XCiT: Cross-Covariance Image Transformers (Facebook AI Machine Learning Research Paper Explained)
#xcit #transformer #attentionmechanism
After dominating Natural Language Processing, Transformers have taken over Computer Vision recently with the advent of Vision Transformers. However, the attention mechanism’s quadratic complexity in the number of tokens means that Transformers do not scale well to high-resolution images. XCiT is a new Transformer architecture, containing XCA, a transposed version of attention, reducing the complexity from quadratic to linear, and at least on image data, it appears to perform on par with other models. What does this mean for the field? Is this even a transformer? What really matters in deep learning?
OUTLINE:
0:00 - Intro & Overview
3:45 - Self-Attention vs Cross-Covariance Attention (XCA)
19:55 - Cross-Covariance Image Transformer (XCiT) Architecture
26:00 - Theoretical & Engineering considerations
30:40 - Experimental Results
33:20 - Comments & Conclusion
Paper:
Abstract:
Following their success in natural language processing, transf
11 views
20
12
4 years ago 00:35:40 8
XCiT: Cross-Covariance Image Transformers (Facebook AI Machine Learning Research Paper Explained)