Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained
In this video I cover the “Do Vision Transformers See Like Convolutional Neural Networks?“ paper. They dissect ViTs and ResNets and show the differences in the features learned as well as what contributes to those differences (like the amount of data used, skip connections, etc.).
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
✅ Paper:
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
⌚️ Timetable:
00:00 Intro
00:45 Contrasting features in ViTs vs CNNs
06:45 Global vs Local receptive fields
13:55 Data matters, mr. obvious
17:40 Contrasting receptive fields
20:30 Data flow through CLS vs spatial tokens
23:30 Skip connections matter a lot in ViTs
24:20 Spatial inform