Do Vision Transformers See Like Convolutional Neural Networks? | Paper Explained

In this video I cover the “Do Vision Transformers See Like Convolutional Neural Networks?“ paper. They dissect ViTs and ResNets and show the differences in the features learned as well as what contributes to those differences (like the amount of data used, skip connections, etc.). ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ✅ Paper: ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ ⌚️ Timetable: 00:00 Intro 00:45 Contrasting features in ViTs vs CNNs 06:45 Global vs Local receptive fields 13:55 Data matters, mr. obvious 17:40 Contrasting receptive fields 20:30 Data flow through CLS vs spatial tokens 23:30 Skip connections matter a lot in ViTs 24:20 Spatial inform

1 view