Why can Transformers do both images and text?

What are the differences between text and images? Why can neural networks process both? ? Ms. Coffee Bean explains the Transformer: https://youtu....
В начало