GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)
Google builds a 600 billion parameter transformer to do massively multilingual, massive machine translation. Interestingly, the larger model scale does not come from increasing depth of the transformer, but from increasing width in the feedforward layers, combined with a hard routing to parallelize computations on up to 2048 TPUs. A very detailed engineering paper!
OUTLINE:
0:00 - Intro & Overview
4:10 - Main Results
5:10 - Mixture-of-Experts
16:00 - Difference to Scaling Classic Transformers
18:50 - Backp
15 views
12
8
4 years ago 01:13:04 1
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)