Training at scale

From a single device to a full pod: tensor, pipeline, and data parallelism; automatic sharding; sequence parallelism for long context; and the runtime that orchestrates it all.

Megatron-LM
ZeRO: Zero Redundancy Optimizer
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GSPMD: General and Scalable Parallelization for ML Computation Graphs
Ring Attention with Blockwise Transformers for Near-Infinite Context
Pathways: Asynchronous Distributed Dataflow for ML