Training at scale
From a single device to a full pod: tensor, pipeline, and data parallelism; automatic sharding; sequence parallelism for long context; and the runtime that orchestrates it all.
- Megatron-LM
- ZeRO: Zero Redundancy Optimizer
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- GSPMD: General and Scalable Parallelization for ML Computation Graphs
- Ring Attention with Blockwise Transformers for Near-Infinite Context
- Pathways: Asynchronous Distributed Dataflow for ML