Training at scale

From a single device to a full pod: tensor, pipeline, and data parallelism; automatic sharding; sequence parallelism for long context; and the runtime that orchestrates it all.