Muon is Scalable for LLM Training
An orthogonalizing optimizer that beats Adam on the matrix parameters — scaled to LLM training.
Liu et al. · arXiv 2025 · Foundations. Read the paper ↗
A free, interactive, animated visual explainer of Muon is Scalable for LLM Training — every exhibit computed from the real formulas, with verbatim quotes from the source.
Questions
- What is Muon is Scalable for LLM Training?
- An orthogonalizing optimizer that beats Adam on the matrix parameters — scaled to LLM training.
- Who published Muon is Scalable for LLM Training, and where?
- Liu et al. — arXiv 2025 (arXiv:2502.16982).
- Where can I find a visual explainer of Muon is Scalable for LLM Training?
- Right here — a free, interactive, animated walkthrough of the whole paper, with exhibits computed from the real formulas and verbatim quotes from the source.
Related explainers
- Attention Is All You Need
- GPT-3: Language Models are Few-Shot Learners
- Mixtral of Experts
- Training Compute-Optimal Large Language Models
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- BERT: Pre-training of Deep Bidirectional Transformers
- Scaling Laws for Neural Language Models
- Adam: A Method for Stochastic Optimization