Deep learning foundations
The handful of papers the whole field stands on: the optimizer everyone uses, the trick that made networks deep, the attention mechanism, the scaling laws, and how to adapt a giant model cheaply.
- Adam: A Method for Stochastic Optimization
- Muon is Scalable for LLM Training
- Pretraining Large Language Models with NVFP4
- Deep Residual Learning for Image Recognition
- Attention Is All You Need
- GPT-3: Language Models are Few-Shot Learners
- Scaling Laws for Neural Language Models
- Training Compute-Optimal Large Language Models
- KAN: Kolmogorov–Arnold Networks
- LoRA: Low-Rank Adaptation of Large Language Models