Deep learning foundations

The handful of papers the whole field stands on: the optimizer everyone uses, the trick that made networks deep, the attention mechanism, the scaling laws, and how to adapt a giant model cheaply.

Adam: A Method for Stochastic Optimization
Muon is Scalable for LLM Training
Pretraining Large Language Models with NVFP4
Deep Residual Learning for Image Recognition
Attention Is All You Need
GPT-3: Language Models are Few-Shot Learners
Scaling Laws for Neural Language Models
Training Compute-Optimal Large Language Models
KAN: Kolmogorov–Arnold Networks
LoRA: Low-Rank Adaptation of Large Language Models