# Rudrite Research > Interactive, animated, visual explainers of landmark AI & ML papers — each paper rebuilt as a walkthrough you can follow, with the real formulas, computed exhibits, and verbatim quotes. Monochrome, static, free, and open. (the frontier, made legible.) ## Papers - [Attention Is All You Need](https://research.rudrite.com/attention): The 2017 paper behind every LLM you use — watch attention decide what matters. (Vaswani et al., NeurIPS 2017) - [FlashAttention](https://research.rudrite.com/flash-attention): Exact attention, made fast by never writing the big matrix to memory. (Dao et al., NeurIPS 2022) - [PagedAttention (vLLM)](https://research.rudrite.com/paged-attention): Serve far more requests by paging the KV cache like an operating system. (Kwon et al., SOSP 2023) - [Megatron-LM](https://research.rudrite.com/megatron-lm): Split a model across GPUs along the matrix — and train billions of parameters. (Shoeybi et al., arXiv 2019) - [DeepSeek-R1](https://research.rudrite.com/deepseek-r1): Reasoning that emerges from reinforcement learning, not imitation. (DeepSeek-AI, 2025) - [GPT-3: Language Models are Few-Shot Learners](https://research.rudrite.com/gpt-3): Scale a language model until it learns new tasks from a few examples. (Brown et al., NeurIPS 2020) - [ZeRO: Zero Redundancy Optimizer](https://research.rudrite.com/zero): Partition a model across GPUs instead of replicating it — and train toward a trillion parameters. (Rajbhandari et al., SC 2020) - [Mixtral of Experts](https://research.rudrite.com/mixtral): Grow capacity without growing per-token cost — route each token to two of eight experts. (Jiang et al., 2024) - [Training Compute-Optimal Large Language Models](https://research.rudrite.com/chinchilla): Given a fixed compute budget, double the model and double the data — in equal proportion. (Hoffmann et al., NeurIPS 2022) - [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://research.rudrite.com/mamba): Let a state-space model read what it's reading — and a recurrence outruns attention. (Gu, Dao, 2023) - [BERT: Pre-training of Deep Bidirectional Transformers](https://research.rudrite.com/bert): Read the whole sentence at once — pre-train by filling in the blanks, then fine-tune anywhere. (Devlin et al., NAACL 2019) - [DeepSeek-V3](https://research.rudrite.com/deepseek-v3): A 671B mixture-of-experts that activates only 37B — via latent-KV attention and loss-free routing. (DeepSeek-AI, 2024) - [Qwen3](https://research.rudrite.com/qwen3): One family, dense and MoE — with a unified thinking / non-thinking switch. (Qwen Team, 2025) - [OLMo 2](https://research.rudrite.com/olmo-2): A fully-open model, stabilized by moving the norms to the output and clamping QK. (Ai2, 2025) - [MiniMax-01](https://research.rudrite.com/minimax-01): Near-linear attention at 456B — lightning attention, with a softmax layer every eighth block. (MiniMax, 2025) - [Gemma 4](https://research.rudrite.com/gemma-4): Five sizes, one design — interleaved local/global sliding-window attention, now with MoE. (Google DeepMind, 2026) - [Scaling Laws for Neural Language Models](https://research.rudrite.com/scaling-laws): Loss falls as a clean power law in size, data, and compute — and tells you how to spend the budget. (Kaplan et al., arXiv 2020) - [Adam: A Method for Stochastic Optimization](https://research.rudrite.com/adam): A per-parameter adaptive learning rate from two moving averages of the gradient. (Kingma & Ba, ICLR 2015) - [Deep Residual Learning for Image Recognition](https://research.rudrite.com/resnet): Add the input back — the identity skip that made 152-layer nets trainable. (He et al., CVPR 2016) - [Denoising Diffusion Probabilistic Models](https://research.rudrite.com/ddpm): Add noise to an image, then learn the reverse — the recipe behind modern diffusion. (Ho et al., NeurIPS 2020) - [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://research.rudrite.com/switch-transformers): Send each token to a single expert — and scale a model to a trillion parameters. (Fedus et al., JMLR 2022) - [LoRA: Low-Rank Adaptation of Large Language Models](https://research.rudrite.com/lora): Freeze the model, learn its change as two skinny matrices — 10,000× fewer trainable weights, zero added latency. (Hu et al., ICLR 2022) - [GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism](https://research.rudrite.com/gpipe): Split a giant model across chips and pipeline micro-batches to keep them all busy (Huang et al., NeurIPS 2019) - [GSPMD: General and Scalable Parallelization for ML Computation Graphs](https://research.rudrite.com/gspmd): Annotate a few tensors; the compiler shards the trillion-parameter rest. (Xu et al., arXiv 2021) - [Pathways: Asynchronous Distributed Dataflow for ML](https://research.rudrite.com/pathways): One controller, thousands of accelerators — parallel dispatch makes single-controller ML as fast as SPMD. (Barham et al., MLSys 2022) - [Ring Attention with Blockwise Transformers for Near-Infinite Context](https://research.rudrite.com/ring-attention): Shard one sequence across a ring of devices, rotate the KV blocks — context scales with device count. (Liu et al., ICLR 2024) - [Efficiently Scaling Transformer Inference](https://research.rudrite.com/scaling-inference): Chop a 540B model across a TPU pod: 29ms/token, 76% MFU, 32x longer context (Pope et al., MLSys 2023) - [Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving](https://research.rudrite.com/mooncake): Schedule the KV cache, not the GPU: disaggregated prefill/decode serving that survives overload. (Qin et al., arXiv 2024) - [Fast Inference from Transformers via Speculative Decoding](https://research.rudrite.com/speculative-decoding): A small model guesses ahead, the big one verifies in parallel — same output, 2–3× faster. (Leviathan et al., ICML 2023) - [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://research.rudrite.com/chain-of-thought): Add worked examples to the prompt — and reasoning emerges in big models, no training (Wei et al., NeurIPS 2022) - [Training language models to follow instructions with human feedback](https://research.rudrite.com/instructgpt): RLHF: align GPT-3 from human feedback — a 1.3B model beats the 175B on preference (Ouyang et al., NeurIPS 2022) - [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://research.rudrite.com/dpo): Skip the reward model and the RL — one cross-entropy loss aligns the policy directly from preferences. (Rafailov et al., NeurIPS 2023) - [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://research.rudrite.com/deepseekmath): A 7B open model hits 51.7% on MATH — by web-mining 120B math tokens and inventing GRPO. (Shao et al., arXiv 2024) - [Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters](https://research.rudrite.com/test-time-compute): Think longer on hard prompts — and let difficulty decide how to spend the compute. (Snell et al., arXiv 2024) - [Constitutional AI: Harmlessness from AI Feedback](https://research.rudrite.com/constitutional-ai): Train a harmless, non-evasive assistant from a written constitution — zero human harm labels. (Bai et al., arXiv 2022) - [DAPO: An Open-Source LLM Reinforcement Learning System at Scale](https://research.rudrite.com/dapo): Four named techniques turn DeepSeek-style RL into a reproducible run to AIME 50. (Yu et al., arXiv 2025) - [Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://research.rudrite.com/tree-of-thoughts): Wrap a frozen GPT-4 in tree search — branch, self-evaluate, prune. Game of 24: 4% to 74%. (Yao et al., NeurIPS 2023) - [ReAct: Synergizing Reasoning and Acting in Language Models](https://research.rudrite.com/react): A frozen LLM that thinks, acts, and reads results in one loop — the blueprint for every agent. (Yao et al., ICLR 2023) - [FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision](https://research.rudrite.com/flash-attention-3): Rebuilds attention for Hopper — async warps + FP8 — for 740 TFLOPs/s, 1.5-2.0x over FA-2. (Shah et al., NeurIPS 2024) - [Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality](https://research.rudrite.com/mamba-2): Selective SSMs and masked attention are one structured matrix, computed two ways. (Dao & Gu, ICML 2024) - [DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model](https://research.rudrite.com/deepseek-v2): 236B MoE, 21B active per token — MLA folds the whole KV cache into one latent vector (DeepSeek-AI, arXiv 2024) - [EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](https://research.rudrite.com/eagle): Draft one layer down: autoregress on features, not tokens — 2.7–3.5× faster, losslessly. (Li et al., ICML 2024) - [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://research.rudrite.com/awq): Find the 1% of weights that matter by watching activations, then scale to protect them at INT4. (Lin et al., MLSys 2024) - [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://research.rudrite.com/rope): Encode position by rotating Q and K, so attention sees only the relative offset m−n. (Su et al., arXiv 2021) - [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://research.rudrite.com/vision-transformer): Cut an image into 16×16 patches, call each a word, feed a plain Transformer. (Dosovitskiy et al., ICLR 2021) - [Learning Transferable Visual Models From Natural Language Supervision](https://research.rudrite.com/clip): Match captions to images, and you get a classifier for any concept you can name. (Radford et al., ICML 2021) - [High-Resolution Image Synthesis with Latent Diffusion Models](https://research.rudrite.com/latent-diffusion): Move diffusion into a compact latent space — cheaper, and the architecture behind Stable Diffusion. (Rombach et al., CVPR 2022) - [Scalable Diffusion Models with Transformers](https://research.rudrite.com/dit): Drop the U-Net: a plain transformer on latent patches whose quality scales with Gflops. (, ICCV 2023) - [Robust Speech Recognition via Large-Scale Weak Supervision](https://research.rudrite.com/whisper): 680k hours of weak supervision → one Transformer that transcribes the real world, zero-shot (Radford et al., ICML 2023) - [Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention](https://research.rudrite.com/native-sparse-attention): Trainable, hardware-aligned sparse attention: 3 gated branches, 11.6x decode, beats dense (Yuan et al., ACL 2025) - [Group Sequence Policy Optimization](https://research.rudrite.com/gspo): Reward lands on the whole sequence — so the importance ratio should too, not per token (Zheng et al. (Qwen Team), arXiv 2025) - [DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving](https://research.rudrite.com/distserve): Split a request's timeline into prefill and decode GPU pools — 4.48x more requests under SLO. (Zhong et al., OSDI 2024) - [CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion](https://research.rudrite.com/cacheblend): Reuse every retrieved chunk's KV cache anywhere, then recompute the ~15% of tokens that stitch cross-attention back. (Yao et al., EuroSys 2025) - [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://research.rudrite.com/gshard): Top-2 experts per token + an SPMD compiler: a 600B model trained in 4 days. (Lepikhin et al., ICLR 2021) - [GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints](https://research.rudrite.com/gqa): One dial from MQA to MHA — near-MHA quality at near-MQA decode speed, retrofitted cheaply. (Ainslie et al., EMNLP 2023) - [YaRN: Efficient Context Window Extension of Large Language Models](https://research.rudrite.com/yarn): Extend a RoPE model to 128k by reshaping frequencies per wavelength — for a tenth of the tuning (Peng et al., ICLR 2024) - [Efficient Streaming Language Models with Attention Sinks](https://research.rudrite.com/streaming-llm): Pin 4 "attention-sink" tokens + a rolling window — stream 4M tokens, no fine-tuning. (Xiao et al., ICLR 2024) - [Generative Adversarial Networks](https://research.rudrite.com/gan): Two networks duel — a forger and a detective — until the fakes pass for real. (Goodfellow et al., NeurIPS 2014) - [Segment Anything](https://research.rudrite.com/segment-anything): Point at anything, get a clean mask back in milliseconds — segmentation as a foundation model. (Kirillov et al., ICCV 2023) - [Visual Instruction Tuning](https://research.rudrite.com/llava): A blind GPT-4 writes the lessons; one matrix turns sight into tokens — the open VLM template. (Liu et al., NeurIPS 2023) - [s1: Simple test-time scaling](https://research.rudrite.com/s1): 1,000 curated examples + one word — “Wait” — buy o1-style test-time scaling in 26 GPU-minutes. (Muennighoff et al., arXiv 2025) - [Tülu 3: Pushing Frontiers in Open Language Model Post-Training](https://research.rudrite.com/tulu-3): The full post-training recipe in the open — SFT, DPO, and RL with verifiable rewards. (Lambert et al., arXiv 2024) - [Let's Verify Step by Step](https://research.rudrite.com/lets-verify): Reward every correct step, not just the answer — process supervision trains the stronger verifier. (Lightman et al., ICLR 2024) - [Self-Consistency Improves Chain of Thought Reasoning in Language Models](https://research.rudrite.com/self-consistency): Sample many reasoning paths, take the majority answer — accuracy jumps with no training at all. (Wang et al., ICLR 2023) - [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://research.rudrite.com/rag): Bolt a retriever to a generator and train them together — knowledge moves out of the weights. (Lewis et al., NeurIPS 2020) - [SWE-bench: Can Language Models Resolve Real-World GitHub Issues?](https://research.rudrite.com/swe-bench): 2,294 real GitHub issues as the exam — can a model patch an actual repository? (Jimenez et al., ICLR 2024) - [The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits](https://research.rudrite.com/bitnet): Every weight is −1, 0, or 1 — matching full precision while multiplication becomes addition. (Ma et al., arXiv 2024) - [KAN: Kolmogorov–Arnold Networks](https://research.rudrite.com/kan): Put the learnable functions on the edges, not the nodes — splines replace fixed activations. (Liu et al., arXiv 2024) - [Differential Transformer](https://research.rudrite.com/differential-transformer): Two softmax maps subtracted — attention noise cancels like a differential amplifier. (Ye et al., ICLR 2025) - [Mixture-of-Depths: Dynamically allocating compute in transformer-based language models](https://research.rudrite.com/mixture-of-depths): A fixed compute budget, spent unevenly — tokens route around blocks they don't need. (Raposo et al., arXiv 2024) - [RWKV: Reinventing RNNs for the Transformer Era](https://research.rudrite.com/rwkv): An RNN trained like a Transformer — constant-state inference at GPT scale. (Peng et al., EMNLP 2023 Findings) - [Titans: Learning to Memorize at Test Time](https://research.rudrite.com/titans): A neural memory that learns at test time — surprise decides what's worth remembering. (Behrouz et al., arXiv 2025) - [Byte Latent Transformer: Patches Scale Better Than Tokens](https://research.rudrite.com/byte-latent-transformer): No tokenizer — bytes group into entropy-sized patches, and patches scale better than tokens. (Pagnoni et al., arXiv 2024) - [The Llama 3 Herd of Models](https://research.rudrite.com/llama-3): The 405B herd report — data, scale, and infrastructure for an open frontier model. (Llama Team, AI @ Meta, arXiv 2024) - [Mistral 7B](https://research.rudrite.com/mistral-7b): Sliding-window attention + GQA — the 7B that beat 13B and started the small-model race. (Jiang et al., arXiv 2023) - [Phi-4 Technical Report](https://research.rudrite.com/phi-4): Synthetic data as the main course, not the garnish — a 14B that punches at reasoning. (Abdin et al., arXiv 2024) - [FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://research.rudrite.com/flash-attention-2): Re-cutting the same tiles across warps — ~2× faster by fixing the work partition, not the math. (Dao, ICLR 2024) - [Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads](https://research.rudrite.com/medusa): Extra decoding heads draft ahead; tree attention verifies — speedup with no draft model. (Cai et al., ICML 2024) - [Scaling Rectified Flow Transformers for High-Resolution Image Synthesis](https://research.rudrite.com/stable-diffusion-3): Rectified flow meets a multimodal DiT — straighter paths, two streams, one per modality. (Esser et al., ICML 2024) - [Flow Matching for Generative Modeling](https://research.rudrite.com/flow-matching): Train the velocity field directly — diffusion-quality generation from straight probability paths. (Lipman et al., ICLR 2023) - [Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty](https://research.rudrite.com/rlcr): Add a calibration reward to RLVR so a reasoning model states how sure it is — and means it. (Damani et al., arXiv 2025) - [Rewarding Doubt: Calibrated Confidence Expression of LLMs](https://research.rudrite.com/rewarding-doubt): RL on a proper scoring rule teaches an LLM to express calibrated confidence in its answers. (Stangel et al., arXiv 2025) - [Why Language Models Hallucinate](https://research.rudrite.com/why-llms-hallucinate): Hallucinations are the predictable result of training and grading that reward confident guessing. (Kalai et al., arXiv 2025) - [τ-bench: Tool-Agent-User Interaction in Real-World Domains](https://research.rudrite.com/tau-bench): A benchmark for tool-using agents talking to a simulated user — and the reliability cliff at pass^k. (Yao et al., arXiv 2024) - [ToolRL: Reward is All Tool Learning Needs](https://research.rudrite.com/toolrl): Tool use learned by RL with a decomposed reward — format plus correctness beats SFT imitation. (Qian et al., arXiv 2025) - [Group-in-Group Policy Optimization for LLM Agent Training](https://research.rudrite.com/gigpo): Group-in-group advantages give long-horizon LLM agents step-level credit without a critic. (Feng et al., NeurIPS 2025) - [MiniMax-M1: Scaling Test-Time Compute with Lightning Attention](https://research.rudrite.com/cispo): Clip the importance weight, not the update — efficient RL for a hybrid-attention reasoning model. (MiniMax, arXiv 2025) - [ProRL: Prolonged RL Expands Reasoning Boundaries](https://research.rudrite.com/prorl): Prolonged RL with KL resets expands what a reasoning model can do, not just sharpens it. (Liu et al., arXiv 2025) - [The Entropy Mechanism of RL for Reasoning Language Models](https://research.rudrite.com/entropy-mechanism): Why RL entropy collapses, the law that predicts it, and two covariance-clipping fixes. (Cui et al., arXiv 2025) - [Spurious Rewards: Rethinking Training Signals in RLVR](https://research.rudrite.com/spurious-rewards): On Qwen, even random or wrong RLVR rewards lift math accuracy — what the signal really does. (Shao et al., arXiv 2025) - [GenPRM: Generative Process Reward Models](https://research.rudrite.com/genprm): A process reward model that reasons and runs code to verify each step — a 7B beats a 72B. (Zhao et al., arXiv 2025) - [From Hard Refusals to Safe-Completions](https://research.rudrite.com/safe-completions): Train safety on the output, not the request: graded safe-completions over hard refusals. (Yuan et al., arXiv 2025) - [Proximal Policy Optimization Algorithms](https://research.rudrite.com/ppo): The clipped-objective RL algorithm under RLHF — stable policy gradients without trust-region overhead. (Schulman et al., arXiv 2017) - [Efficiently Modeling Long Sequences with Structured State Spaces](https://research.rudrite.com/s4): A state-space layer that models 16k-long sequences — the origin of the line that leads to Mamba. (Gu et al., ICLR 2022) - [Auto-Encoding Variational Bayes](https://research.rudrite.com/vae): The reparameterization trick turns a variational lower bound into a trainable autoencoder. (Kingma, Welling, ICLR 2014) - [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://research.rudrite.com/t5): Cast every NLP task as text-to-text — one model, one objective, one format. (Raffel et al., JMLR 2020) - [Toolformer: Language Models Can Teach Themselves to Use Tools](https://research.rudrite.com/toolformer): A model self-supervises where to call APIs — keeping only the calls that lower its own loss. (Schick et al., NeurIPS 2023) - [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://research.rudrite.com/gptq): Quantize a 175B model to 3–4 bits in a few GPU-hours with a one-shot, Hessian-aware solver. (Frantar et al., ICLR 2023) - [Muon is Scalable for LLM Training](https://research.rudrite.com/muon): An orthogonalizing optimizer that beats Adam on the matrix parameters — scaled to LLM training. (Liu et al., arXiv 2025) - [Consistency Models](https://research.rudrite.com/consistency-models): Map any point on the diffusion trajectory straight to its origin — generation in a single step. (Song et al., ICML 2023) ## Comparisons - [Transformers vs Mamba](https://research.rudrite.com/compare/transformers-vs-mamba): All-pairs attention versus a selective state-space recurrence — quadratic recall against linear-time throughput. - [FlashAttention vs PagedAttention](https://research.rudrite.com/compare/flashattention-vs-pagedattention): Two attention optimizations that solve different problems — and are used together, not instead of each other. - [Dense vs Mixture-of-Experts](https://research.rudrite.com/compare/dense-vs-mixture-of-experts): Activate every parameter for every token, or route each token to a few of many experts. - [ReAct vs Toolformer vs ToolRL](https://research.rudrite.com/compare/react-vs-toolformer-vs-toolrl): Three eras of teaching a model to use a tool — prompt the loop, filter the data on its own loss, or reward the policy. - [PPO vs DPO vs GRPO](https://research.rudrite.com/compare/ppo-vs-dpo-vs-grpo): Three ways to turn preferences into a better policy — a full RL loop, a single classification loss, or group-relative RL without a critic. - [MHA vs GQA vs MLA](https://research.rudrite.com/compare/mha-vs-gqa-vs-mla): Three points on the attention-memory curve — how much of the KV cache you keep decides how long a context you can afford to serve. - [GAN vs VAE vs Diffusion](https://research.rudrite.com/compare/gan-vs-vae-vs-diffusion): Three ways to learn a distribution and sample from it — an adversarial game, a probabilistic autoencoder, and an iterative denoiser. - [FlashAttention vs FlashAttention-3](https://research.rudrite.com/compare/flashattention-vs-flashattention-3): The same exact-attention algorithm, rebuilt for a new generation of GPU — IO-aware tiling, then Hopper-era asynchrony and FP8. - [Speculative Decoding vs Medusa vs EAGLE](https://research.rudrite.com/compare/speculative-decoding-vs-medusa-vs-eagle): Three ways to draft tokens for a target model to verify in parallel — a separate draft model, self-drafting heads, or feature-level autoregression. - [Scaling Laws vs Chinchilla](https://research.rudrite.com/compare/scaling-laws-vs-chinchilla): Two readings of the same power laws — one prescribed bigger models, one showed compute-optimal training needs far more data per parameter. - [BERT vs GPT vs T5](https://research.rudrite.com/compare/bert-vs-gpt-vs-t5): Three ways to pretrain the same transformer — read both directions, predict the next token, or cast every task as text-to-text. - [AWQ vs GPTQ vs BitNet](https://research.rudrite.com/compare/awq-vs-gptq-vs-bitnet): Three ways to shrink an LLM — scale the salient weights, compensate the rounding with second-order math, or train ternary so the matmul becomes addition. - [S4 vs Mamba vs RWKV](https://research.rudrite.com/compare/s4-vs-mamba-vs-rwkv): The post-Transformer sequence lineage — a structured state space, a selective one, and a linear-attention RNN, all chasing linear cost without losing quality. - [CoT vs Self-Consistency vs Tree-of-Thoughts](https://research.rudrite.com/compare/cot-vs-self-consistency-vs-tot): One chain, many chains, or a searched tree of chains — three rungs of a reasoning ladder, none of which touch the weights. - [DDPM vs Flow Matching vs Consistency Models](https://research.rudrite.com/compare/ddpm-vs-flow-matching-vs-consistency): One family, three answers to the same question — how should a model walk from noise to data? ## Tracks (guided reading paths — what to read, in what order) - [Post-training LLMs](https://research.rudrite.com/tracks/post-training-llms): How a base model becomes a frontier assistant — RLHF, preference optimization, and RL for reasoning. (10 papers, 4 stages) - [The Transformer, end to end](https://research.rudrite.com/tracks/the-transformer): From the attention mechanism to modern architectures — how today's models actually compute. (20 papers, 6 stages) - [Serving LLMs efficiently](https://research.rudrite.com/tracks/serving-llms): Make a trained model fast and cheap to run — memory, batching, speculation, disaggregation. (16 papers, 4 stages) - [Training at scale](https://research.rudrite.com/tracks/training-at-scale): Spread one model across thousands of chips — the parallelism stack behind frontier training. (7 papers, 3 stages) - [Diffusion & generative vision](https://research.rudrite.com/tracks/diffusion-generative-vision): How models learn to generate images — from denoising to flow matching to modern text-to-image. (8 papers, 3 stages) - [The DeepSeek lineage](https://research.rudrite.com/tracks/deepseek-lineage): One lab’s stack, paper by paper — GRPO, MLA, MoE, and pure-RL reasoning. (5 papers, 3 stages) - [Reasoning & agents](https://research.rudrite.com/tracks/reasoning-and-agents): From a single prompt trick to verifier-checked, tool-using agents — how models learn to think. (22 papers, 5 stages) - [Open model architectures](https://research.rudrite.com/tracks/open-models): How the frontier open models are actually built — the design choices, paper by paper. (9 papers, 3 stages) - [Vision & multimodal](https://research.rudrite.com/tracks/vision-multimodal): How models learned to see, hear, and connect images to language — and to generate them. (10 papers, 3 stages) - [Deep learning foundations](https://research.rudrite.com/tracks/deep-learning-foundations): The bedrock under everything else — optimization, depth, attention, scale, and adaptation. (9 papers, 3 stages) ## Site - [Home](https://research.rudrite.com/): featured, recently decoded, and most-explored explainers. - [Library](https://research.rudrite.com/library): every explainer, filterable by field. - [Feed](https://research.rudrite.com/feed.xml): RSS of new explainers and comparisons.