Rudrite Research — the frontier, made legible

Interactive, animated, visual explainers of landmark AI & ML papers — the systems and ideas behind the models you use, redrawn and made legible. Free and open.

Browse all 100 explainers · Guided reading tracks

Attention Is All You Need
FlashAttention
PagedAttention (vLLM)
Megatron-LM
DeepSeek-R1
GPT-3: Language Models are Few-Shot Learners
ZeRO: Zero Redundancy Optimizer
Mixtral of Experts
Training Compute-Optimal Large Language Models
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
BERT: Pre-training of Deep Bidirectional Transformers
DeepSeek-V3
Qwen3
OLMo 2
MiniMax-01
Gemma 4
Scaling Laws for Neural Language Models
Adam: A Method for Stochastic Optimization
Deep Residual Learning for Image Recognition
Denoising Diffusion Probabilistic Models
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
LoRA: Low-Rank Adaptation of Large Language Models
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
GSPMD: General and Scalable Parallelization for ML Computation Graphs
Pathways: Asynchronous Distributed Dataflow for ML
Ring Attention with Blockwise Transformers for Near-Infinite Context
Efficiently Scaling Transformer Inference
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Fast Inference from Transformers via Speculative Decoding
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Training language models to follow instructions with human feedback
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Constitutional AI: Harmlessness from AI Feedback
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
ReAct: Synergizing Reasoning and Acting in Language Models
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
RoFormer: Enhanced Transformer with Rotary Position Embedding
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Learning Transferable Visual Models From Natural Language Supervision
High-Resolution Image Synthesis with Latent Diffusion Models
Scalable Diffusion Models with Transformers
Robust Speech Recognition via Large-Scale Weak Supervision
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Group Sequence Policy Optimization
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
YaRN: Efficient Context Window Extension of Large Language Models
Efficient Streaming Language Models with Attention Sinks
Generative Adversarial Networks
Segment Anything
Visual Instruction Tuning
s1: Simple test-time scaling
Tülu 3: Pushing Frontiers in Open Language Model Post-Training
Let's Verify Step by Step
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
KAN: Kolmogorov–Arnold Networks
Differential Transformer
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
RWKV: Reinventing RNNs for the Transformer Era
Titans: Learning to Memorize at Test Time
Byte Latent Transformer: Patches Scale Better Than Tokens
The Llama 3 Herd of Models
Mistral 7B
Phi-4 Technical Report
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Flow Matching for Generative Modeling
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
Rewarding Doubt: Calibrated Confidence Expression of LLMs
Why Language Models Hallucinate
τ-bench: Tool-Agent-User Interaction in Real-World Domains
ToolRL: Reward is All Tool Learning Needs
Group-in-Group Policy Optimization for LLM Agent Training
MiniMax-M1: Scaling Test-Time Compute with Lightning Attention
ProRL: Prolonged RL Expands Reasoning Boundaries
The Entropy Mechanism of RL for Reasoning Language Models
Spurious Rewards: Rethinking Training Signals in RLVR
GenPRM: Generative Process Reward Models
From Hard Refusals to Safe-Completions
Proximal Policy Optimization Algorithms
Efficiently Modeling Long Sequences with Structured State Spaces
Auto-Encoding Variational Bayes
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Toolformer: Language Models Can Teach Themselves to Use Tools
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Muon is Scalable for LLM Training
Consistency Models