Rudrite Research — the frontier, made legible
Interactive, animated, visual explainers of landmark AI & ML papers — the systems and ideas behind the models you use, redrawn and made legible. Free and open.
Browse all 100 explainers · Guided reading tracks
- Attention Is All You Need
- FlashAttention
- PagedAttention (vLLM)
- Megatron-LM
- DeepSeek-R1
- GPT-3: Language Models are Few-Shot Learners
- ZeRO: Zero Redundancy Optimizer
- Mixtral of Experts
- Training Compute-Optimal Large Language Models
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- BERT: Pre-training of Deep Bidirectional Transformers
- DeepSeek-V3
- Qwen3
- OLMo 2
- MiniMax-01
- Gemma 4
- Scaling Laws for Neural Language Models
- Adam: A Method for Stochastic Optimization
- Deep Residual Learning for Image Recognition
- Denoising Diffusion Probabilistic Models
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- LoRA: Low-Rank Adaptation of Large Language Models
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- GSPMD: General and Scalable Parallelization for ML Computation Graphs
- Pathways: Asynchronous Distributed Dataflow for ML
- Ring Attention with Blockwise Transformers for Near-Infinite Context
- Efficiently Scaling Transformer Inference
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
- Fast Inference from Transformers via Speculative Decoding
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Training language models to follow instructions with human feedback
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- Constitutional AI: Harmlessness from AI Feedback
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- ReAct: Synergizing Reasoning and Acting in Language Models
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- RoFormer: Enhanced Transformer with Rotary Position Embedding
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Learning Transferable Visual Models From Natural Language Supervision
- High-Resolution Image Synthesis with Latent Diffusion Models
- Scalable Diffusion Models with Transformers
- Robust Speech Recognition via Large-Scale Weak Supervision
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
- Group Sequence Policy Optimization
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
- YaRN: Efficient Context Window Extension of Large Language Models
- Efficient Streaming Language Models with Attention Sinks
- Generative Adversarial Networks
- Segment Anything
- Visual Instruction Tuning
- s1: Simple test-time scaling
- Tülu 3: Pushing Frontiers in Open Language Model Post-Training
- Let's Verify Step by Step
- Self-Consistency Improves Chain of Thought Reasoning in Language Models
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- KAN: Kolmogorov–Arnold Networks
- Differential Transformer
- Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
- RWKV: Reinventing RNNs for the Transformer Era
- Titans: Learning to Memorize at Test Time
- Byte Latent Transformer: Patches Scale Better Than Tokens
- The Llama 3 Herd of Models
- Mistral 7B
- Phi-4 Technical Report
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
- Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
- Flow Matching for Generative Modeling
- Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
- Rewarding Doubt: Calibrated Confidence Expression of LLMs
- Why Language Models Hallucinate
- τ-bench: Tool-Agent-User Interaction in Real-World Domains
- ToolRL: Reward is All Tool Learning Needs
- Group-in-Group Policy Optimization for LLM Agent Training
- MiniMax-M1: Scaling Test-Time Compute with Lightning Attention
- ProRL: Prolonged RL Expands Reasoning Boundaries
- The Entropy Mechanism of RL for Reasoning Language Models
- Spurious Rewards: Rethinking Training Signals in RLVR
- GenPRM: Generative Process Reward Models
- From Hard Refusals to Safe-Completions
- Proximal Policy Optimization Algorithms
- Efficiently Modeling Long Sequences with Structured State Spaces
- Auto-Encoding Variational Bayes
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Toolformer: Language Models Can Teach Themselves to Use Tools
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- Muon is Scalable for LLM Training
- Consistency Models