Serving LLMs efficiently

Understand the inference stack: why the KV cache dominates, how paging and quantization shrink it, how speculative decoding hides latency, and how disaggregated serving maximizes throughput.

PagedAttention (vLLM)
FlashAttention
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Fast Inference from Transformers via Speculative Decoding
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Efficiently Scaling Transformer Inference
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
Efficient Streaming Language Models with Attention Sinks
YaRN: Efficient Context Window Extension of Large Language Models