Serving LLMs efficiently
Understand the inference stack: why the KV cache dominates, how paging and quantization shrink it, how speculative decoding hides latency, and how disaggregated serving maximizes throughput.
- PagedAttention (vLLM)
- FlashAttention
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
- Fast Inference from Transformers via Speculative Decoding
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
- Efficiently Scaling Transformer Inference
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
- Efficient Streaming Language Models with Attention Sinks
- YaRN: Efficient Context Window Extension of Large Language Models