Fast Inference from Transformers via Speculative Decoding
A small model guesses ahead, the big one verifies in parallel — same output, 2–3× faster.
Leviathan et al. · ICML 2023 · Serving. Read the paper ↗
A free, interactive, animated visual explainer of Fast Inference from Transformers via Speculative Decoding — every exhibit computed from the real formulas, with verbatim quotes from the source.
Questions
- What is Fast Inference from Transformers via Speculative Decoding?
- A small model guesses ahead, the big one verifies in parallel — same output, 2–3× faster.
- Who published Fast Inference from Transformers via Speculative Decoding, and where?
- Leviathan et al. — ICML 2023 (arXiv:2211.17192).
- Where can I find a visual explainer of Fast Inference from Transformers via Speculative Decoding?
- Right here — a free, interactive, animated walkthrough of the whole paper, with exhibits computed from the real formulas and verbatim quotes from the source.
Related explainers
- PagedAttention (vLLM)
- Efficiently Scaling Transformer Inference
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
- Efficient Streaming Language Models with Attention Sinks