Fast Inference from Transformers via Speculative Decoding

A small model guesses ahead, the big one verifies in parallel — same output, 2–3× faster.

Leviathan et al. · ICML 2023 · Serving. Read the paper ↗

A free, interactive, animated visual explainer of Fast Inference from Transformers via Speculative Decoding — every exhibit computed from the real formulas, with verbatim quotes from the source.

Questions

What is Fast Inference from Transformers via Speculative Decoding?
A small model guesses ahead, the big one verifies in parallel — same output, 2–3× faster.
Who published Fast Inference from Transformers via Speculative Decoding, and where?
Leviathan et al. — ICML 2023 (arXiv:2211.17192).
Where can I find a visual explainer of Fast Inference from Transformers via Speculative Decoding?
Right here — a free, interactive, animated walkthrough of the whole paper, with exhibits computed from the real formulas and verbatim quotes from the source.

Related explainers