Fast Inference from Transformers via Speculative Decoding

A small model guesses ahead, the big one verifies in parallel — same output, 2–3× faster.

Leviathan et al. · ICML 2023 · Serving. Read the paper ↗

A free, interactive, animated visual explainer of Fast Inference from Transformers via Speculative Decoding — every exhibit computed from the real formulas, with verbatim quotes from the source.

Questions

What is Fast Inference from Transformers via Speculative Decoding?: A small model guesses ahead, the big one verifies in parallel — same output, 2–3× faster.
Who published Fast Inference from Transformers via Speculative Decoding, and where?: Leviathan et al. — ICML 2023 (arXiv:2211.17192).
Where can I find a visual explainer of Fast Inference from Transformers via Speculative Decoding?: Right here — a free, interactive, animated walkthrough of the whole paper, with exhibits computed from the real formulas and verbatim quotes from the source.

Fast Inference from Transformers via Speculative Decoding

Questions

Related explainers