Kokoro

TTS without any of this — an 82M non-autoregressive StyleTTS2 + ISTFTNet model turns phonemes and a split style vector into 24 kHz audio in one feed-forward pass; no LLM, no diffusion, no codec, no tokens.

hexgrad · 2025 · Speech / TTS. Read the paper ↗

A free, interactive, animated visual explainer of Kokoro — every exhibit computed from the real formulas, with verbatim quotes from the source.

Questions

What is Kokoro?
TTS without any of this — an 82M non-autoregressive StyleTTS2 + ISTFTNet model turns phonemes and a split style vector into 24 kHz audio in one feed-forward pass; no LLM, no diffusion, no codec, no tokens.
Who published Kokoro, and where?
hexgrad — 2025 (its official release).
Where can I find a visual explainer of Kokoro?
Right here — a free, interactive, animated walkthrough of the whole paper, with exhibits computed from the real formulas and verbatim quotes from the source.

Related explainers