How AI Learned to Speak
A from-zero ramp through how modern text-to-speech actually works: a sound wave as numbers, the neural codec that turns audio into tokens, TTS as next-token prediction, all the way to today’s talking LLMs — built on the verified explainers behind it.
47 min · 27 chapters · Watch on YouTube ↗
Chapters
- 0:00 — TTS as next-token prediction
- 0:50 — Sound is just numbers
- 2:47 — The mel spectrogram
- 4:13 — From text to phonemes
- 5:37 — The alignment problem
- 6:12 — Before deep learning
- 6:56 — WaveNet
- 7:43 — Tacotron and attention
- 9:46 — FastSpeech and the variance adaptor
- 12:15 — Flow, diffusion, and VITS
- 13:38 — Audio as tokens: VQ and RVQ
- 15:51 — Neural audio codecs
- 19:09 — VALL-E and the three-slot frame
- 21:37 — A gallery of modern systems
- 24:58 — How these models are trained
- 28:42 — Post-training with GRPO
- 29:41 — Inference and the latency problem
- 31:29 — Making it fast
- 33:48 — Controlling prosody, voice, and emotion
- 36:13 — A decade in one sentence
- 36:57 — Moshi and Whisper
- 38:12 — How we evaluate TTS
- 39:17 — The whole field on one map
- 40:13 — Reading a 2026 paper
- 41:02 — The gist
- 41:53 — Bonus: FACodec, AudioLM, and Tortoise
- 44:08 — Bonus: multilingual, safety, and beyond