How AI Learned to Speak

Name: How AI Learned to Speak
Uploaded: 2026-06-26
Duration: 46 min 44 s
Description: A from-zero ramp through how modern text-to-speech actually works: a sound wave as numbers, the neural codec that turns audio into tokens, TTS as next-token prediction, all the way to today’s talking LLMs — built on the verified explainers behind it.

A from-zero ramp through how modern text-to-speech actually works: a sound wave as numbers, the neural codec that turns audio into tokens, TTS as next-token prediction, all the way to today’s talking LLMs — built on the verified explainers behind it.

47 min · 27 chapters · Watch on YouTube ↗

Chapters

0:00 — TTS as next-token prediction
0:50 — Sound is just numbers
2:47 — The mel spectrogram
4:13 — From text to phonemes
5:37 — The alignment problem
6:12 — Before deep learning
6:56 — WaveNet
7:43 — Tacotron and attention
9:46 — FastSpeech and the variance adaptor
12:15 — Flow, diffusion, and VITS
13:38 — Audio as tokens: VQ and RVQ
15:51 — Neural audio codecs
19:09 — VALL-E and the three-slot frame
21:37 — A gallery of modern systems
24:58 — How these models are trained
28:42 — Post-training with GRPO
29:41 — Inference and the latency problem
31:29 — Making it fast
33:48 — Controlling prosody, voice, and emotion
36:13 — A decade in one sentence
36:57 — Moshi and Whisper
38:12 — How we evaluate TTS
39:17 — The whole field on one map
40:13 — Reading a 2026 paper
41:02 — The gist
41:53 — Bonus: FACodec, AudioLM, and Tortoise
44:08 — Bonus: multilingual, safety, and beyond

How AI Learned to Speak

Chapters

Built from these explainers