Moshi: a speech-text foundation model for real-time dialogue

Full-duplex spoken dialogue — both voices as parallel token streams, with a 200ms inner monologue.

Défossez et al. · arXiv 2024 · Foundations. Read the paper ↗

A free, interactive, animated visual explainer of Moshi: a speech-text foundation model for real-time dialogue — every exhibit computed from the real formulas, with verbatim quotes from the source.

Questions

What is Moshi: a speech-text foundation model for real-time dialogue?
Full-duplex spoken dialogue — both voices as parallel token streams, with a 200ms inner monologue.
Who published Moshi: a speech-text foundation model for real-time dialogue, and where?
Défossez et al. — arXiv 2024 (arXiv:2410.00037).
Where can I find a visual explainer of Moshi: a speech-text foundation model for real-time dialogue?
Right here — a free, interactive, animated walkthrough of the whole paper, with exhibits computed from the real formulas and verbatim quotes from the source.

Related explainers