Moshi: a speech-text foundation model for real-time dialogue
Full-duplex spoken dialogue — both voices as parallel token streams, with a 200ms inner monologue.
Défossez et al. · arXiv 2024 · Foundations. Read the paper ↗
A free, interactive, animated visual explainer of Moshi: a speech-text foundation model for real-time dialogue — every exhibit computed from the real formulas, with verbatim quotes from the source.
Questions
- What is Moshi: a speech-text foundation model for real-time dialogue?
- Full-duplex spoken dialogue — both voices as parallel token streams, with a 200ms inner monologue.
- Who published Moshi: a speech-text foundation model for real-time dialogue, and where?
- Défossez et al. — arXiv 2024 (arXiv:2410.00037).
- Where can I find a visual explainer of Moshi: a speech-text foundation model for real-time dialogue?
- Right here — a free, interactive, animated walkthrough of the whole paper, with exhibits computed from the real formulas and verbatim quotes from the source.
Related explainers
- Attention Is All You Need
- GPT-3: Language Models are Few-Shot Learners
- Mixtral of Experts
- Training Compute-Optimal Large Language Models
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- BERT: Pre-training of Deep Bidirectional Transformers
- Scaling Laws for Neural Language Models
- Adam: A Method for Stochastic Optimization