The generative & multimodal frontier

See where generation is heading past autoregressive text: language models that diffuse, image models that predict scale or generate in one step, single models that both understand and create, and the video, speech, and world models reaching into the physical world.

Large Language Diffusion Models
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
Mean Flows for One-step Generative Modeling
Emerging Properties in Unified Multimodal Pretraining
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Cosmos World Foundation Model Platform for Physical AI
Orpheus TTS
Fish Audio S2
IndexTTS2
CosyVoice 2
Higgs Audio v2
Chatterbox
Spark-TTS
Kokoro
Moshi: a speech-text foundation model for real-time dialogue
π0.5: a Vision-Language-Action Model with Open-World Generalization