The generative & multimodal frontier
See where generation is heading past autoregressive text: language models that diffuse, image models that predict scale or generate in one step, single models that both understand and create, and the video, speech, and world models reaching into the physical world.
- Large Language Diffusion Models
- Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- Mean Flows for One-step Generative Modeling
- Emerging Properties in Unified Multimodal Pretraining
- Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
- Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
- Cosmos World Foundation Model Platform for Physical AI
- Moshi: a speech-text foundation model for real-time dialogue
- π0.5: a Vision-Language-Action Model with Open-World Generalization