Vision & multimodal
From the CNN that cracked ImageNet to vision Transformers, contrastive image-text models, segmentation, speech, vision-language assistants, and the diffusion lineage behind modern image generation.
- Deep Residual Learning for Image Recognition
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Segment Anything
- Learning Transferable Visual Models From Natural Language Supervision
- Robust Speech Recognition via Large-Scale Weak Supervision
- Visual Instruction Tuning
- Denoising Diffusion Probabilistic Models
- High-Resolution Image Synthesis with Latent Diffusion Models
- Scalable Diffusion Models with Transformers
- Scaling Rectified Flow Transformers for High-Resolution Image Synthesis