Vision & multimodal

From the CNN that cracked ImageNet to vision Transformers, contrastive image-text models, segmentation, speech, vision-language assistants, and the diffusion lineage behind modern image generation.