Jana Kosecka

George Mason University

From Pixels to Words: Frontiers of Multi-Modal Vision-Language Learning

About the course

In recent years, foundational models have reshaped the field of computer vision, achieving remarkable advances across tasks such as classification, detection, segmentation, and image generation. This talk surveys the current state of foundational vision models, highlighting advances in self-supervised learning, masked image modeling, contrastive learning, and vision transformers. These models are pre-trained on massive, diverse datasets and then adapted to a wide range of downstream applications. Beyond vision alone, we are witnessing a rapid convergence toward multi-modal vision-language models that align visual and textual information in shared embedding spaces. Models such as Segment Anything, CLIP, BLIP, Flamingo, and LLava bridge the gap between visual understanding and natural language, enabling powerful new capabilities like open-vocabulary recognition, image captioning, visual question answering, and interactive grounding. This talk will survey the state of the art in foundational vision models, explore the underlying principles that enable their generalization, and highlight how vision-language integration is transforming both research and real-world applications. I will also discuss emerging challenges and future directions at the intersection of perception, language, and reasoning.