Multi-modal learning

Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

Vision-language alignment is crucial for various downstream tasks such as cross-modal generation and retrieval. Previous multimodal approaches like CLIP utilize InfoNCE to maximize mutual information, primarily aligning pairwise samples across …

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an …

Vision and Language Training Helps Deploy Taxonomic Knowledge But Does Not Fundamentally Alter It

Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this …

TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Multimodal Large Language Models (MLLMs) struggle at this task. In this paper, we introduce TWIST & SCOUT, a framework that …

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success …

Hyperbolic Safety-Aware Vision-Language Models

Addressing the retrieval of unsafe content from vision-language models such as CLIP is an important step towards real-world integration. Current efforts have relied on unlearning techniques that try to erase the model's knowledge of unsafe concepts. …

Multi-modal learning

Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Vision and Language Training Helps Deploy Taxonomic Knowledge But Does Not Fundamentally Alter It

TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Hyperbolic Safety-Aware Vision-Language Models

Compositional entailment learning for hyperbolic vision-language models

Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight

The Sound of Water: Inferring Physical Properties from Pouring Liquids

TULIP: Token-length Upgraded CLIP