Visual reasoning and logical representation

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To …

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious …

Intriguing Properties of Hyperbolic Embeddings in Vision-Language Models

Vision-language models have in short time been established as powerful networks, demonstrating strong performance on a wide range of downstream tasks. A key factor behind their success is the learning of a joint embedding space where pairs of images …

Rotating Features for Object Discovery

The binding problem in human cognition, concerning how the brain represents and connects objects within a fixed network of neural connections, remains a subject of intense debate. Most machine learning efforts addressing this issue in an unsupervised …

Poincaré ResNet

This paper introduces an end-to-end residual network that operates entirely on the Poincaré ball model of hyperbolic space. Hyperbolic learning has recently shown great potential for visual understanding, but is currently only performed in the …

BISCUIT: Causal Representation Learning from Binary Interactions

Identifying the causal variables of an environment and how to intervene on them is of core value in applications such as robotics and embodied AI. While an agent can commonly interact with the environment and may implicitly perturb the behavior of …

Complex-Valued Autoencoders for Object Discovery

Object-centric representations form the basis of human perception, and enable us to reason about the world and to systematically generalize to new settings. Currently, most works on unsupervised object discovery focus on slot-based approaches, which …

MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond

This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query (e.g. a question or a category). Unlike most prior works that use explicit, symbolic models which can be computationally …