Action and behavior recognition

Query by Activity Video in the Wild

This paper considers retrieval of videos containing human activity from just a video query. In the literature, a common assumption is that all activities have sufficient labelled examples when learning an embedding for retrieval. However, this …

Universal Prototype Transport for Zero-Shot Action Recognition and Localization

This work addresses the problem of recognizing action categories in videos when no training examples are available. The current state-of-the-art enables such a zero-shot recognition by learning universal mappings from videos to a semantic space, …

Graph Switching Dynamical Systems

Dynamical systems with complex behaviours, e.g. immune system cells interacting with a pathogen, are commonly modelled by splitting the behaviour into different regimes, or modes, each with simpler dynamics, and then learning the switching behaviour …

Are 3D convolutional networks inherently biased towards appearance?

3D convolutional networks, as direct inheritors of 2D convolutional networks for images, have placed their mark on action recognition in videos. Combined with pretraining on large-scale video data, high classification accuracies have been obtained on …

TubeR: Tubelet Transformer for Video Action Detection

We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to …

Diagnosing Errors in Video Relation Detectors

Video relation detection forms a new and challenging problem in computer vision, where subjects and objects need to be localized spatio-temporally and a predicate label needs to be assigned if and only if there is an interaction between the two. …

Zero-Shot Action Recognition from Diverse Object-Scene Compositions

This paper investigates the problem of zero-shot action recognition, in the setting where no training videos with seen actions are available. For this challenging scenario, the current leading approach is to transfer knowledge from the image domain …

Skeleton-Contrastive 3D Action Representation Learning

This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise …

Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term …

Social Fabric: Tubelet Compositions for Video Relation Detection

This paper strives to classify and detect the relationship between object tubelets appearing within a video as a triplet. Where existing works treat object proposals or tubelets as single entities and model their relations a posteriori, we propose to …