Action and behavior recognition

TubeR: Tubelet Transformer for Video Action Detection

We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to …

Diagnosing Errors in Video Relation Detectors

Video relation detection forms a new and challenging problem in computer vision, where subjects and objects need to be localized spatio-temporally and a predicate label needs to be assigned if and only if there is an interaction between the two. …

Zero-Shot Action Recognition from Diverse Object-Scene Compositions

This paper investigates the problem of zero-shot action recognition, in the setting where no training videos with seen actions are available. For this challenging scenario, the current leading approach is to transfer knowledge from the image domain …

Skeleton-Contrastive 3D Action Representation Learning

This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise …

Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term …

Social Fabric: Tubelet Compositions for Video Relation Detection

This paper strives to classify and detect the relationship between object tubelets appearing within a video as a triplet. Where existing works treat object proposals or tubelets as single entities and model their relations a posteriori, we propose …

Deep 3D human pose estimation: A review

Three-dimensional (3D) human pose estimation involves estimating the articulated 3D joint locations of a human body from an image or video. Due to its widespread applications in a great variety of areas, such as human motion analysis, human–computer …

Object Priors for Classifying and Localizing Unseen Actions

This work strives for the classification and localization of human actions in videos, without the need for any labeled video training examples. Where existing work relies on transferring global attribute or object information from seen to unseen …