Video analysis and understanding

Less than Few: Self-Shot Video Instance Segmentation

The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the …

How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs

We aim to understand how actions are performed and identify subtle differences, such as ‘fold firmly’ vs. ‘fold gently’. To this end, we propose a method which recognizes adverbs across different actions. However, such fine-grained annotations are …

Variational Abnormal Behavior Detection with Motion Consistency

Abnormal crowd behavior detection has recently attracted increasing attention due to its wide applications in computer vision research areas. However, it is still an extremely challenging task due to the great variability of abnormal behavior coupled …

Learning Hierarchical Embedding for Video Instance Segmentation

In this paper, we address video instance segmentation using a new generative model that learns effective representations of the target and background appearance. We propose to exploit hierarchical structural embedding over spatio-temporal space, …

Motion-Augmented Self-Training for Video Recognition at Smaller Scale

The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our …

Few-Shot Transformation of Common Actions into Time and Space

This paper introduces the task of few-shot common action localization in time and space. Given a few trimmed support videos containing the same but unknown action, we strive for spatio-temporal localization of that action in a long untrimmed query …

On Semantic Similarity in Video Retrieval

Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa. We demonstrate that this assumption results in performance comparisons often not …

Repetitive Activity Counting by Sight and Sound

This paper strives for repetitive activity counting in videos. Different from existing works, which all analyze the visual video content only, we incorporate for the first time the corresponding sound into the repetition counting process. This …

Support-set bottlenecks for video-text representation learning

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes …

A Dynamic, Self Supervised, Large Scale AudioVisual Dataset for Stuttered Speech

Stuttering affects at least 1% of the world population. It is caused by irregular disruptions in speech production. These interruptions occur in various forms and frequencies. Repetition of words or parts of words, prolongations, or blocks in getting …