Video analysis and understanding

Towards Open-Vocabulary Video Instance Segmentation

Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make …

Multi-Task Edge Prediction in Temporally-Dynamic Video Graphs

Graph neural networks have shown to learn effective node representations, enabling node-, link-, and graph-level inference. Conventional graph networks assume static relations between nodes, while relations between entities in a video often evolve …

Teaching a New Dog Old Tricks: Contrastive Random Walks in Videos with Unsupervised Priors

This work proposes codebook encodings for graph networks that operate on hyperbolic manifolds. Where graph networks commonly learn node representations in Euclidean space, recent work has provided a generalization to Riemannian manifolds, with a …

Less than Few: Self-Shot Video Instance Segmentation

The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the …

How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs

We aim to understand how actions are performed and identify subtle differences, such as ‘fold firmly’ vs. ‘fold gently’. To this end, we propose a method which recognizes adverbs across different actions. However, such fine-grained annotations are …

Variational Abnormal Behavior Detection with Motion Consistency

Abnormal crowd behavior detection has recently attracted increasing attention due to its wide applications in computer vision research areas. However, it is still an extremely challenging task due to the great variability of abnormal behavior coupled …

Learning Hierarchical Embedding for Video Instance Segmentation

In this paper, we address video instance segmentation using a new generative model that learns effective representations of the target and background appearance. We propose to exploit hierarchical structural embedding over spatio-temporal space, …

Motion-Augmented Self-Training for Video Recognition at Smaller Scale

The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our …

Few-Shot Transformation of Common Actions into Time and Space

This paper introduces the task of few-shot common action localization in time and space. Given a few trimmed support videos containing the same but unknown action, we strive for spatio-temporal localization of that action in a long untrimmed query …

On Semantic Similarity in Video Retrieval

Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa. We demonstrate that this assumption results in performance comparisons often not …