Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make …
Graph neural networks have shown to learn effective node representations, enabling node-, link-, and graph-level inference. Conventional graph networks assume static relations between nodes, while relations between entities in a video often evolve …
This work proposes codebook encodings for graph networks that operate on hyperbolic manifolds. Where graph networks commonly learn node representations in Euclidean space, recent work has provided a generalization to Riemannian manifolds, with a …
The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time. While proven effective, in many practical video settings even labelling a few examples appears unrealistic. This is especially true as the …
We aim to understand how actions are performed and identify subtle differences, such as ‘fold firmly’ vs. ‘fold gently’. To this end, we propose a method which recognizes adverbs across different actions. However, such fine-grained annotations are …
Abnormal crowd behavior detection has recently attracted increasing attention due to its wide applications in computer vision research areas. However, it is still an extremely challenging task due to the great variability of abnormal behavior coupled …
In this paper, we address video instance segmentation using a new generative model that learns effective representations of the target and background appearance. We propose to exploit hierarchical structural embedding over spatio-temporal space, …
The goal of this paper is to self-train a 3D convolutional neural network on an unlabeled video collection for deployment on small-scale video collections. As smaller video datasets benefit more from motion than appearance, we strive to train our …
This paper introduces the task of few-shot common action localization in time and space. Given a few trimmed support videos containing the same but unknown action, we strive for spatio-temporal localization of that action in a long untrimmed query …
Current video retrieval efforts all found their evaluation on an instance-based assumption, that only a single caption is relevant to a query video and vice versa. We demonstrate that this assumption results in performance comparisons often not …