Learning Hierarchical Embedding for Video Instance Segmentation

Zheyun Qin, Xiankai Lu, Xiushan Nie, Xiantong Zhen, Yilong Yin

October 2021

PDF Slides

Abstract

In this paper, we address video instance segmentation using a new generative model that learns effective representations of the target and background appearance. We propose to exploit hierarchical structural embedding over spatio-temporal space, which is compact, powerful, and flexible in contrast to current tracking-by-detection methods. Specifically, our model segments and tracks instances across space and time in a single forward pass, which is formulated as hierarchical embedding learning. The model is trained to locate the pixels belonging to specific instances over a video clip. We firstly take advantage of a novel mixing function to better fuse spatiotemporal embeddings. Moreover, we introduce normalizing flows to further improve the robustness of the learned appearance embedding, which theoretically extends conventional generative flows to a factorized conditional scheme. Comprehensive experiments on the video instance segmentation benchmark, i.e., YouTube-VIS, demonstrate the effectiveness of the proposed approach. Furthermore, we evaluate our method on an unsupervised video object segmentation dataset to demonstrate its generalizability

Type

Conference paper

Publication

ACMMM 2021

Video analysis and understanding

Learning Hierarchical Embedding for Video Instance Segmentation

Abstract

Related