Dynamic Transformer for Few-shot Instance Segmentation


Few-shot instance segmentation aims to train an instance segmentation model that can fast adapt to novel classes with only a few reference images. Existing methods are usually derived from standard detection models and tackle few-shot instance segmentation indirectly by conducting classification, box regression, and mask prediction on a large set of redundant proposals followed by indispensable post-processing, e.g., Non-Maximum Suppression. Such complicated hand-crafted procedures and hyperparameters lead to degraded optimization and insufficient generalization ability. In this work, we propose an end-to-end Dynamic Transformer Network, DTN for short, to directly segment all target object instances from arbitrary categories given by reference images, relieving the requirements of dense proposal generation and post-processing. Specifically, a small set of Dynamic Queries, conditioned on reference images, are exclusively assigned to target object instances and generate all the instance segmentation masks of reference categories simultaneously. Moreover, a Semantic-induced Transformer Decoder is introduced to constrain the cross-attention between dynamic queries and target images within the pixels of the reference category, which suppresses the noisy interaction with the background and irrelevant categories. Extensive experiments are conducted on the COCO-20 dataset. The experiment results demonstrate that our proposed Dynamic Transformer Network significantly outperforms the state-of-the-arts.

ACMMM 2022