Representing videos using vocabularies composed of concept detectors appears promising for generic event recognition. While many have recently shown the benefits of concept vocabularies for recognition, studying the characteristics of a universal concept vocabulary suited for representing events is ignored. In this paper, we study how to create an effective vocabulary for arbitrary-event recognition in web video. We consider five research questions related to the number, the type, the specificity, the quality and the normalization of the detectors in concept vocabularies. A rigorous experimental protocol using a pool of 1346 concept detectors trained on publicly available annotations, two large arbitrary web video datasets and a common event recognition pipeline allow us to analyze the performance of various concept vocabulary definitions. From the analysis we arrive at the recommendation that for effective event recognition the concept vocabulary should (i) contain more than 200 concepts, (ii) be diverse by covering object, action, scene, people, animal and attribute concepts, (iii) include both general and specific concepts, (iv) increase the number of concepts rather than improve the quality of the individual detectors, and (v) contain detectors that are appropriately normalized. We consider the recommendations for recognizing video events by concept vocabularies the most important contribution of the paper, as they provide guidelines for future work.
@Article{HabibianCVIU2014,
author = "Habibian, A. and Snoek, C. G. M.",
title = "Recommendations for Recognizing Video Events by Concept Vocabularies",
journal = "Computer Vision and Image Understanding",
volume = "124",
pages = "110--122",
year = "2014",
url = "https://ivi.fnwi.uva.nl/isis/publications/2014/HabibianCVIU2014",
pdf = "https://ivi.fnwi.uva.nl/isis/publications/2014/HabibianCVIU2014/HabibianCVIU2014.pdf",
has_image = 1
}