Publications

Last updated: Wed May 28 17:07:48 2014

Journal papers

Show all abstracts | Hide all abstracts

: Amirhossein Habibian and Cees G. M. Snoek. Recommendations for recognizing video events by concept vocabularies. Computer Vision and Image Understanding, 2014. In press.
[ BibTeX | abstract ]
: Svetlana Kordumova, Xirong Li, and Cees G. M. Snoek. Best practices for learning video concept detectors from social media examples. Multimedia Tools and Applications, 2014. In press.
[ BibTeX | | abstract ]
Learning video concept detectors from social media sources, such as Flickr images and YouTube videos, has the potential to address a wide variety of concept queries for video search. While the potential has been recognized by many, and progress on the topic has been impressive, we argue that key questions crucial to know how to learn effective video concept detectors from social media examples? remain open. As an initial attempt to answer these questions, we conduct an experimental study using a video search engine which is capable of learning concept detectors from social media examples, be it socially tagged videos or socially tagged images. Within the video search engine we investigate three strategies for positive example selection, three negative example selection strategies and three learning strategies. The performance is evaluated on the challenging TRECVID 2012 benchmark consisting of 600 h of Internet video. From the experiments we derive four best practices: (1) tagged images are a better source for learning video concepts than tagged videos, (2) selecting tag relevant positive training examples is always beneficial, (3) selecting relevant negative examples is advantageous and should be treated differently for video and image sources, and (4) learning concept detectors with selected relevant training data before learning is better then incorporating the relevance during the learning process. The best practices within our video search engine lead to state-of-the-art performance in the TRECVID 2013 benchmark for concept detection without manually provided annotations.
: Gregory K. Myers, Ramesh Nallapati, Julien van Hout, Stephanie Pancoast, Ram Nevatia, Chen Sun, Amirhossein Habibian, Dennis C. Koelma, Koen E. A. van de Sande, Arnold W. M. Smeulders, and Cees G. M. Snoek. Evaluating multimedia features and fusion for example-based event detection. Machine Vision and Applications, 25(1):17-32, January 2014.
[ BibTeX | | abstract ]
Multimedia event detection (MED)is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and automatic speech recognition. Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods. SESAME's performance in the 2012 TRECVID MED evaluation was one of the best reported.
: Xirong Li, Cees G. M. Snoek, Marcel Worring, Dennis C. Koelma, and Arnold W. M. Smeulders. Bootstrapping visual categorization with relevant negatives. IEEE Transactions on Multimedia, 15(4):933-945, June 2013.
[ BibTeX | | abstract ]
Learning classifiers for many visual concepts are important for image categorization and retrieval. As a classifier tends to misclassify negative examples which are visually similar to positive ones, inclusion of such misclassified and thus relevant negatives should be stressed during learning. User-tagged images are abundant online, but which images are the relevant negatives remains unclear. Sampling negatives at random is the de facto standard in the literature. In this paper, we go beyond random sampling by proposing Negative Bootstrap. Given a visual concept and a few positive examples, the new algorithm iteratively finds relevant negatives. Per iteration, we learn from a small proportion of many user-tagged images, yielding an ensemble of meta classifiers. For efficient classification, we introduce Model Compression such that the classification time is independent of the ensemble size. Compared with the state of the art, we obtain relative gains of 14% and 18% on two present-day benchmarks in terms of mean average precision. For concept search in one million images, model compression reduces the search time from over 20 h to approximately 6 min. The effectiveness and efficiency, without the need of manually labeling any negatives, make negative bootstrap appealing for learning better visual concept classifiers.
: Bouke Huurnink, Cees G. M. Snoek, Maarten de Rijke, and Arnold W. M. Smeulders. Content-based analysis improves audiovisual archive retrieval. IEEE Transactions on Multimedia, 14(4):1166-1178, August 2012.
[ BibTeX | | abstract ]
Content-based video retrieval is maturing to the point where it can be used in real-world retrieval practices. One such practice is the audiovisual archive, whose users increasingly require fine-grained access to broadcast television content. In this paper, we take into account the information needs and retrieval data already present in the audiovisual archive, and demonstrate that retrieval performance can be significantly improved when content-based methods are applied to search. To the best of our knowledge, this is the first time that the practice of an audiovisual archive has been taken into account for quantitative retrieval evaluation. To arrive at our main result, we propose an evaluation methodology tailored to the specific needs and circumstances of the audiovisual archive, which are typically missed by existing evaluation initiatives. We utilize logged searches, content purchases, session information, and simulators to create realistic query sets and relevance judgments. To reflect the retrieval practice of both the archive and the video retrieval community as closely as possible, our experiments with three video search engines incorporate archive-created catalog entries as well as state-of-the-art multimedia content analysis results. A detailed query-level analysis indicates that individual content-based retrieval methods such as transcript-based retrieval and concept-based retrieval yield approximately equal performance gains. When combined, we find that content-based video retrieval incorporated into the archive�s practice results in significant performance increases for shot retrieval and for retrieving entire television programs. The time has come for audiovisual archives to start accommodating content-based video retrieval methods into their daily practice.
: Xirong Li, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. Harvesting social images for bi-concept search. IEEE Transactions on Multimedia, 14(4):1091-1104, August 2012.
[ BibTeX | | abstract ]
Searching for the co-occurrence of two visual concepts in unlabeled images is an important step towards answering complex user queries. Traditional visual search methods use combinations of the confidence scores of individual concept detectors to tackle such queries. In this paper we introduce the notion of bi-concepts, a new concept-based retrieval method that is directly learned from social-tagged images. As the number of potential bi-concepts is gigantic, manually collecting training examples is infeasible. Instead, we propose a multimedia framework to collect de-noised positive as well as informative negative training examples from the social web, to learn bi-concept detectors from these examples, and to apply them in a search engine for retrieving bi-concepts in unlabeled images. We study the behavior of our bi-concept search engine using 1.2M social-tagged images as a data source. Our experiments indicate that harvesting examples for bi-concepts differs from traditional single-concept methods, yet the examples can be collected with high accuracy using a multi-modal approach. We find that directly learning bi-concepts is better than oracle linear fusion of single-concept detectors, with a relative improvement of 100%. This study reveals the potential of learning high-order semantics from social images, for free, suggesting promising new lines of research.
: Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. Visual synonyms for landmark image retrieval. Computer Vision and Image Understanding, 116(2):238-249, February 2012.
[ BibTeX | | abstract ]
In this paper, we address the incoherence problem of the visual words in bag-of-words vocabularies. Different from existing work, which assigns words based on closeness in descriptor space, we focus on identifying pairs of independent, distant words - the visual synonyms - that are likely to host image patches of similar visual reality. We focus on landmark images, where the image geometry guides the detection of synonym pairs. Image geometry is used to find those image features that lie in the nearly identical physical location, yet are assigned to different words of the visual vocabulary. Defined in this way, we evaluate the validity of visual synonyms. We also examine the closeness of synonyms in the L2-normalized feature space. We show that visual synonyms may successfully be used for vocabulary reduction. Furthermore, we show that combining the reduced visual vocabularies with synonym augmentation, we perform on par with the state-of-the-art bag-of-words approach, while having a 98% smaller vocabulary.
: Jeroen Steggink and Cees G. M. Snoek. Adding semantics to image-region annotations with the name-it-game. Multimedia Systems, 17(5):367-378, October 2011.
[ BibTeX | | abstract ]
In this paper we present the Name-It-Game, an interactive multimedia game fostering the swift creation of a large data set of region-based image annotations. Compared to existing annotation games, we consider an added semantic structure, by means of the WordNet ontology, the main innovation of the Name-It-Game. Using an ontology-powered game, instead of the more traditional annotation tools, potentially makes region-based image labeling more fun and accessible for every type of user. However, the current games often present the players with hard-to-guess objects. To prevent this from happening in the Name-It-Game, we successfully identify WordNet categories which filter out hard-to-guess objects. To verify the speed of the annotation process, we compare the online Name-It-Game with a desktop tool with similar features. Results show that the Name-It-Game outperforms this tool for semantic region-based image labeling. Lastly, we measure the accuracy of the produced segmentations and compare them with carefully created LabelMe segmentations. Judging from the quantitative and qualitative results, we believe the segmentations are competitive to those of LabelMe, especially when averaged over multiple games. By adding semantics to region-based image annotations, using the Name-It-Game, we have opened up an efficient means to provide precious labels in a playful manner.
: Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. Empowering visual categorization with the GPU. IEEE Transactions on Multimedia, 13(1):60-70, February 2011.
[ BibTeX | | abstract ]
Visual categorization is important to manage large collections of digital images and video, where textual meta-data is often incomplete or simply unavailable. The bag-of-words model has become the most powerful method for visual categorization of images and video. Despite its high accuracy, a severe drawback of this model is its high computational cost. As the trend to increase computational power in newer CPU and GPU architectures is to increase their level of parallelism, exploiting this parallelism becomes an important direction to handle the computational cost of the bag-of-words approach. When optimizing a system based on the bag-of-words approach, the goal is to minimize the time it takes to process batches of images. Additionally, we also consider power usage as an evaluation metric. In this paper, we analyze the bag-of-words model for visual categorization in terms of computational cost and identify two major bottlenecks: the quantization step and the classification step. We address these two bottlenecks by proposing two efficient algorithms for quantization and classification by exploiting the GPU hardware and the CUDA parallel programming model. The algorithms are designed to (1) keep categorization accuracy intact, (2) decompose the problem and (3) give the same numerical results. In the experiments on large scale datasets it is shown that, by using a parallel implementation on the Geforce GTX260 GPU, classifying unseen images is 4.8 times faster than a quad-core CPU version on the Core i7 920, while giving the exact same numerical results. In addition, we show how the algorithms can be generalized to other applications, such as text retrieval and video retrieval. Moreover, when the obtained speedup is used to process extra video frames in a video retrieval benchmark, the accuracy of visual categorization is improved by 29%.
: Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1582-1596, September 2010.
[ BibTeX | | abstract ]
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used for feature extraction at salient points. To increase illumination invariance and discriminative power, color descriptors have been proposed. Because many different descriptors exist, a structured overview is required of color invariant descriptors in the context of image category recognition. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors in a structured way. The analytical invariance properties of color descriptors are explored, using a taxonomy based on invariance properties with respect to photometric transformations, and tested experimentally using a dataset with known illumination conditions. In addition, the distinctiveness of color descriptors is assessed experimentally using two benchmarks, one from the image domain and one from the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results reveal further that, for light intensity changes, the usefulness of invariance is category-specific. Overall, when choosing a single descriptor and no prior knowledge about the dataset and object and scene categories is available, the OpponentSIFT is recommended. Furthermore, a combined set of color descriptors outperforms intensity-based SIFT and improves category recognition by 8% on the PASCAL VOC 2007 and by 7% on the MediaMill Challenge.
: Daragh Byrne, Aiden R. Doherty, Cees G. M. Snoek, Gareth J. F. Jones, and Alan F. Smeaton. Everyday concept detection in visual lifelogs: Validation, relationships and trends. Multimedia Tools and Applications, 49(1):119-144, August 2010.
[ BibTeX | | abstract ]
The Microsoft SenseCam is a small lightweight wearable camera used to passively capture photos and other sensor readings from a user�s day-to-day activities. It captures on average 3,000 images in a typical day, equating to almost 1 million images per year. It can be used to aid memory by creating a personal multimedia lifelog, or visual recording of the wearer�s life. However the sheer volume of image data captured within a visual lifelog creates a number of challenges, particularly for locating relevant content. Within this work, we explore the applicability of semantic concept detection, a method often used within video retrieval, on the domain of visual lifelogs. Our concept detector models the correspondence between low-level visual features and high-level semantic concepts (such as indoors, outdoors, people, buildings, etc.) using supervised machine learning. By doing so it determines the probability of a concept�s presence. We apply detection of 27 everyday semantic concepts on a lifelog collection composed of 257,518 SenseCam images from 5 users. The results were evaluated on a subset of 95,907 images, to determine the accuracy for detection of each semantic concept. We conducted further analysis on the temporal consistency, co-occurance and relationships within the detected concepts to more extensively investigate the robustness of the detectors within this domain.
: Cees G. M. Snoek and Arnold W. M. Smeulders. Visual-concept search solved?. IEEE Computer, 43(6):76-78, June 2010.
[ BibTeX | | abstract ]
Progress in visual-concept search suggests that machine understanding of images is within reach.
: Ork de Rooij, Marcel Worring, and Jack J. van Wijk. Mediatable: Interactive categorization of multimedia collections. IEEE Computer Graphics and Applications, 30(5):42-51, May 2010.
[ BibTeX | http ] abstract ]
: Jan C. van Gemert, Cees G. M. Snoek, Cor J. Veenman, Arnold W. M. Smeulders, and Jan-Mark Geusebroek. Comparing compact codebooks for visual categorization. Computer Vision and Image Understanding, 114(4):450-462, April 2010.
[ BibTeX | | abstract ]
In the face of current large-scale video libraries, the practical applicability of content-based indexing algorithms is constrained by their efficiency. This paper strives for efficient large-scale video indexing by comparing various visual-based concept categorization techniques. In visual categorization, the popular codebook model has shown excellent categorization performance. The codebook model represents continuous visual features by discrete prototypes predefined in a vocabulary. The vocabulary size has a major impact on categorization efficiency, where a more compact vocabulary is more efficient. However, smaller vocabularies typically score lower on classification performance than larger vocabularies. This paper compares four approaches to achieve a compact codebook vocabulary while retaining categorization performance. For these four methods, we investigate the trade-off between codebook compactness and categorization performance. We evaluate the methods on more than 200 h of challenging video data with as many as 101 semantic concepts. The results allow us to create a taxonomy of the four methods based on their efficiency and categorization performance.
: Ork de Rooij and Marcel Worring. Browsing video along multiple threads. IEEE Transactions on Multimedia, 12(2):121-130, February 2010.
[ BibTeX | http ] abstract ]
: Xirong Li, Cees G. M. Snoek, and Marcel Worring. Learning social tag relevance by neighbor voting. IEEE Transactions on Multimedia, 11(7):1310-1322, November 2009.
[ BibTeX | | abstract ]
Social image analysis and retrieval is important for helping people organize and access the increasing amount of user-tagged multimedia. Since user tagging is known to be uncontrolled, ambiguous, and overly personalized, a fundamental problem is how to interpret the relevance of a user-contributed tag with respect to the visual content the tag is describing. Intuitively, if different persons label visually similar images using the same tags, these tags are likely to reflect objective aspects of the visual content. Starting from this intuition, we propose in this paper a neighbor voting algorithm which accurately and efficiently learns tag relevance by accumulating votes from visual neighbors. Under a set of well defined and realistic assumptions, we prove that our algorithm is a good tag relevance measurement for both image ranking and tag ranking. Three experiments on 3.5 million Flickr photos demonstrate the general applicability of our algorithm in both social image retrieval and image tag suggestion. Our tag relevance learning algorithm substantially improves upon baselines for all the experiments. The results suggest that the proposed algorithm is promising for real-world applications.
: Cees G. M. Snoek and Marcel Worring. Concept-based video retrieval. Foundations and Trends in Information Retrieval, 4(2):215-322, 2009.
[ BibTeX | | abstract ]
In this paper, we review 300 references on video retrieval, indicating when text-only solutions are unsatisfactory and showing the promising alternatives which are in majority concept-based. Therefore, central to our discussion is the notion of a semantic concept: an objective linguistic description of an observable entity. Specifically, we present our view on how its automated detection, selection under uncertainty, and interactive usage might solve the major scientific problem for video retrieval: the semantic gap. To bridge the gap, we lay down the anatomy of a concept-based video search engine. We present a component-wise decomposition of such an interdisciplinary multimedia system, covering influences from information retrieval, computer vision, machine learning, and human-computer interaction. For each of the components we review state-of-the-art solutions in the literature, each having different characteristics and merits. Because of these differences, we cannot understand the progress in video retrieval without serious evaluation efforts such as carried out in the NIST TRECVID benchmark. We discuss its data, tasks, results, and the many derived community initiatives in creating annotations and baselines for repeatable experiments. We conclude with our perspective on future challenges and opportunities.
: Giang P. Nguyen and Marcel Worring. Interactive access to large image collections using similarity-based visualization. Journal of Visual Languages and Computing, 19(2):203-224, April 2008.
[ BibTeX | | abstract ]
Image collections are getting larger and larger. To access those collections, systems for managing, searching, and browsing are necessary. Visualization plays an essential role in such systems. Existing visualization systems do not analyze all the problems occurring when dealing with large visual collections. In this paper, we make these problems explicit. From there, we establish three general requirements: overview, visibility, and structure preservation. Solutions for each requirement are proposed, as well as functions balancing the different requirements. We present an optimal visualization scheme, supporting users in interacting with large image collections. Experimental results with a collection of 10,000 Corel images, using simulated user actions, show that the proposed scheme significantly improves performance for a given task compared to the 2D grid-based visualizations commonly used in content-based image retrieval.
: Alan F. Smeaton, Peter Wilkins, Marcel Worring, Ork de Rooij, Tat-Seng Chua, and Huanbo Luan. Content-based video retrieval: Three example systems from TRECVid. International Journal of Imaging Systems and Technology, 18(2-3):195-201, 2008.
[ BibTeX | abstract ]
: Cees G. M. Snoek, Marcel Worring, Ork de Rooij, Koen E. A. van de Sande, Rong Yan, and Alexander G. Hauptmann. VideOlympics: Real-time evaluation of multimedia retrieval systems. IEEE MultiMedia, 15(1):86-91, January-March 2008.
[ BibTeX | | abstract ]
Video search is an experience for the senses. As a result, traditional information retrieval metrics can't fully measure the quality of a video search system. To provide a more interactive assessment of today's video search engines, the authors have organized the VideOlympics as a real-time evaluation showcase where systems compete to answer specific video searches in front of a live audience. At VideOlympics, seeing and hearing is believing.
: Giang P. Nguyen and Marcel Worring. Optimization of interactive visual-similarity-based search. ACM Transactions on Multimedia Computing, Communications and Applications, 4(1):7:1-23, January 2008.
[ BibTeX | | abstract ]
At one end of the spectrum, research in interactive content-based retrieval concentrates on machine learning methods for effective use of relevance feedback. On the other end, the information visualization community focuses on effective methods for conveying information to the user. What is lacking is research considering the information visualization and interactive retrieval as truly integrated parts of one content-based search system. In such an integrated system, there are many degrees of freedom like the similarity function, the number of images to display, the image size, different visualization modes, and possible feedback modes. To base the optimal values for all of those on user studies is unfeasible. We therefore develop search scenarios in which tasks and user actions are simulated. From there, the proposed scheme is optimized based on objective constraints and evaluation criteria. In such a manner, the degrees of freedom are reduced and the remaining degrees can be evaluated in user studies. In this article, we present a system that integrates advanced similarity based visualization with active learning. We have performed extensive experimentation on interactive category search with different image collections. The results using the proposed simulation scheme show that indeed the use of advanced visualization and active learning pays off in all of these datasets.
: Giang P. Nguyen, Marcel Worring, and Arnold W. M. Smeulders. Interactive search by direct manipulation of dissimilarity space. IEEE Transactions on Multimedia, 9(7):1404-1415, November 2007.
[ BibTeX | | abstract ]
In this paper, we argue to learn dissimilarity for interactive search in content based image retrieval. In literature, dissimilarity is often learned via the feature space by feature selection, feature weighting or by adjusting the parameters of a function of the features. Other than existing techniques, we use feedback to adjust the dissimilarity space independent of feature space. This has the great advantage that it manipulates dissimilarity directly. To create a dissimilarity space, we use the method proposed by Pekalska and Duin, selecting a set of images called prototypes and computing distances to those prototypes for all images in the collection. After the user gives feedback, we apply active learning with a one-class support vector machine to decide the movement of images such that relevant images stay close together while irrelevant ones are pushed away (the work of Guo ). The dissimilarity space is then adjusted accordingly. Results on a Corel dataset of 10000 images and a TrecVid collection of 43907 keyframes show that our proposed approach is not only intuitive, it also significantly improves the retrieval performance.
: Frank J. Seinstra, Jan-Mark Geusebroek, Dennis Koelma, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. High-performance distributed image and video content analysis with parallel-horus. IEEE MultiMedia, 14(4):64-75, October-December 2007.
[ BibTeX | | abstract ]
As the world uses more digital video that requires greater storage space, Grid computing is becoming indispensable for urgent problems in multimedia content analysis. Parallel-Horus, a support tool for applications in multimedia Grid computing, lets users implement multimedia applications as sequential programs for efficient execution on clusters and Grids, based on wide-area multimedia services.
: Cees G. M. Snoek, Bouke Huurnink, Laura Hollink, Maarten de Rijke, Guus Schreiber, and Marcel Worring. Adding semantics to detectors for video retrieval. IEEE Transactions on Multimedia, 9(5):975-986, August 2007.
[ BibTeX | | abstract ]
In this paper, we propose an automatic video retrieval method based on high-level concept detectors. Research in video analysis has reached the point where over 100 concept detectors can be learned in a generic fashion, albeit with mixed performance. Such a set of detectors is very small still compared to ontologies aiming to capture the full vocabulary a user has. We aim to throw a bridge between the two fields by building a multimedia thesaurus, i.e., a set of machine learned concept detectors that is enriched with semantic descriptions and semantic structure obtained from WordNet. Given a multimodal user query, we identify three strategies to select a relevant detector from this thesaurus, namely: text matching, ontology querying, and semantic visual querying. We evaluate the methods against the automatic search task of the TRECVID 2005 video retrieval benchmark, using a news video archive of 85 h in combination with a thesaurus of 363 machine learned concept detectors. We assess the influence of thesaurus size on video search performance, evaluate and compare the multimodal selection strategies for concept detectors, and finally discuss their combined potential using oracle fusion. The set of queries in the TRECVID 2005 corpus is too small for us to be definite in our conclusions, but the results suggest promising new lines of research.
: Marcel Worring and Guus Schreiber. Semantic image and video indexing in broad domains. IEEE Transactions on Multimedia, 9(5):909-911, August 2007.
[ BibTeX | | abstract ]
The six papers in this special section focus on semantic image and video indexing in broad domains. To bring semantics to the user in broad domains both the indexing and retrieval step have to be considered. The papers here address both steps and the relation to ontologies.
: Cees G. M. Snoek, Marcel Worring, Dennis C. Koelma, and Arnold W. M. Smeulders. A learned lexicon-driven paradigm for interactive video retrieval. IEEE Transactions on Multimedia, 9(2):280-292, February 2007.
[ BibTeX | | abstract ]
Effective video retrieval is the result of an interplay between interactive query selection, advanced visualization of results, and a goal-oriented human user. Traditional interactive video retrieval approaches emphasize paradigms, such as query-by-keyword and query-by-example, to aid the user in the search for relevant footage. However, recent results in automatic indexing indicate that query-by-concept is becoming a viable resource for interactive retrieval also. We propose in this paper a new video retrieval paradigm. The core of the paradigm is formed by first detecting a large lexicon of semantic concepts. From there, we combine query-by-concept, query-by-example, query-by-keyword, and user interaction into the MediaMill semantic video search engine. To measure the impact of increasing lexicon size on interactive video retrieval performance, we performed two experiments against the 2004 and 2005 NIST TRECVID benchmarks, using lexicons containing 32 and 101 concepts respectively. The results suggest that from all factors that play a role in interactive retrieval, a large lexicon of semantic concepts matters most. Indeed, by exploiting large lexicons, many video search questions are solvable without using query-by-keyword and query-by-example. What is more, we show that the lexicon-driven search engine outperforms all state-of-the-art video retrieval systems in both TRECVID 2004 and 2005.
: Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, Frank J. Seinstra, and Arnold W. M. Smeulders. The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10):1678-1689, October 2006.
[ BibTeX | | abstract ]
This paper presents the semantic pathfinder architecture for generic indexing of multimedia archives. The semantic pathfinder extracts semantic concepts from video by exploring different paths through three consecutive analysis steps, which we derive from the observation that produced video is the result of an authoring-driven process. We exploit this authoring metaphor for machine-driven understanding. The pathfinder starts with the content analysis step. In this analysis step, we follow a data-driven approach of indexing semantics. The style analysis step is the second analysis step. Here we tackle the indexing problem by viewing a video from the perspective of production. Finally, in the context analysis step, we view semantics in context. The virtue of the semantic pathfinder is its ability to learn the best path of analysis steps on a per-concept basis. To show the generality of this novel indexing approach we develop detectors for a lexicon of 32 concepts and we evaluate the semantic pathfinder against the 2004 NIST TRECVID video retrieval benchmark, using a news archive of 64 hours. Top ranking performance in the semantic concept detection task indicates the merit of the semantic pathfinder for generic indexing of multimedia archives.
: Cees G. M. Snoek, Marcel Worring, and Alexander G. Hauptmann. Learning rich semantics from news video archives by style analysis. ACM Transactions on Multimedia Computing, Communications and Applications, 2(2):91-108, May 2006.
[ BibTeX | | abstract ]
We propose a generic and robust framework for news video indexing, which we found on a broadcast news production model. We identify within this model four production phases, each providing useful metadata for annotation. In contrast to semi-automatic indexing approaches, which exploit this information at production time, we adhere to an automatic data-driven approach. To that end, we analyze a digital news video using a separate set of multimodal detectors for each production phase. By combining the resulting production-derived features into a statistical classifier ensemble, the framework facilitates robust classification of several rich semantic concepts in news video; rich meaning that concepts share many similarities in their production process. Experiments on an archive of 120 hours of news video, from the 2003 TRECVID benchmark, show that a combined analysis of production phases yields the best results. In addition, we demonstrate that the accuracy of the proposed style analysis framework for classification of several rich semantic concepts is state-of-the-art.
: Laura Hollink, Giang Nguyen, Dennis C. Koelma, Guus Schreiber, and Marcel Worring. Assessing user behaviour in news video retrieval. IEE on Vision, Image and Signal Processing, 152(6):911-918, December 2005.
[ BibTeX | | abstract ]
The results of a study are presented, in which people queried a news archive using an interactive video retrieval system. 242 search sessions by 39 participants on 24 topics were assessed. Before, during and after the study, participants filled in questionnaires about their expectations of a search. The questionnaire data, logged user actions on the system, queries formulated by users, and a quality measure of each search were studied. The results of the study show that topics concerning 'specific' people or objects were better retrieved than topics concerning 'general' objects and scenes. Users were able to estimate the overall quality of a search but did not know when the optimal result was reached within the search process. Analysis of the results at various stages in the retrieval process suggests that retrieval based on transcriptions of the speech in video data adds more to the average precision of the result than content-based image retrieval based on low-level visual features. The latter is particularly useful in providing the user with an overview of the dataset and thus an indication of the success of a search. Based on the results, implications for the design of user interfaces of video retrieval systems are discussed.
: Giang P. Nguyen and Marcel Worring. Relevance feedback based saliency adaptation in CBIR. ACM Springer Multimedia Systems, 10(6):499-512, October 2005.
[ BibTeX | | abstract ]
Content-based image retrieval (CBIR) has been under investigation for a long time with many systems built to meet different application demands. However, in all systems, there is still a gap between the user's expectation and the system's retrieval capabilities. Therefore, user interaction is an essential component of any CBIR system. Interaction up to now has mostly focused on changing global image features or similarities between images. We consider the interaction with salient details in the image i.e. points, lines, and regions. Interactive salient detail definition goes further than summarizing the image into a set of salient details. We aim to dynamically update the user- and context-dependent definition of saliency based on relevance feedback. To that end, we propose an interaction framework for salient details from the perspective of the user. A number of instantiations of the framework are presented. Finally, we apply our approach for query refinement in detail based image retrieval system with salient points and regions. Experimental results prove the effectiveness of adapting the saliency from user feedback in the retrieval process.
: Cees G. M. Snoek and Marcel Worring. Multimedia event-based video indexing using time intervals. IEEE Transactions on Multimedia, 7(4):638-647, August 2005.
[ BibTeX | | abstract ]
We propose the Time Interval Multimedia Event (TIME) framework as a robust approach for classification of semantic events in multimodal video documents. The representation used in TIME extends the Allen time relations and allows for proper inclusion of context and synchronization of the heterogeneous information sources involved in multimodal video analysis. To demonstrate the viability of our approach, it was evaluated on the domains of soccer and news broadcasts. For automatic classification of semantic events, we compare three different machine learning techniques, i.c. C4.5 decision tree, Maximum Entropy, and Support Vector Machine. The results show that semantic video indexing results significantly benefit from using the TIME framework.
: Cees G. M. Snoek and Marcel Worring. Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25(1):5-35, January 2005.
[ BibTeX | | abstract ]
Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. In this paper we survey several methods aiming at automating this time and resource consuming process. Good reviews on single modality based video indexing have appeared in literature. Effective indexing, however, requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. Therefore, instead of separately treating the different information sources involved, and their specific algorithms, we focus on the similarities and differences between the modalities. To that end we put forward a unifying and multimodal framework, which views a video document from the perspective of its author. This framework forms the guiding principle for identifying index types, for which automatic methods are found in literature. It furthermore forms the basis for categorizing these different methods.

Conference papers

Show all abstracts | Hide all abstracts

: Mihir Jain, Jan C. van Gemert, Hervé Jégou, Patrick Bouthemy, and Cees G. M. Snoek. Action localization by tubelets from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Ohio, USA, June 2014.
[ BibTeX | | abstract ]
This paper considers the problem of action localization, where the objective is to determine when and where certain actions appear. We introduce a sampling strategy to produce 2D+t sequences of bounding boxes, called tubelets. Compared to state-of-the-art alternatives, this drastically reduces the number of hypotheses that are likely to include the action of interest. Our method is inspired by a recent technique introduced in the context of image localization. Beyond considering this technique for the first time for videos, we revisit this strategy for 2D+t sequences obtained from super-voxels. Our sampling strategy advantageously exploits a criterion that reflects how action related motion deviates from background motion. We demonstrate the interest of our approach by extensive experiments on two public datasets: UCF Sports and MSR-II. Our approach significantly outperforms the state-of-theart on both datasets, while restricting the search of actions to a fraction of possible bounding box sequences.
: Thomas Mensink, Efstratios Gavves, and Cees G. M. Snoek. Costa: Co-occurrence statistics for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Ohio, USA, June 2014.
[ BibTeX | | abstract ]
In this paper we aim for zero-shot classification, that is visual recognition of an unseen class by using knowledge transfer from known classes. Our main contribution is COSTA, which exploits co-occurrences of visual concepts in images for knowledge transfer. These inter-dependencies arise naturally between concepts, and are easy to obtain from existing annotations or web-search hit counts. We estimate a classifier for a new label, as a weighted combination of related classes, using the co-occurrences to define the weight. We propose various metrics to leverage these co-occurrences, and a regression model for learning a weight for each related class. We also show that our zero-shot classifiers can serve as priors for few-shot learning. Experiments on three multi-labeled datasets reveal that our proposed zero-shot methods, are approaching and occasionally outperforming fully supervised SVMs. We conclude that co-occurrence statistics suffice for zero-shot classification.
: Koen E. A. van de Sande, Cees G. M. Snoek, and Arnold W. M. Smeulders. Fisher and vlad with flair. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Ohio, USA, June 2014.
[ BibTeX | | abstract ]
A major computational bottleneck in many current algorithms is the evaluation of arbitrary boxes. Dense local analysis and powerful bag-of-word encodings, such as Fisher vectors and VLAD, lead to improved accuracy at the expense of increased computation time. Where a simplification in the representation is tempting, we exploit novel representations while maintaining accuracy. We start from state-of-the-art, fast selective search, but our method will apply to any initial box-partitioning. By representing the picture as sparse integral images, one per codeword, we achieve a Fast Local Area Independent Representation. FLAIR allows for very fast evaluation of any box encoding and still enables spatial pooling. In FLAIR we achieve exact VLADs difference coding, even with l2 and power-norms. Finally, by multiple codeword assignments, we achieve exact and approximate Fisher vectors with FLAIR. The results are a 18x speedup, which enables us to set a new state-of-the- art on the challenging 2010 PASCAL VOC objects and the fine-grained categorization of the CUB-2011 200 bird species. Plus, we rank number one in the official ImageNet 2013 detection challenge.
: Ran Tao, Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. Locality in generic instance search from one example. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Ohio, USA, June 2014.
[ BibTeX | | abstract ]
: Julien van Hout, Eric Yeh, Dennis Koelma Cees G. M. Snoek, Chen Sun, Ramakant Nevatia, Julie Wong, and Gregory Myers. Late fusion and calibration for multimedia event detection using few examples. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Florence, Italy, May 2014.
[ BibTeX | | abstract ]
The state-of-the-art in example-based multimedia event detection (MED) rests on heterogeneous classifiers whose scores are typically combined in a late-fusion scheme. Recent studies on this topic have failed to reach a clear consensus as to whether machine learning techniques can outperform rule-based fusion schemes with varying amount of training data. In this paper, we present two parametric approaches to late fusion: a normalization scheme for arithmetic mean fusion (logistic averaging) and a fusion scheme based on logistic regression, and compare them to widely used rule-based fusion schemes. We also describe how logistic regression can be used to calibrate the fused detection scores to predict an optimal threshold given a detection prior and costs on errors. We discuss the advantages and shortcomings of each approach when the amount of positives available for training varies from 10 positives (10Ex) to 100 positives (100Ex). Experiments were run using video data from the NIST TRECVID MED 2013 evaluation and results were reported in terms of a ranking metric: the mean average precision (mAP) and R0, a cost-based metric introduced in TRECVID MED 2013.
: Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. Composite concept discovery for zero-shot video event detection. In Proceedings of the ACM International Conference on Multimedia Retrieval. Glasgow, UK, April 2014.
[ BibTeX | | abstract ]
We consider automated detection of events in video without the use of any visual training examples. A common approach is to represent videos as classification scores obtained from a vocabulary of pre-trained concept classifiers. Where others construct the vocabulary by training individual concept classifiers, we propose to train classifiers for combination of concepts composed by Boolean logic operators. We call these concept combinations composite concepts and contribute an algorithm that automatically discovers them from existing video-level concept annotations. We discover composite concepts by jointly optimizing the accuracy of concept classifiers and their effectiveness for detecting events. We demonstrate that by combining concepts into composite concepts, we can train more accurate classifiers for the concept vocabulary, which leads to improved zero-shot event detection. Moreover, we demonstrate that by using different logic operators, namely ?AND?, ?OR?, we discover different types of composite concepts, which are complementary for zero-shot event detection. We perform a search for 20 events in 41K web videos from two test sets of the challenging TRECVID Multimedia Event Detection 2013 corpus. The experiments demonstrate the superior performance of the discovered composite concepts, compared to present-day alternatives, for zero-shot event detection.
: Amirhossein Habibian and Cees G. M. Snoek. Stop-frame removal improves web video classification. In Proceedings of the ACM International Conference on Multimedia Retrieval. Glasgow, UK, April 2014.
[ BibTeX | | abstract ]
Web videos available in sharing sites like YouTube, are becoming an alternative to manually annotated training data, which are necessary for creating video classifiers. However, when looking into web videos, we observe they contain several irrelevant frames that may randomly appear in any video, i.e., blank and over exposed frames. We call these irrelevant frames stop-frames and propose a simple algorithm to identify and exclude them during classifier training. Stop-frames might appear in any video, so it is hard to recognize their category. Therefore we identify stop-frames as those frames, which are commonly misclassified by any concept classifier. Our experiments demonstrates that using our algorithm improves classification accuracy by 60 in terms of mean average precision for an event and concept detection benchmark.
: Masoud Mazloom, Xirong Li, and Cees G. M. Snoek. Few-example video event retrieval using tag propagation. In Proceedings of the ACM International Conference on Multimedia Retrieval. Glasgow, UK, April 2014.
[ BibTeX | | abstract ]
An emerging topic in multimedia retrieval is to detect a complex event in video using only a handful of video examples. Different from existing work which learns a ranker from positive video examples and hundreds of negative examples, we aim to query web video for events using zero or only a few visual examples. To that end, we propose in this paper a tag-based video retrieval system which propagates tags from a tagged video source to an unlabeled video collection without the need of any training examples. Our algorithm is based on weighted frequency neighbor voting using concept vector similarity. Once tags are propagated to unlabeled video we can rely on off-the-shelf language models to rank these videos by the tag similarity. We study the behavior of our tag-based video event retrieval system by performing three experiments on web videos from the TRECVID multimedia event detection corpus, with zero, one and multiple query examples that beats a recent alternative.
: Chen Sun, Brian Burns, Ram Nevatia, Cees G. M. Snoek, Bob Bolles, Greg Myers, Wen Wang, and Eric Yeh. Isomer: Informative segment observations for multimedia event recounting. In Proceedings of the ACM International Conference on Multimedia Retrieval. Glasgow, UK, April 2014.
[ BibTeX | | abstract ]
This paper describes a system for multimedia event detection and recounting. The goal is to detect a high level event class in unconstrained web videos and generate event oriented summarization for display to users. For this purpose, we detect informative segments and collect observations for them, leading to our ISOMER system. We combine a large collection of both low level and semantic level visual and audio features for event detection. For event recounting, we propose a novel approach to identify event oriented discriminative video segments and their descriptions with a linear SVM event classifier. User friendly concepts including objects, actions, scenes, speech and optical character recognition are used in generating descriptions. We also develop several mapping and filtering strategies to cope with noisy concept detectors. Our system performed competitively in the TRECVID 2013 Multimedia Event Detection task with near 100,000 videos and was the highest performer in TRECVID 2013 Multimedia Event Recounting task.
: Efstratios Gavves, Basura Fernando, Cees G. M. Snoek, Arnold W. M. Smeulders, and Tinne Tuytelaars. Fine-grained categorization by alignments. In Proceedings of the IEEE International Conference on Computer Vision. Sydney, Australia, December 2013.
[ BibTeX | | abstract ]
The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape, since implicit to fine-grained categorization is the existence of a super-class shape shared among all classes. The alignments are then used to transfer part annotations from training images to test images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We furthermore argue that in the distinction of fine-grained sub-categories, classification-oriented encodings like Fisher vectors are better suited for describing localized information than popular matching oriented features like HOG. We evaluate the method on the CU-2011 Birds and Stanford Dogs fine-grained datasets, outperforming the state-of-the-art.
: Zhenyang Li, Efstratios Gavves, Koen E. A. van de Sande, Cees G. M. Snoek, and Arnold W. M. Smeulders. Codemaps segment, classify and search objects locally. In Proceedings of the IEEE International Conference on Computer Vision. Sydney, Australia, December 2013.
[ BibTeX | | abstract ]
In this paper we aim for segmentation and classification of objects. We propose codemaps that are a joint formulation of the classification score and the local neighborhood it belongs to in the image. We obtain the codemap by reordering the encoding, pooling and classification steps over lattice elements. Other than existing linear decompositions who emphasize only the efficiency benefits for localized search, we make three novel contributions. As a preliminary, we provide a theoretical generalization of the sufficient mathematical conditions under which image encodings and classification becomes locally decomposable. As first novelty we introduce l2 normalization for arbitrarily shaped image regions, which is fast enough for semantic segmentation using our Fisher codemaps. Second, using the same lattice across images, we propose kernel pooling which embeds nonlinearities into codemaps for object classification by explicit or approximate feature mappings. Results demonstrate that l2 normalized Fisher codemaps improve the state-of-the-art in semantic segmentation for PASCAL VOC. For object classification the addition of nonlinearities brings us on par with the state-of-the-art, but is 3x faster. Because of the codemaps? inherent efficiency, we can reach significant speed-ups for localized search as well. We exploit the efficiency gain for our third novelty: object segment retrieval using a single query image only.
: Xirong Li and Cees G. M. Snoek. Classifying tag relevance with relevant positive and negative examples. In Proceedings of the ACM International Conference on Multimedia. Barcelona, Spain, October 2013.
[ BibTeX | | abstract ]
Image tag relevance estimation aims to automatically determine what people label about images is factually present in the pictorial content. Different from previous works, which either use only positive examples of a given tag or use positive and random negative examples, we argue the importance of relevant positive and relevant negative examples for tag relevance estimation. We propose a system that selects positive and negative examples, deemed most relevant with respect to the given tag from crowd-annotated images. While applying models for many tags could be cumbersome, our system trains efficient ensembles of Support Vector Machines per tag, enabling fast classification. Experiments on two benchmark sets show that the proposed system compares favorably against five present day methods. Given extracted visual features, for each image our system can process up to 3,787 tags per second. The new system is both effective and efficient for tag relevance estimation.
: Masoud Mazloom, Amirhossein Habibian, and Cees G. M. Snoek. Querying for video events by semantic signatures from few examples. In Proceedings of the ACM International Conference on Multimedia. Barcelona, Spain, October 2013.
[ BibTeX | | abstract ]
We aim to query web video for complex events using only a handful of video query examples, where the standard approach learns a ranker from hundreds of examples. We consider a semantic signature representation, consisting of off-the-shelf concept detectors, to capture the variance in semantic appearance of events. Since it is unknown what similarity metric and query fusion to use in such an event retrieval setting, we perform three experiments on unconstrained web videos from the TRECVID event detection task. It reveals that: retrieval with semantic signatures using normalized correlation as similarity metric outperforms a low-level bag-of-words alternative, multiple queries are best combined using late fusion with an average operator, and event retrieval is preferred over event classication when less than eight positive video examples are available.
: Svetlana Kordumova, Xirong Li, and Cees G. M. Snoek. Evaluating sources and strategies for learning video concepts from social media. In International Workshop on Content-Based Multimedia Indexing. Veszprém, Hungary, June 2013.
[ BibTeX | abstract ]
: Amirhossein Habibian, Koen E. A. van de Sande, and Cees G. M. Snoek. Recommendations for video event recognition using concept vocabularies. In Proceedings of the ACM International Conference on Multimedia Retrieval, pages 89-96. Dallas, Texas, USA, April 2013.
[ BibTeX | | abstract ]
Representing videos using vocabularies composed of concept detectors appears promising for event recognition. While many have recently shown the benefits of concept vocabularies for recognition, the important question what concepts to include in the vocabulary is ignored. In this paper, we study how to create an effective vocabulary for arbitrary event recognition in web video. We consider four research questions related to the number, the type, the specificity and the quality of the detectors in concept vocabularies. A rigorous experimental protocol using a pool of 1,346 concept detectors trained on publicly available annotations, a dataset containing 13,274 web videos from the Multimedia Event Detection benchmark, 25 event groundtruth definitions, and a state-of-the-art event recognition pipeline allow us to analyze the performance of various concept vocabulary definitions. From the analysis we arrive at the recommendation that for effective event recognition the concept vocabulary should i) contain more than 200 concepts, ii) be diverse by covering object, action, scene, people, animal and attribute concepts, iii) include both general and specific concepts, and iv) increase the number of concepts rather than improve the quality of the individual detectors. We consider the recommendations for video event recognition using concept vocabularies the most important contribution of the paper, as they provide guidelines for future work.
: Masoud Mazloom, Efstratios Gavves, Koen E. A. van de Sande, and Cees G. M. Snoek. Searching informative concept banks for video event detection. In Proceedings of the ACM International Conference on Multimedia Retrieval, pages 255-262. Dallas, Texas, USA, April 2013.
[ BibTeX | | abstract ]
An emerging trend in video event detection is to learn an event from a bank of concept detector scores. Different from existing work, which simply relies on a bank containing all available detectors, we propose in this paper an algorithm that learns from examples what concepts in a bank are most informative per event. We model finding this bank of informative concepts out of a large set of concept detectors as a rare event search. Our proposed approximate solution finds the optimal concept bank using a cross-entropy optimization. We study the behavior of video event detection based on a bank of informative concepts by performing three experiments on more than 1,000 hours of arbitrary internet video from the TRECVID multimedia event detection task. Starting from a concept bank of 1,346 detectors we show that 1.) some concept banks are more informative than others for specific events, 2.) event detection using an automatically obtained informative concept bank is more robust than using all available concepts, 3.) even for small amounts of training examples an informative concept bank outperforms a full bank and a bag-of-word event representation, and 4.) we show qualitatively that the informative concept banks make sense for the events of interest, without being programmed to do so. We conclude that for concept banks it pays to be informative.
: Davide Modolo and Cees G. M. Snoek. Can object detectors aid internet video event retrieval?. In Proceedings of the IS&T/SPIE Symposium on Electronic Imaging. San Francisco, CA, USA, February 2013.
[ BibTeX | | abstract ]
The problem of event representation for automatic event detection in Internet videos is acquiring an increasing importance, due to their applicability to a large number of applications. Existing methods focus on representing events in terms of either low-level descriptors or domain-speci c models suited for a limited class of video only, ignoring the high-level meaning of the events. Ultimately aiming for a more robust and meaningful representation, in this paper we question whether object detectors can aid video event retrieval. We propose an experimental study that investigates the utility of present-day local and global object detectors for video event search. By evaluating object detectors optimized for high-quality photographs on low-quality Internet video, we establish that present-day detectors can successfully be used for recognizing objects in web videos. We use an object-based representation to re-rank the results of an appearance-based event detector. Results on the challenging TRECVID multimedia event detection corpus demonstrate that objects can indeed aid event retrieval. While much remains to be studied, we believe that our experimental study is a rst step towards revealing the potential of object-based event representations.
: Efstratios Gavves, Cees G. M. Snoek, and Arnold W. M. Smeulders. Convex reduction of high-dimensional kernels for visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Providence, Rhode Island, USA, June 2012.
[ BibTeX | | abstract ]
Limiting factors of fast and effective classifiers for large sets of images are their dependence on the number of images analyzed and the dimensionality of the image representation. Considering the growing number of images as a given, we aim to reduce the image feature dimensionality in this paper. We propose reduced linear kernels that use only a portion of the dimensions to reconstruct a linear kernel. We formulate the search for these dimensions as a convex optimization problem, which can be solved efficiently. Different from existing kernel reduction methods, our reduced kernels are faster and maintain the accuracy benefits from non-linear embedding methods that mimic non-linear SVMs. We show these properties on both the Scenes and PASCAL VOC 2007 datasets. In addition, we demonstrate how our reduced kernels allow to compress Fisher vector for use with non-linear embeddings, leading to high accuracy. What is more, without using any labeled examples the selected and weighed kernel dimensions appear to correspond to visually meaningful patches in the images.
: Xirong Li, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. Fusing concept detection and geo context for visual search. In Proceedings of the ACM International Conference on Multimedia Retrieval. Hong Kong, China, June 2012. Best paper runner-up.
[ BibTeX | | abstract ]
Given the proliferation of geo-tagged images, the question of how to exploit geo tags and the underlying geo context for visual search is emerging. Based on the observation that the importance of geo context varies over concepts, we propose a concept-based image search engine which fuses visual concept detection and geo context in a concept-dependent manner. Compared to individual content-based and geo-based concept detectors and their uniform combination, concept-dependent fusion shows improvements. Moreover, since the proposed search engine is trained on social-tagged images alone without the need of human interaction, it is flexible to cope with many concepts. Search experiments on 101 popular visual concepts justify the viability of the proposed solution. In particular, for 79 out of the 101 concepts, the learned weights yield improvements over the uniform weights, with a relative gain of at least 5% in terms of average precision.
: Daan T. J. Vreeswijk, Koen E. A. van de Sande, Cees G. M. Snoek, and Arnold W. M. Smeulders. All vehicles are cars: Subclass preferences in container concepts. In Proceedings of the ACM International Conference on Multimedia Retrieval. Hong Kong, China, June 2012.
[ BibTeX | | abstract ]
This paper investigates the natural bias humans display when labeling images with a container label like vehicle or carnivore. Using three container concepts as subtree root nodes, and all available concepts between these roots and the images from the ImageNet Large Scale Visual Recogni- tion Challenge (ILSVRC) dataset, we analyze the differences between the images labeled at these varying levels of abstraction and the union of their constituting leaf nodes. We find that for many container concepts, a strong preference for one or a few different constituting leaf nodes occurs. These results indicate that care is needed when using hierarchical knowledge in image classification: if the aim is to classify vehicles the way humans do, then cars and buses may be the only correct results.
: Bauke Freiburg, Jaap Kamps, and Cees G. M. Snoek. Crowdsourcing visual detectors for video search. In Proceedings of the ACM International Conference on Multimedia. Scottsdale, AZ, USA, December 2011.
[ BibTeX | | abstract ]
In this paper, we study social tagging at the video fragment-level using a combination of automated content understanding and the wisdom of the crowds. We are interested in the question whether crowdsourcing can be beneficial to a video search engine that automatically recognizes video fragments on a semantic level. To answer this question, we perform a 3-month online field study with a concert video search engine targeted at a dedicated user-community of pop concert enthusiasts. We harvest the feedback of more than 500 active users and perform two experiments. In experiment 1 we measure user incentive to provide feedback, in experiment 2 we determine the tradeoff between feedback quality and quantity when aggregated over multiple users. Results show that users provide sufficient feedback, which becomes highly reliable when a crowd agreement of 67% is enforced.
: Xirong Li, Efstratios Gavves, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. Personalizing automated image annotation using cross-entropy. In Proceedings of the ACM International Conference on Multimedia. Scottsdale, AZ, USA, December 2011.
[ BibTeX | | abstract ]
Annotating the increasing amounts of user-contributed images in a personalized manner is in great demand. However, this demand is largely ignored by the mainstream of automated image annotation research. In this paper we aim for personalizing automated image annotation by jointly exploiting personalized tag statistics and content-based image annotation. We propose a cross-entropy based learning algorithm which personalizes a generic annotation model by learning from a user�s multimedia tagging history. Using cross-entropy-minimization basedMonte Carlo sampling, the proposed algorithm optimizes the personalization process in terms of a performance measurement which can be flexibly chosen. Automatic image annotation experiments with 5,315 realistic users in the social web show that the proposed method compares favorably to a generic image annotation method and a method using personalized tag statistics only. For 4,442 users the performance improves, where for 1,088 users the absolute performance gain is at least 0.05 in terms of average precision. The results show the value of the proposed method.
: Xirong Li, Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. Social negative bootstrapping for visual categorization. In Proceedings of the ACM International Conference on Multimedia Retrieval. Trento, Italy, April 2011.
[ BibTeX | | abstract ]
To learn classifiers for many visual categories, obtaining labeled training examples in an efficient way is crucial. Since a classifier tends to misclassify negative examples which are visually similar to positive examples, inclusion of such informative negatives should be stressed in the learning process. However, they are unlikely to be hit by random sampling, the de facto standard in literature. In this paper, we go beyond random sampling by introducing a novel social negative bootstrapping approach. Given a visual category and a few positive examples, the proposed approach adaptively and iteratively harvests informative negatives from a large amount of social-tagged images. To label negative examples without human interaction, we design an effective virtual labeling procedure based on simple tag reasoning. Virtual labeling, in combination with adaptive sampling, enables us to select the most misclassified negatives as the informative samples. Learning from the positive set and the informative negative sets results in visual classifiers with higher accuracy. Experiments on two present-day image benchmarks employing 650K virtually labeled negative examples show the viability of the proposed approach. On a popular visual categorization benchmark our precision at 20 increases by 34%, compared to baselines trained on randomly sampled negatives. We achieve more accurate visual categorization without the need of manually labeling any negatives.
: Wolfgang Hürst, Cees G. M. Snoek, Willem-Jan Spoel, and Mate Tomin. Size matters! how thumbnail number, size, and motion influence mobile video retrieval. In International Conference on MultiMedia Modeling. Taipei, Taiwan, January 2011.
[ BibTeX | | abstract ]
Various interfaces for video browsing and retrieval have been proposed that provide improved usability, better retrieval performance, and richer user experience compared to simple result lists that are just sorted by relevance. These browsing interfaces take advantage of the rather large screen estate on desktop and laptop PCs to visualize advanced configurations of thumbnails summarizing the video content. Naturally, the usefulness of such screen-intensive visual browsers can be called into question when applied on small mobile handheld devices, such as smart phones. In this paper, we address the usefulness of thumbnail images for mobile video retrieval interfaces. In particular, we investigate how thumbnail number, size, and motion influence the performance of humans in common recognition tasks. Contrary to widespread believe that screens of handheld devices are unsuited for visualizing multiple (small) thumbnails simultaneously, our study shows that users are quite able to handle and assess multiple small thumbnails at the same time, especially when they show moving images. Our results give suggestions for appropriate video retrieval interface designs on handheld devices.
: Efstratios Gavves and Cees G. M. Snoek. Landmark image retrieval using visual synonyms. In Proceedings of the ACM International Conference on Multimedia. Firenze, Italy, October 2010.
[ BibTeX | | abstract ]
In this paper, we consider the incoherence problem of the visual words in bag-of-words vocabularies. Different from existing work, which performs assignment of words based solely on closeness in descriptor space, we focus on identifying pairs of independent, distant words - the visual synonyms - that are still likely to host image patches with similar appearance. To study this problem, we focus on landmark images, where we can examine whether image geometry is an appropriate vehicle for detecting visual synonyms. We propose an algorithm for the extraction of visual synonyms in landmark images. To show the merit of visual synonyms, we perform two experiments. We examine closeness of synonyms in descriptor space and we show a first application of visual synonyms in a landmark image retrieval setting. Using visual synonyms, we perform on par with the state-of-the-art, but with six times less visual words.
: Wolfgang Hürst, Cees G. M. Snoek, Willem-Jan Spoel, and Mate Tomin. Keep moving! revisiting thumbnails for mobile video retrieval. In Proceedings of the ACM International Conference on Multimedia. Firenze, Italy, October 2010.
[ BibTeX | | abstract ]
Motivated by the increasing popularity of video on handheld devices and the resulting importance for effective video retrieval, this paper revisits the relevance of thumbnails in a mobile video retrieval setting. Our study indicates that users are quite able to handle and assess small thumbnails on a mobile's screen - especially with moving images - suggesting promising avenues for future research in design of mobile video retrieval interfaces.
: Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. Accelerating visual categorization with the GPU. In ECCV Workshop on Computer Vision on GPU. Crete, Greece, September 2010.
[ BibTeX | abstract ]
: Bouke Huurnink, Cees G. M. Snoek, Maarten de Rijke, and Arnold W. M. Smeulders. Today's and tomorrow's retrieval practice in the audiovisual archive. In Proceedings of the ACM International Conference on Image and Video Retrieval, pages 18-25. Xi'an, China, July 2010.
[ BibTeX | | abstract ]
Content-based video retrieval is maturing to the point where it can be used in real-world retrieval practices. One such practice is the audiovisual archive, whose users increasingly require fine-grained access to broadcast television content. We investigate to what extent content-based video retrieval methods can improve search in the audiovisual archive. In particular, we propose an evaluation methodology tailored to the specific needs and circumstances of the audiovisual archive, which are typically missed by existing evaluation initiatives. We utilize logged searches and content purchases from an existing audiovisual archive to create realistic query sets and relevance judgments. To reflect the retrieval practice of both the archive and the video retrieval community as closely as possible, our experiments with three video search engines incorporate archive-created catalog entries as well as state-of-the-art multimedia content analysis results. We find that incorporating content-based video retrieval into the archive�s practice results in significant performance increases for shot retrieval and for retrieving entire television programs. Our experiments also indicate that individual content-based retrieval methods yield approximately equal performance gains. We conclude that the time has come for audiovisual archives to start accommodating content-based video retrieval methods into their daily practice.
: Xirong Li and Cees G. M. Snoek. Visual categorization with negative examples for free. In Proceedings of the ACM International Conference on Multimedia. Beijing, China, October 2009.
[ BibTeX | | abstract ]
Automatic visual categorization is critically dependent on labeled examples for supervised learning. As an alternative to traditional expert labeling, social-tagged multimedia is becoming a novel yet subjective and inaccurate source of learning examples. Different from existing work focusing on collecting positive examples, we study in this paper the potential of substituting social tagging for expert labeling for creating negative examples. We present an empirical study using 6.5 million Flickr photos as a source of social tagging. Our experiments on the PASCAL VOC challenge 2008 show that with a relative loss of only 4.3% in terms of mean average precision, expert-labeled negative examples can be completely replaced by social-tagged negative examples for consumer photo categorization.
: Arjan T. Setz and Cees G. M. Snoek. Can social tagged images aid concept-based video search?. In Proceedings of the IEEE International Conference on Multimedia & Expo, pages 1460-1463. June-July 2009. Invited paper.
[ BibTeX | | abstract ]
This paper seeks to unravel whether commonly available social tagged images can be exploited as a training resource for concept-based video search. Since social tags are known to be ambiguous, overly personalized, and often error prone, we place special emphasis on the role of disambiguation. We present a systematic experimental study that evaluates concept detectors based on social tagged images, and their disambiguated versions, in three application scenarios: within-domain, cross-domain, and together with an interacting user. The results indicate that social tagged images can aid concept-based video search indeed, especially after disambiguation and when used in an interactive video retrieval setting. These results open-up interesting avenues for future research.
: Xirong Li, Cees G. M. Snoek, and Marcel Worring. Annotating images by harnessing worldwide user-tagged photos. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Taipei, Taiwan, April 2009. Invited paper.
[ BibTeX | | abstract ]
Automatic image tagging is important yet challenging due to the semantic gap and the lack of learning examples to model a tag's visual diversity. Meanwhile, social user tagging is creating rich multimedia content on the web. In this paper, we propose to combine the two tagging approaches in a search-based framework. For an unlabeled image, we first retrieve its visual neighbors from a large user-tagged image database. We then select relevant tags from the result images to annotate the unlabeled image. To tackle the unreliability and sparsity of user tagging, we introduce a joint-modality tag relevance estimation method which efficiently addresses both textual and visual clues. Experiments on 1.5 million Flickr photos and 10 000 Corel images verify the proposed method.
: Daragh Byrne, Aiden R. Doherty, Cees G. M. Snoek, Gareth J. F. Jones, and Alan F. Smeaton. Validating the detection of everyday concepts in visual lifelogs. In Proceedings of the International Conference on Semantic and Digital Media Technologies, SAMT 2008, Koblenz, Germany, December 3-5, 2008, LNCS, pages 15-30. Springer-Verlag, Berlin, Germany, December 2008.
[ BibTeX | | abstract ]
The Microsoft SenseCam is a small lightweight wearable camera used to passively capture photos and other sensor readings from a user's day-to-day activities. It can capture up to 3,000 images per day, equating to almost 1 million images per year. It is used to aid memory by creating a personal multimedia lifelog, or visual recording of the wearer's life. However the sheer volume of image data captured within a visual lifelog creates a number of challenges, particularly for locating relevant content. Within this work, we explore the applicability of semantic concept detection, a method often used within video retrieval, on the novel domain of visual lifelogs. A concept detector models the correspondence between low-level visual features and high-level semantic concepts (such as indoors, outdoors, people, buildings, etc.) using supervised machine learning. By doing so it determines the probability of a concept's presence. We apply detection of 27 everyday semantic concepts on a lifelog collection composed of 257,518 SenseCam images from 5 users. The results were then evaluated on a subset of 95,907 images, to determine the precision for detection of each semantic concept and to draw some interesting inferences on the lifestyles of those 5 users. We additionally present future applications of concept detection within the domain of lifelogging.
: Xirong Li, Cees G. M. Snoek, and Marcel Worring. Learning tag relevance by neighbor voting for social image retrieval. In Proceedings of the ACM International Conference on Multimedia Information Retrieval, pages 180-187. Vancouver, Canada, October 2008.
[ BibTeX | | abstract ]
Social image retrieval is important for exploiting the increasing amounts of amateur-tagged multimedia such as Flickr images. Since amateur tagging is known to be uncontrolled, ambiguous, and personalized, a fundamental problem is how to reliably interpret the relevance of a tag with respect to the visual content it is describing. Intuitively, if different persons label similar images using the same tags, these tags are likely to reflect objective aspects of the visual content. Starting from this intuition, we propose a novel algorithm that scalably and reliably learns tag relevance by accumulating votes from visually similar neighbors. Further, treated as tag frequency, learned tag relevance is seamlessly embedded into current tag-based social image retrieval paradigms. Preliminary experiments on one million Flickr images demonstrate the potential of the proposed algorithm. Overall comparisons for both single-word queries and multiple-word queries show substantial improvement over the baseline by learning and using tag relevance. Specifically, compared with the baseline using the original tags, on average, retrieval using improved tags increases mean average precision by 24%, from 0.54 to 0.67. Moreover, simulated experiments indicate that performance can be improved further by scaling up the amount of images used in the proposed neighbor voting algorithm.
: Ork de Rooij, Cees G. M. Snoek, and Marcel Worring. Balancing thread based navigation for targeted video search. In Proceedings of the ACM International Conference on Image and Video Retrieval, pages 485-494. Niagara Falls, Canada, July 2008.
[ BibTeX | | abstract ]
Various query methods for video search exist. Because of the semantic gap each method has its limitations. We argue that for effective retrieval query methods need to be combined at retrieval time. However, switching query methods often involves a change in query and browsing interface, which puts a heavy burden on the user. In this paper, we propose a novel method for fast and effective search trough large video collections by embedding multiple query methods into a single browsing environment. To that end we introduced the notion of query threads, which contain a shot-based ranking of the video collection according to some feature-based similarity measure. On top of these threads we define several thread-based visualizations, ranging from fast targeted search to very broad exploratory search, with the ForkBrowser as the balance between fast search and video space exploration. We compare the effectiveness and efficiency of the ForkBrowser with the CrossBrowser on the TRECVID 2007 interactive search task. Results show that different query methods are needed for different types of search topics, and that the ForkBrowser requires signifficantly less user interactions to achieve the same result as the CrossBrowser. In addition, both browsers rank among the best interactive retrieval systems currently available.
: Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. A comparison of color features for visual concept classification. In Proceedings of the ACM International Conference on Image and Video Retrieval, pages 141-149. Niagara Falls, Canada, July 2008.
[ BibTeX | | abstract ]
Concept classification is important to access visual information on the level of objects and scene types. So far, intensity-based features have been widely used. To increase discriminative power, color features have been proposed only recently. As many features exist, a structured overview is required of color features in the context of concept classification. Therefore, this paper studies 1. the invariance properties and 2. the distinctiveness of color features in a structured way. The invariance properties of color features with respect to photometric changes are summarized. The distinctiveness of color features is assessed experimentally using an image and a video benchmark: the PASCAL VOC Challenge 2007 and the Mediamill Challenge. Because color features cannot be studied independently from the points at which they are extracted, different point sampling strategies based on Harris-Laplace salient points, dense sampling and the spatial pyramid are also studied. From the experimental results, it can be derived that invariance to light intensity changes and light color changes affects concept classification. The results reveal further that the usefulness of invariance is concept-specific.
: Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. Evaluation of color descriptors for object and scene recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, Alaska, June 2008.
[ BibTeX | | abstract ]
Image category recognition is important to access visual information on the level of objects and scene types. So far, intensity-based descriptors have been widely used. To increase illumination invariance and discriminative power, color descriptors have been proposed only recently. As many descriptors exist, a structured overview of color invariant descriptors in the context of image category recognition is required. Therefore, this paper studies the invariance properties and the distinctiveness of color descriptors in a structured way. The invariance properties of color descriptors are shown analytically using a taxonomy based on invariance properties with respect to photometric transformations. The distinctiveness of color descriptors is assessed experimentally using two benchmarks from the image domain and the video domain. From the theoretical and experimental results, it can be derived that invariance to light intensity changes and light color changes affects category recognition. The results reveal further that, for light intensity changes, the usefulness of invariance is category-specific.
: Koen E. A. van de Sande, Theo Gevers, and Cees G. M. Snoek. Color descriptors for object category recognition. In Proceedings of the IS&T European Conference on Colour in Graphics, Imaging, and Vision. Terrassa-Barcelona, Spain, June 2008.
[ BibTeX | | abstract ]
Category recognition is important to access visual information on the level of objects. A common approach is to compute image descriptors first and then to apply machine learning to achieve category recognition from annotated examples. As a consequence,the choice of image descriptors is of great influence on the recognition accuracy. So far, intensity-based (e.g. SIFT) descriptors computed at salient points have been used. However, color has been largely ignored. The question is, can color information improve accuracy of category recognition? Therefore, in this paper, we will extend both salient point detection and region description with color information. The extension of color descriptors is integrated into the framework of category recognition enabling to select both intensity and color variants. Our experiments on an image benchmark show that category recognition benefits from the use of color. Moreover, the combination of intensity and color descriptors yields a 30% improvement over intensity features alone.
: Ork de Rooij, Cees G. M. Snoek, and Marcel Worring. Query on demand video browsing. In Proceedings of the ACM International Conference on Multimedia, pages 811-814. Augsburg, Germany, September 2007.
[ BibTeX | | abstract ]
This paper describes a novel method for browsing a large collection of news video by linking various forms of related video fragments together as threads. Each thread contains a sequence of shots with high feature-based similarity. Two interfaces are designed which use threads as the basis for browsing. One interface shows a minimal set of threads, and the other as many as possible. Both interfaces are evaluated in the TRECVID interactive retrieval task, where they ranked among the best interactive retrieval systems currently available. The results indicate that the use of threads in interactive video search is very beneficial. We have found that in general the query result and the timeline are the most important threads. However, having several additional threads allow a user to find unique results which cannot easily be found by using query results and time alone.
: Arnold W. M. Smeulders, Jan C. van Gemert, Bouke Huurnink, Dennis C. Koelma, Ork de Rooij, Koen E. A. van de Sande, Cees G. M. Snoek, Cor J. Veenman, and Marcel Worring. Semantic video search. In International Conference on Image Analysis and Processing. Modena, Italy, September 2007.
[ BibTeX | | abstract ]
In this paper we describe the current performance of our MediaMill system as presented in the TRECVID 2006 benchmark for video search engines. The MediaMill team participated in two tasks: concept detection and search. For concept detection we use the MediaMill Challenge as experimental platform. The MediaMill Challenge divides the generic video indexing problem into a visual-only, textual-only, early fusion, late fusion, and combined analysis experiment. We provide a baseline implementation for each experiment together with baseline results. We extract image features, on global, regional, and keypoint level, which we combine with various supervised learners. A late fusion approach of visual-only analysis methods using geometric mean was our most successful run. With this run we conquer the Challenge baseline by more than 50%. Our concept detection experiments have resulted in the best score for three concepts: i.e. desert, flag us, and charts. What is more, using LSCOM annotations, our visual-only approach generalizes well to a set of 491 concept detectors. To handle such a large thesaurus in retrieval, an engine is developed which allows users to select relevant concept detectors based on interactive browsing using advanced visualizations. Similar to previous years our best interactive search runs yield top performance, ranking 2nd and 6th overall.
: Cees G. M. Snoek, Marcel Worring, Arnold W. M. Smeulders, and Bauke Freiburg. The role of visual content and style for concert video indexing. In Proceedings of the IEEE International Conference on Multimedia & Expo, pages 252-255. Beijing, China, July 2007.
[ BibTeX | | abstract ]
This paper contributes to the automatic indexing of concert video. In contrast to traditional methods, which rely primarily on audio information for summarization applications, we explore how a visual-only concept detection approach could be employed. We investigate how our recent method for news video indexing - which takes into account the role of content and style - generalizes to the concert domain. We analyze concert video on three levels of visual abstraction, namely: content, style, and their fusion. Experiments with 12 concept detectors, on 45 hours of visually challenging concert video, show that the automatically learned best approach is concept-dependent. Moreover, these results suggest that the visual modality provides ample opportunity for more effective indexing and retrieval of concert video when used in addition to the auditory modality.
: Cees G. M. Snoek and Marcel Worring. Are concept detector lexicons effective for video search?. In Proceedings of the IEEE International Conference on Multimedia & Expo, pages 1966-1969. Beijing, China, July 2007.
[ BibTeX | | abstract ]
Until now, systematic studies on the effectiveness of concept detectors for video search have been carried out using less than 20 detectors, or in combination with other retrieval techniques. We investigate whether video search using just large concept detector lexicons is a viable alternative for present day approaches. We demonstrate that increasing the number of concept detectors in a lexicon yields improved video retrieval performance indeed. In addition, we show that combining concept detectors at query time has the potential to boost performance further. We obtain the experimental evidence on the automatic video search task of TRECVID 2005 using 363 machine learned concept detectors.
: Marcel Worring, Cees G. M. Snoek, Ork de Rooij, Giang P. Nguyen, and Arnold W. M. Smeulders. The MediaMill semantic video search engine. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages -. Honolulu, Hawaii, USA, April 2007. Invited paper.
[ BibTeX | | abstract ]
In this paper we present the methods underlying the MediaMill semantic video search engine. The basis for the engine is a semantic indexing process which is currently based on a lexicon of 491 concept detectors. To support the user in navigating the collection, the system defines a visual similarity space, a semantic similarity space, a semantic thread space, and browsers to explore them. We compare the different browsers and their utility within the TRECVID benchmark. In 2005, We obtained a top-3 result for 19 out of 24 search topics. In 2006 for 14 out of 24.
: Giang P. Nguyen, Marcel Worring, and Arnold W. M. Smeulders. Similarity learning via dissimilarity space in CBIR. In Proceedings of the ACM SIGMM International Workshop on Multimedia Information Retrieval, pages 107-116. Santa Barbara, USA, October 2006.
[ BibTeX | | abstract ]
In this paper, we introduce a new approach to learn dissimilarity for interactive search in content based image retrieval. In literature, dissimilarity is often learned via the feature space by feature selection, feature weighting or a parameterized function of the features. Different from existing techniques, we use relevance feedback to adjust dissimilarity in a dissimilarity space. To create a dissimilarity space, we use Pekalska�s method [15]. After the user gives feedback, we apply active learning with one-class SVM on this space. Results on a Corel dataset of 10000 images and a TrecVid collection of 43907 keyframes show that our proposed approach can improve the retrieval performance over the feature space based approach.
: Cees G. M. Snoek, Marcel Worring, Jan C. van Gemert, Jan-Mark Geusebroek, and Arnold W. M. Smeulders. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the ACM International Conference on Multimedia, pages 421-430. Santa Barbara, USA, October 2006.
[ BibTeX | | abstract ]
We introduce the challenge problem for generic video indexing to gain insight in intermediate steps that affect performance of multimedia analysis methods, while at the same time fostering repeatability of experiments. To arrive at a challenge problem, we provide a general scheme for the systematic examination of automated concept detection methods, by decomposing the generic video indexing problem into 2 unimodal analysis experiments, 2 multimodal analysis experiments, and 1 combined analysis experiment. For each experiment, we evaluate generic video indexing performance on 85 hours of international broadcast news data, from the TRECVID 2005/2006 benchmark, using a lexicon of 101 semantic concepts. By establishing a minimum performance on each experiment, the challenge problem allows for component-based optimization of the generic indexing issue, while simultaneously offering other researchers a reference for comparison during indexing methodology development. To stimulate further investigations in intermediate analysis steps that influence video indexing performance, the challenge offers to the research community a manually annotated concept lexicon, pre-computed low-level multimedia features, trained classifier models, and five experiments together with baseline performance, which are all available at http://www.mediamill.nl/challenge/.
: Jan C. van Gemert, Cees G. M. Snoek, Cor Veenman, and Arnold W. M. Smeulders. The influence of cross-validation on video classification performance. In Proceedings of the ACM International Conference on Multimedia, pages 695-698. Santa Barbara, USA, October 2006.
[ BibTeX | | abstract ]
Digital video is sequential in nature. When video data is used in a semantic concept classification task, the episodes are usually summarized with shots. The shots are annotated as containing, or not containing, a certain concept resulting in a labeled dataset. These labeled shots can subsequently be used by supervised learning methods (classifiers) where they are trained to predict the absence or presence of the concept in unseen shots and episodes. The performance of such automatic classification systems is usually estimated with cross-validation. By taking random samples from the dataset for training and testing as such, part of the shots from an episode are in the training set and another part from the same episode is in the test set. Accordingly, data dependence between training and test set is introduced, resulting in too optimistic performance estimates. In this paper, we experimentally show this bias, and propose how this bias can be prevented using "episode-constrained" cross-validation. Moreover, we show that a 15% higher classifier performance can be achieved by using episode constrained cross-validation for classifier parameter tuning.
: Jan-Mark Geusebroek. Compact object descriptors from local colour invariant histograms. In British Machine Vision Conference. Edinburgh, UK, September 2006.
[ BibTeX | | abstract ]
Much emphasis has recently been placed on the detection and recognition of locally (weak) affine invariant region descriptors for object recognition. In this paper, we take recognition one step further by developing features for non-planar objects. We consider the description of objects with locally smoothly varying surface. For this class of objects, colour invariant histogram matching has proven to be very encouraging. However, matching many local colour cubes is computationally demanding. We propose a compact colour descriptor, which we call Wiccest, requiring only 12 numbers to locally capture colour and texture information. The Wiccest features are shown to be fairly insensitive to photometric effects like shadow, shading, and illumination colour. Moreover, we demonstrate the features to be applicable to highly compressed images while retaining discriminative power.
: Marcel Worring, Cees G. M. Snoek, Ork de Rooij, Giang P. Nguyen, and Dennis C. Koelma. Lexicon-based browsers for searching in news video archives. In Proceedings of the International Conference on Pattern Recognition, pages 1256-1259. Hong Kong, China, August 2006.
[ BibTeX | | abstract ]
In this paper we present the methods and visualizations used in the MediaMill video search engine. The basis for the engine is a semantic indexing process which derives a lexicon of 101 concepts. To support the user in navigating the collection, the system defines a visual similarity space, a semantic similarity space, a semantic thread space, and browsers to explore them. The search system is evaluated within the TRECVID benchmark. We obtain a top-3 result for 19 out of 24 search topics. In addition, we obtain the highest mean average precision of all search participants.
: Cees G. M. Snoek, Marcel Worring, Dennis C. Koelma, and Arnold W. M. Smeulders. Learned lexicon-driven interactive video retrieval. In H. Sundaram et al., editors, Proceedings of the International Conference on Image and Video Retrieval, CIVR 2006, Tempe, Arizona, July 13-15, 2006, volume 4071 of LNCS, pages 11-20. Springer-Verlag, Heidelberg, Germany, July 2006.
[ BibTeX | | abstract ]
We combine in this paper automatic learning of a large lexicon of semantic concepts with traditional video retrieval methods into a novel approach to narrow the semantic gap. The core of the proposed solution is formed by the automatic detection of an unprecedented lexicon of 101 concepts. From there, we explore the combination of query-by-concept, query-by-example, query-by-keyword, and user interaction into the MediaMill semantic video search engine. We evaluate the search engine against the 2005 NIST TRECVID video retrieval benchmark, using an international broadcast news archive of 85 hours. Top ranking results show that the lexicon-driven search engine is highly effective for interactive video retrieval.
: Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, Frank J. Seinstra, and Arnold W. M. Smeulders. The semantic pathfinder for generic news video indexing. In Proceedings of the IEEE International Conference on Multimedia & Expo. Toronto, Canada, July 2006.
[ BibTeX | | abstract ]
This paper presents the semantic pathfinder architecture for generic indexing of video archives. The pathfinder automatically extracts semantic concepts from video based on the exploration of different paths through three consecutive analysis steps, closely linked to the video production process, namely: content analysis, style analysis, and context analysis. The virtue of the semantic pathfinder is its learned ability to find a best path of analysis steps on a per-concept basis. To show the generality of this indexing approach we develop detectors for a lexicon of 32 concepts and we evaluate the semantic pathfinder against the 2004 NIST TRECVID video retrieval benchmark, using a news archive of 64 hours. Top ranking performance indicates the merit of the semantic pathfinder.
: Jan C. van Gemert, Jan-Mark Geusebroek, Cor J. Veenman, Cees G. M. Snoek, and Arnold W. M. Smeulders. Robust scene categorization by learning image statistics in context. In Int'l Workshop on Semantic Learning Applications in Multimedia, in conjunction with CVPR'06. New York, USA, June 2006.
[ BibTeX | | abstract ]
We present a generic and robust approach for scene categorization. A complex scene is described by proto-concepts like vegetation, water, fire, sky etc. These proto-concepts are represented by low level features, where we use natural images statistics to compactly represent color invariant texture information by a Weibull distribution. We introduce the notion of contextures which preserve the context of textures in a visual scene with an occurrence histogram (context) of similarities to proto-concept descriptors (texture). In contrast to a codebook approach, we use the similarity to all vocabulary elements to generalize beyond the code words. Visual descriptors are attained by combining different types of contexts with different texture parameters. The visual scene descriptors are generalized to visual categories by training a support vector machine. We evaluate our approach on 3 different datasets: 1) 50 categories for the TRECVID video dataset; 2) the Caltech 101-object images; 3) 89 categories being the intersection of the Corel photo stock with the Art Explosion photo stock. Results show that our approach is robust over different datasets, while maintaining competitive performance.
: Arnold W. M. Smeulders, Jan van Gemert, Jan-Mark Geusebroek, Cees Snoek, and Marcel Worring. Browsing for the national dutch video archive. In ISCCSP2006. Marrakech, Morocco, March 2006.
[ BibTeX | | abstract ]
Pictures have always been a prime carrier of Dutch culture. But pictures take a new form. We live in times of broad- and narrowcasting through Internet, of passive and active viewers, of direct or delayed broadcast, and of digital pictures being delivered in the museum or at home. At the same time, the picture and television archives turn digital. Archives are going to be swamped with information requests unless they swiftly adapt to partially automatic annotation and digital retrieval. Our aim is to provide faster and more complete access to picture archives by digital analysis. Our approach consists of a multi-media analysis of features of pictures in tandem with the language that describes those pictures, under the guidance of a visual ontology. The general scientific paradigm we address is the detection of directly observables fused into semantic features learned from large repositories of digital video. We use invariant, natural-image statisticsbased contextual feature sets for capturing the concepts of images and integrate that as early as possible with text. The system consists of a large for science yet small for practice set of visual concepts permitting the retrieval of semantically formulated queries. We will demonstrate a PC-based, off-line trained state of the art system for browsing broadcast news-archives.
: Giang P. Nguyen and Marcel Worring. Scenario optimization for interactive category search. In Proceedings of the ACM SIGMM International Workshop on Multimedia Information Retrieval. Singapore, November 2005.
[ BibTeX | | abstract ]
Most of the existing work in interactive content based retrieval concentrates on machine learning methods for effective use of relevance feedback. On the other end of the spectrum, the information visualization community focusses on effective methods for conveying information to the user. What lacks is research considering the information visualization and interactive content based retrieval as truly integrated parts of one search system. In such an integrated system there are many degrees of freedom like the number of images to display, the image size, different visualization modes, and possible feedback modes. To find optimal values for all of those using user studies is unfeasible. We therefore develop scenarios in which tasks and user actions are simulated. These are then optimized based on objective constraints and evaluation criteria. In such a manner the degrees of freedom are reduced and the remaining degrees can be evaluated in user studies. In this paper we present a system which integrates advanced similarity based visualization with active learning. We have performed extensive scenario based experimentation on an interactive category search task. The results show that indeed the use of advanced visualization and active learning pays off.
: Laura Hollink, Marcel Worring, and Guus Schreiber. Building a visual ontology for video retrieval. In Proceedings of the ACM International Conference on Multimedia, pages 479-482. Singapore, November 2005.
[ BibTeX | | abstract ]
To ensure access to growing video collections, annotation is becoming more and more important. Using background knowledge in the form of ontologies or thesauri is a way to facilitate annotation in a broad domain. Current ontologies are not suitable for (semi-) automatic annotation of visual resources as they contain little visual information about the concepts they describe. We investigate how an ontology that does contain visual information can facilitate annotation in a broad domain and identify requirements that a visual ontology has to meet. Based on these requirements, we create a visual ontology out of two existing knowledge corpora (WordNet and MPEG-7) by creating links between visualand general concepts. We test performance of the ontology on 40 shots of news video, and discuss the added value of each visual property.
: Cees G. M. Snoek, Marcel Worring, and Arnold W. M. Smeulders. Early versus late fusion in semantic video analysis. In Proceedings of the ACM International Conference on Multimedia, pages 399-402. Singapore, November 2005.
[ BibTeX | | abstract ]
Semantic analysis of multimodal video aims to index segments of interest at a conceptual level. In reaching this goal, it requires an analysis of several information streams. At some point in the analysis these streams need to be fused. In this paper, we consider two classes of fusion schemes, namely early fusion and late fusion. The former fuses modalities in feature space, the latter fuses modalities in semantic space. We show by experiment on 184 hours of broadcast video data and for 20 semantic concepts, that late fusion tends to give slightly better performance for most concepts. However, for those concepts where early fusion performs better the difference is more significant.
: Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, and Frank J. Seinstra. On the surplus value of semantic video analysis beyond the key frame. In Proceedings of the IEEE International Conference on Multimedia & Expo. Amsterdam, The Netherlands, July 2005.
[ BibTeX | | abstract ]
Typical semantic video analysis methods aim for classification of camera shots based on extracted features from a single key frame only. In this paper, we sketch a video analysis scenario and evaluate the benefit of analysis beyond the key frame for semantic concept detection performance. We developed detectors for a lexicon of 26 concepts, and evaluated their performance on 120 hours of video data. Results show that, on average, detection performance can increase with almost 40% when the analysis method takes more visual content into account.
: Giang P. Nguyen and Marcel Worring. Similarity based visualization of image collections. In Proceedings of the 7th International Workshop of the EU Network of Excellence DELOS on Audio-visual Content and Information Visualization in Digital Libraries. Cortona, Italy, May 2005.
[ BibTeX | | abstract ]
In literature, few content based multimedia retrieval systems take the visualization as a tool for exploring the collections. However, when searching for images without examples to start with, one needs to explore the data set. Up to now, most available systems just show random collections of images in 2D grid form. More recently, advanced techniques have been developed for browsing based on similarity. However, none of them analyze the problems that occur when visualizing large visual collections. In this paper, we make these problems explicit. From there, we establish three general requirements: overview, visibility, and data structure preservation. Solutions for each requirement are proposed. Finally, a system is presented and experimental results are given to demonstrate our theory and approach.
: Frank J. Seinstra, Cees G. M. Snoek, Dennis C. Koelma, Jan-Mark Geusebroek, and Marcel Worring. User transparent parallel processing of the 2004 NIST TRECVID data set. In Proceedings of the 19th International Parallel & Distributed Processing Symposium. Denver, USA, April 2005.
[ BibTeX | | abstract ]
The Parallel-Horus framework, developed at the University of Amsterdam, is a unique software architecture that allows non-expert parallel programmers to develop fully sequential multimedia applications for efficient execution on homogeneous Beowulf-type commodity clusters. Previously obtained results for realistic, but relatively small-sized applications have shown the feasibility of the Parallel-Horus approach, with parallel performance consistently being found to be optimal with respect to the abstraction level of message passing programs. In this paper we discuss the most serious challenge Parallel-Horus has had to deal with so far: the processing of over 184 hours of video included in the 2004 NIST TRECVID evaluation, i.e. the de facto international standard benchmark for content-based video retrieval. Our results and experiences confirm that Parallel- Horus is a very powerful support-tool for state-of-the-art research and applications in multimedia processing.
: Cees G. M. Snoek and Marcel Worring. Multimedia pattern recognition in soccer video using time intervals. In Classification the Ubiquitous Challenge, Proceedings of the 28th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Dortmund, March 9-11, 2004, Studies in Classification, Data Analysis, and Knowledge Organization, pages 97-108. Springer-Verlag, Berlin, Germany, 2005. Invited paper.
[ BibTeX | abstract ]
We focus on the problem of learning rich semantic patterns from the multimedia data associated with broadcast video documents. In this talk we propose a generic and flexible framework for produced video classification that is capable to learn semantic concepts from multimodal sources based on analyzed style elements. Four properties that are indicative for style are identified, i.e. layout, content, capture, and concept context. The framework allows for robust classification of different semantic concepts in produced video by using a fixed core of common layout, content, and capture elements in combination with varying concept specific context elements. Concepts are classified using a Stacked Probabilistic Support Vector Machine. Results on 120 hours of video data from the 2003 TRECVID benchmark show that, by using the proposed framework, several rich semantic concepts in broadcast news can be classified with state-of-the-art accuracy.
: Laura Hollink, Giang Nguyen, Guus Schreiber, Jan Wielemaker, Bob Wielinga, and Marcel Worring. Adding spatial semantics to image annotations. In International Workshop on Knowledge Markup and Semantic Annotation. Hiroshima, Japan, November 2004.
[ BibTeX | | abstract ]
In this paper we discuss a the support of users in adding spatial information semi-automatically to annotations of images. Descriptions of objects depicted in an image are extended with information about the position of those objects. We distinguish two types of spatial concepts: absolute positions of objects (e.g., east, west) and relative spatial relations between objects (e.g., left, above). We show the use of a tool for a collection of art paintings with preexisting RDF annotations, including a list of image objects. First, the tool segments a painting into regions. The user selects regions, and labels these with objects from the existing annotation. Then, the tool computes absolute positions and relative spatial relations of the selected regions, and adds these to the annotation. A small evaluation study is reported in which annotations generated by the tool are compared to manual annotations by ten volunteers.
: Giang P. Nguyen and Marcel Worring. A user based framework for salient detail extraction. In Proceedings of the IEEE International Conference on Multimedia & Expo. Taipei, Taiwan, June 2004.
[ BibTeX | | abstract ]
In this paper, we consider the interaction with salient details in the image i.e. points, lines, and regions. Interactive salient detail definition goes further than summarizing the image into a set of salient details since the saliency of details depends on the context, the application and the user. We propose an interaction framework for salient details from the perspective of the user, which dynamically updates the user- and context-dependent definition of saliency based on relevance feedback. A number of instantiations of the framework are presented.
: Giang P. Nguyen and Marcel Worring. Optimizing similarity based visualization in content based image retrieval. In Proceedings of the IEEE International Conference on Multimedia & Expo. Taipei, Taiwan, June 2004.
[ BibTeX | | abstract ]
In any CBIR system, visualization is important, either to show the final result to the user or to form the basis for interaction. Advanced systems use 2-dimensional similarity based visualization which show not only the information of one image itself but also the relations between images. A problem in interactive 2D visualization is the overlap between the images displayed. This obviously reduces the search capability. Simply spreading the images on the screen space will not preserve the relations between them. In this paper, we propose a visualization scheme which reduces the overlap as well as preserves the general distribution of the images displayed. Results show that an effective balance between display of structures and limited overlap can be achieved.
: Cees G. M. Snoek, Marcel Worring, and Alexander G. Hauptmann. Detection of TV news monologues by style analysis. In Proceedings of the IEEE International Conference on Multimedia & Expo. Taipei, Taiwan, June 2004.
[ BibTeX | | abstract ]
We propose a method for detection of semantic concepts in produced video based on style analysis. Recognition of concepts is done by applying a classifier ensemble to the detected style elements. As a case study we present a method for detecting the concept of news subject monologues. Our approach had the best average precision performance amongst 26 submissions in the 2003 TRECVID benchmark.
: Marcel Worring, Giang P. Nguyen, Laura Hollink, Jan C. van Gemert, and Dennis C. Koelma. Accessing video archives using interactive search. In Proceedings of the IEEE International Conference on Multimedia & Expo. Taipei, Taiwan, June 2004.
[ BibTeX | | abstract ]
In this presentation we present a system for interactive search in video archives. In our view interactive search is a fourstep process composed of indexing, filtering, browsing, and ranking. We have experimentally verified, using 22 groups of two participants each, how users apply these steps in the interactive search and how well they perform.
: Laura Hollink, Giang P.Nguyen, Dennis Koelma, Guus Schreiber, and M.Worring. User strategies in video retrieval: a case study. In P. Enser, Y. Kompatsiaris, N.E. O'Connor, A.F. Smeaton, and A.W. M. Smeulders, editors, Proceedings of the International Conference on Image and Video Retrieval, CIVR 2004, Dublin, Ireland, July 21-23, 2004, volume 3115 of LNCS, pages 6-14. Springer-Verlag, Heidelberg, Germany, 2004.
[ BibTeX | | abstract ]
In this paper we present the results of a user study that was conducted in combination with a submission to TRECVID 2003. Search behavior of students querying an interactive video-retrieval system was analyzed. 242 Searches by 39 students on 24 topics were assessed. Questionnaire data, logged user actions on the system, and a quality mea- sure of each search provided by TRECVID were studied. Analysis of the results at various stages in the retrieval process suggests that retrieval based on transcriptions of the speech in video data adds more to the average precision of the result than content-based retrieval. The latter is particularly useful in providing the user with an overview of the dataset and thus an indication of the success of a search.
: Giang P. Nguyen and Marcel Worring. Query definition using interactive saliency. In Proceedings of the ACM SIGMM International Workshop on Multimedia Information Retrieval. Berkeley, USA, November 2003.
[ BibTeX | | abstract ]
Content-based image retrieval (CBIR) has been under investigation for a long time with many systems built to meet different application demands. However, in all systems, there is still a big gap between the user's expectation and the system's retrieval capabilities. Therefore, user interaction is an essential component of any CBIR system. Interaction up to now has mostly focused on global image features or similarities. We consider the interaction with salient details in the image i.e. points, lines, and regions. Interactive salient detail definition goes further than automatically summarizing the image into a set of salient details. We aim to dynamically update the user- and context-dependent definition of saliency based on relevance feedback from the user. In this paper, we propose an interaction framework for salient details from the perspective of the user.
: Cees G. M. Snoek and Marcel Worring. Time interval maximum entropy based event indexing in soccer video. In Proceedings of the IEEE International Conference on Multimedia & Expo, pages 481-484. Baltimore, USA, July 2003.
[ BibTeX | | abstract ]
Multimodal indexing of events in video documents poses problems with respect to representation, inclusion of contextual information, and synchronization of the heterogeneous information sources involved. In this paper we present the Time Interval Maximum Entropy (TIME) framework that tackles aforementioned problems. To demonstrate the viability of TIME for event classification in multimodal video, an evaluation was performed on the domain of soccer broadcasts. It was found that by applying TIME, the amount of video a user has to watch in order to see almost all highlights can be reduced considerably.
: Cees G. M. Snoek and Marcel Worring. A review on multimodal video indexing. In Proceedings of the IEEE International Conference on Multimedia & Expo, volume 2, pages 21-24. Lausanne, Switzerland, August 2002.
[ BibTeX | | abstract ]
Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. Efficient, single modality based, video indexing methods have appeared in literature. Effective indexing, however, requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. In this paper we present a framework for multimodal video indexing, which views a video document from the perspective of its author. The framework serves as a blueprint for a generic and flexible multimodal video indexing system, and generalizes different state-of-the-art video indexing methods. It furthermore forms the basis for categorizing these different methods.
: Marcel Worring, Andrew Bagdanov, Jan van Gemert, Jan-Mark Geusebroek, Minh Hoang, Guus Schreiber, Cees G. M. Snoek, Jeroen Vendrig, Jan Wielemaker, and Arnold W. M. Smeulders. Interactive indexing and retrieval of multimedia content. In Proceedings of the 29th Annual Conference on Current Trends in Theory and Practice of Informatics, volume 2540 of Lecture Notes in Computer Science, pages 135-148. Springer-Verlag, Milovy, Czech Republic, 2002.
[ BibTeX | | abstract ]
The indexing and retrieval of multimedia items is difficult due to the semantic gap between the user's perception of the data and the descriptions we can derive automatically from the data using computer vision, speech recognition, and natural language processing. In this contribution we consider the nature of the semantic gap in more detail and show examples of methods that help in limiting the gap. These methods can be automatic, but in general the indexing and retrieval of multimedia items should be a collaborative process between the system and the user. We show how to employ the user's interaction for limiting the semantic gap.

TRECVID Benchmark

Show all abstracts | Hide all abstracts

: Cees G. M. Snoek, Koen E. A. van de Sande, Amirhossein Habibian, Svetlana Kordumova, Zhenyang Li, Masoud Mazloom, Silvia-Laura Pintea, Ran Tao, Dennis C. Koelma, and Arnold W. M. Smeulders. The MediaMill TRECVID 2012 semantic video search engine, November 2012.
[ BibTeX | abstract ]
: Cees G. M. Snoek, Koen E. A. van de Sande, Xirong Li, Masoud Mazloom, Yu-Gang Jiang, Dennis C. Koelma, and Arnold W. M. Smeulders. The MediaMill TRECVID 2011 semantic video search engine, December 2011.
[ BibTeX | | abstract ]
In this paper we describe our TRECVID 2011 video retrieval experiments. The MediaMill team participated in two tasks: semantic indexing and multimedia event detection. The starting point for the MediaMill detection approach is our top-performing bag-of-words system of TRECVID 2010, which uses multiple color SIFT descriptors, sparse codebooks with spatial pyramids, and kernel-based machine learning. All supported by GPU-optimized algorithms, approximated histogram intersection kernels, and multi-frame video processing. This year our experiments focus on 1) the soft assignment of descriptors with the use of difference coding, 2) the exploration of bag-of-words for event detection, and 3) the selection of informative concepts out of 1,346 concept detectors as a representation for event detection. The 2011 edition of the TRECVID benchmark has again been a fruitful participation for the MediaMill team, resulting in the runner-up ranking for concept detection in the semantic indexing task.
: Koen E. A. van de Sande and Cees G. M. Snoek. The University of Amsterdam's concept detection system at ImageCLEF 2011, September 2011.
[ BibTeX | abstract ]
: Cees G. M. Snoek, Koen E. A. van de Sande, Ork de Rooij, Bouke Huurnink, Efstratios Gavves, Daan Odijk, Maarten de Rijke, Theo Gevers, Marcel Worring, Dennis C. Koelma, and Arnold W. M. Smeulders. The MediaMill TRECVID 2010 semantic video search engine, November 2010.
[ BibTeX | abstract ]
: Cees G. M. Snoek, Koen E. A. van de Sande, Ork de Rooij, Bouke Huurnink, Jasper R. R. Uijlings, Michiel van Liempt, Miguel Bugalho, Isabel Trancoso, Fei Yan, Muhammad A. Tahir, Krystian Mikolajczyk, Josef Kittler, Maarten de Rijke, Jan-Mark Geusebroek, Theo Gevers, Marcel Worring, Dennis C. Koelma, and Arnold W. M. Smeulders. The MediaMill TRECVID 2009 semantic video search engine, November 2009.
[ BibTeX | | abstract ]
In this paper we describe our TRECVID 2009 video retrieval experiments. The MediaMill team participated in three tasks: concept detection, automatic search, and interactive search. The starting point for the MediaMill concept detection approach is our top-performing bag-of-words system of last year, which uses multiple color descriptors, codebooks with soft-assignment, and kernel-based supervised learning. We improve upon this baseline system by exploring two novel research directions. Firstly, we study a multi-modal extension by including 20 audio concepts and fusion using two novel multi-kernel supervised learning methods. Secondly, with the help of recently proposed algorithmic refinements of bag-of-word representations, a GPU implementation, and compute clusters, we scale-up the amount of visual information analyzed by an order of magnitude, to a total of 1,000,000 i-frames. Our experiments evaluate the merit of these new components, ultimately leading to 64 robust concept detectors for video retrieval. For retrieval, a robust but limited set of concept detectors justifies the need to rely on as many auxiliary information channels as possible. For automatic search we therefore explore how we can learn to rank various information channels simultaneously to maximize video search results for a given topic. To further improve the video retrieval results, our interactive search experiments investigate the roles of visualizing preview results for a certain browse-dimension and relevance feedback mechanisms that learn to solve complex search topics by analysis from user browsing behavior. The 2009 edition of the TRECVID benchmark has again been a fruitful participation for the MediaMill team, resulting in the top ranking for both concept detection and interactive search. Again a lot has been learned during this year's TRECVID campaign; we highlight the most important lessons at the end of this paper.
: Cees G. M. Snoek, Koen E. A. van de Sande, Ork de Rooij, Bouke Huurnink, Jan C. van Gemert, Jasper R. R. Uijlings, J. He, Xirong Li, Ivo Everts, Vladimir Nedović, Michiel van Liempt, Richard van Balen, Fei Yan, Muhammad A. Tahir, Krystian Mikolajczyk, Josef Kittler, Maarten de Rijke, Jan-Mark Geusebroek, Theo Gevers, Marcel Worring, Arnold W. M. Smeulders, and Dennis C. Koelma. The MediaMill TRECVID 2008 semantic video search engine, November 2008.
[ BibTeX | | abstract ]
In this paper we describe our TRECVID 2008 video retrieval experiments. The MediaMill team participated in three tasks: concept detection, automatic search, and interactive search. Rather than continuing to increase the number of concept detectors available for retrieval, our TRECVID 2008 experiments focus on increasing the robustness of a small set of detectors using a bag-of-words approach. To that end, our concept detection experiments emphasize in particular the role of visual sampling, the value of color invariant features, the influence of codebook construction, and the effectiveness of kernel-based learning parameters. For retrieval, a robust but limited set of concept detectors necessitates the need to rely on as many auxiliary information channels as possible. Therefore, our automatic search experiments focus on predicting which information channel to trust given a certain topic, leading to a novel framework for predictive video retrieval. To improve the video retrieval results further, our interactive search experiments investigate the roles of visualizing preview results for a certain browse-dimension and active learning mechanisms that learn to solve complex search topics by analysis from user browsing behavior. The 2008 edition of the TRECVID benchmark has been the most successful MediaMill participation to date, resulting in the top ranking for both concept detection and interactive search, and a runner-up ranking for automatic retrieval. Again a lot has been learned during this year's TRECVID campaign; we highlight the most important lessons at the end of this paper.
: Cees G. M. Snoek, I. Everts, Jan C. van Gemert, Jan-Mark Geusebroek, Bouke Huurnink, Dennis C. Koelma, Michiel van Liempt, Ork de Rooij, Koen E. A. van de Sande, Arnold W. M. Smeulders, Jasper R. R. Uijlings, and Marcel Worring. The MediaMill TRECVID 2007 semantic video search engine, November 2007.
[ BibTeX | | abstract ]
In this paper we describe our TRECVID 2007 experiments. The MediaMill team participated in two tasks: concept detection and search. For concept detection we extract region-based image features, on grid, keypoint, and segmentation level, which we combine with various supervised learners. In addition, we explore the utility of temporal image features. A late fusion approach of all region-based analysis methods using geometric mean was our most successful run. What is more, using MediaMill Challenge and LSCOM annotations, our visual-only approach generalizes to a set of 572 concept detectors. To handle such a large thesaurus in retrieval, an engine is developed which automatically selects a set of relevant concept detectors based on text matching, ontology querying, and visual concept likelihood. The suggestion engine is evaluated as part of the automatic search task and forms the entry point for our interactive search experiments. For this task we experiment with two browsers for interactive exploration: the well-known CrossBrowser and the novel ForkBrowser. It was found that, while retrieval performance varies substantially per topic, the ForkBrowser is able to produce the same overall results as the CrossBrowser. However, the ForkBrowser obtains top-performance for most topics with less user interaction. Indicating the potential of this browser for interactive search. Similar to previous years our best interactive search runs yield high overall performance, ranking 3rd and 4th.
: Cees G. M. Snoek, Jan C. van Gemert, Theo Gevers, Bouke Huurnink, Dennis C. Koelma, Michiel van Liempt, Ork de Rooij, Koen E. A. van de Sande, Frank J. Seinstra, Arnold W. M. Smeulders, Andrew H. C. Thean, Cor J. Veenman, and Marcel Worring. The MediaMill TRECVID 2006 semantic video search engine, November 2006.
[ BibTeX | | abstract ]
In this paper we describe our TRECVID 2006 experiments. The MediaMill team participated in two tasks: concept detection and search. For concept detection we use the MediaMill Challenge as experimental platform. The MediaMill Challenge divides the generic video indexing problem into a visual-only, textual-only, early fusion, late fusion, and combined analysis experiment. We provide a baseline implementation for each experiment together with baseline results, which we made available for the TRECVID community. The Challenge package was downloaded more than 80 times and we anticipate that it has been used by several teams for their 2006 submission. Our Challenge experiments focus specifically on visual-only analysis of video (run id: B_MM). We extract image features, on global, regional, and keypoint level, which we combine with various supervised learners. A late fusion approach of visual-only analysis methods using geometric mean was our most successful run. With this run we conquer the Challenge baseline by more than 50%. Our concept detection experiments have resulted in the best score for three concepts: i.e. desert, flag us, and charts. What is more, using LSCOM annotations, our visual-only approach generalizes well to a set of 491 concept detectors. To handle such a large thesaurus in retrieval, an engine is developed which automatically selects a set of relevant concept detectors based on text matching and ontology querying. The suggestion engine is evaluated as part of the automatic search task (run id: A-MM) and forms the entry point for our interactive search experiments (run id: A-MM). Here we experiment with query by object matching and two browsers for interactive exploration: the CrossBrowser and the novel NovaBrowser. It was found that the NovaBrowser is able to produce the same results as the CrossBrowser, but with less user interaction. Similar to previous years our best interactive search runs yield top performance, ranking 2nd and 6th overall. Again a lot has been learned during this year's TRECVID campaign, we highlight the most important lessons at the end of this paper.
: Cees G. M. Snoek, Jan C. van Gemert, Jan-Mark Geusebroek, Bouke Huurnink, Dennis C. Koelma, Giang P. Nguyen, Ork de Rooij, Frank J. Seinstra, Arnold W. M. Smeulders, Cor J. Veenman, and Marcel Worring. The MediaMill TRECVID 2005 semantic video search engine, November 2005.
[ BibTeX | | abstract ]
In this paper we describe our TRECVID 2005 experiments. The UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A_CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information. Experiments indicate that average precision results increase drastically, especially for pan (+51%) and tilt (+28%). For concept detection we propose a generic approach using our semantic pathfinder. Most important novelty compared to last years system is the improved visual analysis using proto-concepts based on Wiccest features. In addition, the path selection mechanism was extended. Based on the semantic pathfinder architecture we are currently able to detect an unprecedented lexicon of 101 semantic concepts in a generic fashion. We performed a large set of experiments (runid: B_vA). The results show that an optimal strategy for generic multimedia analysis is one that learns from the training set on a per-concept basis which tactic to follow. Experiments also indicate that our visual analysis approach is highly promising. The lexicon of 101 semantic concepts forms the basis for our search experiments (runid: B_2_A-MM). We participated in automatic, manual (using only visual information), and interactive search. The lexicon-driven retrieval paradigm aids substantially in all search tasks. When coupled with interaction, exploiting several novel browsing schemes of our semantic video search engine, results are excellent. We obtain a top-3 result for 19 out of 24 search topics. In addition, we obtain the highest mean average precision of all search participants. We exploited the technology developed for the above tasks to explore the BBC rushes. Most intriguing result is that from the lexicon of 101 visual-only models trained for news data 25 concepts perform reasonably well on BBC data also.
: Cees G. M. Snoek, Marcel Worring, Jan-Mark Geusebroek, Dennis C. Koelma, and Frank J. Seinstra. The MediaMill TRECVID 2004 semantic video search engine, November 2004.
[ BibTeX | | abstract ]
This year the UvA-MediaMill team participated in the Feature Extraction and Search Task. We developed a generic approach for semantic concept classification using the semantic value chain. The semantic value chain extracts concepts from video documents based on three consecutive analysis links, named the content link, the style link, and the context link. Various experiments within the analysis links were performed, showing amongst others the merit of processing beyond key frames, the value of style elements, and the importance of learning semantic context. For all experiments a lexicon of 32 concepts was exploited, 10 of which are part of the Feature Extraction Task. Top three system-based ranking in 8 out of the 10 benchmark concepts indicates that our approach is very promising. Apart from this, the lexicon of 32 concepts proved very useful in an interactive search scenario with our semantic video search engine, where we obtained the highest mean average precision of all participants.
: Alexander Hauptmann, Robert V. Baron, Ming-Yu Chen, Michael Christel, Pinar Duygulu, Chang Huang, Rong Jin, Wei-Hao Lin, Dorbin Ng, Neema Moraveji, Norman Papernick, Cees G. M. Snoek, George Tzanetakis, Jun Yang, Rong Yan, and Howard D. Wactlar. Informedia at TRECVID 2003: Analyzing and searching broadcast news video, November 2003.
[ BibTeX | | abstract ]
: Marcel Worring, Giang P.Nguyen, Laura Hollink, Jan van Gemert, and Dennis C. Koelma. Interactive search using indexing, filtering, browsing, and ranking, November 2003.
[ BibTeX | | abstract ]
: Jeroen Vendrig, Jurgen den Hartog, David van Leeuwen, Ioannis Patras, Stephan Raaijmakers, Cees Snoek, Jeroen van Rest, and Marcel Worring. TREC feature extraction by active learning, November 2002.
[ BibTeX | | abstract ]
: Jan Baan, Alex van Ballegooij, Jan-Mark Geusebroek, Djoerd Hiemstra, Jurgen den Hartog, Johan List, Cees Snoek, Ioannis Patras, Stephan Raaijmakers, Leon Todoran, Jeroen Vendrig, Arjen de Vries, Thijs Westerveld, and Marcel Worring. Lazy users and automatic video retrieval tools in (the) lowlands, November 2001.
[ BibTeX | | abstract ]

Technical Demonstrators

Show all abstracts | Hide all abstracts

: Cees G. M. Snoek, Bauke Freiburg, Johan Oomen, and Roeland Ordelman. Crowdsourcing rock n' roll multimedia retrieval, October 2010.
[ BibTeX | | abstract ]
In this technical demonstration, we showcase a multimedia search engine that facilitates semantic access to archival rock n' roll concert video. The key novelty is the crowdsourcing mechanism, which relies on online users to improve, extend, and share, automatically detected results in video fragments using an advanced timeline-based video player. The user-feedback serves as valuable input to further improve automated multimedia retrieval results, such as automatically detected concepts and automatically transcribed interviews. The search engine has been operational online to harvest valuable feedback from rock n' roll enthusiasts.
: Cees G. M. Snoek. The MediaMill search engine video, October 2010.
[ BibTeX | | abstract ]
In this video demonstration, we advertise the MediaMill video search engine, a system that facilitates semantic access to video based on a large lexicon of visual concept detectors and interactive video browsers. With an ultimate aim to disseminate video retrieval research to a non-technical audience, we explain the need for a visual video retrieval solution, summarize the MediaMill technology, and hint at future perspectives.
: Ork de Rooij, Cees G. M. Snoek, and Marcel Worring. MediaMill: Guiding the user to results using the ForkBrowser, July 2009.
[ BibTeX | | abstract ]
In this technical demonstration we showcase the MediaMill Semantic Video Search Engine. It allows usage of multiple query methods embedded into a single browsing environment while guiding the user to better results by using a novel active learning strategy. This allows for fast and effective search trough large video collections.
: Ork de Rooij, Cees G. M. Snoek, and Marcel Worring. Mediamill: Fast and effective video search using the ForkBrowser, July 2008.
[ BibTeX | | abstract ]
In this technical demonstration we showcase the MediaMill ForkBrowser for video retrieval. It embeds multiple query methods into a single browsing environment. We show that users can switch query methods on demand without the need to adapt to a different interface. This allows for fast and effective search trough large video collections.
: Cees G. M. Snoek, Richard van Balen, Dennis C. Koelma, Arnold W. M. Smeulders, and Marcel Worring. Analyzing video concept detectors visually, June 2008.
[ BibTeX | | abstract ]
In this demonstration we showcase an interactive analysis tool for researchers working on concept-based video retrieval. By visualizing intermediate concept detection analysis stages, the tool aids in understanding the success and failure of video concept detection methods. We demonstrate the tool on the domain of pop concert video.
: Ork de Rooij, Cees G. M. Snoek, and Marcel Worring. Mediamill: Semantic video browsing using the RotorBrowser, July 2007.
[ BibTeX | | abstract ]
In this technical demonstration we showcase the current version of the MediaMill system, a search engine that facilitates access to news video archives at a semantic level. The core of the system is a thesaurus of 500 automatically detected semantic concepts. To handle such a large thesaurus in retrieval, an engine is developed which automatically selects a set of relevant concepts based on a textual query, and an novel user interface which uses multi dimensional browsing to visualize the result set.
: Ork de Rooij, Cees G. M. Snoek, and Marcel Worring. Mediamill: Video query on demand using the RotorBrowser, July 2007.
[ BibTeX | | abstract ]
In this technical demonstration we showcase the RotorBrowser, A visualization within MediaMill system which uses query exploration as the basis for search in video archives.
: Cees G. M. Snoek, Marcel Worring, Bouke Huurnink, Jan C. van Gemert, Koen E. A. van de Sande, Dennis C. Koelma, and Ork de Rooij. MediaMill: Video search using a thesaurus of 500 machine learned concepts, December 2006.
[ BibTeX | | abstract ]
In this technical demonstration we showcase the current version of the MediaMill system, a search engine that facilitates access to news video archives at a semantic level. The core of the system is a thesaurus of 500 automatically detected semantic concepts. To handle such a large thesaurus in retrieval, an engine is developed which automatically selects a set of relevant concepts based on the textual query and userspecified example images. The result set can be browsed easily to obtain the final result for the query.
: Marcel Worring, Cees G. M. Snoek, Bouke Huurnink, Jan van Gemert, Dennis Koelma, and Ork de Rooij. The MediaMill large-lexicon concept suggestion engine, October 2006.
[ BibTeX | | abstract ]
In this technical demonstration we show the current version of the MediaMill system, a search engine that facilitates access to news video archives at a semantic level. The core of the system is a lexicon of 436 automatically detected semantic concepts. To handle such a large lexicon in retrieval, an engine is developed which automatically selects a set of relevant concepts based on the textual query and example images. The result set can be browsed easily to obtain the final result for the query.
: Marcel Worring, Cees G. M. Snoek, Ork de Rooij, Giang. P. Nguyen, Richard van Balen, and Dennis C. Koelma. MediaMill: Advanced browsing in news video archives, July 2006.
[ BibTeX | | abstract ]
In this paper we present our Mediamill video search engine. The basis for the engine is a semantic indexing process which derives a lexicon of 101 concepts. To support the user in navigating the collection, the system defines a visual similarity space, a semantic similarity space, a semantic thread space, and browsers to explore them. It extends upon [1] with improved browsing tools. The search system is evaluated within the TRECVID benchmark [2]. We obtain a top-3 result for 19 out of 24 search topics. In addition, we obtain the highest mean average precision of all search participants.
: Cees G. M. Snoek, Marcel Worring, Jan van Gemert, Jan-Mark Geusebroek, Dennis Koelma, Giang P. Nguyen, Ork de Rooij, and Frank Seinstra. MediaMill: Exploring news video archives based on learned semantics, November 2005.
[ BibTeX | | abstract ]
In this technical demonstration we showcase the MediaMill system. A search engine that facilitates access to news video archives at a semantic level. The core of the system is an unprecedented lexicon of 100 automatically detected semantic concepts. Based on this lexicon we demonstrate how users can obtain highly relevant retrieval results using query-by-concept. In addition, we show how the lexicon of concepts can be exploited for novel applications using advanced semantic visualizations. Several aspects of the MediaMill system are evaluated as part of our TRECVID 2005 efforts.
: Cees G. M. Snoek, Dennis Koelma, Jeroen van Rest, Nellie Schipper, Frank J. Seinstra, Andrew Thean, and Marcel Worring. MediaMill: Searching multimedia archives based on learned semantics, July 2005.
[ BibTeX | abstract ]
Video is about to conquer the Internet. Real-time delivery of video content is technically possible to any desktop and mobile device, even with modest connections. The main problem hampering massive (re)usage of video content today is the lack of effective content based tools that provide semantic access. In this contribution we discuss systems for both video analysis and video retrieval that facilitate semantic access to video sources. Both systems were evaluated in the 2004 TRECVID benchmark as top performers in their task.
: Cees G. M. Snoek and Marcel Worring. Goalgle: A soccer video search engine, July 2003.
[ BibTeX | | abstract ]
Goalgle is a prototype search engine for soccer video. Browsing and retrieval functionality is provided by means of a web based interface. This interface allows users to jump to video segments from a collection of prerecorded and analyzed soccer matches based on queries on specific players, events, matches, and/or text. In this contribution we discuss the system architecture and functionality of the Goalgle soccer video search engine.

Ph.D. Theses

Show all abstracts | Hide all abstracts

: Xirong Li. Content-Based Visual Search Learned from Social Media. Ph.D. thesis, University of Amsterdam, Amsterdam, The Netherlands, 2012.
[ BibTeX ]
: Koen E. A. van de Sande. Invariant Color Descriptors for Efficient Object Recognition. Ph.D. thesis, University of Amsterdam, Amsterdam, The Netherlands, 2011.
[ BibTeX ]
: Bouke Huurnink. Search in Audiovisual Broadcast Archives. Ph.D. thesis, University of Amsterdam, Amsterdam, The Netherlands, 2010.
[ BibTeX ]
: Jan C. van Gemert. Robust Visual Scene Categorization in Context. Ph.D. thesis, University of Amsterdam, Amsterdam, The Netherlands, 2010.
[ BibTeX ]
: Giang P. Nguyen. Interactive Image Search using Similarity-Based Visualization. Ph.D. thesis, University of Amsterdam, December 2006.
[ BibTeX | | abstract ]
To search for images in a small collection, it can be done by just looking at them one-by-one. The sizes of image collections on the web or professional collections are in the order of a hundred thousand if not a million. For such collections systems should provide efficient browsing techniques. As most of the time, users are non-expert searchers the systems must have a user friendly interface. To satisfy these requirements, we design image search systems that allow the user to interact with image collections in an intuitive way. To that end, advanced visualization techniques are used in which a cloud of images is presented on the screen in such a way that similar images are presented close to each other. In this way the user's attention is pointed to the right search direction. While exploring this direction the user can give feedback to the system by indicating relevant images. The system than learns to adapt itself to get closer to the user's search expectation. We have demonstrated our proposed approach to different image collections from simple to very complicated ones such as images taken from large news video archives. The experimental results show a significant improvement in search performance over existing methods.
: Cees G. M. Snoek. The Authoring Metaphor to Machine Understanding of Multimedia. Ph.D. thesis, University of Amsterdam, October 2005.
[ BibTeX | | abstract ]
This thesis makes a contribution to the field of multimedia understanding. Where our ultimate aim is to structure the digital multimedia chaos by bridging the semantic gap between computable data features on one end and the semantic interpretation of the data by a user on the other end. We distinguish between produced and non-produced multimedia or video documents. We depart from the view that a produced video is the result of an authoring-driven production process. This authoring process serves as a metaphor for machine-driven understanding. We present a step-by-step extrapolation of this authoring metaphor for automatic multimedia understanding. While doing so, we cover in this thesis an extensive overview of the field, a theoretical foundation for authoring-driven multimedia understanding, state-of-the-art benchmark validation, and practical semantic video retrieval applications.

M.Sc. Theses

Show all abstracts | Hide all abstracts

: Koen E. A. van de Sande. Coloring Concept Detection in Video using Interest Regions. Master's thesis, University of Amsterdam, March 2007.
[ BibTeX | | abstract ]
Video concept detection aims to detect high-level semantic information present in video. State-of-the-art systems are based on visual features and use machine learning to build concept detectors from annotated examples. The choice of features and machine learning algorithms is of great influence on the accuracy of the concept detector. So far, intensitybased SIFT features based on interest regions have been applied with great success in image retrieval. Features based on interest regions, also known as local features, consist of an interest region detector and a region descriptor. In contrast to using intensity information only, we will extend both interest region detection and region description with color information in this thesis. We hypothesize that automated concept detection using interest region features benefits from the addition of color information. Our experiments, using the Mediamill Challenge benchmark, show that the combination of intensity features with color features improves significantly over intensity features alone.
: Ork de Rooij. Browsing News Video using Semantic Threads. Master's thesis, University of Amsterdam, December 2005.
[ BibTeX | | abstract ]
This paper describes a novel approach for finding threads in video material using basic clustering techniques by combining knowledge from the content-based retrieval in video material domain and the topic detection and tracking domain. For this the notion of the semantic thread as an ordered list of video shots about the same semantic subject is proposed. A method for generating semantic threads from a large collection of video material is presented. Several standard algorithms for creating clusters are compared and a method for including both clusters and time to create threads is discussed. With these threads an interface for searching through a large dataset of video material is proposed and implemented. This interface is then evaluated with the TRECVID interactive retrieval task, where it ranked among the best interactive retrieval systems currently available. The interface proved to be very usefull for finding video material where the topic cannot be easily found by using traditional keyword search.
: Bouke Huurnink. AutoSeek: Towards a Fully Automated Video Search System. Master's thesis, University of Amsterdam, October 2005.
[ BibTeX | | abstract ]
The astounding rate at which digital video is becoming available has stimulated research into video retrieval systems that incorporate visual, auditory, and spatio-temporal analysis. In the beginning, these multimodal systems required intensive user interaction, but during the past few years automatic search systems that need no interaction at all have emerged, requiring only a string of natural language text and a number of multimodal examples as input. We apply ourselves to this task of automatic search, and investigate the feasibility of automatic search without multimodal examples. The result is AutoSeek, an automatic multimodal search system that requires only text as input. In our search strategy we first extract semantic concepts from text and match them to semantic concept indices using a large lexical database. Queries are then created for the semantic concept indices as well as for indices that incorporate ASR text. Finally, the result sets from the different indices are fused with a combination strategy that was created using a set of development search statements. We subject our system to an external assessment in the form of the TRECVID 2005 automatic search task, and find that our system performs competitively when compared to systems that also use multimodal examples, ranking in the top three systems for 25% of the search tasks and scoring the fourth highest in overall mean average precision. We conclude that automatic search without using multimodal examples is a realistic task, and predict that performance will improve further as semantic concept detectors increase in quantity and quality.
: Cees G. M. Snoek. Camera Distance Classification: Indexing Video Shots based on Visual Features. Master's thesis, University of Amsterdam, October 2000. Awarded by the NGI as best computer science masters thesis for the state of North-Holland in 2001.
[ BibTeX | | abstract ]
In this thesis we describe a method that automatically indexes shots from cinematographic video data based on the camera distance used. The proposed method can be used for automatic analysis and interpretation of the meaning of the shot within a video stream, as an assistance tool for video librarians, and as indexing mechanism to be used within a video database system. Three types of camera distance, originating from the art of filming, are distinguished. Based on extracted and evaluated visual features an integrated classification method is proposed and evaluated. It was found that, although discriminative power of some features was limited, classification of cinematographic video based on visual features is possible in the majority of shots.