Video concept detection aims to detect high-level semantic information present in video. State-of-the-art systems are based on visual features and use machine learning to build concept detectors from annotated examples. The choice of features and machine learning algorithms is of great influence on the accuracy of the concept detector. So far, intensitybased SIFT features based on interest regions have been applied with great success in image retrieval. Features based on interest regions, also known as local features, consist of an interest region detector and a region descriptor. In contrast to using intensity information only, we will extend both interest region detection and region description with color information in this thesis. We hypothesize that automated concept detection using interest region features benefits from the addition of color information. Our experiments, using the Mediamill Challenge benchmark, show that the combination of intensity features with color features improves significantly over intensity features alone.
This paper describes a novel approach for finding threads in video material using basic clustering techniques by combining knowledge from the content-based retrieval in video material domain and the topic detection and tracking domain. For this the notion of the semantic thread as an ordered list of video shots about the same semantic subject is proposed. A method for generating semantic threads from a large collection of video material is presented. Several standard algorithms for creating clusters are compared and a method for including both clusters and time to create threads is discussed. With these threads an interface for searching through a large dataset of video material is proposed and implemented. This interface is then evaluated with the TRECVID interactive retrieval task, where it ranked among the best interactive retrieval systems currently available. The interface proved to be very usefull for finding video material where the topic cannot be easily found by using traditional keyword search.
The astounding rate at which digital video is becoming available has stimulated research into video retrieval systems that incorporate visual, auditory, and spatio-temporal analysis. In the beginning, these multimodal systems required intensive user interaction, but during the past few years automatic search systems that need no interaction at all have emerged, requiring only a string of natural language text and a number of multimodal examples as input. We apply ourselves to this task of automatic search, and investigate the feasibility of automatic search without multimodal examples. The result is AutoSeek, an automatic multimodal search system that requires only text as input. In our search strategy we first extract semantic concepts from text and match them to semantic concept indices using a large lexical database. Queries are then created for the semantic concept indices as well as for indices that incorporate ASR text. Finally, the result sets from the different indices are fused with a combination strategy that was created using a set of development search statements. We subject our system to an external assessment in the form of the TRECVID 2005 automatic search task, and find that our system performs competitively when compared to systems that also use multimodal examples, ranking in the top three systems for 25% of the search tasks and scoring the fourth highest in overall mean average precision. We conclude that automatic search without using multimodal examples is a realistic task, and predict that performance will improve further as semantic concept detectors increase in quantity and quality.
In this thesis we describe a method that automatically indexes shots from cinematographic video data based on the camera distance used. The proposed method can be used for automatic analysis and interpretation of the meaning of the shot within a video stream, as an assistance tool for video librarians, and as indexing mechanism to be used within a video database system. Three types of camera distance, originating from the art of filming, are distinguished. Based on extracted and evaluated visual features an integrated classification method is proposed and evaluated. It was found that, although discriminative power of some features was limited, classification of cinematographic video based on visual features is possible in the majority of shots.