Adding ears to a computer counting the bounces on a trampoline.

Recently, an article from researchers from the Informatics Institute of the University of Amsterdam has been published at the IEEE Conference on Computer Vision and Pattern Recognition; the leading peer-reviewed publication venue in the field of artificial intelligence. The researchers introduced a method for counting repetitions, which are relevant when analyzing human activity (sports), animal behavior (a bee’s waggle dance) or natural phenomena (leaves in the wind) by integrating for the first time the audio modality in a visual counting system based on neural networks.

Yunhua Zhang and Cees Snoek of the Video & Image Sense Lab (VIS), in collaboration with Ling Shao from the Inception Institute of Artificial Intelligence (IIAI), developed a method for estimating how many times a certain repetitive phenomenon, such as bouncing on a trampoline, slicing an onion, or playing ping pong, happens in a video stream. Their methodology is applicable to any scenario in which repetitive motion patterns exist. By using both sight and sound, as well as their cross-modal interaction, counting predictions are shown to be much more robust than a sight-only model. The results highlight the benefits brought by the use of sound as an addition to sight, especially in harsh vision conditions, for example during low illumination, or when camera viewpoint changes, and even occlusions, where the combination of both sight and sound always outperforms the use of sight only.

The detailed information could be found in

Cees G.M. Snoek
Cees G.M. Snoek