Video analysis and understanding

Repetitive Activity Counting by Sight and Sound

This paper strives for repetitive activity counting in videos. Different from existing works, which all analyze the visual video content only, we incorporate for the first time the corresponding sound into the repetition counting process. This …

Support-set bottlenecks for video-text representation learning

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes …

A Dynamic, Self Supervised, Large Scale AudioVisual Dataset for Stuttered Speech

Stuttering affects at least 1% of the world population. It is caused by irregular disruptions in speech production. These interruptions occur in various forms and frequencies. Repetition of words or parts of words, prolongations, or blocks in getting …