Real-World Repetition Estimation by Div, Grad and Curl – Runia, Tom; Cees G.M. Snoek, Arnold Smeulders – The IEEE Conference on Computer Vision and Pattern Recognition (IEEE Conference on Computer Vision and Pattern Recognition), 2018
Abstract | BibTeX
@inproceedings{Runia_2018_CVPR,
title = {Real-World Repetition Estimation by Div, Grad and Curl},
author = {Tom F.H. Runia and Cees G.M. Snoek, Arnold W.M. Smeulders},
url = {https://arxiv.org/abs/1802.09971},
year = {2018},
date = {2018-06-01},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
abstract = {We consider the problem of estimating repetition in video, such as performing push-ups, cutting a melon or playing violin. Existing work shows good results under the assumption of static and stationary periodicity. As realistic video is rarely perfectly static and stationary, the often preferred Fourier-based measurements is inapt. Instead, we adopt the wavelet transform to better handle non-static and non-stationary video dynamics. From the flow field and its differentials, we derive three fundamental motion types and three motion continuities of intrinsic periodicity in 3D. On top of this, the 2D perception of 3D periodicity considers two extreme viewpoints. What follows are 18 fundamental cases of recurrent perception in 2D. In practice, to deal with the variety of repetitive appearance, our theory implies measuring time-varying flow and its differentials (gradient, divergence and curl) over segmented foreground motion. For experiments, we introduce the new QUVA Repetition dataset, reflecting reality by including non-static and non-stationary videos. On the task of counting repetitions in video, we obtain favorable results compared to a deep learning alternative.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We consider the problem of estimating repetition in video, such as performing push-ups, cutting a melon or playing violin. Existing work shows good results under the assumption of static and stationary periodicity. As realistic video is rarely perfectly static and stationary, the often preferred Fourier-based measurements is inapt. Instead, we adopt the wavelet transform to better handle non-static and non-stationary video dynamics. From the flow field and its differentials, we derive three fundamental motion types and three motion continuities of intrinsic periodicity in 3D. On top of this, the 2D perception of 3D periodicity considers two extreme viewpoints. What follows are 18 fundamental cases of recurrent perception in 2D. In practice, to deal with the variety of repetitive appearance, our theory implies measuring time-varying flow and its differentials (gradient, divergence and curl) over segmented foreground motion. For experiments, we introduce the new QUVA Repetition dataset, reflecting reality by including non-static and non-stationary videos. On the task of counting repetitions in video, we obtain favorable results compared to a deep learning alternative.
|
Tracking by Natural Language Specification – Li, Zhenyang; Tao, Ran; Gavves, Efstratios; Snoek, Cees GM; Smeulders, Arnold WM – Computer Vision and Pattern Recognition, 2017. IEEE Conference on Computer Vision and Pattern Recognition 2017. IEEE Conference on, 2017
BibTeX
@conference{li2017tracking,
title = {Tracking by Natural Language Specification},
author = {Li, Zhenyang and Tao, Ran and Gavves, Efstratios and Snoek, Cees GM and Smeulders, Arnold WM},
year = {2017},
date = {2017-01-01},
booktitle = {Computer Vision and Pattern Recognition, 2017. CVPR 2017. IEEE Conference on},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
|
Self-supervised video representation learning with odd-one-out networks – Fernando, Basura; Bilen, Hakan; Gavves, Efstratios; Gould, Stephen – Computer Vision and Pattern Recognition, 2017. IEEE Conference on Computer Vision and Pattern Recognition 2017. IEEE Conference on, IEEE 2017
BibTeX
@inproceedings{fernando2016self,
title = {Self-supervised video representation learning with odd-one-out networks},
author = {Fernando, Basura and Bilen, Hakan and Gavves, Efstratios and Gould, Stephen},
year = {2017},
date = {2017-01-01},
booktitle = {Computer Vision and Pattern Recognition, 2017. CVPR 2017. IEEE Conference on},
organization = {IEEE},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
A unified embedding for zero exemplar event detection – Hussein, Noureldien; Gavves, Efstratios; Smeulders, Arnold WM – Computer Vision and Pattern Recognition, 2017. IEEE Conference on Computer Vision and Pattern Recognition 2017. IEEE Conference on, IEEE 2017
BibTeX
@inproceedings{nour2017unified,
title = {A unified embedding for zero exemplar event detection},
author = {Hussein, Noureldien and Gavves, Efstratios and Smeulders, Arnold WM},
year = {2017},
date = {2017-01-01},
booktitle = {Computer Vision and Pattern Recognition, 2017. CVPR 2017. IEEE Conference on},
organization = {IEEE},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
On the Theory and Practice of Privacy-Preserving Bayesian Data Analysis – Foulds, James; Geumlek, Joseph; Welling, Max; Chaudhuri, Kamalika – ArXive, 2016
Abstract | BibTeX
@inproceedings{FouldsArxive16,
title = {On the Theory and Practice of Privacy-Preserving Bayesian Data Analysis},
author = {James R. Foulds and Joseph Geumlek and Max Welling and Kamalika Chaudhuri},
year = {2016},
date = {2016-01-01},
booktitle = {ArXive},
abstract = {Bayesian inference has great promise for the privacy-preserving analysis of sensitive data, as posterior sampling automatically preserves differential privacy, an algorithmic notion of data privacy, under certain conditions (Wang et al., 2015). While Wang et al. (2015)'s one posterior sample (OPS) approach elegantly provides privacy "for free," it is data inefficient in the sense of asymptotic relative efficiency (ARE). We show that a simple alternative based on the Laplace mechanism, the workhorse technique of differential privacy, is as asymptotically efficient as non-private posterior inference, under general assumptions. The Laplace mechanism has additional practical advantages including efficient use of the privacy budget for MCMC. We demonstrate the practicality of our approach on a time-series analysis of sensitive military records from the Afghanistan and Iraq wars disclosed by the Wikileaks organization.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Bayesian inference has great promise for the privacy-preserving analysis of sensitive data, as posterior sampling automatically preserves differential privacy, an algorithmic notion of data privacy, under certain conditions (Wang et al., 2015). While Wang et al. (2015)'s one posterior sample (OPS) approach elegantly provides privacy "for free," it is data inefficient in the sense of asymptotic relative efficiency (ARE). We show that a simple alternative based on the Laplace mechanism, the workhorse technique of differential privacy, is as asymptotically efficient as non-private posterior inference, under general assumptions. The Laplace mechanism has additional practical advantages including efficient use of the privacy budget for MCMC. We demonstrate the practicality of our approach on a time-series analysis of sensitive military records from the Afghanistan and Iraq wars disclosed by the Wikileaks organization.
|
VideoLSTM Convolves, Attends and Flows for Action Recognition – Li, Zhenyang; Gavves, Efstratios; Jain, Mihir; Snoek, Cees – ArXive, 2016
BibTeX
@inproceedings{LiArxive16,
title = {VideoLSTM Convolves, Attends and Flows for Action Recognition},
author = {Zhenyang Li and Efstratios Gavves and Mihir Jain and Cees G. M. Snoek},
year = {2016},
date = {2016-01-01},
booktitle = {ArXive},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
Deep Spiking Networks – O'Connor, Peter; Welling, Max – ArXive, 2016
BibTeX
@inproceedings{OConnorArxive16,
title = {Deep Spiking Networks},
author = {Peter O'Connor and Max Welling},
year = {2016},
date = {2016-01-01},
booktitle = {ArXive},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
Practical Privacy For Expectation Maximization – Park, Mijung; Foulds, Jimmy; Chaudhuri, Kamalika; Welling, Max – ArXiv, 2016
Abstract | BibTeX
@inproceedings{ParkArxive16,
title = {Practical Privacy For Expectation Maximization},
author = {Mijung Park and Jimmy Foulds and Kamalika Chaudhuri and Max Welling},
year = {2016},
date = {2016-01-01},
booktitle = {ArXiv},
abstract = {Expectation maximization (EM) is an iterative algorithm that computes maximum likelihood and maximum a posteriori estimates for models with unobserved variables. While widely used, the iterative nature of EM presents challenges for privacy-preserving estimation. Multiple iterations are required to obtain accurate parameter estimates, yet each iteration increases the amount of noise that must be added to achieve a reasonable degree of privacy. We propose a practical algorithm that overcomes this challenge and outputs EM parameter estimates that are both accurate and private. Our algorithm focuses on the frequent use case of models whose joint distribution over observed and unobserved variables remains in the exponential family. For these models, the EM parameters are functions of moments of the data. Our algorithm leverages this to preserve privacy by perturbing the moments, for which the amount of additive noise scales naturally with the data. In addition, our algorithm uses a relaxed notion of the differential privacy (DP) gold standard, called concentrated differential privacy (CDP). Rather than focusing on single-query loss, CDP provides high probability bounds for cumulative privacy loss, which is well suited for iterative algorithms. For mixture models, we show that our method requires a significantly smaller privacy budget for the same estimation accuracy compared to both DP and its (epsilon, delta)-DP relaxation. Our general approach of moment perturbation equipped with CDP can be readily extended to many iterative machine learning algorithms, which opens up various exciting future directions.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Expectation maximization (EM) is an iterative algorithm that computes maximum likelihood and maximum a posteriori estimates for models with unobserved variables. While widely used, the iterative nature of EM presents challenges for privacy-preserving estimation. Multiple iterations are required to obtain accurate parameter estimates, yet each iteration increases the amount of noise that must be added to achieve a reasonable degree of privacy. We propose a practical algorithm that overcomes this challenge and outputs EM parameter estimates that are both accurate and private. Our algorithm focuses on the frequent use case of models whose joint distribution over observed and unobserved variables remains in the exponential family. For these models, the EM parameters are functions of moments of the data. Our algorithm leverages this to preserve privacy by perturbing the moments, for which the amount of additive noise scales naturally with the data. In addition, our algorithm uses a relaxed notion of the differential privacy (DP) gold standard, called concentrated differential privacy (CDP). Rather than focusing on single-query loss, CDP provides high probability bounds for cumulative privacy loss, which is well suited for iterative algorithms. For mixture models, we show that our method requires a significantly smaller privacy budget for the same estimation accuracy compared to both DP and its (epsilon, delta)-DP relaxation. Our general approach of moment perturbation equipped with CDP can be readily extended to many iterative machine learning algorithms, which opens up various exciting future directions.
|
Rank Pooling for Action Recognition – Fernando, Basura; Gavves, Efstratios; Oramas, Jose; Ghodrati, Amir; Tuytelaars, Tinne – IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016
Abstract | BibTeX
@inproceedings{FernandoPAMI16,
title = {Rank Pooling for Action Recognition},
author = {Basura Fernando and Efstratios Gavves and Jose Oramas and Amir Ghodrati and Tinne Tuytelaars},
year = {2016},
date = {2016-01-01},
booktitle = {PAMI},
abstract = {We propose a function-based temporal pooling method that captures the latent structure of the video sequence data - e.g. how frame-level features evolve over time in a video. We show how the parameters of a function that has been fit to the video data can serve as a robust new video representation. As a specific example, we learn a pooling function via ranking machines. By learning to rank the frame-level features of a video in chronological order, we obtain a new representation that captures the video-wide temporal dynamics of a video, suitable for action recognition. Other than ranking functions, we explore different parametric models that could also explain the temporal changes in videos. The proposed functional pooling methods, and rank pooling in particular, is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We evaluate our method on various benchmarks for generic action, fine-grained action and gesture recognition. Results show that rank pooling brings an absolute improvement of 7-10 average pooling baseline. At the same time, rank pooling is compatible with and complementary to several appearance and local motion based methods and features, such as improved trajectories and deep learning features.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We propose a function-based temporal pooling method that captures the latent structure of the video sequence data - e.g. how frame-level features evolve over time in a video. We show how the parameters of a function that has been fit to the video data can serve as a robust new video representation. As a specific example, we learn a pooling function via ranking machines. By learning to rank the frame-level features of a video in chronological order, we obtain a new representation that captures the video-wide temporal dynamics of a video, suitable for action recognition. Other than ranking functions, we explore different parametric models that could also explain the temporal changes in videos. The proposed functional pooling methods, and rank pooling in particular, is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We evaluate our method on various benchmarks for generic action, fine-grained action and gesture recognition. Results show that rank pooling brings an absolute improvement of 7-10 average pooling baseline. At the same time, rank pooling is compatible with and complementary to several appearance and local motion based methods and features, such as improved trajectories and deep learning features.
|
Online Action Detection – Geest, Roeland De; Gavves, Efstratios; Ghodrati, Amir; Li, Zhenyang; Snoek, Cees; Tuytelaars, Tinne – ECCV, 2016
BibTeX
@inproceedings{DeGeestECCV16,
title = {Online Action Detection},
author = {Roeland De Geest and Efstratios Gavves and Amir Ghodrati and Zhenyang Li and Cees Snoek and Tinne Tuytelaars},
year = {2016},
date = {2016-01-01},
booktitle = {ECCV},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
Spot On: Action Localization from Pointly-Supervised Proposals – Mettes, Pascal; van Gemert, Jan; Snoek, Cees – ECCV, 2016
BibTeX
@inproceedings{MettesECCV16,
title = {Spot On: Action Localization from Pointly-Supervised Proposals},
author = {Pascal Mettes and Jan C. van Gemert and Cees G. M. Snoek},
year = {2016},
date = {2016-01-01},
booktitle = {ECCV},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
Video Stream Retrieval of Unseen Queries using Semantic Memory – Cappallo, Spencer; Mensink, Thomas; Snoek, Cees – BMVC, 2016
BibTeX
@inproceedings{CappalloBMVC16,
title = {Video Stream Retrieval of Unseen Queries using Semantic Memory},
author = {Spencer Cappallo and Thomas Mensink and Cees G. M. Snoek},
year = {2016},
date = {2016-01-01},
booktitle = {BMVC},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
Dynamic Image Networks for Action Recognition – Bilen, Hakan; Fernando, Basura; Gavves, Efstratios; Vedaldi, Andrea; Gould, Stephen – IEEE Conference on Computer Vision and Pattern Recognition, 2016
Abstract | BibTeX
@inproceedings{BilenCVPR16,
title = {Dynamic Image Networks for Action Recognition},
author = {Hakan Bilen and Basura Fernando and Efstratios Gavves and Andrea Vedaldi and Stephen Gould},
year = {2016},
date = {2016-01-01},
booktitle = {CVPR},
abstract = {We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis especially when convolutional neural networks (CNNs) are used. The dynamic image is based on the rank pooling concept and is obtained through the parameters of a ranking machine that encodes the temporal evolution of the frames of the video. Dynamic images are obtained by directly applying rank pooling on the raw image pixels of a video producing a single RGB image per video. This idea is simple but powerful as it enables the use of existing CNN models directly on video data with fine-tuning. We present an efficient and effective approximate rank pooling operator, speeding it up orders of magnitude compared to rank pooling. Our new approximate rank pooling CNN layer allows us to generalize dynamic images to dynamic feature maps and we demonstrate the power of our new representations on standard benchmarks in action recognition achieving state-of-the-art performance.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis especially when convolutional neural networks (CNNs) are used. The dynamic image is based on the rank pooling concept and is obtained through the parameters of a ranking machine that encodes the temporal evolution of the frames of the video. Dynamic images are obtained by directly applying rank pooling on the raw image pixels of a video producing a single RGB image per video. This idea is simple but powerful as it enables the use of existing CNN models directly on video data with fine-tuning. We present an efficient and effective approximate rank pooling operator, speeding it up orders of magnitude compared to rank pooling. Our new approximate rank pooling CNN layer allows us to generalize dynamic images to dynamic feature maps and we demonstrate the power of our new representations on standard benchmarks in action recognition achieving state-of-the-art performance.
|
Structured Receptive Fields in CNNs – Jacobsen, Joern-Henrik; van Gemert, Jan; Lou, Zhongyu; Smeulders, Arnold – IEEE Conference on Computer Vision and Pattern Recognition, 2016
Abstract | BibTeX
@inproceedings{JacobsenCVPR16,
title = {Structured Receptive Fields in CNNs},
author = {Joern-Henrik Jacobsen and Jan van Gemert and Zhongyu Lou and Arnold W.M. Smeulders},
year = {2016},
date = {2016-01-01},
booktitle = {CVPR},
abstract = {Learning powerful feature representations with CNNs is hard when training data are limited. Pre-training is one way to overcome this, but it requires large datasets sufficiently similar to the target domain. Another option is to design priors into the model, which can range from tuned hyperparameters to fully engineered representations like Scattering Networks. We combine these ideas into structured receptive field networks, a model which has a fixed filter basis and yet retains the flexibility of CNNs. This flexibility is achieved by expressing receptive fields in CNNs as a weighted sum over a fixed basis which is similar in spirit to Scattering Networks. The key difference is that we learn arbitrary effective filter sets from the basis rather than modeling the filters. This approach explicitly connects classical multiscale image analysis with general CNNs. With structured receptive field networks, we improve considerably over unstructured CNNs for small and medium dataset scenarios as well as over Scattering for large datasets. We validate our findings on ILSVRC2012, Cifar-10, Cifar-100 and MNIST. As a realistic small dataset example, we show state-of-the-art classification results on popular 3D MRI brain-disease datasets where pre-training is difficult due to a lack of large public datasets in a similar domain.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Learning powerful feature representations with CNNs is hard when training data are limited. Pre-training is one way to overcome this, but it requires large datasets sufficiently similar to the target domain. Another option is to design priors into the model, which can range from tuned hyperparameters to fully engineered representations like Scattering Networks. We combine these ideas into structured receptive field networks, a model which has a fixed filter basis and yet retains the flexibility of CNNs. This flexibility is achieved by expressing receptive fields in CNNs as a weighted sum over a fixed basis which is similar in spirit to Scattering Networks. The key difference is that we learn arbitrary effective filter sets from the basis rather than modeling the filters. This approach explicitly connects classical multiscale image analysis with general CNNs. With structured receptive field networks, we improve considerably over unstructured CNNs for small and medium dataset scenarios as well as over Scattering for large datasets. We validate our findings on ILSVRC2012, Cifar-10, Cifar-100 and MNIST. As a realistic small dataset example, we show state-of-the-art classification results on popular 3D MRI brain-disease datasets where pre-training is difficult due to a lack of large public datasets in a similar domain.
|
Deep Reflectance Maps – Rematas, Konstantinos; Ritschel, Tobias; Fritz, Mario; Gavves, Efstratios; Tuytelaars, Tinne – IEEE Conference on Computer Vision and Pattern Recognition, 2016
Abstract | BibTeX
@inproceedings{RematasCVPR16,
title = {Deep Reflectance Maps},
author = {Konstantinos Rematas and Tobias Ritschel and Mario Fritz and Efstratios Gavves and Tinne Tuytelaars},
year = {2016},
date = {2016-01-01},
booktitle = {CVPR},
abstract = {Undoing the image formation process and therefore decomposing appearance into its intrinsic properties is a challenging task due to the under-constraint nature of this inverse problem. While significant progress has been made on inferring shape, materials and illumination from images only, progress in an unconstrained setting is still limited. We propose a convolutional neural architecture to estimate reflectance maps of specular materials in natural lighting conditions. We achieve this in an end-to-end learning formulation that directly predicts a reflectance map from the image itself. We show how to improve estimates by facilitating additional supervision in an indirect scheme that first predicts surface orientation and afterwards predicts the reflectance map by a learning-based sparse data interpolation. In order to analyze performance on this difficult task, we propose a new challenge of Specular MAterials on SHapes with complex IllumiNation (SMASHINg) using both synthetic and real images. Furthermore, we show the application of our method to a range of image-based editing tasks on real images.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
Undoing the image formation process and therefore decomposing appearance into its intrinsic properties is a challenging task due to the under-constraint nature of this inverse problem. While significant progress has been made on inferring shape, materials and illumination from images only, progress in an unconstrained setting is still limited. We propose a convolutional neural architecture to estimate reflectance maps of specular materials in natural lighting conditions. We achieve this in an end-to-end learning formulation that directly predicts a reflectance map from the image itself. We show how to improve estimates by facilitating additional supervision in an indirect scheme that first predicts surface orientation and afterwards predicts the reflectance map by a learning-based sparse data interpolation. In order to analyze performance on this difficult task, we propose a new challenge of Specular MAterials on SHapes with complex IllumiNation (SMASHINg) using both synthetic and real images. Furthermore, we show the application of our method to a range of image-based editing tasks on real images.
|
Siamese Instance Search for Tracking – Tao, Ran; Gavves, Efstratios; Smeulders, Arnold – IEEE Conference on Computer Vision and Pattern Recognition, 2016
Abstract | BibTeX
@inproceedings{TaoCVPR16,
title = {Siamese Instance Search for Tracking},
author = {Ran Tao and Efstratios Gavves and Arnold W.M. Smeulders},
year = {2016},
date = {2016-01-01},
booktitle = {CVPR},
abstract = {In this paper we present a tracker, which is radically different from state-of-the-art trackers: we apply no model updating, no occlusion detection, no combination of trackers, no geometric matching, and still deliver state-of-the-art tracking performance, as demonstrated on the popular online tracking benchmark (OTB) and six very challenging YouTube videos. The presented tracker simply matches the initial patch of the target in the first frame with candidates in a new frame and returns the most similar patch by a learned matching function. The strength of the matching function comes from being extensively trained generically, i.e., without any data of the target, using a Siamese deep neural network, which we design for tracking. Once learned, the matching function is used as is, without any adapting, to track previously unseen targets. It turns out that the learned matching function is so powerful that a simple tracker built upon it, coined Siamese INstance search Tracker, SINT, which only uses the original observation of the target from the first frame, suffices to reach state-of-the-art performance. Further, we show the proposed tracker even allows for target re-identification after the target was absent for a complete video shot.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper we present a tracker, which is radically different from state-of-the-art trackers: we apply no model updating, no occlusion detection, no combination of trackers, no geometric matching, and still deliver state-of-the-art tracking performance, as demonstrated on the popular online tracking benchmark (OTB) and six very challenging YouTube videos. The presented tracker simply matches the initial patch of the target in the first frame with candidates in a new frame and returns the most similar patch by a learned matching function. The strength of the matching function comes from being extensively trained generically, i.e., without any data of the target, using a Siamese deep neural network, which we design for tracking. Once learned, the matching function is used as is, without any adapting, to track previously unseen targets. It turns out that the learned matching function is so powerful that a simple tracker built upon it, coined Siamese INstance search Tracker, SINT, which only uses the original observation of the target from the first frame, suffices to reach state-of-the-art performance. Further, we show the proposed tracker even allows for target re-identification after the target was absent for a complete video shot.
|
A note on privacy preserving iteratively reweighted least squares – Park, Mijung; Welling, Max – ICML workshops, 2016
Abstract | BibTeX
@conference{ParkICML16,
title = {A note on privacy preserving iteratively reweighted least squares},
author = {Mijung Park and Max Welling},
year = {2016},
date = {2016-01-01},
booktitle = {ICML workshops},
abstract = {Iteratively reweighted least squares (IRLS) is a widely-used method in machine learning to estimate the parameters in the generalised linear models. In particular, IRLS for L1 minimisation under the linear model provides a closed-form solution in each step, which is a simple multiplication between the inverse of the weighted second moment matrix and the weighted first moment vector. When dealing with privacy sensitive data, however, developing a privacy preserving IRLS algorithm faces two challenges. First, due to the inversion of the second moment matrix, the usual sensitivity analysis in differential privacy incorporating a single datapoint perturbation gets complicated and often requires unrealistic assumptions. Second, due to its iterative nature, a significant cumulative privacy loss occurs. However, adding a high level of noise to compensate for the privacy loss hinders from getting accurate estimates. Here, we develop a practical algorithm that overcomes these challenges and outputs privatised and accurate IRLS solutions. In our method, we analyse the sensitivity of each moments separately and treat the matrix inversion and multiplication as a post-processing step, which simplifies the sensitivity analysis. Furthermore, we apply the concentrated differential privacy formalism, a more relaxed version of differential privacy, which requires adding a significantly less amount of noise for the same level of privacy guarantee, compared to the conventional and advanced compositions of differentially private mechanisms.},
keywords = {},
pubstate = {published},
tppubtype = {conference}
}
Iteratively reweighted least squares (IRLS) is a widely-used method in machine learning to estimate the parameters in the generalised linear models. In particular, IRLS for L1 minimisation under the linear model provides a closed-form solution in each step, which is a simple multiplication between the inverse of the weighted second moment matrix and the weighted first moment vector. When dealing with privacy sensitive data, however, developing a privacy preserving IRLS algorithm faces two challenges. First, due to the inversion of the second moment matrix, the usual sensitivity analysis in differential privacy incorporating a single datapoint perturbation gets complicated and often requires unrealistic assumptions. Second, due to its iterative nature, a significant cumulative privacy loss occurs. However, adding a high level of noise to compensate for the privacy loss hinders from getting accurate estimates. Here, we develop a practical algorithm that overcomes these challenges and outputs privatised and accurate IRLS solutions. In our method, we analyse the sensitivity of each moments separately and treat the matrix inversion and multiplication as a post-processing step, which simplifies the sensitivity analysis. Furthermore, we apply the concentrated differential privacy formalism, a more relaxed version of differential privacy, which requires adding a significantly less amount of noise for the same level of privacy guarantee, compared to the conventional and advanced compositions of differentially private mechanisms.
|
Sigma Delta Quantized Networks – O'Connor, Peter; Welling, Max – CoRR, abs/1611.02024 , 2016
BibTeX
@article{DBLP:journals/corr/OConnorW16a,
title = {Sigma Delta Quantized Networks},
author = {Peter O'Connor and Max Welling},
url = {http://arxiv.org/abs/1611.02024},
year = {2016},
date = {2016-01-01},
journal = {CoRR},
volume = {abs/1611.02024},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
|
Learning-to-rank based on subsequences – Fernando, Basura; Gavves, Efstratios; Muselet, Damien; Tuytelaars, Tinne – ICCV, 2015
Abstract | BibTeX
@inproceedings{FernandoICCV15,
title = {Learning-to-rank based on subsequences},
author = {Basura Fernando and Efstratios Gavves and Damien Muselet and Tinne Tuytelaars},
year = {2015},
date = {2015-01-01},
booktitle = {ICCV},
abstract = {We present a supervised learning to rank algorithm that effectively orders images by exploiting the structure in image sequences. Most often in the supervised learning to rank literature, ranking is approached either by analyzing pairs of images or by optimizing a list-wise surrogate loss function on full sequences. In this work we propose MidRank, which learns from moderately sized sub-sequences instead. These sub-sequences contain useful structural ranking information that leads to better learnability during training and better generalization during testing. By exploiting sub-sequences, the proposed MidRank improves ranking accuracy considerably on an extensive array of image ranking applications and datasets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
We present a supervised learning to rank algorithm that effectively orders images by exploiting the structure in image sequences. Most often in the supervised learning to rank literature, ranking is approached either by analyzing pairs of images or by optimizing a list-wise surrogate loss function on full sequences. In this work we propose MidRank, which learns from moderately sized sub-sequences instead. These sub-sequences contain useful structural ranking information that leads to better learnability during training and better generalization during testing. By exploiting sub-sequences, the proposed MidRank improves ranking accuracy considerably on an extensive array of image ranking applications and datasets.
|
Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks – Gavves, Efstratios; Mensink, Thomas; Tommasi, Tatiana; Snoek, Cees; Tuytelaars, Tinne – ICCV, 2015
Abstract | BibTeX
@inproceedings{GavvesICCV15,
title = {Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks},
author = {Efstratios Gavves and Thomas Mensink and Tatiana Tommasi and Cees G. M. Snoek and Tinne Tuytelaars},
year = {2015},
date = {2015-01-01},
booktitle = {ICCV},
abstract = {How can we reuse existing knowledge, in the form of available datasets, when solving a new and apparently unrelated target task from a set of unlabeled data? In this work we make a first contribution to answer this question in the context of image classification. We frame this quest as an active learning problem and use zero-shot classifiers to guide the learning process by linking the new task to the existing classifiers. By revisiting the dual formulation of adaptive SVM, we reveal two basic conditions to choose greedily only the most relevant samples to be annotated. On this basis we propose an effective active learning algorithmwhich learns the best possible target classification model with minimum human labeling effort. Extensive experiments on two challenging datasets show the value of our approach compared to the state-of-the-art active learning methodologies, as well as its potential to reuse past datasets with minimal effort for future tasks.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
How can we reuse existing knowledge, in the form of available datasets, when solving a new and apparently unrelated target task from a set of unlabeled data? In this work we make a first contribution to answer this question in the context of image classification. We frame this quest as an active learning problem and use zero-shot classifiers to guide the learning process by linking the new task to the existing classifiers. By revisiting the dual formulation of adaptive SVM, we reveal two basic conditions to choose greedily only the most relevant samples to be annotated. On this basis we propose an effective active learning algorithmwhich learns the best possible target classification model with minimum human labeling effort. Extensive experiments on two challenging datasets show the value of our approach compared to the state-of-the-art active learning methodologies, as well as its potential to reuse past datasets with minimal effort for future tasks.
|
Objects2action: Classifying and localizing actions without any video example – Jain, Mihir; van Gemert, Jan; Mensink, Thomas; Snoek, Cees – ICCV, 2015
Abstract | BibTeX
@inproceedings{JainICCV15,
title = {Objects2action: Classifying and localizing actions without any video example},
author = {Mihir Jain and Jan C. van Gemert and Thomas Mensink and Cees G. M. Snoek},
year = {2015},
date = {2015-01-01},
booktitle = {ICCV},
abstract = {The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach.
|
APT: Action localization proposals from dense trajectories – van Gemert, Jan; Jain, Mihir; Gati, Ella; Snoek, Cees – BMVC, 2015
Abstract | BibTeX
@inproceedings{GemertBMVC15,
title = {APT: Action localization proposals from dense trajectories},
author = {Jan van Gemert and Mihir Jain and Ella Gati and Cees G. M. Snoek},
year = {2015},
date = {2015-01-01},
booktitle = {BMVC},
abstract = {This paper is on action localization in video with the aid of spatio-temporal proposals. To alleviate the computational expensive segmentation step of existing proposals, we propose bypassing the segmentations completely by generating proposals directly from the dense trajectories used to represent videos during classification. Our Action localization Proposals from dense Trajectories (APT) use an efficient proposal generation algorithm to handle the high number of trajectories in a video. Our spatio-temporal proposals are faster than current methods and outperform the localization and classification accuracy of current proposals on the UCF Sports, UCF 101, and MSR-II video datasets. Corrected version: we fixed a mistake in our UCF-101 ground truth. Numbers are different; conclusions are unchanged.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper is on action localization in video with the aid of spatio-temporal proposals. To alleviate the computational expensive segmentation step of existing proposals, we propose bypassing the segmentations completely by generating proposals directly from the dense trajectories used to represent videos during classification. Our Action localization Proposals from dense Trajectories (APT) use an efficient proposal generation algorithm to handle the high number of trajectories in a video. Our spatio-temporal proposals are faster than current methods and outperform the localization and classification accuracy of current proposals on the UCF Sports, UCF 101, and MSR-II video datasets. Corrected version: we fixed a mistake in our UCF-101 ground truth. Numbers are different; conclusions are unchanged.
|
Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams – Nagel, Markus; Mensink, Thomas; Snoek, Cees – BMVC, 2015
Abstract | BibTeX
@inproceedings{NagelBMVC15,
title = {Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams},
author = {Markus Nagel and Thomas Mensink and Cees G. M. Snoek},
year = {2015},
date = {2015-01-01},
booktitle = {BMVC},
abstract = {In this paper we focus on event recognition in visual image streams. More specifically, we aim to construct a compact representation which encodes the diversity of the visual stream from just a few observations. For this purpose, we introduce the Event Fisher Vector, a Fisher Kernel based representation to describe a collection of images or the sequential frames of a video. We explore different generative models beyond the Gaussian mixture model as underlying probability distribution. First, the Student?s-t mixture model which captures the heavy tails of the small sample size of a collection of images. Second, Hidden Markov Models to explicitly capture the temporal ordering of the observations in a stream. For all our models we derive analytical approximations of the Fisher information matrix, which significantly improves recognition performance. We extensively evaluate the properties of our proposed method on three recent datasets for event recognition in photo collections and web videos, leading to an efficient compact image representation which achieves state-of-the-art performance on all these datasets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
In this paper we focus on event recognition in visual image streams. More specifically, we aim to construct a compact representation which encodes the diversity of the visual stream from just a few observations. For this purpose, we introduce the Event Fisher Vector, a Fisher Kernel based representation to describe a collection of images or the sequential frames of a video. We explore different generative models beyond the Gaussian mixture model as underlying probability distribution. First, the Student?s-t mixture model which captures the heavy tails of the small sample size of a collection of images. Second, Hidden Markov Models to explicitly capture the temporal ordering of the observations in a stream. For all our models we derive analytical approximations of the Fisher information matrix, which significantly improves recognition performance. We extensively evaluate the properties of our proposed method on three recent datasets for event recognition in photo collections and web videos, leading to an efficient compact image representation which achieves state-of-the-art performance on all these datasets.
|
What do 15,000 object categories tell us about classifying and localizing actions? – Jain, Mihir; van Gemert, Jan; Snoek, Cees – IEEE Conference on Computer Vision and Pattern Recognition, 2015
Abstract | BibTeX
@inproceedings{JainCVPR15,
title = {What do 15,000 object categories tell us about classifying and localizing actions?},
author = {Mihir Jain and Jan C. van Gemert and Cees G. M. Snoek},
year = {2015},
date = {2015-01-01},
booktitle = {CVPR},
abstract = {This paper contributes to automatic classification and localization of human actions in video. Whereas motion is the key ingredient in modern approaches, we assess the benefits of having objects in the video representation. Rather than considering a handful of carefully selected and localized objects, we conduct an empirical study on the benefit of encoding 15,000 object categories for action using 6 datasets totaling more than 200 hours of video and covering 180 action classes. Our key contributions are i) the first in-depth study of encoding objects for actions, ii) we show that objects matter for actions, and are often semantically relevant as well. iii) We establish that actions have object preferences. Rather than using all objects, selection is advantageous for action recognition. iv) We reveal that object-action relations are generic, which allows to transferring these relationships from the one domain to the other. And, v) objects, when combined with motion, improve the state-of-the-art for both action classification and localization.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
This paper contributes to automatic classification and localization of human actions in video. Whereas motion is the key ingredient in modern approaches, we assess the benefits of having objects in the video representation. Rather than considering a handful of carefully selected and localized objects, we conduct an empirical study on the benefit of encoding 15,000 object categories for action using 6 datasets totaling more than 200 hours of video and covering 180 action classes. Our key contributions are i) the first in-depth study of encoding objects for actions, ii) we show that objects matter for actions, and are often semantically relevant as well. iii) We establish that actions have object preferences. Rather than using all objects, selection is advantageous for action recognition. iv) We reveal that object-action relations are generic, which allows to transferring these relationships from the one domain to the other. And, v) objects, when combined with motion, improve the state-of-the-art for both action classification and localization.
|
Attributes and Categories for Generic Instance Search from One Example – Tao, Ran; Smeulders, Arnold; Chang, Shih-Fu – IEEE Conference on Computer Vision and Pattern Recognition, 2015
BibTeX
@inproceedings{TaoCVPR15,
title = {Attributes and Categories for Generic Instance Search from One Example},
author = {Ran Tao and Arnold W.M. Smeulders and Shih-Fu Chang},
year = {2015},
date = {2015-01-01},
booktitle = {CVPR},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}
|
Local Alignments for Fine-Grained Categorization – Gavves, Efstratios; Fernando, Basura; Snoek, Cees; Smeulders, Arnold; Tuytelaars, Tinne – International Journal of Computer Vision, 111 (2), pp. 191–212, 2015
Abstract | BibTeX
@article{GavvesIJCV15,
title = {Local Alignments for Fine-Grained Categorization},
author = {Efstratios Gavves and Basura Fernando and Cees G. M. Snoek and Arnold W. M. Smeulders and Tinne Tuytelaars},
year = {2015},
date = {2015-01-01},
journal = {International Journal of Computer Vision},
volume = {111},
number = {2},
pages = {191--212},
abstract = {The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape. Then, one may proceed to the differential classification by examining the corresponding regions of the alignments. More specifically, the alignments are used to transfer part annotations from training images to unseen images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We further argue that for the distinction of sub-classes, distribution-based features like color Fisher vectors are better suited for describing localized appearance of fine-grained categories than popular matching oriented intensity features, like HOG. They allow capturing the subtle local differences between subclasses, while at the same time being robust to misalignments between distinctive details. We evaluate the local alignments on the CUB-2011 and on the Stanford Dogs datasets, composed of 200 and 120, visually very hard to distinguish bird and dog species. In our experiments we study and show the benefit of the color Fisher vector parameterization, the influence of the alignment partitioning, and the significance of object segmentation on fine-grained categorization. We, furthermore, show that by using object detectors as voters to generate object confidence saliency maps, we arrive at fully unsupervised, yet highly accurate fine-grained categorization. The proposed local alignments set a new state-of-the-art on both the fine-grained birds and dogs datasets, even without any human intervention. What is more, the local alignments reveal what appearance details are most decisive per fine-grained object category.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
The aim of this paper is fine-grained categorization without human interaction. Different from prior work, which relies on detectors for specific object parts, we propose to localize distinctive details by roughly aligning the objects using just the overall shape. Then, one may proceed to the differential classification by examining the corresponding regions of the alignments. More specifically, the alignments are used to transfer part annotations from training images to unseen images (supervised alignment), or to blindly yet consistently segment the object in a number of regions (unsupervised alignment). We further argue that for the distinction of sub-classes, distribution-based features like color Fisher vectors are better suited for describing localized appearance of fine-grained categories than popular matching oriented intensity features, like HOG. They allow capturing the subtle local differences between subclasses, while at the same time being robust to misalignments between distinctive details. We evaluate the local alignments on the CUB-2011 and on the Stanford Dogs datasets, composed of 200 and 120, visually very hard to distinguish bird and dog species. In our experiments we study and show the benefit of the color Fisher vector parameterization, the influence of the alignment partitioning, and the significance of object segmentation on fine-grained categorization. We, furthermore, show that by using object detectors as voters to generate object confidence saliency maps, we arrive at fully unsupervised, yet highly accurate fine-grained categorization. The proposed local alignments set a new state-of-the-art on both the fine-grained birds and dogs datasets, even without any human intervention. What is more, the local alignments reveal what appearance details are most decisive per fine-grained object category.
|
Convolutional Neural Networks in the Brain: an fMRI study – Ramakrishnan,; Scholte,; Lamme,; Smeulders,; Ghebreab, – Journal of Vision, 2015
Abstract | BibTeX
@article{RamakrishnanJV2015,
title = {Convolutional Neural Networks in the Brain: an fMRI study},
author = {Ramakrishnan, K. and Scholte, H. S. and Lamme, V. A. F. and Smeulders, A. W. M. and Ghebreab, S.},
year = {2015},
date = {2015-01-01},
journal = {Journal of Vision},
abstract = {Biologically inspired computational models replicate the hierarchical visual processing in the human ventral stream. One such recent model, Convolutional Neural Network (CNN) has achieved state of the art performance on automatic visual recognition tasks. The CNN architecture contains successive layers of convolution and pooling, and resembles the simple and complex cell hierarchy as proposed by Hubel and Wiesel. This makes it a candidate model to test against the human brain. In this study we look at 1) where in the brain different layers of the CNN account for brain responses, and 2) how the CNN network compares against existing and widely used hierarchical vision models such as Bag-of-Words (BoW) and HMAX. fMRI brain activity of 20 subjects obtained while viewing a short video clip was analyzed voxel-wise using a distance-based variation partitioning method. Variation partitioning was done on successive CNN layers to determine the unique contribution of each layers in explaining fMRI brain activity. We observe that each of the 7 different layers of CNN accounts for brain activity consistently across subjects in areas known to be involved in visual processing. In addition, we find a relation between the visual processing hierarchy in the brain and the 7 CNN layers: visual areas such as V1, V2 and V3 are sensitive to lower layers of the CNN while areas such as LO, TO and PPA are sensitive to higher layers. The comparison of CNN with HMAX and BoW furthermore shows that while all three models explain brain activity in early visual areas, the CNN additionally explains brain activity deeper in the brain. Overall, our results suggest that Convolutional Neural Networks provide a suitable computational basis for visual processing in the brain, allowing to decode feed-forward representations in the visual brain. Meeting abstract presented at VSS 2015.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}
Biologically inspired computational models replicate the hierarchical visual processing in the human ventral stream. One such recent model, Convolutional Neural Network (CNN) has achieved state of the art performance on automatic visual recognition tasks. The CNN architecture contains successive layers of convolution and pooling, and resembles the simple and complex cell hierarchy as proposed by Hubel and Wiesel. This makes it a candidate model to test against the human brain. In this study we look at 1) where in the brain different layers of the CNN account for brain responses, and 2) how the CNN network compares against existing and widely used hierarchical vision models such as Bag-of-Words (BoW) and HMAX. fMRI brain activity of 20 subjects obtained while viewing a short video clip was analyzed voxel-wise using a distance-based variation partitioning method. Variation partitioning was done on successive CNN layers to determine the unique contribution of each layers in explaining fMRI brain activity. We observe that each of the 7 different layers of CNN accounts for brain activity consistently across subjects in areas known to be involved in visual processing. In addition, we find a relation between the visual processing hierarchy in the brain and the 7 CNN layers: visual areas such as V1, V2 and V3 are sensitive to lower layers of the CNN while areas such as LO, TO and PPA are sensitive to higher layers. The comparison of CNN with HMAX and BoW furthermore shows that while all three models explain brain activity in early visual areas, the CNN additionally explains brain activity deeper in the brain. Overall, our results suggest that Convolutional Neural Networks provide a suitable computational basis for visual processing in the brain, allowing to decode feed-forward representations in the visual brain. Meeting abstract presented at VSS 2015.
|