QUVA Colloquium

The goal of the Qualcomm-UvA Deep Vision Seminars is to invite seminal guest speakers to provide talks on the latest advances in the areas of Deep Learning, Computer Vision, and Machine Learning.

Next Seminar: by Riza Alp Guler: “DensePose: Dense Human Pose Estimation In The Wild” (Apr 6, 11.00-12.00).

April 6, 2018, Room C0.110 – Invited Talk by Riza Alp Guler from École Centrale Paris
Title: DensePose: Dense Human Pose Estimation In The Wild
Abstract Non-planar object deformations result in challenging but informative signal variations. We aim to recover this information in a feedforward manner by employing discriminatively trained convolutional networks. We formulate this task as a regression problem and train our networks by leveraging upon manually annotated correspondences between images and 3D surfaces.  We show that we can combine ideas from semantic segmentation with regression networks, yielding a highly-accurate ‘quantized regression’ architecture, which was shown* to perform well for the task of establishing dense correspondences to a template face surface using fully-convolutional networks. In this talk, the focus will be on our recent work “DensePose”, where we show that the same approach can be used to establish dense correspondences to a surface-based representation of the human body. We form the “COCO-DensePose” dataset by introducing an efficient annotation pipeline to collect correspondences between 50K persons appearing in the COCO dataset and the SMPL 3D deformable human-body model. We then use our dataset to train CNN-based systems that deliver dense correspondence ‘in the wild’, namely in the presence of background, occlusions, multiple objects and scale variations. We experiment with fully-convolutional networks and region-based DensePose-RCNN model and observe a superiority of the latter; we further improve accuracy through cascading, obtaining a system that delivers highly accurate results in real time.

March 16, 2018, Room C0.05 – Invited Talk by Dr. I.Tolstikhin from Max Planck Institute for Intelligent Systems
Title: Wasserstein Auto-Encoders: from optimal transport to generative modeling and beyond
Abstract The modern field of unsupervised generative modeling and representation learning is rapidly growing. Empirical success of recently introduced methods such as variational auto-encoders (VAE) and generative adversarial nets (GAN) attracts attention of more and more researchers working on Machine Learning. In this talk I will shortly sketch one particular way to look at VAEs and GANs, which emphasizes their pros and cons. This viewpoint will naturally lead us to a new Wasserstein auto-encoder (WAE) algorithm. WAE shares many of the nice properties of VAEs (stable training, encoder-decoder architecture, nice latent manifold structure) while generating samples of better quality, as measured by the recently introduced FID scores. I will also discuss a more recent work on WAEs, where we (a) address a problem of choosing a dimensionality of the latent space, and (b) highlight the potential of WAEs for representation learning with promising results on a benchmark disentanglement task (dSprites). The talk is based on a joint work with Olivier Bousquet and Sylvain Gelly (of Google Brain Zurich) and Carl Johann Simon-Gabriel, Paul Rubenstein, and Bernhard Schoelkopf (of MPI IS, Tuebingen).

December 4, 2017, Room C1.112 – Invited Talk by Prof. Mubarak Shah from University of Florida
Title: Solving Semantic Segmentation: Precision Matrix, Knowledge-Based Rules and Generator Adversarial Network
Abstract: Figure-ground separation is a land-mark problem in visual perception, which has fascinated many scientists for centuries. In computer vision, edge detection and region segmentation of an image have been two grand challenges for understanding of an image in terms of objects and contextual surroundings, and shapes and appearances of objects. Generic Segmentation of an image involves grouping pixels, which are perceptually similar. However, in Semantic Segmentation the aim is to assign a semantic label to each pixel in the image. Even though semantic segmentation can be achieved by simply applying classifiers (which are trained using supervised learning), to each pixel or a region in the image, the results may not be desirable due to the fact that general context information beyond the simple smoothness is not considered. In this talk, I will start with briefly presenting two supervised approaches to address this problem. First, I will discuss an approach to discover interactions between labels and regions using a sparse estimation of precision matrix, which is the inverse of covariance matrix of data obtained by graphical lasso. In this context, we find a graph over labels as well as regions in the image which encodes significant interactions and also it is able to capture the long-distance associations. Second, I will introduce a knowledge-based method to incorporate dependencies among regions in the image during inference. High level knowledge rules – such as co-occurrence, spatial relations and mutual exclusivity – are extracted from training data and transformed into constraints in Integer Programming formulation. A difficulty which most supervised semantic segmentation approaches are confronted with is lack of enough training data, particularly in deep learning methods which have become enormously popular recently. Annotated data should be at the pixel-level (i.e., each pixel of training images must be annotated), which is highly expensive to achieve. To address this limitation, next I will present a semi supervised learning approach to exploit the plentiful amount of available unlabeled as well as synthetic images generated via Generative Adversarial Networks (GAN). Furthermore, I will discuss an extension of the model to use additional weakly labeled data to solve the problem in a weakly supervised manner. The basic idea here is by providing these fake data from the Generator and the competition between real/fake data (discriminator/generator networks), true samples are encouraged to be close in the feature space. Therefore, the model learns more discriminative features, which lead to better classification results for semantic segmentation.

October 30, 2017 – Invited Talk by Prof. Vincent Lepetit from University of Bordeaux
Title:Deep Learning for 3D Localization
The first part of the talk will describe a novel method for 3D object detection and pose estimation from color images only. He introduces a “holistic’’ approach that relies on a representation of a 3D pose suitable to Deep Networks and on a feedback loop.  This approach, like many previous ones is however not sufficient for handling objects with an axis of rotational symmetry, as the pose of these objects is in fact ambiguous. He shows how to relax this ambiguity with a combination of classification and regression. The second part will describe an approach bridging the gap between learning-based approaches and geometric approaches, for accurate and robust camera pose estimation in urban environments from a single input image and simple 2D maps as the only reference data.

Prof. Vincent Lepetit is a Full Professor at the LaBRI, University of Bordeaux, and an associate member of the Inria Manao team. He also supervises a research group in Computer Vision for Augmented Reality at the Institute for Computer Graphics and Vision, TU Graz. He received the PhD degree in Computer Vision in 2001 from the University of Nancy, France, after working in the ISA INRIA team. He then joined the Virtual Reality Lab at EPFL as a post-doctoral fellow and became a founding member of the Computer Vision Laboratory. He became a Professor at TU Graz in February 2014, and at University of Bordeaux in January 2017. His research is at the interface between Machine Learning and 3D Computer Vision, with application to 3D hand pose estimation, feature point detection and description, and 3D registration from images. In particular, he introduced with his colleagues methods such as Ferns, BRIEF, LINE-MOD, and DeepPrior for feature point matching and 3D object recognition. He often serves as program committee member and area chair of major vision conferences (CVPR, ICCV, ECCV, ACCV, BMVC). He is an editor for the International Journal of Computer Vision (IJCV) and the Computer Vision and Image Understanding (CVIU) journal.


September 15nd 2017 – Invited Talk by Oriol Vinyals from Google DeepMind.
Title: New Challenges in Reinforcement Learning
Oriol Vinyals is a Staff Research Scientist at Google DeepMind, working in Deep Learning. Prior to joining DeepMind, Oriol was part of the Google Brain team. He holds a Ph.D. in EECS from University of California, Berkeley and is a recipient of the 2016 MIT TR35 innovator award. At DeepMind he continues working on his areas of interest, which include artificial intelligence, with particular emphasis on sequences, deep learning and reinforcement learning. In this talk he’ll describe some of the recent work that he and his collaborators did at DeepMind in model based RL, and his recent work on StarCraft II, a strategy game which poses a new challenge for deep RL.

December 2nd 2016 – Invited Talk by Jason Yosinksi from Geometric Intelligence.
Title: A deeper understanding of large neural nets
Deep neural networks have recently been making a bit of a splash, enabling machines to learn to solve problems that had previously been easy for humans but hard for machines, like playing Atari games or identifying lions or jaguars in photos. But how do these neural nets actually work? What do they learn? This turns out to be a surprisingly tricky question to answer — surprising because we built the networks, but tricky because they are so large and have many millions of connections that effect complex and hard to interpret computation. Trickiness notwithstanding, in this talk we’ll see what we can learn about neural nets by looking at a few examples of networks in action and experiments designed to elucidate network behavior. The combined experiments yield a better understanding of network behavior and capabilities and promise to bolster our ability to apply neural nets as components in real world computer vision systems.

October 28th 2016 – Invited Talk by Max Jaderberg from Google DeepMind.
Title: Temporal Credit Assignment for Training Recurrent Neural Networks
The problem of temporal credit assignment is at the heart of training temporal models — how the processing or actions performed in the past affects the future, and how we can train this processing to optimise future performance. This talk will focus on two distinct scenarios. First the reinforcement learning scenario, where we consider an agent which is a recurrent neural network which takes actions in its environment. I will show our state-of-the-art approach to deep reinforcement learning, and some of the latest methods which deal with enhancing temporal credit assignment, presenting results on new 3D environments. I will then look at how temporal credit assignment is performed more generically during the training of recurrent neural networks, and how this can be improved by the introduction of Synthetic Gradients — predicted gradients from future processing by local models learnt online.

September 29th 2016 – Invited Talk by Iasonas Kokkinos from INRIA.
Title: DeepLab to UberNet: From Task-specific to Task-agnostic Deep Learning
Over the last few years Convolutional Neural Networks (CNNs) have been shown to deliver excellent results in a broad range of low- and high-level vision tasks, spanning effectively the whole spectrum of computer vision problems.
In this talk we will present recent research progress along two complementary directions. In the first part we will present research efforts on integrating established computer vision ideas with CNNs, thereby allowing us to incorporate task-specific domain knowledge in CNNs. We will present CNN-based adaptations of structured prediction techniques that use discrete (DenseCRF – Deeplab) and continuous energy-based formulations (Deep Gaussian CRF), and will also present methods to incorporate ideas from multi-scale processing, Multiple-Instance Learning and Spectral Clustering into CNNs. In the second part of the talk we will turn to designing a generic architecture that can tackle a multitude of tasks jointly, aiming at designing a `swiss knife’ for vision. We call this network an ‘UberNet’ to underline its overarching nature. We will introduce techniques that allow us to train an UberNet while using datasets with diverse annotations, while also handling the memory limitations of current hardware. The proposed architecture is able to jointly address (a) boundary detection (b) saliency detection (c) normal estimation (d) semantic segmentation (e) human part segmentation (f) human boundary detection (g) region proposal generation and object detection in 0.7 seconds per frame, with a level of performance that is comparable to the current state-of-the-art on these tasks.