• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-05-08
Yue Wu, Qiang Ji

The locations of the fiducial facial landmark points around facial components and facial contour capture the rigid and non-rigid facial deformations due to head movements and facial expressions. They are hence important for various facial analysis tasks. Many facial landmark detection algorithms have been developed to automatically detect those key points over the years, and in this paper, we perform an extensive review of them. We classify the facial landmark detection algorithms into three major categories: holistic methods, Constrained Local Model (CLM) methods, and the regression-based methods. They differ in the ways to utilize the facial appearance and shape information. The holistic methods explicitly build models to represent the global facial appearance and shape information. The CLMs explicitly leverage the global shape model but build the local appearance models. The regression based methods implicitly capture facial shape and appearance information. For algorithms within each category, we discuss their underlying theories as well as their differences. We also compare their performances on both controlled and in the wild benchmark datasets, under varying facial expressions, head poses, and occlusion. Based on the evaluations, we point out their respective strengths and weaknesses. There is also a separate section to review the latest deep learning based algorithms. The survey also includes a listing of the benchmark databases and existing software. Finally, we identify future research directions, including combining methods in different categories to leverage their respective strengths to solve landmark detection “in-the-wild”.

更新日期：2019-02-14
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-05-23
Bo Liu, Liping Jing, Jia Li, Jian Yu, Alex Gittens, Michael W. Mahoney

With significant advances in imaging technology, multiple images of a person or an object are becoming readily available in a number of real-life scenarios. In contrast to single images, image sets can capture a broad range of variations in the appearance of a single face or object. Recognition from these multiple images (i.e., image set classification) has gained significant attention in the area of computer vision. Unlike many existing approaches, which assume that only the images in the same set affect each other, this work develops a group collaborative representation (GCR) model which makes no such assumption, and which can effectively determine the hidden structure among image sets. Specifically, GCR takes advantage of the relationship between image sets to capture the inter- and intra-set variations, and it determines the characteristic subspaces of all the gallery sets. In these subspaces, individual gallery images and each probe set can be effectively represented via a self-representation learning scheme, which leads to increased discriminative ability and enhances robustness and efficiency of the prediction process. By conducting extensive experiments and comparing with state-of-the-art, we demonstrated the superiority of the proposed method on set-based face recognition and object categorization tasks.

更新日期：2019-02-14
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-05-24
Martin Hofmann, Marco Seeland, Patrick Mäder

The projection of a real world scenery to a planar image sensor inherits the loss of information about the 3D structure as well as the absolute dimensions of the scene. For image analysis and object classification tasks, however, absolute size information can make results more accurate. Today, the creation of size annotated image datasets is effort intensive and typically requires measurement equipment not available to public image contributors. In this paper, we propose an effective annotation method that utilizes the camera within smart mobile devices to capture the missing size information along with the image. The approach builds on the fact that with a camera, calibrated to a specific object distance, lengths can be measured in the object’s plane. We use the camera’s minimum focus distance as calibration distance and propose an adaptive feature matching process for precise computation of the scale change between two images facilitating measurements on larger object distances. Eventually, the measured object is segmented and its size information is annotated for later analysis. A user study showed that humans are able to retrieve the calibration distance with a low variance. The proposed approach facilitates a measurement accuracy comparable to manual measurement with a ruler and outperforms state-of-the-art methods in terms of accuracy and repeatability. Consequently, the proposed method allows in-situ size annotation of objects in images without the need for additional equipment or an artificial reference object in the scene.

更新日期：2019-02-14
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-02-13
Amir Jamaludin, Joon Son Chung, Andrew Zisserman

We describe a method for generating a video of a talking face. The method takes still images of the target face and an audio speech segment as inputs, and generates a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we develop an encoder–decoder convolutional neural network (CNN) model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on unlabelled videos using cross-modal self-supervision. We also propose methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.

更新日期：2019-02-14
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-02-13
Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A. Nicolaou, Athanasios Papaioannou, Guoying Zhao, Björn Schuller, Irene Kotsia, Stefanos Zafeiriou

Automatic understanding of human affect using visual signals is of great importance in everyday human–machine interactions. Appraising human emotional states, behaviors and reactions displayed in real-world settings, can be accomplished using latent continuous dimensions (e.g., the circumplex model of affect). Valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion) constitute popular and effective representations for affect. Nevertheless, the majority of collected datasets this far, although containing naturalistic emotional states, have been captured in highly controlled recording conditions. In this paper, we introduce the Aff-Wild benchmark for training and evaluating affect recognition algorithms. We also report on the results of the First Affect-in-the-wild Challenge (Aff-Wild Challenge) that was recently organized in conjunction with CVPR 2017 on the Aff-Wild database, and was the first ever challenge on the estimation of valence and arousal in-the-wild. Furthermore, we design and extensively train an end-to-end deep neural architecture which performs prediction of continuous emotion dimensions based on visual cues. The proposed deep learning architecture, AffWildNet, includes convolutional and recurrent neural network layers, exploiting the invariant properties of convolutional features, while also modeling temporal dynamics that arise in human behavior via the recurrent layers. The AffWildNet produced state-of-the-art results on the Aff-Wild Challenge. We then exploit the AffWild database for learning features, which can be used as priors for achieving best performances both for dimensional, as well as categorical emotion recognition, using the RECOLA, AFEW-VA and EmotiW 2017 datasets, compared to all other methods designed for the same goal. The database and emotion recognition models are available at http://ibug.doc.ic.ac.uk/resources/first-affect-wild-challenge.

更新日期：2019-02-14
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-02-13
Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, Gérard Medioni

We present a novel method for modeling 3D face shape, viewpoint, and expression from a single, unconstrained photo. Our method uses three deep convolutional neural networks to estimate each of these components separately. Importantly, unlike others, our method does not use facial landmark detection at test time; instead, it estimates these properties directly from image intensities. In fact, rather than using detectors, we show how accurate landmarks can be obtained as a by-product of our modeling process. We rigorously test our proposed method. To this end, we raise a number of concerns with existing practices used in evaluating face landmark detection methods. In response to these concerns, we propose novel paradigms for testing the effectiveness of rigid and non-rigid face alignment methods without relying on landmark detection benchmarks. We evaluate rigid face alignment by measuring its effects on face recognition accuracy on the challenging IJB-A and IJB-B benchmarks. Non-rigid, expression estimation is tested on the CK+ and EmotiW’17 benchmarks for emotion classification. We do, however, report the accuracy of our approach as a landmark detector for 3D landmarks on AFLW2000-3D and 2D landmarks on 300W and AFLW-PIFA. A surprising conclusion of these results is that better landmark detection accuracy does not necessarily translate to better face processing. Parts of this paper were previously published by Tran et al. (2017) and Chang et al. (2017, 2018).

更新日期：2019-02-14
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-02-12
Jiyoung Jung, Joon-Young Lee, In So Kweon

We present an outdoor photometric stereo method using images captured in a single day. We simulate a sky hemisphere for each image according to its GPS and timestamp and parameterize the obtained sky hemisphere into quadratic skylight and Gaussian sunlight distributions. Our previous work recovered an outdoor scene on a clear day, whereas the current paper shows that cloudy days can provide better illumination conditions for surface orientation recovery, and hence we propose a modified sky model to represent a well-conditioned skylight distribution for outdoor photometric stereo. The proposed method models natural illumination according to a sky model, providing sufficient constraints for shape reconstruction from 1-day images. We tested the proposed method to recover various sized objects and scenes from real-world outdoor daylight images and verified the method using synthetic and real data experiments.

更新日期：2019-02-13
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-02-12
Yao Sui, Ziming Zhang, Guanghui Wang, Yafei Tang, Li Zhang

Correlation filtering based tracking model has received significant attention and achieved great success in terms of both tracking accuracy and computational complexity. However, due to the limitation of the loss function, current correlation filtering paradigm could not reliably respond to the abrupt appearance changes of the target object. This study focuses on improving the robustness of the correlation filter learning. An anisotropy of the filter response is observed and analyzed for the correlation filtering based tracking model, through which the overfitting issue of previous methods is alleviated. Three sparsity related loss functions are proposed to exploit the anisotropy, leading to three implementations of visual trackers, correspondingly resulting in improved overall tracking performance. A large number of experiments are conducted and these experimental results demonstrate that the proposed approach greatly improves the robustness of the learned correlation filter. The proposed trackers performs comparably against state-of-the-art tracking methods on four latest standard tracking benchmark datasets.

更新日期：2019-02-13
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-02-12
Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, William T. Freeman

Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.

更新日期：2019-02-13
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-02-12
Huaibo Huang, Ran He, Zhenan Sun, Tieniu Tan

Most modern face hallucination methods resort to convolutional neural networks (CNN) to infer high-resolution (HR) face images. However, when dealing with very low-resolution (LR) images, these CNN based methods tend to produce over-smoothed outputs. To address this challenge, this paper proposes a wavelet-domain generative adversarial method that can ultra-resolve a very low-resolution (like $$16\times 16$$ or even $$8\times 8$$) face image to its larger version of multiple upscaling factors ($$2\times$$ to $$16\times$$) in a unified framework. Different from the most existing studies that hallucinate faces in image pixel domain, our method firstly learns to predict the wavelet information of HR face images from its corresponding LR inputs before image-level super-resolution. To capture both global topology information and local texture details of human faces, a flexible and extensible generative adversarial network is designed with three types of losses: (1) wavelet reconstruction loss aims to push wavelets closer with the ground-truth; (2) wavelet adversarial loss aims to generate realistic wavelets; (3) identity preserving loss aims to help identity information recovery. Extensive experiments demonstrate that the presented approach not only achieves more appealing results both quantitatively and qualitatively than state-of-the-art face hallucination methods, but also can significantly improve identification accuracy for low-resolution face images captured in the wild.

更新日期：2019-02-13
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-02-11
Thanh-Trung Ngo, Hajime Nagahara, Ko Nishino, Rin-ichiro Taniguchi, Yasushi Yagi

Reflectance and shape are two important components in visually perceiving the real world. Inferring the reflectance and shape of an object through cameras is a fundamental research topic in the field of computer vision. While three-dimensional shape recovery is pervasive with varieties of approaches and practical applications, reflectance recovery has only emerged recently. Reflectance recovery is a challenging task that is usually conducted in controlled environments, such as a laboratory environment with a special apparatus. However, it is desirable that the reflectance be recovered in the field with a handy camera so that reflectance can be jointly recovered with the shape. To that end, we present a solution that simultaneously recovers the reflectance and shape (i.e., dense depth and normal maps) of an object under natural illumination with commercially available handy cameras. We employ a light field camera to capture one light field image of the object, and a 360-degree camera to capture the illumination. The proposed method provides positive results in both simulation and real-world experiments.

更新日期：2019-02-11
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-01-25
Xinchu Shi, Haibin Ling, Yu Pang, Weiming Hu, Peng Chu, Junliang Xing

High-order motion information is important in multi-target tracking (MTT) especially when dealing with large inter-target ambiguities. Such high-order information can be naturally modeled as a multi-dimensional assignment (MDA) problem, whose global solution is however intractable in general. In this paper, we propose a novel framework to the problem by reshaping MTT as a rank-1 tensor approximation problem (R1TA). We first show that MDA and R1TA share the same objective function and similar constraints. This discovery opens a door to use high-order tensor analysis for MTT and suggests the exploration of R1TA. In particular, we develop a tensor power iteration algorithm to effectively capture high-order motion information as well as appearance variation. The proposed algorithm is evaluated on a diverse set of datasets including aerial video sequences containing ariel borne dense highway scenes, top-view pedestrian trajectories, multiple similar objects, normal view pedestrians and vehicles. The effectiveness of the proposed algorithm is clearly demonstrated in these experiments.

更新日期：2019-01-25
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-04-27
Ju Hong Yoon, Chang-Ryeol Lee, Ming-Hsuan Yang, Kuk-Jin Yoon

Online two-dimensional (2D) multi-object tracking (MOT) is a challenging task when the objects of interest have similar appearances. In that case, the motion of objects is another helpful cue for tracking and discriminating multiple objects. However, when using a single moving camera for online 2D MOT, observable motion cues are contaminated by global camera movements and, thus, are not always predictable. To deal with unexpected camera motion, we propose a new data association method that effectively exploits structural constraints in the presence of large camera motion. In addition, to reduce incorrect associations with mis-detections and false positives, we develop a novel event aggregation method to integrate assignment costs computed by structural constraints. We also utilize structural constraints to track missing objects when they are re-detected again. By doing this, identities of the missing objects can be retained continuously. Experimental results validated the effectiveness of the proposed data association algorithm under unexpected camera motions. In addition, tracking results on a large number of benchmark datasets demonstrated that the proposed MOT algorithm performs robustly and favorably against various online methods in terms of several quantitative metrics, and that its performance is comparable to offline methods.

更新日期：2019-01-23
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-05-15
Miaomiao Zhang, P. Thomas Fletcher

This paper introduces Fourier-approximated Lie algebras for shooting (FLASH), a fast geodesic shooting algorithm for diffeomorphic image registration. We approximate the infinite-dimensional Lie algebra of smooth vector fields, i.e., the tangent space at the identity of the diffeomorphism group, with a low-dimensional, bandlimited space. We show that most of the computations for geodesic shooting can be carried out entirely in this low-dimensional space. Our algorithm results in dramatic savings in time and memory over traditional large-deformation diffeomorphic metric mapping algorithms, which require dense spatial discretizations of vector fields. To validate the effectiveness of FLASH, we run pairwise image registration on both 2D synthetic data and real 3D brain images and compare with the state-of-the-art geodesic shooting methods. Experimental results show that our algorithm dramatically reduces the computational cost and memory footprint of diffemorphic image registration with little or no loss of accuracy.

更新日期：2019-01-23
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-04-30
Zicheng Liao, Kevin Karsch, Hongyi Zhang, David Forsyth

We present an object relighting system that allows an artist to select an object from an image and insert it into a target scene. Through simple interactions, the system can adjust illumination on the inserted object so that it appears naturally in the scene. To support image-based relighting, we build object model from the image, and propose a perceptually-inspired approximate shading model for the relighting. It decomposes the shading field into (a) a rough shape term that can be reshaded, (b) a parametric shading detail that encodes missing features from the first term, and (c) a geometric detail term that captures fine-scale material properties. With this decomposition, the shading model combines 3D rendering and image-based composition and allows more flexible compositing than image-based methods. Quantitative evaluation and a set of user studies suggest our method is a promising alternative to existing methods of object insertion.

更新日期：2019-01-23
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-05-03
Tatiana Tommasi, Arun Mallya, Bryan Plummer, Svetlana Lazebnik, Alexander C. Berg, Tamara L. Berg

This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support for feature extraction. We map each of these features, together with candidate answers, to a joint embedding space through normalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scores from nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a significant improvement over the previous state of the art and confirm that answering questions from a wide range of types benefits from examining a variety of image cues and carefully choosing the spatial support for feature extraction.

更新日期：2019-01-23
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-01-17
Yandong Wen, Kaipeng Zhang, Zhifeng Li, Yu Qiao

Deep convolutional neural networks (CNNs) trained with the softmax loss have achieved remarkable successes in a number of close-set recognition problems, e.g. object recognition, action recognition, etc. Unlike these close-set tasks, face recognition is an open-set problem where the testing classes (persons) are usually different from those in training. This paper addresses the open-set property of face recognition by developing the center loss. Specifically, the center loss simultaneously learns a center for each class, and penalizes the distances between the deep features of the face images and their corresponding class centers. Training with the center loss enables CNNs to extract the deep features with two desirable properties: inter-class separability and intra-class compactness. In addition, we extend the center loss in two aspects. First, we adopt parameter sharing between the softmax loss and the center loss, to reduce the extra parameters introduced by centers. Second, we generalize the concept of center from a single point to a region in embedding space, which further allows us to account for intra-class variations. The advanced center loss significantly enhances the discriminative power of deep features. Experimental results show that our method achieves high accuracies on several important face recognition benchmarks, including Labeled Faces in the Wild, YouTube Faces, IJB-A Janus, and MegaFace Challenging 1.

更新日期：2019-01-17
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-01-17
Yibing Song, Jiawei Zhang, Lijun Gong, Shengfeng He, Linchao Bao, Jinshan Pan, Qingxiong Yang, Ming-Hsuan Yang

We address the problem of restoring a high-resolution face image from a blurry low-resolution input. This problem is difficult as super-resolution and deblurring need to be tackled simultaneously. Moreover, existing algorithms cannot handle face images well as low-resolution face images do not have much texture which is especially critical for deblurring. In this paper, we propose an effective algorithm by utilizing the domain-specific knowledge of human faces to recover high-quality faces. We first propose a facial component guided deep Convolutional Neural Network (CNN) to restore a coarse face image, which is denoted as the base image where the facial component is automatically generated from the input face image. However, the CNN based method cannot handle image details well. We further develop a novel exemplar-based detail enhancement algorithm via facial component matching. Extensive experiments show that the proposed method outperforms the state-of-the-art algorithms both quantitatively and qualitatively.

更新日期：2019-01-17
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-01-09
Lerenhan Li, Jinshan Pan, Wei-Sheng Lai, Changxin Gao, Nong Sang, Ming-Hsuan Yang

We present an effective blind image deblurring method based on a data-driven discriminative prior. Our work is motivated by the fact that a good image prior should favor sharp images over blurred ones. In this work, we formulate the image prior as a binary classifier using a deep convolutional neural network. The learned prior is able to distinguish whether an input image is sharp or not. Embedded into the maximum a posterior framework, it helps blind deblurring in various scenarios, including natural, face, text, and low-illumination images, as well as non-uniform deblurring. However, it is difficult to optimize the deblurring method with the learned image prior as it involves a non-linear neural network. In this work, we develop an efficient numerical approach based on the half-quadratic splitting method and gradient descent algorithm to optimize the proposed model. Furthermore, we extend the proposed model to handle image dehazing. Both qualitative and quantitative experimental results show that our method performs favorably against the state-of-the-art algorithms as well as domain-specific image deblurring approaches.

更新日期：2019-01-09
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-01-09
Pierre-Luc St-Charles, Guillaume-Alexandre Bilodeau, Robert Bergevin

The segmentation of video sequences into foreground and background regions is a low-level process commonly used in video content analysis and smart surveillance applications. Using a multispectral camera setup can improve this process by providing more diverse data to help identify objects despite adverse imaging conditions. The registration of several data sources is however not trivial if the appearance of objects produced by each sensor differs substantially. This problem is further complicated when parallax effects cannot be ignored when using close-range stereo pairs. In this work, we present a new method to simultaneously tackle multispectral segmentation and stereo registration. Using an iterative procedure, we estimate the labeling result for one problem using the provisional result of the other. Our approach is based on the alternating minimization of two energy functions that are linked through the use of dynamic priors. We rely on the integration of shape and appearance cues to find proper multispectral correspondences, and to properly segment objects in low contrast regions. We also formulate our model as a frame processing pipeline using higher order terms to improve the temporal coherence of our results. Our method is evaluated under different configurations on multiple multispectral datasets, and our implementation is available online.

更新日期：2019-01-09
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2019-01-04
Bailey Kong, James Supanc̆ic̆, Deva Ramanan, Charless C. Fowlkes

We investigate the problem of automatically determining what type of shoe left an impression found at a crime scene. This recognition problem is made difficult by the variability in types of crime scene evidence (ranging from traces of dust or oil on hard surfaces to impressions made in soil) and the lack of comprehensive databases of shoe outsole tread patterns. We find that mid-level features extracted by pre-trained convolutional neural nets are surprisingly effective descriptors for this specialized domains. However, the choice of similarity measure for matching exemplars to a query image is essential to good performance. For matching multi-channel deep features, we propose the use of multi-channel normalized cross-correlation and analyze its effectiveness. Our proposed metric significantly improves performance in matching crime scene shoeprints to laboratory test impressions. We also show its effectiveness in other cross-domain image retrieval problems: matching facade images to segmentation labels and aerial photos to map images. Finally, we introduce a discriminatively trained variant and fine-tune our system through our proposed metric, obtaining state-of-the-art performance.

更新日期：2019-01-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-12-17
Grigorios G. Chrysos, Paolo Favaro, Stefanos Zafeiriou

Face analysis lies at the heart of computer vision with remarkable progress in the past decades. Face recognition and tracking are tackled by building invariance to fundamental modes of variation such as illumination, 3D pose. A much less standing mode of variation is motion deblurring, which however presents substantial challenges in face analysis. Recent approaches either make oversimplifying assumptions, e.g. in cases of joint optimization with other tasks, or fail to preserve the highly structured shape/identity information. We introduce a two-step architecture tailored to the challenges of motion deblurring: the first step restores the low frequencies; the second restores the high frequencies, while ensuring that the outputs span the natural images manifold. Both steps are implemented with a supervised data-driven method; to train those we devise a method for creating realistic motion blur by averaging a variable number of frames. The averaged images originate from the $$2MF^2$$ dataset with $$19$$ million facial frames, which we introduce for the task. Considering deblurring as an intermediate step, we conduct a thorough experimentation on high-level face analysis tasks, i.e. landmark localization and face verification, on blurred images. The experimental evaluation demonstrates the superiority of our method.

更新日期：2018-12-17
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-12-07
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, Antonio Torralba

Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.

更新日期：2018-12-07
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-12-05
Rudi Penne, Bart Ribbens, Pedro Roios

In this article we provide a very robust algorithm to compute the position of the center of a sphere with known radius from one image by a calibrated camera. To our knowledge it is the first time that an exact sphere localization formula is published that only uses the (pixel) area and the ellipse center of the sphere image. Other authors either derived an approximation formula or followed the less robust and more time consuming procedure of fitting an ellipse through the detected edge pixels. Our method is analytic and deterministic, making use of the unique positive real tool of a cubic equation. We observe that the proposed area method is significantly more accurate and precise than an ellipse fitting method. Furthermore, we investigate in what conditions for sphere images the proposed exact method is preferable to the robust approximation method. These observations are validated by virtual, synthetic and real experiments.

更新日期：2018-12-05
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-12-03
Arun Mukundan, Giorgos Tolias, Andrei Bursuc, Hervé Jégou, Ondřej Chum

We propose a multiple-kernel local-patch descriptor based on efficient match kernels from pixel gradients. It combines two parametrizations of gradient position and direction, each parametrization provides robustness to a different type of patch mis-registration: polar parametrization for noise in the patch dominant orientation detection, Cartesian for imprecise location of the feature point. Combined with whitening of the descriptor space, that is learned with or without supervision, the performance is significantly improved. We analyze the effect of the whitening on patch similarity and demonstrate its semantic meaning. Our unsupervised variant is the best performing descriptor constructed without the need of labeled data. Despite the simplicity of the proposed descriptor, it competes well with deep learning approaches on a number of different tasks.

更新日期：2018-12-03
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-12-01
Cewu Lu, Jianping Shi, Weiming Wang, Jiaya Jia

Fast abnormal event detection meets the growing demand to process an enormous number of surveillance videos. Based on the inherent redundancy of video structures, we propose an efficient sparse combination learning framework with both batch and online solvers. It achieves decent performance in the detection phase without compromising result quality. The extremely fast execution speed is guaranteed owing to the fact that our method effectively turns the original complicated problem into a few small-scale least square optimizations. Our method reaches high detection rates on benchmark datasets at a speed of 1000–1200 frames per second on average when computing on an ordinary single core desktop PC using MATLAB.

更新日期：2018-12-01
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-11-30
Qi Cai, Yuanxin Wu, Lilian Zhang, Peike Zhang

Two-view relative pose estimation and structure reconstruction is a classical problem in computer vision. The typical methods usually employ the singular value decomposition of the essential matrix to get multiple solutions of the relative pose, from which the right solution is picked out by reconstructing the three-dimension (3D) feature points and imposing the constraint of positive depth. This paper revisits the two-view geometry problem and discovers that the two-view imaging geometry is equivalently governed by a Pair of new Pose-Only (PPO) constraints: the same-side constraint and the intersection constraint. From the perspective of solving equation, the complete pose solutions of the essential matrix are explicitly derived and we rigorously prove that the orientation part of the pose can still be recovered in the case of pure rotation. The PPO constraints are simplified and formulated in the form of inequalities to directly identify the right pose solution with no need of 3D reconstruction and the 3D reconstruction can be analytically achieved from the identified right pose. Furthermore, the intersection inequality also enables a robust criterion for pure rotation identification. Experiment results validate the correctness of analyses and the robustness of the derived pose solution/pure rotation identification and analytical 3D reconstruction.

更新日期：2018-11-30
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-07-27
Hanxiao Wang, Xiatian Zhu, Shaogang Gong, Tao Xiang

Most existing person re-identification (re-id) methods are unsuitable for real-world deployment due to two reasons: Unscalability to large population size, and Inadaptability over time. In this work, we present a unified solution to address both problems. Specifically, we propose to construct an identity regression space (IRS) based on embedding different training person identities (classes) and formulate re-id as a regression problem solved by identity regression in the IRS. The IRS approach is characterised by a closed-form solution with high learning efficiency and an inherent incremental learning capability with human-in-the-loop. Extensive experiments on four benchmarking datasets (VIPeR, CUHK01, CUHK03 and Market-1501) show that the IRS model not only outperforms state-of-the-art re-id methods, but also is more scalable to large re-id population size by rapidly updating model and actively selecting informative samples with reduced human labelling effort.

更新日期：2018-11-29
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-07-04
Gottfried Munda, Christian Reinbacher, Thomas Pock

Event cameras or neuromorphic cameras mimic the human perception system as they measure the per-pixel intensity change rather than the actual intensity level. In contrast to traditional cameras, such cameras capture new information about the scene at MHz frequency in the form of sparse events. The high temporal resolution comes at the cost of losing the familiar per-pixel intensity information. In this work we propose a variational model that accurately models the behaviour of event cameras, enabling reconstruction of intensity images with arbitrary frame rate in real-time. Our method is formulated on a per-event-basis, where we explicitly incorporate information about the asynchronous nature of events via an event manifold induced by the relative timestamps of events. In our experiments we verify that solving the variational model on the manifold produces high-quality images without explicitly estimating optical flow. This paper is an extended version of our previous work (Reinbacher et al. in British machine vision conference (BMVC), 2016) and contains additional details of the variational model, an investigation of different data terms and a quantitative evaluation of our method against competing methods as well as synthetic ground-truth data.

更新日期：2018-11-29
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-11-29
Shan Li, Weihong Deng

Comprehending different categories of facial expressions plays a great role in the design of computational model analyzing human perceived and affective state. Authoritative studies have revealed that facial expressions in human daily life are in multiple or co-occurring mental states. However, due to the lack of valid datasets, most previous studies are still restricted to basic emotions with single label. In this paper, we present a novel multi-label facial expression database, RAF-ML, along with a new deep learning algorithm, to address this problem. Specifically, a crowdsourcing annotation of 1.2 million labels from 315 participants was implemented to identify the multi-label expressions collected from social network, then EM algorithm was designed to filter out unreliable labels. For all we know, RAF-ML is the first database in the wild that provides with crowdsourced cognition for multi-label expressions. Focusing on the ambiguity and continuity of blended expressions, we propose a new deep manifold learning network, called Deep Bi-Manifold CNN, to learn the discriminative feature for multi-label expressions by jointly preserving the local affinity of deep features and the manifold structures of emotion labels. Furthermore, a deep domain adaption method is leveraged to extend the deep manifold features learned from RAF-ML to other expression databases under various imaging conditions and cultures. Extensive experiments on the RAF-ML and other diverse databases (JAFFE, CK$$+$$, SFEW and MMI) show that the deep manifold feature is not only superior in multi-label expression recognition in the wild, but also captures the elemental and generic components that are effective for a wide range of expression recognition tasks.

更新日期：2018-11-29
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-11-29
Jiankang Deng, Anastasios Roussos, Grigorios Chrysos, Evangelos Ververas, Irene Kotsia, Jie Shen, Stefanos Zafeiriou

In this article, we present the Menpo 2D and Menpo 3D benchmarks, two new datasets for multi-pose 2D and 3D facial landmark localisation and tracking. In contrast to the previous benchmarks such as 300W and 300VW, the proposed benchmarks contain facial images in both semi-frontal and profile pose. We introduce an elaborate semi-automatic methodology for providing high-quality annotations for both the Menpo 2D and Menpo 3D benchmarks. In Menpo 2D benchmark, different visible landmark configurations are designed for semi-frontal and profile faces, thus making the 2D face alignment full-pose. In Menpo 3D benchmark, a united landmark configuration is designed for both semi-frontal and profile faces based on the correspondence with a 3D face model, thus making face alignment not only full-pose but also corresponding to the real-world 3D space. Based on the considerable number of annotated images, we organised Menpo 2D Challenge and Menpo 3D Challenge for face alignment under large pose variations in conjunction with CVPR 2017 and ICCV 2017, respectively. The results of these challenges demonstrate that recent deep learning architectures, when trained with the abundant data, lead to excellent results. We also provide a very simple, yet effective solution, named Cascade Multi-view Hourglass Model, to 2D and 3D face alignment. In our method, we take advantage of all 2D and 3D facial landmark annotations in a joint way. We not only capitalise on the correspondences between the semi-frontal and profile 2D facial landmarks but also employ joint supervision from both 2D and 3D facial landmarks. Finally, we discuss future directions on the topic of face alignment.

更新日期：2018-11-29
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-11-28
Yongming Rao, Jiwen Lu, Jie Zhou

In this paper, we propose a discriminative aggregation network method for video-based face recognition and person re-identification, which aims to integrate information from video frames for feature representation effectively and efficiently. Unlike existing video aggregation methods, our method aggregates raw video frames directly instead of the features obtained by complex processing. By combining the idea of metric learning and adversarial learning, we learn an aggregation network to generate more discriminative images compared to the raw input frames. Our framework reduces the number of image frames per video to be processed and significantly speeds up the recognition procedure. Furthermore, low-quality frames containing misleading information can be well filtered and denoised during the aggregation procedure, which makes our method more robust and discriminative. Experimental results on several widely used datasets show that our method can generate discriminative images from video clips and improve the overall recognition performance in both the speed and the accuracy for video-based face recognition and person re-identification.

更新日期：2018-11-28
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-11-26
Mariko Isogawa, Dan Mikami, Kosuke Takahashi, Daisuke Iwai, Kosuke Sato, Hideaki Kimata

This paper proposes a learning-based quality evaluation framework for inpainted results that does not require any subjectively annotated training data. Image inpainting, which removes and restores unwanted regions in images, is widely acknowledged as a task whose results are quite difficult to evaluate objectively. Thus, existing learning-based image quality assessment (IQA) methods for inpainting require subjectively annotated data for training. However, subjective annotation requires huge cost and subjects’ judgment occasionally differs from person to person in accordance with the judgment criteria. To overcome these difficulties, the proposed framework generates and uses simulated failure results of inpainted images whose subjective qualities are controlled as the training data. We also propose a masking method for generating training data towards fully automated training data generation. These approaches make it possible to successfully estimate better inpainted images, even though the task is quite subjective. To demonstrate the effectiveness of our approach, we test our algorithm with various datasets and show it outperforms existing IQA methods for inpainting.

更新日期：2018-11-27
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-11-21
Chuhang Zou, Ruiqi Guo, Zhizhong Li, Derek Hoiem

One major goal of vision is to infer physical models of objects, surfaces, and their layout from sensors. In this paper, we aim to interpret indoor scenes from one RGBD image. Our representation encodes the layout of orthogonal walls and the extent of objects, modeled with CAD-like 3D shapes. We parse both the visible and occluded portions of the scene and all observable objects, producing a complete 3D parse. Such a scene interpretation is useful for robotics and visual reasoning, but difficult to produce due to the well-known challenge of segmentation, the high degree of occlusion, and the diversity of objects in indoor scenes. We take a data-driven approach, generating sets of potential object regions, matching to regions in training images, and transferring and aligning associated 3D models while encouraging fit to observations and spatial consistency. We use support inference to aid interpretation and propose a retrieval scheme that uses convolutional neural networks to classify regions and retrieve objects with similar shapes. We demonstrate the performance of our method on our newly annotated NYUd v2 dataset (Silberman et al., in: Computer vision-ECCV, 2012, pp 746–760, 2012) with detailed 3D shapes.

更新日期：2018-11-24
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-11-16
Yujiang Wang, Bingnan Luo, Jie Shen, Maja Pantic

Inspired by the recent development of deep network-based methods in semantic image segmentation, we introduce an end-to-end trainable model for face mask extraction in video sequence. Comparing to landmark-based sparse face shape representation, our method can produce the segmentation masks of individual facial components, which can better reflect their detailed shape variations. By integrating convolutional LSTM (ConvLSTM) algorithm with fully convolutional networks (FCN), our new ConvLSTM-FCN model works on a per-sequence basis and takes advantage of the temporal correlation in video clips. In addition, we also propose a novel loss function, called segmentation loss, to directly optimise the intersection over union (IoU) performances. In practice, to further increase segmentation accuracy, one primary model and two additional models were trained to focus on the face, eyes, and mouth regions, respectively. Our experiment shows the proposed method has achieved a 16.99% relative improvement (from 54.50 to 63.76% mean IoU) over the baseline FCN model on the 300 Videos in the Wild (300VW) dataset.

更新日期：2018-11-16
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-11-08
Li Liu, Jie Chen, Paul Fieguth, Guoying Zhao, Rama Chellappa, Matti Pietikäinen

Texture is a fundamental characteristic of many types of images, and texture representation is one of the essential and challenging problems in computer vision and pattern recognition which has attracted extensive research attention over several decades. Since 2000, texture representations based on Bag of Words and on Convolutional Neural Networks have been extensively studied with impressive performance. Given this period of remarkable evolution, this paper aims to present a comprehensive survey of advances in texture representation over the last two decades. More than 250 major publications are cited in this survey covering different aspects of the research, including benchmark datasets and state of the art results. In retrospect of what has been achieved so far, the survey discusses open challenges and directions for future research.

更新日期：2018-11-08
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-01-31
Isinsu Katircioglu, Bugra Tekin, Mathieu Salzmann, Vincent Lepetit, Pascal Fua

Most recent approaches to monocular 3D pose estimation rely on Deep Learning. They either train a Convolutional Neural Network to directly regress from an image to a 3D pose, which ignores the dependencies between human joints, or model these dependencies via a max-margin structured learning framework, which involves a high computational cost at inference time. In this paper, we introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images or 2D joint location heatmaps that relies on an overcomplete autoencoder to learn a high-dimensional latent pose representation and accounts for joint dependencies. We further propose an efficient Long Short-Term Memory network to enforce temporal consistency on 3D pose predictions. We demonstrate that our approach achieves state-of-the-art performance both in terms of structure preservation and prediction accuracy on standard 3D human pose estimation benchmarks.

更新日期：2018-11-05
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2017-11-07
Henri Rebecq, Guillermo Gallego, Elias Mueggler, Davide Scaramuzza

Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. They offer significant advantages over standard cameras, namely a very high dynamic range, no motion blur, and a latency in the order of microseconds. However, because the output is composed of a sequence of asynchronous events rather than actual intensity images, traditional vision algorithms cannot be applied, so that a paradigm shift is needed. We introduce the problem of event-based multi-view stereo (EMVS) for event cameras and propose a solution to it. Unlike traditional MVS methods, which address the problem of estimating dense 3D structure from a set of known viewpoints, EMVS estimates semi-dense 3D structure from an event camera with known trajectory. Our EMVS solution elegantly exploits two inherent properties of an event camera: (1) its ability to respond to scene edges—which naturally provide semi-dense geometric information without any pre-processing operation—and (2) the fact that it provides continuous measurements as the sensor moves. Despite its simplicity (it can be implemented in a few lines of code), our algorithm is able to produce accurate, semi-dense depth maps, without requiring any explicit data association or intensity estimation. We successfully validate our method on both synthetic and real data. Our method is computationally very efficient and runs in real-time on a CPU.

更新日期：2018-11-05
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-01-31
Bernhard Egger, Sandro Schönborn, Andreas Schneider, Adam Kortylewski, Andreas Morel-Forster, Clemens Blumer, Thomas Vetter

Faces in natural images are often occluded by a variety of objects. We propose a fully automated, probabilistic and occlusion-aware 3D morphable face model adaptation framework following an analysis-by-synthesis setup. The key idea is to segment the image into regions explained by separate models. Our framework includes a 3D morphable face model, a prototype-based beard model and a simple model for occlusions and background regions. The segmentation and all the model parameters have to be inferred from the single target image. Face model adaptation and segmentation are solved jointly using an expectation–maximization-like procedure. During the E-step, we update the segmentation and in the M-step the face model parameters are updated. For face model adaptation we apply a stochastic sampling strategy based on the Metropolis–Hastings algorithm. For segmentation, we apply loopy belief propagation for inference in a Markov random field. Illumination estimation is critical for occlusion handling. Our combined segmentation and model adaptation needs a proper initialization of the illumination parameters. We propose a RANSAC-based robust illumination estimation technique. By applying this method to a large face image database we obtain a first empirical distribution of real-world illumination conditions. The obtained empirical distribution is made publicly available and can be used as prior in probabilistic frameworks, for regularization or to synthesize data for deep learning methods.

更新日期：2018-11-05
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-03-27
Daniel Maurer, Yong Chul Ju, Michael Breuß, Andrés Bruhn

Shape from shading (SfS) and stereo are two fundamentally different strategies for image-based 3-D reconstruction. While approaches for SfS infer the depth solely from pixel intensities, methods for stereo are based on a matching process that establishes correspondences across images. This difference in approaching the reconstruction problem yields complementary advantages that are worthwhile being combined. So far, however, most “joint” approaches are based on an initial stereo mesh that is subsequently refined using shading information. In this paper we follow a completely different approach. We propose a joint variational method that combines both cues within a single minimisation framework. To this end, we fuse a Lambertian SfS approach with a robust stereo model and supplement the resulting energy functional with a detail-preserving anisotropic second-order smoothness term. Moreover, we extend the resulting model in such a way that it jointly estimates depth, albedo and illumination. This in turn makes the approach applicable to objects with non-uniform albedo as well as to scenes with unknown illumination. Experiments for synthetic and real-world images demonstrate the benefits of our combined approach: They not only show that our method is capable of generating very detailed reconstructions, but also that joint approaches are feasible in practice.

更新日期：2018-11-05
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2017-07-27
Arash Akbarinia, C. Alejandro Parraga

Edges are key components of any visual scene to the extent that we can recognise objects merely by their silhouettes. The human visual system captures edge information through neurons in the visual cortex that are sensitive to both intensity discontinuities and particular orientations. The “classical approach” assumes that these cells are only responsive to the stimulus present within their receptive fields, however, recent studies demonstrate that surrounding regions and inter-areal feedback connections influence their responses significantly. In this work we propose a biologically-inspired edge detection model in which orientation selective neurons are represented through the first derivative of a Gaussian function resembling double-opponent cells in the primary visual cortex (V1). In our model we account for four kinds of receptive field surround, i.e. full, far, iso- and orthogonal-orientation, whose contributions are contrast-dependant. The output signal from V1 is pooled in its perpendicular direction by larger V2 neurons employing a contrast-variant centre-surround kernel. We further introduce a feedback connection from higher-level visual areas to the lower ones. The results of our model on three benchmark datasets show a big improvement compared to the current non-learning and biologically-inspired state-of-the-art algorithms while being competitive to the learning-based methods.

更新日期：2018-11-05
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-10-29
David Stutz, Andreas Geiger

We address the problem of 3D shape completion from sparse and noisy point clouds, a fundamental problem in computer vision and robotics. Recent approaches are either data-driven or learning-based: Data-driven approaches rely on a shape model whose parameters are optimized to fit the observations; Learning-based approaches, in contrast, avoid the expensive optimization step by learning to directly predict complete shapes from incomplete observations in a fully-supervised setting. However, full supervision is often not available in practice. In this work, we propose a weakly-supervised learning-based approach to 3D shape completion which neither requires slow optimization nor direct supervision. While we also learn a shape prior on synthetic data, we amortize, i.e., learn, maximum likelihood fitting using deep neural networks resulting in efficient shape completion without sacrificing accuracy. On synthetic benchmarks based on ShapeNet (Chang et al. Shapenet: an information-rich 3d model repository, 2015. arXiv:1512.03012) and ModelNet (Wu et al., in: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), 2015) as well as on real robotics data from KITTI (Geiger et al., in: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), 2012) and Kinect (Yang et al., 3d object dense reconstruction from a single depth view, 2018. arXiv:1802.00411), we demonstrate that the proposed amortized maximum likelihood approach is able to compete with the fully supervised baseline of Dai et al. (in: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR), 2017) and outperforms the data-driven approach of Engelmann et al. (in: Proceedings of the German conference on pattern recognition (GCPR), 2016), while requiring less supervision and being significantly faster.

更新日期：2018-10-30
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-10-22
Olivia Wiles, Andrew Zisserman

The objective of this work is to reconstruct the 3D surfaces of sculptures from one or more images using a view-dependent representation. To this end, we train a network, SiDeNet, to predict the Silhouette and Depth of the surface given a variable number of images; the silhouette is predicted at a different viewpoint from the inputs (e.g. from the side), while the depth is predicted at the viewpoint of the input images. This has three benefits. First, the network learns a representation of shape beyond that of a single viewpoint, as the silhouette forces it to respect the visual hull, and the depth image forces it to predict concavities (which don’t appear on the visual hull). Second, as the network learns about 3D using the proxy tasks of predicting depth and silhouette images, it is not limited by the resolution of the 3D representation. Finally, using a view-dependent representation (e.g. additionally encoding the viewpoint with the input image) improves the network’s generalisability to unseen objects. Additionally, the network is able to handle the input views in a flexible manner. First, it can ingest a different number of views during training and testing, and it is shown that the reconstruction performance improves as additional views are added at test-time. Second, the additional views do not need to be photometrically consistent. The network is trained and evaluated on two synthetic datasets—a realistic sculpture dataset (SketchFab), and ShapeNet. The design of the network is validated by comparing to state of the art methods for a set of tasks. It is shown that (i) passing the input viewpoint (i.e. using a view-dependent representation) improves the network’s generalisability at test time. (ii) Predicting depth/silhouette images allows for higher quality predictions in 2D, as the network is not limited by the chosen latent 3D representation. (iii) On both datasets the method of combining views in a global manner performs better than a local method. Finally, we show that the trained network generalizes to real images, and probe how the network has encoded the latent 3D shape.

更新日期：2018-10-23
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-10-19
Fotios Logothetis, Roberto Mecca, Fiorella Sgallari, Roberto Cipolla

Despite the longtime research aimed at retrieving geometrical information of an object from polarimetric imaging, physical limitations in the polarisation phenomena constrain current approaches to provide ambiguous depth estimation. As an additional constraint, polarimetric imaging formulation differs when light is reflected off the object specularly or diffusively. This introduces another source of ambiguity that current formulations cannot overcome. With the aim of deriving a formulation capable of dealing with as many heterogeneous effects as possible, we propose a differential formulation of the Shape from Polarisation problem that depends only on polarimetric images. This allows the direct geometrical characterisation of the level-set of the object keeping consistent mathematical formulation for diffuse and specular reflection. We show via synthetic and real-world experiments that diffuse and specular reflection can be easily distinguished in order to extract meaningful geometrical features from just polarimetric imaging. The inherent ambiguity of the Shape from Polarization problem becomes evident through the impossibility of reconstructing the whole surface with this differential approach. To overcome this limitation, we consider shading information elegantly embedding this new formulation into a two-light calibrated photometric stereo approach..

更新日期：2018-10-19
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-10-05
Katsushi Ikeuchi, Zhaoyuan Ma, Zengqiang Yan, Shunsuke Kudoh, Minako Nakamura

We have been developing a paradigm that we call learning-from-observation for a robot to automatically acquire a robot program to conduct a series of operations, or for a robot to understand what to do, through observing humans performing the same operations. Since a simple mimicking method to repeat exact joint angles or exact end-effector trajectories does not work well because of the kinematic and dynamic differences between a human and a robot, the proposed method employs intermediate symbolic representations, tasks, for conceptually representing what-to-do through observation. These tasks are subsequently mapped to appropriate robot operations depending on the robot hardware. In the present work, task models for upper-body operations of humanoid robots are presented, which are designed on the basis of Labanotation. Given a series of human operations, we first analyze the upper-body motions and extract certain fixed poses from key frames. These key poses are translated into tasks represented by Labanotation symbols. Then, a robot performs the operations corresponding to those task models. Because tasks based on Labanotation are independent of robot hardware, different robots can share the same observation module, and only different task-mapping modules specific to robot hardware are required. The system was implemented and demonstrated that three different robots can automatically mimic human upper-body operations with a satisfactory level of resemblance.

更新日期：2018-10-05
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-10-05
Oscar Koller, Sepehr Zargaran, Hermann Ney, Richard Bowden

This manuscript introduces the end-to-end embedding of a CNN into a HMM, while interpreting the outputs of the CNN in a Bayesian framework. The hybrid CNN-HMM combines the strong discriminative abilities of CNNs with the sequence modelling capabilities of HMMs. Most current approaches in the field of gesture and sign language recognition disregard the necessity of dealing with sequence data both for training and evaluation. With our presented end-to-end embedding we are able to improve over the state-of-the-art on three challenging benchmark continuous sign language recognition tasks by between 15 and 38% relative reduction in word error rate and up to 20% absolute. We analyse the effect of the CNN structure, network pretraining and number of hidden states. We compare the hybrid modelling to a tandem approach and evaluate the gain of model combination.

更新日期：2018-10-05
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-04-21
Zsolt Sánta, Zoltan Kato

A novel region-based approach is proposed to find a thin plate spline map between a pair of deformable 3D objects represented by triangular surface meshes. The proposed method works without landmark extraction and feature correspondences. The aligning transformation is simply found by solving a system of integral equations. Each equation is generated by integrating a non-linear function over the object domains. We derive recursive formulas for the efficient computation of these integrals for open and closed surface meshes. Based on a series of comparative tests on a large synthetic dataset, our triangular mesh-based algorithm outperforms state of the art methods both in terms of computing time and accuracy. The applicability of the proposed approach has been demonstrated on the registration of 3D lung CT volumes, brain surfaces and 3D human faces.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-04-21
Manuel Ruder, Alexey Dosovitskiy, Thomas Brox

Manually re-drawing an image in a certain artistic style takes a professional artist a long time. Doing this for a video sequence single-handedly is beyond imagination. We present two computational approaches that transfer the style from one image (for example, a painting) to a whole video sequence. In our first approach, we adapt to videos the original image style transfer technique by Gatys et al. based on energy minimization. We introduce new ways of initialization and new loss functions to generate consistent and stable stylized video sequences even in cases with large motion and strong occlusion. Our second approach formulates video stylization as a learning problem. We propose a deep network architecture and training procedures that allow us to stylize arbitrary-length videos in a consistent and stable way, and nearly in real time. We show that the proposed methods clearly outperform simpler baselines both qualitatively and quantitatively. Finally, we propose a way to adapt these approaches also to 360$$^\circ$$ images and videos as they emerge with recent virtual reality hardware.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-04-23
Xing Wei, Qingxiong Yang, Yihong Gong

Edge/structure-preserving operations for images aim to smooth images without blurring the edges/structures. Many exemplary edge-preserving filtering methods have recently been proposed to reduce the computational complexity and/or separate structures of different scales. They normally adopt a user-selected scale measurement to control the detail smoothing. However, natural photos contain objects of different sizes, which cannot be described by a single scale measurement. On the other hand, contour analysis is closely related to edge-preserving filtering, and significant progress has recently been achieved. Nevertheless, the majority of state-of-the-art filtering techniques have ignored the successes in this area. Inspired by the fact that learning-based edge detectors significantly outperform traditional manually-designed detectors, this paper proposes a learning-based edge-preserving filtering technique. It synergistically combines the differential operations in edge-preserving filters with the effectiveness of the recent edge detectors for scale-aware filtering. Unlike previous filtering methods, the proposed filters can efficiently extract subjectively meaningful structures from natural scenes containing multiple-scale objects.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-04-06
Yuan Xie, Dacheng Tao, Wensheng Zhang, Yan Liu, Lei Zhang, Yanyun Qu

In this paper, we address the multi-view subspace clustering problem. Our method utilizes the circulant algebra for tensor, which is constructed by stacking the subspace representation matrices of different views and then rotating, to capture the low rank tensor subspace so that the refinement of the view-specific subspaces can be achieved, as well as the high order correlations underlying multi-view data can be explored. By introducing a recently proposed tensor factorization, namely tensor-Singular Value Decomposition (t-SVD) (Kilmer et al. in SIAM J Matrix Anal Appl 34(1):148–172, 2013), we can impose a new type of low-rank tensor constraint on the rotated tensor to ensure the consensus among multiple views. Different from traditional unfolding based tensor norm, this low-rank tensor constraint has optimality properties similar to that of matrix rank derived from SVD, so the complementary information can be explored and propagated among all the views more thoroughly and effectively. The established model, called t-SVD based Multi-view Subspace Clustering (t-SVD-MSC), falls into the applicable scope of augmented Lagrangian method, and its minimization problem can be efficiently solved with theoretical convergence guarantee and relatively low computational complexity. Extensive experimental testing on eight challenging image datasets shows that the proposed method has achieved highly competent objective performance compared to several state-of-the-art multi-view clustering methods.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-04-12
James Steven Supančič, Grégory Rogez, Yi Yang, Jamie Shotton, Deva Ramanan

Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new advances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable number of systems, and have released software and evaluation code. We summarize important conclusions here: (1) Coarse pose estimation appears viable for scenes with isolated hands. However, high precision pose estimation [required for immersive virtual reality and cluttered scenes (where hands may be interacting with nearby objects and surfaces) remain a challenge. To spur further progress we introduce a challenging new dataset with diverse, cluttered scenes. (2) Many methods evaluate themselves with disparate criteria, making comparisons difficult. We define a consistent evaluation criteria, rigorously motivated by human experiments. (3) We introduce a simple nearest-neighbor baseline that outperforms most existing systems. This implies that most systems do not generalize beyond their training sets. This also reinforces the under-appreciated point that training data is as important as the model itself. We conclude with directions for future progress.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-07-17
Yingzhen Yang, Jiashi Feng, Nebojsa Jojic, Jianchao Yang, Thomas S. Huang

Subspace clustering methods partition the data that lie in or close to a union of subspaces in accordance with the subspace structure. Such methods with sparsity prior, such as sparse subspace clustering (SSC) (Elhamifar and Vidal in IEEE Trans Pattern Anal Mach Intell 35(11):2765–2781, 2013) with the sparsity induced by the $$\ell ^{1}$$-norm, are demonstrated to be effective in subspace clustering. Most of those methods require certain assumptions, e.g. independence or disjointness, on the subspaces. However, these assumptions are not guaranteed to hold in practice and they limit the application of existing sparse subspace clustering methods. In this paper, we propose $$\ell ^{0}$$-induced sparse subspace clustering ($$\ell ^{0}$$-SSC). In contrast to the required assumptions, such as independence or disjointness, on subspaces for most existing sparse subspace clustering methods, we prove that $$\ell ^{0}$$-SSC guarantees the subspace-sparse representation, a key element in subspace clustering, for arbitrary distinct underlying subspaces almost surely under the mild i.i.d. assumption on the data generation. We also present the “no free lunch” theorem which shows that obtaining the subspace representation under our general assumptions can not be much computationally cheaper than solving the corresponding $$\ell ^{0}$$ sparse representation problem of $$\ell ^{0}$$-SSC. A novel approximate algorithm named Approximate $$\ell ^{0}$$-SSC (A$$\ell ^{0}$$-SSC) is developed which employs proximal gradient descent to obtain a sub-optimal solution to the optimization problem of $$\ell ^{0}$$-SSC with theoretical guarantee. The sub-optimal solution is used to build a sparse similarity matrix upon which spectral clustering is performed for the final clustering results. Extensive experimental results on various data sets demonstrate the superiority of A$$\ell ^{0}$$-SSC compared to other competing clustering methods. Furthermore, we extend $$\ell ^{0}$$-SSC to semi-supervised learning by performing label propagation on the sparse similarity matrix learnt by A$$\ell ^{0}$$-SSC and demonstrate the effectiveness of the resultant semi-supervised learning method termed $$\ell ^{0}$$-sparse subspace label propagation ($$\ell ^{0}$$-SSLP).

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-05-23
Xi Peng, Rogerio S. Feris, Xiaoyu Wang, Dimitris N. Metaxas

We propose a novel method for real-time face alignment in videos based on a recurrent encoder–decoder network model. Our proposed model predicts 2D facial point heat maps regularized by both detection and regression loss, while uniquely exploiting recurrent learning at both spatial and temporal dimensions. At the spatial level, we add a feedback loop connection between the combined output response map and the input, in order to enable iterative coarse-to-fine face alignment using a single network model, instead of relying on traditional cascaded model ensembles. At the temporal level, we first decouple the features in the bottleneck of the network into temporal-variant factors, such as pose and expression, and temporal-invariant factors, such as identity information. Temporal recurrent learning is then applied to the decoupled temporal-variant features. We show that such feature disentangling yields better generalization and significantly more accurate results at test time. We perform a comprehensive experimental analysis, showing the importance of each component of our proposed model, as well as superior results over the state of the art and several variations of our method in standard datasets.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-07-11
Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, Antonio Torralba

The sound of crashing waves, the roar of fast-moving cars—sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. (in: European conference on computer vision, 2016b), with additional experiments and discussion.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2017-11-13
Emma Alexander, Qi Guo, Sanjeev Koppal, Steven J. Gortler, Todd Zickler

We present the focal flow sensor. It is an unactuated, monocular camera that simultaneously exploits defocus and differential motion to measure a depth map and a 3D scene velocity field. It does this using an optical-flow-like, per-pixel linear constraint that relates image derivatives to depth and velocity. We derive this constraint, prove its invariance to scene texture, and prove that it is exactly satisfied only when the sensor’s blur kernels are Gaussian. We analyze the inherent sensitivity of the focal flow cue, and we build and test a prototype. Experiments produce useful depth and velocity information for a broader set of aperture configurations, including a simple lens with a pillbox aperture.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2017-12-23
Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, Stan Sclaroff

We aim to model the top-down attention of a convolutional neural network (CNN) classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process. Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative. We show a theoretic connection between the proposed contrastive attention formulation and the Class Activation Map computation. Efficient implementation of Excitation Backprop for common neural network layers is also presented. In experiments, we visualize the evidence of a model’s classification decision by computing the proposed top-down attention maps. For quantitative evaluation, we report the accuracy of our method in weakly supervised localization tasks on the MS COCO, PASCAL VOC07 and ImageNet datasets. The usefulness of our method is further validated in the text-to-region association task. On the Flickr30k Entities dataset, we achieve promising performance in phrase localization by leveraging the top-down attention of a CNN model that has been trained on weakly labeled web images. Finally, we demonstrate applications of our method in model interpretation and data annotation assistance for facial expression analysis and medical imaging tasks.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-09-22
Jiayi Ma, Ji Zhao, Junjun Jiang, Huabing Zhou, Xiaojie Guo

Seeking reliable correspondences between two feature sets is a fundamental and important task in computer vision. This paper attempts to remove mismatches from given putative image feature correspondences. To achieve the goal, an efficient approach, termed as locality preserving matching (LPM), is designed, the principle of which is to maintain the local neighborhood structures of those potential true matches. We formulate the problem into a mathematical model, and derive a closed-form solution with linearithmic time and linear space complexities. Our method can accomplish the mismatch removal from thousands of putative correspondences in only a few milliseconds. To demonstrate the generality of our strategy for handling image matching problems, extensive experiments on various real image pairs for general feature matching, as well as for point set registration, visual homing and near-duplicate image retrieval are conducted. Compared with other state-of-the-art alternatives, our LPM achieves better or favorably competitive performance in accuracy while intensively cutting time cost by more than two orders of magnitude.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-09-22
Meng Tang, Dmitrii Marin, Ismail Ben Ayed, Yuri Boykov

This work bridges the gap between two popular methodologies for data partitioning: kernel clustering and regularization-based segmentation. While addressing closely related practical problems, these general methodologies may seem very different based on how they are covered in the literature. The differences may show up in motivation, formulation, and optimization, e.g. spectral relaxation versus max-flow. We explain how regularization and kernel clustering can work together and why this is useful. Our joint energy combines standard regularization, e.g. MRF potentials, and kernel clustering criteria like normalized cut. Complementarity of such terms is demonstrated in many applications using our bound optimization Kernel Cut algorithm for the joint energy (code is publicly available). While detailing combinatorial move-making, our main focus are new linear kernel and spectral bounds for kernel clustering criteria allowing their integration with any regularization objectives with existing discrete or continuous solvers.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-09-22
Pavel Tokmakov, Cordelia Schmid, Karteek Alahari

We study the problem of segmenting moving objects in unconstrained videos. Given a video, the task is to segment all the objects that exhibit independent motion in at least one frame. We formulate this as a learning problem and design our framework with three cues: (1) independent object motion between a pair of frames, which complements object recognition, (2) object appearance, which helps to correct errors in motion estimation, and (3) temporal consistency, which imposes additional constraints on the segmentation. The framework is a two-stream neural network with an explicit memory module. The two streams encode appearance and motion cues in a video sequence respectively, while the memory module captures the evolution of objects over time, exploiting the temporal consistency. The motion stream is a convolutional neural network trained on synthetic videos to segment independently moving objects in the optical flow field. The module to build a “visual memory” in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. For every pixel in a frame of a test video, our approach assigns an object or background label based on the learned spatio-temporal features as well as the “visual memory” specific to the video. We evaluate our method extensively on three benchmarks, DAVIS, Freiburg-Berkeley motion segmentation dataset and SegTrack. In addition, we provide an extensive ablation study to investigate both the choice of the training data and the influence of each component in the proposed framework.

更新日期：2018-10-04
• Int. J. Comput. Vis. (IF 11.541) Pub Date : 2018-09-11
Danda Pani Paudel, Adlane Habed, Cédric Demonceaux, Pascal Vasseur

This paper addresses the problem of registering a known structured 3D scene, typically a 3D scan, and its metric Structure-from-Motion (SfM) counterpart. The proposed registration method relies on a prior plane segmentation of the 3D scan. Alignment is carried out by solving either the point-to-plane assignment problem, should the SfM reconstruction be sparse, or the plane-to-plane one in case of dense SfM. A Polynomial Sum-of-Squares optimization theory framework is employed for identifying point-to-plane and plane-to-plane mismatches, i.e. outliers, with certainty. An inlier set maximization approach within a Branch-and-Bound search scheme is adopted to iteratively build potential inlier sets and converge to the solution satisfied by the largest number of assignments. Plane visibility conditions and vague camera locations may be incorporated for better efficiency without sacrificing optimality. The registration problem is solved in two cases: (i) putative correspondences (with possibly overwhelmingly many outliers) are provided as input and (ii) no initial correspondences are available. Our approach yields outstanding results in terms of robustness and optimality.

更新日期：2018-10-04
Some contents have been Reproduced with permission of the American Chemical Society.
Some contents have been Reproduced by permission of The Royal Society of Chemistry.