显示样式:     当前期刊: International Journal of Computer Vision    加入关注    导出
  • Hierarchical Cellular Automata for Visual Saliency
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-02-23
    Yao Qin, Mengyang Feng, Huchuan Lu, Garrison W. Cottrell

    Saliency detection, finding the most important parts of an image, has become increasingly popular in computer vision. In this paper, we introduce Hierarchical Cellular Automata (HCA)—a temporally evolving model to intelligently detect salient objects. HCA consists of two main components: Single-layer Cellular Automata (SCA) and Cuboid Cellular Automata (CCA). As an unsupervised propagation mechanism, Single-layer Cellular Automata can exploit the intrinsic relevance of similar regions through interactions with neighbors. Low-level image features as well as high-level semantic information extracted from deep neural networks are incorporated into the SCA to measure the correlation between different image patches. With these hierarchical deep features, an impact factor matrix and a coherence matrix are constructed to balance the influences on each cell’s next state. The saliency values of all cells are iteratively updated according to a well-defined update rule. Furthermore, we propose CCA to integrate multiple saliency maps generated by SCA at different scales in a Bayesian framework. Therefore, single-layer propagation and multi-scale integration are jointly modeled in our unified HCA. Surprisingly, we find that the SCA can improve all existing methods that we applied it to, resulting in a similar precision level regardless of the original results. The CCA can act as an efficient pixel-wise aggregation algorithm that can integrate state-of-the-art methods, resulting in even better results. Extensive experiments on four challenging datasets demonstrate that the proposed algorithm outperforms state-of-the-art conventional methods and is competitive with deep learning based approaches.

  • Scale-Free Registrations in 3D: 7 Degrees of Freedom with Fourier Mellin SOFT Transforms
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-02-23
    Heiko Bülow, Andreas Birk

    Fourier Mellin SOFT (FMS) as a novel method for global registration of 3D data is presented. It determines the seven degrees of freedom (7-DoF) transformation, i.e., the 6-DoF rigid motion parameters plus 1-DoF scale, between two scans, i.e., two noisy, only partially overlapping views on objects or scenes. It is based on a sequence of the 3D Fourier transform, the Mellin transform and the SO(3) Fourier transform. This combination represents a non-trivial complete 3D extension of the well known Fourier-Mellin registration for 2D images. It is accordingly based on decoupling rotation and scale from translation. First, rotation—which is the main challenge for the extension to 3D data - is tackled with a SO(3) Fourier Transform (SOFT) based on Spherical Harmonics. In a second step, scale is determined via a 3D Mellin transform. Finally, translation is calculated by Phase-Matching. Experiments are presented with simulated data sets for ground truth comparisons and with real world data including object recognition and localization in Magnetic Resonance Tomography (MRT) data, registration of 2.5D RGBD scans from a Microsoft Kinect with a scale-free 3D model generated by Multi-View Vision, and 3D mapping by registration of a sequence of consecutive scans from a low-cost actuated Laser Range Finder. The results show that the method is fast and that it can robustly handle partial overlap, interfering structures, and noise. It is also shown that the method is a very interesting option for 6-DoF registration, i.e., when scale is known.

  • Prediction of Manipulation Actions
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-20
    Cornelia Fermüller, Fang Wang, Yezhou Yang, Konstantinos Zampogiannis, Yi Zhang, Francisco Barranco, Michael Pfeiffer

    By looking at a person’s hands, one can often tell what the person is going to do next, how his/her hands are moving and where they will be, because an actor’s intentions shape his/her movement kinematics during action execution. Similarly, active systems with real-time constraints must not simply rely on passive video-segment classification, but they have to continuously update their estimates and predict future actions. In this paper, we study the prediction of dexterous actions. We recorded videos of subjects performing different manipulation actions on the same object, such as “squeezing”, “flipping”, “washing”, “wiping” and “scratching” with a sponge. In psychophysical experiments, we evaluated human observers’ skills in predicting actions from video sequences of different length, depicting the hand movement in the preparation and execution of actions before and after contact with the object. We then developed a recurrent neural network based method for action prediction using as input image patches around the hand. We also used the same formalism to predict the forces on the finger tips using for training synchronized video and force data streams. Evaluations on two new datasets show that our system closely matches human performance in the recognition task, and demonstrate the ability of our algorithms to predict in real time what and how a dexterous action is performed.

  • Dynamic Behavior Analysis via Structured Rank Minimization
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-01-19
    Christos Georgakis, Yannis Panagakis, Maja Pantic

    Human behavior and affect is inherently a dynamic phenomenon involving temporal evolution of patterns manifested through a multiplicity of non-verbal behavioral cues including facial expressions, body postures and gestures, and vocal outbursts. A natural assumption for human behavior modeling is that a continuous-time characterization of behavior is the output of a linear time-invariant system when behavioral cues act as the input (e.g., continuous rather than discrete annotations of dimensional affect). Here we study the learning of such dynamical system under real-world conditions, namely in the presence of noisy behavioral cues descriptors and possibly unreliable annotations by employing structured rank minimization. To this end, a novel structured rank minimization method and its scalable variant are proposed. The generalizability of the proposed framework is demonstrated by conducting experiments on 3 distinct dynamic behavior analysis tasks, namely (i) conflict intensity prediction, (ii) prediction of valence and arousal, and (iii) tracklet matching. The attained results outperform those achieved by other state-of-the-art methods for these tasks and, hence, evidence the robustness and effectiveness of the proposed approach.

  • Joint Estimation of Human Pose and Conversational Groups from Social Scenes
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-07-14
    Jagannadan Varadarajan, Ramanathan Subramanian, Samuel Rota Bulò, Narendra Ahuja, Oswald Lanz, Elisa Ricci

    Despite many attempts in the last few years, automatic analysis of social scenes captured by wide-angle camera networks remains a very challenging task due to the low resolution of targets, background clutter and frequent and persistent occlusions. In this paper, we present a novel framework for jointly estimating (i) head, body orientations of targets and (ii) conversational groups called F-formations from social scenes. In contrast to prior works that have (a) exploited the limited range of head and body orientations to jointly learn both, or (b) employed the mutual head (but not body) pose of interactors for deducing F-formations, we propose a weakly-supervised learning algorithm for joint inference. Our algorithm employs body pose as the primary cue for F-formation estimation, and an alternating optimization strategy is proposed to iteratively refine F-formation and pose estimates. We demonstrate the increased efficacy of joint inference over the state-of-the-art via extensive experiments on three social datasets.

  • Toward Personalized Modeling: Incremental and Ensemble Alignment for Sequential Faces in the Wild
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-15
    Xi Peng, Shaoting Zhang, Yang Yu, Dimitris N. Metaxas

    Fitting facial landmarks on unconstrained videos is a challenging task with broad applications. Both generic and joint alignment methods have been proposed with varying degrees of success. However, many generic methods are heavily sensitive to initializations and usually rely on offline-trained static models, which limit their performance on sequential images with extensive variations. On the other hand, joint methods are restricted to offline applications, since they require all frames to conduct batch alignment. To address these limitations, we propose to exploit incremental learning for personalized ensemble alignment. We sample multiple initial shapes to achieve image congealing within one frame, which enables us to incrementally conduct ensemble alignment by group-sparse regularized rank minimization. At the same time, incremental subspace adaptation is performed to achieve personalized modeling in a unified framework. To alleviate the drifting issue, we leverage a very efficient fitting evaluation network to pick out well-aligned faces for robust incremental learning. Extensive experiments on both controlled and unconstrained datasets have validated our approach in different aspects and demonstrated its superior performance compared with state of the arts in terms of fitting accuracy and efficiency.

  • Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-05-22
    Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, Li Fei-Fei

    Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memory deep networks for modeling these temporal relations via multiple input and output connections. We show that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.

  • Space-Time Tree Ensemble for Action Recognition and Localization
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-02
    Shugao Ma, Jianming Zhang, Stan Sclaroff, Nazli Ikizler-Cinbis, Leonid Sigal

    Human actions are, inherently, structured patterns of body movements. We explore ensembles of hierarchical spatio-temporal trees, discovered directly from training data, to model these structures for action recognition and spatial localization. Discovery of frequent and discriminative tree structures is challenging due to the exponential search space, particularly if one allows partial matching. We address this by first building a concise action word vocabulary via discriminative clustering of the hierarchical space-time segments, which is a two-level video representation that captures both static and non-static relevant space-time segments of the video. Using this vocabulary we then utilize tree mining with subsequent tree clustering and ranking to select a compact set of discriminative tree patterns. Our experiments show that these tree patterns, alone, or in combination with shorter patterns (action words and pairwise patterns) achieve promising performance on three challenging datasets: UCF Sports, HighFive and Hollywood3D. Moreover, we perform cross-dataset validation, using trees learned on HighFive to recognize the same actions in Hollywood3D, and using trees learned on UCF-Sports to recognize and localize the similar actions in JHMDB. The results demonstrate the potential for cross-dataset generalization of the trees our approach discovers.

  • Unconstrained Still/Video-Based Face Verification with Deep Convolutional Neural Networks
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-07-01
    Jun-Cheng Chen, Rajeev Ranjan, Swami Sankaranarayanan, Amit Kumar, Ching-Hui Chen, Vishal M. Patel, Carlos D. Castillo, Rama Chellappa

    Over the last 5 years, methods based on Deep Convolutional Neural Networks (DCNNs) have shown impressive performance improvements for object detection and recognition problems. This has been made possible due to the availability of large annotated datasets, a better understanding of the non-linear mapping between input images and class labels as well as the affordability of GPUs. In this paper, we present the design details of a deep learning system for unconstrained face recognition, including modules for face detection, association, alignment and face verification. The quantitative performance evaluation is conducted using the IARPA Janus Benchmark A (IJB-A), the JANUS Challenge Set 2 (JANUS CS2), and the Labeled Faces in the Wild (LFW) dataset. The IJB-A dataset includes real-world unconstrained faces of 500 subjects with significant pose and illumination variations which are much harder than the LFW and Youtube Face datasets. JANUS CS2 is the extended version of IJB-A which contains not only all the images/frames of IJB-A but also includes the original videos. Some open issues regarding DCNNs for face verification problems are then discussed.

  • Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2016-10-04
    Lionel Pigou, Aäron van den Oord, Sander Dieleman, Mieke Van Herreweghe, Joni Dambre

    Recent studies have demonstrated the power of recurrent neural networks for machine translation, image captioning and speech recognition. For the task of capturing temporal structure in video, however, there still remain numerous open research questions. Current research suggests using a simple temporal feature pooling strategy to take into account the temporal aspect of video. We demonstrate that this method is not sufficient for gesture recognition, where temporal information is more discriminative compared to general video classification tasks. We explore deep architectures for gesture recognition in video and propose a new end-to-end trainable neural network architecture incorporating temporal convolutions and bidirectional recurrence. Our main contributions are twofold; first, we show that recurrence is crucial for this task; second, we show that adding temporal convolutions leads to significant improvements. We evaluate the different approaches on the Montalbano gesture recognition dataset, where we achieve state-of-the-art results.

  • A Comprehensive Performance Evaluation of Deformable Face Tracking “In-the-Wild”
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-25
    Grigorios G. Chrysos, Epameinondas Antonakos, Patrick Snape, Akshay Asthana, Stefanos Zafeiriou

    Recently, technologies such as face detection, facial landmark localisation and face recognition and verification have matured enough to provide effective and efficient solutions for imagery captured under arbitrary conditions (referred to as “in-the-wild”). This is partially attributed to the fact that comprehensive “in-the-wild” benchmarks have been developed for face detection, landmark localisation and recognition/verification. A very important technology that has not been thoroughly evaluated yet is deformable face tracking “in-the-wild”. Until now, the performance has mainly been assessed qualitatively by visually assessing the result of a deformable face tracking technology on short videos. In this paper, we perform the first, to the best of our knowledge, thorough evaluation of state-of-the-art deformable face tracking pipelines using the recently introduced 300 VW benchmark. We evaluate many different architectures focusing mainly on the task of on-line deformable face tracking. In particular, we compare the following general strategies: (a) generic face detection plus generic facial landmark localisation, (b) generic model free tracking plus generic facial landmark localisation, as well as (c) hybrid approaches using state-of-the-art face detection, model free tracking and facial landmark localisation technologies. Our evaluation reveals future avenues for further research on the topic.

  • Transferring Deep Object and Scene Representations for Event Recognition in Still Images
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-09-13
    Limin Wang, Zhe Wang, Yu Qiao, Luc Van Gool

    This paper addresses the problem of image-based event recognition by transferring deep representations learned from object and scene datasets. First we empirically investigate the correlation of the concepts of object, scene, and event, thus motivating our representation transfer methods. Based on this empirical study, we propose an iterative selection method to identify a subset of object and scene classes deemed most relevant for representation transfer. Afterwards, we develop three transfer techniques: (1) initialization-based transfer, (2) knowledge-based transfer, and (3) data-based transfer. These newly designed transfer techniques exploit multitask learning frameworks to incorporate extra knowledge from other networks or additional datasets into the fine-tuning procedure of event CNNs. These multitask learning frameworks turn out to be effective in reducing the effect of over-fitting and improving the generalization ability of the learned CNNs. We perform experiments on four event recognition benchmarks: the ChaLearn LAP Cultural Event Recognition dataset, the Web Image Dataset for Event Recognition, the UIUC Sports Event dataset, and the Photo Event Collection dataset. The experimental results show that our proposed algorithm successfully transfers object and scene representations towards the event dataset and achieves the current state-of-the-art performance on all considered datasets.

  • Deep Multimodal Fusion: A Hybrid Approach
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-20
    Mohamed R. Amer, Timothy Shields, Behjat Siddiquie, Amir Tamrakar, Ajay Divakaran, Sek Chai

    We propose a novel hybrid model that exploits the strength of discriminative classifiers along with the representation power of generative models. Our focus is on detecting multimodal events in time varying sequences as well as generating missing data in any of the modalities. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We propose a new model that jointly optimizes the representation space using a hybrid energy function. We employ a Restricted Boltzmann Machines (RBMs) based model to learn a shared representation across multiple modalities with time varying data. The Conditional RBMs (CRBMs) is an extension of the RBM model that takes into account short term temporal phenomena. The hybrid model involves augmenting CRBMs with a discriminative component for classification. For these purposes we propose a novel Multimodal Discriminative CRBMs (MMDCRBMs) model. First, we train the MMDCRBMs model using labeled data by training each modality, followed by training a fusion layer. Second, we exploit the generative capability of MMDCRBMs to activate the trained model so as to generate the lower-level data corresponding to the specific label that closely matches the actual input data. We evaluate our approach on ChaLearn dataset, audio-mocap, as well as the Tower Game dataset, mocap-mocap as well as three multimodal toy datasets. We report classification accuracy, generation accuracy, and localization accuracy and demonstrate its superiority compared to the state-of-the-art methods.

  • Subjects and Their Objects: Localizing Interactees for a Person-Centric View of Importance
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2016-10-28
    Chao-Yeh Chen, Kristen Grauman

    Understanding images with people often entails understanding their interactions with other objects or people. As such, given a novel image, a vision system ought to infer which other objects/people play an important role in a given person’s activity. However, existing methods are limited to learning action-specific interactions (e.g., how the pose of a tennis player relates to the position of his racquet when serving the ball) for improved recognition, making them unequipped to reason about novel interactions with actions or objects unobserved in the training data. We propose to predict the “interactee” in novel images—that is, to localize the object of a person’s action. Given an arbitrary image with a detected person, the goal is to produce a saliency map indicating the most likely positions and scales where that person’s interactee would be found. To that end, we explore ways to learn the generic, action-independent connections between (a) representations of a person’s pose, gaze, and scene cues and (b) the interactee object’s position and scale. We provide results on a newly collected UT Interactee dataset spanning more than 10,000 images from SUN, PASCAL, and COCO. We show that the proposed interaction-informed saliency metric has practical utility for four tasks: contextual object detection, image retargeting, predicting object importance, and data-driven natural language scene description. All four scenarios reveal the value in linking the subject to its object in order to understand the story of an image.

  • Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2016-08-10
    Rasmus Rothe, Radu Timofte, Luc Van Gool

    In this paper we propose a deep learning solution to age estimation from a single face image without the use of facial landmarks and introduce the IMDB-WIKI dataset, the largest public dataset of face images with age and gender labels. If the real age estimation research spans over decades, the study of apparent age estimation or the age as perceived by other humans from a face image is a recent endeavor. We tackle both tasks with our convolutional neural networks (CNNs) of VGG-16 architecture which are pre-trained on ImageNet for image classification. We pose the age estimation problem as a deep classification problem followed by a softmax expected value refinement. The key factors of our solution are: deep learned models from large data, robust face alignment, and expected value formulation for age regression. We validate our methods on standard benchmarks and achieve state-of-the-art results for both real and apparent age estimation.

  • Real-Time Accurate 3D Head Tracking and Pose Estimation with Consumer RGB-D Cameras
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-02
    David Joseph Tan, Federico Tombari, Nassir Navab

    We demonstrate how 3D head tracking and pose estimation can be effectively and efficiently achieved from noisy RGB-D sequences. Our proposal leverages on a random forest framework, designed to regress the 3D head pose at every frame in a temporal tracking manner. One peculiarity of the algorithm is that it exploits together (1) a generic training dataset of 3D head models, which is learned once offline; and, (2) an online refinement with subject-specific 3D data, which aims for the tracker to withstand slight facial deformations and to adapt its forest to the specific characteristics of an individual subject. The combination of these works allows our algorithm to be robust even under extreme poses, where the user’s face is no longer visible on the image. Finally, we also propose another solution that utilizes a multi-camera system such that the data simultaneously acquired from multiple RGB-D sensors helps the tracker to handle challenging conditions that affect a subset of the cameras. Notably, the proposed multi-camera frameworks yields a real-time performance of approximately 8 ms per frame given six cameras and one CPU core, and scales up linearly to 30 fps with 25 cameras.

  • Large Scale 3D Morphable Models
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-04-08
    James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, Stefanos Zafeiriou

    We present large scale facial model (LSFM)—a 3D Morphable Model (3DMM) automatically constructed from 9663 distinct facial identities. To the best of our knowledge LSFM is the largest-scale Morphable Model ever constructed, containing statistical information from a huge variety of the human population. To build such a large model we introduce a novel fully automated and robust Morphable Model construction pipeline, informed by an evaluation of state-of-the-art dense correspondence techniques. The dataset that LSFM is trained on includes rich demographic information about each subject, allowing for the construction of not only a global 3DMM model but also models tailored for specific age, gender or ethnicity groups. We utilize the proposed model to perform age classification from 3D shape alone and to reconstruct noisy out-of-sample data in the low-dimensional model space. Furthermore, we perform a systematic analysis of the constructed 3DMM models that showcases their quality and descriptive power. The presented extensive qualitative and quantitative evaluations reveal that the proposed 3DMM achieves state-of-the-art results, outperforming existing models by a large margin. Finally, for the benefit of the research community, we make publicly available the source code of the proposed automatic 3DMM construction pipeline, as well as the constructed global 3DMM and a variety of bespoke models tailored by age, gender and ethnicity.

  • Confidence-Weighted Local Expression Predictions for Occlusion Handling in Expression Recognition and Action Unit Detection
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-04-08
    Arnaud Dapogny, Kevin Bailly, Séverine Dubuisson

    Fully-automatic facial expression recognition (FER) is a key component of human behavior analysis. Performing FER from still images is a challenging task as it involves handling large interpersonal morphological differences, and as partial occlusions can occasionally happen. Furthermore, labelling expressions is a time-consuming process that is prone to subjectivity, thus the variability may not be fully covered by the training data. In this work, we propose to train random forests upon spatially-constrained random local subspaces of the face. The output local predictions form a categorical expression-driven high-level representation that we call local expression predictions (LEPs). LEPs can be combined to describe categorical facial expressions as well as action units (AUs). Furthermore, LEPs can be weighted by confidence scores provided by an autoencoder network. Such network is trained to locally capture the manifold of the non-occluded training data in a hierarchical way. Extensive experiments show that the proposed LEP representation yields high descriptive power for categorical expressions and AU occurrence prediction, and leads to interesting perspectives towards the design of occlusion-robust and confidence-aware FER systems.

  • Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s)
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-02-05
    Danna Gurari, Kun He, Bo Xiong, Jianming Zhang, Mehrnoosh Sameki, Suyog Dutt Jain, Stan Sclaroff, Margrit Betke, Kristen Grauman

    We propose the ambiguity problem for the foreground object segmentation task and motivate the importance of estimating and accounting for this ambiguity when designing vision systems. Specifically, we distinguish between images which lead multiple annotators to segment different foreground objects (ambiguous) versus minor inter-annotator differences of the same object. Taking images from eight widely used datasets, we crowdsource labeling the images as “ambiguous” or “not ambiguous” to segment in order to construct a new dataset we call STATIC. Using STATIC, we develop a system that automatically predicts which images are ambiguous. Experiments demonstrate the advantage of our prediction system over existing saliency-based methods on images from vision benchmarks and images taken by blind people who are trying to recognize objects in their environment. Finally, we introduce a crowdsourcing system to achieve cost savings for collecting the diversity of all valid “ground truth” foreground object segmentations by collecting extra segmentations only when ambiguity is expected. Experiments show our system eliminates up to 47% of human effort compared to existing crowdsourcing methods with no loss in capturing the diversity of ground truths.

  • Label Propagation with Ensemble of Pairwise Geometric Relations: Towards Robust Large-Scale Retrieval of Object Instances
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-01-31
    Xiaomeng Wu, Kaoru Hiramatsu, Kunio Kashino

    Spatial verification methods permit geometrically stable image matching, but still involve a difficult trade-off between robustness as regards incorrect rejection of true correspondences and discriminative power in terms of mismatches. To address this issue, we ask whether an ensemble of weak geometric constraints that correlates with visual similarity only slightly better than a bag-of-visual-words model performs better than a single strong constraint. We consider a family of spatial verification methods and decompose them into fundamental constraints imposed on pairs of feature correspondences. Encompassing such constraints leads us to propose a new method, which takes the best of existing techniques and functions as a unified Ensemble of pAirwise GEometric Relations (EAGER), in terms of both spatial contexts and between-image transformations. We also introduce a novel and robust reranking method, in which the object instances localized by EAGER in high-ranked database images are reissued as new queries. EAGER is extended to develop a smoothness constraint where the similarity between the optimized ranking scores of two instances should be maximally consistent with their geometrically constrained similarity. Reranking is newly formulated as two label propagation problems: one is to assess the confidence of new queries and the other to aggregate new independently executed retrievals. Extensive experiments conducted on four datasets show that EAGER and our reranking method outperform most of their state-of-the-art counterparts, especially when large-scale visual vocabularies are used.

  • Learning Latent Representations of 3D Human Pose with Deep Neural Networks
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-01-31
    Isinsu Katircioglu, Bugra Tekin, Mathieu Salzmann, Vincent Lepetit, Pascal Fua

    Most recent approaches to monocular 3D pose estimation rely on Deep Learning. They either train a Convolutional Neural Network to directly regress from an image to a 3D pose, which ignores the dependencies between human joints, or model these dependencies via a max-margin structured learning framework, which involves a high computational cost at inference time. In this paper, we introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images or 2D joint location heatmaps that relies on an overcomplete autoencoder to learn a high-dimensional latent pose representation and accounts for joint dependencies. We further propose an efficient Long Short-Term Memory network to enforce temporal consistency on 3D pose predictions. We demonstrate that our approach achieves state-of-the-art performance both in terms of structure preservation and prediction accuracy on standard 3D human pose estimation benchmarks.

  • Occlusion-Aware 3D Morphable Models and an Illumination Prior for Face Image Analysis
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-01-31
    Bernhard Egger, Sandro Schönborn, Andreas Schneider, Adam Kortylewski, Andreas Morel-Forster, Clemens Blumer, Thomas Vetter

    Faces in natural images are often occluded by a variety of objects. We propose a fully automated, probabilistic and occlusion-aware 3D morphable face model adaptation framework following an analysis-by-synthesis setup. The key idea is to segment the image into regions explained by separate models. Our framework includes a 3D morphable face model, a prototype-based beard model and a simple model for occlusions and background regions. The segmentation and all the model parameters have to be inferred from the single target image. Face model adaptation and segmentation are solved jointly using an expectation–maximization-like procedure. During the E-step, we update the segmentation and in the M-step the face model parameters are updated. For face model adaptation we apply a stochastic sampling strategy based on the Metropolis–Hastings algorithm. For segmentation, we apply loopy belief propagation for inference in a Markov random field. Illumination estimation is critical for occlusion handling. Our combined segmentation and model adaptation needs a proper initialization of the illumination parameters. We propose a RANSAC-based robust illumination estimation technique. By applying this method to a large face image database we obtain a first empirical distribution of real-world illumination conditions. The obtained empirical distribution is made publicly available and can be used as prior in probabilistic frameworks, for regularization or to synthesize data for deep learning methods.

  • Graph-Based Slice-to-Volume Deformable Registration
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-08-22
    Enzo Ferrante, Nikos Paragios

    Deformable image registration is a fundamental problem in computer vision and medical image computing. In this paper we investigate the use of graphical models in the context of a particular type of image registration problem, known as slice-to-volume registration. We introduce a scalable, modular and flexible formulation that can accommodate low-rank and high order terms, that simultaneously selects the plane and estimates the in-plane deformation through a single shot optimization approach. The proposed framework is instantiated into different variants seeking either a compromise between computational efficiency (soft plane selection constraints and approximate definition of the data similarity terms through pair-wise components) or exact definition of the data terms and the constraints on the plane selection. Simulated and real-data in the context of ultrasound and magnetic resonance registration (where both framework instantiations as well as different optimization strategies are considered) demonstrate the potentials of our method.

  • Baseline and Triangulation Geometry in a Standard Plenoptic Camera
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-08-20
    Christopher Hahne, Amar Aggoun, Vladan Velisavljevic, Susanne Fiebig, Matthias Pesch

    In this paper, we demonstrate light field triangulation to determine depth distances and baselines in a plenoptic camera. Advances in micro lenses and image sensors have enabled plenoptic cameras to capture a scene from different viewpoints with sufficient spatial resolution. While object distances can be inferred from disparities in a stereo viewpoint pair using triangulation, this concept remains ambiguous when applied in the case of plenoptic cameras. We present a geometrical light field model allowing the triangulation to be applied to a plenoptic camera in order to predict object distances or specify baselines as desired. It is shown that distance estimates from our novel method match those of real objects placed in front of the camera. Additional benchmark tests with an optical design software further validate the model’s accuracy with deviations of less than ±0.33% for several main lens types and focus settings. A variety of applications in the automotive and robotics field can benefit from this estimation model.

  • Efficient Label Collection for Image Datasets via Hierarchical Clustering
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-08-24
    Maggie Wigness, Bruce A. Draper, J. Ross Beveridge

    Raw visual data used to train classifiers is abundant and easy to gather, but lacks semantic labels that describe visual concepts of interest. These labels are necessary for supervised learning and can require significant human effort to collect. We discuss four labeling objectives that play an important role in the design of frameworks aimed at collecting label information for large training sets while maintaining low human effort: discovery, efficiency, exploitation and accuracy. We introduce a framework that explicitly models and balances these four labeling objectives with the use of (1) hierarchical clustering, (2) a novel interestingness measure that defines structural change within the hierarchy, and (3) an iterative group-based labeling process that exploits relationships between labeled and unlabeled data. Results on benchmark data show that our framework collects labeled training data more efficiently than existing labeling techniques and trains higher performing visual classifiers. Further, we show that our resulting framework is fast and significantly reduces human interaction time when labeling real-world multi-concept imagery depicting outdoor environments.

  • On the Beneficial Effect of Noise in Vertex Localization
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-09-19
    Konstantinos A. Raftopoulos, Stefanos D. Kollias, Dionysios D. Sourlas, Marin Ferecatu

    A theoretical and experimental analysis related to the effect of noise in the task of vertex identification in unknown shapes is presented. Shapes are seen as real functions of their closed boundary. An alternative global perspective of curvature is examined providing insight into the process of noise-enabled vertex localization. The analysis reveals that noise facilitates in the localization of certain vertices. The concept of noising is thus considered and a relevant global method for localizing Global Vertices is investigated in relation to local methods under the presence of increasing noise. Theoretical analysis reveals that induced noise can indeed help localizing certain vertices if combined with global descriptors. Experiments with noise and a comparison to localized methods validate the theoretical results.

  • Learning to Detect Good 3D Keypoints
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-08-08
    Alessio Tonioni, Samuele Salti, Federico Tombari, Riccardo Spezialetti, Luigi Di Stefano

    The established approach to 3D keypoint detection consists in defining effective handcrafted saliency functions based on geometric cues with the aim of maximizing keypoint repeatability. Differently, the idea behind our work is to learn a descriptor-specific keypoint detector so as to optimize the end-to-end performance of the feature matching pipeline. Accordingly, we cast 3D keypoint detection as a classification problem between surface patches that can or cannot be matched correctly by a given 3D descriptor, i.e. those either good or not in respect to that descriptor. We propose a machine learning framework that allows for defining examples of good surface patches from the training data and leverages Random Forest classifiers to realize both fixed-scale and adaptive-scale 3D keypoint detectors. Through extensive experiments on standard datasets, we show how feature matching performance improves significantly by deploying 3D descriptors together with companion detectors learned by our methodology with respect to the adoption of established state-of-the-art 3D detectors based on hand-crafted saliency functions.

  • Attentive Systems: A Survey
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-09-15
    Tam V. Nguyen, Qi Zhao, Shuicheng Yan

    Visual saliency analysis detects salient regions/objects that attract human attention in natural scenes. It has attracted intensive research in different fields such as computer vision, computer graphics, and multimedia. While many such computational models exist, the focused study of what and how applications can be beneficial is still lacking. In this article, our ultimate goal is thus to provide a comprehensive review of the applications using saliency cues, the so-called attentive systems. We would like to provide a broad vision about saliency applications and what visual saliency can do. We categorize the vast amount of applications into different areas such as computer vision, computer graphics, and multimedia. Intensively covering 200+ publications we survey (1) key application trends, (2) the role of visual saliency, and (3) the usability of saliency into different tasks.

  • Discriminative Correlation Filter Tracker with Channel and Spatial Reliability
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-01-08
    Alan Lukežič, Tomáš Vojíř, Luka Čehovin Zajc, Jiří Matas, Matej Kristan

    Short-term tracking is an open and challenging problem for which discriminative correlation filters (DCF) have shown excellent performance. We introduce the channel and spatial reliability concepts to DCF tracking and provide a learning algorithm for its efficient and seamless integration in the filter update and the tracking process. The spatial reliability map adjusts the filter support to the part of the object suitable for tracking. This both allows to enlarge the search region and improves tracking of non-rectangular objects. Reliability scores reflect channel-wise quality of the learned filters and are used as feature weighting coefficients in localization. Experimentally, with only two simple standard feature sets, HoGs and colornames, the novel CSR-DCF method—DCF with channel and spatial reliability—achieves state-of-the-art results on VOT 2016, VOT 2015 and OTB100. The CSR-DCF runs close to real-time on a CPU.

  • Separable Anisotropic Diffusion
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-01-05
    Roi Méndez-Rial, Julio Martín-Herrero

    Anisotropic diffusion has many applications in image processing, but the high computational cost usually requires accuracy trade-offs in order to grant its applicability in practical problems. This is specially true when dealing with 3D images, where anisotropic diffusion should be able to provide interesting results for many applications, but the usual implementation methods greatly scale in complexity with the additional dimension. Here we propose a separable implementation of the most general anisotropic diffusion formulation, based on Gaussian convolutions, whose favorable computational complexity scales linearly with the number of dimensions, without any assumptions about specific parameterizations. We also present variants that bend the Gaussian kernels for improved results when dealing with highly anisotropic curved or sharp structures. We test the accuracy, speed, stability, and scale-space properties of the proposed methods, and present some results (both synthetic and real) which show their advantages, including up to 60 times faster computation in 3D with respect to the explicit method, improved accuracy and stability, and min–max preservation.

  • Top-Down Neural Attention by Excitation Backprop
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-12-23
    Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, Stan Sclaroff

    We aim to model the top-down attention of a convolutional neural network (CNN) classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process. Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative. We show a theoretic connection between the proposed contrastive attention formulation and the Class Activation Map computation. Efficient implementation of Excitation Backprop for common neural network layers is also presented. In experiments, we visualize the evidence of a model’s classification decision by computing the proposed top-down attention maps. For quantitative evaluation, we report the accuracy of our method in weakly supervised localization tasks on the MS COCO, PASCAL VOC07 and ImageNet datasets. The usefulness of our method is further validated in the text-to-region association task. On the Flickr30k Entities dataset, we achieve promising performance in phrase localization by leveraging the top-down attention of a CNN model that has been trained on weakly labeled web images. Finally, we demonstrate applications of our method in model interpretation and data annotation assistance for facial expression analysis and medical imaging tasks.

  • SDF-2-SDF Registration for Real-Time 3D Reconstruction from RGB-D Data
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-12-18
    Miroslava Slavcheva, Wadim Kehl, Nassir Navab, Slobodan Ilic

    We tackle the task of dense 3D reconstruction from RGB-D data. Contrary to the majority of existing methods, we focus not only on trajectory estimation accuracy, but also on reconstruction precision. The key technique is SDF-2-SDF registration, which is a correspondence-free, symmetric, dense energy minimization method, performed via the direct voxel-wise difference between a pair of signed distance fields. It has a wider convergence basin than traditional point cloud registration and cloud-to-volume alignment techniques. Furthermore, its formulation allows for straightforward incorporation of photometric and additional geometric constraints. We employ SDF-2-SDF registration in two applications. First, we perform small-to-medium scale object reconstruction entirely on the CPU. To this end, the camera is tracked frame-to-frame in real time. Then, the initial pose estimates are refined globally in a lightweight optimization framework, which does not involve a pose graph. We combine these procedures into our second, fully real-time application for larger-scale object reconstruction and SLAM. It is implemented as a hybrid system, whereby tracking is done on the GPU, while refinement runs concurrently over batches on the CPU. To bound memory and runtime footprints, registration is done over a fixed number of limited-extent volumes, anchored at geometry-rich locations. Extensive qualitative and quantitative evaluation of both trajectory accuracy and model fidelity on several public RGB-D datasets, acquired with various quality sensors, demonstrates higher precision than related techniques.

  • RAW Image Reconstruction Using a Self-contained sRGB–JPEG Image with Small Memory Overhead
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-12-18
    Rang M. H. Nguyen, Michael S. Brown

    Most camera images are saved as 8-bit standard RGB (sRGB) compressed JPEGs. Even when JPEG compression is set to its highest quality, the encoded sRGB image has been significantly processed in terms of color and tone manipulation. This makes sRGB–JPEG images undesirable for many computer vision tasks that assume a direct relationship between pixel values and incoming light. For such applications, the RAW image format is preferred, as RAW represents a minimally processed, sensor-specific RGB image that is linear with respect to scene radiance. The drawback with RAW images, however, is that they require large amounts of storage and are not well-supported by many imaging applications. To address this issue, we present a method to encode the necessary data within an sRGB–JPEG image to reconstruct a high-quality RAW image. Our approach requires no calibration of the camera’s colorimetric properties and can reconstruct the original RAW to within 0.5% error with a small memory overhead for the additional data (e.g., 128 KB). More importantly, our output is a fully self-contained 100% compliant sRGB–JPEG file that can be used as-is, not affecting any existing image workflow—the RAW image data can be extracted when needed, or ignored otherwise. We detail our approach and show its effectiveness against competing strategies.

  • Hallucinating Compressed Face Images
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-12-08
    Chih-Yuan Yang, Sifei Liu, Ming-Hsuan Yang

    A face hallucination algorithm is proposed to generate high-resolution images from JPEG compressed low-resolution inputs by decomposing a deblocked face image into structural regions such as facial components and non-structural regions like the background. For structural regions, landmarks are used to retrieve adequate high-resolution component exemplars in a large dataset based on the estimated head pose and illumination condition. For non-structural regions, an efficient generic super resolution algorithm is applied to generate high-resolution counterparts. Two sets of gradient maps extracted from these two regions are combined to guide an optimization process of generating the hallucination image. Numerous experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art hallucination methods on JPEG compressed face images with different poses, expressions, and illumination conditions.

  • Learning Image Representations Tied to Egomotion from Unlabeled Video
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-03-04
    Dinesh Jayaraman, Kristen Grauman

    Understanding how images of objects and scenes behave in response to specific egomotions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images. We propose a new “embodied” visual learning paradigm, exploiting proprioceptive motor signals to train visual representations from egocentric video with no manual supervision. Specifically, we enforce that our learned features exhibit equivariance i.e., they respond predictably to transformations associated with distinct egomotions. With three datasets, we show that our unsupervised feature learning approach significantly outperforms previous approaches on visual recognition and next-best-view prediction tasks. In the most challenging test, we show that features learned from video captured on an autonomous driving platform improve large-scale scene recognition in static images from a disjoint domain.

  • Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-08-29
    Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

    We propose a Deep Learning approach to the visual question answering task, where machines answer to questions about real-world images. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We evaluate our approaches on the DAQUAR as well as the VQA dataset where we also report various baselines, including an analysis how much information is contained in the language part only. To study human consensus, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Finally, we evaluate a rich set of design choices how to encode, combine and decode information in our proposed Deep Learning formulation.

  • How Good Is My Test Data? Introducing Safety Analysis for Computer Vision
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-06-09
    Oliver Zendel, Markus Murschitz, Martin Humenberger, Wolfgang Herzner

    Good test data is crucial for driving new developments in computer vision (CV), but two questions remain unanswered: which situations should be covered by the test data, and how much testing is enough to reach a conclusion? In this paper we propose a new answer to these questions using a standard procedure devised by the safety community to validate complex systems: the hazard and operability analysis (HAZOP). It is designed to systematically identify possible causes of system failure or performance loss. We introduce a generic CV model that creates the basis for the hazard analysis and—for the first time—apply an extensive HAZOP to the CV domain. The result is a publicly available checklist with more than 900 identified individual hazards. This checklist can be utilized to evaluate existing test datasets by quantifying the covered hazards. We evaluate our approach by first analyzing and annotating the popular stereo vision test datasets Middlebury and KITTI. Second, we demonstrate a clearly negative influence of the hazards in the checklist on the performance of six popular stereo matching algorithms. The presented approach is a useful tool to evaluate and improve test datasets and creates a common basis for future dataset designs.

  • Global, Dense Multiscale Reconstruction for a Billion Points
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-06-03
    Benjamin Ummenhofer, Thomas Brox

    We present a variational approach for surface reconstruction from a set of oriented points with scale information. We focus particularly on scenarios with nonuniform point densities due to images taken from different distances. In contrast to previous methods, we integrate the scale information in the objective and globally optimize the signed distance function of the surface on a balanced octree grid. We use a finite element discretization on the dual structure of the octree minimizing the number of variables. The tetrahedral mesh is generated efficiently with a lookup table which allows to map octree cells to the nodes of the finite elements. We optimize memory efficiency by data aggregation, such that robust data terms can be used even on very large scenes. The surface normals are explicitly optimized and used for surface extraction to improve the reconstruction at edges and corners.

  • Holistically-Nested Edge Detection
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-03-15
    Saining Xie, Zhuowen Tu

    We develop a new edge detection algorithm that addresses two important issues in this long-standing vision problem: (1) holistic image training and prediction; and (2) multi-scale and multi-level feature learning. Our proposed method, holistically-nested edge detection (HED), performs image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets. HED automatically learns rich hierarchical representations (guided by deep supervision on side responses) that are important in order to resolve the challenging ambiguity in edge and object boundary detection. We significantly advance the state-of-the-art on the BSDS500 dataset (ODS F-score of 0.790) and the NYU Depth dataset (ODS F-score of 0.746), and do so with an improved speed (0.4 s per image) that is orders of magnitude faster than some CNN-based edge detection algorithms developed before HED. We also observe encouraging results on other boundary detection benchmark datasets such as Multicue and PASCAL-Context.

  • Mutual-Structure for Joint Filtering
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-06-03
    Xiaoyong Shen, Chao Zhou, Li Xu, Jiaya Jia

    Previous joint/guided filters directly transfer structural information from the reference to the target image. In this paper, we analyze the major drawback—that is, there may be completely different edges in the two images. Simply considering all patterns could introduce significant errors. To address this issue, we propose the concept of mutual-structure, which refers to the structural information that is contained in both images and thus can be safely enhanced by joint filtering. We also use an untraditional objective function that can be efficiently optimized to yield mutual structure. Our method results in important edge preserving property, which greatly benefits depth completion, optical flow estimation, image enhancement, stereo matching, to name a few.

  • Depth Sensing Using Geometrically Constrained Polarization Normals
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-06-22
    Achuta Kadambi, Vage Taamazyan, Boxin Shi, Ramesh Raskar

    Analyzing the polarimetric properties of reflected light is a potential source of shape information. However, it is well-known that polarimetric information contains fundamental shape ambiguities, leading to an underconstrained problem of recovering 3D geometry. To address this problem, we use additional geometric information, from coarse depth maps, to constrain the shape information from polarization cues. Our main contribution is a framework that combines surface normals from polarization (hereafter polarization normals) with an aligned depth map. The additional geometric constraints are used to mitigate physics-based artifacts, such as azimuthal ambiguity, refractive distortion and fronto-parallel signal degradation. We believe our work may have practical implications for optical engineering, demonstrating a new option for state-of-the-art 3D reconstruction.

  • 3D Time-Lapse Reconstruction from Internet Photos
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-03-21
    Ricardo Martin-Brualla, David Gallup, Steven M. Seitz

    Given an Internet photo collection of a landmark, we compute a 3D time-lapse video sequence where a virtual camera moves continuously in time and space. While previous work assumed a static camera, the addition of camera motion during the time-lapse creates a very compelling impression of parallax. Achieving this goal, however, requires addressing multiple technical challenges, including solving for time-varying depth maps, regularizing 3D point color profiles over time, and reconstructing high quality, hole-free images at every frame from the projected profiles. Our results show photorealistic time-lapses of skylines and natural scenes over many years, with dramatic parallax effects.

  • Automatic Registration of Images to Untextured Geometry Using Average Shading Gradients
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-06-06
    Tobias Plötz, Stefan Roth

    Many existing approaches for image-to-geometry registration assume that either a textured 3D model or a good initial guess of the 3D pose is available to bootstrap the registration process. In this paper we consider the registration of photographs to 3D models even when no texture information is available. This is very challenging as we cannot rely on texture gradients, and even shading gradients are hard to estimate since the lighting conditions are unknown. To that end, we propose average shading gradients, a rendering technique that estimates the average gradient magnitude over all lighting directions under Lambertian shading. We use this gradient representation as the building block of a registration pipeline based on matching sparse features. To cope with inevitable false matches due to the missing texture information and to increase robustness, the pose of the 3D model is estimated in two stages. Coarse pose hypotheses are first obtained from a single correct match each, subsequently refined using SIFT flow, and finally verified. We apply our algorithm to registering images of real-world objects to untextured 3D meshes of limited accuracy. Moreover, we show that registration can be performed even for paintings despite lacking photo-realism.

  • Defining the Pose of Any 3D Rigid Object and an Associated Distance
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-11-24
    Romain Brégier, Frédéric Devernay, Laetitia Leyrit, James L. Crowley

    The pose of a rigid object is usually regarded as a rigid transformation, described by a translation and a rotation. However, equating the pose space with the space of rigid transformations is in general abusive, as it does not account for objects with proper symmetries—which are common among man-made objects. In this article, we define pose as a distinguishable static state of an object, and equate a pose to a set of rigid transformations. Based solely on geometric considerations, we propose a frame-invariant metric on the space of possible poses, valid for any physical rigid object, and requiring no arbitrary tuning. This distance can be evaluated efficiently using a representation of poses within a Euclidean space of at most 12 dimensions depending on the object’s symmetries. This makes it possible to efficiently perform neighborhood queries such as radius searches or k-nearest neighbor searches within a large set of poses using off-the-shelf methods. Pose averaging considering this metric can similarly be performed easily, using a projection function from the Euclidean space onto the pose space. The practical value of those theoretical developments is illustrated with an application of pose estimation of instances of a 3D rigid object given an input depth map, via a Mean Shift procedure.

  • From Facial Expression Recognition to Interpersonal Relation Prediction
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-11-24
    Zhanpeng Zhang, Ping Luo, Chen Change Loy, Xiaoou Tang

    Interpersonal relation defines the association, e.g., warm, friendliness, and dominance, between two or more people. We investigate if such fine-grained and high-level relation traits can be characterized and quantified from face images in the wild. We address this challenging problem by first studying a deep network architecture for robust recognition of facial expressions. Unlike existing models that typically learn from facial expression labels alone, we devise an effective multitask network that is capable of learning from rich auxiliary attributes such as gender, age, and head pose, beyond just facial expression data. While conventional supervised training requires datasets with complete labels (e.g., all samples must be labeled with gender, age, and expression), we show that this requirement can be relaxed via a novel attribute propagation method. The approach further allows us to leverage the inherent correspondences between heterogeneous attribute sources despite the disparate distributions of different datasets. With the network we demonstrate state-of-the-art results on existing facial expression recognition benchmarks. To predict inter-personal relation, we use the expression recognition network as branches for a Siamese model. Extensive experiments show that our model is capable of mining mutual context of faces for accurate fine-grained interpersonal prediction.

  • No-Reference Image Quality Assessment for Image Auto-Denoising
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-11-17
    Xiangfei Kong, Qingxiong Yang

    This paper proposes two new non-reference image quality metrics that can be adopted by the state-of-the-art image/video denoising algorithms for auto-denoising. The first metric is proposed based on the assumption that the noise should be independent of the original image. A direct measurement of this dependence is, however, impractical due to the relatively low accuracy of existing denoising method. The proposed metric thus tackles the homogeneous regions and highly-structured regions separately. Nevertheless, this metric is only stable when the noise level is relatively low. Most denoising algorithms reduce noise by (weighted) averaging repeated noisy measurements. As a result, another metric is proposed for high-level noise based on the fact that more noisy measurements will be required when the noise level increases. The number of measurements before converging is thus related to the quality of noisy images. Our patch-matching based metric proposes to iteratively find and add noisy image measurements for averaging until there is no visible difference between two successively averaged images. Both metrics are evaluated on LIVE2 (Sheikh et al. in LIVE image quality assessment database release 2: 2013) and TID2013 (Ponomarenko et al. in Color image database tid2013: Peculiarities and preliminary results: 2005) data sets using standard Spearman and Kendall rank-order correlation coefficients (ROCC), showing that they subjectively outperforms current state-of-the-art no-reference metrics. Quantitative evaluation w.r.t. different level of synthetic noisy images also demonstrates consistently higher performance over state-of-the-art non-reference metrics when used for image denoising.

  • Focal Flow: Velocity and Depth from Differential Defocus Through Motion
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-11-13
    Emma Alexander, Qi Guo, Sanjeev Koppal, Steven J. Gortler, Todd Zickler

    We present the focal flow sensor. It is an unactuated, monocular camera that simultaneously exploits defocus and differential motion to measure a depth map and a 3D scene velocity field. It does this using an optical-flow-like, per-pixel linear constraint that relates image derivatives to depth and velocity. We derive this constraint, prove its invariance to scene texture, and prove that it is exactly satisfied only when the sensor’s blur kernels are Gaussian. We analyze the inherent sensitivity of the focal flow cue, and we build and test a prototype. Experiments produce useful depth and velocity information for a broader set of aperture configurations, including a simple lens with a pillbox aperture.

  • Classification of Multi-class Daily Human Motion using Discriminative Body Parts and Sentence Descriptions
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-11-10
    Yusuke Goutsu, Wataru Takano, Yoshihiko Nakamura

    In this paper, we propose a motion model that focuses on the discriminative parts of the human body related to target motions to classify human motions into specific categories, and apply this model to multi-class daily motion classifications. We extend this model to a motion recognition system which generates multiple sentences associated with human motions. The motion model is evaluated with the following four datasets acquired by a Kinect sensor or multiple infrared cameras in a motion capture studio: UCF-kinect; UT-kinect; HDM05-mocap; and YNL-mocap. We also evaluate the sentences generated from the dataset of motion and language pairs. The experimental results indicate that the motion model improves classification accuracy and our approach is better than other state-of-the-art methods for specific datasets, including human–object interactions with variations in the duration of motions, such as daily human motions. We achieve a classification rate of 81.1% for multi-class daily motion classifications in a non cross-subject setting. Additionally, the sentences generated by the motion recognition system are semantically and syntactically appropriate for the description of the target motion, which may lead to human–robot interaction using natural language.

  • Visual Tracking via Subspace Learning: A Discriminative Approach
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-11-10
    Yao Sui, Yafei Tang, Li Zhang, Guanghui Wang

    Good tracking performance is in general attributed to accurate representation over previously obtained targets and/or reliable discrimination between the target and the surrounding background. In this work, a robust tracker is proposed by integrating the advantages of both approaches. A subspace is constructed to represent the target and the neighboring background, and their class labels are propagated simultaneously via the learned subspace. In addition, a novel criterion is proposed, by taking account of both the reliability of discrimination and the accuracy of representation, to identify the target from numerous target candidates in each frame. Thus, the ambiguity in the class labels of neighboring background samples, which influences the reliability of the discriminative tracking model, is effectively alleviated, while the training set still remains small. Extensive experiments demonstrate that the proposed approach outperforms most state-of-the-art trackers.

  • EMVS: Event-Based Multi-View Stereo—3D Reconstruction with an Event Camera in Real-Time
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-11-07
    Henri Rebecq, Guillermo Gallego, Elias Mueggler, Davide Scaramuzza

    Event cameras are bio-inspired vision sensors that output pixel-level brightness changes instead of standard intensity frames. They offer significant advantages over standard cameras, namely a very high dynamic range, no motion blur, and a latency in the order of microseconds. However, because the output is composed of a sequence of asynchronous events rather than actual intensity images, traditional vision algorithms cannot be applied, so that a paradigm shift is needed. We introduce the problem of event-based multi-view stereo (EMVS) for event cameras and propose a solution to it. Unlike traditional MVS methods, which address the problem of estimating dense 3D structure from a set of known viewpoints, EMVS estimates semi-dense 3D structure from an event camera with known trajectory. Our EMVS solution elegantly exploits two inherent properties of an event camera: (1) its ability to respond to scene edges—which naturally provide semi-dense geometric information without any pre-processing operation—and (2) the fact that it provides continuous measurements as the sensor moves. Despite its simplicity (it can be implemented in a few lines of code), our algorithm is able to produce accurate, semi-dense depth maps, without requiring any explicit data association or intensity estimation. We successfully validate our method on both synthetic and real data. Our method is computationally very efficient and runs in real-time on a CPU.

  • Do Semantic Parts Emerge in Convolutional Neural Networks?
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-10-17
    Abel Gonzalez-Garcia, Davide Modolo, Vittorio Ferrari

    Semantic object parts can be useful for several visual recognition tasks. Lately, these tasks have been addressed using Convolutional Neural Networks (CNN), achieving outstanding results. In this work we study whether CNNs learn semantic parts in their internal representation. We investigate the responses of convolutional filters and try to associate their stimuli with semantic parts. We perform two extensive quantitative analyses. First, we use ground-truth part bounding-boxes from the PASCAL-Part dataset to determine how many of those semantic parts emerge in the CNN. We explore this emergence for different layers, network depths, and supervision levels. Second, we collect human judgements in order to study what fraction of all filters systematically fire on any semantic part, even if not annotated in PASCAL-Part. Moreover, we explore several connections between discriminative power and semantics. We find out which are the most discriminative filters for object recognition, and analyze whether they respond to semantic parts or to other image patches. We also investigate the other direction: we determine which semantic parts are the most discriminative and whether they correspond to those parts emerging in the network. This enables to gain an even deeper understanding of the role of semantic parts in the network.

  • Dense Reconstruction of Transparent Objects by Altering Incident Light Paths Through Refraction
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-09-30
    Kai Han, Kwan-Yee K. Wong, Miaomiao Liu

    This paper addresses the problem of reconstructing the surface shape of transparent objects. The difficulty of this problem originates from the viewpoint dependent appearance of a transparent object, which quickly makes reconstruction methods tailored for diffuse surfaces fail disgracefully. In this paper, we introduce a fixed viewpoint approach to dense surface reconstruction of transparent objects based on refraction of light. We present a simple setup that allows us to alter the incident light paths before light rays enter the object by immersing the object partially in a liquid, and develop a method for recovering the object surface through reconstructing and triangulating such incident light paths. Our proposed approach does not need to model the complex interactions of light as it travels through the object, neither does it assume any parametric form for the object shape nor the exact number of refractions and reflections taken place along the light paths. It can therefore handle transparent objects with a relatively complex shape and structure, with unknown and inhomogeneous refractive index. We also show that for thin transparent objects, our proposed acquisition setup can be further simplified by adopting a single refraction approximation. Experimental results on both synthetic and real data demonstrate the feasibility and accuracy of our proposed approach.

Some contents have been Reproduced with permission of the American Chemical Society.
Some contents have been Reproduced by permission of The Royal Society of Chemistry.
化学 • 材料 期刊列表