显示样式:     当前期刊: International Journal of Computer Vision    加入关注       排序: 导出
  • Efficiently Annotating Object Images with Absolute Size Information Using Mobile Devices
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-05-24
    Martin Hofmann, Marco Seeland, Patrick Mäder

    The projection of a real world scenery to a planar image sensor inherits the loss of information about the 3D structure as well as the absolute dimensions of the scene. For image analysis and object classification tasks, however, absolute size information can make results more accurate. Today, the creation of size annotated image datasets is effort intensive and typically requires measurement equipment not available to public image contributors. In this paper, we propose an effective annotation method that utilizes the camera within smart mobile devices to capture the missing size information along with the image. The approach builds on the fact that with a camera, calibrated to a specific object distance, lengths can be measured in the object’s plane. We use the camera’s minimum focus distance as calibration distance and propose an adaptive feature matching process for precise computation of the scale change between two images facilitating measurements on larger object distances. Eventually, the measured object is segmented and its size information is annotated for later analysis. A user study showed that humans are able to retrieve the calibration distance with a low variance. The proposed approach facilitates a measurement accuracy comparable to manual measurement with a ruler and outperforms state-of-the-art methods in terms of accuracy and repeatability. Consequently, the proposed method allows in-situ size annotation of objects in images without the need for additional equipment or an artificial reference object in the scene.

  • Group Collaborative Representation for Image Set Classification
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-05-23
    Bo Liu, Liping Jing, Jia Li, Jian Yu, Alex Gittens, Michael W. Mahoney

    With significant advances in imaging technology, multiple images of a person or an object are becoming readily available in a number of real-life scenarios. In contrast to single images, image sets can capture a broad range of variations in the appearance of a single face or object. Recognition from these multiple images (i.e., image set classification) has gained significant attention in the area of computer vision. Unlike many existing approaches, which assume that only the images in the same set affect each other, this work develops a group collaborative representation (GCR) model which makes no such assumption, and which can effectively determine the hidden structure among image sets. Specifically, GCR takes advantage of the relationship between image sets to capture the inter- and intra-set variations, and it determines the characteristic subspaces of all the gallery sets. In these subspaces, individual gallery images and each probe set can be effectively represented via a self-representation learning scheme, which leads to increased discriminative ability and enhances robustness and efficiency of the prediction process. By conducting extensive experiments and comparing with state-of-the-art, we demonstrated the superiority of the proposed method on set-based face recognition and object categorization tasks.

  • RED-Net: A Recurrent Encoder–Decoder Network for Video-Based Face Alignment
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-05-23
    Xi Peng, Rogerio S. Feris, Xiaoyu Wang, Dimitris N. Metaxas

    We propose a novel method for real-time face alignment in videos based on a recurrent encoder–decoder network model. Our proposed model predicts 2D facial point heat maps regularized by both detection and regression loss, while uniquely exploiting recurrent learning at both spatial and temporal dimensions. At the spatial level, we add a feedback loop connection between the combined output response map and the input, in order to enable iterative coarse-to-fine face alignment using a single network model, instead of relying on traditional cascaded model ensembles. At the temporal level, we first decouple the features in the bottleneck of the network into temporal-variant factors, such as pose and expression, and temporal-invariant factors, such as identity information. Temporal recurrent learning is then applied to the decoupled temporal-variant features. We show that such feature disentangling yields better generalization and significantly more accurate results at test time. We perform a comprehensive experimental analysis, showing the importance of each component of our proposed model, as well as superior results over the state of the art and several variations of our method in standard datasets.

  • Superpixel-Guided Two-View Deterministic Geometric Model Fitting
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-05-19
    Guobao Xiao, Hanzi Wang, Yan Yan, David Suter

    Geometric model fitting is a fundamental research topic in computer vision and it aims to fit and segment multiple-structure data. In this paper, we propose a novel superpixel-guided two-view geometric model fitting method (called SDF), which can obtain reliable and consistent results for real images. Specifically, SDF includes three main parts: a deterministic sampling algorithm, a model hypothesis updating strategy and a novel model selection algorithm. The proposed deterministic sampling algorithm generates a set of initial model hypotheses according to the prior information of superpixels. Then the proposed updating strategy further improves the quality of model hypotheses. After that, by analyzing the properties of the updated model hypotheses, the proposed model selection algorithm extends the conventional “fit-and-remove” framework to estimate model instances in multiple-structure data. The three parts are tightly coupled to boost the performance of SDF in both speed and accuracy, and SDF has the deterministic nature. Experimental results show that the proposed SDF has significant advantages over several state-of-the-art fitting methods when it is applied to real images with single-structure and multiple-structure data.

  • Understanding Image Representations by Measuring Their Equivariance and Equivalence
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-05-18
    Karel Lenc, Andrea Vedaldi

    Despite the importance of image representations such as histograms of oriented gradients and deep Convolutional Neural Networks (CNN), our theoretical understanding of them remains limited. Aimed at filling this gap, we investigate two key mathematical properties of representations: equivariance and equivalence. Equivariance studies how transformations of the input image are encoded by the representation, invariance being a special case where a transformation has no effect. Equivalence studies whether two representations, for example two different parameterizations of a CNN, two different layers, or two different CNN architectures, share the same visual information or not. A number of methods to establish these properties empirically are proposed, including introducing transformation and stitching layers in CNNs. These methods are then applied to popular representations to reveal insightful aspects of their structure, including clarifying at which layers in a CNN certain geometric invariances are achieved and how various CNN architectures differ. We identify several predictors of geometric and architectural compatibility, including the spatial resolution of the representation and the complexity and depth of the models. While the focus of the paper is theoretical, direct applications to structured-output regression are demonstrated too.

  • Fast Diffeomorphic Image Registration via Fourier-Approximated Lie Algebras
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-05-15
    Miaomiao Zhang, P. Thomas Fletcher

    This paper introduces Fourier-approximated Lie algebras for shooting (FLASH), a fast geodesic shooting algorithm for diffeomorphic image registration. We approximate the infinite-dimensional Lie algebra of smooth vector fields, i.e., the tangent space at the identity of the diffeomorphism group, with a low-dimensional, bandlimited space. We show that most of the computations for geodesic shooting can be carried out entirely in this low-dimensional space. Our algorithm results in dramatic savings in time and memory over traditional large-deformation diffeomorphic metric mapping algorithms, which require dense spatial discretizations of vector fields. To validate the effectiveness of FLASH, we run pairwise image registration on both 2D synthetic data and real 3D brain images and compare with the state-of-the-art geodesic shooting methods. Experimental results show that our algorithm dramatically reduces the computational cost and memory footprint of diffemorphic image registration with little or no loss of accuracy.

  • Facial Landmark Detection: A Literature Survey
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-05-08
    Yue Wu, Qiang Ji

    The locations of the fiducial facial landmark points around facial components and facial contour capture the rigid and non-rigid facial deformations due to head movements and facial expressions. They are hence important for various facial analysis tasks. Many facial landmark detection algorithms have been developed to automatically detect those key points over the years, and in this paper, we perform an extensive review of them. We classify the facial landmark detection algorithms into three major categories: holistic methods, Constrained Local Model (CLM) methods, and the regression-based methods. They differ in the ways to utilize the facial appearance and shape information. The holistic methods explicitly build models to represent the global facial appearance and shape information. The CLMs explicitly leverage the global shape model but build the local appearance models. The regression based methods implicitly capture facial shape and appearance information. For algorithms within each category, we discuss their underlying theories as well as their differences. We also compare their performances on both controlled and in the wild benchmark datasets, under varying facial expressions, head poses, and occlusion. Based on the evaluations, we point out their respective strengths and weaknesses. There is also a separate section to review the latest deep learning based algorithms. The survey also includes a listing of the benchmark databases and existing software. Finally, we identify future research directions, including combining methods in different categories to leverage their respective strengths to solve landmark detection “in-the-wild”.

  • Combining Multiple Cues for Visual Madlibs Question Answering
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-05-03
    Tatiana Tommasi, Arun Mallya, Bryan Plummer, Svetlana Lazebnik, Alexander C. Berg, Tamara L. Berg

    This paper presents an approach for answering fill-in-the-blank multiple choice questions from the Visual Madlibs dataset. Instead of generic and commonly used representations trained on the ImageNet classification task, our approach employs a combination of networks trained for specialized tasks such as scene recognition, person activity classification, and attribute prediction. We also present a method for localizing phrases from candidate answers in order to provide spatial support for feature extraction. We map each of these features, together with candidate answers, to a joint embedding space through normalized canonical correlation analysis (nCCA). Finally, we solve an optimization problem to learn to combine scores from nCCA models trained on multiple cues to select the best answer. Extensive experimental results show a significant improvement over the previous state of the art and confirm that answering questions from a wide range of types benefits from examining a variety of image cues and carefully choosing the spatial support for feature extraction.

  • An Approximate Shading Model with Detail Decomposition for Object Relighting
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-04-30
    Zicheng Liao, Kevin Karsch, Hongyi Zhang, David Forsyth

    We present an object relighting system that allows an artist to select an object from an image and insert it into a target scene. Through simple interactions, the system can adjust illumination on the inserted object so that it appears naturally in the scene. To support image-based relighting, we build object model from the image, and propose a perceptually-inspired approximate shading model for the relighting. It decomposes the shading field into (a) a rough shape term that can be reshaded, (b) a parametric shading detail that encodes missing features from the first term, and (c) a geometric detail term that captures fine-scale material properties. With this decomposition, the shading model combines 3D rendering and image-based composition and allows more flexible compositing than image-based methods. Quantitative evaluation and a set of user studies suggest our method is a promising alternative to existing methods of object insertion.

  • Structural Constraint Data Association for Online Multi-object Tracking
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-04-27
    Ju Hong Yoon, Chang-Ryeol Lee, Ming-Hsuan Yang, Kuk-Jin Yoon

    Online two-dimensional (2D) multi-object tracking (MOT) is a challenging task when the objects of interest have similar appearances. In that case, the motion of objects is another helpful cue for tracking and discriminating multiple objects. However, when using a single moving camera for online 2D MOT, observable motion cues are contaminated by global camera movements and, thus, are not always predictable. To deal with unexpected camera motion, we propose a new data association method that effectively exploits structural constraints in the presence of large camera motion. In addition, to reduce incorrect associations with mis-detections and false positives, we develop a novel event aggregation method to integrate assignment costs computed by structural constraints. We also utilize structural constraints to track missing objects when they are re-detected again. By doing this, identities of the missing objects can be retained continuously. Experimental results validated the effectiveness of the proposed data association algorithm under unexpected camera motions. In addition, tracking results on a large number of benchmark datasets demonstrated that the proposed MOT algorithm performs robustly and favorably against various online methods in terms of several quantitative metrics, and that its performance is comparable to offline methods.

  • Joint Contour Filtering
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-04-23
    Xing Wei, Qingxiong Yang, Yihong Gong

    Edge/structure-preserving operations for images aim to smooth images without blurring the edges/structures. Many exemplary edge-preserving filtering methods have recently been proposed to reduce the computational complexity and/or separate structures of different scales. They normally adopt a user-selected scale measurement to control the detail smoothing. However, natural photos contain objects of different sizes, which cannot be described by a single scale measurement. On the other hand, contour analysis is closely related to edge-preserving filtering, and significant progress has recently been achieved. Nevertheless, the majority of state-of-the-art filtering techniques have ignored the successes in this area. Inspired by the fact that learning-based edge detectors significantly outperform traditional manually-designed detectors, this paper proposes a learning-based edge-preserving filtering technique. It synergistically combines the differential operations in edge-preserving filters with the effectiveness of the recent edge detectors for scale-aware filtering. Unlike previous filtering methods, the proposed filters can efficiently extract subjectively meaningful structures from natural scenes containing multiple-scale objects.

  • Elastic Alignment of Triangular Surface Meshes
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-04-21
    Zsolt Sánta, Zoltan Kato

    A novel region-based approach is proposed to find a thin plate spline map between a pair of deformable 3D objects represented by triangular surface meshes. The proposed method works without landmark extraction and feature correspondences. The aligning transformation is simply found by solving a system of integral equations. Each equation is generated by integrating a non-linear function over the object domains. We derive recursive formulas for the efficient computation of these integrals for open and closed surface meshes. Based on a series of comparative tests on a large synthetic dataset, our triangular mesh-based algorithm outperforms state of the art methods both in terms of computing time and accuracy. The applicability of the proposed approach has been demonstrated on the registration of 3D lung CT volumes, brain surfaces and 3D human faces.

  • Artistic Style Transfer for Videos and Spherical Images
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-04-21
    Manuel Ruder, Alexey Dosovitskiy, Thomas Brox

    Manually re-drawing an image in a certain artistic style takes a professional artist a long time. Doing this for a video sequence single-handedly is beyond imagination. We present two computational approaches that transfer the style from one image (for example, a painting) to a whole video sequence. In our first approach, we adapt to videos the original image style transfer technique by Gatys et al. based on energy minimization. We introduce new ways of initialization and new loss functions to generate consistent and stable stylized video sequences even in cases with large motion and strong occlusion. Our second approach formulates video stylization as a learning problem. We propose a deep network architecture and training procedures that allow us to stylize arbitrary-length videos in a consistent and stable way, and nearly in real time. We show that the proposed methods clearly outperform simpler baselines both qualitatively and quantitatively. Finally, we propose a way to adapt these approaches also to 360\(^\circ \) images and videos as they emerge with recent virtual reality hardware.

  • Depth-Based Hand Pose Estimation: Methods, Data, and Challenges
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-04-12
    James Steven Supančič, Grégory Rogez, Yi Yang, Jamie Shotton, Deva Ramanan

    Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new advances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable number of systems, and have released software and evaluation code. We summarize important conclusions here: (1) Coarse pose estimation appears viable for scenes with isolated hands. However, high precision pose estimation [required for immersive virtual reality and cluttered scenes (where hands may be interacting with nearby objects and surfaces) remain a challenge. To spur further progress we introduce a challenging new dataset with diverse, cluttered scenes. (2) Many methods evaluate themselves with disparate criteria, making comparisons difficult. We define a consistent evaluation criteria, rigorously motivated by human experiments. (3) We introduce a simple nearest-neighbor baseline that outperforms most existing systems. This implies that most systems do not generalize beyond their training sets. This also reinforces the under-appreciated point that training data is as important as the model itself. We conclude with directions for future progress.

  • Multi-label Learning with Missing Labels Using Mixed Dependency Graphs
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-04-06
    Baoyuan Wu, Fan Jia, Wei Liu, Bernard Ghanem, Siwei Lyu

    This work focuses on the problem of multi-label learning with missing labels (MLML), which aims to label each test instance with multiple class labels given training instances that have an incomplete/partial set of these labels (i.e., some of their labels are missing). The key point to handle missing labels is propagating the label information from the provided labels to missing labels, through a dependency graph that each label of each instance is treated as a node. We build this graph by utilizing different types of label dependencies. Specifically, the instance-level similarity is served as undirected edges to connect the label nodes across different instances and the semantic label hierarchy is used as directed edges to connect different classes. This base graph is referred to as the mixed dependency graph, as it includes both undirected and directed edges. Furthermore, we present another two types of label dependencies to connect the label nodes across different classes. One is the class co-occurrence, which is also encoded as undirected edges. Combining with the above base graph, we obtain a new mixed graph, called mixed graph with co-occurrence (MG-CO). The other is the sparse and low rank decomposition of the whole label matrix, to embed high-order dependencies over all labels. Combining with the base graph, the new mixed graph is called as MG-SL (mixed graph with sparse and low rank decomposition). Based on MG-CO and MG-SL, we further propose two convex transductive formulations of the MLML problem, denoted as MLMG-CO and MLMG-SL respectively. In both formulations, the instance-level similarity is embedded through a quadratic smoothness term, while the semantic label hierarchy is used as a linear constraint. In MLMG-CO, the class co-occurrence is also formulated as a quadratic smoothness term, while the sparse and low rank decomposition is incorporated into MLMG-SL, through two additional matrices (one is assumed as sparse, and the other is assumed as low rank) and an equivalence constraint between the summation of this two matrices and the original label matrix. Interestingly, two important applications, including image annotation and tag based image retrieval, can be jointly handled using our proposed methods. Experimental results on several benchmark datasets show that our methods lead to significant improvements in performance and robustness to missing labels over the state-of-the-art methods.

  • On Unifying Multi-view Self-Representations for Clustering by Tensor Multi-rank Minimization
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-04-06
    Yuan Xie, Dacheng Tao, Wensheng Zhang, Yan Liu, Lei Zhang, Yanyun Qu

    In this paper, we address the multi-view subspace clustering problem. Our method utilizes the circulant algebra for tensor, which is constructed by stacking the subspace representation matrices of different views and then rotating, to capture the low rank tensor subspace so that the refinement of the view-specific subspaces can be achieved, as well as the high order correlations underlying multi-view data can be explored. By introducing a recently proposed tensor factorization, namely tensor-Singular Value Decomposition (t-SVD) (Kilmer et al. in SIAM J Matrix Anal Appl 34(1):148–172, 2013), we can impose a new type of low-rank tensor constraint on the rotated tensor to ensure the consensus among multiple views. Different from traditional unfolding based tensor norm, this low-rank tensor constraint has optimality properties similar to that of matrix rank derived from SVD, so the complementary information can be explored and propagated among all the views more thoroughly and effectively. The established model, called t-SVD based Multi-view Subspace Clustering (t-SVD-MSC), falls into the applicable scope of augmented Lagrangian method, and its minimization problem can be efficiently solved with theoretical convergence guarantee and relatively low computational complexity. Extensive experimental testing on eight challenging image datasets shows that the proposed method has achieved highly competent objective performance compared to several state-of-the-art multi-view clustering methods.

  • What Makes Good Synthetic Training Data for Learning Disparity and Optical Flow Estimation?
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-04-02
    Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazirbas, Daniel Cremers, Alexey Dosovitskiy, Thomas Brox

    The finding that very large networks can be trained efficiently and reliably has led to a paradigm shift in computer vision from engineered solutions to learning formulations. As a result, the research challenge shifts from devising algorithms to creating suitable and abundant training data for supervised learning. How to efficiently create such training data? The dominant data acquisition method in visual recognition is based on web data and manual annotation. Yet, for many computer vision problems, such as stereo or optical flow estimation, this approach is not feasible because humans cannot manually enter a pixel-accurate flow field. In this paper, we promote the use of synthetically generated data for the purpose of training deep networks on such tasks. We suggest multiple ways to generate such data and evaluate the influence of dataset properties on the performance and generalization properties of the resulting networks. We also demonstrate the benefit of learning schedules that use different types of data at selected stages of the training process.

  • Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-03-07
    Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars Mescheder, Andreas Geiger, Carsten Rother

    The success of deep learning in computer vision is based on the availability of large annotated datasets. To lower the need for hand labeled images, virtually rendered 3D worlds have recently gained popularity. Unfortunately, creating realistic 3D content is challenging on its own and requires significant human effort. In this work, we propose an alternative paradigm which combines real and synthetic data for learning semantic instance segmentation and object detection models. Exploiting the fact that not all aspects of the scene are equally important for this task, we propose to augment real-world imagery with virtual objects of the target category. Capturing real-world images at large scale is easy and cheap, and directly provides real background appearances without the need for creating complex 3D models of the environment. We present an efficient procedure to augment these images with virtual objects. In contrast to modeling complete 3D environments, our data augmentation approach requires only a few user interactions in combination with 3D models of the target object category. Leveraging our approach, we introduce a novel dataset of augmented urban driving scenes with 360 degree images that are used as environment maps to create realistic lighting and reflections on rendered objects. We analyze the significance of realistic object placement by comparing manual placement by humans to automatic methods based on semantic scene analysis. This allows us to create composite images which exhibit both realistic background appearance as well as a large number of complex object arrangements. Through an extensive set of experiments, we conclude the right set of parameters to produce augmented data which can maximally enhance the performance of instance segmentation models. Further, we demonstrate the utility of the proposed approach on training standard deep models for semantic instance segmentation and object detection of cars in outdoor driving scenarios. We test the models trained on our augmented data on the KITTI 2015 dataset, which we have annotated with pixel-accurate ground truth, and on the Cityscapes dataset. Our experiments demonstrate that the models trained on augmented imagery generalize better than those trained on fully synthetic data or models trained on limited amounts of annotated real data.

  • Hierarchical Cellular Automata for Visual Saliency
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-02-23
    Yao Qin, Mengyang Feng, Huchuan Lu, Garrison W. Cottrell

    Saliency detection, finding the most important parts of an image, has become increasingly popular in computer vision. In this paper, we introduce Hierarchical Cellular Automata (HCA)—a temporally evolving model to intelligently detect salient objects. HCA consists of two main components: Single-layer Cellular Automata (SCA) and Cuboid Cellular Automata (CCA). As an unsupervised propagation mechanism, Single-layer Cellular Automata can exploit the intrinsic relevance of similar regions through interactions with neighbors. Low-level image features as well as high-level semantic information extracted from deep neural networks are incorporated into the SCA to measure the correlation between different image patches. With these hierarchical deep features, an impact factor matrix and a coherence matrix are constructed to balance the influences on each cell’s next state. The saliency values of all cells are iteratively updated according to a well-defined update rule. Furthermore, we propose CCA to integrate multiple saliency maps generated by SCA at different scales in a Bayesian framework. Therefore, single-layer propagation and multi-scale integration are jointly modeled in our unified HCA. Surprisingly, we find that the SCA can improve all existing methods that we applied it to, resulting in a similar precision level regardless of the original results. The CCA can act as an efficient pixel-wise aggregation algorithm that can integrate state-of-the-art methods, resulting in even better results. Extensive experiments on four challenging datasets demonstrate that the proposed algorithm outperforms state-of-the-art conventional methods and is competitive with deep learning based approaches.

  • Scale-Free Registrations in 3D: 7 Degrees of Freedom with Fourier Mellin SOFT Transforms
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-02-23
    Heiko Bülow, Andreas Birk

    Fourier Mellin SOFT (FMS) as a novel method for global registration of 3D data is presented. It determines the seven degrees of freedom (7-DoF) transformation, i.e., the 6-DoF rigid motion parameters plus 1-DoF scale, between two scans, i.e., two noisy, only partially overlapping views on objects or scenes. It is based on a sequence of the 3D Fourier transform, the Mellin transform and the SO(3) Fourier transform. This combination represents a non-trivial complete 3D extension of the well known Fourier-Mellin registration for 2D images. It is accordingly based on decoupling rotation and scale from translation. First, rotation—which is the main challenge for the extension to 3D data - is tackled with a SO(3) Fourier Transform (SOFT) based on Spherical Harmonics. In a second step, scale is determined via a 3D Mellin transform. Finally, translation is calculated by Phase-Matching. Experiments are presented with simulated data sets for ground truth comparisons and with real world data including object recognition and localization in Magnetic Resonance Tomography (MRT) data, registration of 2.5D RGBD scans from a Microsoft Kinect with a scale-free 3D model generated by Multi-View Vision, and 3D mapping by registration of a sequence of consecutive scans from a low-cost actuated Laser Range Finder. The results show that the method is fast and that it can robustly handle partial overlap, interfering structures, and noise. It is also shown that the method is a very interesting option for 6-DoF registration, i.e., when scale is known.

  • Prediction of Manipulation Actions
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-20
    Cornelia Fermüller, Fang Wang, Yezhou Yang, Konstantinos Zampogiannis, Yi Zhang, Francisco Barranco, Michael Pfeiffer

    By looking at a person’s hands, one can often tell what the person is going to do next, how his/her hands are moving and where they will be, because an actor’s intentions shape his/her movement kinematics during action execution. Similarly, active systems with real-time constraints must not simply rely on passive video-segment classification, but they have to continuously update their estimates and predict future actions. In this paper, we study the prediction of dexterous actions. We recorded videos of subjects performing different manipulation actions on the same object, such as “squeezing”, “flipping”, “washing”, “wiping” and “scratching” with a sponge. In psychophysical experiments, we evaluated human observers’ skills in predicting actions from video sequences of different length, depicting the hand movement in the preparation and execution of actions before and after contact with the object. We then developed a recurrent neural network based method for action prediction using as input image patches around the hand. We also used the same formalism to predict the forces on the finger tips using for training synchronized video and force data streams. Evaluations on two new datasets show that our system closely matches human performance in the recognition task, and demonstrate the ability of our algorithms to predict in real time what and how a dexterous action is performed.

  • Dynamic Behavior Analysis via Structured Rank Minimization
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-01-19
    Christos Georgakis, Yannis Panagakis, Maja Pantic

    Human behavior and affect is inherently a dynamic phenomenon involving temporal evolution of patterns manifested through a multiplicity of non-verbal behavioral cues including facial expressions, body postures and gestures, and vocal outbursts. A natural assumption for human behavior modeling is that a continuous-time characterization of behavior is the output of a linear time-invariant system when behavioral cues act as the input (e.g., continuous rather than discrete annotations of dimensional affect). Here we study the learning of such dynamical system under real-world conditions, namely in the presence of noisy behavioral cues descriptors and possibly unreliable annotations by employing structured rank minimization. To this end, a novel structured rank minimization method and its scalable variant are proposed. The generalizability of the proposed framework is demonstrated by conducting experiments on 3 distinct dynamic behavior analysis tasks, namely (i) conflict intensity prediction, (ii) prediction of valence and arousal, and (iii) tracklet matching. The attained results outperform those achieved by other state-of-the-art methods for these tasks and, hence, evidence the robustness and effectiveness of the proposed approach.

  • Joint Estimation of Human Pose and Conversational Groups from Social Scenes
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-07-14
    Jagannadan Varadarajan, Ramanathan Subramanian, Samuel Rota Bulò, Narendra Ahuja, Oswald Lanz, Elisa Ricci

    Despite many attempts in the last few years, automatic analysis of social scenes captured by wide-angle camera networks remains a very challenging task due to the low resolution of targets, background clutter and frequent and persistent occlusions. In this paper, we present a novel framework for jointly estimating (i) head, body orientations of targets and (ii) conversational groups called F-formations from social scenes. In contrast to prior works that have (a) exploited the limited range of head and body orientations to jointly learn both, or (b) employed the mutual head (but not body) pose of interactors for deducing F-formations, we propose a weakly-supervised learning algorithm for joint inference. Our algorithm employs body pose as the primary cue for F-formation estimation, and an alternating optimization strategy is proposed to iteratively refine F-formation and pose estimates. We demonstrate the increased efficacy of joint inference over the state-of-the-art via extensive experiments on three social datasets.

  • Toward Personalized Modeling: Incremental and Ensemble Alignment for Sequential Faces in the Wild
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-15
    Xi Peng, Shaoting Zhang, Yang Yu, Dimitris N. Metaxas

    Fitting facial landmarks on unconstrained videos is a challenging task with broad applications. Both generic and joint alignment methods have been proposed with varying degrees of success. However, many generic methods are heavily sensitive to initializations and usually rely on offline-trained static models, which limit their performance on sequential images with extensive variations. On the other hand, joint methods are restricted to offline applications, since they require all frames to conduct batch alignment. To address these limitations, we propose to exploit incremental learning for personalized ensemble alignment. We sample multiple initial shapes to achieve image congealing within one frame, which enables us to incrementally conduct ensemble alignment by group-sparse regularized rank minimization. At the same time, incremental subspace adaptation is performed to achieve personalized modeling in a unified framework. To alleviate the drifting issue, we leverage a very efficient fitting evaluation network to pick out well-aligned faces for robust incremental learning. Extensive experiments on both controlled and unconstrained datasets have validated our approach in different aspects and demonstrated its superior performance compared with state of the arts in terms of fitting accuracy and efficiency.

  • Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-05-22
    Serena Yeung, Olga Russakovsky, Ning Jin, Mykhaylo Andriluka, Greg Mori, Li Fei-Fei

    Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memory deep networks for modeling these temporal relations via multiple input and output connections. We show that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.

  • Space-Time Tree Ensemble for Action Recognition and Localization
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-02
    Shugao Ma, Jianming Zhang, Stan Sclaroff, Nazli Ikizler-Cinbis, Leonid Sigal

    Human actions are, inherently, structured patterns of body movements. We explore ensembles of hierarchical spatio-temporal trees, discovered directly from training data, to model these structures for action recognition and spatial localization. Discovery of frequent and discriminative tree structures is challenging due to the exponential search space, particularly if one allows partial matching. We address this by first building a concise action word vocabulary via discriminative clustering of the hierarchical space-time segments, which is a two-level video representation that captures both static and non-static relevant space-time segments of the video. Using this vocabulary we then utilize tree mining with subsequent tree clustering and ranking to select a compact set of discriminative tree patterns. Our experiments show that these tree patterns, alone, or in combination with shorter patterns (action words and pairwise patterns) achieve promising performance on three challenging datasets: UCF Sports, HighFive and Hollywood3D. Moreover, we perform cross-dataset validation, using trees learned on HighFive to recognize the same actions in Hollywood3D, and using trees learned on UCF-Sports to recognize and localize the similar actions in JHMDB. The results demonstrate the potential for cross-dataset generalization of the trees our approach discovers.

  • Unconstrained Still/Video-Based Face Verification with Deep Convolutional Neural Networks
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-07-01
    Jun-Cheng Chen, Rajeev Ranjan, Swami Sankaranarayanan, Amit Kumar, Ching-Hui Chen, Vishal M. Patel, Carlos D. Castillo, Rama Chellappa

    Over the last 5 years, methods based on Deep Convolutional Neural Networks (DCNNs) have shown impressive performance improvements for object detection and recognition problems. This has been made possible due to the availability of large annotated datasets, a better understanding of the non-linear mapping between input images and class labels as well as the affordability of GPUs. In this paper, we present the design details of a deep learning system for unconstrained face recognition, including modules for face detection, association, alignment and face verification. The quantitative performance evaluation is conducted using the IARPA Janus Benchmark A (IJB-A), the JANUS Challenge Set 2 (JANUS CS2), and the Labeled Faces in the Wild (LFW) dataset. The IJB-A dataset includes real-world unconstrained faces of 500 subjects with significant pose and illumination variations which are much harder than the LFW and Youtube Face datasets. JANUS CS2 is the extended version of IJB-A which contains not only all the images/frames of IJB-A but also includes the original videos. Some open issues regarding DCNNs for face verification problems are then discussed.

  • Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2016-10-04
    Lionel Pigou, Aäron van den Oord, Sander Dieleman, Mieke Van Herreweghe, Joni Dambre

    Recent studies have demonstrated the power of recurrent neural networks for machine translation, image captioning and speech recognition. For the task of capturing temporal structure in video, however, there still remain numerous open research questions. Current research suggests using a simple temporal feature pooling strategy to take into account the temporal aspect of video. We demonstrate that this method is not sufficient for gesture recognition, where temporal information is more discriminative compared to general video classification tasks. We explore deep architectures for gesture recognition in video and propose a new end-to-end trainable neural network architecture incorporating temporal convolutions and bidirectional recurrence. Our main contributions are twofold; first, we show that recurrence is crucial for this task; second, we show that adding temporal convolutions leads to significant improvements. We evaluate the different approaches on the Montalbano gesture recognition dataset, where we achieve state-of-the-art results.

  • A Comprehensive Performance Evaluation of Deformable Face Tracking “In-the-Wild”
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-25
    Grigorios G. Chrysos, Epameinondas Antonakos, Patrick Snape, Akshay Asthana, Stefanos Zafeiriou

    Recently, technologies such as face detection, facial landmark localisation and face recognition and verification have matured enough to provide effective and efficient solutions for imagery captured under arbitrary conditions (referred to as “in-the-wild”). This is partially attributed to the fact that comprehensive “in-the-wild” benchmarks have been developed for face detection, landmark localisation and recognition/verification. A very important technology that has not been thoroughly evaluated yet is deformable face tracking “in-the-wild”. Until now, the performance has mainly been assessed qualitatively by visually assessing the result of a deformable face tracking technology on short videos. In this paper, we perform the first, to the best of our knowledge, thorough evaluation of state-of-the-art deformable face tracking pipelines using the recently introduced 300 VW benchmark. We evaluate many different architectures focusing mainly on the task of on-line deformable face tracking. In particular, we compare the following general strategies: (a) generic face detection plus generic facial landmark localisation, (b) generic model free tracking plus generic facial landmark localisation, as well as (c) hybrid approaches using state-of-the-art face detection, model free tracking and facial landmark localisation technologies. Our evaluation reveals future avenues for further research on the topic.

  • Transferring Deep Object and Scene Representations for Event Recognition in Still Images
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-09-13
    Limin Wang, Zhe Wang, Yu Qiao, Luc Van Gool

    This paper addresses the problem of image-based event recognition by transferring deep representations learned from object and scene datasets. First we empirically investigate the correlation of the concepts of object, scene, and event, thus motivating our representation transfer methods. Based on this empirical study, we propose an iterative selection method to identify a subset of object and scene classes deemed most relevant for representation transfer. Afterwards, we develop three transfer techniques: (1) initialization-based transfer, (2) knowledge-based transfer, and (3) data-based transfer. These newly designed transfer techniques exploit multitask learning frameworks to incorporate extra knowledge from other networks or additional datasets into the fine-tuning procedure of event CNNs. These multitask learning frameworks turn out to be effective in reducing the effect of over-fitting and improving the generalization ability of the learned CNNs. We perform experiments on four event recognition benchmarks: the ChaLearn LAP Cultural Event Recognition dataset, the Web Image Dataset for Event Recognition, the UIUC Sports Event dataset, and the Photo Event Collection dataset. The experimental results show that our proposed algorithm successfully transfers object and scene representations towards the event dataset and achieves the current state-of-the-art performance on all considered datasets.

  • Deep Multimodal Fusion: A Hybrid Approach
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-20
    Mohamed R. Amer, Timothy Shields, Behjat Siddiquie, Amir Tamrakar, Ajay Divakaran, Sek Chai

    We propose a novel hybrid model that exploits the strength of discriminative classifiers along with the representation power of generative models. Our focus is on detecting multimodal events in time varying sequences as well as generating missing data in any of the modalities. Discriminative classifiers have been shown to achieve higher performances than the corresponding generative likelihood-based classifiers. On the other hand, generative models learn a rich informative space which allows for data generation and joint feature representation that discriminative models lack. We propose a new model that jointly optimizes the representation space using a hybrid energy function. We employ a Restricted Boltzmann Machines (RBMs) based model to learn a shared representation across multiple modalities with time varying data. The Conditional RBMs (CRBMs) is an extension of the RBM model that takes into account short term temporal phenomena. The hybrid model involves augmenting CRBMs with a discriminative component for classification. For these purposes we propose a novel Multimodal Discriminative CRBMs (MMDCRBMs) model. First, we train the MMDCRBMs model using labeled data by training each modality, followed by training a fusion layer. Second, we exploit the generative capability of MMDCRBMs to activate the trained model so as to generate the lower-level data corresponding to the specific label that closely matches the actual input data. We evaluate our approach on ChaLearn dataset, audio-mocap, as well as the Tower Game dataset, mocap-mocap as well as three multimodal toy datasets. We report classification accuracy, generation accuracy, and localization accuracy and demonstrate its superiority compared to the state-of-the-art methods.

  • Subjects and Their Objects: Localizing Interactees for a Person-Centric View of Importance
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2016-10-28
    Chao-Yeh Chen, Kristen Grauman

    Understanding images with people often entails understanding their interactions with other objects or people. As such, given a novel image, a vision system ought to infer which other objects/people play an important role in a given person’s activity. However, existing methods are limited to learning action-specific interactions (e.g., how the pose of a tennis player relates to the position of his racquet when serving the ball) for improved recognition, making them unequipped to reason about novel interactions with actions or objects unobserved in the training data. We propose to predict the “interactee” in novel images—that is, to localize the object of a person’s action. Given an arbitrary image with a detected person, the goal is to produce a saliency map indicating the most likely positions and scales where that person’s interactee would be found. To that end, we explore ways to learn the generic, action-independent connections between (a) representations of a person’s pose, gaze, and scene cues and (b) the interactee object’s position and scale. We provide results on a newly collected UT Interactee dataset spanning more than 10,000 images from SUN, PASCAL, and COCO. We show that the proposed interaction-informed saliency metric has practical utility for four tasks: contextual object detection, image retargeting, predicting object importance, and data-driven natural language scene description. All four scenarios reveal the value in linking the subject to its object in order to understand the story of an image.

  • Deep Expectation of Real and Apparent Age from a Single Image Without Facial Landmarks
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2016-08-10
    Rasmus Rothe, Radu Timofte, Luc Van Gool

    In this paper we propose a deep learning solution to age estimation from a single face image without the use of facial landmarks and introduce the IMDB-WIKI dataset, the largest public dataset of face images with age and gender labels. If the real age estimation research spans over decades, the study of apparent age estimation or the age as perceived by other humans from a face image is a recent endeavor. We tackle both tasks with our convolutional neural networks (CNNs) of VGG-16 architecture which are pre-trained on ImageNet for image classification. We pose the age estimation problem as a deep classification problem followed by a softmax expected value refinement. The key factors of our solution are: deep learned models from large data, robust face alignment, and expected value formulation for age regression. We validate our methods on standard benchmarks and achieve state-of-the-art results for both real and apparent age estimation.

  • Real-Time Accurate 3D Head Tracking and Pose Estimation with Consumer RGB-D Cameras
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-02-02
    David Joseph Tan, Federico Tombari, Nassir Navab

    We demonstrate how 3D head tracking and pose estimation can be effectively and efficiently achieved from noisy RGB-D sequences. Our proposal leverages on a random forest framework, designed to regress the 3D head pose at every frame in a temporal tracking manner. One peculiarity of the algorithm is that it exploits together (1) a generic training dataset of 3D head models, which is learned once offline; and, (2) an online refinement with subject-specific 3D data, which aims for the tracker to withstand slight facial deformations and to adapt its forest to the specific characteristics of an individual subject. The combination of these works allows our algorithm to be robust even under extreme poses, where the user’s face is no longer visible on the image. Finally, we also propose another solution that utilizes a multi-camera system such that the data simultaneously acquired from multiple RGB-D sensors helps the tracker to handle challenging conditions that affect a subset of the cameras. Notably, the proposed multi-camera frameworks yields a real-time performance of approximately 8 ms per frame given six cameras and one CPU core, and scales up linearly to 30 fps with 25 cameras.

  • Large Scale 3D Morphable Models
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-04-08
    James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, Stefanos Zafeiriou

    We present large scale facial model (LSFM)—a 3D Morphable Model (3DMM) automatically constructed from 9663 distinct facial identities. To the best of our knowledge LSFM is the largest-scale Morphable Model ever constructed, containing statistical information from a huge variety of the human population. To build such a large model we introduce a novel fully automated and robust Morphable Model construction pipeline, informed by an evaluation of state-of-the-art dense correspondence techniques. The dataset that LSFM is trained on includes rich demographic information about each subject, allowing for the construction of not only a global 3DMM model but also models tailored for specific age, gender or ethnicity groups. We utilize the proposed model to perform age classification from 3D shape alone and to reconstruct noisy out-of-sample data in the low-dimensional model space. Furthermore, we perform a systematic analysis of the constructed 3DMM models that showcases their quality and descriptive power. The presented extensive qualitative and quantitative evaluations reveal that the proposed 3DMM achieves state-of-the-art results, outperforming existing models by a large margin. Finally, for the benefit of the research community, we make publicly available the source code of the proposed automatic 3DMM construction pipeline, as well as the constructed global 3DMM and a variety of bespoke models tailored by age, gender and ethnicity.

  • Confidence-Weighted Local Expression Predictions for Occlusion Handling in Expression Recognition and Action Unit Detection
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-04-08
    Arnaud Dapogny, Kevin Bailly, Séverine Dubuisson

    Fully-automatic facial expression recognition (FER) is a key component of human behavior analysis. Performing FER from still images is a challenging task as it involves handling large interpersonal morphological differences, and as partial occlusions can occasionally happen. Furthermore, labelling expressions is a time-consuming process that is prone to subjectivity, thus the variability may not be fully covered by the training data. In this work, we propose to train random forests upon spatially-constrained random local subspaces of the face. The output local predictions form a categorical expression-driven high-level representation that we call local expression predictions (LEPs). LEPs can be combined to describe categorical facial expressions as well as action units (AUs). Furthermore, LEPs can be weighted by confidence scores provided by an autoencoder network. Such network is trained to locally capture the manifold of the non-occluded training data in a hierarchical way. Extensive experiments show that the proposed LEP representation yields high descriptive power for categorical expressions and AU occurrence prediction, and leads to interesting perspectives towards the design of occlusion-robust and confidence-aware FER systems.

  • Predicting Foreground Object Ambiguity and Efficiently Crowdsourcing the Segmentation(s)
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-02-05
    Danna Gurari, Kun He, Bo Xiong, Jianming Zhang, Mehrnoosh Sameki, Suyog Dutt Jain, Stan Sclaroff, Margrit Betke, Kristen Grauman

    We propose the ambiguity problem for the foreground object segmentation task and motivate the importance of estimating and accounting for this ambiguity when designing vision systems. Specifically, we distinguish between images which lead multiple annotators to segment different foreground objects (ambiguous) versus minor inter-annotator differences of the same object. Taking images from eight widely used datasets, we crowdsource labeling the images as “ambiguous” or “not ambiguous” to segment in order to construct a new dataset we call STATIC. Using STATIC, we develop a system that automatically predicts which images are ambiguous. Experiments demonstrate the advantage of our prediction system over existing saliency-based methods on images from vision benchmarks and images taken by blind people who are trying to recognize objects in their environment. Finally, we introduce a crowdsourcing system to achieve cost savings for collecting the diversity of all valid “ground truth” foreground object segmentations by collecting extra segmentations only when ambiguity is expected. Experiments show our system eliminates up to 47% of human effort compared to existing crowdsourcing methods with no loss in capturing the diversity of ground truths.

  • Label Propagation with Ensemble of Pairwise Geometric Relations: Towards Robust Large-Scale Retrieval of Object Instances
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-01-31
    Xiaomeng Wu, Kaoru Hiramatsu, Kunio Kashino

    Spatial verification methods permit geometrically stable image matching, but still involve a difficult trade-off between robustness as regards incorrect rejection of true correspondences and discriminative power in terms of mismatches. To address this issue, we ask whether an ensemble of weak geometric constraints that correlates with visual similarity only slightly better than a bag-of-visual-words model performs better than a single strong constraint. We consider a family of spatial verification methods and decompose them into fundamental constraints imposed on pairs of feature correspondences. Encompassing such constraints leads us to propose a new method, which takes the best of existing techniques and functions as a unified Ensemble of pAirwise GEometric Relations (EAGER), in terms of both spatial contexts and between-image transformations. We also introduce a novel and robust reranking method, in which the object instances localized by EAGER in high-ranked database images are reissued as new queries. EAGER is extended to develop a smoothness constraint where the similarity between the optimized ranking scores of two instances should be maximally consistent with their geometrically constrained similarity. Reranking is newly formulated as two label propagation problems: one is to assess the confidence of new queries and the other to aggregate new independently executed retrievals. Extensive experiments conducted on four datasets show that EAGER and our reranking method outperform most of their state-of-the-art counterparts, especially when large-scale visual vocabularies are used.

  • Learning Latent Representations of 3D Human Pose with Deep Neural Networks
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-01-31
    Isinsu Katircioglu, Bugra Tekin, Mathieu Salzmann, Vincent Lepetit, Pascal Fua

    Most recent approaches to monocular 3D pose estimation rely on Deep Learning. They either train a Convolutional Neural Network to directly regress from an image to a 3D pose, which ignores the dependencies between human joints, or model these dependencies via a max-margin structured learning framework, which involves a high computational cost at inference time. In this paper, we introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images or 2D joint location heatmaps that relies on an overcomplete autoencoder to learn a high-dimensional latent pose representation and accounts for joint dependencies. We further propose an efficient Long Short-Term Memory network to enforce temporal consistency on 3D pose predictions. We demonstrate that our approach achieves state-of-the-art performance both in terms of structure preservation and prediction accuracy on standard 3D human pose estimation benchmarks.

  • Occlusion-Aware 3D Morphable Models and an Illumination Prior for Face Image Analysis
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-01-31
    Bernhard Egger, Sandro Schönborn, Andreas Schneider, Adam Kortylewski, Andreas Morel-Forster, Clemens Blumer, Thomas Vetter

    Faces in natural images are often occluded by a variety of objects. We propose a fully automated, probabilistic and occlusion-aware 3D morphable face model adaptation framework following an analysis-by-synthesis setup. The key idea is to segment the image into regions explained by separate models. Our framework includes a 3D morphable face model, a prototype-based beard model and a simple model for occlusions and background regions. The segmentation and all the model parameters have to be inferred from the single target image. Face model adaptation and segmentation are solved jointly using an expectation–maximization-like procedure. During the E-step, we update the segmentation and in the M-step the face model parameters are updated. For face model adaptation we apply a stochastic sampling strategy based on the Metropolis–Hastings algorithm. For segmentation, we apply loopy belief propagation for inference in a Markov random field. Illumination estimation is critical for occlusion handling. Our combined segmentation and model adaptation needs a proper initialization of the illumination parameters. We propose a RANSAC-based robust illumination estimation technique. By applying this method to a large face image database we obtain a first empirical distribution of real-world illumination conditions. The obtained empirical distribution is made publicly available and can be used as prior in probabilistic frameworks, for regularization or to synthesize data for deep learning methods.

  • Graph-Based Slice-to-Volume Deformable Registration
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-08-22
    Enzo Ferrante, Nikos Paragios

    Deformable image registration is a fundamental problem in computer vision and medical image computing. In this paper we investigate the use of graphical models in the context of a particular type of image registration problem, known as slice-to-volume registration. We introduce a scalable, modular and flexible formulation that can accommodate low-rank and high order terms, that simultaneously selects the plane and estimates the in-plane deformation through a single shot optimization approach. The proposed framework is instantiated into different variants seeking either a compromise between computational efficiency (soft plane selection constraints and approximate definition of the data similarity terms through pair-wise components) or exact definition of the data terms and the constraints on the plane selection. Simulated and real-data in the context of ultrasound and magnetic resonance registration (where both framework instantiations as well as different optimization strategies are considered) demonstrate the potentials of our method.

  • Baseline and Triangulation Geometry in a Standard Plenoptic Camera
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-08-20
    Christopher Hahne, Amar Aggoun, Vladan Velisavljevic, Susanne Fiebig, Matthias Pesch

    In this paper, we demonstrate light field triangulation to determine depth distances and baselines in a plenoptic camera. Advances in micro lenses and image sensors have enabled plenoptic cameras to capture a scene from different viewpoints with sufficient spatial resolution. While object distances can be inferred from disparities in a stereo viewpoint pair using triangulation, this concept remains ambiguous when applied in the case of plenoptic cameras. We present a geometrical light field model allowing the triangulation to be applied to a plenoptic camera in order to predict object distances or specify baselines as desired. It is shown that distance estimates from our novel method match those of real objects placed in front of the camera. Additional benchmark tests with an optical design software further validate the model’s accuracy with deviations of less than ±0.33% for several main lens types and focus settings. A variety of applications in the automotive and robotics field can benefit from this estimation model.

  • Efficient Label Collection for Image Datasets via Hierarchical Clustering
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-08-24
    Maggie Wigness, Bruce A. Draper, J. Ross Beveridge

    Raw visual data used to train classifiers is abundant and easy to gather, but lacks semantic labels that describe visual concepts of interest. These labels are necessary for supervised learning and can require significant human effort to collect. We discuss four labeling objectives that play an important role in the design of frameworks aimed at collecting label information for large training sets while maintaining low human effort: discovery, efficiency, exploitation and accuracy. We introduce a framework that explicitly models and balances these four labeling objectives with the use of (1) hierarchical clustering, (2) a novel interestingness measure that defines structural change within the hierarchy, and (3) an iterative group-based labeling process that exploits relationships between labeled and unlabeled data. Results on benchmark data show that our framework collects labeled training data more efficiently than existing labeling techniques and trains higher performing visual classifiers. Further, we show that our resulting framework is fast and significantly reduces human interaction time when labeling real-world multi-concept imagery depicting outdoor environments.

  • On the Beneficial Effect of Noise in Vertex Localization
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-09-19
    Konstantinos A. Raftopoulos, Stefanos D. Kollias, Dionysios D. Sourlas, Marin Ferecatu

    A theoretical and experimental analysis related to the effect of noise in the task of vertex identification in unknown shapes is presented. Shapes are seen as real functions of their closed boundary. An alternative global perspective of curvature is examined providing insight into the process of noise-enabled vertex localization. The analysis reveals that noise facilitates in the localization of certain vertices. The concept of noising is thus considered and a relevant global method for localizing Global Vertices is investigated in relation to local methods under the presence of increasing noise. Theoretical analysis reveals that induced noise can indeed help localizing certain vertices if combined with global descriptors. Experiments with noise and a comparison to localized methods validate the theoretical results.

  • Learning to Detect Good 3D Keypoints
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-08-08
    Alessio Tonioni, Samuele Salti, Federico Tombari, Riccardo Spezialetti, Luigi Di Stefano

    The established approach to 3D keypoint detection consists in defining effective handcrafted saliency functions based on geometric cues with the aim of maximizing keypoint repeatability. Differently, the idea behind our work is to learn a descriptor-specific keypoint detector so as to optimize the end-to-end performance of the feature matching pipeline. Accordingly, we cast 3D keypoint detection as a classification problem between surface patches that can or cannot be matched correctly by a given 3D descriptor, i.e. those either good or not in respect to that descriptor. We propose a machine learning framework that allows for defining examples of good surface patches from the training data and leverages Random Forest classifiers to realize both fixed-scale and adaptive-scale 3D keypoint detectors. Through extensive experiments on standard datasets, we show how feature matching performance improves significantly by deploying 3D descriptors together with companion detectors learned by our methodology with respect to the adoption of established state-of-the-art 3D detectors based on hand-crafted saliency functions.

  • Attentive Systems: A Survey
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-09-15
    Tam V. Nguyen, Qi Zhao, Shuicheng Yan

    Visual saliency analysis detects salient regions/objects that attract human attention in natural scenes. It has attracted intensive research in different fields such as computer vision, computer graphics, and multimedia. While many such computational models exist, the focused study of what and how applications can be beneficial is still lacking. In this article, our ultimate goal is thus to provide a comprehensive review of the applications using saliency cues, the so-called attentive systems. We would like to provide a broad vision about saliency applications and what visual saliency can do. We categorize the vast amount of applications into different areas such as computer vision, computer graphics, and multimedia. Intensively covering 200+ publications we survey (1) key application trends, (2) the role of visual saliency, and (3) the usability of saliency into different tasks.

  • Discriminative Correlation Filter Tracker with Channel and Spatial Reliability
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-01-08
    Alan Lukežič, Tomáš Vojíř, Luka Čehovin Zajc, Jiří Matas, Matej Kristan

    Short-term tracking is an open and challenging problem for which discriminative correlation filters (DCF) have shown excellent performance. We introduce the channel and spatial reliability concepts to DCF tracking and provide a learning algorithm for its efficient and seamless integration in the filter update and the tracking process. The spatial reliability map adjusts the filter support to the part of the object suitable for tracking. This both allows to enlarge the search region and improves tracking of non-rectangular objects. Reliability scores reflect channel-wise quality of the learned filters and are used as feature weighting coefficients in localization. Experimentally, with only two simple standard feature sets, HoGs and colornames, the novel CSR-DCF method—DCF with channel and spatial reliability—achieves state-of-the-art results on VOT 2016, VOT 2015 and OTB100. The CSR-DCF runs close to real-time on a CPU.

  • Separable Anisotropic Diffusion
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2018-01-05
    Roi Méndez-Rial, Julio Martín-Herrero

    Anisotropic diffusion has many applications in image processing, but the high computational cost usually requires accuracy trade-offs in order to grant its applicability in practical problems. This is specially true when dealing with 3D images, where anisotropic diffusion should be able to provide interesting results for many applications, but the usual implementation methods greatly scale in complexity with the additional dimension. Here we propose a separable implementation of the most general anisotropic diffusion formulation, based on Gaussian convolutions, whose favorable computational complexity scales linearly with the number of dimensions, without any assumptions about specific parameterizations. We also present variants that bend the Gaussian kernels for improved results when dealing with highly anisotropic curved or sharp structures. We test the accuracy, speed, stability, and scale-space properties of the proposed methods, and present some results (both synthetic and real) which show their advantages, including up to 60 times faster computation in 3D with respect to the explicit method, improved accuracy and stability, and min–max preservation.

  • Top-Down Neural Attention by Excitation Backprop
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-12-23
    Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, Stan Sclaroff

    We aim to model the top-down attention of a convolutional neural network (CNN) classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process. Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative. We show a theoretic connection between the proposed contrastive attention formulation and the Class Activation Map computation. Efficient implementation of Excitation Backprop for common neural network layers is also presented. In experiments, we visualize the evidence of a model’s classification decision by computing the proposed top-down attention maps. For quantitative evaluation, we report the accuracy of our method in weakly supervised localization tasks on the MS COCO, PASCAL VOC07 and ImageNet datasets. The usefulness of our method is further validated in the text-to-region association task. On the Flickr30k Entities dataset, we achieve promising performance in phrase localization by leveraging the top-down attention of a CNN model that has been trained on weakly labeled web images. Finally, we demonstrate applications of our method in model interpretation and data annotation assistance for facial expression analysis and medical imaging tasks.

  • SDF-2-SDF Registration for Real-Time 3D Reconstruction from RGB-D Data
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-12-18
    Miroslava Slavcheva, Wadim Kehl, Nassir Navab, Slobodan Ilic

    We tackle the task of dense 3D reconstruction from RGB-D data. Contrary to the majority of existing methods, we focus not only on trajectory estimation accuracy, but also on reconstruction precision. The key technique is SDF-2-SDF registration, which is a correspondence-free, symmetric, dense energy minimization method, performed via the direct voxel-wise difference between a pair of signed distance fields. It has a wider convergence basin than traditional point cloud registration and cloud-to-volume alignment techniques. Furthermore, its formulation allows for straightforward incorporation of photometric and additional geometric constraints. We employ SDF-2-SDF registration in two applications. First, we perform small-to-medium scale object reconstruction entirely on the CPU. To this end, the camera is tracked frame-to-frame in real time. Then, the initial pose estimates are refined globally in a lightweight optimization framework, which does not involve a pose graph. We combine these procedures into our second, fully real-time application for larger-scale object reconstruction and SLAM. It is implemented as a hybrid system, whereby tracking is done on the GPU, while refinement runs concurrently over batches on the CPU. To bound memory and runtime footprints, registration is done over a fixed number of limited-extent volumes, anchored at geometry-rich locations. Extensive qualitative and quantitative evaluation of both trajectory accuracy and model fidelity on several public RGB-D datasets, acquired with various quality sensors, demonstrates higher precision than related techniques.

  • RAW Image Reconstruction Using a Self-contained sRGB–JPEG Image with Small Memory Overhead
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-12-18
    Rang M. H. Nguyen, Michael S. Brown

    Most camera images are saved as 8-bit standard RGB (sRGB) compressed JPEGs. Even when JPEG compression is set to its highest quality, the encoded sRGB image has been significantly processed in terms of color and tone manipulation. This makes sRGB–JPEG images undesirable for many computer vision tasks that assume a direct relationship between pixel values and incoming light. For such applications, the RAW image format is preferred, as RAW represents a minimally processed, sensor-specific RGB image that is linear with respect to scene radiance. The drawback with RAW images, however, is that they require large amounts of storage and are not well-supported by many imaging applications. To address this issue, we present a method to encode the necessary data within an sRGB–JPEG image to reconstruct a high-quality RAW image. Our approach requires no calibration of the camera’s colorimetric properties and can reconstruct the original RAW to within 0.5% error with a small memory overhead for the additional data (e.g., 128 KB). More importantly, our output is a fully self-contained 100% compliant sRGB–JPEG file that can be used as-is, not affecting any existing image workflow—the RAW image data can be extracted when needed, or ignored otherwise. We detail our approach and show its effectiveness against competing strategies.

  • Hallucinating Compressed Face Images
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-12-08
    Chih-Yuan Yang, Sifei Liu, Ming-Hsuan Yang

    A face hallucination algorithm is proposed to generate high-resolution images from JPEG compressed low-resolution inputs by decomposing a deblocked face image into structural regions such as facial components and non-structural regions like the background. For structural regions, landmarks are used to retrieve adequate high-resolution component exemplars in a large dataset based on the estimated head pose and illumination condition. For non-structural regions, an efficient generic super resolution algorithm is applied to generate high-resolution counterparts. Two sets of gradient maps extracted from these two regions are combined to guide an optimization process of generating the hallucination image. Numerous experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art hallucination methods on JPEG compressed face images with different poses, expressions, and illumination conditions.

  • Learning Image Representations Tied to Egomotion from Unlabeled Video
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-03-04
    Dinesh Jayaraman, Kristen Grauman

    Understanding how images of objects and scenes behave in response to specific egomotions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images. We propose a new “embodied” visual learning paradigm, exploiting proprioceptive motor signals to train visual representations from egocentric video with no manual supervision. Specifically, we enforce that our learned features exhibit equivariance i.e., they respond predictably to transformations associated with distinct egomotions. With three datasets, we show that our unsupervised feature learning approach significantly outperforms previous approaches on visual recognition and next-best-view prediction tasks. In the most challenging test, we show that features learned from video captured on an autonomous driving platform improve large-scale scene recognition in static images from a disjoint domain.

  • Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-08-29
    Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

    We propose a Deep Learning approach to the visual question answering task, where machines answer to questions about real-world images. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We evaluate our approaches on the DAQUAR as well as the VQA dataset where we also report various baselines, including an analysis how much information is contained in the language part only. To study human consensus, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Finally, we evaluate a rich set of design choices how to encode, combine and decode information in our proposed Deep Learning formulation.

  • How Good Is My Test Data? Introducing Safety Analysis for Computer Vision
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-06-09
    Oliver Zendel, Markus Murschitz, Martin Humenberger, Wolfgang Herzner

    Good test data is crucial for driving new developments in computer vision (CV), but two questions remain unanswered: which situations should be covered by the test data, and how much testing is enough to reach a conclusion? In this paper we propose a new answer to these questions using a standard procedure devised by the safety community to validate complex systems: the hazard and operability analysis (HAZOP). It is designed to systematically identify possible causes of system failure or performance loss. We introduce a generic CV model that creates the basis for the hazard analysis and—for the first time—apply an extensive HAZOP to the CV domain. The result is a publicly available checklist with more than 900 identified individual hazards. This checklist can be utilized to evaluate existing test datasets by quantifying the covered hazards. We evaluate our approach by first analyzing and annotating the popular stereo vision test datasets Middlebury and KITTI. Second, we demonstrate a clearly negative influence of the hazards in the checklist on the performance of six popular stereo matching algorithms. The presented approach is a useful tool to evaluate and improve test datasets and creates a common basis for future dataset designs.

  • Global, Dense Multiscale Reconstruction for a Billion Points
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-06-03
    Benjamin Ummenhofer, Thomas Brox

    We present a variational approach for surface reconstruction from a set of oriented points with scale information. We focus particularly on scenarios with nonuniform point densities due to images taken from different distances. In contrast to previous methods, we integrate the scale information in the objective and globally optimize the signed distance function of the surface on a balanced octree grid. We use a finite element discretization on the dual structure of the octree minimizing the number of variables. The tetrahedral mesh is generated efficiently with a lookup table which allows to map octree cells to the nodes of the finite elements. We optimize memory efficiency by data aggregation, such that robust data terms can be used even on very large scenes. The surface normals are explicitly optimized and used for surface extraction to improve the reconstruction at edges and corners.

  • Holistically-Nested Edge Detection
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-03-15
    Saining Xie, Zhuowen Tu

    We develop a new edge detection algorithm that addresses two important issues in this long-standing vision problem: (1) holistic image training and prediction; and (2) multi-scale and multi-level feature learning. Our proposed method, holistically-nested edge detection (HED), performs image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets. HED automatically learns rich hierarchical representations (guided by deep supervision on side responses) that are important in order to resolve the challenging ambiguity in edge and object boundary detection. We significantly advance the state-of-the-art on the BSDS500 dataset (ODS F-score of 0.790) and the NYU Depth dataset (ODS F-score of 0.746), and do so with an improved speed (0.4 s per image) that is orders of magnitude faster than some CNN-based edge detection algorithms developed before HED. We also observe encouraging results on other boundary detection benchmark datasets such as Multicue and PASCAL-Context.

  • Mutual-Structure for Joint Filtering
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-06-03
    Xiaoyong Shen, Chao Zhou, Li Xu, Jiaya Jia

    Previous joint/guided filters directly transfer structural information from the reference to the target image. In this paper, we analyze the major drawback—that is, there may be completely different edges in the two images. Simply considering all patterns could introduce significant errors. To address this issue, we propose the concept of mutual-structure, which refers to the structural information that is contained in both images and thus can be safely enhanced by joint filtering. We also use an untraditional objective function that can be efficiently optimized to yield mutual structure. Our method results in important edge preserving property, which greatly benefits depth completion, optical flow estimation, image enhancement, stereo matching, to name a few.

  • Depth Sensing Using Geometrically Constrained Polarization Normals
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-06-22
    Achuta Kadambi, Vage Taamazyan, Boxin Shi, Ramesh Raskar

    Analyzing the polarimetric properties of reflected light is a potential source of shape information. However, it is well-known that polarimetric information contains fundamental shape ambiguities, leading to an underconstrained problem of recovering 3D geometry. To address this problem, we use additional geometric information, from coarse depth maps, to constrain the shape information from polarization cues. Our main contribution is a framework that combines surface normals from polarization (hereafter polarization normals) with an aligned depth map. The additional geometric constraints are used to mitigate physics-based artifacts, such as azimuthal ambiguity, refractive distortion and fronto-parallel signal degradation. We believe our work may have practical implications for optical engineering, demonstrating a new option for state-of-the-art 3D reconstruction.

  • 3D Time-Lapse Reconstruction from Internet Photos
    Int. J. Comput. Vis. (IF 8.222) Pub Date : 2017-03-21
    Ricardo Martin-Brualla, David Gallup, Steven M. Seitz

    Given an Internet photo collection of a landmark, we compute a 3D time-lapse video sequence where a virtual camera moves continuously in time and space. While previous work assumed a static camera, the addition of camera motion during the time-lapse creates a very compelling impression of parallax. Achieving this goal, however, requires addressing multiple technical challenges, including solving for time-varying depth maps, regularizing 3D point color profiles over time, and reconstructing high quality, hole-free images at every frame from the projected profiles. Our results show photorealistic time-lapses of skylines and natural scenes over many years, with dramatic parallax effects.

Some contents have been Reproduced with permission of the American Chemical Society.
Some contents have been Reproduced by permission of The Royal Society of Chemistry.
化学 • 材料 期刊列表