显示样式:     当前期刊: IEEE Transactions on Pattern Analysis and Machine Intelligence    加入关注       排序: 导出
我的关注
我的收藏
您暂时未登录!
登录
  • End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-02-14
    Wenhan Luo; Peng Sun; Fangwei Zhong; Wei Liu; Tong Zhang; Yizhou Wang

    We study active object tracking, where a tracker takes visual observations (i.e., frame sequences) as inputs and produces the corresponding camera control signals as outputs (e.g., move forward, turn left, etc.). Conventional methods tackle tracking and camera control tasks separately, and the resulting system is difficult to tune jointly. Such an approach also requires significant human efforts for image labeling and expensive trial-and-error system tuning in real-world. To address these issues, we propose, in this paper, an end-to-end solution via deep reinforcement learning. A ConvNet-LSTM function approximator is adopted for the direct frame-to-action prediction. We further propose environment augmentation techniques and a customized reward function which are crucial for successful training. The tracker trained in simulators (ViZDoom, Unreal Engine) demonstrates good generalization behaviors in the case of unseen object moving paths, unseen object appearances, unseen backgrounds, and distracting objects. The system is robust and can restore tracking after occasional lost of the target being tracked. We also find that the tracking ability, obtained solely from simulators, can potentially transfer to real-world scenarios. We demonstrate successful examples of such transfer, via experiments over the VOT dataset and the deployment of a real-world robot using the proposed active tracker trained in simulation.

    更新日期:2019-02-14
  • Measuring Shapes with Desired Convex Polygons
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-02-12
    Jovisa Zunic; Paul Rosin

    In this paper we have developed a family of shape measures. All the measures from the family evaluate the degree to which a shape looks like a predefined convex polygon. A quite new approach in designing object shape based measures has been applied. In the most cases such measures were defined by exploiting some of shape properties. An illustrative example might be the shape circularity measure derived by exploiting the well-know result that the circle has the largest area among all the shapes with the same perimeter. In the approach applied here, no desired property is needed and no optimizing shape has to be found. We start from a desired/selected convex polygon, and develop the related shape measure. The measures obtained range over the interval (0,1] and pick the maximal possible value, equal to 1, if and only if the measured shape coincides with the selected convex polygon, used to develop the a particular measure. All the measures are invariant with respect to translations, rotations, and scaling transformations. The method used has an straightforward extension to a wider family of shape measures, dependent on a tuning parameter involved. Another extension leads to a family of the new shape convexity measures.

    更新日期:2019-02-13
  • Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-02-12
    Shubham Tulsiani; Tinghui Zhou; Alyosha Efros; Jitendra Malik

    We study the notion of consistency between a 3D shape and a 2D observation and propose a differentiable formulation which allows computing gradients of the 3D shape given an observation from an arbitrary view. We do so by reformulating view consistency using a differentiable ray consistency (DRC) term. We show that this formulation can be incorporated in a learning framework to leverage different types of multi-view observations e.g. foreground masks, depth, color images, semantics etc. as supervision for learning single-view 3D prediction. We present empirical analysis of our technique in a controlled setting. We also show that this approach allows us to improve over existing techniques for single-view reconstruction of objects from the PASCAL VOC dataset.

    更新日期:2019-02-13
  • Min-Entropy Latent Model for Weakly Supervised Object Detection
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-02-12
    Fang Wan; Pengxu Wei; Zhenjun Han; Jianbin Jiao; Qixiang Ye

    Weakly supervised object detection is a challenging task when provided with image category supervision but required to learn, at the same time, object locations and object detectors. The inconsistency between the weak supervision and learning objectives introduces significant randomness to object locations and ambiguity to detectors. In this paper, a min-entropy latent model (MELM) is proposed for weakly supervised object detection. Min-entropy serves as a model to learn object locations and a metric to measure the randomness of object localization during learning. It aims to principally reduce the variance of learned instances and alleviate the ambiguity of detectors. MELM is decomposed into three components including proposal clique partition, object clique discovery, and object localization. MELM is optimized with a recurrent learning algorithm, which leverages continuation optimization to solve the challenging non-convexity problem. Experiments demonstrate that MELM significantly improves the performance of weakly supervised object detection, weakly supervised object localization, and image classification, against the state-of-the-art approaches.

    更新日期:2019-02-13
  • Skeleton-Based Online Action Prediction Using Scale Selection Network
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-02-12
    Jun Liu; Amir Shahroudy; Gang Wang; Ling-Yu Duan; Alex Kot Chichung

    Action prediction is to recognize the class label of an ongoing activity when only a part of it is observed. In this paper, we focus on online action prediction in streaming 3D skeleton sequences. A dilated convolutional network is introduced to model the motion dynamics in temporal dimension via a sliding window over the temporal axis. Since there are significant temporal scale variations in the observed part of the ongoing action at different time steps, a novel window scale selection method is proposed to make our network focus on the performed part of the ongoing action and try to suppress the possible incoming interference from the previous actions at each step. An activation sharing scheme is also proposed to handle the overlapping computations among the adjacent time steps, which enables our framework to run more efficiently. Moreover, to enhance the performance of our framework for action prediction with the skeletal input data, a hierarchy of dilated tree convolutions are also designed to learn the multi-level structured semantic representations over the skeleton joints at each frame. Our proposed approach is evaluated on four challenging datasets. The extensive experiments demonstrate the effectiveness of our method for skeleton-based online action prediction.

    更新日期:2019-02-13
  • Hyperbolic Wasserstein Distance for Shape Indexing
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-02-08
    Jie Shi; Yalin Wang

    Shape space is an active research topic in computer vision and medical imaging fields. The distance defined in a shape space may provide a simple and refined index to represent a unique shape. This work studies the Wasserstein space and proposes a novel framework to compute the Wasserstein distance between general topological surfaces by integrating hyperbolic Ricci flow, hyperbolic harmonic map, and hyperbolic power Voronoi diagram algorithms. The resulting hyperbolic Wasserstein distance can intrinsically measure the similarity between general topological surfaces. Our proposed algorithms are theoretically rigorous and practically efficient. It has the potential to be a powerful tool for 3D shape indexing research. We tested our algorithm with human face classification and Alzheimer's disease (AD) progression tracking studies. Experimental results demonstrated that our work may provide a succinct and effective shape index.

    更新日期:2019-02-11
  • A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-19
    Srikrishna karanam; Mengran Gou; Ziyan Wu; Angels Rates-Borras; Octavia Camps; Richard J. Radke

    Person re-identification (re-id) is a critical problem in video analytics applications such as security and surveillance. The public release of several datasets and code for vision algorithms has facilitated rapid progress in this area over the last few years. However, directly comparing re-id algorithms reported in the literature has become difficult since a wide variety of features, experimental protocols, and evaluation metrics are employed. In order to address this need, we present an extensive review and performance evaluation of single- and multi-shot re-id algorithms. The experimental protocol incorporates the most recent advances in both feature extraction and metric learning. To ensure a fair comparison, all of the approaches were implemented using a unified code library that includes 11 feature extraction algorithms and 22 metric learning and ranking techniques. All approaches were evaluated using a new large-scale dataset that closely mimics a real-world problem setting, in addition to 16 other publicly available datasets: VIPeR, GRID, CAVIAR, DukeMTMC4ReID, 3DPeS, PRID, V47, WARD, SAIVT-SoftBio, CUHK01, CHUK02, CUHK03, RAiD, iLIDSVID, HDA+, and Market1501. The evaluation codebase and results will be made publicly available for community use.

    更新日期:2019-02-06
  • Constant-Time Calculation of Zernike Moments for Detection with Rotational Invariance
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-07
    Aneta Bera; Przemysław Klęsk; Dariusz Sychel

    We construct a set of special complex-valued integral images and an algorithm that allows to calculate Zernike moments fast, namely in constant time. The technique is suitable for dense detection procedures, where the image is scanned by a sliding window at multiple scales, and where rotational invariance is required at the level of each window. We assume no preliminary image segmentation. Owing to the proposed integral images and binomial expansions, the extraction of each feature does not depend on the number of pixels in the window and thereby is an $O(1)$ calculation. We analyze algorithmic properties of the proposition, such as: number of needed integral images, complex-conjugacy of integral images, number of operations involved in feature extraction, speed-up possibilities based on lookup tables. We also point out connections between Zernike and orthogonal Fourier–Mellin moments in the context of computations backed with integral images. Finally, we demonstrate three examples of detection tasks of varying difficulty. Detectors are trained on the proposed features by the RealBoost algorithm. When learning, the classifiers get acquainted only with examples of target objects in their upright position or rotated within a limited range. At the testing stage, generalization onto the full $360^\circ$ angle takes place automatically.

    更新日期:2019-02-06
  • Empirical Bayesian Light-Field Stereo Matching by Robust Pseudo Random Field Modeling
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-27
    Chao-Tsung Huang

    Light-field stereo matching problems are commonly modeled by Markov Random Fields (MRFs) for statistical inference of depth maps. Nevertheless, most previous approaches did not adapt to image statistics but instead adopted fixed model parameters. They explored explicit vision cues, such as depth consistency and occlusion, to provide local adaptability and enhance depth quality. However, such additional assumptions could end up confining their applicability, e.g. algorithms designed for dense view sampling are not suitable for sparse one. In this paper, we get back to MRF fundamentals and develop an empirical Bayesian framework—Robust Pseudo Random Field—to explore intrinsic statistical cues for broad applicability. Based on pseudo-likelihoods with hidden soft-decision priors, we apply soft expectation-maximization (EM) for good model fitting and perform hard EM for robust depth estimation. We introduce novel pixel difference models to enable such adaptability and robustness simultaneously. Accordingly, we devise a stereo matching algorithm to employ this framework on dense, sparse, and even denoised light fields. It can be applied to both true-color and grey-scale pixels. Experimental results show that it estimates scene-dependent parameters robustly and converges quickly. In terms of depth accuracy and computation speed, it also outperforms state-of-the-art algorithms constantly.

    更新日期:2019-02-06
  • Evaluating the Group Detection Performance: The GRODE Metrics
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-16
    Francesco Setti; Marco Cristani

    The detection of groups of individuals is attracting the attention of many researchers in diverse fields, from automated surveillance to human-computer interaction, with a growing number of approaches published every year. Unexpectedly, the evaluation metrics for this problem are not consolidated, with some measures inherited from the people detection field, other from clustering, other designed specifically for a particular approach, thus lacking in generalization and making the comparisons between different approaches hard to be carried out. Moreover, most of the existent metrics are scarcely expressive, addressing groups as they are atomic entities, ignoring that they may have different cardinalities, and that group detection approaches may fail in capturing the exact number of individuals that compose it. This paper fills this gap presenting the GROup DEtection (GRODE) metrics, which formally define precision and recall on the groups, including the group cardinality as a variable. This gives the possibility to investigate aspects never considered so far, such as the tendency of a method of over- or under-segmenting, or of better dealing with specific group cardinalities. The GRODE metrics have been evaluated first on controlled scenarios, where the differences with alternative metrics are evident. Then, the metrics have been applied to eight approaches of group detection, on eight public datasets, providing a fresh-new panorama of the state-of-the-art, discovering interesting strengths and pitfalls of the recent approaches.

    更新日期:2019-02-06
  • FCSS: Fully Convolutional Self-Similarity for Dense Semantic Correspondence
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-07
    Seungryong Kim; Dongbo Min; Bumsub Ham; Stephen Lin; Kwanghoon Sohn

    We present a descriptor, called fully convolutional self-similarity (FCSS), for dense semantic correspondence. Unlike traditional dense correspondence approaches for estimating depth or optical flow, semantic correspondence estimation poses additional challenges due to intra-class appearance and shape variations among different instances within the same object or scene category. To robustly match points across semantically similar images, we formulate FCSS using local self-similarity (LSS), which is inherently insensitive to intra-class appearance variations. LSS is incorporated through a proposed convolutional self-similarity (CSS) layer, where the sampling patterns and the self-similarity measure are jointly learned in an end-to-end and multi-scale manner. Furthermore, to address shape variations among different object instances, we propose a convolutional affine transformer (CAT) layer that estimates explicit affine transformation fields at each pixel to transform the sampling patterns and corresponding receptive fields. As training data for semantic correspondence is rather limited, we propose to leverage object candidate priors provided in most existing datasets and also correspondence consistency between object pairs to enable weakly-supervised learning. Experiments demonstrate that FCSS significantly outperforms conventional handcrafted descriptors and CNN-based descriptors on various benchmarks.

    更新日期:2019-02-06
  • Hierarchical Scene Parsing by Weakly Supervised Learning with Image Descriptions
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-01-30
    Ruimao Zhang; Liang Lin; Guangrun Wang; Meng Wang; Wangmeng Zuo

    This paper investigates a fundamental problem of scene understanding: how to parse a scene image into a structured configuration (i.e., a semantic object hierarchy with object interaction relations). We propose a deep architecture consisting of two networks: i) a convolutional neural network (CNN) extracting the image representation for pixel-wise object labeling and ii) a recursive neural network (RsNN) discovering the hierarchical object structure and the inter-object relations. Rather than relying on elaborative annotations (e.g., manually labeled semantic maps and relations), we train our deep model in a weakly-supervised learning manner by leveraging the descriptive sentences of the training images. Specifically, we decompose each sentence into a semantic tree consisting of nouns and verb phrases, and apply these tree structures to discover the configurations of the training images. Once these scene configurations are determined, then the parameters of both the CNN and RsNN are updated accordingly by back propagation. The entire model training is accomplished through an Expectation-Maximization method. Extensive experiments show that our model is capable of producing meaningful scene configurations and achieving more favorable scene labeling results on two benchmarks (i.e., PASCAL VOC 2012 and SYSU-Scenes) compared with other state-of-the-art weakly-supervised deep learning methods. In particular, SYSU-Scenes contains more than 5,000 scene images with their semantic sentence descriptions, which is created by us for advancing research on scene parsing.

    更新日期:2019-02-06
  • Improving Shadow Suppression for Illumination Robust Face Recognition
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-07
    Wuming Zhang; Xi Zhao; Jean-Marie Morvan; Liming Chen

    2D face analysis techniques, such as face landmarking, face recognition and face verification, are reasonably dependent on illumination conditions which are usually uncontrolled and unpredictable in the real world. The current massive data-driven approach, e.g., deep learning-based face recognition, requires a huge amount of labeled training face data that hardly cover the infinite lighting variations that can be encountered in real-life applications. An illumination robust preprocessing method thus remains a very interesting but also a significant challenge in reliable face analysis. In this paper we propose a novel model driven approach to improve lighting normalization of face images. Specifically, we propose to build the underlying reflectance model which characterizes interactions between skin surface, lighting source and camera sensor, and elaborate the formation of face color appearance. The proposed illumination processing pipeline enables generation of the Chromaticity Intrinsic Image (CII) in a log chromaticity space which is robust to illumination variations. Moreover, as an advantage over most prevailing methods, a photo-realistic color face image is subsequently reconstructed, which eliminates a wide variety of shadows whilst retaining the color information and identity details. Experimental results under different scenarios and using various face databases show the effectiveness of the proposed approach in dealing with lighting variations, including both soft and hard shadows, in face recognition.

    更新日期:2019-02-06
  • Learning Hyperedge Replacement Grammars for Graph Generation
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-03-01
    Salvador Aguinaga; David Chiang; Tim Weninger

    The discovery and analysis of network patterns are central to the scientific enterprise. In the present work, we developed and evaluated a new approach that learns the building blocks of graphs that can be used to understand and generate new realistic graphs. Our key insight is that a graph's clique tree encodes robust and precise information. We show that a Hyperedge Replacement Grammar (HRG) can be extracted from the clique tree, and we develop a fixed-size graph generation algorithm that can be used to produce new graphs of a specified size. In experiments on large real-world graphs, we show that graphs generated from the HRG approach exhibit a diverse range of properties that are similar to those found in the original networks. In addition to graph properties like degree or eigenvector centrality, what a graph “looks like” ultimately depends on small details in local graph substructures that are difficult to define at a global level. We show that the HRG model can also preserve these local substructures when generating new graphs.

    更新日期:2019-02-06
  • Mixed Supervised Object Detection with Robust Objectness Transfer
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-28
    yan li; Junge Zhang; Kaiqi Huang; Jianguo Zhang

    In this paper, we consider the problem of leveraging existing fully labeled categories to improve the weakly supervised detection (WSD) of new object categories, which we refer to as mixed supervised detection (MSD). Different from previous MSD methods that directly transfer the pre-trained object detectors from existing categories to new categories, we propose a more reasonable and robust objectness transfer approach for MSD. In our framework, we first learn domain-invariant objectness knowledge from the existing fully labeled categories. The knowledge is modeled based on invariant features that are robust to the distribution discrepancy between the existing categories and new categories; therefore the resulting knowledge would generalize well to new categories and could assist detection models to reject distractors (e.g., object parts) in weakly labeled images of new categories. Under the guidance of learned objectness knowledge, we utilize multiple instance learning (MIL) to model the concepts of both objects and distractors and to further improve the ability of rejecting distractors in weakly labeled images. Our robust objectness transfer approach outperforms the existing MSD methods, and achieves state-of-the-art results on the challenging ILSVRC2013 detection dataset and the PASCAL VOC datasets.

    更新日期:2019-02-06
  • Recurrent Face Aging with Hierarchical AutoRegressive Memory
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-07
    Wei Wang; Yan Yan; Zhen Cui; Jiashi Feng; Shuicheng Yan; Nicu Sebe

    Modeling the aging process of human faces is important for cross-age face verification and recognition. In this paper, we propose a Recurrent Face Aging (RFA) framework which takes as input a single image and automatically outputs a series of aged faces. The hidden units in the RFA are connected autoregressively allowing the framework to age the person by referring to the previous aged faces. Due to the lack of labeled face data of the same person captured in a long range of ages, traditional face aging models split the ages into discrete groups and learn a one-step face transformation for each pair of adjacent age groups. Since human face aging is a smooth progression, it is more appropriate to age the face by going through smooth transitional states. In this way, the intermediate aged faces between the age groups can be generated. Towards this target, we employ a recurrent neural network whose recurrent module is a hierarchical triple-layer gated recurrent unit which functions as an autoencoder. The bottom layer of the module encodes the input to a latent representation, and the top layer decodes the representation to a corresponding aged face. The experimental results demonstrate the effectiveness of our framework.

    更新日期:2019-02-06
  • Recursive Nearest Agglomeration (ReNA): Fast Clustering for Approximation of Structured Signals
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-03-13
    Andrés Hoyos-Idrobo; Gaël Varoquaux; Jonas Kahn; Bertrand Thirion

    In this work, we revisit fast dimension reduction approaches, as with random projections and random sampling. Our goal is to summarize the data to decrease computational costs and memory footprint of subsequent analysis. Such dimension reduction can be very efficient when the signals of interest have a strong structure, such as with images. We focus on this setting and investigate feature clustering schemes for data reductions that capture this structure. An impediment to fast dimension reduction is then that good clustering comes with large algorithmic costs. We address it by contributing a linear-time agglomerative clustering scheme, Recursive Nearest Agglomeration (ReNA). Unlike existing fast agglomerative schemes, it avoids the creation of giant clusters. We empirically validate that it approximates the data as well as traditional variance-minimizing clustering schemes that have a quadratic complexity. In addition, we analyze signal approximation with feature clustering and show that it can remove noise, improving subsequent analysis steps. As a consequence, data reduction by clustering features with ReNA yields very fast and accurate models, enabling to process large datasets on budget. Our theoretical analysis is backed by extensive experiments on publicly-available data that illustrate the computation efficiency and the denoising properties of the resulting dimension reduction scheme.

    更新日期:2019-02-06
  • Robust and Globally Optimal Manhattan Frame Estimation in Near Real Time
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-01-30
    Kyungdon Joo; Tae-Hyun Oh; Junsik Kim; In So Kweon

    Most man-made environments, such as urban and indoor scenes, consist of a set of parallel and orthogonal planar structures. These structures are approximated by the Manhattan world assumption, in which notion can be represented as a Manhattan frame (MF). Given a set of inputs such as surface normals or vanishing points, we pose an MF estimation problem as a consensus set maximization that maximizes the number of inliers over the rotation search space. Conventionally, this problem can be solved by a branch-and-bound framework, which mathematically guarantees global optimality. However, the computational time of the conventional branch-and-bound algorithms is rather far from real-time. In this paper, we propose a novel bound computation method on an efficient measurement domain for MF estimation, i.e., the extended Gaussian image (EGI). By relaxing the original problem, we can compute the bound with a constant complexity, while preserving global optimality. Furthermore, we quantitatively and qualitatively demonstrate the performance of the proposed method for various synthetic and real-world data. We also show the versatility of our approach through three different applications: extension to multiple MF estimation, 3D rotation based video stabilization, and vanishing point estimation (line clustering).

    更新日期:2019-02-06
  • Searching for Representative Modes on Hypergraphs for Robust Geometric Model Fitting
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-07
    Hanzi Wang; Guobao Xiao; Yan Yan; David Suter

    In this paper, we propose a simple and effective geometric model fitting method to fit and segment multi-structure data even in the presence of severe outliers. We cast the task of geometric model fitting as a representative mode-seeking problem on hypergraphs. Specifically, a hypergraph is first constructed, where the vertices represent model hypotheses and the hyperedges denote data points. The hypergraph involves higher-order similarities (instead of pairwise similarities used on a simple graph), and it can characterize complex relationships between model hypotheses and data points. In addition, we develop a hypergraph reduction technique to remove “insignificant” vertices while retaining as many “significant” vertices as possible in the hypergraph. Based on the simplified hypergraph, we then propose a novel mode-seeking algorithm to search for representative modes within reasonable time. Finally, the proposed mode-seeking algorithm detects modes according to two key elements, i.e., the weighting scores of vertices and the similarity analysis between vertices. Overall, the proposed fitting method is able to efficiently and effectively estimate the number and the parameters of model instances in the data simultaneously. Experimental results demonstrate that the proposed method achieves significant superiority over several state-of-the-art model fitting methods on both synthetic data and real images.

    更新日期:2019-02-06
  • Self Paced Deep Learning for Weakly Supervised Object Detection
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-12
    Enver Sangineto; Moin Nabi; Dubravko Culibrk; Nicu Sebe

    In a weakly-supervised scenario object detectors need to be trained using image-level annotation alone. Since bounding-box-level ground truth is not available, most of the solutions proposed so far are based on an iterative, Multiple Instance Learning framework in which the current classifier is used to select the highest-confidence boxes in each image, which are treated as pseudo-ground truth in the next training iteration. However, the errors of an immature classifier can make the process drift, usually introducing many of false positives in the training dataset. To alleviate this problem, we propose in this paper a training protocol based on the self-paced learning paradigm. The main idea is to iteratively select a subset of images and boxes that are the most reliable, and use them for training. While in the past few years similar strategies have been adopted for SVMs and other classifiers, we are the first showing that a self-paced approach can be used with deep-network-based classifiers in an end-to-end training pipeline. The method we propose is built on the fully-supervised Fast-RCNN architecture and can be applied to similar architectures which represent the input image as a bag of boxes. We show state-of-the-art results on Pascal VOC 2007, Pascal VOC 2010 and ILSVRC 2013. On ILSVRC 2013 our results based on a low-capacity AlexNet network outperform even those weakly-supervised approaches which are based on much higher-capacity networks.

    更新日期:2019-02-06
  • SurfCut: Surfaces of Minimal Paths from Topological Structures
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-03-05
    Marei Algarni; Ganesh Sundaramoorthi

    We present SurfCut , an algorithm for extracting a smooth, simple surface with an unknown 3D curve boundary from a noisy 3D image and a seed point. Our method is built on the novel observation that ridge curves of the Euclidean length of minimal paths ending on a level set of the solution of the eikonal equation lie on the surface. Our method extracts these ridges and cuts them to form the surface boundary. Our surface extraction algorithm is built on the novel observation that the surface lies in a valley of the eikonal equation solution. The resulting surface is a collection of minimal paths. Using the framework of cubical complexes and Morse theory, we design algorithms to extract ridges and valleys robustly. Experiments on three 3D datasets show the robustness of our method, and that it achieves higher accuracy with lower computational cost than state-of-the-art.

    更新日期:2019-02-06
  • What Do Different Evaluation Metrics Tell Us About Saliency Models?
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-03-13
    Zoya Bylinskii; Tilke Judd; Aude Oliva; Antonio Torralba; Frédo Durand

    How best to evaluate a saliency model's ability to predict where humans look in images is an open research question. The choice of evaluation metric depends on how saliency is defined and how the ground truth is represented. Metrics differ in how they rank saliency models, and this results from how false positives and false negatives are treated, whether viewing biases are accounted for, whether spatial deviations are factored in, and how the saliency maps are pre-processed. In this paper, we provide an analysis of 8 different evaluation metrics and their properties. With the help of systematic experiments and visualizations of metric computations, we add interpretability to saliency scores and more transparency to the evaluation of saliency models. Building off the differences in metric properties and behaviors, we make recommendations for metric selections under specific assumptions and for specific applications.

    更新日期:2019-02-06
  • Compressive Binary Patterns: Designing a Robust Binary Face Descriptor with Random-Field Eigenfilters
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-01-31
    Weihong Deng; Jiani Hu; Jun Guo

    A binary descriptor typically consists of three stages: image filtering, binarization, and spatial histogram. This paper first demonstrates that the binary code of the maximum-variance filtering responses leads to the lowest bit error rate under Gaussian noise. Then, an optimal eigenfilter bank is derived from a universal assumption on the local stationary random field. Finally, compressive binary patterns (CBP) is designed by replacing the local derivative filters of local binary patterns (LBP) with these novel random-field eigenfilters, which leads to a compact and robust binary descriptor that characterizes the most stable local structures that are resistant to image noise and degradation. A scattering-like operator is subsequently applied to enhance the distinctiveness of the descriptor. Surprisingly, the results obtained from experiments on the FERET, LFW, and PaSC databases show that the scattering CBP (SCBP) descriptor, which is handcrafted by only 6 optimal eigenfilters under restrictive assumptions, outperforms the state-of-the-art learning-based face descriptors in terms of both matching accuracy and robustness. In particular, on probe images degraded with noise, blur, JPEG compression, and reduced resolution, SCBP outperforms other descriptors by a greater than 10 percent accuracy margin.

    更新日期:2019-02-06
  • HARD-PnP: PnP Optimization Using a Hybrid Approximate Representation
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-15
    Simon Hadfield; Karel Lebeda; Richard Bowden

    This paper proposes a Hybrid Approximate Representation (HAR) based on unifying several efficient approximations of the generalized reprojection error (which is known as the gold standard for multiview geometry). The HAR is an over-parameterization scheme where the approximation is applied simultaneously in multiple parameter spaces. A joint minimization scheme “HAR-Descent” can then solve the PnP problem efficiently, while remaining robust to approximation errors and local minima. The technique is evaluated extensively, including numerous synthetic benchmark protocols and the real-world data evaluations used in previous works. The proposed technique was found to have runtime complexity comparable to the fastest $O(n)$ techniques, and up to 10 times faster than current state of the art minimization approaches. In addition, the accuracy exceeds that of all 9 previous techniques tested, providing definitive state of the art performance on the benchmarks, across all 90 of the experiments in the paper and supplementary material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2018.2806446 .

    更新日期:2019-02-06
  • EuroCity Persons: A Novel Benchmark for Person Detection in Traffic Scenes
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-02-05
    Markus Braun; Sebastian Krebs; Fabian Flohr; Dariu Gavrila

    Big data has had a great share in the success of deep learning in computer vision. Recent works suggest that there is significant further potential to increase object detection performance by utilizing even bigger datasets. In this paper, we introduce the EuroCity Persons dataset, which provides a large number of highly diverse, accurate and detailed annotations of pedestrians, cyclists and other riders in urban traffic scenes. The images for this dataset were collected on-board a moving vehicle in 31 cities of 12 European countries. With over 238200 person instances manually labeled in over 47300 images, EuroCity Persons is nearly one order of magnitude larger than datasets used previously for person detection in traffic scenes. The dataset furthermore contains a large number of person orientation annotations (over 211200). We optimize four state-of-the-art deep learning approaches (Faster R-CNN, R-FCN, SSD and YOLOv3) to serve as baselines for the new object detection benchmark. We analyze the generalization capabilities of these detectors when trained with the new dataset. We furthermore study the effect of the training set size, the dataset diversity (day- vs. night-time, geographical region), the dataset detail (i.e. availability of object orientation information) and the annotation quality on the detector performance. Finally, we analyze error sources and discuss the road ahead.

    更新日期:2019-02-06
  • View Adaptive Neural Networks for High Performance Skeleton-based Human Action Recognition
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-31
    Pengfei Zhang; Cuiling Lan; Junliang Xing; Wenjun Zeng; Jianru Xue; Nanning Zheng

    Skeleton-based human action recognition has recently attracted increasing attention thanks to the accessibility and the popularity of 3D skeleton data. One of the key challenges in skeleton-based action recognition lies in the large view variations when capturing data. In order to alleviate the effects of view variations, this paper introduces a novel view adaptation scheme, which automatically determines the virtual observation viewpoints in a learning based data driven manner. We design two view adaptive neural networks, i.e., VA-RNN based on RNNand VA-CNN based on CNN. For each network, a novel view adaptation module learns and determines the most suitable observation viewpoints, and transforms the skeletons to those viewpoints for the end-to-end recognition with a main classification network. Ablation studies find that the proposed view adaptive models are capable of transforming the skeletons of various viewpoints to much more consistent virtual viewpoints which largely eliminates the viewpoint influence. In addition, we design a two-stream scheme (referred to as VA-fusion) that fuses the scores of the two networks to provide the fused prediction. Extensive experimental evaluations on six challenging benchmarks demonstrate the effectiveness of the proposed view-adaptive networks and superior performance over state-of-the-art approaches.

    更新日期:2019-02-06
  • Hierarchical Surface Prediction
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-30
    Christian H?ne; Shubham Tulsiani; Jitendra Malik

    Recently, Convolutional Neural Networks have shown promising results for 3D geometry prediction. They can make predictions from very little input data such as a single color image. A major limitation of such approaches is that they only predict a coarse resolution voxel grid, which does not capture the surface of the objects well. We propose a general framework, called hierarchical surface prediction (HSP), which facilitates prediction of high resolution voxel grids. The main insight is that it is sufficient to predict high resolution voxels around the predicted surfaces. The exterior and interior of the objects can be represented with coarse resolution voxels. This allows us to predict significantly higher resolution voxel grids around the surface, from which triangle meshes can be extracted. Additionally it allows us to predict properties such as surface color which are only defined on the surface. Our approach is not dependent on a specific input type. We show results for geometry prediction from color images and depth images. Our analysis shows that our high resolution predictions are more accurate than low resolution predictions.

    更新日期:2019-01-31
  • Absent Multiple Kernel Learning Algorithms
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 
    Xinwang Liu; Lei Wang; Xinzhong Zhu; Miaomiao Li; En Zhu; Tongliang Liu; Li Liu; Yong Dou; Jianping Yin

    Multiple kernel learning (MKL) has been intensively studied during the past decade. It optimally combines the multiple channels of each sample to improve classification performance. However, existing MKL algorithms cannot effectively handle the situation where some channels of the samples are missing, which is not uncommon in practical applications. This paper proposes three absent MKL (AMKL) algorithms to address this issue. Different from existing approaches where missing channels are firstly imputed and then a standard MKL algorithm is deployed on the imputed data, our algorithms directly classify each sample based on its observed channels, without performing imputation. Specifically, we define a margin for each sample in its own relevant space, a space corresponding to the observed channels of that sample. The proposed AMKL algorithms then maximize the minimum of all sample-based margins, and this leads to a difficult optimization problem. We first provide two two-step iterative algorithms to approximately solve this problem. After that, we show that this problem can be reformulated as a convex one by applying the representer theorem. This makes it readily be solved via existing convex optimization packages. In addition, we provide a generalization error bound to justify the proposed AMKL algorithms from a theoretical perspective. Extensive experiments are conducted on nine UCI and six MKL benchmark datasets to compare the proposed algorithms with existing imputation-based methods. As demonstrated, our algorithms achieve superior performance and the improvement is more significant with the increase of missing ratio.

    更新日期:2019-01-28
  • Training Faster by Separating Modes of Variation in Batch-normalized Models
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 
    Mahdi M. Kalayeh; Mubarak Shah

    Batch Normalization (BN) is essential to effectively train state-of-the-art deep Convolutional Neural Networks (CNN). It normalizes the layer outputs during training using the statistics of each mini-batch. BN accelerates training procedure by allowing to safely utilize large learning rates and alleviates the need for careful initialization of the parameters. In this work, we study BN from the viewpoint of Fisher kernels that arise from generative probability models. We show that assuming samples within a mini-batch are from the same probability density function, then BN is identical to the Fisher vector of a Gaussian distribution. That means batch normalizing transform can be explained in terms of kernels that naturally emerge from the probability density function that models the generative process of the underlying data distribution. Consequently, it promises higher discrimination power for the batch-normalized mini-batch. However, given the rectifying non-linearities employed in CNN architectures, distribution of the layer outputs show an asymmetric characteristic. Therefore, in order for BN to fully benefit from the aforementioned properties, we propose approximating underlying data distribution not with one, but a mixture of Gaussian densities. Deriving Fisher vector for a Gaussian Mixture Model (GMM), reveals that batch normalization can be improved by independently normalizing with respect to the statistics of disentangled sub-populations. We refer to our proposed soft piecewise version of batch normalization as Mixture Normalization (MN). Through extensive set of experiments on CIFAR-10 and CIFAR-100, using both a 5-layers deep CNN and modern Inception-V3 architecture, we show that mixture normalization reduces required number of gradient updates to reach the maximum test accuracy of the batch normalized model by ∼31%-47% across a variety of training scenarios. Replacing even a few BN modules with MN in the 48-layers deep Inception-V3 architecture is sufficient to not only obtain considerable training acceleration but also better final test accuracy. We show that similar observations are valid for 40 and 100-layers deep DenseNet architectures as well. We complement our study by evaluating the application of mixture normalization to the Generative Adversarial Networks (GANs), where "mode collapse" hinders the training process. We solely replace a few batch normalization layers in the generator with our proposed mixture normalization. Our experiments using Deep Convolutional GAN (DCGAN) on CIFAR-10 show that mixture normalized DCGAN not only provides an acceleration of ∼58% but also reaches lower (better) "Fréchet Inception Distance" (FID) of 33.35 compared to 37.56 of its batch normalized counterpart.

    更新日期:2019-01-28
  • Joint Rain Detection and Removal from a Single Image with Contextualized Deep Networks
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 
    Wenhan Yang; Robby T. Tan; Jiashi Feng; Jiaying Liu; Shuicheng Yan; Zongming Guo

    Rain streaks, particularly in heavy rain, not only degrade visibility but also make many computer vision algorithms fail to function properly. In this paper, we address this visibility problem by focusing on single-image rain removal, even in the presence of dense rain streaks and rain-streak accumulation, which is visually similar to mist or fog. To achieve this, we introduce a new rain model and a deep learning architecture. Our rain model incorporates a binary rain map indicating rain-streak regions, and accommodates various shapes, directions, and sizes of overlapping rain streaks, as well as rain accumulation, to model heavy rain. Based on this model, we construct a multi-task deep network, which jointly learns three targets: the binary rain-streak map, rain streak layers, and clean background, which is our ultimate output. To generate features that can be invariant to rain steaks, we introduce a contextual dilated network, which is able to exploit regional contextual information. To handle various shapes and directions of overlapping rain streaks, our strategy is to utilize a recurrent process that progressively removes rain streaks. Our binary map provides a constraint and thus additional information to train our network. Extensive evaluation on real images, particularly in heavy rain, shows the effectiveness of our model and architecture.

    更新日期:2019-01-28
  • Calibrating Classification Probabilities with Shape-restricted Polynomial Regression
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 
    Yongqiao Wang; Lishuai Li; Chuangyin Dang

    In many real-world classification problems, accurate prediction of membership probabilities is critical for further decision making. The probability calibration problem studies how to map scores obtained from one classification algorithm to membership probabilities. The requirement of non-decreasingness for this mapping involves an infinite number of inequality constraints, which makes its estimation computationally intractable. For the sake of this difficulty, existing methods failed to achieve four desiderata of probability calibration: universal flexibility, non-decreasingness, continuousness and computational tractability. This paper proposes a method with shape-restricted polynomial regression, which satisfies all four desiderata. In the method, the calibrating function is approximated with monotone polynomials, and the continuously-constrained requirement of monotonicity is equivalent to some semidefinite constraints. Thus, the calibration problem can be solved with tractable semidefinite programs. This estimator is both strongly and weakly universally consistent under a trivial condition. Experimental results on both artificial and real data sets clearly show that the method can greatly improve calibrating performance in terms of reliability-curve related measures.

    更新日期:2019-01-28
  • Unsupervised Video Matting via Sparse and Low-Rank Representation
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-25
    Dongqing Zou; Xiaowu Chen; Guangying Cao; Xiaogang Wang

    A novel method, unsupervised video matting via sparse and low-rank representation, is proposed which can achieve high quality in a variety of challenging examples featuring illumination changes, feature ambiguity, topology changes, transparency variation, dis-occlusion, fast motion and motion blur, Previous matting methods introduced a nonlocal prior to search samples for estimating the alpha matte, which have achieved impressive results on some data. However, on one hand, searching inadequate or excessive samples may miss good samples or introduce noise; on the other hand, it is difficult to construct consistent nonlocal structures for pixels with similar features, yielding inconsistent video matte. In this paper, we proposed a novel video matting method to achieve spatially and temporally consistent matting result. Toward this end, a sparse and low-rank representation model is introduced for video matting. The sparse representation is used to adaptively select best samples for all unknown pixels, while the low-rank representation is used to globally ensure consistent nonlocal structures for pixels with similar features. The two representations are combined to generate spatially and temporally consistent video mattes. Our method has achieved the best performance among all unsupervised matting methods in the public alpha matting evaluation dataset for images.

    更新日期:2019-01-28
  • Models Matter, So Does Training: An Empirical Study of CNNs for Optical Flow Estimation
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-22
    Deqing Sun; Xiaodong Yang; Ming-Yu Liu; Jan Kautz

    We investigate two crucial and closely related aspects of CNNs for optical flow estimation: models and training. First, we design a compact but effective CNN model, called PWC-Net, according to simple and well-established principles: pyramidal processing, warping, and cost volume processing. PWC-Net is 17 times smaller in size, 2 times faster in inference, and 11% more accurate on Sintel final than the recent FlowNet2 model. It is the winning entry in the optical flow competition of the robust vision challenge. Next, we experimentally analyze the sources of our performance gains. In particular, we use the same training procedure of PWC-Net to retrain FlowNetC, a sub-network of FlowNet2. The retrained FlowNetC is 56% more accurate on Sintel final than the previously trained one and even 5% more accurate than the FlowNet2 model. We further improve the training procedure and increase the accuracy of PWC-Net on Sintel by 10% and on KITTI 2012 and 2015 by 20%. Our newly trained model parameters and training protocols are available on https://github.com/NVlabs/PWC-Net.

    更新日期:2019-01-22
  • Feature Boosting Network For 3D Pose Estimation
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-22
    Jun Liu; Henghui Ding; Amir Shahroudy; Ling-Yu Duan; Xudong Jiang; Gang Wang; Alex Kot Chichung

    In this paper, a feature boosting network is proposed for estimating 3D hand pose and 3D body pose from a single RGB image. In this method, the features learned by the convolutional layers are boosted with a new long short-term dependence-aware (LSTD) module, which enables the intermediate convolutional feature maps to perceive the graphical long short-term dependency among different hand (or body) parts using the designed Graphical ConvLSTM. Learning a set of features that are reliable and discriminatively representative of the pose of a hand (or body) part is difficult due to the ambiguities, texture and illumination variation, and self-occlusion in the real application of 3D pose estimation. To improve the reliability of the features for representing each body part and enhance the LSTD module, we further introduce a context consistency gate (CCG) in this paper, with which the convolutional feature maps are modulated according to their consistency with the context representations. We evaluate the proposed method on challenging benchmark datasets for 3D hand pose estimation and 3D full body pose estimation. Experimental results show the effectiveness of our method that achieves state-of-the-art performance on both of the tasks.

    更新日期:2019-01-22
  • Rolling Shutter Camera Absolute Pose
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-22
    Cenek Albl; Zuzana Kukelova; Viktor Larsson; Tomas Pajdla

    We present a minimal, non-iterative solutions to the absolute pose problem for images from rolling shutter cameras. Absolute pose problem is a key problem in computer vision and rolling shutter is present in a vast majority of today's digital cameras. We discuss several camera motion models and propose two feasible rolling shutter camera models for a polynomial solver. In previous work a linearized camera model was used that required an initial estimate of the camera orientation. We show how to simplify the system of equations and make this solver faster. Furthermore, we present a first solution of the non-linearized camera orientation model using the cayley parameterization. The new solver does not require an initial camera orientation estimate and therefore serves as a standalone solution to the rolling shutter camera pose problem from six 2D-to-3D correspondences. We show that our algorithms outperforms P3P followed by non-linear refinement using rolling shutter model.

    更新日期:2019-01-22
  • Heterogeneous Recommendation via Deep Low-rank Sparse Collective Factorization
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-21
    Shuhui Jiang; Zhengming Ding; Yun Fu

    Real-world recommender usually makes use of heterogeneous types of user feedbacks—for example, binary ratings such as likes and dislikes and numerical rating such as 5-star grades. In this work, we focus on transferring knowledge from binary ratings to numerical ratings, facing more serious data sparsity problem. Conventional Collective Factorization methods usually assume that multiple domains share some common latent information across users and items. However, related domains may also share some knowledge of rating patterns. Furthermore, existing works may also fail to consider the hierarchical structures (i.e., genre, sub genre, detailed-category) in heterogeneous recommendation scenario. To address these challenges, in this paper, we propose a novel Deep Low-rank Sparse Collective Factorization (DLSCF) to facilitate the cross-domain recommendation. Specifically, the low-rank sparse decomposition is adopted to capture the shared rating patterns in multiple domains while splitting the domain-specific patterns. We also factorize the model in multiple layers to capture the affiliation relation between latent categories and latent sub-categories. We propose both batch and Stochastic Gradient Descent (SGD) based optimization algorithms for solving DLSCF. Experimental results on MoviePilot, Netfilx, Flixter, MovieLens10 and MovieLens20 datasets demonstrate the effectiveness of our proposed algorithms, by comparing them with several state-of-the-art batch and SGD based approaches.

    更新日期:2019-01-22
  • Light Field Super-Resolution using a Low-Rank Prior and Deep Convolutional Neural Networks
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-21
    Reuben Farrugia; Christine Guillemot

    Light field imaging has recently known a regain of interest due to the availability of practical light field capturing systems that offer a wide range of applications in the field of computer vision. However, capturing high-resolution light fields remains technologically challenging since the increase in angular resolution is often accompanied by a significant reduction in spatial resolution. This paper describes a learning-based spatial light field super-resolution method that allows the restoration of the entire light field with consistency across all angular views. The algorithm first uses optical flow to align the light field and then reduces its angular dimension using low-rank approximation. We then consider the linearly independent columns of the resulting low-rank model as an embedding, which is restored using a deep convolutional neural network (DCNN). The super-resolved embedding is then used to reconstruct the remaining views. The original disparities are restored using inverse warping where missing pixels are approximated using a novel light field inpainting algorithm. Experimental results show that the proposed method outperforms existing light field super-resolution algorithms,achieving PSNR gains of 0.23 dB over the second best performing method. The performance is shown to be further improved using iterative back-projection as a post-processing step.

    更新日期:2019-01-22
  • Hierarchical LSTMs with Adaptive Attention for Visual Captioning
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-21
    Lianli Gao; Xiangpeng Li; Jingkuan Song; Heng Tao Shen

    Recent progress has been made in using attention based encoder-decoder framework for image and video captioning. Most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g. "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of visual captioning. Furthermore, the hierarchy of LSTMs enables more complex representation of visual data, capturing information at different scales. Considering these issues, we propose a hierarchical LSTM with adaptive attention (hLSTMat) approach for image and video captioning. Specifically, the proposed framework utilizes the spatial or temporal attention for selecting specific regions or frames to predict the related words, while the adaptive attention is for deciding whether to depend on the visual information or the language context information. Also, a hierarchical LSTMs is designed to simultaneously consider both low-level visual information and high-level language context information to support the caption generation. We design the hLSTMat model as a general framework, and we firstly instantiate it for the task of video captioning. Then, we further instantiate our hLSTMarefine it and apply it to the imioning task. To demonstrate the effectiveness of our proposed framework, we test our method on both video and image captioning tasks. Experimental results show that our approach achieves the state-of-the-art performance for most of the evaluation metrics on both tasks. The effect of important components is also well exploited in the ablation study.

    更新日期:2019-01-22
  • RefineNet: Multi-Path Refinement Networks for Dense Prediction
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-18
    Guosheng Lin; Fayao Liu; Anton Milan; Chunhua Shen; Ian Reid

    Recently, very deep convolutional neural networks (CNNs) have shown outstanding performance in object recognition and have also been the first choice for dense prediction problems such as semantic segmentation and depth estimation. However, repeated sub-sampling operations like pooling or convolution striding in deep CNNs lead to a significant decrease in the initial image resolution. Here, we present RefineNet, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections. In this way, the deeper layers that capture high-level semantic features can be directly refined using fine-grained features from earlier convolutions. The individual components of RefineNet employ residual connections following the identity mapping mindset, which allows for effective end-to-end training. Further, we introduce chained residual pooling, which captures rich background context in an efficient manner. We carry out comprehensive experiments on semantic segmentation which is a dense classification problem and set new state-of-the-art results on seven public datasets. We further apply our method for depth estimation and demonstrate the effectiveness of our method on dense regression problems.

    更新日期:2019-01-22
  • Large-scale Urban Reconstruction with Tensor Clustering and Global Boundary Refinement
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-18
    Charalambos Poullis

    Accurate and efficient methods for large-scale urban reconstruction are of significant importance to the computer vision and computer graphics communities. Although rapid acquisition techniques such as airborne LiDAR have been around for many years, creating a useful and functional virtual environment from such data remains difficult and labor intensive. This is due largely to the necessity in present solutions for data dependent user defined parameters. In this paper we present a new solution for automatically converting large LiDAR data pointcloud into simplified polygonal 3D models. The data is first divided into smaller components which are processed independently and concurrently to extract various metrics about the points. Next, the extracted information is converted into tensors. A robust agglomerate clustering algorithm is proposed to segment the tensors into clusters representing geospatial objects e.g. roads, buildings, etc. Unlike previous methods, the proposed tensor clustering process has no data dependencies and does not require any user-defined parameter. The required parameters are adaptively computed assuming a Weibull distribution for similarity distances. Lastly, to extract boundaries from the clusters a new multi-stage boundary refinement process is developed by reformulating this extraction as a global optimization problem. We have extensively tested our methods on several pointcloud datasets of different resolutions which exhibit significant variability in geospatial characteristics e.g. ground surface inclination, building density, etc and the results are reported. The source code for both tensor clustering and global boundary refinement will be made publicly available with the publication.

    更新日期:2019-01-22
  • Pixel Transposed Convolutional Networks
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-18
    Hongyang Gao; Hao Yuan; Zhengyang Wang; Shuiwang Ji

    Transposed convolutional layers have been widely used in a variety of deep models for up-sampling, including encoder-decoder networks for semantic segmentation and deep generative models for unsupervised learning. One of the key limitations of transposed convolutional operations is that they result in the so-called checkerboard problem. This is caused by the fact that no direct relationship exists among adjacent pixels on the output feature map. To address this problem, we propose the pixel transposed convolutional layer (PixelTCL) to establish direct relationships among adjacent pixels on the up-sampled feature map. Our method is based on a fresh interpretation of the regular transposed convolutional operation. The resulting PixelTCL can be used to replace any transposed convolutional layer in a plug-and-play manner without compromising the fully trainable capabilities of original models. The proposed PixelTCL may result in slight decrease in efficiency, but this can be overcome by an implementation trick. Experimental results on semantic segmentation demonstrate that PixelTCL can consider spatial features such as edges and shapes and yields more accurate segmentation outputs than transposed convolutional layers. When used in image generation tasks, our PixelTCL can largely overcome the checkerboard problem suffered by regular transposed convolutional operations.

    更新日期:2019-01-22
  • Shared Multi-view Data Representation for Multi-domain Event Detection
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-18
    Zhenguo Yang; Qing Li; Liu Wenyin; Jianming Lv

    Internet platforms provide new ways for people to share experiences, generating massive amounts of data related to various real-world concepts. In this paper, we present an event detection framework to discover real-world events from multiple data domains, including online news media and social media. As multi-domain data possess multiple data views that are heterogeneous, initial dictionaries consisting of labeled data samples are exploited to align the multi-view data. Furthermore, a shared multi-view data representation (SMDR) model is devised, which learns underlying and intrinsic structures shared among the data views by considering the structures underlying the data, data variations, and informativeness of dictionaries. SMDR incorporates various constraints in the objective function, including shared representation, low-rank, local invariance, reconstruction error, and dictionary independence constraints. Given the data representations achieved by SMDR, class-wise residual models are designed to discover the events underlying the data based on the reconstruction residuals. Extensive experiments conducted on two real-world event detection datasets, i.e., Multi-domain and Multi-modality Event Detection dataset, and MediaEval Social Event Detection 2014 dataset, indicating the effectiveness of the proposed approaches.

    更新日期:2019-01-22
  • Structured Label Inference for Visual Understanding
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-16
    Nelson Isao Nauata Junior; Hexiang Hu; Guang-Tong Zhou; Zhiwei Deng; Zicheng Liao; Greg Mori

    Visual data such as images and videos contain a rich source of structured semantic labels as well as a wide range of interacting components. Visual content could be assigned with fine-grained labels describing major components, coarse-grained labels depicting high level abstractions, or a set of labels revealing attributes. Such categorization over different, interacting layers of labels evinces the potential for a graph-based encoding of label information. In this paper, we exploit this rich structure for performing graph-based inference in label space for a number of tasks: multi-label image and video classification and action detection in untrimmed videos. We consider the use of the Bidirectional Inference Neural Network (BINN) and Structured Inference Neural Network (SINN) for performing graph-based inference in label space and propose a Long Short-Term Memory (LSTM) based extension for exploiting activity progression on untrimmed videos. The methods were evaluated on (i) the Animal with Attributes (AwA), Scene Understanding (SUN) and NUS-WIDE datasets for multi-label image classification, (ii) the first two releases of the YouTube-8M large scale dataset for multi-label video classification, and (iii) the THUMOS'14 and MultiTHUMOS video datasets for action detection. Our results demonstrate the effectiveness of structured label inference in these challenging tasks, achieving significant improvements against baselines.

    更新日期:2019-01-17
  • Fast Cross-Validation for Kernel-based Algorithms
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 
    Yong Liu; Shizhong Liao; Shali Jiang; Lizhong Ding; Hailun Lin; Weiping Wang

    Cross-validation (CV) is a widely adopted approach for selecting the optimal model. However, the computation of empirical cross-validation error (CVE) has high complexity due to multiple times of learner training. In this paper, we develop a novel approximation theory of CVE and present an approximate approach to CV based on the Bouligand influence function (BIF) for kernel-based agorithms. We first represent the BIF and higher order BIFs in Taylor expansions, and approximate CV via the Taylor expansions. We then derive a tight upper bound of the discrepancy between the original and approximate CV. Furthermore, we provide a novel computing method to calculate the BIF for general distribution, and evaluate BIF criterion for sample distribution to approximate CV. The proposed approximate CV requires training on the full data set only once and is suitable for a wide variety of kernel-based algorithms. Experimental results demonstrate that the proposed approximate CV is sound and effective.

    更新日期:2019-01-14
  • Online Meta Adaptation for Fast Video Object Segmentation
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 
    Huaxin Xiao; Bingyi Kang; Yu Liu; Maojun Zhang; Jiashi Feng

    Conventional deep neural networks based video object segmentation (VOS) methods are dominated by heavily fine-tuning a segmentation model on the first frame of a given video, which is time-consuming and inefficient. In this paper, we propose a novel method which rapidly adapts a base segmentation model to new video sequences with only a couple of model-update iterations, without sacrificing performance. Such attractive efficiency benefits from the meta-learning paradigm which leads to a meta-segmentation model and a novel continuous learning approach which enables online adaptation of the segmentation model. Concretely, we train a meta-learner on multiple VOS tasks such that the meta model can capture their common knowledge and gains the ability to fast adapt the segmentation model to new video sequences. Furthermore, to deal with unique challenges of VOS tasks from temporal variations in the video, e.g., object motion and appearance changes, we propose a principled online adaptation approach that continuously adapts the segmentation model across video frames by exploiting temporal context effectively, providing robustness to annoying temporal variations. Integrating the meta-learner with the online adaptation approach, the proposed VOS model achieves competitive performance against the state-of-the-arts and moreover provides faster per-frame processing speed.

    更新日期:2019-01-14
  • LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 
    Gregory Rogez; Philippe Weinzaepfel; Cordelia Schmid

    We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-Regression architecture, named LCR-Net, contains 3 main components: 1) the pose proposal generator that suggests candidate poses at different locations in the image; 2) a classifier that scores the different pose proposals; and 3) a regressor that refines pose proposals both in 2D and 3D. All three stages share the convolutional feature layers and are trained jointly. The final pose estimation is obtained by integrating over neighboring pose hypotheses, which is shown to improve over a standard non maximum suppression algorithm. Our method recovers full-body 2D and 3D poses, hallucinating plausible body parts when the persons are partially occluded or truncated by the image boundary. Our approach significantly outperforms the state of the art in 3D pose estimation on Human3.6M, a controlled environment. Moreover, it shows promising results on real images for both single and multi-person subsets of the MPII 2D pose benchmark.

    更新日期:2019-01-14
  • Minimal Case Relative Pose Computation using Ray-Point-Ray Features
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 
    Ji Zhao; Laurent Kneip; Yijia He; Jiayi Ma

    Corners are popular features for relative pose computation with 2D-2D point correspondences. Stable corners may be formed by two 3D rays sharing a common starting point. We call such elements ray-point-ray (RPR) structures. Besides a local invariant keypoint given by the lines' intersection, their reprojection also defines a corner orientation and an inscribed angle in the image plane. The present paper investigates such RPR features, and aims at answering the fundamental question of what additional constraints can be formed from correspondences between RPR features in two views. In particular, we show that knowing the value of the inscribed angle between the two 3D rays poses additional constraints on the relative orientation. Using the latter enables the solution of the relative pose with only 3 correspondences across two images. We provide a detailed analysis of all minimal cases distinguishing between 90-degree RPR-structures and structures with an arbitrary, known inscribed angle. We furthermore investigate the special cases of a known directional correspondence and planar motion, the latter being solvable with only a single RPR correspondence. We complete the exposition by outlining an image processing technique for robust RPR-feature extraction. Our results suggest high practicality in man-made environments, where 90-degree RPR-structures naturally occur.

    更新日期:2019-01-14
  • 3D Human Pose Machines with Self-supervised Learning
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 
    Keze Wang; Liang Lin; Chenhan Jiang; Chen Qian; Pengxu Wei

    Driven by recent computer vision and robotic applications, recovering 3D human poses has become increasingly important and attracted growing interests. In fact, completing this task is quite challenging due to the diverse appearances, viewpoints, occlusions and inherently geometric ambiguities inside monocular images. Most of the existing methods focus on designing some elaborate priors/constraints to directly regress 3D human poses based on the corresponding 2D human pose-aware features or 2D pose predictions. However, due to the insufficient 3D pose data for training and the domain gap between 2D space and 3D space, these methods have limited scalabilities for all practical scenarios (e.g., outdoor scene). Attempt to address this issue, this paper proposes a simple yet effective self-supervised correction mechanism to learn all intrinsic structures of human poses from abundant images without 3D pose annotations. We further apply our self-supervised correction mechanism to develop a recurrent 3D pose machine, which jointly integrates the 2D spatial relationship, temporal smoothness of predictions and 3D geometric knowledge. Extensive evaluations on the Human3.6M and HumanEva-I benchmarks demonstrate the superior performance and efficiency of our framework over all the compared computing methods.

    更新日期:2019-01-14
  • Multiple Kernel k-means with Incomplete Kernels
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 
    Xinwang Liu; Xinzhong Zhu; Miaomiao Li; Lei Wang; En Zhu; Tongliang Liu; Marius Kloft; Dinggang Shen; Jianping Yin; Wen Gao

    Existing MKC algorithms cannot efficiently address the situation where some rows and columns of base kernel matrices are absent. This paper proposes two simple yet effective algorithms to address this issue. Different from existing approaches where incomplete kernel matrices are firstly imputed and a standard MKC algorithm is applied to the imputed kernel matrices, our first algorithm integrates imputation and clustering into a unified learning procedure. Specifically, we perform multiple kernel clustering directly with the presence of incomplete kernel matrices, which are treated as auxiliary variables to be jointly optimized. Also, it adaptively imputes incomplete kernel matrices and combines them to best serve clustering. Moreover, we further improve this algorithm by encouraging these incomplete kernel matrices to mutually complete each other. The three-step iterative algorithm is designed to solve the resultant optimization problems. After that, we theoretically study the generalization bound of the proposed algorithms. Extensive experiments are conducted on 13 benchmark data sets to compare the proposed algorithms with existing imputation-based methods. Our algorithms consistently achieve superior performance and the improvement becomes more significant with increasing missing ratio, verifying the effectiveness and advantages of the proposed joint imputation and clustering.

    更新日期:2019-01-14
  • On Detection of Faint Edges in Noisy Images
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-10
    Nati Ofir; Meirav Galun; Sharon Alpert; Achi Brandt; Boaz Nadler; Ronen Basri

    A fundamental question for edge detection in noisy images is how faint can an edge be and still be detected. In this paper we offer a formalism to study this question and subsequently introduce computationally efficient multiscale edge detection algorithms designed to detect faint edges in noisy images. In our formalism we view edge detection as a search in a discrete, though potentially large, set of feasible curves. First, we derive approximate expressions for the detection threshold as a function of curve length and the complexity of the search space. We then present two edge detection algorithms, one for straight edges, and the second for curved ones. Both algorithms efficiently search for edges in a large set of candidates by hierarchically constructing difference filters that match the curves traced by the sought edges. We demonstrate the utility of our algorithms in both simulations and applications involving challenging real images. Finally, based on these principles, we develop an algorithm for fiber detection and enhancement. We exemplify its utility to reveal and enhance nerve axons in light microscopy images.

    更新日期:2019-01-11
  • Learning Two-Branch Neural Networks for Image-Text Matching Tasks
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-01-24
    Liwei Wang; Yin Li; Jing Huang; Svetlana Lazebnik

    Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates two-branch neural networks for learning the similarity between these two data modalities. We propose two network structures that produce different output representations. The first one, referred to as an embedding network , learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. Compared to standard triplet sampling, we perform improved neighborhood sampling that takes neighborhood information into consideration while constructing mini-batches. The second network structure, referred to as a similarity network , fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization on the Flickr30K Entities dataset and for bi-directional image-sentence retrieval on Flickr30K and MSCOCO datasets.

    更新日期:2019-01-10
  • Hierarchical Bayesian Inverse Lighting of Portraits with a Virtual Light Stage
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-09
    Davoud Shahlaei; Volker Blanz

    From a single RGB image of an unknown face, taken under unknown conditions, we estimate a physically plausible lighting model. First, the 3D geometry and texture of the face are estimated by fitting a 3D Morphable Model to the 2D input. With this estimated 3D model and a Virtual Light Stage (VLS), we generate a gallery of images of the face with all the same conditions, but different lighting. We consider non-lambertian reflectance and non-convex geometry to handle more realistic illumination effects in complex lighting conditions. Our hierarchical Bayesian approach automatically suppresses inconsistencies between the model and the input. It estimates the RGB values for the light sources of a VLS to reconstruct the input face with the estimated 3D face model. We discuss the relevance of the hierarchical approach to this minimally constrained inverse rendering problem and show how the hyperparameters can be controlled to improve the results of the algorithm for complex effects, such as cast shadows. Our algorithm is a contribution to single image face modeling and analysis, provides information about the imaging condition and facilitates realistic reconstruction of the input image, relighting, lighting transfer and lighting design.

    更新日期:2019-01-10
  • Tattoo Image Search at Scale: Joint Detection and Compact Representation Learning
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-09
    Hu Han; Jie Li; Anil K. Jain; shiguang shan; Xilin Chen

    The explosive growth of digital images in video surveillance and social media has led to the significant need for efficient search of persons of interest in law enforcement and forensic applications. Despite tremendous progress in primary biometric traits (e.g., face and fingerprint) based person identification, a single biometric trait alone can not meet the desired recognition accuracy in forensic scenarios. Tattoos, as one of the important soft biometric traits, have been found to be valuable for assisting in person identification. However, tattoo search in a large collection of unconstrained images remains a difficult problem, and existing tattoo search methods mainly focus on matching cropped tattoos, which is different from real application scenarios. To close the gap, we propose an efficient tattoo search approach that is able to learn tattoo detection and compact representation jointly in a single convolutional neural network (CNN) via multi-task learning. While the features in the backbone network are shared by both tattoo detection and compact representation learning, individual latent layers of each sub-network optimize the shared features toward the detection and feature learning tasks, respectively. We resolve the small batch size issue inside the joint tattoo detection and compact representation learning network via random image stitch and preceding feature buffering. We evaluate the proposed tattoo search system using multiple public-domain tattoo benchmarks, and a gallery set with about 300K distracter tattoo images compiled from these datasets and images from the Internet. In addition, we also introduce a tattoo sketch dataset containing 300 tattoos for sketch-based tattoo search. Experimental results show that the proposed approach has superior performance in tattoo detection and tattoo search at scale compared to several state-of-the-art tattoo retrieval algorithms.

    更新日期:2019-01-10
  • Weighted Manifold Alignment using Wave Kernel Signatures for Aligning Medical image Datasets
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-09
    James Clough; Daniel Robert Balfour; Gastao Lima Da Cruz; Paul Marsden; Claudia Prieto; Andrew Reader; Andrew King

    Manifold alignment (MA) is a technique to map many high-dimensional datasets to one shared low-dimensional space.Here we develop a pipeline for using MA to reconstruct high-resolution medical images. We present two key contributions. Firstly, we develop a novel MA scheme in which each high-dimensional dataset can be differently weighted preventing noisier or less informative data from corrupting the aligned embedding. We find that this generalisation improves performance in our experiments in both supervised and unsupervised MA problems. Secondly, we use the wave kernel signature as a graph descriptor for the unsupervised MA case finding that it significantly outperforms the current state-of-the-art methods and provides higher quality reconstructed magnetic resonance volumes than existing methods.

    更新日期:2019-01-10
  • Tensor Robust Principal Component Analysis with A New Tensor Nuclear Norm
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-09
    Canyi Lu; Jiashi Feng; yudong chen; Wei Liu; Zhouchen Lin; Shuicheng Yan

    In this paper, we consider the Tensor Robust Principal Component Analysis (TRPCA) problem, which aims to exactly recover the low-rank and sparse components from their sum. Our model is based on the recently proposed tensor-tensor product (or t-product) [15]. Induced by the t-product, we first rigorously deduce the tensor spectral norm, tensor nuclear norm, and tensor average rank, and show that the tensor nuclear norm is the convex envelope of the tensor average rank within the unit ball of the tensor spectral norm. These definitions, their relationships and properties are consistent with matrix cases. Equipped with the new tensor nuclear norm, we then solve the TRPCA problem by solving a convex program and provide the theoretical guarantee for the exact recovery. Our TRPCA model and recovery guarantee include matrix RPCA as a special case. Numerical experiments verify our results, and the applications to image recovery and background modeling problems demonstrate the effectiveness of our method.

    更新日期:2019-01-10
  • Visibility graphs for image processing
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2019-01-09
    Jacopo Iacovacci; Lucas Lacasa

    The family of image visibility graphs (IVG/IHVG) have been recently introduced as simple algorithms by which scalar fields can be mapped into graphs. Here we explore the usefulness of such operator in the scenario of image processing and image classication. We demonstrate that the link architecture of the image visibility graphs encapsulates relevant information on the structure of the images and we explore their potential as image filters. We introduce several graph features, including the novel concept of Visibility Patches, and show through several examples that these features are highly informative, computationally efficient and universally applicable for general pattern recognition and image classication tasks.

    更新日期:2019-01-10
  • A Benchmark Dataset and Evaluation for Non-Lambertian and Uncalibrated Photometric Stereo
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-02-05
    Boxin Shi; Zhipeng Mo; Zhe Wu; Dinglong Duan; Sai-Kit Yeung; Ping Tan

    Classic photometric stereo is often extended to deal with real-world materials and work with unknown lighting conditions for practicability. To quantitatively evaluate non-Lambertian and uncalibrated photometric stereo, a photometric stereo image dataset containing objects of various shapes with complex reflectance properties and high-quality ground truth normals is still missing. In this paper, we introduce the ‘ DiLiGenT ’ dataset with calibrated Di rectional Li ghtings, objects of Gen eral reflectance with different shininess, and ‘ground T ruth’ normals from high-precision laser scanning. We use our dataset to quantitatively evaluate state-of-the-art photometric stereo methods for general materials and unknown lighting conditions, selected from a newly proposed photometric stereo taxonomy emphasizing non-Lambertian and uncalibrated methods. The dataset and evaluation results are made publicly available, and we hope it can serve as a benchmark platform that inspires future research.

    更新日期:2019-01-09
  • Analysis of Spatio-Temporal Representations for Robust Footstep Recognition with Deep Residual Neural Networks
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-01-30
    Omar Costilla-Reyes; Ruben Vera-Rodriguez; Patricia Scully; Krikor B. Ozanyan

    Human footsteps can provide a unique behavioural pattern for robust biometric systems. We propose spatio-temporal footstep representations from floor-only sensor data in advanced computational models for automatic biometric verification. Our models deliver an artificial intelligence capable of effectively differentiating the fine-grained variability of footsteps between legitimate users (clients) and impostor users of the biometric system. The methodology is validated in the largest to date footstep database, containing nearly 20,000 footstep signals from more than 120 users. The database is organized by considering a large cohort of impostors and a small set of clients to verify the reliability of biometric systems. We provide experimental results in 3 critical data-driven security scenarios, according to the amount of footstep data made available for model training: at airports security checkpoints (smallest training set), workspace environments (medium training set) and home environments (largest training set). We report state-of-the-art footstep recognition rates with an optimal equal false acceptance and false rejection rate (equal error rate) of 0.7 percent an improvement ratio of 371 percent compared to previous state-of-the-art. We perform a feature analysis of deep residual neural networks showing effective clustering of client's footstep data and to provide insights of the feature learning process.

    更新日期:2019-01-09
  • Depth from a Light Field Image with Learning-Based Matching Costs
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-01-18
    Hae-Gon Jeon; Jaesik Park; Gyeongmin Choe; Jinsun Park; Yunsu Bok; Yu-Wing Tai; In So Kweon

    One of the core applications of light field imaging is depth estimation. To acquire a depth map, existing approaches apply a single photo-consistency measure to an entire light field. However, this is not an optimal choice because of the non-uniform light field degradations produced by limitations in the hardware design. In this paper, we introduce a pipeline that automatically determines the best configuration for photo-consistency measure, which leads to the most reliable depth label from the light field. We analyzed the practical factors affecting degradation in lenslet light field cameras, and designed a learning based framework that can retrieve the best cost measure and optimal depth label. To enhance the reliability of our method, we augmented an existing light field benchmark to simulate realistic source dependent noise, aberrations, and vignetting artifacts. The augmented dataset was used for the training and validation of the proposed approach. Our method was competitive with several state-of-the-art methods for the benchmark and real-world light field datasets.

    更新日期:2019-01-09
  • Disambiguating Visual Verbs
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-12-27
    Spandana Gella; Frank Keller; Mirella Lapata

    In this article, we introduce a new task, visual sense disambiguation for verbs: given an image and a verb, assign the correct sense of the verb, i.e., the one that describes the action depicted in the image. Just as textual word sense disambiguation is useful for a wide range of NLP tasks, visual sense disambiguation can be useful for multimodal tasks such as image retrieval, image description, and text illustration. We introduce a new dataset, which we call VerSe (short for Ver b Se nse) that augments existing multimodal datasets (COCO and TUHOI) with verb and sense labels. We explore supervised and unsupervised models for the sense disambiguation task using textual, visual, and multimodal embeddings. We also consider a scenario in which we must detect the verb depicted in an image prior to predicting its sense (i.e., there is no verbal information associated with the image). We find that textual embeddings perform well when gold-standard annotations (object labels and image descriptions) are available, while multimodal embeddings perform well on unannotated images. VerSe is publicly available at https://github.com/spandanagella/verse .

    更新日期:2019-01-09
Some contents have been Reproduced with permission of the American Chemical Society.
Some contents have been Reproduced by permission of The Royal Society of Chemistry.
导出
化学 • 材料 期刊列表