• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-02-12
Enver Sangineto; Moin Nabi; Dubravko Culibrk; Niculae Sebe

In a weakly-supervised scenario object detectors need to be trained using image-level annotation alone. Since bounding-box-level ground truth is not available, most of the solutions proposed so far are based on an iterative, Multiple Instance Learning framework in which the current classifier is used to select the highest-confidence boxes in each image, which are treated as pseudo-ground truth in the next training iteration. However, the errors of an immature classifier can make the process drift, usually introducing many of false positives in the training dataset. To alleviate this problem, we propose in this paper a training protocol based on the self-paced learning paradigm. The main idea is to iteratively select a subset of images and boxes that are the most reliable, and use them for training. While in the past few years similar strategies have been adopted for SVMs and other classifiers, we are the first showing that a self-paced approach can be used with deep-network-based classifiers in an end-to-end training pipeline. Our method is built on the fully-supervised Fast-RCNN architecture and can be applied to similar architectures which represent the input image as a bag of boxes. We show state-of-the-art results on Pascal VOC 2007-2010 and ILSVRC 2013. On ILSVRC 2013 our results based on a low-capacity AlexNet network outperform even those weakly-supervised approaches which are based on much higher-capacity networks.

更新日期：2018-02-13
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-02-07
Wuming ZHANG; Xi ZHAO; Jean-Marie Morvan; Liming Chen

2D face analysis techniques, such as face landmarking, face recognition and face verification, are reasonably dependent on illumination conditions which are usually uncontrolled and unpredictable in the real world. An illumination robust preprocessing method thus remains a significant challenge in reliable face analysis. In this paper we propose a novel approach for improving lighting normalization through building the underlying reflectance model which characterizes interactions between skin surface, lighting source and camera sensor, and elaborates the formation of face color appearance. Specifically, the proposed illumination processing pipeline enables the generation of Chromaticity Intrinsic Image (CII) in a log chromaticity space which is robust to illumination variations. Moreover, as an advantage over most prevailing methods, a photo-realistic color face image is subsequently reconstructed which eliminates a wide variety of shadows whilst retaining the color information and identity details. Experimental results under different scenarios and using various face databases show the effectiveness of the proposed approach to deal with lighting variations, including both soft and hard shadows, in face recognition.

更新日期：2018-02-08
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-02-07
Seungryong Kim; Dongbo Min; Bumsub Ham; Stephen Lin; Kwanghoon Sohn

We present a descriptor, called fully convolutional self-similarity (FCSS), for dense semantic correspondence. Unlike traditional dense correspondence for estimating depth or optical flow, semantic correspondence estimation poses additional challenges due to intra-class appearance and shape variations among different instances within the same object or scene category. To robustly match points across semantically similar images, we formulate FCSS using local self-similarity (LSS), which is inherently insensitive to intra-class appearance variations. LSS is incorporated through a proposed convolutional self-similarity (CSS) layer, where the sampling patterns and the self-similarity measure are jointly learned in an end-to-end and multi-scale manner. Furthermore, to address shape variations among object instances, we propose a convolutional affine transformer (CAT) layer that estimates explicit affine transformation fields at each pixel to transform the sampling patterns and corresponding receptive fields. As training data for semantic correspondence is rather limited, we propose to leverage object candidate priors provided in most existing datasets and also correspondence consistency between object pairs to enable weakly-supervised learning. Experiments demonstrate that FCSS significantly outperforms conventional handcrafted descriptors and CNN-based descriptors on various benchmarks.

更新日期：2018-02-08
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-02-07
Aneta Bera; Przemyslaw Klesk; Dariusz Sychel

We construct a set of special complex-valued integral images and an algorithm that allows to calculate Zernike moments fast, namely in constant time. The technique is suitable for dense detection procedures, where the image is scanned by a sliding window at multiple scales, and where rotational invariance is required at the level of each window. We assume no preliminary image segmentation. Owing to the proposed integral images and binomial expansions, the extraction of each feature does not depend on the number of pixels in the window and thereby is an $O(1)$ calculation. We analyze algorithmic properties of the proposition, such as: number of needed integral images, complex-conjugacy of integral images, number of operations involved in feature extraction, speed-up possibilities based on lookup tables. We also point out connections between Zernike and orthogonal Fourier--Mellin moments in the context of computations backed with integral images. Finally, we demonstrate three examples of detection tasks of varying difficulty. Detectors are trained on the proposed features by the RealBoost algorithm. When learning, the classifiers get acquainted only with examples of target objects in their upright position or rotated within a limited range. At the testing stage, generalization onto the full 360° angle takes place automatically.

更新日期：2018-02-08
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-02-07
Hanzi Wang; xiao guobao; Yan Yan; David Suter

In this paper, we propose a simple and effective geometric model fitting method to fit and segment multi-structure data even in the presence of severe outliers. We cast the task of geometric model fitting as a representative mode-seeking problem on hypergraphs. Specifically, a hypergraph is firstly constructed, where the vertices represent model hypotheses and the hyperedges denote data points. The hypergraph involves higher-order similarities (instead of pairwise similarities used on a simple graph), and it can characterize complex relationships between model hypotheses and data points. In addition, we develop a hypergraph reduction technique to remove “insignificant” vertices while retaining as many “significant” vertices as possible in the hypergraph. Based on the simplified hypergraph, we then propose a novel mode-seeking algorithm to search for representative modes within reasonable time. Finally, the proposed mode-seeking algorithm detects modes according to two key elements, i.e., the weighting scores of vertices and the similarity analysis between vertices. Overall, the proposed fitting method is able to efficiently and effectively estimate the number and the parameters of model instances in the data simultaneously. Experimental results demonstrate that the proposed method achieves significant superiority over several state-of-the-art model fitting methods on both synthetic data and real images.

更新日期：2018-02-08
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-02-07
Wei Wang; Yan Yan; Zhen Cui; Jiashi Feng; Shuicheng Yan; Niculae Sebe

Modeling the aging process of human faces is important for cross-age face verification and recognition. In this paper, we propose a Recurrent Face Aging (RFA) framework which takes as input a single image and automatically outputs a series of aged faces. The hidden units in the RFA are connected autoregressively allowing the framework to age the person by referring to the previous aged faces. Due to the lack of labeled face data of the same person captured in a long range of ages, traditional face aging models split the ages into discrete groups and learn a one-step face transformation for each pair of adjacent age groups. Since human face aging is a smooth progression, it is more appropriate to age the face by going through smooth transitional states. In this way, the intermediate aged faces between the age groups can be generated. Towards this target, we employ a recurrent neural network whose recurrent module is a hierarchical triple-layer gated recurrent unit which functions as an autoencoder. The bottom layer of the module encodes the input to a latent representation, and the top layer decodes the representation to a corresponding aged face. The experimental results demonstrate the effectiveness of our framework.

更新日期：2018-02-08
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-30
Ugo Moschini; Arnold Meijster; Michael H. F. Wilkinson

Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about $20\times$ better than the fastest sequential algorithm and speed-up goes up to $30-40$ on 64 threads.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-29
Canyi Lu; Jiashi Feng; Shuicheng Yan; Zhouchen Lin

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-04-12
Shu Tian; Xu-Cheng Yin; Ya Su; Hong-Wei Hao

Video text extraction plays an important role for multimedia understanding and retrieval. Most previous research efforts are conducted within individual frames. A few of recent methods, which pay attention to text tracking using multiple frames, however, do not effectively mine the relations among text detection, tracking and recognition. In this paper, we propose a generic Bayesian-based framework of Tracking based Text Detection And Recognition (T $^2$ DAR) from web videos for embedded captions, which is composed of three major components, i.e., text tracking, tracking based text detection, and tracking based text recognition. In this unified framework, text tracking is first conducted by tracking-by-detection. Tracking trajectories are then revised and refined with detection or recognition results. Text detection or recognition is finally improved with multi-frame integration. Moreover, a challenging video text (embedded caption text) database (USTB-VidTEXT) is constructed and publicly available. A variety of experiments on this dataset verify that our proposed approach largely improves the performance of text detection and recognition from web videos.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-20
Vassileios Balntas; Lilian Tang; Krystian Mikolajczyk

We propose a novel approach to generate a binary descriptor optimized for each image patch independently. The approach is inspired by the linear discriminant embedding that simultaneously increases inter and decreases intra class distances. A set of discriminative and uncorrelated binary tests is established from all possible tests in an offline training process. The patch adapted descriptors are then efficiently built online from a subset of features which lead to lower intra-class distances and thus, to a more robust descriptor. We perform experiments on three widely used benchmarks and demonstrate improvements in matching performance, and illustrate that per-patch optimization outperforms global optimization.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-24
Afshin Dehghan; Mubarak Shah

Multi-object tracking has been studied for decades. However, when it comes to tracking pedestrians in extremely crowded scenes, we are limited to only few works. This is an important problem which gives rise to several challenges. Pre-trained object detectors fail to localize targets in crowded sequences. This consequently limits the use of data-association based multi-target tracking methods which rely on the outcome of an object detector. Additionally, the small apparent target size makes it challenging to extract features to discriminate targets from their surroundings. Finally, the large number of targets greatly increases computational complexity which in turn makes it hard to extend existing multi-target tracking approaches to high-density crowd scenarios. In this paper, we propose a tracker that addresses the aforementioned problems and is capable of tracking hundreds of people efficiently. We formulate online crowd tracking as Binary Quadratic Programing. Our formulation employs target's individual information in the form of appearance and motion as well as contextual cues in the form of neighborhood motion, spatial proximity and grouping, and solves detection and data association simultaneously. In order to solve the proposed quadratic optimization efficiently, where state-of art commercial quadratic programing solvers fail to find the solution in a reasonable amount of time, we propose to use the most recent version of the Modified Frank Wolfe algorithm, which takes advantage of SWAP-steps to speed up the optimization. We show that the proposed formulation can track hundreds of targets efficiently and improves state-of-art results by significant margins on eleven challenging high density crowd sequences.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-15
Gang Hua; Chengjiang Long; Ming Yang; Yan Gao

Active learning is an effective way of engaging users to interactively train models for visual recognition more efficiently. The vast majority of previous works focused on active learning with a single human oracle. The problem of active learning with multiple oracles in a collaborative setting has not been well explored. We present a collaborative computational model for active learning with multiple human oracles, the input from whom may possess different levels of noises. It leads to not only an ensemble kernel machine that is robust to label noises, but also a principled label quality measure to online detect irresponsible labelers. Instead of running independent active learning processes for each individual human oracle, our model captures the inherent correlations among the labelers through shared data among them. Our experiments with both simulated and real crowd-sourced noisy labels demonstrate the efficacy of our model.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-04-06
Seung-Hwan Bae; Kuk-Jin Yoon

Online multi-object tracking aims at estimating the tracks of multiple objects instantly with each incoming frame and the information provided up to the moment. It still remains a difficult problem in complex scenes, because of the large ambiguity in associating multiple objects in consecutive frames and the low discriminability between objects appearances. In this paper, we propose a robust online multi-object tracking method that can handle these difficulties effectively. We first define the tracklet confidence using the detectability and continuity of a tracklet, and decompose a multi-object tracking problem into small subproblems based on the tracklet confidence. We then solve the online multi-object tracking problem by associating tracklets and detections in different ways according to their confidence values. Based on this strategy, tracklets sequentially grow with online-provided detections, and fragmented tracklets are linked up with others without any iterative and expensive association steps. For more reliable association between tracklets and detections, we also propose a deep appearance learning method to learn a discriminative appearance model from large training datasets, since the conventional appearance learning methods do not provide rich representation that can distinguish multiple objects with large appearance variations. In addition, we combine online transfer learning for improving appearance discriminability by adapting the pre-trained deep model during online tracking. Experiments with challenging public datasets show distinct performance improvement over other state-of-the-arts batch and online tracking methods, and prove the effect and usefulness of the proposed methods for online multi-object tracking.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-04-12
Jakob Engel; Vladlen Koltun; Daniel Cremers

Direct Sparse Odometry (DSO) is a visual odometry method based on a novel, highly accurate sparse and direct structure and motion formulation. It combines a fully direct probabilistic model (minimizing a photometric error) with consistent, joint optimization of all model parameters, including geometry-represented as inverse depth in a reference frame-and camera motion. This is achieved in real time by omitting the smoothness prior used in other direct methods and instead sampling pixels evenly throughout the images. Since our method does not depend on keypoint detectors or descriptors, it can naturally sample pixels from across all image regions that have intensity gradient, including edges or smooth intensity variations on essentially featureless walls. The proposed model integrates a full photometric calibration, accounting for exposure time, lens vignetting, and non-linear response functions. We thoroughly evaluate our method on three different datasets comprising several hours of video. The experiments show that the presented approach significantly outperforms state-of-the-art direct and indirect methods in a variety of real-world settings, both in terms of tracking accuracy and robustness.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-23
Thanh-Toan Do; Ngai-Man Cheung

The objective of this paper is to design an embedding method that maps local features describing an image (e.g., SIFT) to a higher dimensional representation useful for the image retrieval problem. First, motivated by the relationship between the linear approximation of a nonlinear function in high dimensional space and the state-of-the-art feature representation used in image retrieval, i.e., VLAD, we propose a new approach for the approximation. The embedded vectors resulted by the function approximation process are then aggregated to form a single representation for image retrieval. Second, in order to make the proposed embedding method applicable to large scale problem, we further derive its fast version in which the embedded vectors can be efficiently computed, i.e., in the closed-form. We compare the proposed embedding methods with the state of the art in the context of image search under various settings: when the images are represented by medium length vectors, short vectors, or binary vectors. The experimental results show that the proposed embedding methods outperform existing the state of the art on the standard public image retrieval benchmarks.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-04-12
Martin Storath; Andreas Weinmann

Median filtering is among the most utilized tools for smoothing real-valued data, as it is robust, edge-preserving, value-preserving, and yet can be computed efficiently. For data living on the unit circle, such as phase data or orientation data, a filter with similar properties is desirable. For these data, there is no unique means to define a median; so we discuss various possibilities. The arc distance median turns out to be the only variant which leads to robust, edge-preserving and value-preserving smoothing. However, there are no efficient algorithms for filtering based on the arc distance median. Here, we propose fast algorithms for filtering of signals and images with values on the unit circle based on the arc distance median. For non-quantized data, we develop an algorithm that scales linearly with the filter size. The runtime of our reference implementation is only moderately higher than the Matlab implementation of the classical median filter for real-valued data. For quantized data, we obtain an algorithm of constant complexity w.r.t. the filter size. We demonstrate the performance of our algorithms for real life data sets: phase images from interferometric synthetic aperture radar, planar flow fields from optical flow, and time series of wind directions.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-23
Yong-Jin Liu; Minjing Yu; Bing-Jun Li; Ying He

Superpixels are perceptually meaningful atomic regions that can effectively capture image features. Among various methods for computing uniform superpixels, simple linear iterative clustering (SLIC) is popular due to its simplicity and high performance. In this paper, we extend SLIC to compute content-sensitive superpixels, i.e., small superpixels in content-dense regions with high intensity or colour variation and large superpixels in content-sparse regions. Rather than using the conventional SLIC method that clusters pixels in $\mathbb {R}^5$ , we map the input image $I$ to a 2-dimensional manifold $\mathcal {M}\subset \mathbb {R}^5$ , whose area elements are a good measure of the content density in $I$ . We propose a simple method, called intrinsic manifold SLIC (IMSLIC), for computing a geodesic centroidal Voronoi tessellation (GCVT)—a uniform tessellation—on $\mathcal {M}$ , which induces the content-sensitive superpixels in $I$ . In contrast to the existing algorithms, IMSLIC characterizes the content sensitivity by measuring areas of Voronoi cells on $\mathcal {M}$ . Using a simple and fast approximation to a closed-form solution, the method can compute the GCVT at a very low cost and guarantees that all Voronoi cells are simply connected. We thoroughly evaluate IMSLIC and compare it with eleven representative methods on the BSDS500 dataset and seven representative methods on the NYUV2 dataset. Computational results show that IMSLIC outperforms existing methods in terms of commonly used quality measures pertaining to superpixels such as compactness, adherence to boundaries, and achievable segmentation accuracy. We also evaluate IMSLIC and seven representative methods in an image contour closure application, and the results on two datasets, WHD and WSD, show that IMSLIC achieves the best foreground segmentation performance.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-04-06
Hossein Rahmani; Ajmal Mian; Mubarak Shah

Recognizing human actions from unknown and unseen (novel) views is a challenging problem. We propose a Robust Non-Linear Knowledge Transfer Model (R-NKTM) for human action recognition from novel views. The proposed R-NKTM is a deep fully-connected neural network that transfers knowledge of human actions from any unknown view to a shared high-level virtual view by finding a set of non-linear transformations that connects the views. The R-NKTM is learned from 2D projections of dense trajectories of synthetic 3D human models fitted to real motion capture data and generalizes to real videos of human actions. The strength of our technique is that we learn a single R-NKTM for all actions and all viewpoints for knowledge transfer of any real human action video without the need for re-training or fine-tuning the model. Thus, R-NKTM can efficiently scale to incorporate new action classes. R-NKTM is learned with dummy labels and does not require knowledge of the camera viewpoint at any stage. Experiments on three benchmark cross-view human action datasets show that our method outperforms existing state-of-the-art.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-04-06
Tomás F. Yago Vicente; Minh Hoai; Dimitris Samaras

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-04-06
Wei Liao; Stefan Wörz; Chang-Ki Kang; Zang-Hee Cho; Karl Rohr

We propose a novel minimal path method for the segmentation of 2D and 3D line structures. Minimal path methods perform propagation of a wavefront emanating from a start point at a speed derived from image features, followed by path extraction using backtracing. Usually, the computation of the speed and the propagation of the wave are two separate steps, and point features are used to compute a static speed. We introduce a new continuous minimal path method which steers the wave propagation progressively using dynamic speed based on path features . We present three instances of our method, using an appearance feature of the path, a geometric feature based on the curvature of the path, and a joint appearance and geometric feature based on the tangent of the wavefront. These features have not been used in previous continuous minimal path methods. We compute the features dynamically during the wave propagation, and also efficiently using a fast numerical scheme and a low-dimensional parameter space. Our method does not suffer from discretization or metrication errors. We performed qualitative and quantitative evaluations using 2D and 3D images from different application areas.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-29
Xiaobai Liu; Yibiao Zhao; Song-Chun Zhu

In this paper, we present an attribute grammar for solving two coupled tasks: i) parsing a 2D image into semantic regions; and ii) recovering the 3D scene structures of all regions. The proposed grammar consists of a set of production rules, each describing a kind of spatial relation between planar surfaces in 3D scenes. These production rules are used to decompose an input image into a hierarchical parse graph representation where each graph node indicates a planar surface or a composite surface. Different from other stochastic image grammars, the proposed grammar augments each graph node with a set of attribute variables to depict scene-level global geometry, e.g., camera focal length, or local geometry, e.g., surface normal, contact lines between surfaces. These geometric attributes impose constraints between a node and its off-springs in the parse graph. Under a probabilistic framework, we develop a Markov Chain Monte Carlo method to construct a parse graph that optimizes the 2D image recognition and 3D scene reconstruction purposes simultaneously. We evaluated our method on both public benchmarks and newly collected datasets. Experiments demonstrate that the proposed method is capable of achieving state-of-the-art scene reconstruction of a single image.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-15
Yong Ren; Yining Wang; Jun Zhu

Supervised topic models simultaneously model the latent topic structure of large collections of documents and a response variable associated with each document. Existing inference methods are based on variational approximation or Monte Carlo sampling, which often suffers from the local minimum defect. Spectral methods have been applied to learn unsupervised topic models, such as latent Dirichlet allocation (LDA), with provable guarantees. This paper investigates the possibility of applying spectral methods to recover the parameters of supervised LDA (sLDA). We first present a two-stage spectral method, which recovers the parameters of LDA followed by a power update method to recover the regression model parameters. Then, we further present a single-phase spectral algorithm to jointly recover the topic distribution matrix as well as the regression weights. Our spectral algorithms are provably correct and computationally efficient. We prove a sample complexity bound for each algorithm and subsequently derive a sufficient condition for the identifiability of sLDA. Thorough experiments on synthetic and real-world datasets verify the theory and demonstrate the practical effectiveness of the spectral algorithms. In fact, our results on a large-scale review rating dataset demonstrate that our single-phase spectral algorithm alone gets comparable or even better performance than state-of-the-art methods, while previous work on spectral methods has rarely reported such promising performance.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-09
Ting-Chun Wang; Manmohan Chandraker; Alexei A. Efros; Ravi Ramamoorthi

Light-field cameras have recently emerged as a powerful tool for one-shot passive 3D shape capture. However, obtaining the shape of glossy objects like metals or plastics remains challenging, since standard Lambertian cues like photo-consistency cannot be easily applied. In this paper, we derive a spatially-varying (SV)BRDF-invariant theory for recovering 3D shape and reflectance from light-field cameras. Our key theoretical insight is a novel analysis of diffuse plus single-lobe SVBRDFs under a light-field setup. We show that, although direct shape recovery is not possible, an equation relating depths and normals can still be derived. Using this equation, we then propose using a polynomial (quadratic) shape prior to resolve the shape ambiguity. Once shape is estimated, we also recover the reflectance. We present extensive synthetic data on the entire MERL BRDF dataset, as well as a number of real examples to validate the theory, where we simultaneously recover shape and BRDFs from a single image taken with a Lytro Illum camera.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-10
Przemysław Głowacki; Miguel Amável Pinheiro; Agata Mosinska; Engin Türetken; Daniel Lebrecht; Raphael Sznitman; Anthony Holtmaat; Jan Kybic; Pascal Fua

We propose a novel approach to reconstructing curvilinear tree structures evolving over time, such as road networks in 2D aerial images or neural structures in 3D microscopy stacks acquired in vivo . To enforce temporal consistency, we simultaneously process all images in a sequence, as opposed to reconstructing structures of interest in each image independently. We formulate the problem as a Quadratic Mixed Integer Program and demonstrate the additional robustness that comes from using all available visual clues at once, instead of working frame by frame. Furthermore, when the linear structures undergo local changes over time, our approach automatically detects them.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-05-23
Ethan M. Rudd; Lalit P. Jain; Walter J. Scheirer; Terrance E. Boult

It is often desirable to be able to recognize when inputs to a recognition function learned in a supervised manner correspond to classes unseen at training time. With this ability, new class labels could be assigned to these inputs by a human operator, allowing them to be incorporated into the recognition function—ideally under an efficient incremental update mechanism. While good algorithms that assume inputs from a fixed set of classes exist, e.g. , artificial neural networks and kernel machines, it is not immediately obvious how to extend them to perform incremental learning in the presence of unknown query classes. Existing algorithms take little to no distributional information into account when learning recognition functions and lack a strong theoretical foundation. We address this gap by formulating a novel, theoretically sound classifier—the Extreme Value Machine (EVM). The EVM has a well-grounded interpretation derived from statistical Extreme Value Theory (EVT), and is the first classifier to be able to perform nonlinear kernel-free variable bandwidth incremental learning. Compared to other classifiers in the same deep network derived feature space, the EVM is accurate and efficient on an established benchmark partition of the ImageNet dataset.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-02-05
Boxin Shi; Zhipeng Mo; Zhe Wu; Dinglong Duan; Sai Kit Yeung; Ping Tan

Classic photometric stereo is often extended to deal with real-world materials and work with unknown lighting conditions for practicability. To quantitatively evaluate non-Lambertian and uncalibrated photometric stereo, a photometric stereo image dataset containing objects of various shapes with complex reflectance properties and high-quality ground truth normals is still missing. In this paper, we introduce the ‘DiLiGenT’ dataset with calibrated Directional Lightings, objects of General reflectance with different shininess, and ‘ground Truth’ normals from high-precision laser scanning. We use our dataset to quantitatively evaluate state-of-the-art photometric stereo methods for general materials and unknown lighting conditions, selected from a newly proposed photometric stereo taxonomy emphasizing on non-Lambertian and uncalibrated methods. The dataset and evaluation results are made publicly available, and we hope it can serve as a benchmark platform that inspires future research.

更新日期：2018-02-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-31
Weihong Deng; Jiani Hu; Jun Guo

A binary descriptor typically consists of three stages: image filtering, binarization, and spatial histogram. This paper first demonstrates that the binary code of the maximum-variance filtering responses leads to the lowest bit error rate under Gaussian noise. Then, an optimal eigenfilter bank is derived from a universal assumption on the local stationary random field. Finally, compressive binary patterns (CBP) is designed by replacing the local derivative filters of local binary patterns (LBP) with these novel random-field eigenfilters, which leads to a compact and robust binary descriptor that characterizes the most stable local structures that are resistant to image noise and degradation. A scattering-like operator is subsequently applied to enhance the distinctiveness of the descriptor. Surprisingly, the results obtained from experiments on the FERET, LFW, and PaSC databases show that the scattering CBP (SCBP) descriptor, which is handcrafted by only 6 optimal eigenfilters under restrictive assumptions, outperforms the state-of-the-art learning-based face descriptors in terms of both matching accuracy and robustness. In particular, on probe images degraded with noise, blur, JPEG compression, and reduced resolution, SCBP outperforms other descriptors by a greater than 10% accuracy margin.

更新日期：2018-02-01
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-30
Ruimao Zhang; Liang Lin; Guangrun Wang; Meng Wang; Wangmeng Zuo

This paper investigates a fundamental problem of scene understanding: how to parse a scene image into a structured configuration (i.e., a semantic object hierarchy with object interaction relations). We propose a deep architecture consisting of two networks: i) a convolutional neural network (CNN) extracting the image representation for pixel-wise object labeling and ii) a recursive neural network (RsNN) discovering the hierarchical object structure and the inter-object relations. Rather than relying on elaborative annotations (e.g., manually labeled semantic maps and relations), we train our deep model in a weakly-supervised learning manner by leveraging the descriptive sentences of the training images. Specifically, we decompose each sentence into a semantic tree consisting of nouns and verb phrases, and apply these tree structures to discover the configurations of the training images. Once these scene configurations are determined, then the parameters of both the CNN and RsNN are updated accordingly by back propagation. The entire model training is accomplished through an Expectation-Maximization method. Extensive experiments show that our model is capable of producing meaningful scene configurations and achieving more favorable scene labeling results on two benchmarks (i.e., PASCAL VOC 2012 and SYSU-Scenes) compared with other state-of-the-art weakly-supervised deep learning methods. In particular, SYSU-Scenes contains more than 5000 scene images with their semantic sentence descriptions, which is created by us for advancing research on scene parsing.

更新日期：2018-01-31
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-30
Kyungdon Joo; Tae-Hyun Oh; Junsik Kim; In So Kweon

Most man-made environments, such as urban and indoor scenes, consist of a set of parallel and orthogonal planar structures. These structures are approximated by the Manhattan world assumption, of which notion can be represented as a Manhattan Frame (MF). Given a set of inputs such as surface normals or vanishing points, we pose an MF estimation problem as a consensus set maximization that maximizes the number of inliers over the rotation search space. Conventionally this problem can be solved by a branch-and-bound framework which mathematically guarantees global optimality. However, the computational time of the conventional branch-and-bound algorithms is rather far from real-time. In this paper, we propose a novel bound computation method on an efficient measurement domain for MF estimation, i.e., the extended Gaussian image (EGI). By relaxing the original problem, we can compute the bound with a constant complexity, while preserving global optimality. Furthermore, we quantitatively and qualitatively demonstrate the performance of the proposed method for various synthetic and real-world data. We also show the versatility of our approach through three different applications: extension to multiple MF estimation, 3D rotation based video stabilization and vanishing point estimation (line clustering).

更新日期：2018-01-31
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-30
Omar Costilla Reyes; Ruben Vera-Rodriguez; Patricia Scully; Krikor B Ozanyan

Human footsteps can provide a unique behavioural pattern for robust biometric systems. We propose spatio-temporal footstep representations from floor-only sensor data in advanced computational models for automatic biometric verification. Our models deliver an artificial intelligence capable of effectively differentiating the fine-grained variability of footsteps between legitimate users (clients) and impostor users of the biometric system. The methodology is validated in the largest to date footstep database, containing nearly 20,000 footstep signals from more than 120 users. The database is organized by considering a large cohort of impostors and a small set of clients to verify the reliability of biometric systems. We provide experimental results in 3 critical data-driven security scenarios, according to the amount of footstep data made available for model training: at airports security checkpoints (smallest training set), workspace environments (medium training set) and home environments (largest training set). We report state-of-the-art footstep recognition rates with an optimal equal false acceptance and false rejection rate of 0.7% (equal error rate), an improvement ratio of 371% from previous state-of-the-art. We perform a feature analysis of deep residual neural networks showing effective clustering of client's footstep data and provide insights of the feature learning process.

更新日期：2018-01-31
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-25
Tadas Baltrušaitis; Chaitanya Ahuja; Louis-Philippe Morency

Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning} aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.

更新日期：2018-01-26
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-24
Liwei Wang; Yin Li; Jing Huang; Svetlana Lazebnik

Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates two-branch neural networks for learning the similarity between these two data modalities. We propose two network structures that produce different output representations. The first one, referred to as an embedding network, learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. Compared to standard triplet sampling, we perform improved neighborhood sampling that takes neighborhood information into consideration while constructing mini-batches. The second network structure, referred to as a similarity network, fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization on the Flickr30K Entities dataset and for bi-directional image-sentence retrieval on Flickr30K and MSCOCO datasets.

更新日期：2018-01-25
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-23
Khurram Soomro; Haroon Idrees; Mubarak Shah

This paper proposes a person-centric and online approach to the challenging problem of localization and prediction of actions and interactions in videos. Typically, localization or recognition is performed in an offline manner where all the frames in the video are processed together. This prevents timely localization and prediction of actions and interactions - an important consideration for many tasks including surveillance and human-machine interaction.

更新日期：2018-01-24
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-23
Tianzhu Zhang; Changsheng Xu; Ming-Hsuan Yang

Sparse representations have been applied to visual tracking by finding the best candidate region with minimal reconstruction error based on a set of target templates. However, most existing sparse trackers only consider holistic or local representations and do not make full use of the intrinsic structure among and inside target candidate regions, thereby making them less effective when similar objects appear at close proximity or under occlusion. In this paper, we propose a novel structural sparse representation, which not only exploits the intrinsic relationships among target candidate regions and local patches to learn their representations jointly, but also preserves the spatial structure among the local patches inside each target candidate region. For robust visual tracking, we take outliers resulting from occlusion and noise into account when searching for the best target region. Constructed within a Bayesian filtering framework, we show that the proposed algorithm accommodates most existing sparse trackers with respective merits. The formulated problem can be efficiently solved using an accelerated proximal gradient method that yields a sequence of closed form updates. Qualitative and quantitative evaluations on challenging benchmark datasets demonstrate that the proposed tracking algorithm performs favorably against several state-of-the-art methods.

更新日期：2018-01-24
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-23
Tianzhu Zhang; Changsheng Xu; Ming-Hsuan Yang

We propose a multi-task correlation particle filter (MCPF) for robust visual tracking. We first present the multi-task correlation filter (MCF) that takes the interdependencies among different object parts and features into account to learn the correlation filters jointly. The proposed MCPF is introduced to exploit and complement the strength of a MCF and a particle filter. Compared with existing tracking methods based on correlation filters and particle filters, the proposed MCPF enjoys several merits. First, it exploits the interdependencies among different features to derive the correlation filters jointly, and makes the learned filters complement and enhance each other to obtain consistent responses. Second, it handles partial occlusion via a part-based representation, and exploits the intrinsic relationship among local parts via spatial constraints to preserve object structure and learn the correlation filters jointly. Third, it effectively handles large scale variation via a sampling scheme by drawing particles at different scales for target object state estimation. Fourth, it shepherds the sampled particles toward the modes of the target state distribution via the MCF, and effectively covers object states well using fewer particles than conventional particle filters. Extensive experimental results odemonstrate that the proposed MCPF tracking algorithm performs favorably against the state-of-the-art methods

更新日期：2018-01-24
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-18
Hae-Gon Jeon; Jaesik Park; Gyeongmin Choe; Jinsun Park; Yunsu Bok; Yu Wing Tai; In So Kweon

One of the core applications of light field imaging is depth estimation. To acquire a depth map, existing approaches apply a single photo-consistency measure to an entire light field. However, this is not an optimal choice because of the non-uniform light field degradations produced by limitations in the hardware design. In this paper, we introduce a pipeline that automatically determines the best configuration for photo-consistency measurement, which leads to the most reliable depth label from the light field. We analyzed the practical factors affecting degradation in lenslet light field cameras, and designed a learning based framework that can retrieve the best cost measure and optimal depth label. To enhance the reliability of our method, we augmented an existing light field benchmark to simulate realistic source dependent noise, aberrations, and vignetting artifacts. The augmented dataset was used for the training and validation of the proposed approach. Our method was competitive with several state-of-the-art methods for the benchmark and real-world light field datasets.

更新日期：2018-01-19
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-18
Weiwei Liu; Donna Xu; Ivor Tsang; Wenjie Zhang

Multi-output learning with the task of simultaneously predicting multiple outputs for an input has increasingly attracted interest from researchers due to its wide application. The k nearest neighbor (kNN) algorithm is one of the most popular frameworks for handling multi-output problems. The performance of kNN depends crucially on the metric used to compute the distance between different instances. However, our experiment results show that the existing advanced metric learning technique cannot provide an appropriate distance metric for multi-output tasks. This paper systematically studies how to learn an appropriate distance metric for multi-output problems. In particular, we present a novel large margin metric learning paradigm for multi-output tasks, which projects both the input and output into the same embedding space and then learns a distance metric to discover output dependency such that instances with very different multiple outputs will be moved far away. Several strategies are then proposed to speed up the training and testing time. Moreover, we study the generalization error bound of our method, which shows that our method is able to tighten the excess risk bounds. Experiments on three multi-output learning tasks (multi-label classification, multi-target regression, and multi-concept retrieval) validate the effectiveness and scalability of the proposed method.

更新日期：2018-01-19
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-17
Changsheng Li; Fan Wei; Weishan Dong; Xiangfeng Wang; Qingshan Liu; Xin Zhang

Online multiple-output regression is an important machine learning technique for modeling, predicting, and compressing multi-dimensional correlated data streams. In this paper, we propose a novel online multiple-output regression method, called MORES, for streaming data. MORES can dynamically learn the structure of the regression coefficients to facilitate the model's continuous refinement. Considering that limited expressive ability of regression models often leading to residual errors being dependent, MORES intends to dynamically learn and leverage the structure of the residual errors to improve the prediction accuracy. Moreover, we introduce three modified covariance matrices to extract necessary information from all the seen data for training, and set different weights on samples so as to track the data streams' evolving characteristics. Furthermore, an efficient algorithm is designed to optimize the proposed objective function, and an efficient online eigenvalue decomposition algorithm is developed for the modified covariance matrix. Finally, we analyze the convergence of MORES in certain ideal condition. Experiments on two synthetic datasets and three real-world datasets validate the effectiveness and efficiency of MORES. In addition, MORES can process at least 2,000 instances per second (including training and testing) on the three real-world datasets, more than 12 times faster than the state-of-the-art online learning algorithm.

更新日期：2018-01-18
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-17
Ehsan Adeli; Kim-Han Thung; Le An; Guorong Wu; Feng Shi; Tao Wang; Dinggang Shen

Discriminative methods commonly produce models with relatively good generalization abilities. However, this advantage is challenged in real-world applications (e.g., medical image analysis problems), in which there often exist outlier data points (sample-outliers) and noises in the predictor values (feature-noises). Methods robust to both types of these deviations are somewhat overlooked in the literature. We further argue that denoising can be more effective, if we learn the model using all the available labeled and unlabeled samples, as the intrinsic geometry of the sample manifold can be better constructed using more data points. In this paper, we propose a semi-supervised robust discriminative classification method based on the least-squares formulation of linear discriminant analysis to detect sample-outliers and feature-noises simultaneously, using both labeled training and unlabeled testing data. We conduct several experiments on a synthetic, some benchmark semi-supervised learning, and two brain neurodegenerative disease diagnosis datasets (for Parkinson's and Alzheimer's diseases). Specifically for the application of neurodegenerative diseases diagnosis, incorporating robust machine learning methods can be of great benefit, due to the noisy nature of neuroimaging data. Our results show that our method outperforms the baseline and several state-of-the-art methods, in terms of both accuracy and the area under the ROC curve.

更新日期：2018-01-18
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-16
Canyi Lu; Jiashi Feng; Zhouchen Lin; Tao Mei; Shuicheng Yan

This paper studies the subspace clustering problem. Given some data points approximately drawn from a union of subspaces, the goal is to group these data points into their underlying subspaces. Many subspace clustering methods have been proposed and among which sparse subspace clustering and low-rank representation are two representative ones. Despite the different motivations, we observe that many existing methods own the common block diagonal property, which possibly leads to correct clustering, yet with their proofs given case by case. In this work, we consider a general formulation and provide a unified theoretical guarantee of the block diagonal property. The block diagonal property of many existing methods falls into our special case. Second, we observe that many existing methods approximate the block diagonal representation matrix by using different structure priors, e.g., sparsity and low-rankness, which are indirect. We propose the first block diagonal matrix induced regularizer for directly pursuing the block diagonal matrix. With this regularizer, we solve the subspace clustering problem by Block Diagonal Representation (BDR), which uses the block diagonal structure prior. The BDR model is nonconvex and we propose an alternating minimization solver and prove its convergence. Experiments on real datasets demonstrate the effectiveness of BDR.

更新日期：2018-01-17
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-12
Iacopo Masi; Feng-Ju Chang; Jongmoo Choi; Shai Harel; Jungyeon Kim; KangGeon Kim; Jatuporn Leksut; Stephen Rawls; Yue Wu; Tal Hassner; Wael AbdAlmageed; Gerard Medioni; Louis-Philippe Morency; Prem Natarajan; Ramkant Nevatia

We propose a method designed to push the frontiers of unconstrained face recognition in the wild with an emphasis on extreme out-of-plane pose variations. Existing methods either expect a single model to learn pose invariance by training on massive amounts of data or else normalize images by aligning faces to a single frontal pose. Contrary to these, our method is designed to explicitly tackle pose variations. Our proposed Pose-Aware Models (PAM) process a face image using several pose-specific, deep convolutional neural networks (CNN). 3D rendering is used to synthesize multiple face poses from input images to both train these models and to provide additional robustness to pose variations at test time. Our paper presents an extensive analysis on the IJB-A benchmark, evaluating the effects that landmark detection accuracy, CNN layer selection, and pose model selection all have on the performance of the recognition pipeline. It further provides comparative evaluations on the IARPA Janus Benchmarks A (IJB-A) and the PIPA dataset. These tests show that our approach outperforms existing methods, even surprisingly matching the accuracy of methods that were specifically fine-tuned to the target dataset. Parts of this work previously appeared in [1] and [2].

更新日期：2018-01-13
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-02-22
Huan Wan; Hui Wang; Gongde Guo; Xin Wei

Linear discriminant analysis (LDA) is a classical method for discriminative dimensionality reduction. The original LDA may degrade in its performance for non-Gaussian data, and may be unable to extract sufficient features to satisfactorily explain the data when the number of classes is small. Two prominent extensions to address these problems are subclass discriminant analysis (SDA) and mixture subclass discriminant analysis (MSDA). They divide every class into subclasses and re-define the within-class and between-class scatter matrices on the basis of subclass. In this paper we study the issue of how to obtain subclasses more effectively in order to achieve higher class separation. We observe that there is significant overlap between models of the subclasses, which we hypothesise is undesirable. In order to reduce their overlap we propose an extension of LDA, separability oriented subclass discriminant analysis (SSDA), which employs hierarchical clustering to divide a class into subclasses using a separability oriented criterion, before applying LDA optimisation using re-defined scatter matrices. Extensive experiments have shown that SSDA has better performance than LDA, SDA and MSDA in most cases. Additional experiments have further shown that SSDA can project data into LDA space that has higher class separation than LDA, SDA and MSDA in most cases.

更新日期：2018-01-11
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-10
Alessandro Achille; Stefano Soatto

The cross-entropy loss commonly used in deep learning is closely related to the defining properties of optimal representations, but does not enforce some of the key properties. We show that this can be solved by adding a regularization term, which is related to injecting multiplicative noise in the activations of a Deep Neural Network, a special case of which is the common practice of dropout. We show that our regularized loss function can be efficiently minimized using Information Dropout, a generalization of dropout rooted in information-theoretic principles that automatically adapts to the data and can better exploit architectures of limited capacity. When the task is the reconstruction of the input, our loss function yields a Variational Autoencoder as a special case, thus providing a link between representation learning, information theory and variational inference. Finally, we prove that we can promote the creation of disentangled representations simply by enforcing a factorized prior, a fact that has been observed empirically in recent work. Our experiments validate the theoretical intuitions behind our method, and we find that information dropout achieves a comparable or better generalization performance than dropout, especially on smaller models, since it adapts to both the architecture and the test samples.

更新日期：2018-01-11
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-01-10
Wei Li; Farnaz Abtahi; Zhigang Zhu; Lijun Yin

In this paper, we propose a deep learning based approach for AU detection by enhancing and cropping regions of interest of face images. The approach is implemented by adding two novel nets: the enhancing layers and the cropping layers, to a pretrained convolutional neural network (CNN) model. For the enhancing layers (E-Net), we have designed an attention map based on facial landmark features. For the cropping layers (C-Net), we crop facial regions around the detected landmarks and design individual convolutional layers. We then combine the E-Net and the C-Net to construct a Enhancing and Cropping Net (EAC-Net). Our approach shows a significant performance improvement over the state-of-the-art methods when tested on the BP4D and DISFA AU datasets. We have also studied the performance of the proposed EAC-Net with partial occlusion and with large head pose variations. Experimental results show that (1) the EAC-Net learns facial AUs correlation effectively and can predicts AUs reliably even with only half of a face; (2) Our EAC-Net model also works well under large head poses. It further shows that the EAC-Net works much better without a face alignment than with face alignment as pre-processing, in terms of efficiency and accuracy.

更新日期：2018-01-11
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-02-13
Akihiko Torii; Relja Arandjelović; Josef Sivic; Masatoshi Okutomi; Tomas Pajdla

We address the problem of large-scale visual place recognition for situations where the scene undergoes a major change in appearance, for example, due to illumination (day/night), change of seasons, aging, or structural modifications over time such as buildings being built or destroyed. Such situations represent a major challenge for current large-scale place recognition methods. This work has the following three principal contributions. First, we demonstrate that matching across large changes in the scene appearance becomes much easier when both the query image and the database image depict the scene from approximately the same viewpoint. Second, based on this observation, we develop a new place recognition approach that combines (i) an efficient synthesis of novel views with (ii) a compact indexable image representation. Third, we introduce a new challenging dataset of 1,125 camera-phone query images of Tokyo that contain major changes in illumination (day, sunset, night) as well as structural changes in the scene. We demonstrate that the proposed approach significantly outperforms other large-scale place recognition techniques on this challenging data.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-01
Michael Villamizar; Juan Andrade-Cetto; Alberto Sanfeliu; Francesc Moreno-Noguer

In this paper we introduce the Boosted Random Ferns (BRFs) to rapidly build discriminative classifiers for learning and detecting object categories. At the core of our approach we use standard random ferns, but we introduce four main innovations that let us bring ferns from an instance to a category level, and still retain efficiency. First, we define binary features on the histogram of oriented gradients-domain (as opposed to intensity-), allowing for a better representation of intra-class variability. Second, both the positions where ferns are evaluated within the sliding window, and the location of the binary features for each fern are not chosen completely at random, but instead we use a boosting strategy to pick the most discriminative combination of them. This is further enhanced by our third contribution, that is to adapt the boosting strategy to enable sharing of binary features among different ferns, yielding high recognition rates at a low computational cost. And finally, we show that training can be performed online, for sequentially arriving images. Overall, the resulting classifier can be very efficiently trained, densely evaluated for all image locations in about 0.1 seconds, and provides detection rates similar to competing approaches that require expensive and significantly slower processing times. We demonstrate the effectiveness of our approach by thorough experimentation in publicly available datasets in which we compare against state-of-the-art, and for tasks of both 2D detection and 3D multi-view estimation.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-07
Charles Otto; Dayong Wang; Anil K. Jain

Given a large collection of unlabeled face images, we address the problem of clustering faces into an unknown number of identities. This problem is of interest in social media, law enforcement, and other applications, where the number of faces can be of the order of hundreds of million, while the number of identities (clusters) can range from a few thousand to millions. To address the challenges of run-time complexity and cluster quality, we present an approximate Rank-Order clustering algorithm that performs better than popular clustering algorithms (k-Means and Spectral). Our experiments include clustering up to 123 million face images into over 10 million clusters. Clustering results are analyzed in terms of external (known face labels) and internal (unknown face labels) quality measures, and run-time. Our algorithm achieves an F-measure of 0.87 on the LFW benchmark (13 K faces of 5,749 individuals), which drops to 0.27 on the largest dataset considered (13 K faces in LFW + 123M distractor images). Additionally, we show that frames in the YouTube benchmark can be clustered with an F-measure of 0.71. An internal per-cluster quality measure is developed to rank individual clusters for manual exploration of high quality clusters that are compact and isolated.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-02-22
Odyssée Merveille; Hugues Talbot; Laurent Najman; Nicolas Passat

The analysis of thin curvilinear objects in 3D images is a complex and challenging task. In this article, we introduce a new, non-linear operator, called RORPO (Ranking the Orientation Responses of Path Operators). Inspired by the multidirectional paradigm currently used in linear filtering for thin structure analysis, RORPO is built upon the notion of path operator from mathematical morphology. This operator, unlike most operators commonly used for 3D curvilinear structure analysis, is discrete, non-linear and non-local. From this new operator, two main curvilinear structure characteristics can be estimated: an intensity feature, that can be assimilated to a quantitative measure of curvilinearity; and a directional feature, providing a quantitative measure of the structure's orientation. We provide a full description of the structural and algorithmic details for computing these two features from RORPO, and we discuss computational issues. We experimentally assess RORPO by comparison with three of the most popular curvilinear structure analysis filters, namely Frangi Vesselness, Optimally Oriented Flux, and Hybrid Diffusion with Continuous Switch. In particular, we show that our method provides up to 8 percent more true positive and 50 percent less false positives than the next best method, on synthetic and real 3D images.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-02
Jen-Tzung Chien; Chao-Hsi Lee

Deep unfolding provides an approach to integrate the probabilistic generative models and the deterministic neural networks. Such an approach is benefited by deep representation, easy interpretation, flexible learning and stochastic modeling. This study develops the unsupervised and supervised learning of deep unfolded topic models for document representation and classification. Conventionally, the unsupervised and supervised topic models are inferred via the variational inference algorithm where the model parameters are estimated by maximizing the lower bound of logarithm of marginal likelihood using input documents without and with class labels, respectively. The representation capability or classification accuracy is constrained by the variational lower bound and the tied model parameters across inference procedure. This paper aims to relax these constraints by directly maximizing the end performance criterion and continuously untying the parameters in learning process via deep unfolding inference (DUI). The inference procedure is treated as the layer-wise learning in a deep neural network. The end performance is iteratively improved by using the estimated topic parameters according to the exponentiated updates. Deep learning of topic models is therefore implemented through a back-propagation procedure. Experimental results show the merits of DUI with increasing number of layers compared with variational inference in unsupervised as well as supervised topic models.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-02-14
Yunlian Sun; Man Zhang; Zhenan Sun; Tieniu Tan

Biometrics is the technique of automatically recognizing individuals based on their biological or behavioral characteristics. Various biometric traits have been introduced and widely investigated, including fingerprint, iris, face, voice, palmprint, gait and so forth. Apart from identity, biometric data may convey various other personal information, covering affect, age, gender, race, accent, handedness, height, weight, etc. Among these, analysis of demographics (age, gender, and race) has received tremendous attention owing to its wide real-world applications, with significant efforts devoted and great progress achieved. This survey first presents biometric demographic analysis from the standpoint of human perception, then provides a comprehensive overview of state-of-the-art advances in automated estimation from both academia and industry. Despite these advances, a number of challenging issues continue to inhibit its full potential. We second discuss these open problems, and finally provide an outlook into the future of this very active field of research by sharing some promising opportunities.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-02-16
Yu-Gang Jiang; Zuxuan Wu; Jun Wang; Xiangyang Xue; Shih-Fu Chang

In this paper, we study the challenging problem of categorizing videos according to high-level semantics such as the existence of a particular human action or a complex event. Although extensive efforts have been devoted in recent years, most existing works combined multiple video features using simple fusion strategies and neglected the utilization of inter-class semantic relationships. This paper proposes a novel unified framework that jointly exploits the feature relationships and the class relationships for improved categorization performance. Specifically, these two types of relationships are estimated and utilized by imposing regularizations in the learning process of a deep neural network (DNN). Through arming the DNN with better capability of harnessing both the feature and the class relationships, the proposed regularized DNN (rDNN) is more suitable for modeling video semantics. We show that rDNN produces better performance over several state-of-the-art approaches. Competitive results are reported on the well-known Hollywood2 and Columbia Consumer Video benchmarks. In addition, to stimulate future research on large scale video categorization, we collect and release a new benchmark dataset, called FCVID, which contains 91,223 Internet videos and 239 manually annotated categories.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-08
Zhongyu Lou; Fares Alnajar; Jose M. Alvarez; Ninghang Hu; Theo Gevers

In this paper, we investigate and exploit the influence of facial expressions on automatic age estimation. Different from existing approaches, our method jointly learns the age and expression by introducing a new graphical model with a latent layer between the age/expression labels and the features. This layer aims to learn the relationship between the age and expression and captures the face changes which induce the aging and expression appearance, and thus obtaining expression-invariant age estimation. Conducted on three age-expression datasets (FACES [1] , Lifespan [2] and NEMO [3] ), our experiments illustrate the improvement in performance when the age is jointly learnt with expression in comparison to expression-independent age estimation. The age estimation error is reduced by 14.43, 37.75 and 9.30 percent for the FACES, Lifespan and NEMO datasets respectively. The results obtained by our graphical model, without prior-knowledge of the expressions of the tested faces, are better than the best reported ones for all datasets. The flexibility of the proposed model to include more cues is explored by incorporating gender together with age and expression. The results show performance improvements for all cues.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-03
Tae-Hyun Oh; Yasuyuki Matsushita; Yu-Wing Tai; In So Kweon

Rank minimization can be converted into tractable surrogate problems, such as Nuclear Norm Minimization (NNM) and Weighted NNM (WNNM). The problems related to NNM, or WNNM, can be solved iteratively by applying a closed-form proximal operator, called Singular Value Thresholding (SVT), or Weighted SVT, but they suffer from high computational cost of Singular Value Decomposition (SVD) at each iteration. We propose a fast and accurate approximation method for SVT, that we call fast randomized SVT (FRSVT), with which we avoid direct computation of SVD. The key idea is to extract an approximate basis for the range of the matrix from its compressed matrix. Given the basis, we compute partial singular values of the original matrix from the small factored matrix. In addition, by developping a range propagation method, our method further speeds up the extraction of approximate basis at each iteration. Our theoretical analysis shows the relationship between the approximation bound of SVD and its effect to NNM via SVT. Along with the analysis, our empirical results quantitatively and qualitatively show that our approximation rarely harms the convergence of the host algorithms. We assess the efficiency and accuracy of the proposed method on various computer vision problems, e.g., subspace clustering, weather artifact removal, and simultaneous multi-image alignment and rectification.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-02-09
Ying-Cong Chen; Xiatian Zhu; Wei-Shi Zheng; Jian-Huang Lai

The challenge of person re-identification (re-id) is to match individual images of the same person captured by different non-overlapping camera views against significant and unknown cross-view feature distortion. While a large number of distance metric/subspace learning models have been developed for re-id, the cross-view transformations they learned are view-generic and thus potentially less effective in quantifying the feature distortion inherent to each camera view. Learning view-specific feature transformations for re-id (i.e., view-specific re-id), an under-studied approach, becomes an alternative resort for this problem. In this work, we formulate a novel view-specific person re-identification framework from the feature augmentation point of view, called C amera co R relation A ware F eature augmen T ation (CRAFT). Specifically, CRAFT performs cross-view adaptation by automatically measuring camera correlation from cross-view visual data distribution and adaptively conducting feature augmentation to transform the original features into a new adaptive space. Through our augmentation framework, view-generic learning algorithms can be readily generalized to learn and optimize view-specific sub-models whilst simultaneously modelling view-generic discrimination information. Therefore, our framework not only inherits the strength of view-generic model learning but also provides an effective way to take into account view specific characteristics. Our CRAFT framework can be extended to jointly learn view-specific feature transformations for person re-id across a large network with more than two cameras, a largely under-investigated but realistic re-id setting. Additionally, we present a domain-generic deep person appearance representation which is designed particularly to be towards view invariant for facilitating cross-view adaptation by CRAFT. We conducted extensively comparative experiments to validate the superiority and advantages of our proposed framework over state-of-the-art competitors on contemporary challenging person re-id datasets.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-02-17
Di Xu; Qi Duan; Jianmin Zheng; Juyong Zhang; Jianfei Cai; Tat-Jen Cham

Reconstructing the shape of a 3D object from multi-view images under unknown, general illumination is a fundamental problem in computer vision. High quality reconstruction is usually challenging especially when fine detail is needed and the albedo of the object is non-uniform. This paper introduces vertex overall illumination vectors to model the illumination effect and presents a total variation (TV) based approach for recovering surface details using shading and multi-view stereo (MVS). Behind the approach are the two important observations: (1) the illumination over the surface of an object often appears to be piecewise smooth and (2) the recovery of surface orientation is not sufficient for reconstructing the surface, which was often overlooked previously. Thus we propose to use TV to regularize the overall illumination vectors and use visual hull to constrain partial vertices. The reconstruction is formulated as a constrained TV-minimization problem that simultaneously treats the shape and illumination vectors as unknowns. An augmented Lagrangian method is proposed to quickly solve the TV-minimization problem. As a result, our approach is robust, stable and is able to efficiently recover high-quality surface details even when starting with a coarse model obtained using MVS. These advantages are demonstrated by extensive experiments on the state-of-the-art MVS database, which includes challenging objects with varying albedo.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-02-09
Huei-Fang Yang; Kevin Lin; Chu-Song Chen

This paper presents a simple yet effective supervised deep hash approach that constructs binary hash codes from labeled data for large-scale image search. We assume that the semantic labels are governed by several latent attributes with each attribute on or off , and classification relies on these attributes. Based on this assumption, our approach, dubbed supervised semantics-preserving deep hashing (SSDH), constructs hash functions as a latent layer in a deep network and the binary codes are learned by minimizing an objective function defined over classification error and other desirable hash codes properties. With this design, SSDH has a nice characteristic that classification and retrieval are unified in a single learning model. Moreover, SSDH performs joint learning of image representations, hash codes, and classification in a point-wised manner, and thus is scalable to large-scale datasets. SSDH is simple and can be realized by a slight enhancement of an existing deep architecture for classification; yet it is effective and outperforms other hashing approaches on several benchmarks and large datasets. Compared with state-of-the-art approaches, SSDH achieves higher retrieval accuracy, while the classification performance is not sacrificed.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-02-24
Erik Johannes Bekkers; Marco Loog; Bart M. ter Haar Romeny; Remco Duits

We propose a template matching method for the detection of 2D image objects that are characterized by orientation patterns. Our method is based on data representations via orientation scores, which are functions on the space of positions and orientations, and which are obtained via a wavelet-type transform. This new representation allows us to detect orientation patterns in an intuitive and direct way, namely via cross-correlations. Additionally, we propose a generalized linear regression framework for the construction of suitable templates using smoothing splines. Here, it is important to recognize a curved geometry on the position-orientation domain, which we identify with the Lie group SE(2): the roto-translation group. Templates are then optimized in a B-spline basis, and smoothness is defined with respect to the curved geometry. We achieve state-of-the-art results on three different applications: detection of the optic nerve head in the retina (99.83 percent success rate on 1,737 images), of the fovea in the retina (99.32 percent success rate on 1,616 images), and of the pupil in regular camera images (95.86 percent on 1,521 images). The high performance is due to inclusion of both intensity and orientation features with effective geometric priors in the template matching. Moreover, our method is fast due to a cross-correlation based matching approach.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-07
Chenxia Wu; Jiemi Zhang; Ozan Sener; Bart Selman; Silvio Savarese; Ashutosh Saxena

There is a large variation in the activities that humans perform in their everyday lives. We consider modeling these composite human activities which comprises multiple basic level actions in a completely unsupervised setting. Our model learns high-level co-occurrence and temporal relations between the actions. We consider the video as a sequence of short-term action clips, which contains human-words and object-words. An activity is about a set of action-topics and object-topics indicating which actions are present and which objects are interacting with. We then propose a new probabilistic model relating the words and the topics. It allows us to model long-range action relations that commonly exist in the composite activities, which is challenging in previous works. We apply our model to the unsupervised action segmentation and clustering, and to a novel application that detects forgotten actions, which we call action patching. For evaluation, we contribute a new challenging RGB-D activity video dataset recorded by the new Kinect v2, which contains several human daily activities as compositions of multiple actions interacting with different objects. Moreover, we develop a robotic system that watches and reminds people using our action patching algorithm. Our robotic setup can be easily deployed on any assistive robots.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-06
Manolis C. Tsakiris; René Vidal

Subspace clustering is an important problem in machine learning with many applications in computer vision and pattern recognition. Prior work has studied this problem using algebraic, iterative, statistical, low-rank and sparse representation techniques. While these methods have been applied to both linear and affine subspaces, theoretical results have only been established in the case of linear subspaces. For example, algebraic subspace clustering (ASC) is guaranteed to provide the correct clustering when the data points are in general position and the union of subspaces is transversal . In this paper we study in a rigorous fashion the properties of ASC in the case of affine subspaces. Using notions from algebraic geometry, we prove that the homogenization trick , which embeds points in a union of affine subspaces into points in a union of linear subspaces, preserves the general position of the points and the transversality of the union of subspaces in the embedded space, thus establishing the correctness of ASC for affine subspaces.

更新日期：2018-01-09
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-03-07
Jie Gui; Tongliang Liu; Zhenan Sun; Dacheng Tao; Tieniu Tan

Learning-based hashing algorithms are “hot topics” because they can greatly increase the scale at which existing methods operate. In this paper, we propose a new learning-based hashing method called “fast supervised discrete hashing” (FSDH) based on “supervised discrete hashing” (SDH). Regressing the training examples (or hash code) to the corresponding class labels is widely used in ordinary least squares regression. Rather than adopting this method, FSDH uses a very simple yet effective regression of the class labels of training examples to the corresponding hash code to accelerate the algorithm. To the best of our knowledge, this strategy has not previously been used for hashing. Traditional SDH decomposes the optimization into three sub-problems, with the most critical sub-problem - discrete optimization for binary hash codes - solved using iterative discrete cyclic coordinate descent (DCC), which is time-consuming. However, FSDH has a closed-form solution and only requires a single rather than iterative hash code-solving step, which is highly efficient. Furthermore, FSDH is usually faster than SDH for solving the projection matrix for least squares regression, making FSDH generally faster than SDH. For example, our results show that FSDH is about 12-times faster than SDH when the number of hashing bits is 128 on the CIFAR-10 data base, and FSDH is about 151-times faster than FastHash when the number of hashing bits is 64 on the MNIST data-base. Our experimental results show that FSDH is not only fast, but also outperforms other comparative methods.

更新日期：2018-01-09
Some contents have been Reproduced with permission of the American Chemical Society.
Some contents have been Reproduced by permission of The Royal Society of Chemistry.