• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date :
Zhanning Gao; Le Wang; Nebojsa Jojic; Zhenxing Niu; Nanning Zheng; Gang Hua

A new unified video analytics framework (ER3) is proposed for complex event retrieval, recognition and recounting, based on the proposed video imprint representation, which exploits temporal correlations among image features across video frames. With the video imprint representation, it is convenient to reverse map back to both temporal and spatial locations in video frames, allowing for both key frame identification and key areas localization within each frame. In the proposed framework, a dedicated feature alignment module is incorporated for redundancy removal across frames to produce the tensor representation, i.e., the video imprint. Subsequently, the video imprint is individually fed into both a reasoning network and a feature aggregation module, for event recognition/recounting and event retrieval tasks, respectively. Thanks to its attention mechanism inspired by the memory networks used in language modeling, the proposed reasoning network is capable of simultaneous event category recognition and localization of the key pieces of evidence for event recounting. In addition, the latent structure in our reasoning network highlights the areas of the video imprint, which can be directly used for event recounting. With the event retrieval task, the compact video representation aggregated from the video imprint contributes to better retrieval results than existing state-of-the-art methods.

更新日期：2018-08-20
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-08-17
Emli-Mari Nel; Per Ola Kristensson; David MacKay

Ticker is a novel probabilistic stereophonic single-switch text entry method for visually-impaired users with motor disabilities who rely on single-switch scanning systems to communicate. Ticker models and tolerates a wide variety of noise, which is inevitably introduced in practical use of single-switch systems. Efficacy evaluation consists of performance modelling and three user studies.

更新日期：2018-08-18
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-08-17
Bo Xiong; Suyog Jain; Kristen Grauman

We propose an end-to-end learning framework for segmenting generic objects in both images and videos. Given a novel image or video, our approach produces a pixel-level mask for all “object-like” regions-even for object categories never seen during training. We formulate the task as a structured prediction problem of assigning an object/background label to each pixel, implemented using a deep fully convolutional network. When applied to a video, our model further incorporates a motion stream, and the network learns to combine both appearance and motion and attempts to extract all prominent objects whether they are moving or not. Beyond the core model, a second contribution of our approach is how it leverages varying strengths of training annotations. Pixel-level annotations are quite difficult to obtain, yet crucial for training a deep network approach for segmentation. Thus we propose ways to exploit weakly labeled data for learning dense foreground segmentation. For images, we show the value in mixing object category examples with image-level labels together with relatively few images with boundary-level annotations. For video, we show how to bootstrap weakly annotated videos together with the network trained for image segmentation. Through experiments on multiple challenging image and video segmentation benchmarks, our method offers consistently strong results and improves the state-of-the-art for fully automatic segmentation of generic (unseen) objects. In addition, we demonstrate how our approach benefits image retrieval and image retargeting, both of which flourish when given our high-quality foreground maps. Code, models, and videos are at: http://vision.cs.utexas.edu/projects/pixelobjectness/

更新日期：2018-08-18
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date :
Chi Li; M. Zeeshan Zia; Quoc-Huy Tran; Xiang Yu; Gregory D. Hager; Manmohan Chandraker

Recent data-driven approaches to scene interpretation predominantly pose inference as an end-to-end black-box mapping, commonly performed by a Convolutional Neural Network (CNN). However, decades of work on perceptual organization in both human and machine vision suggest that there are often intermediate representations that are intrinsic to an inference task, and which provide essential structure to improve generalization. In this work, we explore an approach for injecting prior domain structure into neural network training by supervising hidden layers of a CNN with intermediate concepts that normally are not observed in practice. We formulate a probabilistic framework which formalizes these notions and predicts improved generalization via this deep supervision method. One advantage of this approach is that we are able to train only from synthetic CAD renderings of cluttered scenes, where concept values can be extracted, but apply the results to real images. Our implementation achieves the state-of-the-art performance of 2D/3D keypoint localization and image classification on real image benchmarks including KITTI, PASCAL VOC, PASCAL3D+, IKEA, and CIFAR100. We provide additional evidence that our approach outperforms alternative forms of supervision, such as multi-task networks.

更新日期：2018-08-13
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date :
Wei-Sheng Lai; Jia-Bin Huang; Narendra Ahuja; Ming-Hsuan Yang

Convolutional neural networks have recently demonstrated high-quality reconstruction for single image super-resolution. However, existing methods often require a large number of network parameters and entail heavy computational loads at runtime for generating high-accuracy super-resolution results. In this paper, we propose the deep Laplacian Pyramid Super-Resolution Network for fast and accurate image super-resolution. The proposed network progressively reconstructs the sub-band residuals of high-resolution images at multiple pyramid levels. In contrast to existing methods that involve the bicubic interpolation for pre-processing (which results in large feature maps), the proposed method directly extracts features from the low-resolution input space and thereby entails low computational loads. We train the proposed network with deep supervision using the robust Charbonnier loss functions and achieve high-quality image reconstruction. Furthermore, we utilize the recursive layers to share parameters across as well as within pyramid levels, and thus drastically reduce the number of parameters. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of run-time and image quality.

更新日期：2018-08-13
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date :
Chao Ma; Jia-Bin Huang; Xiaokang Yang; Ming-Hsuan Yang

Visual tracking is challenging as target objects often undergo significant appearance changes caused by deformation, abrupt motion, background clutter and occlusion. In this paper, we propose to exploit the rich hierarchical features of deep convolutional neural networks to improve the accuracy and robustness of visual tracking. Deep neural networks trained on object recognition datasets consist of multiple convolutional layers. These layers encode target appearance with different levels of abstraction. For example, the outputs of the last convolutional layers encode the semantic information of targets and such representations are invariant to significant appearance variations. However, their spatial resolutions are too coarse to precisely localize the target. In contrast, features from earlier convolutional layers provide more precise localization but are less invariant to appearance changes. We interpret the hierarchical features of convolutional layers as a nonlinear counterpart of an image pyramid representation and explicitly exploit these multiple levels of abstraction to represent target objects. Specifically, we learn adaptive correlation filters on the outputs from each convolutional layer to encode the target appearance. We infer the maximum response of each layer to locate targets in a coarse-to-fine manner. To further handle the issues with scale estimation and re-detecting target objects from tracking failures caused by heavy occlusion or out-of-the-view movement, we conservatively learn another correlation filter, that maintains a long-term memory of target appearance, as a discriminative classifier. We apply the classifier to two types of object proposals: (1) proposals with a small step size and tightly around the estimated location for scale estimation; and (2) proposals with large step size and across the whole image for target re-detection. Extensive experimental results on large-scale benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art tracking methods.

更新日期：2018-08-13
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date :
Chenglong Li; Liang Lin; Wangmeng Zuo; Jin Tang; Ming-Hsuan Yang

Existing visual tracking methods usually localize a target object with a bounding box, in which the performance of the foreground object trackers or detectors is often affected by the inclusion of background clutter. To handle this problem, we learn a patch-based graph representation for visual tracking. The tracked object is modeled by with a graph by taking a set of non-overlapping image patches as nodes, in which the weight of each node indicates how likely it belongs to the foreground and edges are weighted for indicating the appearance compatibility of two neighboring nodes. This graph is dynamically learned and applied in object tracking and model updating. The proposed algorithm performs three main steps in each frame. First, the graph is initialized by assigning binary weights of some image patches to indicate the object and background patches according to the predicted bounding box. Second, the graph is optimized to refine the patch weights. Third, the object feature representation is updated by imposing the weights of patches on the extracted image features. The object location is predicted by maximizing the classification score in the structured support vector machine. Extensive experiments show that the proposed tracking algorithm performs well against the state-of-the-art methods.

更新日期：2018-08-13
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date :
Christian Bailer; Bertram Taetz; Didier Stricker

Modern large displacement optical flow algorithms usually use an initialization by either sparse descriptor matching techniques or dense approximate nearest neighbor fields. While the latter have the advantage of being dense, they have the major disadvantage of being very outlier-prone as they are not designed to find the optical flow, but the visually most similar correspondence. In this article we present a dense correspondence field approach that is much less outlier-prone and thus much better suited for optical flow estimation than approximate nearest neighbor fields. Our approach does not require explicit regularization, smoothing (like median filtering) or a new data term. Instead we solely rely on patch matching techniques and a novel multi-scale matching strategy. We also present enhancements for outlier filtering. We show that our approach is better suited for large displacement optical flow estimation than modern descriptor matching techniques. We do so by initializing EpicFlow with our approach instead of their originally used state-of-the-art descriptor matching technique. We significantly outperform the original EpicFlow on MPI-Sintel, KITTI 2012, KITTI 2015 and Middlebury. In this extended article of our former conference publication we further improve our approach in matching accuracy as well as runtime and present more experiments and insights.

更新日期：2018-08-13
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date :
Ignacio Rocco; Relja Arandjelovic; Josef Sivic

We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine, homography or thin-plate spline transformation, and estimating its parameters. The contributions of this work are three-fold. First, we propose a convolutional neural network architecture for geometric matching. The architecture is based on three main components that mimic the standard steps of feature extraction, matching and simultaneous inlier detection and model parameter estimation, while being trainable end-to-end. Second, we demonstrate that the network parameters can be trained from synthetically generated imagery without the need for manual annotation and that our matching layer significantly increases generalization capabilities to never seen before images. Finally, we show that the same model can perform both instance-level and category-level matching giving state-of-the-art results on the challenging PF, TSS and Caltech-101 datasets.

更新日期：2018-08-13
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-08-09
Qi Zhang; Chunping Zhang; Jinbo Ling; Qing Wang; Jingyi Yu

Light field cameras can capture both spatial and angular information of light rays, enabling 3D reconstruction by a single exposure. The geometry of 3D reconstruction is affected by intrinsic parameters of a light field camera significantly. In the paper, we propose a multi-projection-center (MPC) model with 6 intrinsic parameters to characterize light field cameras based on traditional two-parallel-plane (TPP) representation. The MPC model can generally parameterize light field in different imaging formations, including conventional and focused light field cameras. By the constraints of 4D ray and 3D geometry, a 3D projective transformation is deduced to describe the relationship between geometric structure and the MPC coordinates. Based on the MPC model and projective transformation, we propose a calibration algorithm to verify our light field camera model. Our calibration method includes a close-form solution and a non-linear optimization by minimizing re-projection errors. Experimental results on both simulated and real scene data have verified the performance of our algorithm.

更新日期：2018-08-10
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-08-29
Enrique Sánchez-Lozano; Georgios Tzimiropoulos; Brais Martinez; Fernando De la Torre; Michel Valstar

Linear regression is a fundamental building block in many face detection and tracking algorithms, typically used to predict shape displacements from image features through a linear mapping. This paper presents a Functional Regression solution to the least squares problem, which we coin Continuous Regression, resulting in the first real-time incremental face tracker. Contrary to prior work in Functional Regression, in which B-splines or Fourier series were used, we propose to approximate the input space by its first-order Taylor expansion, yielding a closed-form solution for the continuous domain of displacements. We then extend the continuous least squares problem to correlated variables, and demonstrate the generalisation of our approach. We incorporate Continuous Regression into the cascaded regression framework, and show its computational benefits for both training and testing. We then present a fast approach for incremental learning within Cascaded Continuous Regression, coined iCCR, and show that its complexity allows real-time face tracking, being 20 times faster than the state of the art. To the best of our knowledge, this is the first incremental face tracker that is shown to operate in real-time. We show that iCCR achieves state-of-the-art performance on the 300-VW dataset, the most recent, large-scale benchmark for face tracking.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-08-30
Umar Asif; Mohammed Bennamoun; Ferdous A. Sohel

While deep convolutional neural networks have shown a remarkable success in image classification, the problems of inter-class similarities, intra-class variances, the effective combination of multi-modal data, and the spatial variability in images of objects remain to be major challenges. To address these problems, this paper proposes a novel framework to learn a discriminative and spatially invariant classification model for object and indoor scene recognition using multi-modal RGB-D imagery. This is achieved through three postulates: 1) spatial invariance $-$ this is achieved by combining a spatial transformer network with a deep convolutional neural network to learn features which are invariant to spatial translations, rotations, and scale changes, 2) high discriminative capability $-$ this is achieved by introducing Fisher encoding within the CNN architecture to learn features which have small inter-class similarities and large intra-class compactness, and 3) multi-modal hierarchical fusion $-$ this is achieved through the regularization of semantic segmentation to a multi-modal CNN architecture, where class probabilities are estimated at different hierarchical levels (i.e., image- and pixel-levels), and fused into a Conditional Random Field (CRF)-based inference hypothesis, the optimization of which produces consistent class labels in RGB-D images. Extensive experimental evaluations on RGB-D object and scene datasets, and live video streams (acquired from Kinect) show that our framework produces superior object and scene classification results compared to the state-of-the-art methods.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-04
Fanhua Shang; James Cheng; Yuanyuan Liu; Zhi-Quan Luo; Zhouchen Lin

The heavy-tailed distributions of corrupted outliers and singular values of all channels in low-level vision have proven effective priors for many applications such as background modeling, photometric stereo and image alignment. And they can be well modeled by a hyper-Laplacian. However, the use of such distributions generally leads to challenging non-convex, non-smooth and non-Lipschitz problems, and makes existing algorithms very slow for large-scale applications. Together with the analytic solutions to $\ell _{p}$ -norm minimization with two specific values of $p$ , i.e., $p=1/2$ and $p=2/3$ , we propose two novel bilinear factor matrix norm minimization models for robust principal component analysis. We first define the double nuclear norm and Frobenius/nuclear hybrid norm penalties, and then prove that they are in essence the Schatten- $1/2$ and $2/3$ quasi-norms, respectively, which lead to much more tractable and scalable Lipschitz optimization problems. Our experimental analysis shows that both our methods yield more accurate solutions than original Schatten quasi-norm minimization, even when the number of observations is very limited. Finally, we apply our penalties to various low-level vision problems, e.g., text removal, moving object detection, image alignment and inpainting, and show that our methods usually outperform the state-of-the-art methods.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-18
Arash Akbarinia; C. Alejandro Parraga

The problem of removing illuminant variations to preserve the colours of objects ( colour constancy ) has already been solved by the human brain using mechanisms that rely largely on centre-surround computations of local contrast. In this paper we adopt some of these biological solutions described by long known physiological findings into a simple, fully automatic, functional model (termed Adaptive Surround Modulation or ASM). In ASM, the size of a visual neuron's receptive field (RF) as well as the relationship with its surround varies according to the local contrast within the stimulus, which in turn determines the nature of the centre-surround normalisation of cortical neurons higher up in the processing chain. We modelled colour constancy by means of two overlapping asymmetric Gaussian kernels whose sizes are adapted based on the contrast of the surround pixels, resembling the change of RF size. We simulated the contrast-dependent surround modulation by weighting the contribution of each Gaussian according to the centre-surround contrast. In the end, we obtained an estimation of the illuminant from the set of the most activated RFs’ outputs. Our results on three single-illuminant and one multi-illuminant benchmark datasets show that ASM is highly competitive against the state-of-the-art and it even outperforms learning-based algorithms in one case. Moreover, the robustness of our model is more tangible if we consider that our results were obtained using the same parameters for all datasets, that is, mimicking how the human visual system operates. These results suggest a dynamical adaptation mechanisms contribute to achieving higher accuracy in computational colour constancy.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-11
Qianggong Zhang; Tat-Jun Chin

Multiple-view triangulation by $\ell _\infty$ minimisation has become established in computer vision. State-of-the-art $\ell _\infty$ triangulation algorithms exploit the quasiconvexity of the cost function to derive iterative update rules that deliver the global minimum. Such algorithms, however, can be computationally costly for large problem instances that contain many image measurements, e.g., from web-based photo sharing sites or long-term video recordings. In this paper, we prove that $\ell _\infty$ triangulation admits a coreset approximation scheme, which seeks small representative subsets of the input data called coresets . A coreset possesses the special property that the error of the $\ell _\infty$ solution on the coreset is within known bounds from the global minimum. We establish the necessary mathematical underpinnings of the coreset algorithm, specifically, by enacting the stopping criterion of the algorithm and proving that the resulting coreset gives the desired approximation accuracy. On large-scale triangulation problems, our method provides theoretically sound approximate solutions. Iterated until convergence, our coreset algorithm is also guaranteed to reach the true optimum. On practical datasets, we show that our technique can in fact attain the global minimiser much faster than current methods.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-08-29
Xingyu Zeng; Wanli Ouyang; Junjie Yan; Hongsheng Li; Tong Xiao; Kun Wang; Yu Liu; Yucong Zhou; Bin Yang; Zhe Wang; Hui Zhou; Xiaogang Wang

The visual cues from multiple support regions of different sizes and resolutions are complementary in classifying a candidate box in object detection. Effective integration of local and contextual visual cues from these regions has become a fundamental problem in object detection. In this paper, we propose a gated bi-directional CNN (GBD-Net) to pass messages among features from different support regions during both feature learning and feature extraction. Such message passing can be implemented through convolution between neighboring support regions in two directions and can be conducted in various layers. Therefore, local and contextual visual patterns can validate the existence of each other by learning their nonlinear relationships and their close interactions are modeled in a more complex way. It is also shown that message passing is not always helpful but dependent on individual samples. Gated functions are therefore needed to control message transmission, whose on-or-offs are controlled by extra visual evidence from the input sample. The effectiveness of GBD-Net is shown through experiments on three object detection datasets, ImageNet, Pascal VOC2007 and Microsoft COCO. Besides the GBD-Net, this paper also shows the details of our approach in winning the ImageNet object detection challenge of 2016, with source code provided on https://github.com/craftGBD/craftGBD . In this winning system, the modified GBD-Net, new pretraining scheme and better region proposal designs are provided. We also show the effectiveness of different network structures and existing techniques for object detection, such as multi-scale testing, left-right flip, bounding box voting, NMS, and context.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-07
Christian Knoll; Dhagash Mehta; Tianran Chen; Franz Pernkopf

Belief propagation (BP) is an iterative method to perform approximate inference on arbitrary graphical models. Whether BP converges and if the solution is a unique fixed point depends on both the structure and the parametrization of the model. To understand this dependence it is interesting to find all fixed points. In this work, we formulate a set of polynomial equations, the solutions of which correspond to BP fixed points. To solve such a nonlinear system we present the numerical polynomial-homotopy-continuation (NPHC) method. Experiments on binary Ising models and on error-correcting codes show how our method is capable of obtaining all BP fixed points. On Ising models with fixed parameters we show how the structure influences both the number of fixed points and the convergence properties. We further asses the accuracy of the marginals and weighted combinations thereof. Weighting marginals with their respective partition function increases the accuracy in all experiments. Contrary to the conjecture that uniqueness of BP fixed points implies convergence, we find graphs for which BP fails to converge, even though a unique fixed point exists. Moreover, we show that this fixed point gives a good approximation, and the NPHC method is able to obtain this fixed point.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-15
Antonio Agudo; Francesc Moreno-Noguer

This paper addresses the problem of simultaneously recovering 3D shape, pose and the elastic model of a deformable object from only 2D point tracks in a monocular video. This is a severely under-constrained problem that has been typically addressed by enforcing the shape or the point trajectories to lie on low-rank dimensional spaces. We show that formulating the problem in terms of a low-rank force space that induces the deformation and introducing the elastic model as an additional unknown, allows for a better physical interpretation of the resulting priors and a more accurate representation of the actual object’s behavior. In order to simultaneously estimate force, pose, and the elastic model of the object we use an expectation maximization strategy, where each of these parameters are successively learned by partial M-steps. Once the elastic model is learned, it can be transfered to similar objects to code its 3D deformation. Moreover, our approach can robustly deal with missing data, and encode both rigid and non-rigid points under the same formalism. We thoroughly validate the approach on Mocap and real sequences, showing more accurate 3D reconstructions than state-of-the-art, and additionally providing an estimate of the full elastic model with no a priori information.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-01
Yeqing Li; Chen Chen; Fei Yang; Junzhou Huang

Similarity measure is an essential component in image registration. In this article, we propose a novel similarity measure for registration of two or more images. The proposed method is motivated by the fact that optimally registered images can be sparsified hierarchically in the gradient domain and frequency domain with the separation of sparse errors. One of the key advantages of the proposed similarity measure is its robustness in dealing with severe intensity distortions, which widely exist on medical images, remotely sensed images and natural photos due to differences of acquisition modalities or illumination conditions. Two efficient algorithms are proposed to solve the batch image registration and pair registration problems in a unified framework. We have validated our method on extensive and challenging data sets. The experimental results demonstrate the robustness, accuracy and efficiency of our method over nine traditional and state-of-the-art algorithms on synthetic images and a wide range of real-world applications.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-04
Hyung Jin Chang; Yiannis Demiris

In this paper, we present a novel framework for unsupervised kinematic structure learning of complex articulated objects from a single-view 2D image sequence. In contrast to prior motion-based methods, which estimate relatively simple articulations, our method can generate arbitrarily complex kinematic structures with skeletal topology via a successive iterative merging strategy. The iterative merge process is guided by a density weighted skeleton map which is generated from a novel object boundary generation method from sparse 2D feature points. Our main contributions can be summarised as follows: (i) An unsupervised complex articulated kinematic structure estimation method that combines motion segments with skeleton information. (ii) An iterative fine-to-coarse merging strategy for adaptive motion segmentation and structural topology embedding. (iii) A skeleton estimation method based on a novel silhouette boundary generation from sparse feature points using an adaptive model selection method. (iv) A new highly articulated object dataset with ground truth annotation. We have verified the effectiveness of our proposed method in terms of computational time and estimation accuracy through rigorous experiments with multiple datasets. Our experiments show that the proposed method outperforms state-of-the-art methods both quantitatively and qualitatively.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-08-30
Shaojing Fan; Tian-Tsong Ng; Bryan Lee Koenig; Jonathan Samuel Herberg; Ming Jiang; Zhiqi Shen; Qi Zhao

Visual realism is defined as the extent to which an image appears to people as a photo rather than computer generated. Assessing visual realism is important in applications like computer graphics rendering and photo retouching. However, current realism evaluation approaches use either labor-intensive human judgments or automated algorithms largely dependent on comparing renderings to reference images. We develop a reference-free computational framework for visual realism prediction to overcome these constraints. First, we construct a benchmark dataset of 2,520 images with comprehensive human annotated attributes. From statistical modeling on this data, we identify image attributes most relevant for visual realism. We propose both empirically-based (guided by our statistical modeling of human data) and deep convolutional neural network models to predict visual realism of images. Our framework has the following advantages: (1) it creates an interpretable and concise empirical model that characterizes human perception of visual realism; (2) it links computational features to latent factors of human image perception.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-05
Jean-Baptiste Alayrac; Piotr Bojanowski; Nishant Agrawal; Josef Sivic; Ivan Laptev; Simon Lacoste-Julien

Automatic assistants could guide a person or a robot in performing new tasks, such as changing a car tire or repotting a plant. Creating such assistants, however, is non-trivial and requires understanding of visual and verbal content of a video. Towards this goal, we here address the problem of automatically learning the main steps of a task from a set of narrated instruction videos. We develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method sequentially clusters textual and visual representations of a task, where the two clustering problems are linked by joint constraints to obtain a single coherent sequence of steps in both modalities. To evaluate our method, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains videos for five different tasks with complex interactions between people and objects, captured in a variety of indoor and outdoor settings. We experimentally demonstrate that the proposed method can automatically discover, learn and localize the main steps of a task in input videos.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-08-29
Antoine Lejeune; Jacques G. Verly; Marc Van Droogenbroeck

We develop a powerful probabilistic framework for the local characterization of surfaces and edges in range images. We use the geometrical nature of the data to derive an analytic expression for the joint probability density function (pdf) for the random variables used to model the ranges of a set of pixels in a local neighborhood of an image. We decompose this joint pdf by considering independently the cases where two real world points corresponding to two neighboring pixels are locally on the same real world surface or not. In particular, we show that this joint pdf is linked to the Voigt pdf and not to the Gaussian pdf as it is assumed in some applications. We apply our framework to edge detection and develop a locally adaptive algorithm that is based on a probabilistic decision rule. We show in an objective evaluation that this new edge detector performs better than prior art edge detectors. This proves the benefits of the probabilistic characterization of the local neighborhood as a tool to improve applications that involve range images.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-08-22
Enliang Zheng; Dinghuang Ji; Enrique Dunn; Jan-Michael Frahm

We target the problem of sparse 3D reconstruction of dynamic objects observed by multiple unsynchronized video cameras with unknown temporal overlap. To this end, we develop a framework to recover the unknown structure without sequencing information across video sequences. Our proposed compressed sensing framework poses the estimation of 3D structure as the problem of dictionary learning, where the dictionary is defined as an aggregation of the temporally varying 3D structures. Given the smooth motion of dynamic objects, we observe any element in the dictionary can be well approximated by a sparse linear combination of other elements in the same dictionary (i.e., self-expression). Our formulation optimizes a biconvex cost function that leverages a compressed sensing formulation and enforces both structural dependency coherence across video streams, as well as motion smoothness across estimates from common video sources. We further analyze the reconstructability of our approach under different capture scenarios, and its comparison and relation to existing methods. Experimental results on large amounts of synthetic data as well as real imagery demonstrate the effectiveness of our approach.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-08
Ahmad Mheich; Mahmoud Hassan; Mohamad Khalil; Vincent Gripon; Olivier Dufor; Fabrice Wendling

Quantifying the similarity between two networks is critical in many applications. A number of algorithms have been proposed to compute graph similarity, mainly based on the properties of nodes and edges. Interestingly, most of these algorithms ignore the physical location of the nodes, which is a key factor in the context of brain networks involving spatially defined functional areas. In this paper, we present a novel algorithm called “SimiNet” for measuring similarity between two graphs whose nodes are defined a priori within a 3D coordinate system. SimiNet provides a quantified index (ranging from 0 to 1) that accounts for node, edge and spatiality features. Complex graphs were simulated to evaluate the performance of SimiNet that is compared with eight state-of-art methods. Results show that SimiNet is able to detect weak spatial variations in compared graphs in addition to computing similarity using both nodes and edges. SimiNet was also applied to real brain networks obtained during a visual recognition task. The algorithm shows high performance to detect spatial variation of brain networks obtained during a naming task of two categories of visual stimuli: animals and tools. A perspective to this work is a better understanding of object categorization in the human brain.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-11
Sergey Tulyakov; László A. Jeni; Jeffrey F. Cohn; Nicu Sebe

Most approaches to face alignment treat the face as a 2D object, which fails to represent depth variation and is vulnerable to loss of shape consistency when the face rotates along a 3D axis. Because faces commonly rotate three dimensionally, 2D approaches are vulnerable to significant error. 3D morphable models, employed as a second step in 2D+3D approaches are robust to face rotation but are computationally too expensive for many applications, yet their ability to maintain viewpoint consistency is unknown. We present an alternative approach that estimates 3D face landmarks in a single face image. The method uses a regression forest-based algorithm that adds a third dimension to the common cascade pipeline. 3D face landmarks are estimated directly, which avoids fitting a 3D morphable model. The proposed method achieves viewpoint consistency in a computationally efficient manner that is robust to 3D face rotation. To train and test our approach, we introduce the Multi-PIE Viewpoint Consistent database. In empirical tests, the proposed method achieved simple yet effective head pose estimation and viewpoint consistency on multiple measures relative to alternative approaches.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-22

Helmholtz Stereopsis is a 3D reconstruction method uniquely independent of surface reflectance. Yet, its sub-optimal maximum likelihood formulation with drift-prone normal integration limits performance. Via three contributions this paper presents a complete novel pipeline for Helmholtz Stereopsis. First, we propose a Bayesian formulation replacing the maximum likelihood problem by a maximum a posteriori one. Second, a tailored prior enforcing consistency between depth and normal estimates via a novel metric related to optimal surface integrability is proposed. Third, explicit surface integration is eliminated by taking advantage of the accuracy of prior and high resolution of the coarse-to-fine approach. The pipeline is validated quantitatively and qualitatively against alternative formulations, reaching sub-millimetre accuracy and coping with complex geometry and reflectance.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-08-29

Background subtraction is a fundamental video analysis technique that consists of creation of a background model that allows distinguishing foreground pixels. We present a new method in which the image sequence is assumed to be made up of the sum of a low-rank background matrix and a dynamic tree-structured sparse matrix. The decomposition task is then solved using our approximated Robust Principal Component Analysis (ARPCA) method which is an extension to the RPCA that can handle camera motion and noise. Our model dynamically estimates the support of the foreground regions via a superpixel generation step, so that spatial coherence can be imposed on these regions. Unlike conventional smoothness constraints such as MRF, our method is able to obtain crisp and meaningful foreground regions, and in general, handles large dynamic background motion better. To reduce the dimensionality and the curse of scale that is persistent in the RPCA-based methods, we model the background via Column Subset Selection Problem, that reduces the order of complexity and hence decreases computation time. Comprehensive evaluation on four benchmark datasets demonstrate the effectiveness of our method in outperforming state-of-the-art alternatives.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-09-07
Junlin Hu; Jiwen Lu; Yap-Peng Tan

This paper presents a sharable and individual multi-view metric learning (MvML) approach for visual recognition. Unlike conventional metric leaning methods which learn a distance metric on either a single type of feature representation or a concatenated representation of multiple types of features, the proposed MvML jointly learns an optimal combination of multiple distance metrics on multi-view representations, where not only it learns an individual distance metric for each view to retain its specific property but also a shared representation for different views in a unified latent subspace to preserve the common properties. The objective function of the MvML is formulated in the large margin learning framework via pairwise constraints, under which the distance of each similar pair is smaller than that of each dissimilar pair by a margin. Moreover, to exploit the nonlinear structure of data points, we extend MvML to a sharable and individual multi-view deep metric learning (MvDML) method by utilizing the neural network architecture to seek multiple nonlinear transformations. Experimental results on face verification, kinship verification, and person re-identification show the effectiveness of the proposed sharable and individual multi-view metric learning methods.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date :
Joyce Jiyoung Whang; Yangyang Hou; David Gleich; Inderjit S. Dhillon

Traditional clustering algorithms, such as K-Means, output a clustering that is disjoint and exhaustive, i.e., every single data point is assigned to exactly one cluster. However, in many real-world datasets, clusters can overlap and there are often outliers that do not belong to any cluster. While this is a well-recognized problem, most existing algorithms address either overlap or outlier detection and do not tackle the problem in a unified way. In this paper, we propose an intuitive objective function, which we call the NEO-K-Means (Non-Exhaustive, Overlapping K-Means) objective, that captures the issues of overlap and non-exhaustiveness in a unified manner. Our objective function can be viewed as a reformulation of the traditional K-Means objective, with easy-to-understand parameters that capture the degrees of overlap and non-exhaustiveness. By considering an extension to weighted kernel K-Means, we show that we can also apply our NEO-K-Means idea to overlapping community detection, which is an important task in network analysis. To optimize the NEO-K-Means objective, we develop not only fast iterative algorithms but also more sophisticated algorithms using low-rank semidefinite programming techniques. Our experimental results show that the new objective and algorithms are effective in finding ground-truth clusterings that have varied overlap and non-exhaustiveness.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date :
Jian-Fang Hu; Wei-Shi Zheng; Lianyang Ma; Gang Wang; Jian-Huang Lai; Jianguo Zhang

We propose a novel approach for predicting on-going action with the assistance of a low-cost depth camera. Our approach introduces a soft regression-based early prediction framework. In this framework, we estimate soft labels for the subsequences at different progress levels, jointly learned with an action predictor. Our formulation of soft regression framework 1) overcomes a usual assumption in existing early action prediction systems that the progress level of on-going sequence is given in the testing stage; and 2) presents a theoretical framework to better resolve the ambiguity and uncertainty of subsequences at early performing stage. The proposed soft regression framework is further enhanced in order to take the relationships among subsequences and the discrepancy of soft labels over different classes into consideration, so that a Multiple Soft labels Recurrent Neural Network (MSRNN) is finally developed. For real-time performance, we also introduce "local accumulative frame feature (LAFF)", which can be computed efficiently by constructing an integral feature map. Our experiments on three RGB-D benchmark datasets and an unconstrained RGB action set demonstrate that the proposed regression-based early action prediction model outperforms existing models and the early action prediction on RGB-D sequence is more accurate than that on RGB channel.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date :
Adam Polyak; Lior Wolf; Yaniv Taigman

We study the problem of mapping an input image to a tied pair consisting of a vector of parameters and an image that is created using a graphical engine from the vector of parameters. The mapping's objective is to have the output image as similar as possible to the input image. During training, no supervision is given in the form of matching inputs and outputs.

更新日期：2018-08-06
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-08-01
Shuyang Wang; Zhengming Ding; Yun Fu

Kinship verification is a very important technique in real-world applications, e.g., album organization and forensic analysis. However, it is very difficult to verify a family pair with generation gap, e.g., father and son, since there exist both age gap and identity variation. It is essential to well fight off such challenges to achieve promising kinship verification performance. To this end, we propose a towards-young cross-generation model for effective kinship verification by mitigating both age and identity divergences. Specifically, we explore a conditional generative model to bring in an intermediate domain as a bridge to link each pair. Thus, we could extract more effective features through deep architectures with a newly-designed Sparse Discriminative Metric Loss (SDM-Loss), which is exploited to involve the positive and negative information. Experimental results on kinship benchmark demonstrate the superiority of our proposed model by comparing with the state-of-the-art kinship verification methods.

更新日期：2018-08-02
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-31

Machine learning models are susceptible to adversarial perturbations: small changes to input that can cause large changes in output. Additionally, there exist input-agnostic perturbations, called universal adversarial perturbations, which can change the inference of target model on most of the data samples. However, existing methods to craft universal perturbations are (i) task specific, (ii) require samples from the training data distribution, and (iii) perform complex optimizations. Additionally, fooling ability of the crafted perturbations is proportional to the available training data. In this paper, we present a novel, generalizable and data-free approach for crafting universal adversarial perturbations. Independent of the underlying task, our objective achieves fooling via corrupting the extracted features at multiple layers. Therefore, the proposed objective is generalizable to craft image-agnostic perturbations across multiple vision tasks such as object recognition, semantic segmentation, and depth estimation. In the practical setting of black-box attack scenario, we show that our objective outperforms the data dependent objectives. Further, via exploiting simple priors related to the data distribution, our objective remarkably boosts the fooling ability of the crafted perturbations. Significant fooling rates achieved by our objective emphasize that the current deep learning models are now at an increased risk, since our objective generalizes across multiple tasks without the requirement of training data.

更新日期：2018-08-01
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-31
Sheng-Jun Huang; Wei Gao; Zhi-Hua Zhou

In many real-world tasks, particularly those involving data objects with complicated semantics such as images and texts, one object can be represented by multiple instances and simultaneously be associated with multiple labels. Such tasks can be formulated as multi-instance multi-label learning (MIML) problems, and have been extensively studied during the past few years. Existing MIML approaches have been found useful in many applications; however, most of them can only handle moderate-sized data. To efficiently handle large data sets, in this paper we propose the MIMLfast approach, which first constructs a low-dimensional subspace shared by all labels, and then trains label specific linear models to optimize approximated ranking loss via stochastic gradient descent. Although the MIML problem is complicated, MIMLfast is able to achieve excellent performance by exploiting label relations with shared space and discovering sub-concepts for complicated labels. Experiments show that the performance of MIMLfast is highly competitive to state-of-the-art techniques, whereas its time cost is much less. Moreover, our approach is able to identify the most representative instance for each label, and thus providing a chance to understand the relation between input patterns and output label semantics.

更新日期：2018-08-01
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-31
Tian Tian; Jun Zhu; You Qiaoben

Learning-from-crowds aims to design proper aggregation strategies to infer the unknown true labels from the noisy labels provided by ordinary web workers. This paper presents max-margin majority voting (M3V) to improve the discriminative ability of majority voting and further presents a Bayesian generalization to incorporate the flexibility of generative methods on modeling noisy observations with worker confusion matrices for different application settings. We first introduce the crowdsourcing margin of majority voting, then we formulate the joint learning as a regularized Bayesian inference (RegBayes) problem, where the posterior regularization is derived by maximizing the margin between the aggregated score of a potential true label and that of any alternative label. Our Bayesian model naturally covers the Dawid-Skene estimator and M3V as its two special cases. Due to the flexibility of our model, we extend it to handle crowdsourced labels with an ordinal structure with the main ideas about the crowdsourcing margin unchanged. Moreover, we consider an online learning-from-crowds setting where labels coming in a stream. Empirical results demonstrate that our methods are competitive, often achieving better results than state-of-the-art estimators.

更新日期：2018-08-01
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-31
Emanuele Sansone; Francesco G.B. De Natale; Zhi-Hua Zhou

Positive unlabeled (PU) learning is useful in various practical situations, where there is a need to learn a classifier for a class of interest from an unlabeled data set, which may contain anomalies as well as samples from unknown classes. The learning task can be formulated as an optimization problem under the framework of statistical learning theory. Recent studies have theoretically analyzed its properties and generalization performance, nevertheless, little effort has been made to consider the problem of scalability, especially when large sets of unlabeled data are available. In this work we propose a novel scalable PU learning algorithm that is theoretically proven to provide the optimal solution, while showing superior computational and memory performance. Experimental evaluation confirms the theoretical evidence and shows that the proposed method can be successfully applied to a large variety of real-world problems involving PU learning.

更新日期：2018-08-01
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-31
Di Wang; X.-B. Gao; Xiumei Wang; Lihuo He

Multimodal hashing has attracted much interest for cross-modal similarity search on large-scale multimedia data sets because of its efficiency and effectiveness. Recently, supervised multimodal hashing, which tries to preserve the semantic information obtained from the labels of training data, has received considerable attention for its higher search accuracy compared with unsupervised multimodal hashing. Although these algorithms are promising, they are mainly designed to preserve pairwise similarities. They do not enable the hash codes to preserve the discriminative information reflected by labels and, hence, the retrieval accuracies of these methods are affected. To address these challenges, this paper introduces a simple yet effective supervised multimodal hashing method, called label consistent matrix factorization hashing (LCMFH), which focuses on directly utilizing semantic labels to guide the hashing learning procedure. Considering that relevant data from different modalities have semantic correlations, LCMFH transforms heterogeneous data into latent semantic spaces in which multimodal data from the same category share the same representation. Therefore, hash codes quantified by the obtained representations are consistent with the semantic labels of the original data and, thus, can have more discriminative power for cross-modal similarity search tasks. Thorough experiments on standard databases show that the proposed algorithm outperforms several state-of-the-art methods.

更新日期：2018-08-01
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-20
Jian-Hao Luo; Hao Zhang; Hong-Yu Zhou; Chen-Wei Xie; Jianxin Wu; Weiyao Lin

This paper aims at accelerating and compressing deep neural networks to deploy CNN models into small devices like mobile phones or embedded gadgets. We focus on filter level pruning, i.e., the whole filter will be discarded if it is less important. An effective and unified framework, ThiNet (stands for "Thin Net"), is proposed in this paper. We formally establish filter pruning as an optimization problem, and reveal that we need to prune filters based on statistics computed from its next layer, not the current layer, which differentiates ThiNet from existing methods. We also propose "gcos" (Group COnvolution with Shuffling), a more accurate group convolution scheme, to further reduce the pruned model size. Experimental results demonstrate the effectiveness of our method, which has advanced the state-of-the-art. Moreover, we show that the original VGG-16 model can be compressed into a very small model (ThiNet-Tiny) with only 2.66MB model size, but still preserve AlexNet level accuracy. This small model is evaluated on several benchmarks with different vision tasks (e.g., classification, detection, segmentation), and shows excellent generalization ability.

更新日期：2018-08-01
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-23
Tsung-Yi Lin; Priyal Goyal; Ross Girshick; Kaiming He; Piotr Dollar

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.

更新日期：2018-07-24
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-23
Yueqi Duan; Jiwen Lu; Ziwei Wang; Jianjiang Feng; Jie Zhou

In this paper, we propose an unsupervised feature learning method called deep binary descriptor with multi-quantization (DBD-MQ) for visual analysis. Existing learning-based binary descriptors utilize the rigid sign function for binarization despite of data distributions, which usually suffer from severe quantization loss. To this end, we propose a deep multi-quantization network to learn a data-dependent binarization in an unsupervised manner. More specifically, we design a K-Autoencoders (KAEs) network to jointly learn the parameters of feature extractor and the binarization functions under a deep learning framework, so that discriminative binary descriptors can be obtained with a fine-grained multi-quantization. As DBD-MQ simply allocates the same number of quantizers to each real-valued feature dimension ignoring the elementwise diversity of informativeness, we further propose a deep competitive binary descriptor with multi-quantization (DCBD-MQ) method to learn optimal allocation of bits with the fixed binary length in a competitive manner, where informative dimensions gain more bits for complete representation. Moreover, we present a similarity-aware binary encoding strategy based on the earth mover's distance of Autoencoders, so that elements that are quantized into similar Autoencoders will have smaller Hamming distances. Extensive experimental results on six widely-used datasets show that our DBD-MQ and DCBD-MQ outperform most state-of-the-art unsupervised binary descriptors

更新日期：2018-07-24
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-23
Takeru Miyato; Shin-Ichi Maeda; Shin Ishii; Masanori Koyama

We propose a new regularization method based on virtual adversarial loss: a new measure of local smoothness of the conditional label distribution given input. Virtual adversarial loss is defined as the robustness of the conditional label distribution around each input data point against local perturbation. Unlike adversarial training, our method defines the adversarial direction without label information and is hence applicable to semi-supervised learning. Because the directions in which we smooth the model are only "virtually" adversarial, we call our method virtual adversarial training (VAT). The computational cost of VAT is relatively low. For neural networks, the approximated gradient of virtual adversarial loss can be computed with no more than two pairs of forward- and back-propagations. In our experiments, we applied VAT to supervised and semi-supervised learning tasks on multiple benchmark datasets. With a simple enhancement of the algorithm based on the entropy minimization principle, our VAT achieves state-of-the-art performance for semi-supervised learning tasks on SVHN and CIFAR-10.

更新日期：2018-07-24
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-23
Juan Liu; Emmanouil Psarakis; Yang Feng; Ioannis Stamos

Repeated patterns (such as windows, balconies, and doors) are prominent and significant features in urban scenes. Therefore, detection of these repeated patterns becomes very important for city scene analysis. This paper attacks the problem of repeated pattern detection in a precise, efficient and automatic way, by combining traditional feature extraction with a Kronecker product based low-rank model. We introduced novel algorithms that extract repeated patterns from rectified images with solid theoretical support. Our method is tailored for 2D images of building facades and tested on a large set of facade images.

更新日期：2018-07-24
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-23
Joshua James Engelsma; Kai Cao; Anil K. Jain

We open source an easy to assemble, spoof resistant, high resolution, optical fingerprint reader, called RaspiReader, using ubiquitous components. By using our open source STL files and software, RaspiReader can be built in under one hour for only US 175. As such, RaspiReader provides the fingerprint research community a seamless and simple method for quickly prototyping new ideas involving fingerprint reader hardware. In particular, we posit that this open source fingerprint reader will facilitate the exploration of novel fingerprint spoof detection techniques involving both hardware and software. We demonstrate one such spoof detection technique by specially customizing RaspiReader with two cameras for fingerprint image acquisition. One camera provides high contrast, frustrated total internal reflection (FTIR) fingerprint images, and the other outputs direct images of the finger in contact with the platen. Using both of these image streams, we extract complementary information which, when fused together and used for spoof detection, results in marked performance improvement over previous methods relying only on grayscale FTIR images provided by COTS optical readers. Finally, fingerprint matching experiments between images acquired from the FTIR output of RaspiReader and images acquired from a COTS reader verify the interoperability of the RaspiReader with existing COTS optical readers. 更新日期：2018-07-24 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-23 Bolei Zhou; David Bau; Aude Oliva; Antonio Torralba The success of recent deep convolutional neural networks (CNNs) depends on learning hidden representations that can summarize the important factors of variation behind the data. In this work, we describe \textit{Network Dissection}, a method that interprets networks by providing meaningful labels to their individual units. The proposed method quantifies the interpretability of CNN representations by evaluating the alignment between individual hidden units and visual semantic concepts. By identifying the best alignments, units are given interpretable labels ranging from colors, materials, textures, parts, objects and scenes. The method reveals that deep representations are more transparent and interpretable than they would be under a random equivalently powerful basis. We apply our approach to interpret and compare the latent representations of several network architectures trained to solve a wide range of supervised and self-supervised tasks. We then examine factors affecting the network interpretability such as the number of the training iterations, regularizations, different initialization parameters, as well as networks depth and width. Finally we show that the interpreted units can be used to provide explicit explanations of a given CNN prediction for an image. Our results highlight that interpretability is an important property of deep neural networks that provides new insights into what hierarchical structures can learn. 更新日期：2018-07-24 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-23 Jian Zhao; Lin Xiong; Jianshu Li; Junliang Xing; Shuicheng Yan; Jiashi Feng In this paper, we propose a Dual-Agent Generative Adversarial Network (DA-GAN) model, which can improve realism of a face simulator's output using unlabeled real faces, while preserving identity information during realism refinement. The dual agents are specifically designed for distinguishing real v.s. fake and identities simultaneously. In particular, we employ an off-the-shelf 3D face model as a simulator to generate profile face images with varying poses. DA-GAN leverages a FCN as the generator and an auto-encoder as the discriminator with dual agents. Besides the novel architecture, we make several key modifications to the standard GAN to preserve pose and texture, preserve identity and stabilize training process: (i) a pose perception loss; (ii) an identity perception loss; (iii) an adversarial loss with a boundary equilibrium regularization term. Experimental results show that DA-GAN not only achieves outstanding perceptual results but also significantly outperforms state-of-the-arts on the challenging NIST IJB-A and CFP unconstrained face recognition benchmarks. In addition, the proposed DA-GAN is also a promising new approach for solving generic transfer learning problems more effectively. DA-GAN is the foundation of our winning in the NIST IJB-A face recognition competition in which we secured the 1st places on the tracks of verification and identification. 更新日期：2018-07-24 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-20 quanming yao; James. T Kwok; Tai-Feng Wang; Tie-Yan Liu Low-rank modeling has many important applications in computer vision and machine learning. While the matrix rank is often approximated by the convex nuclear norm, the use of nonconvex low-rank regularizers has demonstrated better empirical performance. However, the resulting optimization problem is much more challenging. Recent state-of-the-art requires an expensive full SVD in each iteration. In this paper, we show that for many commonly-used nonconvex low-rank regularizers, the singular values obtained from the proximal operator can be automatically threshold. This allows the proximal operator to be efficiently approximated by the power method. We then develop a fast proximal algorithm and its accelerated variant with inexact proximal step. It can be guaranteed that the squared distance between consecutive iterates converges at a rate ofO(1/T)$, where$T$is the number of iterations. Furthermore, we show the proposed algorithm can be parallelized, and the resultant algorithm achieves nearly linear speedup w.r.t. the number of threads. Extensive experiments are performed on matrix completion and robust principal component analysis. Significant speedup over the state-of-the-art is observed. 更新日期：2018-07-21 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-20 Eyasu Zemene Mequanint; Leulseged Tesfaye Alemu; Marcello Pelillo Image segmentation has come a long way since the early days of computer vision, and still remains a challenging task. Modern variations of the classical (purely bottom-up) approach, involve, e.g., some form of user assistance (interactive segmentation) or ask for the simultaneous segmentation of two or more images (co-segmentation). At an abstract level, all these variants can be thought of as “constrained” versions of the original formulation, whereby the segmentation process is guided by some external source of information. In this paper, we propose a new approach to tackle this kind of problems in a unified way. Our work is based on some properties of a family of quadratic optimization problems related to dominant sets, a graph-theoretic notion of a cluster which generalizes the concept of a maximal clique to edge-weighted graphs. In particular, we show that by properly controlling a regularization parameter which determines the structure and the scale of the underlying problem, we are in a position to extract groups of dominant-set clusters that are constrained to contain predefined elements. In particular, we shall focus on interactive segmentation and co-segmentation (in both the unsupervised and the interactive versions). The proposed algorithm can deal naturally with several types of constraints and input modalities, including scribbles, sloppy contours and bounding boxes, and is able to robustly handle noisy annotations on the part of the user. Experiments on standard benchmark datasets show the effectiveness of our approach as compared to state-of-the-art algorithms on a variety of natural images under several input conditions and constraints. 更新日期：2018-07-21 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-19 Edouard Oyallon; Sergey Zagoruyko; Gabriel Huang; Nikos Komodakis; Simon Lacoste-Julien; Matthew B. Blaschko; Eugene Belilovsky Scattering networks are a class of designed Convolutional Neural Networks (CNNs) with fixed weights. We argue they can serve as generic representations for modelling images. In particular, by working in scattering space, we achieve competitive results both for supervised and unsupervised learning tasks, while making progress towards constructing more interpretable CNNs. For supervised learning, we demonstrate that the early layers of CNNs do not necessarily need to be learned, and can be replaced with a scattering network instead. Indeed, using hybrid architectures, we achieve the best results with predefined representations to-date, while being competitive with end-to-end learned CNNs. Specifically, even applying a shallow cascade of small-windowed scattering coefficients followed by$1\times 1\$-convolutions results in AlexNet accuracy on the ILSVRC2012 classification task. Moreover, by combining scattering networks with deep residual networks, we achieve a single-crop top-5 error of 11.4% on ILSVRC2012. Also, we show they can yield excellent performance in the small sample regime on CIFAR-10 and STL-10 datasets, exceeding their end-to-end counterparts, through their ability to incorporate geometrical priors. For unsupervised learning, scattering coefficients can be a competitive representation that permits image recovery. We use this fact to train hybrid GANs to generate images. Finally, we empirically analyze several properties related to stability and reconstruction of images from scattering coefficients.

更新日期：2018-07-20
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-19
Kilho Son; James Hays; David B. Cooper

We present a novel computational puzzle solver for square-piece image jigsaw puzzles with no prior information such as piece orientation or anchor pieces. By “piece” we mean a square d x d block of pixels, where we investigate pieces as small as 7 x 7 pixels. To reconstruct such challenging puzzles, we propose to find maximum geometric consensus between pieces, specifically hierarchical piece loops. The proposed algorithm seeks out loops of four pieces and aggregates the smaller loops into higher order loops of loops in a bottom-up fashion. In contrast to previous puzzle solvers which aim to maximize compatibility measures between all pairs of pieces and thus depend heavily on the pairwise compatibility measures used, our approach reduces the dependency on the pairwise compatibility measures which become increasingly uninformative for small scales and instead exploits geometric agreement among pieces. Our contribution also includes an improved pairwise compatibility measure which exploits directional derivative information along adjoining boundaries of the pieces. We verify the proposed algorithm as well as its individual components with mathematical analysis and reconstruction experiments.

更新日期：2018-07-20
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-19
Yongqin Xian; Christoph H Lampert; Bernt Schiele; Zeynep Akata

Due to the importance of zero-shot learning, i.e. classifying images where there is a lack of labeled training data, the number of proposed approaches has recently increased steadily. We argue that it is time to take a step back and to analyze the status quo of the area. The purpose of this paper is three-fold. First, given the fact that there is no agreed upon zero-shot learning benchmark, we first define a new benchmark by unifying both the evaluation protocols and data splits of publicly available datasets used for this task. This is an important contribution as published results are often not comparable and sometimes even flawed due to, e.g. pre-training on zero-shot test classes. Moreover, we propose a new zero-shot learning dataset, the Animals with Attributes 2 (AWA2) dataset which we make publicly available both in terms of image features and the images themselves. Second, we compare and analyze a significant number of the state-of-the-art methods in depth, both in the classic zero-shot setting but also in the more realistic generalized zero-shot setting. Finally, we discuss in detail the limitations of the current status of the area which can be taken as a basis for advancing it.

更新日期：2018-07-20
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-19
Yunhe Wang; Chang Xu; Chao Xu; Dacheng Tao

Deep convolutional neural networks (CNNs) are successfully used in a number of applications. However, their storage and computational requirements have largely prevented their widespread use on mobile devices. Here we present a series of approaches for compressing and speeding up CNNs in the frequency domain, which focuses not only on smaller weights but on all the weights and their underlying connections. By treating convolutional filters as images, we decompose their representations in the frequency domain as common parts (i.e., cluster centers) shared by other similar filters and their individual private parts (i.e., individual residuals). A large number of low-energy frequency coefficients in both parts can be discarded to produce high compression without significantly compromising accuracy. Furthermore, we explore a data-driven method for removing redundancies in both spatial and frequency domains, which allows us to discard more useless weights by keeping similar accuracies. After obtaining the optimal sparse CNN in the frequency domain, we relax the computational burden of convolution operations in CNNs by linearly combining the convolution responses of discrete cosine transform (DCT) bases. The compression and speed-up ratios of the proposed algorithm are thoroughly analyzed and evaluated on benchmark image datasets to demonstrate its superiority over state-of-the-art methods.

更新日期：2018-07-20
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-16
Han Zhang; Tao Xu; Hongsheng Li; Shaoting Zhang; Xiaogang Wang; Xiaolei Huang; Dimitris N. Metaxas

Although Generative Adversarial Networks (GANs) have shown remarkable success in various tasks, they still face challenges in generating high quality images. In this paper, we propose Stacked Generative Adversarial Networks (StackGANs) aimed at generating high-resolution photo-realistic images. First, we propose a two-stage generative adversarial network architecture, StackGAN-v1, for text-to-image synthesis. The Stage-I GAN sketches the primitive shape and colors of a scene based on a given text description, yielding low-resolution images. The Stage-II GAN takes Stage-I results and the text description as inputs, and generates high-resolution images with photo-realistic details. Second, an advanced multi-stage generative adversarial network architecture, StackGAN-v2, is proposed for both conditional and unconditional generative tasks. Our StackGAN-v2 consists of multiple generators and multiple discriminators arranged in a tree-like structure; images at multiple scales corresponding to the same scene are generated from different branches of the tree. StackGAN-v2 shows more stable training behavior than StackGAN-v1 by jointly approximating multiple distributions. Extensive experiments demonstrate that the proposed stacked generative adversarial networks significantly outperform other state-of-the-art methods in generating photo-realistic images.

更新日期：2018-07-18
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-11
Saeed Anwar; Cong Phuoc Huynh; Fatih Porikli

A fundamental problem in image deblurring is to reliably recover distinct spatial frequencies that have been suppressed by the blur kernel. To tackle this issue, existing image deblurring techniques often rely on generic image priors such as the sparsity of salient features including image gradients and edges. However, these priors only help recover part of the frequency spectrum, such as the frequencies near the high-end. To this end, we pose the following specific questions: (i) Does any image class information offer an advantage over existing generic priors for image quality restoration (ii) If a class-specific prior exists, how should it be encoded into a deblurring framework to recover attenuated image frequencies Throughout this work, we devise a class-specific prior based on the band-pass filter responses and incorporate it into a deblurring strategy. More specifically, we show that the subspace of band-pass filtered images and their intensity distributions serve as useful priors for recovering image frequencies that are difficult to recover by generic image priors. We demonstrate that our image deblurring framework, when equipped with the above priors, significantly outperforms many state-of-the-art methods using generic image priors or class-specific exemplars.

更新日期：2018-07-12
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-10
Tianfan Xue; Jiajun Wu; Katherine Bouman; William Freeman

We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods that have tackled this problem in a deterministic or non-parametric way, we propose to model future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a single input image. To synthesize realistic movement of objects, we propose a novel network structure, namely a Cross Convolutional Network; this network encodes image and motion information as feature maps and convolutional kernels, respectively. In experiments, our model performs well on synthetic data, such as 2D shapes and animated game sprites, and on real-world video frames. We present analyses of the learned network representations, showing it is implicitly learning a compact encoding of object appearance and motion. We also demonstrate a few of its applications, including visual analogy-making and video extrapolation.

更新日期：2018-07-12
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-05
Jae-Pil Heo; Zhe Lin; Sung-Eui Yoon

Approximate K-nearest neighbor search is a fundamental problem in computer science. The problem is especially important for high-dimensional and large-scale data. Recently, many techniques encoding high-dimensional data to compact codes have been proposed. The product quantization and its variations that encode the cluster index in each subspace have been shown to provide impressive accuracy. In this paper, we explore a simple question: is it best to use all the bit-budget for encoding a cluster index? We have found that as data points are located farther away from the cluster centers, the error of estimated distance becomes larger. To address this issue, we propose a novel compact code representation that encodes both the cluster index and quantized distance between a point and its cluster center in each subspace by distributing the bit-budget. We also propose two distance estimators tailored to our representation. We further extend our method to encode global residual distances in the original space. We have evaluated our proposed methods on benchmarks consisting of GIST, VLAD, and CNN features. Our extensive experiments show that the proposed methods significantly and consistently improve the search accuracy over other tested techniques. This result is achieved mainly because our methods accurately estimate distances.

更新日期：2018-07-08
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-07-04
Zechao Li; Jinhui Tang; Tao Mei

In this work, we investigate the problem of learning knowledge from the massive community-contributed images with rich weakly-supervised context information, which can benefit multiple image understanding tasks simultaneously, such as social image tag refinement and assignment, content-based image retrieval, tag-based image retrieval and tag expansion. Towards this end, we propose a Deep Collaborative Embedding (DCE) model to uncover a unified latent space for images and tags. The proposed method incorporates the end-to-end learning and collaborative factor analysis in one unified framework for the optimal compatibility of representation learning and latent space discovery. A nonnegative and discrete refined tagging matrix is learned to guide the end-to-end learning. To collaboratively explore the rich context information of social images, the proposed method integrates the weakly-supervised image-tag correlation, image correlation and tag correlation simultaneously and seamlessly. The proposed model is also extended to embed new tags in the uncovered space. To verify the effectiveness of the proposed method, extensive experiments are conducted on two widely-used social image benchmarks for multiple social image understanding tasks. The encouraging performance of the proposed method over the state-of-the-art approaches demonstrates its superiority.

更新日期：2018-07-05
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-08-09
Shaul Oron; Tali Dekel; Tianfan Xue; William T. Freeman; Shai Avidan

We propose a novel method for template matching in unconstrained environments. Its essence is the Best-Buddies Similarity (BBS), a useful, robust, and parameter-free similarity measure between two sets of points. BBS is based on counting the number of Best-Buddies Pairs (BBPs)—pairs of points in source and target sets that are mutual nearest neighbours, i.e., each point is the nearest neighbour of the other. BBS has several key features that make it robust against complex geometric deformations and high levels of outliers, such as those arising from background clutter and occlusions. We study these properties, provide a statistical analysis that justifies them, and demonstrate the consistent success of BBS on a challenging real-world dataset while using different types of features.

更新日期：2018-07-04
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-07-06
Yu Zhang; Mao Ye; Dinesh Manocha; Ruigang Yang

We present a practical and inexpensive method to reconstruct 3D scenes that include transparent and mirror objects. Our work is motivated by the need for automatically generating 3D models of interior scenes, which commonly include glass. These large structures are often invisible to cameras or even to our human visual system. Existing 3D reconstruction methods for transparent objects are usually not applicable in such a room-sized reconstruction setting. Our simple hardware setup augments a regular depth camera (e.g., the Microsoft Kinect camera) with a single ultrasonic sensor, which is able to measure the distance to any object, including transparent surfaces. The key technical challenge is the sparse sampling rate from the acoustic sensor, which only takes one point measurement per frame. To address this challenge, we take advantage of the fact that the large scale glass structures in indoor environments are usually either piece-wise planar or a simple parametric surface. Based on these assumptions, we have developed a novel sensor fusion algorithm that first segments the (hybrid) depth map into different categories such as opaque/transparent/infinity (e.g., too far to measure) and then updates the depth map based on the segmentation outcome. We validated our algorithms with a number of challenging cases, including multiple panes of glass, mirrors, and even a curved glass cabinet.

更新日期：2018-07-02
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-08-09
Ziwei Liu; Xiaoxiao Li; Ping Luo; Chen Change Loy; Xiaoou Tang

Semantic segmentation tasks can be well modeled by Markov Random Field (MRF). This paper addresses semantic segmentation by incorporating high-order relations and mixture of label contexts into MRF. Unlike previous works that optimized MRFs using iterative algorithm, we solve MRF by proposing a Convolutional Neural Network (CNN), namely Deep Parsing Network (DPN), which enables deterministic end-to-end computation in a single forward pass. Specifically, DPN extends a contemporary CNN to model unary terms and additional layers are devised to approximate the mean field (MF) algorithm for pairwise terms. It has several appealing properties. First, different from the recent works that required many iterations of MF during back-propagation, DPN is able to achieve high performance by approximating one iteration of MF. Second, DPN represents various types of pairwise terms, making many existing models as its special cases. Furthermore, pairwise terms in DPN provide a unified framework to encode rich contextual information in high-dimensional data, such as images and videos. Third, DPN makes MF easier to be parallelized and speeded up, thus enabling efficient inference. DPN is thoroughly evaluated on standard semantic image/video segmentation benchmarks, where a single DPN model yields state-of-the-art segmentation accuracies on PASCAL VOC 2012, Cityscapes dataset and CamVid dataset.

更新日期：2018-07-02
Some contents have been Reproduced with permission of the American Chemical Society.
Some contents have been Reproduced by permission of The Royal Society of Chemistry.