• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-25
Runze Zhang; Siyu Zhu; Tianwei Shen; Lei Zhou; Zixin Luo; Tian Fang; Long Quan

The increasing scale of Structure-from-Motion is fundamentally limited by the conventional optimization framework for the all-in-one global bundle adjustment. In this paper, we propose a distributed approach to coping with this global bundle adjustment for very large scale Structure-from-Motion computation. First, we derive the distributed formulation from the classical optimization algorithm ADMM, Alternating Direction Method of Multipliers, based on the global camera consensus. Then, we analyze the conditions under which the convergence of this distributed optimization would be guaranteed. In particular, we adopt over-relaxation and self-adaption schemes to improve the convergence rate. After that, we propose to split the large scale camera-point visibility graph in order to reduce the communication overheads of the distributed computing. The experiments on both public large scale SfM data-sets and our very large scale aerial photo sets demonstrate that the proposed distributed method clearly outperforms the state-of-the-art method in efficiency and accuracy.

更新日期：2018-05-27
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-25
Wenguan Wang; Jianbing Shen; Haibin Ling

We study the problem of photo cropping, which aims to find a cropping window of an input image to preserve as much as possible its important parts while being aesthetically pleasant. Seeking a deep learning-based solution, we design a neural network that has two branches for attention box prediction (ABP) and aesthetics assessment (AA), respectively. Given the input image, the ABP network predicts an attention bounding box as an initial minimum cropping window, around which a set of cropping candidates are generated with little loss of important information. Then, the AA network is employed to select the final cropping window with the best aesthetic quality among the candidates. The two sub-networks are designed to share the same full-image convolutional feature map, and thus are computationally efficient. By leveraging attention prediction and aesthetics assessment, the cropping model produces high-quality cropping results, even with the limited availability of training data for photo cropping. The experimental results on benchmark datasets clearly validate the effectiveness of the proposed approach. In addition, our approach runs at 5 fps, outperforming most previous solutions.

更新日期：2018-05-27
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-24
Jan Funke; Fabian David Tschopp; William Grisaitis; Arlo Sheridan; Chandan Singh; Stephan Saalfeld; Srinivas C. Turaga

We present a method combining affinity prediction with region agglomeration, which improves significantly upon the state of the art of neuron segmentation from electron microscopy (EM) in accuracy and scalability. Our method consists of a 3D U-net, trained to predict affinities between voxels, followed by iterative region agglomeration. We train using a structured loss based on MALIS, encouraging topologically correct segmentations obtained from affinity thresholding. Our extension consists of two parts: First, we present a quasi-linear method to compute the loss gradient, improving over the original quadratic algorithm. Second, we compute the gradient in two separate passes to avoid spurious gradient contributions in early training stages. Our predictions are accurate enough that simple learning-free percentile-based agglomeration outperforms more involved methods used earlier on inferior predictions. We present results on three diverse EM datasets, achieving relative improvements over previous results of 27%, 15%, and 250%. Our findings suggest that a single method can be applied to both nearly isotropic block-face EM data and anisotropic serial sectioned EM data. The runtime of our method scales linearly with the size of the volume and achieves a throughput of ~2.6 seconds per megavoxel, qualifying our method for the processing of very large datasets.

更新日期：2018-05-25
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-23
Jinwei Ye; Yu Ji; Mingyuan Zhou; Sing Bing Kang; Jingyi Yu

The goal of image pre-compensation is to process an image such that after being convolved with a known kernel, will appear close to the sharp reference image. In a practical setting, the pre-compensated image has significantly higher dynamic range than the latent image. As a result, some form of tone mapping is needed. In this paper, we show how global tone mapping functions affect contrast and ringing in image pre-compensation. We further enhance contrast and reduce ringing by considering the visual saliency. Specifically, we prioritize contrast preservation in salient regions while tolerating more blurriness elsewhere. For quantitative analysis, we design new metrics to measure the contrast of an image with ringing. Specifically, we set out to find its "equivalent ringing-free" image that matches its intensity histogram and uses its contrast as the measure. We illustrate our approach on projector defocus compensation and visual acuity enhancement. Compared with the state-of-the-art, our approach significantly improves the contrast. We also perform user studies to demonstrate that our method can effectively improve the viewing experience for users with impaired vision.

更新日期：2018-05-24
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-23
Kevis-Kokitsi Maninis; Sergi Caelles; Yuhua Chen; Jordi Pont-Tuset; Laura Leal-Taixé; Daniel Cremers; Luc Van Gool

Video Object Segmentation, and video processing in general, has been historically dominated by methods that rely on the temporal consistency and redundancy in consecutive video frames. When temporal smoothness is suddenly broken, such as when an object is occluded, the result of these methods can deteriorate significantly. This paper explores the orthogonal approach of processing each frame independently, i.e. disregarding temporal information. In particular, it tackles the task of semi-supervised video object segmentation: the separation of an object from the background in a video, given its mask in the first frame. We present Semantic One-Shot Video Object Segmentation (OSVOS $^{S}$ ), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one shot). We show that instance-level semantic information, when combined effectively, can dramatically improve the results of our previous method, OSVOS. We perform experiments on two recent single-object video segmentation databases, which show that OSVOS $^{S}$ is both the fastest and most accurate method in the state of the art. Experiments on multi-object video segmentation show that OSVOS $^{S}$ obtains competitive results.

更新日期：2018-05-24
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-23
Chaowei Fang; Zicheng Liao; Yizhou Yu

We propose a new multi-dimensional nonlinear embedding -- Piecewise Flat Embedding (PFE) -- for image segmentation. Based on the theory of sparse signal recovery, piecewise flat embedding with diverse channels attempts to recover a piecewise constant image representation with sparse region boundaries and sparse cluster value scattering. The resultant piecewise flat embedding exhibits interesting properties such as suppressing slowly varying signals, and offers an image representation with higher region identifiability which is desirable for image segmentation or high-level semantic analysis tasks. We formulate our embedding as a variant of the Laplacian Eigenmap embedding with an $L_{1,p}(0 更新日期：2018-05-24 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-22 Dan Xu; Elisa Ricci; Wanli Ouyang; Xiaogang Wang; Nicu Sebe Depth cues have been proved very useful in various computer vision and robotic tasks. This paper addresses the problem of monocular depth estimation from a single still image. Inspired by the effectiveness of recent works on multi-scale convolutional neural networks (CNN), we propose a deep model which fuses complementary information derived from multiple CNN side outputs. Different from previous methods using concatenation or weighted average schemes, the integration is obtained by means of continuous Conditional Random Fields (CRFs). In particular, we propose two different variations, one based on a cascade of multiple CRFs, the other on a unified graphical model. By designing a novel CNN implementation of mean-field updates for continuous CRFs, we show that both proposed models can be regarded as sequential deep networks and that training can be performed end-to-end. Through an extensive experimental evaluation, we demonstrate the effectiveness of the proposed approach and establish new state of the art results for the monocular depth estimation task on three publicly available datasets, i.e. NYUD-V2, Make3D and KITTI. 更新日期：2018-05-23 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-22 Dan Teng; Delin Chu Recently a deterministic method, frequent directions (FD) is proposed to solve the high dimensional low rank approximation problem. It works well in practice, but experiences high computational cost. In this paper, we establish a fast frequent directions algorithm for the low rank approximation problem, which implants a randomized algorithm, sparse subspace embedding (SpEmb) in FD. This new algorithm makes use of FD's natural block structure and sends more information through SpEmb to each block in FD. We prove that our new algorithm produces a good low rank approximation with a sketch of size linear on the rank approximated. Its effectiveness and efficiency are demonstrated by the experimental results on both synthetic and real world datasets, as well as applications in network analysis. 更新日期：2018-05-23 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-17 Min-Gyu Park; Kuk-Jin Yoon We present a robust approach for computing disparity maps with a supervised learning-based confidence prediction. This approach takes into consideration following features. First, we analyze the characteristics of various confidence measures in the random forest framework to select effective confidence measures depending on the characteristics of the training data and matching strategies, such as similarity measures and parameters. We then train a random forest using the selected confidence measures to improve the efficiency of confidence prediction and to build a better prediction model. Second, we present a confidence-based matching cost modulation scheme based on the predicted confidence values to improve the robustness and accuracy of the (semi-) global stereo matching algorithms. Finally, we apply the proposed modulation scheme to popularly used algorithms to make them robust against unexpected difficulties that could occur in an uncontrolled environment using challenging outdoor datasets. The proposed confidence measure selection and cost modulation schemes are experimentally verified from various perspectives using the KITTI and Middlebury datasets. 更新日期：2018-05-18 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-17 Yudong Guo; Juyong Zhang; Jianfei Cai; Boyi Jiang; Jianmin Zheng With the powerfulness of convolution neural networks (CNN), CNN based face reconstruction has recently shown promising performance in reconstructing detailed face shape from 2D face images. The success of CNN-based methods relies on a large number of labeled data. The state-of-the-art synthesizes such data using a coarse morphable face model, which however has difficulty to generate detailed photo-realistic images of faces (with wrinkles). This paper presents a novel face data generation method. Specifically, we render a large number of photo-realistic face images with different attributes based on inverse rendering. Furthermore, we construct a fine-detailed face image dataset by transferring different scales of details from one image to another. We also construct a large number of video-type adjacent frame pairs by simulating the distribution of real video data. With these nicely constructed datasets, we propose a coarse-to-fine learning framework consisting of three convolutional networks. The networks are trained for real-time detailed 3D face reconstruction from monocular video as well as from a single image. Extensive experimental results demonstrate that our framework can produce high-quality reconstruction but with much less computation time compared to the state-of-the-art. Moreover, our method is robust to pose, expression and lighting due to the diversity of data. 更新日期：2018-05-18 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-15 Daniel Martinho-Corbishley; Mark Nixon; John N. Carter Recognising human attributes from surveillance footage is widely studied for attribute-based re-identification. However, most works assume coarse, expertly-defined categories, ineffective in describing challenging images. Such brittle representations are limited in descriminitive power and hamper the efficacy of learnt estimators. We aim to discover more relevant and precise subject descriptions, improving image retrieval and closing the semantic gap. Inspired by fine-grained and relative attributes, we introduce super-fine attributes, which now describe multiple, integral concepts of a single trait as multi-dimensional perceptual coordinates. Crowd prototyping facilitates efficient crowdsourcing of super-fine labels by pre-discovering salient perceptual concepts for prototype matching. We re-annotate gender, age and ethnicity traits from PETA, a highly diverse (19K instances, 8.7K identities) amalgamation of 10 re-id datasets including VIPER, CUHK and TownCentre. Employing joint attribute regression with the ResNet-152 CNN, we demonstrate substantially improved ranked retrieval performance with super-fine attributes in direct comparison to conventional binary labels, reporting up to a 11.2% and 14.8% mAP improvement for gender and age, further surpassed by ethnicity. We also find our 3 super-fine traits to outperform 35 binary attributes by 6.5% mAP for subject retrieval in a challenging zero-shot identification scenario. 更新日期：2018-05-16 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-15 James Booth; Anastasios Roussos; Evangelos Ververas; Epameinondas Antonakos; Stylianos Poumpis; Yannis Panagakis; Stefanos P. Zafeiriou 3D Morphable Models (3DMMs) are powerful statistical models of 3D facial shape and texture, and among the state-of-the-art methods for reconstructing facial shape from single images. With the advent of new 3D sensors, many 3D facial datasets have been collected containing both neutral as well as expressive faces. However, all datasets are captured under controlled conditions. Thus, even though powerful 3D facial shape models can be learnt from such data, it is difficult to build statistical texture models that are sufficient to reconstruct faces captured in unconstrained conditions ("in-the-wild"). In this paper, we propose the first "in-the-wild" 3DMM by combining a statistical model of facial identity and expression shape with an "in-the-wild" texture model. We show that such an approach allows for the development of a greatly simplified fitting procedure for images and videos, as there is no need to optimise with regards to the illumination parameters. We have collected three new databases that combine "in-the-wild" images and video with ground truth 3D facial geometry, the first of their kind, and report extensive quantitative evaluations using them that demonstrate our method is state-of-the-art. 更新日期：2018-05-16 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-11 Jingdong Wang; Ting Zhang This paper studies the compact coding approach to approximate nearest neighbor search. We introduce a composite quantization framework. It uses the composition of several ($M$) elements, each of which is selected from a different dictionary, to accurately approximate a$D$-dimensional vector, thus yielding accurate search, and represents the data vector by a short code composed of the indices of the selected elements in the corresponding dictionaries. Our key contribution lies in introducing a near-orthogonality constraint, which makes the search efficiency is guaranteed as the cost of the distance computation is reduced to$O(M)$from$O(D)$through a distance table lookup scheme. The resulting approach is called near-orthogonal composite quantization. We theoretically justify the equivalence between near-orthogonal composite quantization and minimizing an upper bound of a function formed by jointly considering the quantization error and the search cost according to a generalized triangle inequality. We empirically show the efficacy of the proposed approach over several benchmark datasets. In addition, we demonstrate the superior performances in other three applications: combination with inverted multi-index, inner-product similarity search, and query compression for mobile search. 更新日期：2018-05-12 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-08 Kevin Lin; Jiwen Lu; Chu-Song Chen; Jie Zhou; Ming-Ting Sun Binary descriptors have been widely used for efficient image matching and retrieval. However, most existing binary descriptors are designed with hand-craft sampling patterns or learned with label annotation provided by datasets. In this paper, we propose a new unsupervised deep learning approach, called DeepBit, to learn compact binary descriptor for efficient visual object matching. We enforce three criteria on binary descriptors which are learned at the top layer of the deep neural network: 1) minimal quantization loss, 2) evenly distributed codes and 3) transformation invariant bit. Then, we estimate the parameters of the network through the optimization of the proposed objectives with a back-propagation technique. Extensive experimental results on various visual recognition tasks demonstrate the effectiveness of the proposed approach. We further demonstrate our proposed approach can be realized on the simplified deep neural network, and enables efficient image matching and retrieval speed with very competitive accuracies. 更新日期：2018-05-09 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-07 Trevor Campbell; Brian Kulis; Jonathan P How Bayesian nonparametrics are a class of probabilistic models in which the model size is inferred from data. A recently developed methodology in this field is small-variance asymptotic analysis, a mathematical technique for deriving learning algorithms that capture much of the flexibility of Bayesian nonparametric inference algorithms, but are simpler to implement and less computationally expensive. Past work on small-variance analysis of Bayesian nonparametric inference algorithms has exclusively considered batch models trained on a single, static dataset, which are incapable of capturing time evolution in the latent structure of the data. This work presents a small-variance analysis of the maximum a posteriori filtering problem for a temporally varying mixture model with a Markov dependence structure, which captures temporally evolving clusters within a dataset. Two clustering algorithms result from the analysis: D-Means, an iterative clustering algorithm for linearly separable, spherical clusters; and SD-Means, a spectral clustering algorithm derived from a kernelized, relaxed version of the clustering problem. Empirical results from experiments demonstrate the advantages of using D-Means and SD-Means over contemporary clustering algorithms, in terms of both computational cost and clustering accuracy. 更新日期：2018-05-08 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-05-04 Cosimo Rubino; Marco Crocco; Alessio Del Bue In this work we present a novel approach to recover objects 3D position and occupancy in a generic scene using only 2D object detections from multiple view images. The method reformulates the problem as the estimation of a quadric (ellipsoid) in 3D given a set of 2D ellipses fitted to the object detection bounding boxes in multiple views. We show that a closed-form solution exists in the dual-space using a minimum of three views while a solution with two views is possible through the use of non-linear optimisation and object constraints on the size of the object shape. In order to make the solution robust toward inaccurate bounding boxes, a likely occurrence in object detection methods, we introduce a data preconditioning technique and a non-linear refinement of the closed form solution based on implicit subspace constraints. Results on synthetic tests and on different real datasets, involving challenging scenarios, demonstrate the applicability and potential of our method in several realistic scenarios. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-06-01 Adrian Šošić; Abdelhak M. Zoubir; Heinz Koeppl Learning from demonstration (LfD) is the process of building behavioral models of a task from demonstrations provided by an expert. These models can be used, e.g., for system control by generalizing the expert demonstrations to previously unencountered situations. Most LfD methods, however, make strong assumptions about the expert behavior, e.g., they assume the existence of a deterministic optimal ground truth policy or require direct monitoring of the expert's controls, which limits their practical use as part of a general system identification framework. In this work, we consider the LfD problem in a more general setting where we allow for arbitrary stochastic expert policies, without reasoning about the optimality of the demonstrations. Following a Bayesian methodology, we model the full posterior distribution of possible expert controllers that explain the provided demonstration data. Moreover, we show that our methodology can be applied in a nonparametric context to infer the complexity of the state representation used by the expert, and to learn task-appropriate partitionings of the system state space. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-07-04 Tsung-Yu Lin; Aruni RoyChowdhury; Subhransu Maji We present a simple and effective architecture for fine-grained recognition called Bilinear Convolutional Neural Networks (B-CNNs) . These networks represent an image as a pooled outer product of features derived from two CNNs and capture localized feature interactions in a translationally invariant manner. B-CNNs are related to orderless texture representations built on deep features but can be trained in an end-to-end manner. Our most accurate model obtains 84.1, 79.4, 84.5 and 91.3 percent per-image accuracy on the Caltech-UCSD birds [1] , NABirds [2] , FGVC aircraft [3] , and Stanford cars [4] dataset respectively and runs at 30 frames-per-second on a NVIDIA Titan X GPU. We then present a systematic analysis of these networks and show that (1) the bilinear features are highly redundant and can be reduced by an order of magnitude in size without significant loss in accuracy, (2) are also effective for other image classification tasks such as texture and scene recognition, and (3) can be trained from scratch on the ImageNet dataset offering consistent improvements over the baseline architecture. Finally, we present visualizations of these models on various datasets using top activations of neural units and gradient-based inversion techniques. The source code for the complete system is available at http://vis-www.cs.umass.edu/bcnn . 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-06-15 Tingting Mu; John Yannis Goulermas; Sophia Ananiadou A typical objective of data visualization is to generate low-dimensional plots that maximally convey the information within the data. The visualization output should help the user not only identify the local neighborhood structure of individual samples, but also obtain a global view of the relative positioning and separation between cohorts. Here, we propose a novel visualization framework designed to satisfy these needs. By incorporating additional cohort positioning and discriminative constraints into local neighbor preservation models through the use of computed cohort prototypes, effective control over the arrangements and proximities of data cohorts can be obtained. We introduce various embedding and projection algorithms based on objective functions addressing the different visualization requirements. Their underlying models are optimized effectively using matrix manifold procedures to incorporate the problem constraints. Additionally, to facilitate large-scale applications, a matrix decomposition based model is also proposed to accelerate the computation. The improved capabilities of the new methods are demonstrated using various state-of-the-art dimensionality reduction algorithms. We present many qualitative and quantitative comparisons, on both synthetic problems and real-world tasks of complex text and image data, that show notable improvements over existing techniques. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-06-02 Girum Getachew Demisse; Djamila Aouada; Björn Ottersten In this paper, we introduce a deformation based representation space for curved shapes in$\mathbb {R}^{n}. Given an ordered set of points sampled from a curved shape, the proposed method represents the set as an element of a finite dimensional matrix Lie group. Variation due to scale and location are filtered in a preprocessing stage, while shapes that vary only in rotation are identified by an equivalence relationship. The use of a finite dimensional matrix Lie group leads to a similarity metric with an explicit geodesic solution. Subsequently, we discuss some of the properties of the metric and its relationship with a deformation by least action . Furthermore, invariance to reparametrization or estimation of point correspondence between shapes is formulated as an estimation of sampling function. Thereafter, two possible approaches are presented to solve the point correspondence estimation problem. Finally, we propose an adaptation of k-means clustering for shape analysis in the proposed representation space. Experimental results show that the proposed representation is robust to uninformative cues, e.g., local shape perturbation and displacement. In comparison to state of the art methods, it achieves a high precision on the Swedish and the Flavia leaf datasets and a comparable result on MPEG-7, Kimia99 and Kimia216 datasets. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-05-26 Guosheng Lin; Chunhua Shen; Anton van den Hengel; Ian Reid We propose an approach for exploiting contextual information in semantic image segmentation, and particularly investigate the use of patch-patch context and patch-background context in deep CNNs. We formulate deep structured models by combining CNNs and Conditional Random Fields (CRFs) for learning the patch-patch context between image regions. Specifically, we formulate CNN-based pairwise potential functions to capture semantic correlations between neighboring patches. Efficient piecewise training of the proposed deep structured model is then applied in order to avoid repeated expensive CRF inference during the course of back propagation. For capturing the patch-background context, we show that a network design with traditional multi-scale image inputs and sliding pyramid pooling is very effective for improving performance. We perform comprehensive evaluation of the proposed method. We achieve new state-of-the-art performance on a number of challenging semantic segmentation datasets. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-05-26 Qi Wu; Chunhua Shen; Peng Wang; Anthony Dick; Anton van den Hengel Much of the recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We further show that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. Specifically, we design a visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked where the image alone does not contain the information required to select the appropriate answer. Our final model achieves the best reported results for both image captioning and visual question answering on several of the major benchmark datasets. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-06-08 Fatemeh Sadat Saleh; Mohammad Sadegh Aliakbarian; Mathieu Salzmann; Lars Petersson; Jose M. Alvarez; Stephen Gould Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using image tags. Without additional information, this leads to poor localization accuracy. This problem, however, was alleviated by making use of objectness priors to generate foreground/background masks. Unfortunately these priors either require pixel-level annotations/bounding boxes, or still yield inaccurate object boundaries. Here, we propose a novel method to extract accurate masks from networks pre-trained for the task of object recognition, thus forgoing external objectness modules. We first show how foreground/background masks can be obtained from the activations of higher-level convolutional layers of a network. We then show how to obtain multi-class masks by the fusion of foreground/background ones with information extracted from a weakly-supervised localization network. Our experiments evidence that exploiting these masks in conjunction with a weakly-supervised training loss yields state-of-the-art tag-based weakly-supervised semantic segmentation results. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-06-21 Georgios Dimitrios Evangelidis; Radu Horaud This paper addresses the problem of registering multiple point sets. Solutions to this problem are often approximated by repeatedly solving for pairwise registration, which results in an uneven treatment of the sets forming a pair: a model set and a data set. The main drawback of this strategy is that the model set may contain noise and outliers, which negatively affects the estimation of the registration parameters. In contrast, the proposed formulation treats all the point sets on an equal footing. Indeed, all the points are drawn from a central Gaussian mixture, hence the registration is cast into a clustering problem. We formally derive batch and incremental EM algorithms that robustly estimate both the GMM parameters and the rotations and translations that optimally align the sets. Moreover, the mixture's means play the role of the registered set of points while the variances provide rich information about the contribution of each component to the alignment. We thoroughly test the proposed algorithms on simulated data and on challenging real data collected with range sensors. We compare them with several state-of-the-art algorithms, and we show their potential for surface reconstruction from depth data. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-07-17 Xi Yin; Xiaoming Liu; Jin Chen; David M. Kramer This paper proposes a novel framework for fluorescence plant video processing. The plant research community is interested in the leaf-level photosynthetic analysis within a plant. A prerequisite for such analysis is to segment all leaves, estimate their structures, and track them over time. We identify this as a joint multi-leaf segmentation, alignment, and tracking problem. First, leaf segmentation and alignment are applied on the last frame of a plant video to find a number of well-aligned leaf candidates. Second, leaf tracking is applied on the remaining frames with leaf candidate transformation from the previous frame. We form two optimization problems with shared terms in their objective functions for leaf alignment and tracking respectively. A quantitative evaluation framework is formulated to evaluate the performance of our algorithm with four metrics. Two models are learned to predict the alignment accuracy and detect tracking failure respectively in order to provide guidance for subsequent plant biology analysis. The limitation of our algorithm is also studied. Experimental results show the effectiveness, efficiency, and robustness of the proposed method. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-06-02 György Kovács In this paper, a novel dissimilarity measure called Matching by Monotonic Tone Mapping (MMTM) is proposed. The MMTM technique allows matching under non-linear monotonic tone mappings and can be computed efficiently when the tone mappings are approximated by piecewise constant or piecewise linear functions. The proposed method is evaluated in various template matching scenarios involving simulated and real images, and compared to other measures developed to be invariant to monotonic intensity transformations. The results show that the MMTM technique is a highly competitive alternative of conventional measures in problems where possible tone mappings are close to monotonic. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-06-01 Relja Arandjelović; Petr Gronat; Akihiko Torii; Tomas Pajdla; Josef Sivic We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following four principal contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the “Vector of Locally Aggregated Descriptors” image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we create a new weakly supervised ranking loss, which enables end-to-end learning of the architecture's parameters from images depicting the same places over time downloaded from Google Street View Time Machine. Third, we develop an efficient training procedure which can be applied on very large-scale weakly labelled tasks. Finally, we show that the proposed architecture and training procedure significantly outperform non-learnt image representations and off-the-shelf CNN descriptors on challenging place recognition and image retrieval benchmarks. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-07-04 Bolei Zhou; Agata Lapedriza; Aditya Khosla; Aude Oliva; Antonio Torralba The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-05-26 Alberto Crivellaro; Mahdi Rad; Yannick Verdie; Kwang Moo Yi; Pascal Fua; Vincent Lepetit We present an algorithm for estimating the pose of a rigid object in real-time under challenging conditions. Our method effectively handles poorly textured objects in cluttered, changing environments, even when their appearance is corrupted by large occlusions, and it relies on grayscale images to handle metallic environments on which depth cameras would fail. As a result, our method is suitable for practical Augmented Reality applications including industrial environments. At the core of our approach is a novel representation for the 3D pose of object parts: We predict the 3D pose of each part in the form of the 2D projections of a few control points. The advantages of this representation is three-fold: We can predict the 3D pose of the object even when only one part is visible; when several parts are visible, we can easily combine them to compute a better pose of the object; the 3D pose we obtain is usually very accurate, even when only few parts are visible. We show how to use this representation in a robust 3D tracking framework. In addition to extensive comparisons with the state-of-the-art, we demonstrate our method on a practical Augmented Reality application for maintenance assistance in the ATLAS particle detector at CERN. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-06-06 Bing Shuai; Zhen Zuo; Bing Wang; Gang Wang In this paper, we address the challenging task of scene segmentation. In order to capture the rich contextual dependencies over image regions, we propose Directed Acyclic Graph-Recurrent Neural Networks (DAG-RNN) to perform context aggregation over locally connected feature maps. More specifically, DAG-RNN is placed on top of pre-trained CNN (feature extractor) to embed context into local features so that their representative capability can be enhanced. In comparison with plain CNN (as in Fully Convolutional Networks-FCN), DAG-RNN is empirically found to be significantly more effective at aggregating context. Therefore, DAG-RNN demonstrates noticeably performance superiority over FCNs on scene segmentation. Besides, DAG-RNN entails dramatically less parameters as well as demands fewer computation operations, which makes DAG-RNN more favorable to be potentially applied on resource-constrained embedded devices. Meanwhile, the class occurrence frequencies are extremely imbalanced in scene segmentation, so we propose a novel class-weighted loss to train the segmentation network. The loss distributes reasonably higher attention weights to infrequent classes during network training, which is essential to boost their parsing performance. We evaluate our segmentation network on three challenging public scene segmentation benchmarks: Sift Flow, Pascal Context and COCO Stuff. On top of them, we achieve very impressive segmentation performance. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-06-16 Tao Wang; Haibin Ling Matching-based algorithms have been commonly used in planar object tracking. They often model a planar object as a set of keypoints, and then find correspondences between keypoint sets via descriptor matching. In previous work, unary constraints on appearances or locations are usually used to guide the matching. However, these approaches rarely utilize structure information of the object, and are thus suffering from various perturbation factors. In this paper, we proposed a graph-based tracker, named Gracker , which is able to fully explore the structure information of the object to enhance tracking performance. We model a planar object as a graph, instead of a simple collection of keypoints, to represent its structure. Then, we reformulate tracking as a sequential graph matching process, which establishes keypoint correspondence in a geometric graph matching manner. For evaluation, we compare the proposed Gracker with state-of-the-art planar object trackers on three benchmark datasets: two public ones and a newly collected one. Experimental results show that Gracker achieves robust tracking results against various environmental variations, and outperforms other algorithms in general on the datasets. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-07-07 Davide Modolo; Vittorio Ferrari We propose a technique to train semantic part-based models of object classes from Google Images. Our models encompass the appearance of parts and their spatial arrangement on the object, specific to each viewpoint. We learn these rich models by collecting training instances for both parts and objects, and automatically connecting the two levels. Our framework works incrementally, by learning from easy examples first, and then gradually adapting to harder ones. A key benefit of this approach is that it requires no manual part location annotations. We evaluate our models on the challenging PASCAL-Part dataset [1] and show how their performance increases at every step of the learning, with the final models more than doubling the performance of directly training from images retrieved by querying for part names (from 12.9 to 27.2 AP). Moreover, we show that our part models can help object detection performance by enriching the R-CNN detector with parts. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-06-06 Gül Varol; Ivan Laptev; Cordelia Schmid Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%). 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-06-21 Mingliang Chen; Xing Wei; Qingxiong Yang; Qing Li; Gang Wang; Ming-Hsuan Yang We propose a background subtraction algorithm using hierarchical superpixel segmentation, spanning trees and optical flow. First, we generate superpixel segmentation trees using a number of Gaussian Mixture Models (GMMs) by treating each GMM as one vertex to construct spanning trees. Next, we use theM$-smoother to enhance the spatial consistency on the spanning trees and estimate optical flow to extend the$M-smoother to the temporal domain. Experimental results on synthetic and real-world benchmark datasets show that the proposed algorithm performs favorably for background subtraction in videos against the state-of-the-art methods in spite of frequent and sudden changes of pixel values. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2017-05-31 Yeqing Li; Wei Liu; Junzhou Huang Recently with the explosive growth of visual content on the Internet, large-scale image search has attracted intensive attention. It has been shown that mapping high-dimensional image descriptors to compact binary codes can lead to considerable efficiency gains in both storage and performing similarity computation of images. However, most existing methods still suffer from expensive training devoted to large-scale binary code learning. To address this issue, we propose a sub-selection based matrix manipulation algorithm, which can significantly reduce the computational cost of code learning. As case studies, we apply the sub-selection algorithm to several popular quantization techniques including cases using linear and nonlinear mappings. Crucially, we can justify the resulting sub-selective quantization by proving its theoretic properties. Extensive experiments are carried out on three image benchmarks with up to one million samples, corroborating the efficacy of the sub-selective quantization method in terms of image retrieval. 更新日期：2018-05-05 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-03 Matthias Reso; Jorn Jachalsky; Bodo Rosenhahn; Joern Ostermann A wide variety of computer vision applications rely on superpixel or supervoxel algorithms as a preprocessing step. This underlines the overall importance that these approaches have gained in recent years. However, most methods show a lack of temporal consistency or fail in producing temporally stable superpixels. In this paper, we present an approach to generate temporally consistent superpixels for video content. Our method is formulated as a contour-evolving expectation-maximization framework, which utilizes an efficient label propagation scheme to encourage the preservation of superpixel shapes and their relative positioning over time. By explicitly detecting the occlusion of superpixels and the disocclusion of new image regions, our framework is able to terminate and create superpixels whose corresponding image region becomes hidden or newly appears. Additionally, the occluded parts of superpixels are incorporated in the further optimization. This increases the compliance of the superpixel flow with the optical flow present in the scene. Using established benchmark suites, we show the performance of our approach in comparison to state-of-the-art supervoxel and superpixel algorithms for video content. 更新日期：2018-05-04 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-03 Qi Dong; Shaogang Gong; Xiatian Zhu Model learning from class imbalanced training data is a long-standing and significant challenge for machine learning. In particular, existing deep learning methods consider mostly either class balanced data or moderately imbalanced data in model training, and ignore the challenge of learning from significantly imbalanced training data. To address this problem, we formulate a class imbalanced deep learning model based on batch-wise incremental minority (sparsely sampled) class rectification by hard sample mining in majority (frequently sampled) classes during model training. This model is designed to minimise the dominant effect of majority classes by discovering sparsely sampled boundaries of minority classes in an iterative batch-wise learning process. To that end, we introduce a Class Rectification Loss (CRL) function that can be deployed readily in deep network architectures. Extensive experimental evaluations are conducted on three imbalanced person attribute benchmark datasets (CelebA, X-Domain, DeepFashion) and one balanced object category benchmark dataset (CIFAR-100). These experimental results demonstrate the performance advantages and model scalability of the proposed batch-wise incremental minority class rectification model over the existing state-of-the-art models for addressing the problem of imbalanced data learning. 更新日期：2018-05-04 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-01 Jian Liang; Ran He; Zhenan Sun; Tieniu Tan Unsupervised domain adaptation aims to leverage the labeled source data to learn with the unlabeled target data. Previous methods tackle it by seeking a low-dimensional projection to extract the invariant features and building a classifier on source data. However, they merely concentrate on minimizing the cross-domain distribution divergence, while ignoring the intra-domain structure especially for the target domain. Even after projection, possible risk factors like imbalanced data distribution may still hinder the performance of target label inference. In this paper, we propose a simple yet effective domain-invariant projection ensemble approach to tackle these two issues together. Specifically, we seek the optimal projection via a novel relaxed domain-irrelevant clustering-promoting term that jointly bridges the cross-domain semantic gap and increases the intra-class compactness in both domains. To further enhance the target label inference, we first develop a ‘sampling-and-fusion’ framework, under which multiple projections are independently learned based on various randomized coupled domain subsets. Subsequently, aggregating models such as majority voting are utilized to leverage multiple projections and classify unlabeled target data. Extensive experimental results on four visual benchmarks including object, face, and digit images, demonstrate that the proposed methods gain remarkable margins over state-of-the-art unsupervised domain adaptation methods. 更新日期：2018-05-01 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-01 Erbo Li; Hanlin Mo; Dong Xu; Hua Li In this paper, we have proved the existence of projective moment invariants of images using finite combinations of weighted moments, with relative projective differential invariants as weight functions. We have given some instances constructed in that way, and analyzed possible issues could affect the performance. Some procedures are taken to estimate partial derivatives of discrete images, and a new method is designed to normalize the number of pixels for discrete images to minimize the changes before and after the projective transformation. We have carried out experiments using popular image databases and real images to test the performance. And the results show that the invariants proposed in this paper have better stability and discriminability than other previously used moment invariants in image retrieval and classification. Users can directly extract invariant features of images for a given planar object from different viewpoints without knowing the parameters of the 2D projective transformations. Therefore, the projective moment invariant could be potentially useful for planar object recognition, image description and classification. 更新日期：2018-05-01 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-01 Chen Li; Kun Zhou; Hsiang-Tao Wu; Stephen Lin We present a physically-based approach for simulating makeup in face images. The key idea is to decompose the face image into intrinsic image layers - namely albedo, diffuse shading, and specular highlights - which are each differently affected by cosmetics, and then manipulate each layer according to corresponding models of reflectance. Accurate intrinsic image decompositions for faces are obtained with the help of human face priors, including statistics on skin reflectance and facial geometry. The intrinsic image layers are then transformed in appearance according to measured optical properties of cosmetics and proposed adaptations of physically-based reflectance models. With this approach, realistic results are generated in a manner that preserves the personal appearance features and lighting conditions of the target face while not requiring detailed geometric and reflectance measurements. We demonstrate this technique on various forms of cosmetics including foundation, blush, lipstick, and eye shadow. Results on both images and videos exhibit a close approximation to ground truth and compare favorably to existing techniques. 更新日期：2018-05-01 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-01 Shervin Ardeshir; Ali Borji Thanks to the availability of wearable devices such as GoPro cameras, smart phones, and glasses, we have now access to a plethora of videos captured from the first person perspective. Surveillance cameras and Unmanned Aerial Vehicles (UAVs) also offer tremendous amounts of video data recorded from top and oblique view points. Egocentric and surveillance vision have been studied extensively but separately in the computer vision community. The relationship between these two domains, however, remains unexplored. In this study, we make the first attempt in this direction by addressing two basic yet challenging questions. First, having a set of egocentric videos and a top-view video, does the top-view video contain all or some of the egocentric viewers? In other words, have these videos been shot in the same environment at the same time? Second, if so, how can we identify the egocentric viewers in the top-view video? These problems can become even more challenging when videos are not temporally aligned. We model each view using a graph, and compute the assignment and time-delays in an iterative-alternative fashion using spectral graph matching and time delay estimation. Such an approach handles the temporal misalignment between the egocentric videos and the top-view video. 更新日期：2018-05-01 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-05-01 Jinshan Pan; Wenqi Ren; Zhe Hu; Ming-Hsuan Yang Human faces are one interesting object class with numerous applications. While significant progress has been made in the generic deblurring problem, existing methods are less effective for blurry face images. The success of the state-of-the-art image deblurring algorithms stems mainly from implicit or explicit restoration of salient edges for kernel estimation. However, existing methods are less effective as only few edges can be restored from blurry face images for kernel estimation. In this paper, we address the problem of deblurring face images by exploiting facial structures. We propose a deblurring algorithm based on an exemplar dataset without using coarse-to-fine strategies or heuristic edge selections. In addition, we develop a convolutional neural network to restore sharp edges from blurry face images for deblurring. Extensive experiments against the state-of-the-art methods demonstrate the effectiveness of the proposed algorithms for deblurring face images. In addition, we show the proposed algorithms can be applied to image deblurring for other object classes. 更新日期：2018-05-01 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-30 Bjorn Barz; Erik Rodner; Yanira Guanche Garcia; Joachim Denzler Automatic detection of anomalies in space- and time-varying measurements is an important tool in several fields, e.g., fraud detection, climate analysis, or healthcare monitoring. We present an algorithm for detecting anomalous regions in multivariate spatio-temporal time-series, which allows for spotting the interesting parts in large amounts of data, including video and text data. In opposition to existing techniques for detecting isolated anomalous data points, we propose the “Maximally Divergent Intervals” (MDI) framework for unsupervised detection of coherent spatial regions and time intervals characterized by a high Kullback-Leibler divergence compared with all other data given. In this regard, we define an unbiased Kullback-Leibler divergence that allows for ranking regions of different size and show how to enable the algorithm to run on large-scale data sets in reasonable time using an interval proposal technique. Experiments on both synthetic and real data from various domains, such as climate analysis, video surveillance, and text forensics, demonstrate that our method is widely applicable and a valuable tool for finding interesting events in different types of data. 更新日期：2018-05-01 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-30 Philipp Jauer; Ivo Kuhlemann; Ralf Bruder; Achim Schweikard; Floris Ernst We present a novel framework for rigid point cloud registration. Our approach is based on the principles of mechanics and thermodynamics. We solve the registration problem by assuming point clouds as rigid bodies consisting of particles. Forces can be applied between both particle systems so that they attract or repel each other. These forces are used to cause rigid-body motion of one particle system toward the other, until both are aligned. The framework supports physics-based registration processes with arbitrary driving forces, depending on the desired behaviour. Additionally, the approach handles feature-enhanced point clouds, e.g. by colours or intensity values. Our framework is freely accessible for download. In contrast to already existing algorithms, our contribution is to precisely register high-resolution point clouds with nearly constant computational effort and without the need for pre-processing, subsampling or pre-alignment. At the same time, the quality is up to 28% higher than for state-of-the-art algorithms and up to 49% higher when considering feature-enhanced point clouds. Even in the presence of noise, our registration approach is one of the most robust, on par with state-of-the-art implementations. 更新日期：2018-05-01 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-20 Song Bai; Xiang Bai; Qi Tian; Longin Jan Latecki Diffusion process has advanced object retrieval greatly as it can capture the underlying manifold structure. Recent studies have experimentally demonstrated that tensor product diffusion can better reveal the intrinsic relationship between objects than other variants. However, the principle remains unclear,i.e., what kind of manifold structure is captured. In this paper, we propose a new affinity learning algorithm called Regularized Diffusion Process (RDP). By deeply exploring the properties of RDP, our first yet basic contribution is providing a manifold-based explanation for tensor product diffusion. A novel criterion measuring the smoothness of the manifold is defined, which simultaneously regularizes four vertices in the affinity graph. Inspired by this observation, we further contribute two variants towards two specific goals. While ARDP can learn similarities across heterogeneous domains, HRDP performs affinity learning on tensor product hypergraph, considering the relationships between objects are generally more complex than pairwise. Consequently, RDP, ARDP and HRDP constitute a generic tool for object retrieval in most commonly-used settings, no matter the input relationships between objects are derived from the same domain or not, and in pairwise formulation or not. Comprehensive experiments on10retrieval benchmarks, especially on large scale data, validate the effectiveness and generalization of our work. 更新日期：2018-04-23 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-20 Xin Jiang Hunt; Rebecca Willett In an era of ubiquitous large-scale streaming data, the availability of data far exceeds the capacity of expert human analysts. In many settings, such data is either discarded or stored unprocessed in data centers. This paper proposes a method of online data thinning, in which large-scale streaming datasets are winnowed to preserve unique, anomalous, or salient elements for timely expert analysis. At the heart of this proposed approach is an online anomaly detection method based on dynamic, low-rank Gaussian mixture models. Specifically, the high-dimensional covariances matrices associated with the Gaussian components are associated with low-rank models. According to this model, most observations lie near a union of subspaces. The low-rank modeling mitigates the curse of dimensionality associated with anomaly detection for high-dimensional data, and recent advances in subspace clustering and subspace tracking allow the proposed method to adapt to dynamic environments. Furthermore, the proposed method allows subsampling, is robust to missing data, and uses a mini-batch online optimization approach. The resulting algorithms are scalable, efficient, and are capable of operating in real time. Experiments on wide-area motion imagery and e-mail databases illustrate the efficacy of the proposed approach. 更新日期：2018-04-23 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-20 Tianyi Zhao; Qiuyu Chen; Zhenzhong Kuang; Jun Yu; Wei Zhang; Ming He; Jianping Fan A deep mixture of diverse experts algorithm is developed by seamlessly combining a set of base deep CNNs (convolutional neural networks) with diverse but overlapped task spaces (outputs) to generate a mixture network with larger outputs, e.g., these base deep CNNs are trained to recognize different subsets of tens of thousands of atomic object classes. One particular base deep CNNs withM + 1$outputs is learned for each task group to recognize M atomic object classes and identify one special class of "not-in-group", where the network structure of the well-designed deep CNNs is directly used to configure such base deep CNNs. For M semantically-related atomic object classes in the same task group, a deep multi-task learning algorithm is developed to leverage their inter-class visual similarities to learn more discriminative base deep CNNs and multi-task softmax for enhancing their separability. All these base deep CNNs with diverse but overlapped task spaces are seamlessly combined to build a mixture network with larger outputs for recognizing tens of thousands of atomic object classes. Our experimental results have demonstrated that our deep mixture of diverse experts algorithm can achieve very competitive results on large-scale visual recognition. 更新日期：2018-04-23 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-20 Wangmeng Zuo; Xiaohe Wu; Liang Lin; Lei Zhang; Ming-Hsuan Yang For visual tracking methods based on kernel support vector machines (SVMs), data sampling is usually adopted to reduce the computational cost in training. In addition, budgeting of support vectors is required for computational efficiency. Instead of sampling and budgeting, recently the circulant matrix formed by dense sampling of translated image patches has been utilized in kernel correlation filters for fast tracking. In this paper, we derive an equivalent formulation of a SVM model with the circulant matrix expression and present an efficient alternating optimization method for visual tracking. We incorporate the discrete Fourier transform with the proposed alternating optimization process, and pose the tracking problem as an iterative learning of support correlation filters (SCFs). In the fully-supervision setting, our SCF can find the globally optimal solution with real-time performance. For a given circulant data matrix with$n^2$samples of$n \times n$pixels, the computational complexity of the proposed algorithm is$O(n^2\;{\mathrm{log}}\;n)$whereas that of the standard SVM-based approaches is at least$O(n^4)$. In addition, we extend the SCF-based tracking algorithm with multi-channel features, kernel functions, and scale-adaptive approaches to further improve the tracking performance. Experimental results on a large benchmark dataset show that the proposed SCF-based algorithms perform favorably against the state-of-the-art tracking methods in terms of accuracy and speed. 更新日期：2018-04-23 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-20 Han-Jia Ye; De-Chuan Zhan; Yuan Jiang; Zhi-Hua Zhou Linkages are essentially determined by similarity measures that may be derived from multiple perspectives. For example, spatial linkages are usually generated based on localities of heterogeneous data. Semantic linkages, however, can come from even more properties, such as different physical meanings behind social relations. Many existing metric learning models focus on spatial linkages but leave the rich semantic factors unconsidered. We propose a Unified Multi-Metric Learning$({\mathrm{UM^2L)}}$framework to exploit multiple types of metrics with respect to overdetermined similarities between linkages. In${\mathrm{UM^2L}}$, a type of combination operator is introduced for distance characterization from multiple perspectives, and thus can introduce flexibilities for representing and utilizing both spatial and semantic linkages. Besides, we propose a uniform solver for${\mathrm{UM^2L}}$, and the theoretical analysis reflects the generalization ability of${\mathrm{UM^2L}}$as well. Extensive experiments on diverse applications exhibit the superior classification performance and comprehensibility of${\mathrm{UM^2L}}$. Visualization results also validate its ability on physical meanings discovery. 更新日期：2018-04-23 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-20 Yuankai Qi; Shengping Zhang; Lei Qin; Qingming Huang; Hongxun Yao; Jongwoo Lim; Ming-Hsuan Yang Convolutional Neural Networks (CNNs) have been applied to visual tracking with demonstrated success in recent years. Most CNN-based trackers utilize hierarchical features extracted from a certain layer to represent the target. However, features from a certain layer are not always effective for distinguishing the target object from the backgrounds especially in the presence of complicated interfering factors (e.g., heavy occlusion, background clutter, illumination variation, and shape deformation). In this work, we propose a CNN-based tracking algorithm which hedges deep features from different CNN layers to better distinguish target objects and background clutters. Correlation filters are applied to feature maps of each CNN layer to construct a weak tracker, and all weak trackers are hedged into a strong one. For robust visual tracking, we propose a hedge method to adaptively determine weights of weak classifiers by considering both the difference between the historical as well as instantaneous performance, and the difference among all weak trackers over time. In addition, we design a siamese network to define the loss of each weak tracker for the proposed hedge method. Extensive experiments on large benchmark datasets demonstrate the effectiveness of the proposed algorithm against the state-of-the-art tracking methods. 更新日期：2018-04-23 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-19 Chunyu Wang; Yizhou Wang; Zhouchen Lin; Alan Yuille We propose a method for estimating 3D human poses from single images or video sequences. The task is challenging because: (a) many 3D poses can have similar 2D pose projections which makes the lifting ambiguous, and (b) current 2D joint detectors are not accurate which can cause big errors in 3D estimates. We represent 3D poses by a sparse combination of bases which encode structural pose priors to reduce the lifting ambiguity. This prior is strengthened by adding limb length constraints. We estimate the 3D pose by minimizing an$L_1$norm measurement error between the 2D pose and the 3D pose because it is less sensitive to inaccurate 2D poses. We modify our algorithm to output$K$3D pose candidates for an image, and for videos, we impose a temporal smoothness constraint to select the best sequence of 3D poses from the candidates. We demonstrate good results on 3D pose estimation from static images and improved performance by selecting the best 3D pose from the$K$proposals. Our results on video sequences also show improvements (over static images) of roughly 15%. 更新日期：2018-04-20 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-19 Abhishek Das; Satwik Kottur; Khushi Gupta; Avi Singh; Deshraj Yadav; Stefan Lee; José Moura; Devi Parikh; Dhruv Batra We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being sufficiently grounded in vision to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person real-time chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial v0.9 has been released and consists of$\sim 1.2M$dialog question-answer pairs from 10-round, human-human dialogs grounded in$\sim 120k$images from the COCO dataset. 更新日期：2018-04-20 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-19 Zhen Cui; Shengtao Xiao; Zhiheng Niu; Shuicheng Yan; Wenming Zheng An end-to-end network architecture, the Recurrent Shape Regression (RSR), is presented to deal with the task of facial shape detection, a crucial step in many computer vision problems. The RSR generalizes the conventional cascaded regression into a recurrent dynamic network through abstracting common latent models with stage-to-stage operations. Instead of invariant regression transformation, we construct shape-dependent dynamic regressors to attain the recurrence of regression action itself. The regressors can be stacked into a high-order regression network to represent more complex shape regression. By further integrating feature learning as well as global shape constraint, the RSR becomes more controllable in entire optimization of shape regression, where the gradient computation can be efficiently back-propagated through time. To handle the possible partial occlusions of shapes, we propose a mimic virtual occlusion strategy by randomly disturbing certain point cliques without the requirement of any annotations of occlusion information or even occluded training data. Extensive experiments on five face datasets demonstrate that the proposed RSR is more competitive with the recent state-of-the-art cascaded approaches. 更新日期：2018-04-20 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-17 Jean-Baptiste Fasquel; Nicolas Delanoue This paper presents a method for recovering and identifying image regions from an initial oversegmentation using qualitative knowledge of its content. Compared to recent works favoring spatial information and quantitative techniques, our approach focuses on simple a priori qualitative inclusion and photometric relationships such as "region A is included in region B", "the intensity of region A is lower than the one of region B" or "regions A and B depict similar intensities" (photometric uncertainty). The proposed method is based on a two steps" inexact graph matching approach. The first step searches for the best subgraph isomorphism candidate between expected regions and a subset of regions resulting from the initial oversegmentation. Then, remaining segmented regions are progressively merged with appropriate already matched regions, while preserving the coherence with a priori declared relationships. Strengths and weaknesses of the method are studied on various images (grayscale and color), with various intial oversegmentation algorithms (k-means, meanshift, quickshift). Results show the potential of the method to recover, in a reasonable runtime, expected regions, a priori described in a qualitative manner. For further evaluation and comparison purposes, a Python opensource package implementing the method is provided, together with the specifically built experimental database. 更新日期：2018-04-18 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-16 Xiao Fu; Kejun Huang; Nicholas D. Sidiropoulos; Qingjiang Shi; Mingyi Hong In topic modeling, identifiability of the topics is an essential issue. Many topic modeling approaches have been developed under the premise that each topic has an anchor word, which may be fragile in practice, because words and terms have multiple uses; yet it is commonly adopted because it enables identifiability guarantees. Remedies in the literature include using three- or higher-order word co-occurence statistics to come up with tensor factorization models, but identifiability still hinges on additional assumptions. In this work, we propose a new topic identification criterion using second order statistics of the words. The criterion is theoretically guaranteed to identify the underlying topics even when the anchor-word assumption is grossly violated. An algorithm based on alternating optimization, and an efficient primal-dual algorithm are proposed to handle the resulting identification problem. The former exhibits high performance and is completely parameter-free; the latter affords up to 200 times speedup relative to the former, but requires step-size tuning and a slight sacrifice in accuracy. A variety of real text copora are employed to showcase the effectiveness of the approach, where the proposed anchor-free method demonstrates substantial improvements compared to a number of anchor-word based approaches under various evaluation metrics. 更新日期：2018-04-17 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-16 Guangcan Mai; Kai Cao; Pong C YUEN; Anil K. Jain State-of-the-art face recognition systems are based on deep (convolutional) neural networks. Therefore, it is imperative to determine to what extent face templates derived from deep networks can be inverted to obtain the original face image. In this paper, we study the vulnerabilities of a state-of-the-art face recognition system based on template reconstruction attack. We propose a neighborly de-convolutional neural network (\textit{NbNet}) to reconstruct face images from their deep templates. In our experiments, we assumed that no knowledge about the target subject and the deep network are available. To train the NbNet reconstruction models, we augmented two benchmark face datasets (VGG-Face and Multi-PIE) with a large collection of images synthesized using a face generator. The proposed reconstruction was evaluated using type-I (comparing the reconstructed images against the original face images used to generate the deep template) and type-II (comparing the reconstructed images against a different face image of the same subject) attacks. Given the images reconstructed from NbNets, we show that for verification, we achieve TAR of 95.20% (58.05%) on LFW under type-I (type-II) attacks @ FAR of 0.1%. Besides, 96.58% (92.84%) of the images reconstructed from templates of partition fa (fb) can be identified from partition fa in color FERET. 更新日期：2018-04-17 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-16 Liuhao Ge; Hui Liang; Junsong Yuan; Daniel Thalmann In this paper, we present a novel method for real-time 3D hand pose estimation from single depth images using 3D Convolutional Neural Networks (CNNs). Image-based features extracted by 2D CNNs are not directly suitable for 3D hand pose estimation due to the lack of 3D spatial information. Our proposed 3D CNN-based method, taking a 3D volumetric representation of the hand depth image as input and extracting 3D features from the volumetric input, can capture the 3D spatial structure of the hand and accurately regress full 3D hand pose in a single pass. In order to make the 3D CNN robust to variations in hand sizes and global orientations, we perform 3D data augmentation on the training data. To further improve the estimation accuracy, we propose applying the 3D deep network architectures and leveraging the complete hand surface as intermediate supervision for learning 3D hand pose from depth images. Extensive experiments on three challenging datasets demonstrate that our proposed approach outperforms baselines and state-of-the-art methods. A cross-dataset experiment also shows that our method has good generalization ability. Furthermore, our method is fast as our implementation runs at over 91 frames per second on a standard computer with a single GPU. 更新日期：2018-04-17 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-16 Jayakorn Vongkulbhisal; Fernando De La Torre; Joao Paulo Costeira Many computer vision problems are formulated as the optimization of a cost function. This approach faces two main challenges: designing a cost function with a local optimum at an acceptable solution, and developing an efficient numerical method to search for this optimum. While designing such functions is feasible in the noiseless case, the stability and location of local optima are mostly unknown under noise, occlusion, or missing data. In practice, this can result in undesirable local optima or not having a local optimum in the expected place. On the other hand, numerical optimization algorithms in high-dimensional spaces are typically local and often rely on expensive first or second order information to guide the search. To overcome these limitations, we propose Discriminative Optimization (DO), a method that learns search directions from data without the need of a cost function. DO explicitly learns a sequence of updates in the search space that leads to stationary points that correspond to the desired solutions. We provide a formal analysis of DO and illustrate its benefits in the problem of 3D registration, camera pose estimation, and image denoising. We show that DO outperformed or matched state-of-the-art algorithms in terms of accuracy, robustness, and computational efficiency. 更新日期：2018-04-17 • IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-16 Kang Zhu; Yujia Xue; Qiang Fu; Sing Bing Kang; Xilin Chen; Jingyi Yu In this paper, we describe how scene depth can be extracted using a hyperspectral light field capture (H-LF) system. Our H-LF system consists of a$5 \times 6\$ array of cameras, with each camera sampling a different narrow band in the visible spectrum. There are two parts to extracting scene depth. The first part is our novel cross-spectral pairwise matching technique, which involves a new spectral-invariant feature descriptor and its companion matching metric we call bidirectional weighted normalized cross correlation (BWNCC). The second part, namely, H-LF stereo matching, uses a combination of spectral-dependent correspondence and defocus cues that rely on BWNCC. These two new cost terms are integrated into a Markov Random Field (MRF) for disparity estimation. Experiments on synthetic and real H-LF data show that our approach can produce high-quality disparity maps. We also show that these results can be used to produce the complete plenoptic cube in addition to synthesizing all-focus and defocused color images under different sensor spectral responses.

更新日期：2018-04-17
• IEEE Trans. Pattern Anal. Mach. Intell. (IF 8.329) Pub Date : 2018-04-13
Joseph Peter Robinson; Ming Shao; Yue Wu; Hongfu Liu; Timothy Gillis; Yun Fu

We present the largest database for visual kinship recognition, Families In the Wild (FIW), with over 13,000 family photos of 1,000 family trees with 4-to-38 members. It took only a small team to build FIW by designing an efficient labeling tools and work-flow. To extend FIW, we further improved upon this process with a novel semi-automatic labeling scheme that used annotated faces and unlabeled text metadata to discover labels, which were then used, along with existing FIW data, for the proposed clustering algorithm that generated label proposals for all newly added data— both processes are shared and compared in depth, showing great savings in time and human input required. Essentially, the clustering algorithm proposed is semi-supervised and uses labeled data to produce more accurate clusters. We statistically compare FIW to related datasets, which unarguably shows enormous gains in overall size and amount of information encapsulated in the labels. We benchmark two tasks, kinship verification and family classification, at scales incomparably larger than ever before. Pre-trained CNN models fine-tuned on FIW outscores other conventional methods and achieved state-of-the-art on the renowned KinWild datasets. We also measure human performance on kinship recognition and compare to a fine-tuned CNN.

更新日期：2018-04-14
Some contents have been Reproduced with permission of the American Chemical Society.
Some contents have been Reproduced by permission of The Royal Society of Chemistry.