显示样式:     当前期刊: IEEE Transactions on Pattern Analysis and Machine Intelligence    加入关注       排序: 导出
我的关注
我的收藏
您暂时未登录!
登录
  • Robust Kronecker Component Analysis
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-11-15
    Mehdi Bahri; Yannis Panagakis; Stefanos P. Zafeiriou

    Dictionary learning and component analysis models are fundamental for learning compact representations relevant to a given task. The model complexity is encoded by means of structure, such as sparsity, low-rankness, or nonnegativity. Unfortunately, approaches like K-SVD that learn dictionaries for sparse coding via Singular Value Decomposition (SVD) are hard to scale, and fragile in the presence of outliers. Conversely, robust component analysis methods such as the Robust Principal Component Analysis (RPCA) are able to recover low-complexity representations from data corrupted with noise of unknown magnitude and support, but do not provide a dictionary that respects the structure of the data, and also involve expensive computations. In this paper, we propose a novel Kronecker-decomposable component analysis model, coined as Robust Kronecker Component Analysis (RKCA), that combines ideas from sparse dictionary learning and robust component analysis. RKCA has several appealing properties, including robustness to gross corruption; it can be used for low-rank modeling, and leverages separability to solve significantly smaller problems. We design an efficient learning algorithm by drawing links with tensor factorizations, and analyze its optimality and low-rankness properties. The effectiveness of the proposed approach is demonstrated on real-world applications, namely background subtraction and image denoising and completion, by performing a thorough comparison with the current state of the art.

    更新日期:2018-11-16
  • SPFTN: A Joint Learning Framework for Localizing and Segmenting Objects in Weakly Labeled Videos
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-11-13
    Dingwen Zhang; Junwei Han; Le Yang; Dong Xu

    Object localization and segmentation in weakly labeled videos are two interesting yet challenging tasks. Models built for simultaneous object localization and segmentation have been explored in the conventional fully supervised learning scenario to boost the performance of each task. However, none of the existing works has attempted to jointly learn object localization and segmentation models under weak supervision. To this end, we propose a joint learning framework called Self-Paced Fine-Tuning Network (SPFTN) for localizing and segmenting objects in weakly labelled videos. Learning the deep model jointly for object localization and segmentation under weak supervision is very challenging as the learning process of each single task would face serious ambiguity issue due to the lack of bounding-box or pixel-level supervision. To address this problem, our proposed deep SPFTN model is carefully designed with a novel multi-task self-paced learning objective, which leverages the task-specific prior knowledge and the knowledge that has been already captured to infer the confident training samples for each task. By aggregating the confident knowledge from each single task to mine reliable patterns and learning deep feature representation for both tasks, the proposed learning framework can address the ambiguity issue under weak supervision with simple optimization. Comprehensive experiments on the large-scale YouTube-Objects and DAVIS datasets demonstrate that the proposed approach achieves superior performance when compared with other state-of-the-art methods and the baseline networks/models.

    更新日期:2018-11-14
  • Open Set Domain Adaptation for Image and Action Recognition
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 
    Pau Panareda Busto; Ahsan Iqbal; Juergen Gall

    Since annotating and curating large datasets is very expensive, there is a need to transfer the knowledge from existing annotated datasets to unlabelled data. Data that is relevant for a specific application, however, usually differs from publicly available datasets since it is sampled from a different domain. While domain adaptation methods compensate for such a domain shift, they assume that all categories in the target domain are known and match the categories in the source domain. Since this assumption is violated under real-world conditions, we propose an approach for open set domain adaptation where the target domain contains instances of categories that are not present in the source domain. The proposed approach achieves state-of-the-art results on various datasets for image classification and action recognition. Since the approach can be used for open set and closed set domain adaptation, as well as unsupervised and semi-supervised domain adaptation, it is a versatile tool for many applications.

    更新日期:2018-11-12
  • Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-11-09
    Feng Liu; Tao Xiang; Timothy M. Hospedales; Wankou Yang; Changyin Sun

    In this paper we propose the inverse problem of VQA (iVQA). The iVQA task is to generate a question that corresponds to a given image and answer pair. We propose a variational iVQA model that can generate diverse, grammatically correct and content correlated questions that match the given answer. Based on this model, we show that iVQA is an interesting benchmark for visuo-linguistic understanding, and a more challenging alternative to VQA because an iVQA model needs to understand the image better to be successful. As a second contribution, we show how to use iVQA in a novel reinforcement learning framework to diagnose any existing VQA model by way of exposing its belief set: the set of question-answer pairs that the VQA model would predict true for a given image. This provides a completely new window into what VQA models ‘believe’ about images. We show that existing VQA models have more erroneous beliefs than previously thought, revealing their intrinsic weaknesses. Suggestions are then made on how to address these weaknesses going forward.

    更新日期:2018-11-10
  • Person Re-Identification by Cross-View Multi-Level Dictionary Learning
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-10-26
    Sheng Li; Ming Shao; Yun Fu

    Person re-identification plays an important role in many safety-critical applications. Existing works mainly focus on extracting patch-level features or learning distance metrics. However, the representation power of extracted features might be limited, due to the various viewing conditions of pedestrian images in complex real-world scenarios. To improve the representation power of features, we learn discriminative and robust representations via dictionary learning in this paper. First, we propose a Cross-view Dictionary Learning (CDL) model, which is a general solution to the multi-view learning problem. Inspired by the dictionary learning based domain adaptation, CDL learns a pair of dictionaries from two views. In particular, CDL adopts a projective learning strategy, which is more efficient than the $l_1$ optimization in traditional dictionary learning. Second, we propose a Cross-view Multi-level Dictionary Learning (CMDL) approach based on CDL. CMDL contains dictionary learning models at different representation levels, including image-level, horizontal part-level, and patch-level. The proposed models take advantages of the view-consistency information, and adaptively learn pairs of dictionaries to generate robust and compact representations for pedestrian images. Third, we incorporate a discriminative regularization term to CMDL, and propose a CMDL-Dis approach which learns pairs of discriminative dictionaries in image-level and part-level. We devise efficient optimization algorithms to solve the proposed models. Finally, a fusion strategy is utilized to generate the similarity scores for test images. Experiments on the public VIPeR, CUHK Campus, iLIDS, GRID and PRID450S datasets show that our approach achieves the state-of-the-art performance.

    更新日期:2018-11-08
  • Action Recognition with Dynamic Image Networks
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-02
    Hakan Bilen; Basura Fernando; Efstratios Gavves; Andrea Vedaldi

    We introduce the concept of dynamic image , a novel compact representation of videos useful for video analysis, particularly in combination with convolutional neural networks (CNNs). A dynamic image encodes temporal data such as RGB or optical flow videos by using the concept of ‘rank pooling’. The idea is to learn a ranking machine that captures the temporal evolution of the data and to use the parameters of the latter as a representation. We call the resulting representation dynamic image because it summarizes the video dynamics in addition to appearance. This powerful idea allows to convert any video to an image so that existing CNN models pre-trained with still images can be immediately extended to videos. We also present an efficient approximate rank pooling operator that runs two orders of magnitude faster than the standard ones with any loss in ranking performance and can be formulated as a CNN layer. To demonstrate the power of the representation, we introduce a novel four stream CNN architecture which can learn from RGB and optical flow frames as well as from their dynamic image representations. We show that the proposed network achieves state-of-the-art performance, 95.5 and 72.5 percent accuracy, in the UCF101 and HMDB51, respectively.

    更新日期:2018-11-05
  • Clickstream Analysis for Crowd-Based Object Segmentation with Confidence
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-27
    Eric Heim; Alexander Seitel; Jonas Andrulis; Fabian Isensee; Christian Stock; Tobias Ross; Lena Maier-Hein

    With the rapidly increasing interest in machine learning based solutions for automatic image annotation, the availability of reference annotations for algorithm training is one of the major bottlenecks in the field. Crowdsourcing has evolved as a valuable option for low-cost and large-scale data annotation; however, quality control remains a major issue which needs to be addressed. To our knowledge, we are the first to analyze the annotation process to improve crowd-sourced image segmentation. Our method involves training a regressor to estimate the quality of a segmentation from the annotator's clickstream data. The quality estimation can be used to identify spam and weight individual annotations by their (estimated) quality when merging multiple segmentations of one image. Using a total of 29,000 crowd annotations performed on publicly available data of different object classes, we show that (1) our method is highly accurate in estimating the segmentation quality based on clickstream data, (2) outperforms state-of-the-art methods for merging multiple annotations. As the regressor does not need to be trained on the object class that it is applied to it can be regarded as a low-cost option for quality control and confidence analysis in the context of crowd-based image annotation.

    更新日期:2018-11-05
  • Cross Euclidean-to-Riemannian Metric Learning with Application to Face Recognition from Video
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-22
    Zhiwu Huang; Ruiping Wang; Shiguang Shan; Luc Van Gool; Xilin Chen

    Riemannian manifolds have been widely employed for video representations in visual classification tasks including video-based face recognition. The success mainly derives from learning a discriminant Riemannian metric which encodes the non-linear geometry of the underlying Riemannian manifolds. In this paper, we propose a novel metric learning framework to learn a distance metric across a Euclidean space and a Riemannian manifold to fuse average appearance and pattern variation of faces within one video. The proposed metric learning framework can handle three typical tasks of video-based face recognition: Video-to-Still, Still-to-Video and Video-to-Video settings. To accomplish this new framework, by exploiting typical Riemannian geometries for kernel embedding, we map the source Euclidean space and Riemannian manifold into a common Euclidean subspace, each through a corresponding high-dimensional Reproducing Kernel Hilbert Space (RKHS). With this mapping, the problem of learning a cross-view metric between the two source heterogeneous spaces can be converted to learning a single-view Euclidean distance metric in the target common Euclidean space. By learning information on heterogeneous data with the shared label, the discriminant metric in the common space improves face recognition from videos. Extensive experiments on four challenging video face databases demonstrate that the proposed framework has a clear advantage over the state-of-the-art methods in the three classical video-based face recognition scenarios.

    更新日期:2018-11-05
  • Ensembles of Lasso Screening Rules
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-24
    Seunghak Lee; Nico Görnitz; Eric P. Xing; David Heckerman; Christoph Lippert

    In order to solve large-scale lasso problems, screening algorithms have been developed that discard features with zero coefficients based on a computationally efficient screening rule. Most existing screening rules were developed from a spherical constraint and half-space constraints on a dual optimal solution. However, existing rules admit at most two half-space constraints due to the computational cost incurred by the half-spaces, even though additional constraints may be useful to discard more features. In this paper, we present AdaScreen, an adaptive lasso screening rule ensemble, which allows to combine any one sphere with multiple half-space constraints on a dual optimal solution. Thanks to geometrical considerations that lead to a simple closed form solution for AdaScreen, we can incorporate multiple half-space constraints at small computational cost. In our experiments, we show that AdaScreen with multiple half-space constraints simultaneously improves screening performance and speeds up lasso solvers.

    更新日期:2018-11-05
  • Graph Matching with Adaptive and Branching Path Following
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-10-30
    Tao Wang; Haibin Ling; Congyan Lang; Songhe Feng

    Graph matching aims at establishing correspondences between graph elements, and is widely used in many computer vision tasks. Among recently proposed graph matching algorithms, those utilizing the path following strategy have attracted special research attentions due to their exhibition of state-of-the-art performances. However, the paths computed in these algorithms often contain singular points, which could hurt the matching performance if not dealt properly. To deal with this issue, we propose a novel path following strategy, named branching path following (BPF), to improve graph matching accuracy. In particular, we first propose a singular point detector by solving a KKT system, and then design a branch switching method to seek for better paths at singular points. Moreover, to reduce the computational burden of the BPF strategy, an adaptive path estimation (APE) strategy is integrated into BPF to accelerate the convergence of searching along each path. A new graph matching algorithm named ABPF-G is developed by applying APE and BPF to a recently proposed path following algorithm named GNCCP (Liu & Qiao 2014). Experimental results reveal how our approach consistently outperforms state-of-the-art algorithms for graph matching on five public benchmark datasets.

    更新日期:2018-11-05
  • Guaranteed Outlier Removal for Point Cloud Registration with Correspondences
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-14
    Álvaro Parra Bustos; Tat-Jun Chin

    An established approach for 3D point cloud registration is to estimate the registration function from 3D keypoint correspondences. Typically, a robust technique is required to conduct the estimation, since there are false correspondences or outliers. Current 3D keypoint techniques are much less accurate than their 2D counterparts, thus they tend to produce extremely high outlier rates. A large number of putative correspondences must thus be extracted to ensure that sufficient good correspondences are available. Both factors (high outlier rates, large data sizes) however cause existing robust techniques to require very high computational cost. In this paper, we present a novel preprocessing method called guaranteed outlier removal for point cloud registration. Our method reduces the input to a smaller set, in a way that any rejected correspondence is guaranteed to not exist in the globally optimal solution. The reduction is performed using purely geometric operations which are deterministic and fast. Our method significantly reduces the population of outliers, such that further optimization can be performed quickly. Further, since only true outliers are removed, the globally optimal solution is preserved. On various synthetic and real data experiments, we demonstrate the effectiveness of our preprocessing method. Demo code is available as supplementary material , which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2773482 .

    更新日期:2018-11-05
  • Hand-Object Contact Force Estimation from Markerless Visual Tracking
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-10-26
    Tu-Hoa Pham; Nikolaos Kyriazis; Antonis A. Argyros; Abderrahmane Kheddar

    We consider the problem of estimating realistic contact forces during manipulation, backed with ground-truth measurements, using vision alone. Interaction forces are usually measured by mounting force transducers onto the manipulated objects or the hands. Those are costly, cumbersome, and alter the objects’ physical properties and their perception by the human sense of touch. Our work establishes that interaction forces can be estimated in a cost-effective, reliable, non-intrusive way using vision. This is a complex and challenging problem. Indeed, in multi-contact, a given motion can generally be caused by an infinity of possible force distributions. To alleviate the limitations of traditional models based on inverse optimization, we collect and release the first large-scale dataset on manipulation kinodynamics as 3.2 hours of synchronized force and motion measurements under 193 object-grasp configurations. We learn a mapping between high-level kinematic features based on the equations of motion and the underlying manipulation forces using recurrent neural networks (RNN). The RNN predictions are consistently refined using physics-based optimization through second-order cone programming (SOCP). We show that our method can successfully capture interaction forces compatible with both the observations and the way humans intuitively manipulate objects, using a single RGB-D camera.

    更新日期:2018-11-05
  • Information Dropout: Learning Optimal Representations Through Noisy Computation
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-01-10
    Alessandro Achille; Stefano Soatto

    The cross-entropy loss commonly used in deep learning is closely related to the defining properties of optimal representations, but does not enforce some of the key properties. We show that this can be solved by adding a regularization term, which is in turn related to injecting multiplicative noise in the activations of a Deep Neural Network, a special case of which is the common practice of dropout. We show that our regularized loss function can be efficiently minimized using Information Dropout, a generalization of dropout rooted in information theoretic principles that automatically adapts to the data and can better exploit architectures of limited capacity. When the task is the reconstruction of the input, we show that our loss function yields a Variational Autoencoder as a special case, thus providing a link between representation learning, information theory and variational inference. Finally, we prove that we can promote the creation of optimal disentangled representations simply by enforcing a factorized prior, a fact that has been observed empirically in recent work. Our experiments validate the theoretical intuitions behind our method, and we find that Information Dropout achieves a comparable or better generalization performance than binary dropout, especially on smaller models, since it can automatically adapt the noise to the structure of the network, as well as to the test sample.

    更新日期:2018-11-05
  • Learning Consensus Representation for Weak Style Classification
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-09
    Shuhui Jiang; Ming Shao; Chengcheng Jia; Yun Fu

    Style classification (e.g., Baroque and Gothic architecture style) is grabbing increasing attention in many fields such as fashion, architecture, and manga. Most existing methods focus on extracting discriminative features from local patches or patterns. However, the spread out phenomenon in style classification has not been recognized yet. It means that visually less representative images in a style class are usually very diverse and easily getting misclassified. We name them weak style images. Another issue when employing multiple visual features towards effective weak style classification is lack of consensus among different features. That is, weights for different visual features in the local patch should have been allocated similar values. To address these issues, we propose a Consensus Style Centralizing Auto-Encoder (CSCAE) for learning robust style features representation, especially for weak style classification. First, we propose a Style Centralizing Auto-Encoder (SCAE) which centralizes weak style features in a progressive way. Then, based on SCAE, we propose both the non-linear and linear version CSCAE which adaptively allocate weights for different features during the progressive centralization process. Consensus constraints are added based on the assumption that the weights of different features of the same patch should be similar. Specifically, the proposed linear counterpart of CSCAE motivated by the “shared weights” idea as well as group sparsity improves both efficacy and efficiency. For evaluations, we experiment extensively on fashion, manga and architecture style classification problems. In addition, we collect a new dataset—Online Shopping, for fashion style classification, which will be publicly available for vision based fashion style research. Experiments demonstrate the effectiveness of the SCAE and CSCAE on both public and newly collected datasets when compared with the most recent state-of-the-art works.

    更新日期:2018-11-05
  • Learning Kinematic Structure Correspondences Using Multi-Order Similarities
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-24
    Hyung Jin Chang; Tobias Fischer; Maxime Petit; Martina Zambelli; Yiannis Demiris

    In this paper, we present a novel framework for finding the kinematic structure correspondences between two articulated objects in videos via hypergraph matching. In contrast to appearance and graph alignment based matching methods, which have been applied among two similar static images, the proposed method finds correspondences between two dynamic kinematic structures of heterogeneous objects in videos. Thus our method allows matching the structure of objects which have similar topologies or motions, or a combination of the two. Our main contributions can be summarised as follows: (i) casting the kinematic structure correspondence problem into a hypergraph matching problem by incorporating multi-order similarities with normalising weights, (ii) introducing a structural topology similarity measure by aggregating topology constrained subgraph isomorphisms, (iii) measuring kinematic correlations between pairwise nodes, and (iv) proposing a combinatorial local motion similarity measure using geodesic distance on the Riemannian manifold. We demonstrate the robustness and accuracy of our method through a number of experiments on synthetic and real data, outperforming various other state of the art methods. Our method is not limited to a specific application nor sensor, and can be used as building block in applications such as action recognition, human motion retargeting to robots, and articulated object manipulation amongst others.

    更新日期:2018-11-05
  • Learning without Forgetting
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-14
    Zhizhong Li; Derek Hoiem

    When building a unified vision system or gradually adding new apabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning with similar old and new task datasets for improved new task performance.

    更新日期:2018-11-05
  • Linear Maximum Margin Classifier for Learning from Uncertain Data
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-10
    Christos Tzelepis; Vasileios Mezaris; Ioannis Patras

    In this paper, we propose a maximum margin classifier that deals with uncertainty in data input. More specifically, we reformulate the SVM framework such that each training example can be modeled by a multi-dimensional Gaussian distribution described by its mean vector and its covariance matrix—the latter modeling the uncertainty. We address the classification problem and define a cost function that is the expected value of the classical SVM cost when data samples are drawn from the multi-dimensional Gaussian distributions that form the set of the training examples. Our formulation approximates the classical SVM formulation when the training examples are isotropic Gaussians with variance tending to zero. We arrive at a convex optimization problem, which we solve efficiently in the primal form using a stochastic gradient descent approach. The resulting classifier, which we name SVM with Gaussian Sample Uncertainty (SVM-GSU), is tested on synthetic data and five publicly available and popular datasets; namely, the MNIST, WDBC, DEAP, TV News Channel Commercial Detection, and TRECVID MED datasets. Experimental results verify the effectiveness of the proposed method.

    更新日期:2018-11-05
  • Proposal-Free Network for Instance-Level Object Segmentation
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-22
    Xiaodan Liang; Liang Lin; Yunchao Wei; Xiaohui Shen; Jianchao Yang; Shuicheng Yan

    Instance-level object segmentation is an important yet under-explored task. Most of state-of-the-art methods rely on region proposal methods to extract candidate segments and then utilize object classification to produce final results. Nonetheless, generating reliable region proposals itself is a quite challenging and unsolved task. In this work, we propose a Proposal-Free Network (PFN) to address the instance-level object segmentation problem, which outputs the numbers of instances of different categories and the pixel-level information on i) the coordinates of the instance bounding box each pixel belongs to, and ii) the confidences of different categories for each pixel, based on pixel-to-pixel deep convolutional neural network. All the outputs together, by using any off-the-shelf clustering method for simple post-processing, can naturally generate the ultimate instance-level object segmentation results. The whole PFN can be easily trained without the requirement of a proposal generation stage. Extensive evaluations on the challenging PASCAL VOC 2012 semantic segmentation benchmark demonstrate the effectiveness of the proposed PFN solution without relying on any proposal generation methods.

    更新日期:2018-11-05
  • Safe Feature Screening for Generalized LASSO
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-22
    Shaogang Ren; Shuai Huang; Jieping Ye; Xiaoning Qian

    Solving Generalized LASSO (GL) problems is challenging, particularly when analyzing many features with a complex interacting structure. Recent developments have found effective ways to identify inactive features so that they can be removed or aggregated to reduce the problem size before applying optimization solvers for learning. However, existing methods are mostly devoted to special cases of GL problems with special structures for feature interactions, such as chains or trees. Developing screening rules, particularly, safe screening rules to remove or aggregate features with general interaction structures, calls for a very different screening approach for GL problems. To tackle this challenge, we formulate the GL screening problem as a bound estimation problem in a large linear inequality system when solving them in the dual space. We propose a novel bound propagation algorithm for efficient safe screening for general GL problems, which can be further enhanced by developing novel transformation methods that can effectively decouple interactions among features. The proposed propagation and transformation methods are applicable with dynamic screening that can easily initiate the screening process while existing screening methods require the knowledge of the solution under a desirable regularization parameter. Experiments on both synthetic and real-world data demonstrate the effectiveness of the proposed screening method.

    更新日期:2018-11-05
  • Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-09
    Jun Liu; Amir Shahroudy; Dong Xu; Alex C. Kot; Gang Wang

    Skeleton-based human action recognition has attracted a lot of research attention during the past few years. Recent works attempted to utilize recurrent neural networks to model the temporal dependencies between the 3D positional configurations of human body joints for better analysis of human activities in the skeletal data. The proposed work extends this idea to spatial domain as well as temporal domain to better analyze the hidden sources of action-related information within the human skeleton sequences in both of these domains simultaneously. Based on the pictorial structure of Kinect's skeletal data, an effective tree-structure based traversal framework is also proposed. In order to deal with the noise in the skeletal data, a new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell. Moreover, we introduce a novel multi-modal feature fusion strategy within the LSTM unit in this paper. The comprehensive experimental results on seven challenging benchmark datasets for human action recognition demonstrate the effectiveness of the proposed method.

    更新日期:2018-11-05
  • Tetrahedron Based Fast 3D Fingerprint Identification Using Colored LEDs Illumination
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-09
    Chenhao Lin; Ajay Kumar

    Emerging 3D fingerprint recognition technologies have attracted growing attention in addressing the limitations from contact-based fingerprint acquisition and improve recognition accuracy. However, the complex 3D imaging setups employed in these systems typically require structured lighting with scanners or multiple cameras which are bulky with higher cost. This paper presents a more accurate and efficient 3D fingerprint identification approach using a single 2D camera with multiple colored LED illumination. A 3D minutiae tetrahedron based algorithm is developed to more efficiently match recovered minutiae features in 3D space and address the limitations of 3D minutiae matching approach in the literature. This algorithm significantly improves the matching time to about 15 times than the state-of-art in the reference. A hierarchical tetrahedron matching scheme is also developed to further improve the matching accuracy with faster speed. The 2D images acquired to reconstruct the 3D fingerprints are also used to recover 2D minutiae and further improve matching performance for 3D fingerprints. A new two-session database acquiring from 300 different clients consists of 2760 3D fingerprints reconstructed from 5520 colored 2D fingerprints is also developed and shared in public domain to further advance much needed research in this area. Extensive experimental results presented in this paper validate our approach and demonstrate the effectiveness of proposed algorithms.

    更新日期:2018-11-05
  • Unsupervised Deep Hashing with Similarity-Adaptive and Discrete Optimization
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-01-05
    Fumin Shen; Yan Xu; Li Liu; Yang Yang; Zi Huang; Heng Tao Shen

    Recent vision and learning studies show that learning compact hash codes can facilitate massive data processing with significantly reduced storage and computation. Particularly, learning deep hash functions has greatly improved the retrieval performance, typically under the semantic supervision. In contrast, current unsupervised deep hashing algorithms can hardly achieve satisfactory performance due to either the relaxed optimization or absence of similarity-sensitive objective. In this work, we propose a simple yet effective unsupervised hashing framework, named Similarity-Adaptive Deep Hashing (SADH), which alternatingly proceeds over three training modules: deep hash model training, similarity graph updating and binary code optimization. The key difference from the widely-used two-step hashing method is that the output representations of the learned deep model help update the similarity graph matrix, which is then used to improve the subsequent code optimization. In addition, for producing high-quality binary codes, we devise an effective discrete optimization algorithm which can directly handle the binary constraints with a general hashing loss. Extensive experiments validate the efficacy of SADH, which consistently outperforms the state-of-the-arts by large gaps.

    更新日期:2018-11-05
  • Visual and Semantic Knowledge Transfer for Large Scale Semi-Supervised Object Detection
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-09
    Yuxing Tang; Josiah Wang; Xiaofang Wang; Boyang Gao; Emmanuel Dellandréa; Robert Gaizauskas; Liming Chen

    Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers into object detectors. This is done by modeling the differences between the two on categories with both image-level and bounding box annotations, and transferring this information to convert classifiers to detectors for categories without bounding box annotations. We improve this previous work by incorporating knowledge about object similarities from visual and semantic domains during the transfer process. The intuition behind our proposed method is that visually and semantically similar categories should exhibit more common transferable properties than dissimilar categories, e.g. a better detector would result by transforming the differences between a dog classifier and a dog detector onto the cat class, than would by transforming from the violin class. Experimental results on the challenging ILSVRC2013 detection dataset demonstrate that each of our proposed object similarity based knowledge transfer methods outperforms the baseline methods. We found strong evidence that visual similarity and semantic relatedness are complementary for the task, and when combined notably improve detection, achieving state-of-the-art detection performance in a semi-supervised setting.

    更新日期:2018-11-05
  • A Simple, Fast and Highly-Accurate Algorithm to Recover 3D Shape from 2D Landmarks on a Single Image
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-13
    Ruiqi Zhao; Yan Wang; Aleix M. Martinez

    Three-dimensional shape reconstruction of 2D landmark points on a single image is a hallmark of human vision, but is a task that has been proven difficult for computer vision algorithms. We define a feed-forward deep neural network algorithm that can reconstruct 3D shapes from 2D landmark points almost perfectly (i.e., with extremely small reconstruction errors), even when these 2D landmarks are from a single image. Our experimental results show an improvement of up to two-fold over state-of-the-art computer vision algorithms; 3D shape reconstruction error (measured as the Procrustes distance between the reconstructed shape and the ground-truth) of human faces is $<.004$ , cars is .0022, human bodies is .022, and highly-deformable flags is .0004. Our algorithm was also a top performer at the 2016 3D Face Alignment in the Wild Challenge competition (done in conjunction with the European Conference on Computer Vision, ECCV) that required the reconstruction of 3D face shape from a single image. The derived algorithm can be trained in a couple hours and testing runs at more than 1,000 frames/s on an i7 desktop. We also present an innovative data augmentation approach that allows us to train the system efficiently with small number of samples. And the system is robust to noise (e.g., imprecise landmark points) and missing data (e.g., occluded or undetected landmark points).

    更新日期:2018-11-05
  • Facial Landmark Detection with Tweaked Convolutional Neural Networks
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-12-25
    Yue Wu; Tal Hassner; Kanggeon Kim; Gérard Medioni; Prem Natarajan

    This paper concerns the problem of facial landmark detection. We provide a unique new analysis of the features produced at intermediate layers of a convolutional neural network (CNN) trained to regress facial landmark coordinates. This analysis shows that while being processed by the CNN, face images can be partitioned in an unsupervised manner into subsets containing faces in similar poses (i.e., 3D views) and facial properties (e.g., presence or absence of eye-wear). Based on this finding, we describe a novel CNN architecture, specialized to regress the facial landmark coordinates of faces in specific poses and appearances. To address the shortage of training data, particularly in extreme profile poses, we additionally present data augmentation techniques designed to provide sufficient training examples for each of these specialized sub-networks. The proposed Tweaked CNN (TCNN) architecture is shown to outperform existing landmark detection methods in an extensive battery of tests on the AFW, ALFW, and 300W benchmarks. Finally, to promote reproducibility of our results, we make code and trained models publicly available through our project webpage.

    更新日期:2018-11-05
  • iDeLog: Iterative Dual Spatial and Kinematic Extraction of Sigma-Lognormal Parameters
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-11-02
    Miguel A. A. Ferrer; Moises Diaz; Cristina A. Carmona; Rejean Plamondon

    The Kinematic Theory of rapid movements and its associated Sigma-Lognormal model have been extensively used in a large variety of applications. While the physical and biological meaning of the model have been widely tested and validated for rapid movements, some shortcomings have been detected when it is used with continuous long and complex movements. To alleviate such drawbacks, and inspired by the motor equivalence theory and a conceivable visual feedback, this paper proposes a novel framework to extract the Sigma-Lognormal parameters, namely iDeLog. Specifically, iDeLog consists of two steps. The first one, influenced by the motor equivalence model, separately derives an initial action plan defined by a set of virtual points and angles from the trajectory and a sequence of lognormals from the velocity. In the second step, based on a hypothetical visual feedback compatible with an open-loop motor control, the virtual target points of the action plan are iteratively moved to improve the matching between the observed and reconstructed trajectory and velocity. During experiments conducted with handwritten signatures, iDeLog obtained promising results as compared to the previous development of the Sigma-Lognormal.

    更新日期:2018-11-05
  • Numerical Quadrature for Probabilistic Policy Search
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-11-02
    Julia Vinogradska; Bastian Bischoff; Jan Achterhold; Torsten Koller; Jan Peters

    Learning control policies has become an appealing alternative to the derivation of control laws based on classic control theory. Model-based approaches have proven an outstanding data efficiency, especially when combined with probabilistic models to eliminate model bias. However, a major difficulty for these methods is that multi-step-ahead predictions typically become intractable for larger planning horizons and can only poorly be approximated. In this paper, we propose the use of numerical quadrature to overcome this drawback and provide significantly more accurate multi-step-ahead predictions. As a result, our approach increases data efficiency and enhances the quality of learned policies. Furthermore, policy learning is not restricted to optimizing locally around one trajectory, as numerical quadrature provides a principled approach to extend optimization to all trajectories starting in a specified starting state region. Thus, manual effort, such as choosing informative starting points for simultaneous policy optimization, is significantly decreased. Furthermore, learning is highly robust to the choice of initial policy and, thus, interaction time with the system is minimized. Empirical evaluations on simulated benchmark problems show the efficiency of the proposed approach and support our theoretical results.

    更新日期:2018-11-05
  • Person Recognition in Personal Photo Collections
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-11-01
    Seong Joon Oh; Rodrigo Benenson; Mario Fritz; Bernt Schiele

    People nowadays share large parts of their personal lives through social media. Being able to automatically recognise people in personal photos may greatly enhance user convenience by easing photo album organisation. For human identification task, however, traditional focus of computer vision has been face recognition and pedestrian re-identification. Person recognition in social media photos sets new challenges for computer vision, including non-cooperative subjects (e.g. backward viewpoints, unusual poses) and great changes in appearance. To tackle this problem, we build a simple person recognition framework that leverages convnet features from multiple image regions (head, body, etc.). We propose new recognition scenarios that focus on the time and appearance gap between training and testing samples. We present an in-depth analysis of the importance of different features according to time and viewpoint generalisability. In the process, we verify that our simple approach achieves the state of the art result on the PIPA [1] benchmark, arguably the largest social media based benchmark for person recognition to date with diverse poses, viewpoints, social groups, and events. Compared the conference version of the paper [2], this paper additionally presents (1) analysis of a face recogniser (DeepID2+ [3]), (2) new method naeil2 that combines the conference version method naeil and DeepID2+ to achieve state of the art results even compared to post-conference works, (3) discussion of related work since the conference version, (4) additional analysis including the head viewpoint-wise breakdown of performance, and (5) results on the open-world setup.

    更新日期:2018-11-02
  • Cooperative Learning of Descriptor and Generator Networks
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-11-01
    Jianwen Xie; Yang Lu; Ruiqi Gao; Song-Chun Zhu; Ying Nian Wu

    This paper studies the cooperative learning of two generative models. Both models are parametrized by ConvNets. The first model is a deep energy-based model, whose energy function is defined by a bottom-up ConvNet, which maps the observed image to the energy. We call it the descriptor network. The second model is a generator network, which is defined by a top-down ConvNet, which maps the latent factors to the observed image. The maximum likelihood learning algorithms of both models involve MCMC sampling such as Langevin dynamics. We observe that the two learning algorithms can be seamlessly interwoven into a cooperative learning algorithm that can train both models simultaneously. Specifically, within each iteration of the cooperative learning algorithm, the generator model generates initial synthetic examples to initialize a finite-step MCMC that samples and trains the energy-based descriptor model. After that, the generator model learns from how the MCMC changes its synthetic examples. That is, the descriptor model teaches the generator model by MCMC, so that the generator model accumulates the MCMC transitions and reproduces them by direct ancestral sampling. We call this scheme MCMC teaching. We show that the cooperative algorithm can learn highly realistic generative models.

    更新日期:2018-11-02
  • Late Fusion Incomplete Multi-view Clustering
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-11-01
    Xinwang Liu; Xinzhong Zhu; Miaomiao Li; Lei Wang; Chang Tang; Jianping Yin; Dinggang Shen; Huaimin Wang; Wen Gao

    Incomplete multi-view clustering optimally integrates a group of pre-specified incomplete views to improve clustering performance. Among various excellent solutions, multiple kernel $k$ -means with incomplete kernels forms a benchmark, which redefines the incomplete multi-view clustering as a joint optimization problem where the imputation and clustering are alternately performed until convergence. However, the comparatively intensive computational and storage complexities preclude it from practical applications. To address these issues, we propose Late Fusion Incomplete Multi-view Clustering (LF-IMVC) which effectively and efficiently integrates the incomplete clustering matrices generated by incomplete views. Specifically, our algorithm jointly learns a consensus clustering matrix, imputes each incomplete base matrix, and optimizes the corresponding permutation matrices. We develop a three-step iterative algorithm to solve the resultant optimization problem with linear computational complexity and theoretically prove its convergence. Further, we conduct comprehensive experiments to study the proposed LF-IMVC in terms of clustering accuracy, running time, advantages of late fusion multi-view clustering, evolution of the learned consensus clustering matrix, parameter sensitivity and convergence. As indicated, our algorithm significantly and consistently outperforms some state-of-the-art algorithms with much less running time and memory.

    更新日期:2018-11-02
  • Richer Convolutional Features for Edge Detection
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-31
    Yun Liu; Ming-Ming Cheng; Xiaowei Hu; Jia-Wang Bian; Le Zhang; Xiang Bai; Jinhui Tang

    Edge detection is a fundamental problem in computer vision. Recently, convolutional neural networks (CNNs) have pushed forward this field significantly. Existing methods which adopt specific layers of deep CNNs may fail to capture complex data structures caused by variations of scales and aspect ratios. In this paper, we propose an accurate edge detector using richer convolutional features (RCF). RCF encapsulates all convolutional features into more discriminative representation, which makes good usage of rich feature hierarchies, and is amenable to training via backpropagation. RCF fully exploits multiscale and multilevel information of objects to perform the image-to-image prediction holistically. Using VGG16 network, we achieve state-of-the-art performance on several available datasets. When evaluating on the well-known BSDS500 benchmark, we achieve ODS F-measure of 0.811 while retaining a fast speed (8 FPS). Besides, our fast version of RCF achieves ODS F-measure of 0.806 with 30 FPS. We also demonstrate the versatility of the proposed method by applying RCF edges for classical image segmentation.

    更新日期:2018-11-02
  • Context-Aware Query Selection for Active Learning in Event Recognition
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-30
    Mahmudul Hasan; Sujoy Paul; Anastasios I. Mourikis; Amit K. Roy-Chowdhury

    Activity recognition is a challenging problem with many practical applications. In addition to the visual features, recent approaches got benefited from the use of context e.g., interrelationships among the activities and objects. However, these approaches require data to be labeled, entirely available beforehand, and not designed to be updated continuously, which make them unsuitable for surveillance applications. In contrast, we propose a continuous-learning framework for context-aware activity recognition from unlabeled video, which has two distinct advantages over existing methods. First, it employs a novel active-learning technique that not only exploits the informativeness of the individual activities but also utilizes their contextual information during the query selection; this leads to significant reduction in expensive manual annotation effort. Second, the learned models can be adapted online as more data is available. We formulate a conditional random field model that encodes the context and devise an information-theoretic approach that utilizes entropy and mutual information of the nodes to compute the set of most informative queries, which are labeled by a human. These labels are combined with graphical inference techniques for incremental updates. We provide a theoretical formulation of the active learning framework with an analytic solution. Experiments on six challenging datasets demonstrate that our framework achieves superior performance with significantly less manual labeling.

    更新日期:2018-10-31
  • Runtime Network Routing for Efficient Image Classification
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-26
    Yongming Rao; Jiwen Lu; Ji Lin; Jie Zhou

    In this paper, we propose a generic Runtime Network Routing (RNR) framework for efficient image classification, which selects an optimal path inside the network. Unlike existing static neural network acceleration methods, our method preserves the full ability of the original large network and conducts dynamic routing at runtime according to the input image and current feature maps. The routing is performed in a bottom-up, layer-by-layer manner, where we model it as a Markov decision process and use reinforcement learning for training. The agent determines the estimated reward of each sub-path and conducts routing conditioned on different samples, where a faster path is taken when the image is easier for the task. Since the ability of network is fully preserved, the balance point is easily adjustable according to the available resources. We test our method on both multi-path residual networks and incremental convolutional channel pruning, and show that RNR consistently outperforms static methods at the same computation complexity on both the CIFAR and ImageNet datasets. Our method can also be applied to off-the-shelf neural network structures and easily extended to other application scenarios.

    更新日期:2018-10-27
  • Discrete-Continuous Transformation Matching for Dense Semantic Correspondence
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-26
    Seungryong Kim; Dongbo Min; Stephen Lin; Kwanghoon Sohn

    Techniques for dense semantic correspondence have provided limited ability to deal with the geometric variations that commonly exist between semantically similar images. While variations due to scale and rotation have been examined, there is a lack of practical solutions for more complex deformations such as affine transformations because of the tremendous size of the associated solution space. To address this problem, we present a discrete-continuous transformation matching (DCTM) framework where dense affine transformation fields are inferred through a discrete label optimization in which the labels are iteratively updated via continuous regularization. In this way, our approach draws solutions from the continuous space of affine transformations in a manner that can be computed efficiently through constant-time edge-aware filtering and a proposed affine-varying CNN-based descriptor. Furthermore, leveraging correspondence consistency and confidence-guided filtering in each iteration facilitates the convergence of our method. Experimental results show that this model outperforms the state-of-the-art methods for dense semantic correspondence on various benchmarks and applications.

    更新日期:2018-10-27
  • Pictionary-style word-guessing on hand-drawn object sketches: dataset, analysis and deep network models
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-25
    Ravi Kiran Sarvadevabhatla; Shiv Surya; Trisha Mittal; Venkatesh Babu Radhakrishnan

    The ability of intelligent agents to play games in human-like fashion is popularly considered a benchmark of progress in Artificial Intelligence. Similarly, performance on multi-disciplinary tasks such as Visual Question Answering (VQA) is considered a marker for gauging progress in Computer Vision. In our work, we bring games and multi-disciplinary tasks together. Specifically, we introduce the first computational model aimed at Pictionary, the popular word-guessing social game. We first introduce Sketch-QA. Styled after Pictionary, Sketch-QA uses incrementally accumulated sketch stroke sequences as visual data. Notably, Sketch-QA involves asking a fixed question ("What object is being drawn?") and gathering open-ended guess-words from human guessers. We analyze the resulting dataset and present many interesting findings therein. To mimic Pictionary-style guessing, we subsequently propose a deep neural model which generates guess-words in response to temporally evolving human-drawn sketches. Our model even makes human-like mistakes while guessing, thus amplifying the human mimicry factor. We evaluate our model on the large-scale guess-word dataset generated via Sketch-QA task and compare with various baselines. We also conduct a Visual Turing Test to obtain human impressions of the guess-words generated by humans and our model. Experimental results demonstrate the promise of our approach for Pictionary and similarly themed games.

    更新日期:2018-10-26
  • Efficient Inter-Geodesic Distance Computation and Fast Classical Scaling
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-25
    Gil Shamai; Michael Zibulevsky; Ron Kimmel

    Multidimensional scaling (MDS) is a dimensionality reduction tool used for information analysis, data visualization and manifold learning. Most MDS procedures embed data points in low-dimensional Euclidean (flat) domains, such that distances between the points are as close as possible to given inter-point dissimilarities. We present an efficient solver for classical scaling, a specific MDS model, by extrapolating the information provided by distances measured from a subset of the points to the remainder. The computational and space complexities of the new MDS methods are thereby reduced from quadratic to quasi-linear in the number of data points. Incorporating both local and global information about the data allows us to construct a low-rank approximation of the inter-geodesic distances between the data points. As a by-product, the proposed method allows for efficient computation of geodesic distances.

    更新日期:2018-10-26
  • Generalized Latent Multi-View Subspace Clustering
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-23
    Changqing Zhang; Huazhu Fu; Qinghua Hu; Xiaochun Cao; Yuan Xie; Dacheng Tao; Dong Xu

    Subspace clustering is an effective method that has been successfully applied to many applications. Here we propose a novel subspace clustering model for multi-view data using a latent representation termed Latent Multi-View Subspace Clustering (LMSC). Unlike most existing single-view subspace clustering methods, which directly reconstruct data points using original features, our method explores underlying complementary information from multiple views and simultaneously seeks the underlying latent representation. Using the complementarity of multiple views, the latent representation depicts data more comprehensively than each individual view, accordingly making subspace representation more accurate and robust. We proposed two LMSC formulations: linear LMSC (lLMSC), based on linear correlations between latent representation and each view, and generalized LMSC (gLMSC), based on neural networks to handle general relationships. The proposed method can be efficiently optimized under the Augmented Lagrangian Multiplier with Alternating Direction Minimization (ALM-ADM) framework. Extensive experiments on diverse datasets demonstrate the effectiveness of the proposed method.

    更新日期:2018-10-23
  • Visibility Constrained Generative Model for Depth-based 3D Facial Pose Tracking
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-23
    Lu Sheng; Jianfei Cai; Tat-Jen Cham; Vladimir Pavlovic; KingNgi Ngan

    In this paper, we propose a generative framework that unifies depth-based 3D facial pose tracking and face model adaptation on-the-fly, in the unconstrained scenarios with heavy occlusions and arbitrary facial expression variations. Specifically, we introduce a statistical 3D morphable model that flexibly describes the distribution of points on the surface of the face model, with an efficient switchable online adaptation that gradually captures the identity of the tracked subject and rapidly constructs a suitable face model when the subject changes. Moreover, unlike prior art that employed ICP-based facial pose estimation, to improve robustness to occlusions, we propose a ray visibility constraint that regularizes the pose based on the face model's visibility with respect to the input point cloud. Ablation studies and experimental results on Biwi and ICT-3DHP datasets demonstrate that the proposed framework is effective and outperforms completing state-of-the-art depth-based methods.

    更新日期:2018-10-23
  • High-Fidelity Monocular Face Reconstruction based on an Unsupervised Model-based Face Autoencoder
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-18
    Ayush Tewari; Michael Zollhoefer; Florian Bernard; Pablo Garrido; Hyeongwoo Kim; Patrick Perez; Christian Theobalt

    In this work we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image. To this end, we combine a convolutional encoder network with an expert-designed generative model that serves as decoder. The core innovation is the differentiable parametric decoder that encapsulates image formation analytically based on a generative model. Our decoder takes as input a code vector with exactly defined semantic meaning that encodes detailed face pose, shape, expression, skin reflectance and scene illumination. Due to this new way of combining CNN-based with model-based face reconstruction, the CNN-based encoder learns to extract semantically meaningful parameters from a single monocular input image. For the first time, a CNN encoder and an expert-designed generative model can be trained end-to-end in an unsupervised manner, which renders training on very large (unlabeled) real world datasets feasible. The obtained reconstructions compare favorably to current state-of-the-art approaches in terms of quality and richness of representation. This work is an extended version of [1], where we additionally present a stochastic vertex sampling technique for faster training of our networks, and moreover, we propose and evaluate analysis-by-synthesis and shape-from-shading refinement approaches to achieve a high-fidelity reconstruction.

    更新日期:2018-10-19
  • Neural Machine Translation with Deep Attention
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-16
    Biao Zhang; Deyi Xiong; Jinsong Su

    Deepening neural models has been proven very successful in improving the model's capacity when solving complex learning tasks, such as the machine translation task. Previous efforts on deep neural machine translation mainly focus on the encoder and the decoder, while little on the attention mechanism. However, the attention mechanism is of vital importance to induce the translation correspondence between different languages where shallow neural networks are relatively insufficient, especially when the encoder and decoder are deep. In this paper, we propose a deep attention model (DeepAtt). Based on the low-level attention information, DeepAtt is capable of automatically determining what should be passed or suppressed from the corresponding encoder layer so as to make the distributed representation appropriate for high-level attention and translation. We conduct experiments on NIST Chinese-English, WMT English-German and WMT English-French translation tasks, where, with 5 attention layers, DeepAtt yields very competitive performance against the state-of-the-art results. We empirically find that with an adequate increase of attention layers, DeepAtt tends to produce more accurate attention weights. An in-depth analysis on the translation of important context words further reveals that DeepAtt significantly improves the faithfulness of system translations.

    更新日期:2018-10-17
  • Approximate Fisher Information Matrix to Characterise the Training of Deep Neural Networks
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-16
    Zhibin Liao; Tom Drummond; Ian Reid; Gustavo Carneiro

    In this paper, we introduce a novel methodology for characterising the performance of deep learning networks (ResNets and DenseNet) with respect to training convergence and generalisation as a function of mini-batch size and learning rate for image classification. This methodology is based on novel measurements derived from the eigenvalues of the approximate Fisher information matrix, which can be efficiently computed even for high capacity deep models. Our proposed measurements can help practitioners to monitor and control the training process (by actively tuning the mini-batch size and learning rate) to allow for good training convergence and generalisation. Furthermore, the proposed measurements also allow us to show that it is possible to optimise the training process with a new dynamic sampling training approach that continuously and automatically change the mini-batch size and learning rate during the training process. Finally, we show that the proposed dynamic sampling training approach has a faster training time and a competitive classification accuracy compared to the current state of the art.

    更新日期:2018-10-17
  • PCL: Proposal Cluster Learning for Weakly Supervised Object Detection
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-16
    Peng Tang; Xinggang Wang; Song Bai; Wei Shen; Xiang Bai; Wenyu Liu; Alan Loddon Yuille

    Weakly Supervised Object Detection (WSOD), using only image-level annotations to train object detectors, is of growing importance in object recognition. In this paper, we propose a novel deep network for WSOD. Unlike previous networks that transfer the object detection problem to an image classification problem using Multiple Instance Learning (MIL), our strategy generates proposal clusters to learn refined instance classifiers by an iterative process. The proposals in the same cluster are spatially adjacent and associated with the same object. This prevents the network from concentrating too much on parts of objects instead of whole objects. We first show that instances can be assigned object or background labels directly based on proposal clusters for instance classifier refinement, and then show that treating each cluster as a small new bag yields fewer ambiguities than the directly assigning label method. The iterative instance classifier refinement is implemented online using multiple streams in convolutional neural networks, where the first is an MIL network and the others are for instance classifier refinement supervised by the preceding one. Experiments are conducted on the PASCAL VOC, ImageNet detection, and MS-COCO benchmarks for WSOD. Results show that our method outperforms the previous state of the art significantly.

    更新日期:2018-10-17
  • Motion Segmentation & Multiple Object Tracking by Correlation Co-Clustering
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-16
    Margret Keuper; Siyu Tang; Bjorn Andres; Thomas Brox; Bernt Schiele

    Models for computer vision are commonly defined either w.r.t. low-level concepts such as pixels that are to be grouped, or w.r.t. high-level concepts such as semantic objects that are to be detected and tracked. Combining bottom-up grouping with top-down detection and tracking, although highly desirable, is a challenging problem. We state this joint problem as a co-clustering problem that is principled and tractable by existing algorithms. We demonstrate the effectiveness of this approach by combining bottom-up motion segmentation by grouping of point trajectories with high-level multiple object tracking by clustering of bounding boxes. We show that solving the joint problem is beneficial at the low-level, in terms of the FBMS59 motion segmentation benchmark, and at the high-level, in terms of the Multiple Object Tracking benchmarks MOT15, MOT16 and the MOT17 challenge, and is state-of-the-art in some metrics.

    更新日期:2018-10-17
  • Detecting Coherent Groups in Crowd Scenes by Multiview Clustering
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-09
    Qi Wang; Mulin Chen; Feiping Nie; Xuelong Li

    Detecting coherent groups is fundamentally important for crowd behavior analysis. In the past few decades, plenty of works have been conducted on this topic, but most of them have limitations due to the insufficient utilization of crowd properties and the arbitrary processing of individuals. In this study, a Multiview-based Parameter Free framework (MPF) is proposed. Based on the L1-norm and L2-norm, we design two versions of the multiview clustering method, which is the main part of the proposed framework. This paper presents the contributions on three aspects: (1) a new structural context descriptor is designed to characterize the structural properties of individuals in crowd scenes; (2) an self-weighted multiview clustering method is proposed to cluster feature points by incorporating their motion and context similarities; (3) a novel framework is introduced for group detection, which is able to determine the group number automatically without any parameter or threshold to be tuned. The effectiveness of the proposed framework is evaluated on real-world crowd videos, and the experimental results show its promising performance on group detection. In addition, the proposed multiview clustering method is also evaluated on a synthetic dataset and several standard benchmarks, and its superiority over the state-of-the-art competitors is demonstrated.

    更新日期:2018-10-10
  • Learning Compact Features for Human Activity Recognition via Probabilistic First-Take-All
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-08
    Jun Ye; Guojun Qi; Naifan Zhuang; Hao Hu; Kien A. Hua

    With the popularity of mobile technology, wearable devices, such as smart wristbands and smartphones, open a unprecedented opportunity to solve the challenging human activity recognition (HAR) problem by learning expressive representations from the multi-dimensional sensor signals that record huge amounts of daily activities. This inspires us to develop a new algorithm applicable to both camera-based and wearable sensor-based HAR systems. Although competitive classification accuracy has been reported, existing methods often face the challenge of distinguishing visually similar activities composed of activity patterns in different temporal orders. In this paper, we propose a novel probabilistic algorithm to compactly encode temporal orders of activity patterns for HAR. Specifically, the algorithm learns an optimal set of latent patterns such that their temporal structures really matter in recognizing different human activities. Then a novel probabilistic First-Take-All (pFTA) approach is introduced to generate compact features from the orders of these latent patterns to encode an entire sequence, and the temporal structural similarity between different sequences can be efficiently computed by the Hamming distance between compact features. Experiments on three public HAR datasets show the proposed pFTA approach can achieve competitive performance in terms of accuracy as well as efficiency.

    更新日期:2018-10-09
  • Shallowing Deep Networks: Layer-wise Pruning based on Feature Representations
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-08
    Shi Chen; Qi Zhao

    Recent surge of Convolutional Neural Networks (CNNs) has brought successes among various applications. However, these successes are accompanied by a significant increase in computational cost and the demand for computational resources, which critically hampers the utilization of complex CNNs on devices with limited computational power. In this work, we propose a feature representation based layer-wise pruning method that aims at reducing complex CNNs to more compact ones with equivalent performance. Different from previous parameter pruning methods that conduct connection-wise or filter-wise pruning based on weight information, our method determines redundant parameters by investigating the features learned in the convolutional layers and the pruning process is operated at a layer level. Experiments demonstrate that the proposed method is able to significantly reduce computational cost and the pruned models achieve equivalent or even better performance compared to the original models on various datasets.

    更新日期:2018-10-09
  • Rank Minimization for Snapshot Compressive Imaging
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-04
    Yang Liu; Xin Yuan; Jinli Suo; David Brady; Qionghai Dai

    Snapshot compressive imaging (SCI) refers to compressive imaging systems where multiple frames are mapped into a single measurement, with video compressive imaging and hyperspectral compressive imaging as two representative applications. Though exciting results of high-speed videos and hyperspectral images have been demonstrated, the poor reconstruction quality precludes SCI from wide applications.This paper aims to boost the reconstruction quality of SCI via exploiting the high-dimensional structure in the desired signal. We build a joint model to integrate the nonlocal self-similarity of video/hyperspectral frames and the rank minimization approach with the SCI sensing process. Following this, an alternating minimization algorithm is developed to solve this non-convex problem. We further investigate the special structure of the sampling process in SCI to tackle the computational workload and memory issues in SCI reconstruction. Both simulation and real data (captured by four different SCI cameras) results demonstrate that our proposed algorithm leads to significant improvements compared with current state-of-the-art algorithms. We hope our results will encourage the researchers and engineers to pursue further in compressive imaging for real applications.

    更新日期:2018-10-05
  • Hyperspectral recovery from RGB images using Gaussian Processes
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-04
    NAVEED AKHTAR; Ajmal S Mian

    We propose to recover spectral details from RGB images of known spectral quantization by modeling natural spectra under Gaussian Processes and combining them with the RGB images. Our technique exploits Process Kernels to model the relative smoothness of reflectance spectra, and encourages non-negativity in the resulting signals for better estimation of the reflectance values. The Gaussian Processes are inferred in sets using clusters of spatio-spectrally correlated hyperspectral training patches. Each set is transformed to match the spectral quantization of the test RGB image. We extract overlapping patches from the RGB image and match them to the hyperspectral training patches by spectrally transforming the latter. The RGB patches are encoded over the transformed Gaussian Processes related to those hyperspectral patches and the resulting image is constructed by combining the codes with the original Processes. Our approach infers the desired Gaussian Processes under a fully Bayesian model inspired by Beta-Bernoulli Process, for which we also present the inference procedure. A thorough evaluation using three hyperspectral datasets demonstrates the effective extraction of spectral details from RGB images by the proposed technique.

    更新日期:2018-10-05
  • Denoising Prior Driven Deep Neural Network for Image Restoration
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-04
    Weisheng Dong; Peiyao Wang; Wotao Yin; Guangming Shi

    Deep neural networks (DNNs) have shown very promising results for various image restoration (IR) tasks. However, the design of network architectures remains a major challenging for achieving further improvements. While most existing DNN-based methods solve the IR problems by directly mapping low quality images to desirable high-quality images, the observation models characterizing the image degradation processes have been largely ignored. In this paper, we first propose a denoising-based IR algorithm, whose iterative steps can be computed efficiently. Then, the iterative process is unfolded into a deep neural network, which is composed of multiple denoisers modules interleaved with back-projection (BP) modules that ensure the observation consistencies. A convolutional neural network (CNN) based denoiser that can exploit the multi-scale redundancies of natural images is proposed. As such, the proposed network not only exploits the powerful denoising ability of DNNs, but also leverages the prior of the observation model. Through end-to-end training, both the denoisers and the BP modules can be jointly optimized. Experimental results on several IR tasks, e.g., image denoisig, super-resolution and deblurring show that the proposed method can lead to very competitive and often state-of-the-art results on several IR tasks, including image denoising, deblurring and super-resolution.

    更新日期:2018-10-05
  • First-Person Activity Forecasting from Video with Online Inverse Reinforcement Learning
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-04
    Nicholas Rhinehart; Kris Kitani

    We address the problem of incrementally modeling and forecasting long-term goals of a first-person camera wearer: what the user will do, where they will go, and what goal they seek. In contrast to prior work in trajectory forecasting, our algorithm, DARKO, goes further to reason about semantic states (will I pick up an object?), and future goal states that are far in terms of both space and time. DARKO learns and forecasts from first-person visual observations of the user's daily behaviors via an Online Inverse Reinforcement Learning (IRL) approach. Classical IRL discovers only the rewards in a batch setting, whereas DARKO discovers the transitions, rewards, and goals of a user from streaming data. Among other results, we show DARKO forecasts goals better than competing methods in both noisy and ideal settings, and our approach is theoretically and empirically no-regret.

    更新日期:2018-10-05
  • Visual Permutation Learning
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-10-04
    Rodrigo Santa Cruz; Basura Fernando; Anoop Cherian; Stephen Gould

    We present a principled approach to uncover the structure of visual data by solving a deep learning task coined visual permutation learning. The goal of this task is to find the permutation that recovers the structure of data from shuffled versions of it. In the case of natural images, this task boils down to recovering the original image from patches shuffled by an unknown permutation matrix. Permutation matrices are discrete, thereby posing difficulties for gradient-based optimization methods. To this end, we resort to a continuous approximation using doubly-stochastic matrices and formulate a novel bi-level optimization problem on such matrices that learns to recover the permutation. Unfortunately, such a scheme leads to expensive gradient computations. We circumvent this issue by further proposing a computationally cheap scheme for generating doubly stochastic matrices based on Sinkhorn iterations. To implement our approach we propose DeepPermNet, an end-to-end CNN model for this task. The utility of DeepPermNet is demonstrated on three challenging computer vision problems, namely, relative attributes learning, supervised learning-to-rank, and self-supervised representation learning. Our results show state-of-the-art performance on relative attributes learning, supervised learning-to-rank, and competitive results in the classification and segmentation tasks of the PASCAL VOC dataset for self-supervised learning.

    更新日期:2018-10-05
  • Two-Stream Transformer Networks for Video-Based Face Alignment
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-08-01
    Hao Liu; Jiwen Lu; Jianjiang Feng; Jie Zhou

    In this paper, we propose a two-stream transformer networks (TSTN) approach for video-based face alignment. Unlike conventional image-based face alignment approaches which cannot explicitly model the temporal dependency in videos and motivated by the fact that consistent movements of facial landmarks usually occur across consecutive frames, our TSTN aims to capture the complementary information of both the spatial appearance on still frames and the temporal consistency information across frames. To achieve this, we develop a two-stream architecture, which decomposes the video-based face alignment into spatial and temporal streams accordingly. Specifically, the spatial stream aims to transform the facial image to the landmark positions by preserving the holistic facial shape structure. Accordingly, the temporal stream encodes the video input as active appearance codes, where the temporal consistency information across frames is captured to help shape refinements. Experimental results on the benchmarking video-based face alignment datasets show very competitive performance of our method in comparisons to the state-of-the-arts.

    更新日期:2018-10-03
  • PD2T: Person-Specific Detection, Deformable Tracking
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-11-03
    Grigorios G. Chrysos; Stefanos Zafeiriou

    Face detection/alignment methods have reached a satisfactory state in static images captured under arbitrary conditions. Such methods typically perform (joint) fitting for each frame and are used in commercial applications; however in the majority of the real-world scenarios the dynamic scenes are of interest. We argue that generic fitting per frame is suboptimal (it discards the informative correlation of sequential frames) and propose to learn person-specific statistics from the video to improve the generic results. To that end, we introduce a meticulously studied pipeline, which we name PD 2 T, that performs person-specific detection and landmark localisation. We carry out extensive experimentation with a diverse set of i) generic fitting results, ii) different objects (human faces, animal faces) that illustrate the powerful properties of our proposed pipeline and experimentally verify that PD 2 T outperforms all the compared methods.

    更新日期:2018-10-03
  • Recurrent Convolutional Shape Regression
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-03-01
    WEI WANG; Sergey Tulyakov; Nicu Sebe

    The mainstream direction in face alignment is now dominated by cascaded regression methods. These methods start from an image with an initial shape and build a set of shape increments based on features with respect to the current estimated shape. These shape increments move the initial shape to the desired location. Despite the advantages of the cascaded methods, they all share two major limitations: (i) shape increments are learned independently from each other in a cascaded manner, (ii) the use of standard generic computer vision features such SIFT, HOG, does not allow these methods to learn problem-specific features. In this work, we propose a novel Recurrent Convolutional Shape Regression (RCSR) method that overcomes these limitations. We formulate the standard cascaded alignment problem as a recurrent process and learn all shape increments jointly, by using a recurrent neural network with a gated recurrent unit. Importantly, by combining a convolutional neural network with a recurrent one we avoid hand-crafted features, widely adopted in the literature and thus we allow the model to learn task-specific features. Besides, we employ the convolutional gated recurrent unit which takes as input the feature tensors instead of flattened feature vectors. Therefore, the spatial structure of the features can be better preserved in the memory of the recurrent neural network. Moreover, both the convolutional and the recurrent neural networks are learned jointly. Experimental evaluation shows that the proposed method has better performance than the state-of-the-art methods, and further supports the importance of learning a single end-to-end model for face alignment.

    更新日期:2018-10-03
  • EAC-Net: Deep Nets with Enhancing and Cropping for Facial Action Unit Detection
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-01-10
    Wei Li; Farnaz Abtahi; Zhigang Zhu; Lijun Yin

    In this paper, we propose a deep learning based approach for facial action unit (AU) detection by enhancing and cropping regions of interest of face images. The approach is implemented by adding two novel nets (a.k.a. layers): the enhancing layers and the cropping layers, to a pretrained convolutional neural network (CNN) model. For the enhancing layers (noted as E-Net ), we have designed an attention map based on facial landmark features and apply it to a pretrained neural network to conduct enhanced learning. For the cropping layers (noted as C-Net ), we crop facial regions around the detected landmarks and design individual convolutional layers to learn deeper features for each facial region. We then combine the E-Net and the C-Net to construct a so-called Enhancing and Cropping Net ( EAC-Net ), which can learn both features enhancing and region cropping functions effectively. The EAC-Net integrates three important elements, i.e., learning transfer, attention coding, and regions of interest processing, making our AU detection approach more efficient and more robust to facial position and orientation changes. Our approach shows a significant performance improvement over the state-of-the-art methods when tested on the BP4D and DISFA AU datasets. The EAC-Net with a slight modification also shows its potentials in estimating accurate AU intensities. We have also studied the performance of the proposed EAC-Net under two very challenging conditions: (1) faces with partial occlusion and (2) faces with large head pose variations. Experimental results show that (1) the EAC-Net learns facial AUs correlation effectively and predicts AUs reliably even with only half of a face being visible, especially for the lower half; (2) Our EAC-Net model also works well under very large head poses, which outperforms significantly a compared baseline approach. It further shows that the EAC-Net works much better without a face frontalization than with face frontalization through image warping as pre-processing, in terms of computational efficiency and AU detection accuracy.

    更新日期:2018-10-03
  • Heterogeneous Face Attribute Estimation: A Deep Multi-Task Learning Approach
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-08-11
    Hu Han; Anil K. Jain; Fang Wang; Shiguang Shan; Xilin Chen

    Face attribute estimation has many potential applications in video surveillance, face retrieval, and social media. While a number of methods have been proposed for face attribute estimation, most of them did not explicitly consider the attribute correlation and heterogeneity (e.g., ordinal versus nominal and holistic versus local) during feature representation learning. In this paper, we present a Deep Multi-Task Learning (DMTL) approach to jointly estimate multiple heterogeneous attributes from a single face image. In DMTL, we tackle attribute correlation and heterogeneity with convolutional neural networks (CNNs) consisting of shared feature learning for all the attributes, and category-specific feature learning for heterogeneous attributes. We also introduce an unconstrained face database (LFW+), an extension of public-domain LFW, with heterogeneous demographic attributes (age, gender, and race) obtained via crowdsourcing. Experimental results on benchmarks with multiple face attributes (MORPH II, LFW+, CelebA, LFWA, and FotW) show that the proposed approach has superior performance compared to state of the art. Finally, evaluations on a public-domain face database (LAP) with a single attribute show that the proposed approach has excellent generalization ability.

    更新日期:2018-10-03
  • Efficient Group-n Encoding and Decoding for Facial Age Estimation
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2017-12-04
    Zichang Tan; Jun Wan; Zhen Lei; Ruicong Zhi; Guodong Guo; Stan Z. Li

    Different ages are closely related especially among the adjacent ages because aging is a slow and extremely non-stationary process with much randomness. To explore the relationship between the real age and its adjacent ages, an age group-n encoding (AGEn) method is proposed in this paper. In our model, adjacent ages are grouped into the same group and each age corresponds to n groups. The ages grouped into the same group would be regarded as an independent class in the training stage. On this basis, the original age estimation problem can be transformed into a series of binary classification sub-problems. And a deep Convolutional Neural Networks (CNN) with multiple classifiers is designed to cope with such sub-problems. Later, a Local Age Decoding (LAD) strategy is further presented to accelerate the prediction process, which locally decodes the estimated age value from ordinal classifiers. Besides, to alleviate the imbalance data learning problem of each classifier, a penalty factor is inserted into the unified objective function to favor the minority class. To compare with state-of-the-art methods, we evaluate the proposed method on FG-NET, MORPH II, CACD and Chalearn LAP 2015 databases and it achieves the best performance.

    更新日期:2018-10-03
  • Visual Kinship Recognition of Families in the Wild
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-04-13
    Joseph P. Robinson; Ming Shao; Yue Wu; Hongfu Liu; Timothy Gillis; Yun Fu

    We present the largest database for visual kinship recognition,Families In the Wild(FIW), with over 13,000 family photos of 1,000 family trees with 4-to-38 members. It took only a small team to build FIW with efficient labeling tools and work-flow. To extend FIW, we further improved upon this process with a novel semi-automatic labeling scheme that used annotated faces and unlabeled text metadata to discover labels, which were then used, along with existing FIW data, for the proposed clustering algorithm that generated label proposals for all newly added data–both processes are shared and compared in depth, showing great savings in time and human input required. Essentially, the clustering algorithm proposed is semi-supervised and uses labeled data to produce more accurate clusters. We statistically compare FIW to related datasets, which unarguably shows enormous gains in overall size and amount of information encapsulated in the labels. We benchmark two tasks, kinship verification and family classification, at scales incomparably larger than ever before. Pre-trained CNN models fine-tuned on FIW outscores other conventional methods and achieved state-of-the art on the renowned KinWild datasets. We also measure human performance on kinship recognition and compare to a fine-tuned CNN.

    更新日期:2018-10-03
  • 3D Reconstruction of “In-the-Wild” Faces in Images and Videos
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-05-15
    James Booth; Anastasios Roussos; Evangelos Ververas; Epameinondas Antonakos; Stylianos Ploumpis; Yannis Panagakis; Stefanos Zafeiriou

    3D Morphable Models (3DMMs) are powerful statistical models of 3D facial shape and texture, and are among the state-of-the-art methods for reconstructing facial shape from single images. With the advent of new 3D sensors, many 3D facial datasets have been collected containing both neutral as well as expressive faces. However, all datasets are captured under controlled conditions. Thus, even though powerful 3D facial shape models can be learnt from such data, it is difficult to build statistical texture models that are sufficient to reconstruct faces captured in unconstrained conditions (“in-the-wild”). In this paper, we propose the first “in-the-wild” 3DMM by combining a statistical model of facial identity and expression shape with an “in-the-wild” texture model. We show that such an approach allows for the development of a greatly simplified fitting procedure for images and videos, as there is no need to optimise with regards to the illumination parameters. We have collected three new benchmarks that combine “in-the-wild” images and video with ground truth 3D facial geometry, the first of their kind, and report extensive quantitative evaluations using them that demonstrate our method is state-of-the-art.

    更新日期:2018-10-03
  • HeadFusion: 360 ${^\circ }$ Head Pose Tracking Combining 3D Morphable Model and 3D Reconstruction
    IEEE Trans. Pattern Anal. Mach. Intell. (IF 9.455) Pub Date : 2018-05-29
    Yu Yu; Kenneth Alberto Funes Mora; Jean-Marc Odobez

    Head pose estimation is a fundamental task for face and social related research. Although 3D morphable model (3DMM) based methods relying on depth information usually achieve accurate results, they usually require frontal or mid-profile poses which preclude a large set of applications where such conditions can not be garanteed, like monitoring natural interactions from fixed sensors placed in the environment. A major reason is that 3DMM models usually only cover the face region. In this paper, we present a framework which combines the strengths of a 3DMM model fitted online with a prior-free reconstruction of a 3D full head model providing support for pose estimation from any viewpoint. In addition, we also proposes a symmetry regularizer for accurate 3DMM fitting under partial observations, and exploit visual tracking to address natural head dynamics with fast accelerations. Extensive experiments show that our method achieves state-of-the-art performance on the public BIWI dataset, as well as accurate and robust results on UbiPose, an annotated dataset of natural interactions that we make public and where adverse poses, occlusions or fast motions regularly occur.

    更新日期:2018-10-03
Some contents have been Reproduced with permission of the American Chemical Society.
Some contents have been Reproduced by permission of The Royal Society of Chemistry.
化学 • 材料 期刊列表