当前期刊: Image and Vision Computing Go to current issue    加入关注   
显示样式:        排序: 导出
  • A Complementary Regression Network for Accurate Face Alignment
    Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-22
    Hyunsung Park; Daijin Kim

    This paper proposes a complementary regression network (CRN) that combines global and local regression methods to align faces. A global regression network (GRN) generates the coordinates of facial landmark points directly such that all facial feature points are fitted to the input face on the whole and a local regression network (LRN) generates the heatmap of facial landmark points such that each channel localizes the detail of its facial landmark point well. The CRN converts the GRN’s coordinates to another heatmap, then uses with the LRN’s heatmap to get the final facial landmark points. The CRN works complementarily such that the GRN’s overall fitting tendency compensates for the LRN’s poor alignment caused by missing local information, whereas the LRN’s detailed representation compensates for the GRN’s poor alignment caused by global miss-fitting. We conducted several experiments on the 300-W public dataset, the 300-W private dataset, and the Menpo dataset and the proposed CRN achieved 3.14%, 3.74%, and 1.996% the-state-of-art face alignment accuracy in terms of percentage of normalized mean error, respectively.

  • A Riemannian Approach for Free-Space Extraction and Path Planning using Catadioptric Omnidirectional Vision
    Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-19
    Fatima Aziz; Ouiddad Labbani-Igbida; Amina Radgui; Ahmed Tamtaoui

    This paper presents a Riemannian approach for free-space extraction and path planning using color catadioptric vision. The problem is formulated considering color catadioptric images as Riemannian manifolds and solved using the Riemannian Eikonal equation with an anisotropic fast marching numerical scheme. This formulation allows the integration of adapted color and spatial metrics in an incremental process. First, the traversable ground (namely free-space) is delimited using a color structure tensor built on the multi-dimensional components of the catadioptric image. Then, the Eikonal equation is solved in the image plane incorporating a generic metric tensor for central catadioptric systems. This built Riemannian metric copes with the geometric distortions in the catadioptric image plane introduced by the curved mirror in order to compute the geodesic distance map and the shortest path between image points. We present comparative results using Euclidean and Riemannian distance transforms and show the effectiveness of the Riemannian approach to produce safest path planning.

  • Salient object detection based on backbone enhanced network
    Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-15
    Ronghua Luo; Huailin Huang; WeiZeng Wu

    The Convolutional Neural Networks(CNNs) with encoder-decoder architecture has shown powerful ability in semantic segmentation and it has also been applied in saliency detection. In most researches, the parameters of the backbone network which have been pre-trained on the ImageNet dataset will be retrained using the new training dataset to let CNNs adapt to the new task better. But the retraining will weaken generalization of the pre-trained backbone network and result in over-fitting, especially when the scale of the new training data is not very large. To make a balance between generalization and precision, and to further improve the performance of the CNNs with encoder-decoder architecture in salient object detection, We proposed a framework with enhanced backbone network (BENet). A encoder with structure of dual backbone networks(DBNs) is adopted in BENet to extract more diverse feature maps. In addtion, BENet includes a connection module based on improved Res2Net to efficiently fuse feature maps from the two backbone networks and a decoder based on weighted multi-scale feedback module (WMFM) to perform synchronous learning. Our approach is extensively evaluated on six public datasets, and experimental results show significant and consistent improvements over the state-of-the-art methods without any additional supervision.

  • Person re-identification with expanded neighborhoods distance re-ranking
    Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-15
    Jingyi Lv; Zhiyong Li; Ke Nai; Ying Chen; Jin Yuan

    In the person re-identification (re-ID) community, pedestrians often have great changes in appearance, and there are many similar persons, which incurs will degrades the accuracy. Re-ranking is an effective method to solve these problems, this paper proposes an expanded neighborhoods distance (END) to re-rank the re-ID results. We assume that if the two persons in different image are same, their initial ranking lists and two-level neighborhoods will be very similar when they are taken as the query. Our method follows the principle of similarity, and selects expanded neighborhoods in initial ranking list to calculate the END distance. Final distance is calculated as the combination of the END distance and Jaccard distance. Experiments on Market-1501, DukeMTMC-reID and CUHK03 datasets confirm the effectiveness of the novel re-ranking method in this article. Compare with re-ID baseline, the proposed method in this paper increases mAP by 14.2% on Market-1501 and Rank1 by 12.9% on DukeMTMC-reID.

  • Depth-Guided View Synthesis for Light Field Reconstruction From a Single Image
    Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-14
    Wenhui Zhou; Gaomin Liu; Jiangwei Shi; Hua Zhang; Guojun Dai

    Light field imaging has recently become a promising technology for 3D rendering and displaying. However, capturing real-world light field images still faces many challenges in both the quantity and quality. In this paper, we develop a learning based technique to reconstruct light field from a single 2D RGB image. It includes three steps: unsupervised monocular depth estimation, view synthesis and depth-guided view inpainting. We first propose a novel monocular depth estimation network to predict disparity maps of each sub-aperture views from the central view of light field. Then we synthesize the initial sub-aperture views by using the warping scheme. Considering that occlusion makes synthesis ambiguous for pixels invisible in the central view, we present a simple but effective fully convolutional network (FCN) for view inpainting. Note that the proposed network architecture is a general framework for light field reconstruction, which can be extended to take a sparse set of views as input without changing any structure or parameters of the network. Comparison experiments demonstrate that our method outperforms the state-of-the-art light field reconstruction methods with single-view input, and achieves comparable results with the multi-input methods.

  • Learning Reliable-Spatial and Spatial-Variation Regularization Correlation Filters for Visual Tracking
    Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-10
    Hengcheng Fu; Yihong Zhang; Wuneng Zhou; Xiaofeng Wang; Huanlong Zhang

    Single-object tracking is a significant and challenging computer vision problem. Recently, discriminative correlation filters (DCF) have shown excellent performance. But there is a theoretical defects that the boundary effect, caused by the periodic assumption of training samples, greatly limit the tracking performance. Spatially regularized DCF (SRDCF) introduces a spatial regularization to penalize the filter coefficients depending on their spatial location, which improves the tracking performance a lot. However, this simple regularization strategy implements unequal penalties for the target area filter coefficients, which makes the filter learn a distorted object appearance model. In this paper, a novel spatial regularization strategy is proposed, utilizing a reliability map to approximate the target area and to keep the penalty coefficients of relevant region consistent. Besides, we introduce a spatial variation regularization component that the second-order difference of the filter, which smooths changes of filter coefficients to prevent the filter over-fitting current frame. Furthermore, an efficient optimization algorithm called alternating direction method of multipliers (ADMM) is developed. Comprehensive experiments are performed on three benchmark datasets: OTB-2013, OTB-2015 and TempleColor-128, and our algorithm achieves a more favorable performance than several state-of-the-art methods. Compared with SRDCF, our approach obtains an absolute gain of 6.6% and 5.1% in mean distance precision on OTB-2013 and OTB-2015, respectively. Our approach runs in real-time on a CPU.

  • CollectiveSports: A Multi-task dataset for collective activity recognition
    Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-09
    Cemil Zalluhoglu; Nazli Ikizler-Cinbis

    Collective activity recognition is an important subtask of human action recognition, where the existing datasets are mostly limited. In this paper, we look into this issue and introduce the “Collective Sports (C-Sports)” dataset, which is a novel benchmark dataset for multi-task recognition of both collective activity and sports categories. Various state-of-the-art techniques are evaluated on this dataset, together with multi-task variants which demonstrate increased performance. From the experimental results, we can say that while sports categories of the videos are inferred accurately, there is still room for improvement for collective activity recognition, especially regarding the generalization ability beyond previously unseen sports categories. In order to evaluate this ability, we introduce a novel evaluation protocol called unseen sports, where the training and test are carried out on disjoint sets of sports categories. The relatively lower recognition performances in this evaluation protocol indicate that the recognition models tend to be influenced by the surrounding context, rather than focusing on the essence of the collective activities. We believe that C-Sports dataset will stir further interest in this research direction.

  • Bottom-up Unsupervised Image Segmentation using FC-Dense u-net based Deep Representation Clustering and Multidimensional Feature Fusion based Region Merging
    Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-08
    Zubair Khan; Jie Yang

    Recent advances in system resources provide ease in the applicability of deep learning approaches in computer vision. In this paper, we propose a deep learning-based unsupervised image segmentation approach for natural image segmentation. Image segmentation aims to transform an image into regions, representing various objects in the image. Our method consists of a fully convolutional dense network-based unsupervised deep representation oriented clustering, followed by shallow features based high-dimensional region merging to produce the final segmented image. We evaluate our proposed approach on the BSD300 database and perform a comparison with several classical and some recent deep learning-based unsupervised segmentation methods. The experimental results represent that the proposed method is comparable and confirm the efficacy of the proposed approach.

  • SAANet: Spatial adaptive alignment network for object detection in automatic driving
    Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-07
    Junying Chen; Tongyao Bai

    Both images and point clouds are beneficial for object detection in a visual navigation module for autonomous driving. The spatial relationships between different objects at different times in a bimodal space can vary significantly. It is difficult to combine bimodal descriptions into a unified model to effectively detect objects in an efficient amount of time. In addition, conventional voxelization methods resolve point clouds into voxels at a global level, and often overlook local attributes of the voxels. To address these problems, we propose a novel fusion-based deep framework named SAANet. SAANet utilizes a spatial adaptive alignment (SAA) module to align point cloud features and image features, by automatically discovering the complementary information between point clouds and images. Specifically, we transform the point clouds into 3D voxels, and introduce local orientation encoding to represent the point clouds. Then, we use a sparse convolutional neural network to learn a point cloud feature. Simultaneously, a ResNet-like 2D convolutional neural network is used to extract an image feature. Next, the point cloud feature and image feature are fused by our SAA block to derive a comprehensive feature. Then, the labels and 3D boxes for objects are learned using a multi-task learning network. Finally, an experimental evaluation on the KITTI benchmark demonstrates the advantages of our method in terms of average precision and inference time, as compared to previous state-of-the-art results for 3D object detection.

  • Enhancing deep discriminative feature maps via perturbation for face presentation attack detection
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-02
    Yasar Abbas Ur Rehman; Lai-Man Po; Jukka Komulainen

    Face presentation attack detection (PAD) in unconstrained conditions is one of the key issues in face biometric-based authentication and security applications. In this paper, we propose a perturbation layer — a learnable pre-processing layer for low-level deep features — to enhance the discriminative ability of deep features in face PAD. The perturbation layer takes the deep features of a candidate layer in Convolutional Neural Network (CNN), the corresponding hand-crafted features of an input image, and produces adaptive convolutional weights for the deep features of the candidate layer. These adaptive convolutional weights determine the importance of the pixels in the deep features of the candidate layer for face PAD. The proposed perturbation layer adds very little overhead to the total trainable parameters in the model. We evaluated the proposed perturbation layer with Local Binary Patterns (LBP), with and without color information, on three publicly available face PAD databases, i.e., CASIA, Idiap Replay-Attack, and OULU-NPU databases. Our experimental results show that the introduction of the proposed perturbation layer in the CNN improved the face PAD performance, in both intra-database and cross-database scenarios. Our results also highlight the attention created by the proposed perturbation layer in the deep features and its effectiveness for face PAD in general.

  • A fast and accurate iterative method for the camera pose estimation problem
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-16
    Maoteng Zheng; Shiguang Wang; Xiaodong Xiong; Junfeng Zhu

    This paper presents a fast and accurate iterative method for camera pose estimation problem. The dependence on initial values is reduced by replacing unknown angular parameters with three independent non-angular parameters. Image point coordinates are treated as observations with errors and a new model is built using a conditional adjustment with parameters for relative orientation. This model allows for the estimation of the errors in the observations. The estimated observation errors are then used iteratively to detect and eliminate gross errors in the adjustment. A total of 22 synthetic datasets and 10 real datasets are used to compare the proposed method with the traditional iterative method, the 5-point-RANSAC and the state-of-the-art 5-point-USAC methods. Preliminary results show that our proposed method is not only faster than the other methods, but also more accurate and stable.

  • Skin detection and lightweight encryption for privacy protection in real-time surveillance applications
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-10
    Amna Shifa; Muhammad Babar Imtiaz; Mamoona Naveed Asghar; Martin Fleury

    An individual's privacy is a significant concern in surveillance videos. Existing research work into the location of individuals on the basis of detecting their skin is focused either on different techniques for detecting human skin on protecting individuals from the consequences of applying such techniques. This paper considers both lines of research and proposes a hybrid scheme for human skin detection and subsequent privacy protection by utilizing color information in dynamically varying illumination and environmental conditions. For those purposes, dynamic and explicit skin-detection approaches are implemented, simultaneously considering multiple color-spaces, i.e. RGB, perceptual (HSV) and orthogonal (YCbCr) color-spaces, and then detecting the human skin by the proposed Combined Threshold Rule (CTR)-based segmentation. Comparative qualitative and quantitative detection results with an average 93.73% accuracy, imply that the proposed scheme achieves considerable accuracy without incurring a training cost. Once skin detection has been performed, the detected skin pixels (including false positives) are encrypted, when standard AES-CFB encryption of skin pixels is shown to be preferable compared to selective encryption of a whole video frame. The scheme preserves the behavior of the subjects within the video. Hence, subsequent image processing and behavior analysis, if required, can be performed by an authorized user. The experimental results are encouraging, as they show that the average encryption time is 8.268 s and the Encryption Space Ratio (ESR) is an average 7.25% for a high definition video (1280 × 720 pixels/frame). A performance comparison in terms of Correct Detection Rate (CDR) showed an average 91.5% for CTB-based segmentation compared to using only one color space for segmentation, such as using RGB with 85.86%, HSV with 80.93% and YCbCr with an average 84.8%, which implies that the proposed method of combining color-space skin identifications has a higher ability to detect skin accurately. Security analysis confirmed that the proposed scheme could be a suitable choice for real-time surveillance applications operating on resource-constrained devices.

  • Multi-feature fusion for image retrieval using constrained dominant sets
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-12
    Leulseged Tesfaye Alemu; Marcello Pelillo

    Aggregating different image features for image retrieval has recently shown its effectiveness. While highly effective, though, the question of how to uplift the impact of the best features for a specific query image persists as an open computer vision problem. In this paper, we propose a computationally efficient approach to fuse several hand-crafted and deep features, based on the probabilistic distribution of a given membership score of a constrained cluster in an unsupervised manner. First, we introduce an incremental nearest neighbor (NN) selection method, whereby we dynamically select k-NN to the query. We then build several graphs from the obtained NN sets and employ constrained dominant sets (CDS) on each graph G to assign edge weights which consider the intrinsic manifold structure of the graph, and detect false matches to the query. Finally, we elaborate the computation of feature positive-impact weight (PIW) based on the dispersive degree of the characteristics vector. To this end, we exploit the entropy of a cluster membership-score distribution. In addition, the final NN set bypasses a heuristic voting scheme. Experiments on several retrieval benchmark datasets show that our method can improve the state-of-the-art result.

  • A new cast shadow detection method for traffic surveillance video analysis using color and statistical modeling
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-13
    Hang Shi; Chengjun Liu

    In traffic surveillance video analysis systems, the cast shadows of vehicles often have a negative effect on video analysis results. A novel cast shadow detection framework, which consists of a new foreground detection method and a cast shadow detection method, is presented in this paper to detect and remove the cast shadows from the foreground. The new foreground detection method applies an innovative Global Foreground Modeling (GFM) method, a Gaussian mixture model or GMM, and the Bayes classifier for foreground and background classification. While the GFM method is for global foreground modeling, the GMM is for local background modeling, and the Bayes classifier applies both the foreground and the background models for foreground detection. The rationale of the GFM method stems from the observation that the foreground objects often appear in recent frames and their trajectories often lead them to different locations in these frames. As a result, the statistical models used to characterize the foreground objects should not be pixel based or locally defined. The cast shadow detection method contains four hierarchical steps. First, a set of new chromatic criteria is presented to detect the candidate shadow pixels in the HSV color space. Second, a new shadow region detection method is proposed to cluster the candidate shadow pixels into shadow regions. Third, a statistical shadow model, which uses a single Gaussian distribution to model the shadow class, is presented for classifying shadow pixels. Fourth, an aggregated shadow detection method is presented for final shadow detection. Experiments using the public video data ‘Highway-1’ and ‘Highway-3’, and the New Jersey Department of Transportation (NJDOT) real traffic video sequences show the feasibility of the proposed method. In particular, the proposed method achieves better shadow detection performance than the popular shadow detection methods, and is able to improve the traffic video analysis results.

  • Single image dehazing via a dual-fusion method
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-14
    Yin Gao; Qiming Li; Jun Li
  • Post-mortem iris recognition with deep-learning-based image segmentation
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-13
    Mateusz Trokielewicz; Adam Czajka; Piotr Maciejewicz

    This paper proposes the first known to us iris recognition methodology designed specifically for post-mortem samples. We propose to use deep learning-based iris segmentation models to extract highly irregular iris texture areas in post-mortem iris images. We show how to use segmentation masks predicted by neural networks in conventional, Gabor-based iris recognition method, which employs circular approximations of the pupillary and limbic iris boundaries. As a whole, this method allows for a significant improvement in post-mortem iris recognition accuracy over the methods designed only for ante-mortem irises, including the academic OSIRIS and commercial IriCore implementations. The proposed method reaches the EER less than 1% for samples collected up to 10 hours after death, when compared to 16.89% and 5.37% of EER observed for OSIRIS and IriCore, respectively. For samples collected up to 369 h post-mortem, the proposed method achieves the EER 21.45%, while 33.59% and 25.38% are observed for OSIRIS and IriCore, respectively. Additionally, the method is tested on a database of iris images collected from ophthalmology clinic patients, for which it also offers an advantage over the two other algorithms. This work is the first step towards post-mortem-specific iris recognition, which increases the chances of identification of deceased subjects in forensic investigations. The new database of post-mortem iris images acquired from 42 subjects, as well as the deep learning-based segmentation models are made available along with the paper, to ensure all the results presented in this manuscript are reproducible.

  • Variational shape prior segmentation with an initial curve based on image registration technique
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-16
    Doyeob Yeo; Chang-Ock Lee

    In general images, it is practically hard to distinguish only the desired object using the conventional image segmentation methods. In many cases, we can segment the desired object by using the shape information of the object in addition to the standard image segmentation. Chan and Zhu's model is not robust to the intensity changes of objects. In this paper, we propose a novel model for the shape prior segmentation that produces robust results using the hierarchical image segmentation and an attraction term. Moreover, we adopt an image registration technique and a multi-region image segmentation to get an initial for a given shape prior. Finally, we consider the free-form deformation in obtaining the shape function from the reference shape prior for real-world images. Numerical experiments demonstrate the results independent of intensities of objects and the location of the reference shape prior. All numerical calculations are automatic and progress without any user input.

  • Flow Adaptive Video Object Segmentation
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-21
    Fanqing Lin; Yao Chou; Tony Martinez

    We tackle the task of semi-supervised video object segmentation, i.e. pixel-level object classification of the images in video sequences using very limited ground truth training data of its corresponding video. We present FLow Adaptive Video Object Segmentation, an efficient pipeline based on a novel online adaptation algorithm that utilizes optical flow, capable of tracking objects effectively throughout videos. Comparing with most of the recent deep learning based approaches that trade efficiency for accuracy, we provide extensive complexity analysis and additionally demonstrate that FLAVOS is natural for real world applications by introducing an interactive pipeline that enables the user to provide feedback for online training. Our method achieves state-of-the-art accuracy on three challenging benchmark datasets and nearly ground-truth level segmentation results with interactive user feedback.

  • Online Maximum a Posteriori Tracking of Multiple Objects Using Sequential Trajectory Prior
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-13
    Min Yang; Mingtao Pei; Yunde Jia

    In this paper, we address the problem of online multi-object tracking based on the Maximum a Posteriori (MAP) framework. Given the observations up to the current frame, we estimate the optimal object trajectories via two MAP estimation stages: object detection and data association. By introducing the sequential trajectory prior, i.e, the prior information from previous frames about “good” trajectories, into the two MAP stages, the inference of optimal detections is refined and the association correctness between trajectories and detections is enhanced. Furthermore, the sequential trajectory prior allows the two MAP stages interact with each other in a sequential manner, which jointly optimizes the detections and trajectories to facilitate online multi-object tracking. Compared with existing methods, our approach is able to alleviate the association ambiguity caused by noisy detections and frequent inter-object interactions without using sophisticated association likelihood models. The experiments on publicly available challenging datasets demonstrate that our approach provides superior tracking performance over state-of-the-art algorithms in various complex scenes.

  • Coupled Generative Adversarial Network for Heterogeneous Face Recognition
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-10
    Seyed Mehdi Iranmanesh; Benjamin Riggan; Shuowen Hu; Nasser M. Nasrabadi

    The large modality gap between faces captured in different spectra makes heterogeneous face recognition (HFR) a challenging problem. In this paper, we present a coupled generative adversarial network (CpGAN) to address the problem of matching non-visible facial imagery against a gallery of visible faces. Our CpGAN architecture consists of two sub-networks one dedicated to the visible spectrum and the other sub-network dedicated to the non-visible spectrum. Each sub-network consists of a generative adversarial network (GAN) architecture. Inspired by a dense network which is capable of maximizing the information flow among features at different levels, we utilize a densely connected encoder-decoder structure as the generator in each GAN sub-network. The proposed CpGAN framework uses multiple loss functions to force the features from each sub-network to be as close as possible for the same identities in a common latent subspace. To achieve a realistic photo reconstruction while preserving the discriminative information, we also added a perceptual loss function to the coupling loss function. An ablation study is performed to show the effectiveness of different loss functions in optimizing the proposed method. Moreover, the superiority of the model compared to the state-of-the-art models in HFR is demonstrated using multiple datasets.

  • Face presentation attack detection in mobile scenarios: A comprehensive evaluation
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-12
    Shan Jia; Guodong Guo; Zhengquan Xu; Qiangchang Wang

    The vulnerability of face recognition systems to different presentation attacks has aroused increasing concern in the biometric community. Face presentation detection (PAD) techniques, which aim to distinguish real face samples from spoof artifacts, are the efficient countermeasure. In recent years, various methods have been proposed to address 2D type face presentation attacks, including photo print attack and video replay attack. However, it is difficult to tell which methods perform better for these attacks, especially in practical mobile authentication scenarios, since there is no systematic evaluation or benchmark of the state-of-the-art methods on a common ground (i.e., using the same databases and protocols). Therefore, this paper presents a comprehensive evaluation of several representative face PAD methods (30 in total) on three public mobile spoofing datasets to quantitatively compare the detection performance. Furthermore, the generalization ability of existing methods is tested under cross-database testing scenarios to show the possible database bias. We also summarize meaningful observations and give some insights that will help promote both academic research and practical applications.

  • Depth prediction from 2D images: A taxonomy and an evaluation study
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-10
    Ambroise Moreau; Matei Mancas; Thierry Dutoit

    Among the various cues that help us understand and interact with our surroundings, depth is of particular importance. It allows us to move in space and grab objects to complete different tasks. Therefore, depth prediction has been an active research field for decades and many algorithms have been proposed to retrieve depth. Some imitate human vision and compute depth through triangulation on correspondences found between pixels or handcrafted features in different views of the same scene. Others rely on simple assumptions and semantic knowledge of the structure of the scene to get the depth information. Recently, numerous algorithms based on deep learning have emerged from the computer vision community. They implement the same principles as the non-deep learning methods and leverage the ability of deep neural networks of automatically learning important features that help to solve the task. By doing so, they produce new state-of-the-art results and show encouraging prospects. In this article, we propose a taxonomy of deep learning methods for depth prediction from 2D images. We retained the training strategy as the sorting criterion. Indeed, some methods are trained in a supervised manner which means depth labels are needed during training while others are trained in an unsupervised manner. In that case, the models learn to perform a different task such as view synthesis and depth is only a by-product of this learning. In addition to this taxonomy, we also evaluate nine models on two similar datasets without retraining. Our analysis showed that (i) most models are sensitive to sharp discontinuities created by shadows or colour contrasts and (ii) the post processing applied to the results before computing the commonly used metrics can change the model ranking. Moreover, we showed that most metrics agree with each other and are thus redundant.

  • An integrated ship segmentation method based on discriminator and extractor
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-09
    Wen Zhang; Xujie He; Wanyi Li; Zhi Zhang; Yongkang Luo; Li Su; Peng Wang

    Ship segmentation is an important task in maritime surveillance systems. A great deal of research on image segmentation has been done in the past few years, but there appears to be some problems when directly utilizing them for ship segmentation under complex maritime background. The interference factors decreasing segmentation performance usually are from the peculiarity of complex maritime background, such as the existence of sea fog, large wakes and large waves. To deal with these interference factors, this paper presents an integrated ship segmentation method based on discriminator and extractor (ISDE). Different from traditional segmentation methods, our method consists of two components in light of the structure: Interference Factor Discriminator (IFD) and Ship Extractor (SE). SqueezeNet is employed for the implementation of IFD as the first step to make a judgment on what interference factors are contained in the input image. While DeepLabv3 + and improved DeepLabv3 + are employed for the implementation of SE as the second step to finally extract ships. We collect a ship segmentation dataset and conduct intensive experiments on it. The experimental results demonstrate that our method for ship segmentation outperforms state-of-the-art methods in terms of segmentation accuracy, especially for the images contain sea fog. Besides our method can run in real time as well.

  • View-based weight network for 3D object recognition
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-09
    Qiang Huang; Yongxiong Wang; Zhong Yin

    Projective methods generally achieve better results in 3D object recognition in recent years. This may be similar to that human visual 3D shapes rely on various 2D observations which are unconscious on retina. Each projection is treated fairly in existing methods. However, we note that different viewpoint images of the same object have different discriminative features, and only some of images are completely significant. We propose a novel View-based Weight Network (VWN) for 3D object recognition where the different view-based weights are assigned to different projections. The trainable view-level weights are incorporated as a pooling layer of the multi-view residual network. The pooling layer contains 7 sub-layers. Meanwhile, we find a simple unsupervised criterion to evaluate the prediction results before they output. To improve the recognition accuracy, a new multi-channel integrated classifier combining Extreme Learning Machine, KNN, SVM and Random Forest is proposed based on the criterion. The multi-channel classifier can make the accuracy of Top1 close to Top2. Experiments on Princeton ModelNet 3D datasets demonstrate our proposed method outperforms the state-of-the-art approaches significantly in recognition accuracy.

  • Improving head pose estimation using two-stage ensembles with top-k regression
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-09
    Bin Huang; Renwen Chen; Wang Xu; Qinbang Zhou

    Conventional head pose estimation methods are regarded as a classification or regression paradigm, individually. The accuracy of classification-based approaches is limited to pose quantized interval and regression-based methods are fragile due to extremely large pose in non-ideal conditions. On the contrary to these methods, this paper introduces a novel head pose estimation method using two-stage ensembles with average top-k regression. The first stage is a binned classification subtask with the optimal pose partition. The second stage achieves average top-k regression based on the former prediction. Then we combine the two subtasks by considering the task-dependent weights instead of setting coefficients by grid search. We conduct several experiments to analyze the optimal pose partition for classification part and to validate the average top-k loss for regression part. Furthermore, we report the performance of proposed method on AFW, AFLW2000 and BIWI datasets and results show rather competitive performance in head pose prediction.

  • Spec-Net and Spec-CGAN: Deep learning models for specularity removal from faces
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-07
    Siraj Muhammad; Matthew N. Dailey; Muhammad Farooq; Muhammad F. Majeed; Mongkol Ekpanyapong

    The process of splitting an image into specular and diffuse components is a fundamental problem in computer vision, because most computer vision algorithms, such as image segmentation and tracking, assume diffuse surfaces, so existence of specular reflection can mislead algorithms to make incorrect decisions. Existing decomposition methods tend to work well for images with low specularity and high chromaticity, but they fail in cases of high intensity specular light and on images with low chromaticity. In this paper, we address the problem of removing high intensity specularity from low chromaticity images (faces). We introduce a new dataset, Spec-Face, comprising face images corrupted with specular lighting and corresponding ground truth diffuse images. We also introduce two deep learning models for specularity removal, Spec-Net and Spec-CGAN. Spec-Net takes an intensity channel as input and produces an output image that is very close to ground truth, while Spec-CGAN takes an RGB image as input and produces a diffuse image very similar to the ground truth RGB image. On Spec-Face, with Spec-Net, we obtain a peak signal-to-noise ratio (PSNR) of 3.979, a local mean squared error (LMSE) of 0.000071, a structural similarity index (SSIM) of 0.899, and a Fréchet Inception Distance (FID) of 20.932. With Spec-CGAN, we obtain a PSNR of 3.360, a LMSE of 0.000098, a SSIM of 0.707, and a FID of 31.699. With Spec-Net and Spec-CGAN, it is now feasible to perform specularity removal automatically prior to other critical complex vision processes for real world images, i.e., faces. This will potentially improve the performance of algorithms later in the processing stream, such as face recognition and skin cancer detection.

  • Salient object detection via double random walks with dual restarts
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-06
    Jiaxing Yang; Xiang Fang; Lihe Zhang; Huchuan Lu; Guohua Wei

    In this paper, we propose a novel saliency model based on double random walks with dual restarts. Two agents (also known as walkers) respectively representing the foreground and background properties simultaneously walk on a graph to explore saliency distribution. First, we propose the propagation distance measure and use it to calculate the initial distributions of the two agents instead of using geodesic distance. Second, the two agents traverse the graph starting from their own initial distribution, and then interact with each other to correct their travel routes by the restart mechanism, which enforces the agents to return to some specific nodes with a certain probability after every movement. We define the dual restarts to take into account interaction between and weighting of two agents. Extensive evaluations demonstrate that the proposed algorithm performs favorably against other state-of-the-art methods on four benchmark datasets.

  • Fine-Grained Image Retrieval via Piecewise Cross Entropy loss
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-01
    Xianxian Zeng; Yun Zhang; Xiaodong Wang; Kairui Chen; Dong Li; Weijun Yang

    Fine-Grained Image Retrieval is an important problem in computer vision. It is more challenging than the task of content-based image retrieval because it has small diversity within the different classes but large diversity in the same class. Recently, the cross entropy loss can be utilized to make Convolutional Neural Network (CNN) generate distinguish feature for Fine-Grained Image Retrieval, and it can obtain further improvement with some extra operations, such as Normalize-Scale layer. In this paper, we propose a variant of the cross entropy loss, named Piecewise Cross Entropy loss function, for enhancing model generalization and promoting the retrieval performance. Besides, the Piecewise Cross Entropy loss is easy to implement. We evaluate the performance of the proposed scheme on two standard fine-grained retrieval benchmarks, and obtain significant improvements over the state-of-the-art, with 11.8% and 3.3% over the previous work on CARS196 and CUB-200-2011, respectively.

  • ResFeats: Residual network based features for underwater image classification
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-01
    Ammar Mahmood; Mohammed Bennamoun; Senjian An; Ferdous Sohel; Farid Boussaid

    Oceanographers rely on advanced digital imaging systems to assess the health of marine ecosystems. The majority of the imagery collected by these systems do not get annotated due to lack of resources. Consequently, the expert labeled data is not enough to train dedicated deep networks. Meanwhile, in the deep learning community, much focus is on how to use pre-trained deep networks to classify out-of-domain images and transfer learning. In this paper, we leverage these advances to evaluate how well features extracted from deep neural networks transfer to underwater image classification. We propose new image features (called ResFeats) extracted from the different convolutional layers of a deep residual network pre-trained on ImageNet. We further combine the ResFeats extracted from different layers to obtain compact and powerful deep features. Moreover, we show that ResFeats consistently perform better than their CNN counterparts. Experimental results are provided to show the effectiveness of ResFeats with state-of-the-art classification accuracies on MLC, Benthoz15, EILAT and RSMAS datasets.

  • Multiple stream deep learning model for human action recognition
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-10-31
    Ye Gu; Xiaofeng Ye; Weihua Sheng; Yongsheng Ou; Yongqiang Li

    Human action recognition is one of the most important and challenging topic in the fields of image processing. Unlike object recognition, action recognition requires motion feature modeling which contains not only spatial but also temporal information. In this paper, we use multiple models to characterize both global and local motion features. Global motion patterns are represented efficiently by the depth-based 3-channel motion history images (MHIs). Meanwhile, the local spatial and temporal patterns are extracted from the skeleton graph. The decisions of these two streams are fused. At the end, the domain knowledge, which is the object/action dependency is considered. The proposed framework is evaluated on two RGB-D datasets. The experimental results show the effectiveness of our proposed approach. The performance is comparable with the state-of-the-art.

  • Multi-label learning for concept-oriented labels of product image data
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-10-31
    Yong Dai; Yi Li; Shu-Tao Li

    In the designing field, designers usually retrieve the images for reference according to product attributes when designing new proposals. To obtain the attributes of the product, the designers take lots of time and effort to collect product images and annotate them with multiple labels. However, the labels of product images represent the concept of subjective perception, which makes the multi-label learning more challenging to imitate the human aesthetic rather than discriminate the appearance. In this paper, a Feature Correlation Learning (FCL) network is proposed to solve this problem by exploiting the potential feature correlations of product images. Given a product image, the FCL network calculates the features of different levels and their correlations via gram matrices. The FCL is aggregated with the DenseNet to predict the labels of the input product image. The proposed method is compared with several outstanding multi-label learning methods, as well as DenseNet. Experimental results demonstrate that the proposed method outperforms the state-of-the-arts for multi-label learning problem of product image data.

  • Region-based Fitting of Overlapping Ellipses and its application to cells segmentation
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-10-31
    Costas Panagiotakis; Antonis Argyros

    We present RFOVE, a region-based method for approximating an arbitrary 2D shape with an automatically determined number of possibly overlapping ellipses. RFOVE is completely unsupervised, operates without any assumption or prior knowledge on the object's shape and extends and improves the Decremental Ellipse Fitting Algorithm (DEFA) [1]. Both RFOVE and DEFA solve the multi-ellipse fitting problem by performing model selection that is guided by the minimization of the Akaike Information Criterion on a suitably defined shape complexity measure. However, in contrast to DEFA, RFOVE minimizes an objective function that allows for ellipses with higher degree of overlap and, thus, achieves better ellipse-based shape approximation. A comparative evaluation of RFOVE with DEFA on several standard datasets shows that RFOVE achieves better shape coverage with simpler models (less ellipses). As a practical exploitation of RFOVE, we present its application to the problem of detecting and segmenting potentially overlapping cells in fluorescence microscopy images. Quantitative results obtained in three public datasets (one synthetic and two with more than 4000 actual stained cells) show the superiority of RFOVE over the state of the art in overlapping cells segmentation.

  • Saddle: Fast and repeatable features with good coverage
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-09-03
    Javier Aldana-Iuit; Dmytro Mishkin; Ondřej Chum; Jiří Matas

    A novel similarity-covariant feature detector that extracts points whose neighborhoods, when treated as a 3D intensity surface, have a saddle-like intensity profile is presented. The saddle condition is verified efficiently by intensity comparisons on two concentric rings that must have exactly two dark-to-bright and two bright-to-dark transitions satisfying certain geometric constraints. Saddle is a fast approximation of Hessian detector as ORB, that implements the FAST detector, is for Harris detector. We propose to use the matching strategy called the first geometric inconsistent with binary descriptors that is suitable for our feature detector, including experiments with fix point descriptors hand-crafted and learned. Experiments show that the Saddle features are general, evenly spread and appearing in high density in a range of images. The Saddle detector is among the fastest proposed. In comparison with detector with similar speed, the Saddle features show superior matching performance on number of challenging datasets. Compared to recently proposed deep-learning based interest point detectors and popular hand-crafted keypoint detectors, evaluated for repeatability in the ApolloScape dataset [1], the Saddle detectors shows the best performance in most of the street-level view sequences a.k.a. traversals.

  • Multi-view dynamic facial action unit detection
    Image Vis. Comput. (IF 2.747) Pub Date : 2018-09-26
    Andrés Romero; Juán León; Pablo Arbeláez

    We propose a novel convolutional neural network approach to address the fine-grained recognition problem of multi-view dynamic facial action unit detection. We leverage recent gains in large-scale object recognition by formulating the task of predicting the presence or absence of a specific action unit in a still image of a human face as holistic classification. We then explore the design space of our approach by considering both shared and independent representations for separate action units, and also different CNN architectures for combining color and motion information. We then move to the novel setup of the FERA 2017 Challenge, in which we propose a multi-view extension of our approach that operates by first predicting the viewpoint from which the video was taken, and then evaluating an ensemble of action unit detectors that were trained for that specific viewpoint. Our approach is holistic, efficient, and modular, since new action units can be easily included in the overall system. Our approach significantly outperforms the baseline of the FERA 2017 Challenge, with an absolute improvement of 14% on the F1-metric. Additionally, it compares favorably against the winner of the FERA 2017 Challenge.

  • Postnatal gestational age estimation of newborns using Small Sample Deep Learning.
    Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-26
    Mercedes Torres Torres,Michel Valstar,Caroline Henry,Carole Ward,Don Sharkey

    A baby's gestational age determines whether or not they are premature, which helps clinicians decide on suitable post-natal treatment. The most accurate dating methods use Ultrasound Scan (USS) machines, but these are expensive, require trained personnel and cannot always be deployed to remote areas. In the absence of USS, the Ballard Score, a postnatal clinical examination, can be used. However, this method is highly subjective and results vary widely depending on the experience of the examiner. Our main contribution is a novel system for automatic postnatal gestational age estimation using small sets of images of a newborn's face, foot and ear. Our two-stage architecture makes the most out of Convolutional Neural Networks trained on small sets of images to predict broad classes of gestational age, and then fuses the outputs of these discrete classes with a baby's weight to make fine-grained predictions of gestational age using Support Vector Regression. On a purpose-collected dataset of 130 babies, experiments show that our approach surpasses current automatic state-of-the-art postnatal methods and attains an expected error of 6 days. It is three times more accurate than the Ballard method. Making use of images improves predictions by 33% compared to using weight only. This indicates that even with a very small set of data, our method is a viable candidate for postnatal gestational age estimation in areas were USS is not available.

  • The role of image registration in brain mapping.
    Image Vis. Comput. (IF 2.747) Pub Date : 2001-01-01
    A W Toga,P M Thompson

    Image registration is a key step in a great variety of biomedical imaging applications. It provides the ability to geometrically align one dataset with another, and is a prerequisite for all imaging applications that compare datasets across subjects, imaging modalities, or across time. Registration algorithms also enable the pooling and comparison of experimental findings across laboratories, the construction of population-based brain atlases, and the creation of systems to detect group patterns in structural and functional imaging data. We review the major types of registration approaches used in brain imaging today. We focus on their conceptual basis, the underlying mathematics, and their strengths and weaknesses in different contexts. We describe the major goals of registration, including data fusion, quantification of change, automated image segmentation and labeling, shape measurement, and pathology detection. We indicate that registration algorithms have great potential when used in conjunction with a digital brain atlas, which acts as a reference system in which brain images can be compared for statistical analysis. The resulting armory of registration approaches is fundamental to medical image analysis, and in a brain mapping context provides a means to elucidate clinical, demographic, or functional trends in the anatomy or physiology of the brain.

  • Learning Facial Action Units with Spatiotemporal Cues and Multi-label Sampling.
    Image Vis. Comput. (IF 2.747) Pub Date : 2018-12-14
    Wen-Sheng Chu,Fernando De la Torre,Jeffrey F Cohn

    Facial action units (AUs) may be represented spatially, temporally, and in terms of their correlation. Previous research focuses on one or another of these aspects or addresses them disjointly. We propose a hybrid network architecture that jointly models spatial and temporal representations and their correlation. In particular, we use a Convolutional Neural Network (CNN) to learn spatial representations, and a Long Short-Term Memory (LSTM) to model temporal dependencies among them. The outputs of CNNs and LSTMs are aggregated into a fusion network to produce per-frame prediction of multiple AUs. The hybrid network was compared to previous state-of-the-art approaches in two large FACS-coded video databases, GFT and BP4D, with over 400,000 AU-coded frames of spontaneous facial behavior in varied social contexts. Relative to standard multi-label CNN and feature-based state-of-the-art approaches, the hybrid system reduced person-specific biases and obtained increased accuracy for AU detection. To address class imbalance within and between batches during training the network, we introduce multi-labeling sampling strategies that further increase accuracy when AUs are relatively sparse. Finally, we provide visualization of the learned AU models, which, to the best of our best knowledge, reveal for the first time how machines see AUs.

  • A Portable Stereo Vision System for Whole Body Surface Imaging.
    Image Vis. Comput. (IF 2.747) Pub Date : 2010-02-18
    Wurong Yu,Bugao Xu

    This paper presents a whole body surface imaging system based on stereo vision technology. We have adopted a compact and economical configuration which involves only four stereo units to image the frontal and rear sides of the body. The success of the system depends on a stereo matching process that can effectively segment the body from the background in addition to recovering sufficient geometric details. For this purpose, we have developed a novel sub-pixel, dense stereo matching algorithm which includes two major phases. In the first phase, the foreground is accurately segmented with the help of a predefined virtual interface in the disparity space image, and a coarse disparity map is generated with block matching. In the second phase, local least squares matching is performed in combination with global optimization within a regularization framework, so as to ensure both accuracy and reliability. Our experimental results show that the system can realistically capture smooth and natural whole body shapes with high accuracy.

  • Modelling and Recognition of the Linguistic Components in American Sign Language.
    Image Vis. Comput. (IF 2.747) Pub Date : 2010-02-18
    Liya Ding,Aleix M Martinez

    The manual signs in sign languages are generated and interpreted using three basic building blocks: handshape, motion, and place of articulation. When combined, these three components (together with palm orientation) uniquely determine the meaning of the manual sign. This means that the use of pattern recognition techniques that only employ a subset of these components is inappropriate for interpreting the sign or to build automatic recognizers of the language. In this paper, we define an algorithm to model these three basic components form a single video sequence of two-dimensional pictures of a sign. Recognition of these three components are then combined to determine the class of the signs in the videos. Experiments are performed on a database of (isolated) American Sign Language (ASL) signs. The results demonstrate that, using semi-automatic detection, all three components can be reliably recovered from two-dimensional video sequences, allowing for an accurate representation and recognition of the signs.

  • Figure-Ground Segmentation Using Factor Graphs.
    Image Vis. Comput. (IF 2.747) Pub Date : 2010-02-18
    Huiying Shen,James Coughlan,Volodymyr Ivanchenko

    Foreground-background segmentation has recently been applied [26,12] to the detection and segmentation of specific objects or structures of interest from the background as an efficient alternative to techniques such as deformable templates [27]. We introduce a graphical model (i.e. Markov random field)-based formulation of structure-specific figure-ground segmentation based on simple geometric features extracted from an image, such as local configurations of linear features, that are characteristic of the desired figure structure. Our formulation is novel in that it is based on factor graphs, which are graphical models that encode interactions among arbitrary numbers of random variables. The ability of factor graphs to express interactions higher than pairwise order (the highest order encountered in most graphical models used in computer vision) is useful for modeling a variety of pattern recognition problems. In particular, we show how this property makes factor graphs a natural framework for performing grouping and segmentation, and demonstrate that the factor graph framework emerges naturally from a simple maximum entropy model of figure-ground segmentation.We cast our approach in a learning framework, in which the contributions of multiple grouping cues are learned from training data, and apply our framework to the problem of finding printed text in natural scenes. Experimental results are described, including a performance analysis that demonstrates the feasibility of the approach.

  • 更新日期:2019-11-01
  • Efficient Constrained Local Model Fitting for Non-Rigid Face Alignment.
    Image Vis. Comput. (IF 2.747) Pub Date : 2010-01-05
    Simon Lucey,Yang Wang,Mark Cox,Sridha Sridharan,Jeffery F Cohn

    Active appearance models (AAMs) have demonstrated great utility when being employed for non-rigid face alignment/tracking. The "simultaneous" algorithm for fitting an AAM achieves good non-rigid face registration performance, but has poor real time performance (2-3 fps). The "project-out" algorithm for fitting an AAM achieves faster than real time performance (> 200 fps) but suffers from poor generic alignment performance. In this paper we introduce an extension to a discriminative method for non-rigid face registration/tracking referred to as a constrained local model (CLM). Our proposed method is able to achieve superior performance to the "simultaneous" AAM algorithm along with real time fitting speeds (35 fps). We improve upon the canonical CLM formulation, to gain this performance, in a number of ways by employing: (i) linear SVMs as patch-experts, (ii) a simplified optimization criteria, and (iii) a composite rather than additive warp update step. Most notably, our simplified optimization criteria for fitting the CLM divides the problem of finding a single complex registration/warp displacement into that of finding N simple warp displacements. From these N simple warp displacements, a single complex warp displacement is estimated using a weighted least-squares constraint. Another major advantage of this simplified optimization lends from its ability to be parallelized, a step which we also theoretically explore in this paper. We refer to our approach for fitting the CLM as the "exhaustive local search" (ELS) algorithm. Experiments were conducted on the CMU Multi-PIE database.

  • The Painful Face - Pain Expression Recognition Using Active Appearance Models.
    Image Vis. Comput. (IF 2.747) Pub Date : 2009-10-01
    Ahmed Bilal Ashraf,Simon Lucey,Jeffrey F Cohn,Tsuhan Chen,Zara Ambadar,Kenneth M Prkachin,Patricia E Solomon

    Pain is typically assessed by patient self-report. Self-reported pain, however, is difficult to interpret and may be impaired or in some circumstances (i.e., young children and the severely ill) not even possible. To circumvent these problems behavioral scientists have identified reliable and valid facial indicators of pain. Hitherto, these methods have required manual measurement by highly skilled human observers. In this paper we explore an approach for automatically recognizing acute pain without the need for human observers. Specifically, our study was restricted to automatically detecting pain in adult patients with rotator cuff injuries. The system employed video input of the patients as they moved their affected and unaffected shoulder. Two types of ground truth were considered. Sequence-level ground truth consisted of Likert-type ratings by skilled observers. Frame-level ground truth was calculated from presence/absence and intensity of facial actions previously associated with pain. Active appearance models (AAM) were used to decouple shape and appearance in the digitized face images. Support vector machines (SVM) were compared for several representations from the AAM and of ground truth of varying granularity. We explored two questions pertinent to the construction, design and development of automatic pain detection systems. First, at what level (i.e., sequence- or frame-level) should datasets be labeled in order to obtain satisfactory automatic pain detection performance? Second, how important is it, at both levels of labeling, that we non-rigidly register the face?

  • Dense 3D Face Alignment from 2D Video for Real-Time Use.
    Image Vis. Comput. (IF 2.747) Pub Date : 2017-02-01
    László A Jeni,Jeffrey F Cohn,Takeo Kanade

    To enable real-time, person-independent 3D registration from 2D video, we developed a 3D cascade regression approach in which facial landmarks remain invariant across pose over a range of approximately 60 degrees. From a single 2D image of a person's face, a dense 3D shape is registered in real time for each frame. The algorithm utilizes a fast cascade regression framework trained on high-resolution 3D face-scans of posed and spontaneous emotion expression. The algorithm first estimates the location of a dense set of landmarks and their visibility, then reconstructs face shapes by fitting a part-based 3D model. Because no assumptions are required about illumination or surface properties, the method can be applied to a wide range of imaging conditions that include 2D video and uncalibrated multi-view video. The method has been validated in a battery of experiments that evaluate its precision of 3D reconstruction, extension to multi-view reconstruction, temporal integration for videos and 3D head-pose estimation. Experimental findings strongly support the validity of real-time, 3D registration and reconstruction from 2D video. The software is available online at http://zface.org.

  • Multiview stereo and silhouette fusion via minimizing generalized reprojection error.
    Image Vis. Comput. (IF 2.747) Pub Date : 2015-01-06
    Zhaoxin Li,Kuanquan Wang,Wenyan Jia,Hsin-Chen Chen,Wangmeng Zuo,Deyu Meng,Mingui Sun

    Accurate reconstruction of 3D geometrical shape from a set of calibrated 2D multiview images is an active yet challenging task in computer vision. The existing multiview stereo methods usually perform poorly in recovering deeply concave and thinly protruding structures, and suffer from several common problems like slow convergence, sensitivity to initial conditions, and high memory requirements. To address these issues, we propose a two-phase optimization method for generalized reprojection error minimization (TwGREM), where a generalized framework of reprojection error is proposed to integrate stereo and silhouette cues into a unified energy function. For the minimization of the function, we first introduce a convex relaxation on 3D volumetric grids which can be efficiently solved using variable splitting and Chambolle projection. Then, the resulting surface is parameterized as a triangle mesh and refined using surface evolution to obtain a high-quality 3D reconstruction. Our comparative experiments with several state-of-the-art methods show that the performance of TwGREM based 3D reconstruction is among the highest with respect to accuracy and efficiency, especially for data with smooth texture and sparsely sampled viewpoints.

  • Nonverbal Social Withdrawal in Depression: Evidence from manual and automatic analysis.
    Image Vis. Comput. (IF 2.747) Pub Date : 2014-11-08
    Jeffrey M Girard,Jeffrey F Cohn,Mohammad H Mahoor,S Mohammad Mavadati,Zakia Hammal,Dean P Rosenwald

    The relationship between nonverbal behavior and severity of depression was investigated by following depressed participants over the course of treatment and video recording a series of clinical interviews. Facial expressions and head pose were analyzed from video using manual and automatic systems. Both systems were highly consistent for FACS action units (AUs) and showed similar effects for change over time in depression severity. When symptom severity was high, participants made fewer affiliative facial expressions (AUs 12 and 15) and more non-affiliative facial expressions (AU 14). Participants also exhibited diminished head motion (i.e., amplitude and velocity) when symptom severity was high. These results are consistent with the Social Withdrawal hypothesis: that depressed individuals use nonverbal behavior to maintain or increase interpersonal distance. As individuals recover, they send more signals indicating a willingness to affiliate. The finding that automatic facial expression analysis was both consistent with manual coding and revealed the same pattern of findings suggests that automatic facial expression analysis may be ready to relieve the burden of manual coding in behavioral and clinical science.

  • Classification and Weakly Supervised Pain Localization using Multiple Segment Representation.
    Image Vis. Comput. (IF 2.747) Pub Date : 2014-09-23
    Karan Sikka,Abhinav Dhall,Marian Stewart Bartlett

    Automatic pain recognition from videos is a vital clinical application and, owing to its spontaneous nature, poses interesting challenges to automatic facial expression recognition (AFER) research. Previous pain vs no-pain systems have highlighted two major challenges: (1) ground truth is provided for the sequence, but the presence or absence of the target expression for a given frame is unknown, and (2) the time point and the duration of the pain expression event(s) in each video are unknown. To address these issues we propose a novel framework (referred to as MS-MIL) where each sequence is represented as a bag containing multiple segments, and multiple instance learning (MIL) is employed to handle this weakly labeled data in the form of sequence level ground-truth. These segments are generated via multiple clustering of a sequence or running a multi-scale temporal scanning window, and are represented using a state-of-the-art Bag of Words (BoW) representation. This work extends the idea of detecting facial expressions through 'concept frames' to 'concept segments' and argues through extensive experiments that algorithms such as MIL are needed to reap the benefits of such representation. The key advantages of our approach are: (1) joint detection and localization of painful frames using only sequence-level ground-truth, (2) incorporation of temporal dynamics by representing the data not as individual frames but as segments, and (3) extraction of multiple segments, which is well suited to signals with uncertain temporal location and duration in the video. Extensive experiments on UNBC-McMaster Shoulder Pain dataset highlight the effectiveness of the approach by achieving competitive results on both tasks of pain classification and localization in videos. We also empirically evaluate the contributions of different components of MS-MIL. The paper also includes the visualization of discriminative facial patches, important for pain detection, as discovered by our algorithm and relates them to Action Units that have been associated with pain expression. We conclude the paper by demonstrating that MS-MIL yields a significant improvement on another spontaneous facial expression dataset, the FEEDTUM dataset.

  • Non-rigid Face Tracking with Local Appearance Consistency Constraint.
    Image Vis. Comput. (IF 2.747) Pub Date : 2010-05-01
    Yang Wang,Simon Lucey,Jeffrey F Cohn,Jason Saragih

    In this paper we present a new discriminative approach to achieve consistent and efficient tracking of non-rigid object motion, such as facial expressions. By utilizing both spatial and temporal appearance coherence at the patch level, the proposed approach can reduce ambiguity and increase accuracy. Recent research demonstrates that feature based approaches, such as constrained local models (CLMs), can achieve good performance in non-rigid object alignment/tracking using local region descriptors and a non-rigid shape prior. However, the matching performance of the learned generic patch experts is susceptible to local appearance ambiguity. Since there is no motion continuity constraint between neighboring frames of the same sequence, the resultant object alignment might not be consistent from frame to frame and the motion field is not temporally smooth. In this paper, we extend the CLM method into the spatio-temporal domain by enforcing the appearance consistency constraint of each local patch between neighboring frames. More importantly, we show that the global warp update can be optimized jointly in an efficient manner using convex quadratic fitting. Finally, we demonstrate that our approach receives improved performance for the task of non-rigid facial motion tracking on the videos of clinical patients.

Contents have been reproduced by permission of the publishers.
上海纽约大学William Glover