当前期刊: Computer Vision and Image Understanding Go to current issue    加入关注   
显示样式:        排序: 导出
我的关注
我的收藏
您暂时未登录!
登录
  • Rotation invariant features based on three dimensional Gaussian Markov random fields for volumetric texture classification
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-02-12
    Yasseen Almakady; Sasan Mahmoodi; Joy Conway; Michael Bennett

    This paper proposes a set of rotation invariant features based on three dimensional Gaussian Markov Random Fields (3D-GMRF) for volumetric texture image classification. In the method proposed here, the mathematical notion of spherical harmonics is employed to produce a set of features which are used to construct the rotation invariant descriptor. Our proposed method is evaluated and compared with other method in the literature for datasets containing synthetic textures as well as medical images. The results of our experiments demonstrate excellent classification performance for our proposed method compared with state-of-the-art methods. Furthermore, our method is evaluated using a clinical dataset and show good performance in discriminating between healthy individuals and COPD patients. Our method also performs well in classifying lung nodules in the LIDC-IDRI dataset. Our results indicate that our 3D-GMRF-based method enjoys more superior performance compared with other methods in the literature.

    更新日期:2020-02-12
  • Image dehazing based on a transmission fusion strategy by automatic image matting
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-02-11
    Feiniu Yuan; Yu Zhou; Xue Xia; Jinting Shi; Yuming Fang; Xueming Qian

    Most dehazing methods fail to estimate satisfactory transmission simultaneously in both normal and bright regions. To estimate more accurate transmission for these two kinds of regions, we propose a transmission fusion strategy based on automatic image matting for image dehazing. We first extract the mean and variance of a local patch around each pixel, and propose a binary classification method with the mean and variance of each patch to coarsely segment an input image into a binary map of normal and bright regions. Then we smooth and quantize the binary map to automatically generate a trimap of ternary values. Thus we can avoid the difficulty in manually labeling trimaps. Both the image and the trimap are input into a Bayesian matting method for soft segmentation of normal and bright regions to produce an alpha map. The dark channel prior (DCP) is adopted to extract a transmission map for normal regions, while an improved atmospheric veil correction (AVC) method is proposed to generate another transmission map for bright regions. Finally, we propose to use the alpha map to fuzzily fuse the two transmission maps for final image dehazing. Experimental results show that our method significantly outperforms existing methods.

    更新日期:2020-02-12
  • Efficient distance transformation for path-based metrics
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-02-11
    David Coeurjolly; Isabelle Sivignon

    In many applications, separable algorithms have demonstrated their efficiency to perform high performance volumetric processing of shape, such as distance transformation or medial axis extraction. In the literature, several authors have discussed about conditions on the metric to be considered in a separable approach. In this article, we present generic separable algorithms to efficiently compute Voronoi maps and distance transformations for a large class of metrics. Focusing on path-based norms (chamfer masks, neighborhood sequences), we propose efficient algorithms to compute such volumetric transformation in dimension n. We describe a new O(n⋅Nn⋅logN⋅(n+logf)) algorithm for shapes in a Nn domain for chamfer norms with a rational ball of f facets (compared to O(f⌊n2⌋⋅Nn) with previous approaches). Last we further investigate a more elaborate algorithm with the same worst-case complexity, but reaching a complexity of O(n⋅Nn⋅logf⋅(n+logf)) experimentally, under assumption of regularity distribution of the mask vectors.

    更新日期:2020-02-11
  • Multi-exposure photomontage with hand-held cameras
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-02-07
    Ru Li; Shuaicheng Liu; Guanghui Liu; Tiecheng Sun; Jishun Guo

    The paper studies the image fusion from multiple images taken by hand-held cameras with different exposures. Existing methods often generate unsatisfactory results, such as blurring/ghosting artifacts due to the problematic handling of camera motions, dynamic contents, and inappropriately fusion of local regions (e.g., over or under exposed). In addition, they often require a high-quality image registration, which is hard to achieve in scenarios with large depth variations and dynamic textures, and is also time-consuming. In this paper, we propose to enable a rough registration by a single homography and combine the inputs seamlessly to hide any possible misalignment. Specifically, the method first uses a Markov Random Field (MRF) energy for the labeling of all pixels, which assigns different labels to different aligned input images. During the labeling, it chooses well-exposured regions and skips moving objects at the same time. Then, the proposed method combines a Laplacian image according to the labels and constructs the fusion result by solving the Poisson equation. Furthermore, it adds some internal constraints when solving the Poisson equation for balancing and improving fusion results. We present various challenging examples, including static/dynamic, indoor/outdoor and daytime/nighttime scenes, to demonstrate the effectiveness and practicability of the proposed method.

    更新日期:2020-02-07
  • 更新日期:2020-02-07
  • Deep code operation network for multi-label image retrieval
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-02-04
    Ge Song; Xiaoyang Tan

    Deep hashing methods have been extensively studied for large-scale image search and achieved promising results in recent years. However, there are two major limitations of previous deep hashing methods for multi-label image retrieval: the first one concerns the flexibility for users to express their query intention (so-called the intention gap), and the second one concerns the exploitation of rich similarity structures of the semantic space (so-called the semantic gap). To address these issues, we propose a novel Deep Code Operation Network (CoNet), in which a user is allowed to simultaneously present multiple images instead of a single one as his/her query, and then the system triggers a series of code operators to extract the hidden relations among them. In this way, a set of new queries are automatically constructed to cover users’ real complex query intention, without the need of explicitly stating them. The CoNet is trained with a newly proposed margin-adaptive triplet loss function, which effectively encourages the system to incorporate the hierarchical similarity structures of the semantic space into the learning procedure of the code operations. The whole system has an end-to-end differentiable architecture, equipped with an adversarial mechanism to further improve the quality of the final intention representation. Experimental results on four multi-label image datasets demonstrate that our method significantly improves the state-of-the-art in performing complex multi-label retrieval tasks with multiple query images.

    更新日期:2020-02-04
  • Adversarial autoencoders for compact representations of 3D point clouds
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-02-01
    Maciej Zamorski; Maciej Zięba; Piotr Klukowski; Rafał Nowak; Karol Kurach; Wojciech Stokowiec; Tomasz Trzciński

    Deep generative architectures provide a way to model not only images but also complex, 3-dimensional objects, such as point clouds. In this work, we present a novel method to obtain meaningful representations of 3D shapes that can be used for challenging tasks, including 3D points generation, reconstruction, compression, and clustering. Contrary to existing methods for 3D point cloud generation that train separate decoupled models for representation learning and generation, our approach is the first end-to-end solution that allows to simultaneously learn a latent space of representation and generate 3D shape out of it. Moreover, our model is capable of learning meaningful compact binary descriptors with adversarial training conducted on a latent space. To achieve this goal, we extend a deep Adversarial Autoencoder model (AAE) to accept 3D input and create 3D output. Thanks to our end-to-end training regime, the resulting method called 3D Adversarial Autoencoder (3dAAE) obtains either binary or continuous latent space that covers a much broader portion of training data distribution. Finally, our quantitative evaluation shows that 3dAAE provides state-of-the-art results for 3D points clustering and 3D object retrieval.

    更新日期:2020-02-03
  • UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-01-27
    Longyin Wen; Dawei Du; Zhaowei Cai; Zhen Lei; Ming-Ching Chang; Honggang Qi; Jongwoo Lim; Ming-Hsuan Yang; Siwei Lyu

    Effective multi-object tracking (MOT) methods have been developed in recent years for a wide range of applications including visual surveillance and behavior understanding. Existing performance evaluations of MOT methods usually separate the tracking step from the detection step by using one single predefined setting of object detection for comparisons. In this work, we propose a new University at Albany DEtection and TRACking (UA-DETRAC) dataset for comprehensive performance evaluation of MOT systems especially on detectors. The UA-DETRAC benchmark dataset consists of 100 challenging videos captured from real-world traffic scenes (over 140,000 frames with rich annotations, including illumination, vehicle type, occlusion, truncation ratio, and vehicle bounding boxes) for multi-object detection and tracking. We evaluate complete MOT systems constructed from combinations of state-of-the-art object detection and tracking methods. Our analysis shows the complex effects of detection accuracy on MOT system performance. Based on these observations, we propose effective and informative evaluation metrics for MOT systems that consider the effect of object detection for comprehensive performance analysis.

    更新日期:2020-01-27
  • Learning a confidence measure in the disparity domain from O(1) features
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-01-18
    Matteo Poggi; Fabio Tosi; Stefano Mattoccia

    Depth sensing is of paramount importance for countless applications and stereo represents a popular, effective and cheap solution for this purpose. As highlighted by recent works concerned with stereo, uncertainty estimation can be a powerful cue to improve accuracy in stereo. Most confidence measures rely on features, mainly extracted from the cost volume, fed to a random forest or a convolutional neural network trained to estimate match uncertainty. In contrast, we propose a novel strategy for confidence estimation based on features computed in the disparity domain, making our proposal suited for any stereo system including COTS devices, and in constant time. We exhaustively assess the performance of our proposals, referred to as O1 and O2, on KITTI and Middlebury datasets with three popular and different stereo algorithms (CENSUS, MC-CNN and SGM), as well as a deep stereo network (PSM-Net). We also evaluate how well confidence measures generalize to different environments/datasets.

    更新日期:2020-01-21
  • A progressive learning framework based on single-instance annotation for weakly supervised object detection
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-01-15
    Ming Zhang; Bing Zeng

    Fully-supervised object detection (FSOD) and weakly-supervised object detection (WSOD) are two extremes in the field of object detection. The former relies entirely on detailed bounding-box annotations while the later discards them completely. To balance these two extremes, we propose to make use of the so-called single-instance annotations, i.e., all images that contain only a single object are labelled with the corresponding bounding-boxes. By using such instance annotations of the simplest images, we propose a progressive learning framework that integrates image-level learning, single-instance learning, and multi-instance learning into an end-to-end network. Specifically, our framework is composed of three parallel streams that share a proposal feature extractor. The first stream is supervised by image-level annotations, which provides global information of all training data for the shared feature extractor. The second stream is supervised by single-instance annotations to bridge the features learning gap between the image level and instance level. To further learn from complex images, we propose an overlap-based instance mining algorithm to mine pseudo multi-instance annotations from the detection results of the second stream, and use them to supervise the third stream. Our method achieves a trade-off between the detection accuracy and annotation cost. Extensive experiments demonstrate the effectiveness of our proposed method on the PASCAL VOC and MS-COCO dataset, implying that a few single-instance annotations can improve the detection performance of WSOD significantly (more than 10%) and reduce the average annotation cost of FSOD greatly (more than 5 times).

    更新日期:2020-01-15
  • Triplanar convolution with shared 2D kernels for 3D classification and shape retrieval
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-01-15
    Eu Young Kim; Seung Yeon Shin; Soochahn Lee; Kyong Joon Lee; Kyoung Ho Lee; Kyoung Mu Lee

    Increasing the depth of Convolutional Neural Networks (CNNs) has been recognized to provide better generalization performance. However, in the case of 3D CNNs, stacking layers increases the number of learnable parameters linearly, making it more prone to learn redundant features. In this paper, we propose a novel 3D CNN structure that learns shared 2D triplanar features viewed from the three orthogonal planes, which we term S3PNet. Due to the reduced dimension of the convolutions, the proposed S3PNet is able to learn 3D representations with substantially fewer learnable parameters. Experimental evaluations show that the combination of 2D representations on the different orthogonal views learned through the S3PNet is sufficient and effective for 3D representation, with the results outperforming current methods based on fully 3D CNNs. We support this with extensive evaluations on widely used 3D data sources in computer vision: CAD models, LiDAR point clouds, RGB-D images, and 3D Computed Tomography scans. Experiments further demonstrate that S3PNet has better generalization capability for smaller training sets, and learns more of kernels with less redundancy compared to kernels learned from 3D CNNs.

    更新日期:2020-01-15
  • Monocular human pose estimation: A survey of deep learning-based methods
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-01-07
    Yucheng Chen; Yingli Tian; Mingyi He

    Vision-based monocular human pose estimation, as one of the most fundamental and challenging problems in computer vision, aims to obtain posture of the human body from input images or video sequences. The recent developments of deep learning techniques have been brought significant progress and remarkable breakthroughs in the field of human pose estimation. This survey extensively reviews the recent deep learning-based 2D and 3D human pose estimation methods published since 2014. This paper summarizes the challenges, main frameworks, benchmark datasets, evaluation metrics, performance comparison, and discusses some promising future research directions.

    更新日期:2020-01-07
  • 更新日期:2020-01-04
  • Photometric camera characterization from a single image with invariance to light intensity and vignetting
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-12-13
    Pedro M.C. Rodrigues; João P. Barreto; Michel Antunes

    Photometric characterization of a camera entails describing how the camera transforms the light reaching its sensors into an image and how this image can be defined in a standard color space. Although the research in this area has been extensive, the current literature lacks practical methods designed for cameras operating under near light. There are two major application scenarios considered in this paper that would benefit from this type of approaches. Camera rigs for minimally-invasive-procedures cannot be calibrated in the operating room with the current methods. This comes from the fact that existing approaches need multiple images, assume uniform lighting, and/or use over-simplistic camera models, which does not allow for the calibration of near light setups in a fast and reliable way. The second scenario refers to the calibration of cellphone cameras, which currently cannot be calibrated at close range with a single image, specially if the flash is used, as there would be non-uniform lighting on the scene. In this work, we describe a method to characterize cameras from a single image of a known target. This enables both geometric and photometric calibrations to be performed on-the-fly without making assumptions on the vignetting nor on the spatial properties of the light. The presented method showed good repeatability and color accuracy even when compared to multiple-image approaches. Applications to laparoscopic cameras, generic cameras (such as cellphone cameras), and cameras other than trichromatic are shown to be viable.

    更新日期:2020-01-04
  • Learning feature aggregation in temporal domain for re-identification
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-11-28
    Jakub Špaňhel; Jakub Sochor; Roman Juránek; Petr Dobeš; Vojtěch Bartl; Adam Herout

    Person re-identification is a standard and established problem in the computer vision community. In recent years, vehicle re-identification is also getting more attention. In this paper, we focus on both these tasks and propose a method for aggregation of features in temporal domain as it is common to have multiple observations of the same object. The aggregation is based on weighting different elements of the feature vectors by different weights and it is trained in an end-to-end manner by a Siamese network. The experimental results show that our method outperforms other existing methods for feature aggregation in temporal domain on both vehicle and person re-identification tasks. Furthermore, to push research in vehicle re-identification further, we introduce a novel dataset CarsReId74k. The dataset is not limited to frontal/rear viewpoints. It contains 17,681 unique vehicles, 73,976 observed tracks, and 277,236 positive pairs. The dataset was captured by 66 cameras from various angles.

    更新日期:2020-01-04
  • Cascade multi-head attention networks for action recognition
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-01-02
    Jiaze Wang; Xiaojiang Peng; Yu Qiao

    Long-term temporal information yields crucial cues for video action understanding. Previous researches always rely on sequential models such as recurrent networks, memory units, segmental models, self-attention mechanism to integrate the local temporal features for long-term temporal modeling. Recurrent or memory networks record temporal patterns (or relations) by memory units, which are proved to be difficult to capture long-term information in machine translation. Self-attention mechanisms directly aggregate all local information with attention weights which is more straightforward and efficient than the former. However, the attention weights from self-attention ignore the relations between local information and global information which may lead to unreliable attention. To this end, we propose a new attention network architecture, termed as Cascade multi-head ATtention Network (CATNet), which constructs video representations with two-level attentions, namely multi-head local self-attentions and relation based global attentions. Starting from the segment features generated by backbone networks, CATNet first learns multiple attention weights for each segment to capture the importance of local features in a self-attention manner. With the local attention weights, CATNet integrates local features into several global representations, and then learns the second level attention for the global information by a relation manner. Extensive experiments on Kinetics, HMDB51, and UCF101 show that our CATNet boosts the baseline network with a large margin. With only RGB information, we respectively achieve 75.8%, 75.2%, and 96.0% on these three datasets, which are comparable or superior to the state of the arts.

    更新日期:2020-01-04
  • Graph-matching-based correspondence search for nonrigid point cloud registration
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2020-01-02
    Seunggyu Chang; Chanho Ahn; Minsik Lee; Songhwai Oh

    Nonrigid registration finds transformations to fit a source point cloud/mesh to a target point cloud/mesh. Most nonrigid registration algorithms consist of two steps; finding correspondence and optimization. Among these, finding correspondence plays an important role in registration performance. However, when two point clouds have large displacement, it is hard to know correct correspondences and an algorithm often fails to find correct transformations. In this paper, we propose a novel graph-matching-based correspondence search for nonrigid registration and a corresponding optimization method for finding transformation to complete nonrigid registration. Considering global connectivity as well as local similarity for the correspondence search, the proposed method finds good correspondences according to semantics and consequently finds correct transformations even when the motion is large. Our algorithm is experimentally validated on human body and animal datasets, which verifies that it is capable of finding correct transformations to fit a source to a target.

    更新日期:2020-01-04
  • Guess where? Actor-supervision for spatiotemporal action localization
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-12-09
    Victor Escorcia; Cuong D. Dao; Mihir Jain; Bernard Ghanem; Cees Snoek

    This paper addresses the problem of spatiotemporal localization of actions in videos. Compared to leading approaches, which all learn to localize based on carefully annotated boxes on training video frames, we adhere to a solution only requiring video class labels. We introduce an actor-supervised architecture that exploits the inherent compositionality of actions in terms of actor transformations, to localize actions. We make two contributions. First, we propose actor proposals derived from a detector for human and non-human actors intended for images, which are linked over time by Siamese similarity matching to account for actor deformations. Second, we propose an actor-based attention mechanism enabling localization from action class labels and actor proposals. It exploits a new actor pooling operation and is end-to-end trainable. Experiments on four action datasets show actor supervision is state-of-the-art for action localization from video class labels and is even competitive to some box-supervised alternatives.

    更新日期:2020-01-04
  • Graph convolutional neural network for multi-scale feature learning
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-12-02
    Michael Edwards; Xianghua Xie; Robert I. Palmer; Gary K.L. Tam; Rob Alcock; Carl Roobottom
    更新日期:2020-01-04
  • Momental directional patterns for dynamic texture recognition
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-12-02
    Thanh Tuan Nguyen; Thanh Phuong Nguyen; Frédéric Bouchara; Xuan Son Nguyen

    Understanding the chaotic motions of dynamic textures (DTs) is a challenging problem of video representation for different tasks in computer vision. This paper presents a new approach for an efficient DT representation by addressing the following novel concepts. First, a model of moment volumes is introduced as an effective pre-processing technique for enriching the robust and discriminative information of dynamic voxels with low computational cost. Second, two important extensions of Local Derivative Pattern operator are proposed to improve its performance in capturing directional features. Third, we present a new framework, called Momental Directional Patterns, taking into account the advantages of filtering and local-feature-based approaches to form effective DT descriptors. Furthermore, motivated by convolutional neural networks, the proposed framework is boosted by utilizing more global features extracted from max-pooling videos to improve the discrimination power of the descriptors. Our proposal is verified on benchmark datasets, i.e., UCLA, DynTex, and DynTex++, for DT classification issue. The experimental results substantiate the interest of our method.

    更新日期:2020-01-04
  • Human Visual System vs Convolution Neural Networks in food recognition task: An empirical comparison
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-11-30
    Pedro Furtado; Manuel Caldeira; Pedro Martins

    Automated food recognition from food plate is useful for smartphone-based applications promoting healthy lifestyles and for automated carbohydrate counting, e.g. targeted at type I diabetic patients, but the variation of appearance of food items makes it a difficult task. Convolution Neural Networks (CNNs) raised to prominence in recent years, and they will enable those applications if they are able to match HVS accuracy at least in meal classification. In this work we run an experimental comparison of accuracy between CNNs and HVS based on a simple meal recognition task. We set up a survey for humans with two phases, training and testing, and also give the food dataset to state-of-the-art CNNs. The results, considering some relevant variations in the setup, allow us to reach conclusions regarding the comparison, characteristics and limitations of CNNs, which are relevant for future improvements.

    更新日期:2020-01-04
  • Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-11-26
    Tobias Koch; Lukas Liebel; Marco Körner; Friedrich Fraundorfer
    更新日期:2020-01-04
  • An efficient EM-ICP algorithm for non-linear registration of large 3D point sets
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-11-12
    Benoit Combès; Sylvain Prima

    In this paper, we present a new method for non-linear pairwise registration of 3D point sets. In this method, we consider the points of the first set as the draws of a Gaussian mixture model whose centres are the displaced points of the second set. Next we perform a maximum a posteriori estimation of the parameters (which include the unknown transformation) of this model using the expectation–maximisation (EM) algorithm. Compared to other methods using the same “EM-ICP” framework, we propose four key modifications leading to an efficient algorithm allowing for fast registration of large 3D point sets: (1) truncation of the cost function; (2) symmetrisation of the point-to-point correspondences; (3) specification of priors on these correspondences using differential geometry; (4) efficient encoding of deformations using the RKHS theory and the Fourier analysis. We evaluate the added value of these modifications and compare our method to the state-of-the-art CPD algorithm on real and simulated data.

    更新日期:2020-01-04
  • Simultaneous compression and quantization: A joint approach for efficient unsupervised hashing
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-11-12
    Tuan Hoang; Thanh-Toan Do; Huu Le; Dang-Khoa Le-Tan; Ngai-Man Cheung

    For unsupervised data-dependent hashing, the two most important requirements are to preserve similarity in the low-dimensional feature space and to minimize the binary quantization loss. A well-established hashing approach is Iterative Quantization (ITQ), which addresses these two requirements in separate steps. In this paper, we revisit the ITQ approach and propose novel formulations and algorithms to the problem. Specifically, we propose a novel approach, named Simultaneous Compression and Quantization (SCQ), to jointly learn to compress (reduce dimensionality) and binarize input data in a single formulation under strict orthogonal constraint. With this approach, we introduce a loss function and its relaxed version, termed Orthonormal Encoder (OnE) and Orthogonal Encoder (OgE) respectively, which involve challenging binary and orthogonal constraints. We propose to attack the optimization using novel algorithms based on recent advance in cyclic coordinate descent approach. Comprehensive experiments on unsupervised image retrieval demonstrate that our proposed methods consistently outperform other state-of-the-art hashing methods. Notably, our proposed methods outperform recent deep neural networks and GAN based hashing in accuracy, while being very computationally-efficient.

    更新日期:2020-01-04
  • 更新日期:2020-01-04
  • An Entropic Optimal Transport loss for learning deep neural networks under label noise in remote sensing images
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-11-06
    Bharath Bhushan Damodaran; Rémi Flamary; Vivien Seguy; Nicolas Courty

    Deep neural networks have established as a powerful tool for large scale supervised classification tasks. The state-of-the-art performances of deep neural networks are conditioned to the availability of large number of accurately labeled samples. In practice, collecting large scale accurately labeled datasets is a challenging and tedious task in most scenarios of remote sensing image analysis, thus cheap surrogate procedures are employed to label the dataset. Training deep neural networks on such datasets with inaccurate labels easily overfits to the noisy training labels and degrades the performance of the classification tasks drastically. To mitigate this effect, we propose an original solution with entropic optimal transportation. It allows to learn in an end-to-end fashion deep neural networks that are, to some extent, robust to inaccurately labeled samples. We empirically demonstrate on several remote sensing datasets, where both scene and pixel-based hyperspectral images are considered for classification. Our method proves to be highly tolerant to significant amounts of label noise and achieves favorable results against state-of-the-art methods.

    更新日期:2020-01-04
  • CRF with deep class embedding for large scale classification
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-11-06
    Eran Goldman; Jacob Goldberger

    This paper presents a novel deep learning architecture for classifying structured objects in ultrafine-grained datasets, where classes may not be clearly distinguishable by their appearance but rather by their context. We model sequences of images as linear-chain CRFs, and jointly learn the parameters from both local-visual features and neighboring class information. The visual features are learned by convolutional layers, whereas class-structure information is reparametrized by factorizing the CRF pairwise potential matrix. This forms a context-based semantic similarity space, learned alongside the visual similarities, and dramatically increases the learning capacity of contextual information. This new parametrization, however, forms a highly nonlinear objective function which is challenging to optimize. To overcome this, we develop a novel surrogate likelihood which allows for a local likelihood approximation of the original CRF with integrated batch-normalization. This model overcomes the difficulties of existing CRF methods to learn the contextual relationships thoroughly when there is a large number of classes and the data is sparse. The performance of the proposed method is illustrated on a huge dataset that contains images of retail-store product displays, and shows significantly improved results compared to linear CRF parametrization, unnormalized likelihood optimization, and RNN modeling. We also show improved results on a standard OCR dataset.

    更新日期:2020-01-04
  • CompactNets: Compact Hierarchical Compositional Networks for Visual Recognition
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019-10-28
    Hans Lobel; René Vidal; Alvaro Soto

    CNN-based models currently provide state-of-the-art performance in image categorization tasks. While these methods are powerful in terms of representational capacity, they are generally not conceived with explicit means to control complexity. This might lead to scenarios where resources are used in a non-optimal manner, increasing the number of unspecialized or repeated neurons, and overfitting to data. In this work we propose CompactNets, a new approach to visual recognition that learns a hierarchy of shared, discriminative, specialized, and compact representations. CompactNets naturally capture the notion of compositional compactness, a characterization of complexity in compositional models, consisting on using the smallest number of patterns to build a suitable visual representation. We employ a structural regularizer with group-sparse terms in the objective function, that induces on each layer, an efficient and effective use of elements from the layer below. In particular, this allows groups of top-level features to be specialized based on category information. We evaluate CompactNets on the ILSVRC12 dataset, obtaining compact representations and competitive performance, using an order of magnitude less parameters than common CNN-based approaches. We show that CompactNets are able to outperform other group-sparse-based approaches, in terms of performance and compactness. Finally, transfer-learning experiments on small-scale datasets demonstrate high generalization power, providing remarkable categorization performance with respect to alternative approaches.

    更新日期:2020-01-04
  • Visual tracking in video sequences based on biologically inspired mechanisms
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2018-10-26
    Alireza Sokhandan; Amirhassan Monadjemi

    Visual tracking is the process of locating one or more objects based on their appearance. The high variation in the conditions and states of a moving object and presence of challenges such as background clutter, illumination variation, occlusion, etc. makes this problem extremely complex, and hard to achieve a robust algorithm in this field. However, unlike the machine vision, in the biological vision, the task of visual tracking is ideally conducted even in the worst conditions. Consequently, in this paper, taking into account the superior performance of biological vision in visual tracking, a biologically inspired visual tracking algorithm is introduced. The proposed algorithm inspiring the task-driven recognition procedure of the primary layers of the ventral pathway, and visual cortex mechanisms including spatial–temporal processing, motion perception, attention, and saliency to track a single object in the video sequence. For this purpose, a set of low-level features including the oriented-edges, color, and motion information (inspired by the layer V1) extracted from the target area and based on the discrimination rate that each feature creates with the background (inspired by the saliency mechanism), a subset of these features are employed to generate the appearance model and identify the target location. Moreover, by memorizing the shape and motion information (inspired by the short-term memory) scale variation and occlusion are handled. The experimental results showed that the proposed algorithm can well handle most of the visual tracking challenges, achieve high precision in target locating and act in a real-time manner.

    更新日期:2020-01-04
  • A novel algebraic solution to the perspective-three-line pose problem
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2018-09-15
    Ping Wang; Guili Xu; Yuehua Cheng

    In this work, we present a novel algebraic method to the perspective-three-line (P3L) problem for determining the position and attitude of a calibrated camera from features of three known reference lines. Unlike other methods, the proposed method uses an intermediate camera frame F and an intermediate world frame E, with sparse known line coordinates, facilitating formulations of the P3L problem. Additionally, the rotation matrix between the frame E and the frame F is parameterized by using its orthogonality, and then a closed-form solution for the P3L pose problem is obtained from subsequent substitutions. This algebraic method makes the processes more easily followed and significantly improves the performance. The experimental results show that the proposed method offers numerical stability, accuracy and efficiency comparable or better than that of state-of-the-art method.

    更新日期:2020-01-04
  • 更新日期:2020-01-04
  • Descriptor extraction based on a multilayer dictionary architecture for classification of natural images
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2018-08-29
    Stefen Chan Wai Tim; Michele Rombaut; Denis Pellerin; Anuvabh Dutt

    This paper presents a descriptor extraction method in the context of image classification, based on a multilayer structure of dictionaries. We propose to learn an architecture of discriminative dictionaries for classification in a supervised framework using a patch-level approach. This method combines many layers of sparse coding and pooling in order to reduce the dimension of the problem. The supervised learning of dictionary atoms allows them to be specialized for a classification task. The method has been tested on known datasets of natural images such as MNIST, CIFAR-10 and STL, in various conditions, especially when the size of the training set is limited, and in a transfer learning application. The results are also compared with those obtained with Convolutional Neural Network (CNN) of similar complexity in terms of number of layers and processing pipeline.

    更新日期:2020-01-04
  • Identifying motion pathways in highly crowded scenes: A non-parametric tracklet clustering approach
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2018-08-24
    Allam S. Hassanein; Mohamed E. Hussein; Walid Gomaa; Yasushi Makihara; Yasushi Yagi

    Many approaches that address the analysis of crowded scenes rely on using short trajectory fragments, also known as tracklets, of moving objects to identify motion pathways. Typically, such approaches aim at defining meaningful relationships among tracklets. However, defining these relationships and incorporating them in a crowded scene analysis framework is a challenge. In this article, we introduce a robust approach to identifying motion pathways based on tracklet clustering. We formulate a novel measure, inspired by line geometry, to capture the pairwise similarities between tracklets. For tracklet clustering, the recent distance dependent Chinese restaurant process (DD-CRP) model is adapted to use the estimated pairwise tracklet similarities. The motion pathways are identified based on two hierarchical levels of DD-CRP clustering such that the output clusters correspond to the pathways of moving objects in the crowded scene. Moreover, we extend our DD-CRP clustering adaptation to incorporate the source and sink gate probabilities for each tracklet as a high-level semantic prior for improving clustering performance. For qualitative evaluation, we propose a robust pathway matching metric, based on the chi-square distance, that accounts for both spatial coverage and motion orientation in the matched pathways. Our experimental evaluation on multiple crowded scene datasets, principally, the challenging Grand Central Station dataset, demonstrates the state-of-the-art performance of our approach. Finally, we demonstrate the task of motion abnormality detection, both at the tracklet and frame levels, against the normal motion patterns encountered in the motion pathways identified by our method, with competent quantitative performance on multiple datasets.

    更新日期:2020-01-04
  • Special issue on 3D imaging and modelling
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2011
    Adrian Hilton,Guy Godin,Chang Shu,Takeshi Masuda

    Three-dimensional imaging and modelling has advanced over the past decade from the measurement of isolated static objects using laser scanning techniques to the recovery of accurate shape from images and video. This special issue presents an overview of the latest research across the breadth of advances in this field from computational photography for 3D acquisition to techniques for online capture and modelling of dynamic shape. This special issue of CVIU follows the Workshop on 3D Imaging and Modelling 3DIM held in conjunction with ICCV 2009 in Kyoto, Japan. An open call for papers together with invited submissions from the workshop resulted in 25 full-length journal paper submissions. These papers have been subject to a full journal review process with each paper reviewed by both members of the 3DIM programme committee and other members of the community. Most papers have undergone two rounds of review. Twelve papers have been selected for this CVIU-3DIM special issue based on the reviewers comments. The papers in this special issue can be categorised into five topics: 3D imaging; shape reconstruction from single images; shape reconstruction from multiple images; shape registration and matching; and shape representation and recognition.

    更新日期:2020-01-01
  • Simultaneous Registration of Multiple Corresponding Point Sets
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2001
    John A. Williams,Mohammed Bennamoun

    We present a new technique for the simultaneous registration of multiple corresponding point sets with rigid 3D transformations. This class of problems is a generalization of the classic pairwise point set registration task, involving multiple views with multiple correspondences existing between them. The proposed technique requires the computation of a constant matrix which encodes the point correspondence information, followed by an efficient iterative algorithm to compute the optimal rotations. The optimal translations are then recovered directly through the solution of a linear equation system. The algorithm supports weighting of data according to confidence, and we show how it may be incorporated into two robust estimation frameworks to detect and reject outlier data. We have integrated our method into a generalized multiview ICP surface matching system and tested it with synthetic and real data. These tests indicate that the technique is accurate and efficient. The algorithm also compares favorably to another multiview technique from the literature.

    更新日期:2020-01-01
  • Alternative search techniques for face detection using location estimation and binary features
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2013
    B. S. Venkatesh,Sébastien Marcel

    The sliding window approach is the most widely used technique to detect an object from an image. In the past few years, classifiers have been improved in many ways to increase the scanning speed. Apart from the classifier design (such as the cascade), the scanning speed also depends on a number of different factors (such as the grid spacing, and the scale at which the image is searched). When the scanning grid spacing is larger than the tolerance of the trained classifier it suffers from low detections. In this paper, we present a technique to reduce the number of missed detections when fewer subwindows are processed in the sliding window approach for face detection. This is achieved by using a small patch to predict the location of the face within a local search area. We use simple binary features and a decision tree for location estimation as it proved to be efficient for our application. We also show that by using a simple interest point detector based on quantized gradient orientation, as the front-end to the proposed location estimation technique, we can further improve the performance. Experimental evaluation on several face databases show better detection rate and speed with our proposed approach when fewer number of subwindows are processed compared to the standard scanning technique.

    更新日期:2020-01-01
  • Recent Progress in CAD-Based Computer Vision: An Introduction to the Special Issue
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 1998
    Octavia I. Camps,Patrick J. Flynn,George C. Stockman

    更新日期:2020-01-01
  • Scene-Based Shot Change Detection and Comparative Evaluation
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2000
    Loong Fah Cheong

    A key step for managing a large video database is to partition the video sequences into shots. Past approaches to this problem tend to confuse gradual shot changes with changes caused by smooth camera motions. This is in part due to the fact that camera motion has not been dealt with in a more fundamental way. We propose an approach that is based on a physical constraint used in optical flow analysis, namely, the total brightness of a scene point across two frames should remain constant if the change across two frames is a result of smooth camera motion. Since the brightness constraint would be violated across a shot change, the detection can be based on detecting the violation of this constraint. It is robust because it uses only the qualitative aspect of the brightness constraint?detecting a scene change rather than estimating the scene itself. Moreover, by tapping on the significant know-how in using this constraint, the algorithm's robustness is further enhanced. Experimental results are presented to demonstrate the performance of various algorithms. It is shown that our algorithm is less likely to interpret gradual camera motions as shot changes, resulting in a better precision performance than most other algorithms. However, its performance deteriorates under large camera or object motions. A twin-threshold scheme is proposed to improve its robustness.

    更新日期:2020-01-01
  • Region-based image registration for remote sensing imagery
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019
    Azubuike Okorie,Sokratis Makrogiannis

    Abstract We propose an automatic region-based registration method for remote sensing imagery. In this method, we aim to register two images by matching region properties to address possible errors caused by local feature estimators. We apply automated image segmentation to identify the regions and calculate regional Fourier descriptors and standardized regional intensity descriptors for each region. We define a joint matching cost, as a linear combination of Euclidean distances, to establish and extract correspondences between regions. The segmentation technique utilizes kernel density estimators for edge localization, followed by morphological reconstruction and the watershed transform. We evaluated the registration performance of our method on synthetic and real datasets. We measured the registration accuracy by calculating the root-mean-squared error (RMSE) between the estimated transformation and the ground truth transformation. The results obtained using the joint intensity-Fourier descriptor were compared to the results obtained using Harris, Minimum eigenvalue, Features accelerated segment test (FAST), speeded-up robust features (SURF), binary robust invariant scalable keypoints (BRISK) and KAZE keypoint descriptors. The joint intensity-Fourier descriptor yielded average RMSE of 0 . 446 ± 0 . 359 pixels and 1 . 152 ± 0 . 488 pixels on two satellite imagery datasets consisting of 35 image pairs in total. These results indicate the capacity of the proposed technique for high accuracy. Our method also produces a lower registration error than the compared feature-based methods.

    更新日期:2020-01-01
  • An efficient solution to the perspective-three-point pose problem
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2018
    Ping Wang,Guili Xu,Zhengsheng Wang,Yuehua Cheng

    Abstract In this paper, we present a new algebraic method to solve the perspective-three-point (P3P) problem, which directly computes the rotation and position of a camera in the world frame without the intermediate derivation of the points in the camera frame. Unlike other online methods, the proposed method uses an “object” coordinate frame, in which the known 3D coordinates are sparse, facilitating formulations of the P3P problem. Additionally, two auxiliary variables are introduced to parameterize the rotation and position matrix, and then a closed-form solution for the camera pose is obtained from subsequent substitutions. This algebraic approach makes the processes more easily followed and significantly improves the performance. Experimental results demonstrated that our method offers accuracy and precision comparable to the existing state-of-the art methods but it has better computational efficiency.

    更新日期:2020-01-01
  • A general shape context framework for object identification
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2008
    Yanling Chi,Maylor K. H. Leung

    A general shape context framework is proposed for object/image retrieval in occluded and cluttered environment with hundreds of models as the potential matches of an input. The approach is general since it does not require separation of input objects from complex background. It works by first extracting consistent and structurally unique local neighborhood information from inputs or models, and then voting on the optimal matches. Its performance degrades gracefully with respect to the amount of structural information that is being occluded or lost. The local neighborhood information applicable to the system can be shape, color, texture feature, etc. Currently, we employ shape information only. The mechanism of voting is based on a novel hyper cube based indexing structure, and driven by dynamic programming. The proposed concepts have been tested on database with thousands of images. Very encouraging results have been obtained.

    更新日期:2020-01-01
  • Indoor Manhattan spatial layout recovery from monocular videos via line matching
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2016
    Chelhwon Kim,Roberto Manduchi

    An end-to-end system for structure and motion computation in a Manhattan layout from monocular videos is presented.The system is based on the idea of characteristic lines, which can be seen as an invariant of two views of a parallel line pair laying on a plane with known orientation.The characteristic lines algorithm enables segmentation of planar patches from visible lines, and thus reconstruction of motion and structure.Extending the characteristic lines algorithm to the multi-view case results in a more robust planar segmentation. We present an end-to-end system for structure and motion computation in a Manhattan layout from monocular videos. Unlike most SFM algorithms that rely on point feature matching, only line matches are considered in this work. This may be convenient in indoor environment characterized by extended textureless walls, where point features may be scarce. Our system relies on the notion of characteristic lines, which are invariants of two views of the same parallel line pairs on a surface of known orientation. Experiments with indoor video sequences demonstrate the robustness of the proposed system.

    更新日期:2020-01-01
  • A differential geometry approach to camera-independent image correspondence
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2018
    József Molnár,Ivan Eichhardt

    Abstract Projective geometry is a standard mathematical tool for image-based 3D reconstruction. Most reconstruction methods establish pointwise image correspondences using projective geometry. We present an alternative approach based on differential geometry using oriented patches rather than points. Our approach assumes that the scene to be reconstructed is observed by any camera, existing or potential, that satisfies very general conditions, namely, the differentiability of the surface and the bijective projection functions. We show how the notions of the differential geometry such as diffeomorphism, pushforward and pullback are related to the reconstruction problem. A unified theory applicable to various 3D reconstruction problems is presented. Considering two views of the surface, we derive reconstruction equations for oriented patches and pose equations to determine the relative pose of the two cameras. Then we discuss the generalized epipolar geometry and derive the generalized epipolar constraint (compatibility equation) along the epipolar curves. Applying the proposed theory to the projective camera and assuming that affine mapping between small corresponding regions has been estimated, we obtain the minimal pose equation for the case when a fully calibrated camera is moved with its internal parameters unchanged. Equations for the projective epipolar constraints and the fundamental matrix are also derived. Finally, two important nonlinear camera types, the axial and the spherical, are examined.

    更新日期:2020-01-01
  • A comparative study of pose representation and dynamics modelling for online motion quality assessment
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2016
    Lili Tao,Adeline Paiement,Dima Damen,Majid Mirmehdi,Sion L. Hannuna,Massimo Camplani,Tilo Burghardt,Ian Craddock

    Quantitative assessment of the quality of motion is increasingly in demand by clinicians in healthcare and rehabilitation monitoring of patients. We study and compare the performances of different pose representations and HMM models of dynamics of movement for online quality assessment of human motion. In a general sense, our assessment framework builds a model of normal human motion from skeleton-based samples of healthy individuals. It encapsulates the dynamics of human body pose using robust manifold representation and a first-order Markovian assumption. We then assess deviations from it via a continuous online measure. We compare different feature representations, reduced dimensionality spaces, and HMM models on motions typically tested in clinical settings, such as gait on stairs and flat surfaces, and transitions between sitting and standing. Our dataset is manually labelled by a qualified physiotherapist. The continuous-state HMM, combined with pose representation based on body-joints' location, outperforms standard discrete-state HMM approaches and other skeleton-based features in detecting gait abnormalities, as well as assessing deviations from the motion model on a frame-by-frame basis.

    更新日期:2020-01-01
  • Image processing using 3-state cellular automata
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2010
    Paul L. Rosin

    This paper describes the application of cellular automata (CA) to various image processing tasks such as denoising and feature detection. Whereas our previous work mainly dealt with binary images, the current work operates on intensity images. The increased number of cell states (i.e. pixel intensities) leads to a vast increase in the number of possible rules. Therefore, a reduced intensity representation is used, leading to a three state CA that is more practical. In addition, a modified sequential floating forward search mechanism is developed in order to speed up the selection of good rule sets in the CA training stage. Results are compared with our previous method based on threshold decomposition, and are found to be generally superior. The results demonstrate that the CA is capable of being trained to perform many different tasks, and that the quality of these results is in many cases comparable or better than established specialised algorithms.

    更新日期:2020-01-01
  • Cooperative Stereo-Motion: Matching and Reconstruction
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 1998
    Fadi Dornaika,Rhy Chung

    One of the most interesting goals of computer vision is the 3D structure recovery of scenes. Traditionally, two cues are used: structure from motion and structure from stereo, two subfields with complementary sets of assumptions and techniques. This paper introduces a new general framework of cooperation between stereo and motion. This framework combines the advantages of both cues: (i) easy correspondence from motion and (ii) accurate 3D reconstruction from stereo. First, we show how the stereo matching can be recovered from motion correspondences using only geometric constraints. Second, we propose a method of 3D reconstruction of both binocular and monocular features using all stereo pairs in the case of a calibrated stereo rig. Third, we perform an analysis of the performance of the proposed framework as well as a comparison with an affine method. Experiments involving real and synthetic stereo pairs indicate that rich and reliable information can be derived from the proposed framework. They also indicate that robust 3D reconstruction can be obtained even with short image sequences.

    更新日期:2020-01-01
  • Incremental, scalable tracking of objects inter camera
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2008
    Andrew Gilbert,Richard Bowden

    This paper presents a scalable solution to the problem of tracking objects across spatially separated, uncalibrated cameras with non overlapping fields of view. The approach relies on the three cues of colour, relative size and movement between cameras to describe the relationship of objects between cameras. This relationship weights the observation likelihood for correlating or tracking objects between cameras. Any individual cue alone has poor performance, but when fused together, a large boost in accuracy is gained. Unlike previous work, this paper uses an incremental technique to learning. The three cues are learnt in parallel and then fused together to track objects across the spatially separated cameras. The colour appearance cue is incrementally calibrated through transformation matrices, while probabilistic links, modelling an object's bounding box, between cameras represent the objects relative size. Probabilistic region links between entry and exit areas on cameras provide the cue of movement. The approach needs no pre colour or environment calibration and does not use batch processing. It works completely unsupervised, and is able to become more accurate over time as new evidence is accumulated.

    更新日期:2020-01-01
  • An Optimizing Line Finder Using a Hough Transform Algorithm
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 1997
    Phil L. Palmer,Josef Kittler,Maria Petrou

    In this paper we present an optimization algorithm for locating peaks in the accumulator of the Hough algorithm with robust voting kernel. We present a detailed discussion of the accuracy that can be achieved by locating these peaks in the accumulator, and show that the error bounds on the estimates of line parameters are always within those based upon least squares. This arises from the robust nature of the voting kernel. We describe the optimization algorithm in some detail since the shape of the peaks in the standard parameter space for straight lines are sinusoidal ridges. Standard approaches therefore fail, but the method described is shown to be robust from the experimental results presented. Some discussion of post-processing is also made, in which the shortcomings of standard Hough techniques, splitting long lines across parameter bins, can be remedied. We also discuss the use of a confidence measure in the line parameters based upon the value of the accumulator, and show that this is related to the mean squared distance from the line of the edge pixels associated with it. Finally, we present results produced by this optimizing Hough technique on a disparate set of images, with various application areas in mind, to demonstrate the versatility of the method and the accuracy that can be achieved at little computational overhead.

    更新日期:2020-01-01
  • GOLD: Gaussians of Local Descriptors for image representation
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2015
    Giuseppe Serra,Costantino Grana,Marco Manfredi,Rita Cucchiara

    Abstract The Bag of Words paradigm has been the baseline from which several successful image classification solutions were developed in the last decade. These represent images by quantizing local descriptors and summarizing their distribution. The quantization step introduces a dependency on the dataset, that even if in some contexts significantly boosts the performance, severely limits its generalization capabilities. Differently, in this paper, we propose to model the local features distribution with a multivariate Gaussian, without any quantization. The full rank covariance matrix, which lies on a Riemannian manifold, is projected on the tangent Euclidean space and concatenated to the mean vector. The resulting representation, a Gaussian of Local Descriptors (GOLD), allows to use the dot product to closely approximate a distance between distributions without the need for expensive kernel computations. We describe an image by an improved spatial pyramid, which avoids boundary effects with soft assignment: local descriptors contribute to neighboring Gaussians, forming a weighted spatial pyramid of GOLD descriptors. In addition, we extend the model leveraging dataset characteristics in a mixture of Gaussian formulation further improving the classification accuracy. To deal with large scale datasets and high dimensional feature spaces the Stochastic Gradient Descent solver is adopted. Experimental results on several publicly available datasets show that the proposed method obtains state-of-the-art performance.

    更新日期:2020-01-01
  • Tracking in object action space
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2013
    Volker Krüger,Dennis Herzog

    In this paper we focus on the joint problem of tracking humans and recognizing human action in scenarios such as a kitchen scenario or a scenario where a robot cooperates with a human, e.g., for a manufacturing task. In these scenarios, the human directly interacts with objects physically by using/manipulating them or by, e.g., pointing at them such as in ''Give me that...''. To recognize these types of human actions is difficult because (a) they ought to be recognized independent of scene parameters such as viewing direction and (b) the actions are parametric, where the parameters are either object-dependent or as, e.g., in the case of a pointing direction convey important information. One common way to achieve recognition is by using 3D human body tracking followed by action recognition based on the captured tracking data. For the kind of scenarios considered here we would like to argue that 3D body tracking and action recognition should be seen as an intertwined problem that is primed by the objects on which the actions are applied. In this paper, we are looking at human body tracking and action recognition from a object-driven perspective. Instead of the space of human body poses we consider the space of the object affordances, i.e., the space of possible actions that are applied on a given object. This way, 3D body tracking reduces to action tracking in the object (and context) primed parameter space of the object affordances. This reduces the high-dimensional joint-space to a low-dimensional action space. In our approach, we use parametric hidden Markov models to represent parametric movements; particle filtering is used to track in the space of action parameters. We demonstrate its effectiveness on synthetic and on real image sequences using human-upper body single arm actions that involve objects.

    更新日期:2020-01-01
  • Scene parsing by nonparametric label transfer of content-adaptive windows
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2016
    Frederick Tung,James J. Little

    CollageParsing is a scene parsing algorithm that matches content-adaptive windows.Unlike superpixels, content-adaptive windows are designed to preserve objects.A powerful MRF unary is constructed by performing label transfer using the windows.Gains of 15-19% average per-class accuracy are obtained on a standard benchmark. Scene parsing is the task of labeling every pixel in an image with its semantic category. We present CollageParsing, a nonparametric scene parsing algorithm that performs label transfer by matching content-adaptive windows. Content-adaptive windows provide a higher level of perceptual organization than superpixels, and unlike superpixels are designed to preserve entire objects instead of fragmenting them. Performing label transfer using content-adaptive windows enables the construction of a more effective Markov random field unary potential than previous approaches. On a standard benchmark consisting of outdoor scenes from the LabelMe database, CollageParsing obtains state-of-the-art performance with 15-19% higher average per-class accuracy than recent nonparametric scene parsing algorithms.

    更新日期:2020-01-01
  • Topological analysis of shapes using Morse theory
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2007
    Madjid Allili,David Corriveau

    In this paper, we propose a novel method for shape analysis that is suitable for any multi-dimensional data set that can be modelled as a manifold. The descriptor is obtained for any pair (M,@f), where M is a closed smooth manifold and @f is a Morse function defined on M. More precisely, we characterize the topology of all pairs of sub-level sets (M"y,M"x) of @f, where M"a=@f^-^1((-~,a]), for all a@?R. Classical Morse theory is used to establish a link between the topology of a pair of sub-level sets of @f and its critical points lying between the two levels.

    更新日期:2020-01-01
  • Experimental Evaluation of FLIR ATR Approaches - A Comparative Study
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2001
    Baoxin Li,Rama Chellappa,Qinfen Zheng,Sandor Z. Der,Nasser M. Nasrabadi,LipChen Alex Chan,Lin-Cheng Wang

    This paper presents an empirical evaluation of a number of recently developed Automatic Target Recognition algorithms for Forward-Looking Infrared (FLIR) imagery using a large database of real FLIR images. The algorithms evaluated are based on convolutional neural networks (CNN), principal component analysis (PCA), linear discriminant analysis (LDA), learning vector quantization (LVQ), modular neural networks (MNN), and two model-based algorithms, using Hausdorff metric-based matching and geometric hashing. The evaluation results show that among the neural approaches, the LVQ- and MNN-based algorithms perform the best; the classical LDA and the PCA methods and our implementation of the geometric hashing method ended up in the bottom three, with the CNN- and Hausdorff metric-based methods in the middle. Analyses show that the less-than-desirable performance of the approaches is mainly due to the lack of a good training set.

    更新日期:2020-01-01
  • Face alignment in-the-wild: A Survey
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2016
    Xin Jin,Xiaoyang Tan

    Over the last two decades, face alignment or localizing fiducial facial points has received increasing attention owing to its comprehensive applications in automatic face analysis. However, such a task has proven extremely challenging in unconstrained environments due to many confounding factors, such as pose, occlusions, expression and illumination. While numerous techniques have been developed to address these challenges, this problem is still far away from being solved. In this survey, we present an up-to-date critical review of the existing literatures on face alignment, focusing on those methods addressing overall difficulties and challenges of this topic under uncontrolled conditions. Specifically, we categorize existing face alignment techniques, present detailed descriptions of the prominent algorithms within each category, and discuss their advantages and disadvantages. Furthermore, we organize special discussions on the practical aspects of face alignment in-the-wild, towards the development of a robust face alignment system. In addition, we show performance statistics of the state of the art, and conclude this paper with several promising directions for future research.

    更新日期:2020-01-01
  • Enhanced control of a wheelchair-mounted robotic manipulator using 3-D vision and multimodal interaction
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2016
    Hairong Jiang,Ting Zhang,Juan Pablo Wachs,Bradley S. Duerstock

    Wheelchair-mounted robotic arm integrates 3D computer vision with multimodal input.Object recognition was improved by combining RGB information and 3D point clouds.Hybrid input (using both gestures or speech) outperformed using a single modality.Input performance was validated for daily living tasks: feeding and dressing. This paper presents a multiple-sensors, 3D vision-based, autonomous wheelchair-mounted robotic manipulator (WMRM). Two 3D sensors were employed: one for object recognition, and the other for recognizing body parts (face and hands). The goal is to recognize everyday items and automatically interact with them in an assistive fashion. For example, when a cereal box is recognized, it is grasped, poured in a bowl, and brought to the user. Daily objects (i.e. bowl and hat) were automatically detected and classified using a three-steps procedure: (1) remove background based on 3D information and find the point cloud of each object; (2) extract feature vectors for each segmented object from its 3D point cloud and its color image; and (3) classify feature vectors as objects after applying a nonlinear support vector machine (SVM). To retrieve specific objects, three user interface methods were adopted: voice-based, gesture-based, and hybrid commands. The presented system was tested using two common activities of daily living -- feeding and dressing. The results revealed that an accuracy of 98.96% is achieved for a dataset with twelve daily objects. The experimental results indicated that hybrid (gesture and speech) interaction outperforms any single modal interaction.

    更新日期:2020-01-01
  • Cross-domain mapping learning for transductive zero-shot learning
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019
    Mingyu Ding,Zhe Wang,Zhiwu Lu

    Abstract Zero-shot learning (ZSL) aims to learn a projection function from a visual feature space to a semantic embedding space or reverse. The main challenge of ZSL is the domain shift problem where the unseen test data has a large gap with the seen training data. Transductive ZSL based methods alleviate this problem by learning from both labeled data and unlabeled data to capture their common semantic information. In this paper, we propose a framework to learn a robust cross-domain mapping for transductive ZSL with an extremely efficient algorithm for model optimization. Combining with a deep model, we formulate the cross-domain mapping as a general loss function that optimizes both the projection function and discriminative visual features simultaneously in an end-to-end manner. Extensive experiments on five benchmark datasets show that the proposed Cross-Domain Mapping (CDM) model outperforms the state-of-the-art.

    更新日期:2020-01-01
  • Registration without ICP
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2004
    Helmut Pottmann,Stefan Leopoldseder,Michael Hofer

    We present a new approach to the geometric alignment of a point cloud to a surface and to related registration problems. The standard algorithm is the familiar ICP algorithm. Here, we provide an alternative concept which relies on instantaneous kinematics and on the geometry of the squared distance function of a surface. The proposed algorithm exhibits faster convergence than ICP: this is supported both by results of a local convergence analysis and by experiments.

    更新日期:2020-01-01
  • Cross-modality motion parameterization for fine-grained video prediction
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2019
    Yichao Yan,Bingbing Ni,Wendong Zhang,Jun Tang,Xiaokang Yang

    Abstract While predicting video content is challenging given the huge unconstrained searching space, this work explores cross-modality constraints to safeguard the video generation process and seeks improved content prediction. By observing the underlying correspondence between the sound and the object movement, we propose a novel cross-modality video generation network. Via adversarial training, this network directly links sound with the movement parameters of the operated object and automatically outputs corresponding object motion according to the rhythm of the given audio signal. We experiment on both rigid object and non-rigid object motion prediction tasks and show that our method significantly reduces motion uncertainty for the generated video content, with the guidance of the associated audio information.

    更新日期:2020-01-01
  • Computer Vision in Sports
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2017
    Thomas B. Moeslund,Graham A. Thomas,Adrian Hilton,Peter Carr,Irfan Essa

    h 1 It is evident that sports play a major role in the modern society. ports allow interaction of people irrespective of age, social status, tc. With the increasing importance of mass media, a significant mount of resources has been channeled into the world of sports n order to improve presentation, understanding and performance. or example, the interest in statistics regarding performances is no onger limited to coaches and sports scientists, such statistics are ow being applied in broadcast and other media to add value to he overall experience of the viewer. Obtaining such statistics, or ther information, from a game is a tedious manual task and thereore automatic systems based on sensing and machine learning are natural approach. This is where computer vision comes in, since t has the ability for non-contact capture of data at a distance, i.e. ithout interfering with the sports/activity. Computer vision algoithms have a huge potential in many aspects of sports ranging rom automatic annotation of live footage, improved understandng of sport injuries, through to enhanced viewing. The tasks to be olved by computer vision systems in sports are similar to those eeded in other domains, namely detection, tracking and recogniion. Until recently the use of computer vision in sports has been cattered between different disciplines. To this end a series of dedcated workshops has been organized; Computer Vision in Sports @CVPR13, @ICCV15, @CVPR17). The first workshop resulted in an dited book on Computer Vision in Sports. This special issue SI is ooted in the second of these workshops and contains extended orkshop articles together with contributions from an open call. n total 22 submissions were received and 12 accepted for pubication in this SI. Many of the accepted articles contain aspects f both detection, tracking and recognition. In the following we riefly present them in four different categories according to the ain focus of a particular article.

    更新日期:2020-01-01
  • Unsupervised action proposal ranking through proposal recombination
    Comput. Vis. Image Underst. (IF 2.645) Pub Date : 2017
    Waqas Sultani,Dong Zhang,Mubarak Shah

    Abstract Recently, action proposal methods have played an important role in action recognition tasks, as they reduce the search space dramatically. Most unsupervised action proposal methods tend to generate hundreds of action proposals which include many noisy, inconsistent, and unranked action proposals, while supervised action proposal methods take advantage of predefined object detectors (e.g., human detector) to refine and score the action proposals, but they require thousands of manual annotations to train. Given the action proposals in a video, the goal of the proposed work is to generate a few better action proposals that are ranked properly. In our approach, we first divide action proposal into sub-proposal and then use Dynamic Programming based graph optimization scheme to select the optimal combinations of sub-proposals from different proposals and assign each new proposal a score. We propose a new unsupervised image-based actionness detector that leverages web images and employs it as one of the node scores in our graph formulation. Moreover, we capture motion information by estimating the number of motion contours within each action proposal patch. The proposed method is an unsupervised method that neither needs bounding box annotations nor video level labels, which is desirable with the current explosion of large-scale action datasets. Our approach is generic and does not depend on a specific action proposal method. We evaluate our approach on several publicly available trimmed and untrimmed datasets and obtain better performance compared to several proposal ranking methods. In addition, we demonstrate that properly ranked proposals produce significantly better action detection as compared to state-of-the-art proposal based methods.

    更新日期:2020-01-01
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
化学/材料学中国作者研究精选
Springer Nature 2019高下载量文章和章节
ACS材料视界
南京大学
自然科研论文编辑服务
剑桥大学-
中国科学院大学化学科学学院
南开大学化学院周其林
课题组网站
X-MOL
北京大学分子工程苏南研究院
华东师范大学分子机器及功能材料
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug