• arXiv.cs.CV Pub Date : 2020-01-17
Hao Shao; Shengju Qian; Yu Liu

For a long time, the vision community tries to learn the spatio-temporal representation by combining convolutional neural network together with various temporal models, such as the families of Markov chain, optical flow, RNN and temporal convolution. However, these pipelines consume enormous computing resources due to the alternately learning process for spatial and temporal information. One natural question is whether we can embed the temporal information into the spatial one so the information in the two domains can be jointly learned once-only. In this work, we answer this question by presenting a simple yet powerful operator -- temporal interlacing network (TIN). Instead of learning the temporal features, TIN fuses the two kinds of information by interlacing spatial representations from the past to the future, and vice versa. A differentiable interlacing target can be learned to control the interlacing process. In this way, a heavy temporal model is replaced by a simple interlacing operator. We theoretically prove that with a learnable interlacing target, TIN performs equivalently to the regularized temporal convolution network (r-TCN), but gains 4% more accuracy with 6x less latency on 6 challenging benchmarks. These results push the state-of-the-art performances of video understanding by a considerable margin. Not surprising, the ensemble model of the proposed TIN won the $1^{st}$ place in the ICCV19 - Multi Moments in Time challenge. Code is made available to facilitate further research at https://github.com/deepcs233/TIN

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-17
A. Emre Kavur; N. Sinem Gezer; Mustafa Barış; Pierre-Henri Conze; Vladimir Groza; Duc Duy Pham; Soumick Chatterjee; Philipp Ernst; Savaş Özkan; Bora Baydar; Dmitry Lachinov; Shuo Han; Josef Pauli; Fabian Isensee; Matthias Perkonigg; Rachana Sathish; Ronnie Rajan; Sinem Aslan; Debdoot Sheet; Gurbandurdy Dovletov; Oliver Speck; Andreas Nürnberger; Klaus H. Maier-Hein; Gözde Bozdağı Akar; Gözde Ünal; Oğuz Dicle; M. Alper Selver

Segmentation of abdominal organs has been a comprehensive, yet unresolved, research field for many years. In the last decade, intensive developments in deep learning (DL) have introduced new state-of-the-art segmentation systems. Despite outperforming the overall accuracy of existing systems, the effects of DL model properties and parameters on the performance is hard to interpret. This makes comparative analysis a necessary tool to achieve explainable studies and systems. Moreover, the performance of DL for emerging learning approaches such as cross-modality and multi-modal tasks have been rarely discussed. In order to expand the knowledge in these topics, CHAOS -- Combined (CT-MR) Healthy Abdominal Organ Segmentation challenge has been organized in the IEEE International Symposium on Biomedical Imaging (ISBI), 2019, in Venice, Italy. Despite a large number of the previous abdomen related challenges, the majority of which are focused on tumor/lesion detection and/or classification with a single modality, CHAOS provides both abdominal CT and MR data from healthy subjects. Five different and complementary tasks have been designed to analyze the capabilities of the current approaches from multiple perspectives. The results are investigated thoroughly, compared with manual annotations and interactive methods. The outcomes are reported in detail to reflect the latest advancements in the field. CHAOS challenge and data will be available online to provide a continuous benchmark resource for segmentation.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-17
Lei Chen; Jianhui Chen; Hossein Hajimirsadeghi; Greg Mori

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Luisa Verdoliva

With the rapid progress of recent years, techniques that generate and manipulate multimedia content can now guarantee a very advanced level of realism. The boundary between real and synthetic media has become very thin. On the one hand, this opens the door to a series of exciting applications in different fields such as creative arts, advertising, film production, video games. On the other hand, it poses enormous security threats. Software packages freely available on the web allow any individual, without special skills, to create very realistic fake images and videos. So-called deepfakes can be used to manipulate public opinion during elections, commit fraud, discredit or blackmail people. Potential abuses are limited only by human imagination. Therefore, there is an urgent need for automated tools capable of detecting false multimedia content and avoiding the spread of dangerous false information. This review paper aims to present an analysis of the methods for visual media integrity verification, that is, the detection of manipulated images and videos. Special emphasis will be placed on the emerging phenomenon of deepfakes and, from the point of view of the forensic analyst, on modern data-driven forensic methods. The analysis will help to highlight the limits of current forensic tools, the most relevant issues, the upcoming challenges, and suggest future directions for research.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Matej Ulicny; Vladimir A. Krylov; Rozenn Dahyot

Convolutional neural networks (CNNs) learn filters in order to capture local correlation patterns in feature space. In this paper we propose to revert to learning combinations of preset spectral filters by switching to CNNs with harmonic blocks. We rely on the use of the Discrete Cosine Transform (DCT) filters which have excellent energy compaction properties and are widely used for image compression. The proposed harmonic blocks rely on DCT-modeling and replace conventional convolutional layers to produce partially or fully harmonic versions of new or existing CNN architectures. We demonstrate how the harmonic networks can be efficiently compressed in a straightforward manner by truncating high-frequency information in harmonic blocks which is possible due to the redundancies in the spectral domain. We report extensive experimental validation demonstrating the benefits of the introduction of harmonic blocks into state-of-the-art CNN models in image classification, segmentation and edge detection applications.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Lirong Wu; Kejie Huang; Haibin Shen

The method of importance map has been widely adopted in DNN-based lossy image compression to achieve bit allocation according to the importance of image contents. However, insufficient allocation of bits in non-important regions often leads to severe distortion at low bpp (bits per pixel), which hampers the development of efficient content-weighted image compression systems. This paper rethinks content-based compression by using Generative Adversarial Network (GAN) to reconstruct the non-important regions. Moreover, multiscale pyramid decomposition is applied to both the encoder and the discriminator to achieve global compression of high-resolution images. A tunable compression scheme is also proposed in this paper to compress an image to any specific compression ratio without retraining the model. The experimental results show that our proposed method improves MS-SSIM by more than 10.3% compared to the recently reported GAN-based method to achieve the same low bpp (0.05) on the Kodak dataset.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Lirong Wu; Kejie Huang; Haibin Shen; Lianli Gao

The data storage has been one of the bottlenecks in surveillance systems. The conventional video compression algorithms such as H.264 and H.265 do not fully utilize the low information density characteristic of the surveillance video. In this paper, we propose a video compression method that extracts and compresses the foreground and background of the video separately. The compression ratio is greatly improved by sharing background information among multiple adjacent frames through an adaptive background updating and interpolation module. Besides, we present two different schemes to compress the foreground and compare their performance in the ablation study to show the importance of temporal information for video compression. In the decoding end, a coarse-to-fine two-stage module is applied to achieve the composition of the foreground and background and the enhancements of frame quality. Furthermore, an adaptive sampling method for surveillance cameras is proposed, and we have shown its effects through software simulation. The experimental results show that our proposed method requires 69.5% less bpp (bits per pixel) than the conventional algorithm H.265 to achieve the same PSNR (36 dB) on the HECV dataset.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Pedro D. Marrero Fernandez; Tsang Ing Ren; Tsang Ing Jyh; Fidel A. Guerrero Peña; Alexandre Cunha

We propose a deep metric learning model to create embedded sub-spaces with a well defined structure. A new loss function that imposes Gaussian structures on the output space is introduced to create these sub-spaces thus shaping the distribution of the data. Having a mixture of Gaussians solution space is advantageous given its simplified and well established structure. It allows fast discovering of classes within classes and the identification of mean representatives at the centroids of individual classes. We also propose a new semi-supervised method to create sub-classes. We illustrate our methods on the facial expression recognition problem and validate results on the FER+, AffectNet, Extended Cohn-Kanade (CK+), BU-3DFE, and JAFFE datasets. We experimentally demonstrate that the learned embedding can be successfully used for various applications including expression retrieval and emotion recognition.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Hari Om Aggrawal; Jan Modersitzki

Multilevel strategies are an integral part of many image registration algorithms. These strategies are very well-known for avoiding undesirable local minima, providing an outstanding initial guess, and reducing overall computation time. State-of-the-art multilevel strategies build a hierarchy of discretization in the spatial dimensions. In this paper, we present a spatio-temporal strategy, where we introduce a hierarchical discretization in the temporal dimension at each spatial level. This strategy is suitable for a motion estimation problem where the motion is assumed smooth over time. Our strategy exploits the temporal smoothness among image frames by following a predictor-corrector approach. The strategy predicts the motion by a novel interpolation method and later corrects it by registration. The prediction step provides a good initial guess for the correction step, hence reduces the overall computational time for registration. The acceleration is achieved by a factor of 2.5 on average, over the state-of-the-art multilevel methods on three examined optical coherence tomography datasets.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Shuo Wang; Tianle Chen; Shangyu Chen; Carsten Rudolph; Surya Nepal; Marthie Grobler

Anomaly detection aims to recognize samples with anomalous and unusual patterns with respect to a set of normal data, which is significant for numerous domain applications, e.g. in industrial inspection, medical imaging, and security enforcement. There are two key research challenges associated with existing anomaly detention approaches: (1) many of them perform well on low-dimensional problems however the performance on high-dimensional instances is limited, such as images; (2) many of them depend on often still rely on traditional supervised approaches and manual engineering of features, while the topic has not been fully explored yet using modern deep learning approaches, even when the well-label samples are limited. In this paper, we propose a One-for-all Image Anomaly Detection system (OIAD) based on disentangled learning using only clean samples. Our key insight is that the impact of small perturbation on the latent representation can be bounded for normal samples while anomaly images are usually outside such bounded intervals, called structure consistency. We implement this idea and evaluate its performance for anomaly detention. Our experiments with three datasets show that OIAD can detect over $90\%$ of anomalies while maintaining a high low false alarm rate. It can also detect suspicious samples from samples labeled as clean, coincided with what humans would deem unusual.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Anubha Pandey; Ashish Mishra; Vinay Kumar Verma; Anurag Mittal; Hema A. Murthy

Conventional approaches to Sketch-Based Image Retrieval (SBIR) assume that the data of all the classes are available during training. The assumption may not always be practical since the data of a few classes may be unavailable, or the classes may not appear at the time of training. Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) relaxes this constraint and allows the algorithm to handle previously unseen classes during the test. This paper proposes a generative approach based on the Stacked Adversarial Network (SAN) and the advantage of Siamese Network (SN) for ZS-SBIR. While SAN generates a high-quality sample, SN learns a better distance metric compared to that of the nearest neighbor search. The capability of the generative model to synthesize image features based on the sketch reduces the SBIR problem to that of an image-to-image retrieval problem. We evaluate the efficacy of our proposed approach on TU-Berlin, and Sketchy database in both standard ZSL and generalized ZSL setting. The proposed method yields a significant improvement in standard ZSL as well as in a more challenging generalized ZSL setting (GZSL) for SBIR.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Tehseen Zia; Shahan Arif; Shakeeb Murtaza; Mirza Ahsan Ullah

Conditional image modeling based on textual descriptions is a relatively new domain in unsupervised learning. Previous approaches use a latent variable model and generative adversarial networks. While the formers are approximated by using variational auto-encoders and rely on the intractable inference that can hamper their performance, the latter is unstable to train due to Nash equilibrium based objective function. We develop a tractable and stable caption-based image generation model. The model uses an attention-based encoder to learn word-to-pixel dependencies. A conditional autoregressive based decoder is used for learning pixel-to-pixel dependencies and generating images. Experimentations are performed on Microsoft COCO, and MNIST-with-captions datasets and performance is evaluated by using the Structural Similarity Index. Results show that the proposed model performs better than contemporary approaches and generate better quality images. Keywords: Generative image modeling, autoregressive image modeling, caption-based image generation, neural attention, recurrent neural networks.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Min Li; Zhenglong Zhou; Zhe Wu; Boxin Shi; Changyu Diao; Ping Tan

We present a method to capture both 3D shape and spatially varying reflectance with a multi-view photometric stereo (MVPS) technique that works for general isotropic materials. Our algorithm is suitable for perspective cameras and nearby point light sources. Our data capture setup is simple, which consists of only a digital camera, some LED lights, and an optional automatic turntable. From a single viewpoint, we use a set of photometric stereo images to identify surface points with the same distance to the camera. We collect this information from multiple viewpoints and combine it with structure-from-motion to obtain a precise reconstruction of the complete 3D shape. The spatially varying isotropic bidirectional reflectance distribution function (BRDF) is captured by simultaneously inferring a set of basis BRDFs and their mixing weights at each surface point. In experiments, we demonstrate our algorithm with two different setups: a studio setup for highest precision and a desktop setup for best usability. According to our experiments, under the studio setting, the captured shapes are accurate to 0.5 millimeters and the captured reflectance has a relative root-mean-square error (RMSE) of 9%. We also quantitatively evaluate state-of-the-art MVPS on a newly collected benchmark dataset, which is publicly available for inspiring future research.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Pietro Falco; Shuang Lu; Ciro Natale; Salvatore Pirozzi; Dongheui Lee

In this work, we introduce the problem of cross-modal visuo-tactile object recognition with robotic active exploration. With this term, we mean that the robot observes a set of objects with visual perception and, later on, it is able to recognize such objects only with tactile exploration, without having touched any object before. Using a machine learning terminology, in our application we have a visual training set and a tactile test set, or vice versa. To tackle this problem, we propose an approach constituted by four steps: finding a visuo-tactile common representation, defining a suitable set of features, transferring the features across the domains, and classifying the objects. We show the results of our approach using a set of 15 objects, collecting 40 visual examples and five tactile examples for each object. The proposed approach achieves an accuracy of 94.7%, which is comparable with the accuracy of the monomodal case, i.e., when using visual data both as training set and test set. Moreover, it performs well compared to the human ability, which we have roughly estimated carrying out an experiment with ten participants.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Zhun Fan; Jiahong Wei; Guijie Zhu; Jiajie Mo; Wenji Li

The accurate retina vessel segmentation (RVS) is of great significance to assist doctors in the diagnosis of ophthalmology diseases and other systemic diseases, and manually designing a valid neural network architecture for retinal vessel segmentation requires high expertise and a large workload. In order to further improve the performance of vessel segmentation and reduce the workload of manually designing neural network. We propose a specific search space based on encoder-decoder framework and apply neural architecture search (NAS) to retinal vessel segmentation. The search space is a macro-architecture search that involves some operations and adjustments to the entire network topology. For the architecture optimization, we adopt the modified evolutionary strategy which can evolve with limited computing resource to evolve the architectures. During the evolution, we select the elite architectures for the next generation evolution based on their performances. After the evolution, the searched model is evaluated on three mainstream datasets, namely DRIVE, STARE and CHASE_DB1. The searched model achieves top performance on all three datasets with fewer parameters (about 2.3M). Moreover, the results of cross-training between above three datasets show that the searched model is with considerable scalability, which indicates that the searched model is with potential for clinical disease diagnosis.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Jie Wu; Guanbin Li; Si Liu; Liang Lin

Temporally language grounding in untrimmed videos is a newly-raised task in video understanding. Most of the existing methods suffer from inferior efficiency, lacking interpretability, and deviating from the human perception mechanism. Inspired by human's coarse-to-fine decision-making paradigm, we formulate a novel Tree-Structured Policy based Progressive Reinforcement Learning (TSP-PRL) framework to sequentially regulate the temporal boundary by an iterative refinement process. The semantic concepts are explicitly represented as the branches in the policy, which contributes to efficiently decomposing complex policies into an interpretable primitive action. Progressive reinforcement learning provides correct credit assignment via two task-oriented rewards that encourage mutual promotion within the tree-structured policy. We extensively evaluate TSP-PRL on the Charades-STA and ActivityNet datasets, and experimental results show that TSP-PRL achieves competitive performance over existing state-of-the-art methods.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-18
Yazhao Li; Yanwei Pang; Jianbing Shen; Jiale Cao; Ling Shao

Due to the advantages of real-time detection and improved performance, single-shot detectors have gained great attention recently. To solve the complex scale variations, single-shot detectors make scale-aware predictions based on multiple pyramid layers. However, the features in the pyramid are not scale-aware enough, which limits the detection performance. Two common problems in single-shot detectors caused by object scale variations can be observed: (1) small objects are easily missed; (2) the salient part of a large object is sometimes detected as an object. With this observation, we propose a new Neighbor Erasing and Transferring (NET) mechanism to reconfigure the pyramid features and explore scale-aware features. In NET, a Neighbor Erasing Module (NEM) is designed to erase the salient features of large objects and emphasize the features of small objects in shallow layers. A Neighbor Transferring Module (NTM) is introduced to transfer the erased features and highlight large objects in deep layers. With this mechanism, a single-shot network called NETNet is constructed for scale-aware object detection. In addition, we propose to aggregate nearest neighboring pyramid features to enhance our NET. NETNet achieves 38.5% AP at a speed of 27 FPS and 32.0% AP at a speed of 55 FPS on MS COCO dataset. As a result, NETNet achieves a better trade-off for real-time and accurate object detection.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Kaiyu Shan; Yongtao Wang; Zhuoying Wang; Tingting Liang; Zhi Tang; Ying Chen; Yangyan Li

To efficiently extract spatiotemporal features of video for action recognition, most state-of-the-art methods integrate 1D temporal convolution into a conventional 2D CNN backbone. However, they all exploit 1D temporal convolution of fixed kernel size (i.e., 3) in the network building block, thus have suboptimal temporal modeling capability to handle both long-term and short-term actions. To address this problem, we first investigate the impacts of different kernel sizes for the 1D temporal convolutional filters. Then, we propose a simple yet efficient operation called Mixed Temporal Convolution (MixTConv), which consists of multiple depthwise 1D convolutional filters with different kernel sizes. By plugging MixTConv into the conventional 2D CNN backbone ResNet-50, we further propose an efficient and effective network architecture named MSTNet for action recognition, and achieve state-of-the-art results on multiple benchmarks.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Hui Zhu; Zhulin An; Kaiqiang Xu; Xiaolong Hu; Yongjun Xu

Existing approaches to improve the performances of convolutional neural networks by optimizing the local architectures or deepening the networks tend to increase the size of models significantly. In order to deploy and apply the neural networks to edge devices which are in great demand, reducing the scale of networks are quite crucial. However, It is easy to degrade the performance of image processing by compressing the networks. In this paper, we propose a method which is suitable for edge devices while improving the efficiency and effectiveness of inference. The joint decision of multi-participants, mainly contain multi-layers and multi-networks, can achieve higher classification accuracy (0.26% on CIFAR-10 and 4.49% on CIFAR-100 at most) with similar total number of parameters for classical convolutional neural networks.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Quan Xiao; Canhong Wen; Zirui Yan

K-SVD algorithm has been successfully applied to image denoising tasks dozens of years but the big bottleneck in speed and accuracy still needs attention to break. For the sparse coding stage in K-SVD, which involves $\ell_{0}$ constraint, prevailing methods usually seek approximate solutions greedily but are less effective once the noise level is high. The alternative $\ell_{1}$ optimization is proved to be powerful than $\ell_{0}$, however, the time consumption prevents it from the implementation. In this paper, we propose a new K-SVD framework called K-SVD$_P$ by applying the Primal-dual active set (PDAS) algorithm to it. Different from the greedy algorithms based K-SVD, the K-SVD$_P$ algorithm develops a selection strategy motivated by KKT (Karush-Kuhn-Tucker) condition and yields to an efficient update in the sparse coding stage. Since the K-SVD$_P$ algorithm seeks for an equivalent solution to the dual problem iteratively with simple explicit expression in this denoising problem, speed and quality of denoising can be reached simultaneously. Experiments are carried out and demonstrate the comparable denoising performance of our K-SVD$_P$ with state-of-the-art methods.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Tianhe Yu; Saurabh Kumar; Abhishek Gupta; Sergey Levine; Karol Hausman; Chelsea Finn

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Wenguan Wang; Zhijie Zhang; Siyuan Qi; Jianbing Shen; Yanwei Pang; Ling Shao

This work proposes to combine neural networks with the compositional hierarchy of human bodies for efficient and complete human parsing. We formulate the approach as a neural information fusion framework. Our model assembles the information from three inference processes over the hierarchy: direct inference (directly predicting each part of a human body using image information), bottom-up inference (assembling knowledge from constituent parts), and top-down inference (leveraging context from parent nodes). The bottom-up and top-down inferences explicitly model the compositional and decompositional relations in human bodies, respectively. In addition, the fusion of multi-source information is conditioned on the inputs, i.e., by estimating and considering the confidence of the sources. The whole model is end-to-end differentiable, explicitly modeling information flows and structures. Our approach is extensively evaluated on four popular datasets, outperforming the state-of-the-arts in all cases, with a fast processing speed of 23fps. Our code and results have been released to help ease future research in this direction.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Wenguan Wang; Xiankai Lu; Jianbing Shen; David Crandall; Ling Shao

This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS). The suggested AGNN recasts this task as a process of iterative information fusion over video graphs. Specifically, AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges. The underlying pair-wise relations are described by a differentiable attention mechanism. Through parametric message passing, AGNN is able to efficiently capture and mine much richer and higher-order relations between video frames, thus enabling a more complete understanding of video content and more accurate foreground estimation. Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case. To further demonstrate the generalizability of our framework, we extend AGNN to an additional task: image object co-segmentation (IOCS). We perform experiments on two famous IOCS datasets and observe again the superiority of our AGNN model. The extensive experiments verify that AGNN is able to learn the underlying semantic/appearance relationships among video frames or related images, and discover the common objects.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Xiankai Lu; Wenguan Wang; Chao Ma; Jianbing Shen; Ling Shao; Fatih Porikli

We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments. The co-attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space. We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Shizhen Zhao; Changxin Gao; Yuanjie Shao; Lerenhan Li; Changqian Yu; Zhong Ji; Nong Sang

We propose a Generative Transfer Network (GTNet) for zero shot object detection (ZSD). GTNet consists of an Object Detection Module and a Knowledge Transfer Module. The Object Detection Module can learn large-scale seen domain knowledge. The Knowledge Transfer Module leverages a feature synthesizer to generate unseen class features, which are applied to train a new classification layer for the Object Detection Module. In order to synthesize features for each unseen class with both the intra-class variance and the IoU variance, we design an IoU-Aware Generative Adversarial Network (IoUGAN) as the feature synthesizer, which can be easily integrated into GTNet. Specifically, IoUGAN consists of three unit models: Class Feature Generating Unit (CFU), Foreground Feature Generating Unit (FFU), and Background Feature Generating Unit (BFU). CFU generates unseen features with the intra-class variance conditioned on the class semantic embeddings. FFU and BFU add the IoU variance to the results of CFU, yielding class-specific foreground and background features, respectively. We evaluate our method on three public datasets and the results demonstrate that our method performs favorably against the state-of-the-art ZSD approaches.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Ziyi Shen; Wenguan Wang; Xiankai Lu; Jianbing Shen; Haibin Ling; Tingfa Xu; Ling Shao

This paper proposes a human-aware deblurring model that disentangles the motion blur between foreground (FG) humans and background (BG). The proposed model is based on a triple-branch encoder-decoder architecture. The first two branches are learned for sharpening FG humans and BG details, respectively; while the third one produces global, harmonious results by comprehensively fusing multi-scale deblurring information from the two domains. The proposed model is further endowed with a supervised, human-aware attention mechanism in an end-to-end fashion. It learns a soft mask that encodes FG human information and explicitly drives the FG/BG decoder-branches to focus on their specific domains. To further benefit the research towards Human-aware Image Deblurring, we introduce a large-scale dataset, named HIDE, which consists of 8,422 blurry and sharp image pairs with 65,784 densely annotated FG human bounding boxes. HIDE is specifically built to span a broad range of scenes, human object sizes, motion patterns, and background complexities. Extensive experiments on public benchmarks and our dataset demonstrate that our model performs favorably against the state-of-the-art motion deblurring methods, especially in capturing semantic details.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Qichuan Geng; Hong Zhang; Xiaojuan Qi; Ruigang Yang; Zhong Zhou; Gao Huang

Semantic segmentation is a challenging task that needs to handle large scale variations, deformations and different viewpoints. In this paper, we develop a novel network named Gated Path Selection Network (GPSNet), which aims to learn adaptive receptive fields. In GPSNet, we first design a two-dimensional multi-scale network - SuperNet, which densely incorporates features from growing receptive fields. To dynamically select desirable semantic context, a gate prediction module is further introduced. In contrast to previous works that focus on optimizing sample positions on the regular grids, GPSNet can adaptively capture free form dense semantic contexts. The derived adaptive receptive fields are data-dependent, and are flexible that can model different object geometric transformations. On two representative semantic segmentation datasets, i.e., Cityscapes, and ADE20K, we show that the proposed approach consistently outperforms previous methods and achieves competitive performance without bells and whistles.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Ziyi Shen; Wei-Sheng Lai; Tingfa Xu; Jan Kautz; Ming-Hsuan Yang

In this paper, we propose an effective and efficient face deblurring algorithm by exploiting semantic cues via deep convolutional neural networks. As the human faces are highly structured and share unified facial components (e.g., eyes and mouths), such semantic information provides a strong prior for restoration. We incorporate face semantic labels as input priors and propose an adaptive structural loss to regularize facial local structures within an end-to-end deep convolutional neural network. Specifically, we first use a coarse deblurring network to reduce the motion blur on the input face image. We then adopt a parsing network to extract the semantic features from the coarse deblurred image. Finally, the fine deblurring network utilizes the semantic information to restore a clear face image. We train the network with perceptual and adversarial losses to generate photo-realistic results. The proposed method restores sharp images with more accurate facial features and details. Quantitative and qualitative evaluations demonstrate that the proposed face deblurring algorithm performs favorably against the state-of-the-art methods in terms of restoration quality, face recognition and execution speed.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
David Morris; Eric Müller-Budack; Ralph Ewerth

In the past few years, convolutional neural networks (CNNs) have achieved impressive results in computer vision tasks, which however mainly focus on photos with natural scene content. Besides, non-sensor derived images such as illustrations, data visualizations, figures, etc. are typically used to convey complex information or to explore large datasets. However, this kind of images has received little attention in computer vision. CNNs and similar techniques use large volumes of training data. Currently, many document analysis systems are trained in part on scene images due to the lack of large datasets of educational image data. In this paper, we address this issue and present SlideImages, a dataset for the task of classifying educational illustrations. SlideImages contains training data collected from various sources, e.g., Wikimedia Commons and the AI2D dataset, and test data collected from educational slides. We have reserved all the actual educational images as a test dataset in order to ensure that the approaches using this dataset generalize well to new educational images, and potentially other domains. Furthermore, we present a baseline system using a standard deep neural architecture and discuss dealing with the challenge of limited training data.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Chunle Guo; Chongyi Li; Jichang Guo; Chen Change Loy; Junhui Hou; Sam Kwong; Runmin Cong

The paper presents a novel method, Zero-Reference Deep Curve Estimation (Zero-DCE), which formulates light enhancement as a task of image-specific curve estimation with a deep network. Our method trains a lightweight deep network, DCE-Net, to estimate pixel-wise and high-order curves for dynamic range adjustment of a given image. The curve estimation is specially designed, considering pixel value range, monotonicity, and differentiability. Zero-DCE is appealing in its relaxed assumption on reference images, i.e., it does not require any paired or unpaired data during training. This is achieved through a set of carefully formulated non-reference loss functions, which implicitly measure the enhancement quality and drive the learning of the network. Our method is efficient as image enhancement can be achieved by an intuitive and simple nonlinear curve mapping. Despite its simplicity, we show that it generalizes well to diverse lighting conditions. Extensive experiments on various benchmarks demonstrate the advantages of our method over state-of-the-art methods qualitatively and quantitatively. Furthermore, the potential benefits of our Zero-DCE to face detection in the dark are discussed. Code and model will be available at https://github.com/Li-Chongyi/Zero-DCE.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Junjie Yan; Ruosi Wan; Xiangyu Zhang; Wei Zhang; Yichen Wei; Jian Sun

Batch Normalization (BN) is one of the most widely used techniques in Deep Learning field. But its performance can awfully degrade with insufficient batch size. This weakness limits the usage of BN on many computer vision tasks like detection or segmentation, where batch size is usually small due to the constraint of memory consumption. Therefore many modified normalization techniques have been proposed, which either fail to restore the performance of BN completely, or have to introduce additional nonlinear operations in inference procedure and increase huge consumption. In this paper, we reveal that there are two extra batch statistics involved in backward propagation of BN, on which has never been well discussed before. The extra batch statistics associated with gradients also can severely affect the training of deep neural network. Based on our analysis, we propose a novel normalization method, named Moving Average Batch Normalization (MABN). MABN can completely restore the performance of vanilla BN in small batch cases, without introducing any additional nonlinear operations in inference procedure. We prove the benefits of MABN by both theoretical analysis and experiments. Our experiments demonstrate the effectiveness of MABN in multiple computer vision tasks including ImageNet and COCO. The code has been released in https://github.com/megvii-model/MABN.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Javier Civera; Seong Hun Lee

The emergence of modern RGB-D sensors had a significant impact in many application fields, including robotics, augmented reality (AR) and 3D scanning. They are low-cost, low-power and low-size alternatives to traditional range sensors such as LiDAR. Moreover, unlike RGB cameras, RGB-D sensors provide the additional depth information that removes the need of frame-by-frame triangulation for 3D scene reconstruction. These merits have made them very popular in mobile robotics and AR, where it is of great interest to estimate ego-motion and 3D scene structure. Such spatial understanding can enable robots to navigate autonomously without collisions and allow users to insert virtual entities consistent with the image stream. In this chapter, we review common formulations of odometry and Simultaneous Localization and Mapping (known by its acronym SLAM) using RGB-D stream input. The two topics are closely related, as the former aims to track the incremental camera motion with respect to a local map of the scene, and the latter to jointly estimate the camera trajectory and the global map with consistency. In both cases, the standard approaches minimize a cost function using nonlinear optimization techniques. This chapter consists of three main parts: In the first part, we introduce the basic concept of odometry and SLAM and motivate the use of RGB-D sensors. We also give mathematical preliminaries relevant to most odometry and SLAM algorithms. In the second part, we detail the three main components of SLAM systems: camera pose tracking, scene mapping and loop closing. For each component, we describe different approaches proposed in the literature. In the final part, we provide a brief discussion on advanced research topics with the references to the state-of-the-art.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Zhu Zhang; Zhou Zhao; Yang Zhao; Qi Wang; Huasheng Liu; Lianli Gao

In this paper, we consider a novel task, Spatio-Temporal Video Grounding for Multi-Form Sentences (STVG). Given an untrimmed video and a declarative/interrogative sentence depicting an object, STVG aims to localize the spatiotemporal tube of the queried object. STVG has two challenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may only exist in a very small segment of the video; (2) We deal with multi-form sentences, including the declarative sentences with explicit objects and interrogative sentences with unknown objects. Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of object relationship modeling. Thus, we then propose a novel Spatio-Temporal Graph Reasoning Network (STGRN) for this task. First, we build a spatio-temporal region graph to capture the region relationships with temporal object dynamics, which involves the implicit and explicit spatial subgraphs in each frame and the temporal dynamic subgraph across frames. We then incorporate textual clues into the graph and develop the multi-step cross-modal graph reasoning. Next, we introduce a spatio-temporal localizer with a dynamic selection method to directly retrieve the spatiotemporal tubes without tube pre-generation. Moreover, we contribute a large-scale video grounding dataset VidSTG based on video relation dataset VidOR. The extensive experiments demonstrate the effectiveness of our method.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Chandrakanth Jayachandran Preetha; Jonathan Kloss; Fabian Siegfried Wehrtmann; Lalith Sharan; Carolyn Fan; Beat Peter Müller-Stich; Felix Nickel; Sandy Engelhardt

Minimally Invasive Surgery (MIS) techniques have gained rapid popularity among surgeons since they offer significant clinical benefits including reduced recovery time and diminished post-operative adverse effects. However, conventional endoscopic systems output monocular video which compromises depth perception, spatial orientation and field of view. Suturing is one of the most complex tasks performed under these circumstances. Key components of this tasks are the interplay between needle holder and the surgical needle. Reliable 3D localization of needle and instruments in real time could be used to augment the scene with additional parameters that describe their quantitative geometric relation, e.g. the relation between the estimated needle plane and its rotation center and the instrument. This could contribute towards standardization and training of basic skills and operative techniques, enhance overall surgical performance, and reduce the risk of complications. The paper proposes an Augmented Reality environment with quantitative and qualitative visual representations to enhance laparoscopic training outcomes performed on a silicone pad. This is enabled by a multi-task supervised deep neural network which performs multi-class segmentation and depth map prediction. Scarcity of labels has been conquered by creating a virtual environment which resembles the surgical training scenario to generate dense depth maps and segmentation maps. The proposed convolutional neural network was tested on real surgical training scenarios and showed to be robust to occlusion of the needle. The network achieves a dice score of 0.67 for surgical needle segmentation, 0.81 for needle holder instrument segmentation and a mean absolute error of 6.5 mm for depth estimation.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-19
Simon Vandenhende; Stamatios Georgoulis; Luc Van Gool

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Ramprasaath R. Selvaraju; Purva Tendulkar; Devi Parikh; Eric Horvitz; Marco Ribeiro; Besmira Nushi; Ece Kamar

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Saad Bin Sami; Abdul Muqeet; Humera Tariq

Images captured in hazy weather conditions often suffer from color contrast and color fidelity. This degradation is represented by transmission map which represents the amount of attenuation and airlight which represents the color of additive noise. In this paper, we have proposed a method to estimate the transmission map using haze levels instead of airlight color since there are some ambiguities in estimation of airlight. Qualitative and quantitative results of proposed method show competitiveness of the method given. In addition we have proposed two metrics which are based on statistics of natural outdoor images for assessment of haze removal algorithms.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Chao Zhang; Xuequan Lu; Katsuya Hotta; Xi Yang

In this paper we attempt to address the problem of geometric multi-model fitting with resorting to a few weakly annotated (WA) data points, which has been sparsely studied so far. In weak annotating, most of the manual annotations are supposed to be correct yet inevitably mixed with incorrect ones. The WA data can be naturally obtained in an interactive way for specific tasks, for example, in the case of homography estimation, one can easily annotate points on the same plane/object with a single label by observing the image. Motivated by this, we propose a novel method to make full use of the WA data to boost the multi-model fitting performance. Specifically, a graph for model proposal sampling is first constructed using the WA data, given the prior that the WA data annotated with the same weak label has a high probability of being assigned to the same model. By incorporating this prior knowledge into the calculation of edge probabilities, vertices (i.e., data points) lie on/near the latent model are likely to connect together and further form a subset/cluster for effective proposals generation. With the proposals generated, the $\alpha$-expansion is adopted for labeling, and our method in return updates the proposals. This works in an iterative way. Extensive experiments validate our method and show that the proposed method produces noticeably better results than state-of-the-art techniques in most cases.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Subhayan Mukherjee; Ram Mohana Reddy Guddeti

In this paper, we have proposed a novel method for stereo disparity estimation by combining the existing methods of block based and region based stereo matching. Our method can generate dense disparity maps from disparity measurements of only 18% pixels of either the left or the right image of a stereo image pair. It works by segmenting the lightness values of image pixels using a fast implementation of K-Means clustering. It then refines those segment boundaries by morphological filtering and connected components analysis, thus removing a lot of redundant boundary pixels. This is followed by determining the boundaries' disparities by the SAD cost function. Lastly, we reconstruct the entire disparity map of the scene from the boundaries' disparities through disparity propagation along the scan lines and disparity prediction of regions of uncertainty by considering disparities of the neighboring regions. Experimental results on the Middlebury stereo vision dataset demonstrate that the proposed method outperforms traditional disparity determination methods like SAD and NCC by up to 30% and achieves an improvement of 2.6% when compared to a recent approach based on absolute difference (AD) cost function for disparity calculations [1].

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Yu Dong; Yihao Liu; He Zhang; Shifeng Chen; Yu Qiao

Recently, convolutional neural networks (CNNs) have achieved great improvements in single image dehazing and attained much attention in research. Most existing learning-based dehazing methods are not fully end-to-end, which still follow the traditional dehazing procedure: first estimate the medium transmission and the atmospheric light, then recover the haze-free image based on the atmospheric scattering model. However, in practice, due to lack of priors and constraints, it is hard to precisely estimate these intermediate parameters. Inaccurate estimation further degrades the performance of dehazing, resulting in artifacts, color distortion and insufficient haze removal. To address this, we propose a fully end-to-end Generative Adversarial Networks with Fusion-discriminator (FD-GAN) for image dehazing. With the proposed Fusion-discriminator which takes frequency information as additional priors, our model can generator more natural and realistic dehazed images with less color distortion and fewer artifacts. Moreover, we synthesize a large-scale training dataset including various indoor and outdoor hazy images to boost the performance and we reveal that for learning-based dehazing methods, the performance is strictly influenced by the training data. Experiments have shown that our method reaches state-of-the-art performance on both public synthetic datasets and real-world images with more visually pleasing dehazed results.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Subhayan Mukherjee; Guan-Ming Su; Irene Cheng

High Dynamic Range (HDR) imaging is gaining increased attention due to its realistic content, for not only regular displays but also smartphones. Before sufficient HDR content is distributed, HDR visualization still relies mostly on converting Standard Dynamic Range (SDR) content. SDR images are often quantized, or bit depth reduced, before SDR-to-HDR conversion, e.g. for video transmission. Quantization can easily lead to banding artefacts. In some computing and/or memory I/O limited environment, the traditional solution using spatial neighborhood information is not feasible. Our method includes noise generation (offline) and noise injection (online), and operates on pixels of the quantized image. We vary the magnitude and structure of the noise pattern adaptively based on the luma of the quantized pixel and the slope of the inverse-tone mapping function. Subjective user evaluations confirm the superior performance of our technique.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Renoh Johnson Chalakkal; Faizal Hafiz; Waleed Abdulla; Akshya Swain

The present study proposes a new approach to automated screening of Clinically Significant Macular Edema (CSME) and addresses two major challenges associated with such screenings, i.e., exudate segmentation and imbalanced datasets. The proposed approach replaces the conventional exudate segmentation based feature extraction by combining a pre-trained deep neural network with meta-heuristic feature selection. A feature space over-sampling technique is being used to overcome the effects of skewed datasets and the screening is accomplished by a k-NN based classifier. The role of each data-processing step (e.g., class balancing, feature selection) and the effects of limiting the region-of-interest to fovea on the classification performance are critically analyzed. Finally, the selection and implication of operating point on Receiver Operating Characteristic curve are discussed. The results of this study convincingly demonstrate that by following these fundamental practices of machine learning, a basic k-NN based classifier could effectively accomplish the CSME screening.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Daniel J. Trosten; Michael C. Kampffmeyer; Robert Jenssen

In this paper we develop a new model for deep image clustering, using convolutional neural networks and tensor kernels. The proposed Deep Tensor Kernel Clustering (DTKC) consists of a convolutional neural network (CNN), which is trained to reflect a common cluster structure at the output of its intermediate layers. Encouraging a consistent cluster structure throughout the network has the potential to guide it towards meaningful clusters, even though these clusters might appear to be nonlinear in the input space. The cluster structure is enforced through the idea of unsupervised companion objectives, where separate loss functions are attached to layers in the network. These unsupervised companion objectives are constructed based on a proposed generalization of the Cauchy-Schwarz (CS) divergence, from vectors to tensors of arbitrary rank. Generalizing the CS divergence to tensor-valued data is a crucial step, due to the tensorial nature of the intermediate representations in the CNN. Several experiments are conducted to thoroughly assess the performance of the proposed DTKC model. The results indicate that the model outperforms, or performs comparable to, a wide range of baseline algorithms. We also empirically demonstrate that our model does not suffer from objective function mismatch, which can be a problematic artifact in autoencoder-based clustering models.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Adrien Kaiser; José Alonso Ybanez Zepeda; Tamy Boubekeur

We present a novel method to estimate the motion matrix between overlapping pairs of 3D views in the context of indoor scenes. We use the Manhattan world assumption to introduce lightweight geometric constraints under the form of planes into the problem, which reduces complexity by taking into account the structure of the scene. In particular, we define a stochastic framework to categorize planes as vertical or horizontal and parallel or non-parallel. We leverage this classification to match pairs of planes in overlapping views with point-of-view agnostic structural metrics. We propose to split the motion computation using the classification and estimate separately the rotation and translation of the sensor, using a quadric minimizer. We validate our approach on a toy example and present quantitative experiments on a public RGB-D dataset, comparing against recent state-of-the-art methods. Our evaluation shows that planar constraints only add low computational overhead while improving results in precision when applied after a prior coarse estimate. We conclude by giving hints towards extensions and improvements of current results.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Moshiur R. Farazi; Salman H. Khan; Nick Barnes

Visual Question Answering (VQA) has emerged as a Visual Turing Test to validate the reasoning ability of AI agents. The pivot to existing VQA models is the joint embedding that is learned by combining the visual features from an image and the semantic features from a given question. Consequently, a large body of literature has focused on developing complex joint embedding strategies coupled with visual attention mechanisms to effectively capture the interplay between these two modalities. However, modelling the visual and semantic features in a high dimensional (joint embedding) space is computationally expensive, and more complex models often result in trivial improvements in the VQA accuracy. In this work, we systematically study the trade-off between the model complexity and the performance on the VQA task. VQA models have a diverse architecture comprising of pre-processing, feature extraction, multimodal fusion, attention and final classification stages. We specifically focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline. Our thorough experimental evaluation leads us to two proposals, one optimized for minimal complexity and the other one optimized for state-of-the-art VQA performance.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Zi-Qi Li; Jun Sun; Xiao-Jun Wu; He-Feng Yin

Representation based classification methods have become a hot research topic during the past few years, and the two most prominent approaches are sparse representation based classification (SRC) and collaborative representation based classification (CRC). CRC reveals that it is the collaborative representation rather than the sparsity that makes SRC successful. Nevertheless, the dense representation of CRC may not be discriminative which will degrade its performance for classification tasks. To alleviate this problem to some extent, we propose a new method called sparse and collaborative-competitive representation based classification (SCCRC) for image classification. Firstly, the coefficients of the test sample are obtained by SRC and CCRC, respectively. Then the fused coefficient is derived by multiplying the coefficients of SRC and CCRC. Finally, the test sample is designated to the class that has the minimum residual. Experimental results on several benchmark databases demonstrate the efficacy of our proposed SCCRC. The source code of SCCRC is accessible at https://github.com/li-zi-qi/SCCRC.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Zhen-Liang Ni; Gui-Bin Bian; Guan-An Wang; Xiao-Hu Zhou; Zeng-Guang Hou; Xiao-Liang Xie; Zhen Li; Yu-Han Wang

Surgical instrument segmentation is extremely important for computer-assisted surgery. Different from common object segmentation, it is more challenging due to the large illumination and scale variation caused by the special surgical scenes. In this paper, we propose a novel bilinear attention network with adaptive receptive field to solve these two challenges. For the illumination variation, the bilinear attention module can capture second-order statistics to encode global contexts and semantic dependencies between local pixels. With them, semantic features in challenging areas can be inferred from their neighbors and the distinction of various semantics can be boosted. For the scale variation, our adaptive receptive field module aggregates multi-scale features and automatically fuses them with different weights. Specifically, it encodes the semantic relationship between channels to emphasize feature maps with appropriate scales, changing the receptive field of subsequent convolutions. The proposed network achieves the best performance 97.47% mean IOU on Cata7 and comes first place on EndoVis 2017 by 10.10% IOU overtaking second-ranking method.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Clemens-Alexander Brust; Christoph Käding; Joachim Denzler

Large amounts of labeled training data are one of the main contributors to the great success that deep models have achieved in the past. Label acquisition for tasks other than benchmarks can pose a challenge due to requirements of both funding and expertise. By selecting unlabeled examples that are promising in terms of model improvement and only asking for respective labels, active learning can increase the efficiency of the labeling process in terms of time and cost. In this work, we describe combinations of an incremental learning scheme and methods of active learning. These allow for continuous exploration of newly observed unlabeled data. We describe selection criteria based on model uncertainty as well as expected model output change (EMOC). An object detection task is evaluated in a continuous exploration context on the PASCAL VOC dataset. We also validate a weakly supervised system based on active and incremental learning in a real-world biodiversity application where images from camera traps are analyzed. Labeling only 32 images by accepting or rejecting proposals generated by our method yields an increase in accuracy from 25.4% to 42.6%.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Tinghuai Wang; Guangming Wang; Kuan Eeik Tan; Donghui Tan

Convolutional neural networks (CNN) have made significant advances in hyperspectral image (HSI) classification. However, standard convolutional kernel neglects the intrinsic connections between data points, resulting in poor region delineation and small spurious predictions. Furthermore, HSIs have a unique continuous data distribution along the high dimensional spectrum domain - much remains to be addressed in characterizing the spectral contexts considering the prohibitively high dimensionality and improving reasoning capability in light of the limited amount of labelled data. This paper presents a novel architecture which explicitly addresses these two issues. Specifically, we design an architecture to encode the multiple spectral contextual information in the form of spectral pyramid of multiple embedding spaces. In each spectral embedding space, we propose graph attention mechanism to explicitly perform interpretable reasoning in the spatial domain based on the connection in spectral feature space. Experiments on three HSI datasets demonstrate that the proposed architecture can significantly improve the classification accuracy compared with the existing methods.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Wei Wang; Xiang-Gen Xia; Chuanjiang He; Zemin Ren; Jian Lu; Tianfu Wang; Baiying Lei

A CT image can be well reconstructed when the sampling rate of the sinogram satisfies the Nyquist criteria and the sampled signal is noise-free. However, in practice, the sinogram is usually contaminated by noise, which degrades the quality of a reconstructed CT image. In this paper, we design a deep network for sinogram and CT image reconstruction. The network consists of two cascaded blocks that are linked by a filter backprojection (FBP) layer, where the former block is responsible for denoising and completing the sinograms while the latter is used to removing the noise and artifacts of the CT images. Experimental results show that the reconstructed CT images by our methods have the highest PSNR and SSIM in average compared to state of the art methods.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Lucas Mansilla; Diego H. Milone; Enzo Ferrante

Deformable image registration is a fundamental problem in the field of medical image analysis. During the last years, we have witnessed the advent of deep learning-based image registration methods which achieve state-of-the-art performance, and drastically reduce the required computational time. However, little work has been done regarding how can we encourage our models to produce not only accurate, but also anatomically plausible results, which is still an open question in the field. In this work, we argue that incorporating anatomical priors in the form of global constraints into the learning process of these models, will further improve their performance and boost the realism of the warped images after registration. We learn global non-linear representations of image anatomy using segmentation masks, and employ them to constraint the registration process. The proposed AC-RegNet architecture is evaluated in the context of chest X-ray image registration using three different datasets, where the high anatomical variability makes the task extremely challenging. Our experiments show that the proposed anatomically constrained registration model produces more realistic and accurate results than state-of-the-art methods, demonstrating the potential of this approach.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Yichao Zhou; Shaunak Mishra; Manisha Verma; Narayan Bhamidipati; Wei Wang

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Aman Gajendra Jain; Nicolas Saunier

We propose a method for automatic calibration of a traffic surveillance camera with wide-angle lenses. Video footage of a few minutes is sufficient for the entire calibration process to take place. This method takes in the height of the camera from the ground plane as the only user input to overcome the scale ambiguity. The calibration is performed in two stages, 1. Intrinsic Calibration 2. Extrinsic Calibration. Intrinsic calibration is achieved by assuming an equidistant fisheye distortion and an ideal camera model. Extrinsic calibration is accomplished by estimating the two vanishing points, on the ground plane, from the motion of vehicles at perpendicular intersections. The first stage of intrinsic calibration is also valid for thermal cameras. Experiments have been conducted to demonstrate the effectiveness of this approach on visible as well as thermal cameras. Index Terms: fish-eye, calibration, thermal camera, intelligent transportation systems, vanishing points

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Tsun-Yi Yang; Duy-Kien Nguyen; Huub Heijnen; Vassileios Balntas

In this paper, we explore how three related tasks, namely keypoint detection, description, and image retrieval can be jointly tackled using a single unified framework, which is trained without the need of training data with point to point correspondences. By leveraging diverse information from sequential layers of a standard ResNet-based architecture, we are able to extract keypoints and descriptors that encode local information using generic techniques such as local activation norms, channel grouping and dropping, and self-distillation. Subsequently, global information for image retrieval is encoded in an end-to-end pipeline, based on pooling of the aforementioned local responses. In contrast to previous methods in local matching, our method does not depend on pointwise/pixelwise correspondences, and requires no such supervision at all i.e. no depth-maps from an SfM model nor manually created synthetic affine transformations. We illustrate that this simple and direct paradigm, is able to achieve very competitive results against the state-of-the-art methods in various challenging benchmark conditions such as viewpoint changes, scale changes, and day-night shifting localization.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Jane Wu; Yongxu Jin; Zhenglin Geng; Hui Zhou; Ronald Fedkiw

Regularization is used to avoid overfitting when training a neural network; unfortunately, this reduces the attainable level of detail hindering the ability to capture high-frequency information present in the training data. Even though various approaches may be used to re-introduce high-frequency detail, it typically does not match the training data and is often not time coherent. In the case of network inferred cloth, these sentiments manifest themselves via either a lack of detailed wrinkles or unnaturally appearing and/or time incoherent surrogate wrinkles. Thus, we propose a general strategy whereby high-frequency information is procedurally embedded into low-frequency data so that when the latter is smeared out by the network the former still retains its high-frequency detail. We illustrate this approach by learning texture coordinates which when smeared do not in turn smear out the high-frequency detail in the texture itself but merely smoothly distort it. Notably, we prescribe perturbed texture coordinates that are subsequently used to correct the over-smoothed appearance of inferred cloth, and correcting the appearance from multiple camera views naturally recovers lost geometric information.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-20
Yijie Zhang; Kevin de Haan; Yair Rivenson; Jingxi Li; Apostolos Delis; Aydogan Ozcan

Histological staining is a vital step used to diagnose various diseases and has been used for more than a century to provide contrast to tissue sections, rendering the tissue constituents visible for microscopic analysis by medical experts. However, this process is time-consuming, labor-intensive, expensive and destructive to the specimen. Recently, the ability to virtually-stain unlabeled tissue sections, entirely avoiding the histochemical staining step, has been demonstrated using tissue-stain specific deep neural networks. Here, we present a new deep learning-based framework which generates virtually-stained images using label-free tissue, where different stains are merged following a micro-structure map defined by the user. This approach uses a single deep neural network that receives two different sources of information at its input: (1) autofluorescence images of the label-free tissue sample, and (2) a digital staining matrix which represents the desired microscopic map of different stains to be virtually generated at the same tissue section. This digital staining matrix is also used to virtually blend existing stains, digitally synthesizing new histological stains. We trained and blindly tested this virtual-staining network using unlabeled kidney tissue sections to generate micro-structured combinations of Hematoxylin and Eosin (H&E), Jones silver stain, and Masson's Trichrome stain. Using a single network, this approach multiplexes virtual staining of label-free tissue with multiple types of stains and paves the way for synthesizing new digital histological stains that can be created on the same tissue cross-section, which is currently not feasible with standard histochemical staining methods.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-21
Gantugs Atarsaikhan; Brian Kenji Iwana; Seiichi Uchida

Designing fonts requires a great deal of time and effort. It requires professional skills, such as sketching, vectorizing, and image editing. Additionally, each letter has to be designed individually. In this paper, we will introduce a method to create fonts automatically. In our proposed method, the difference of font styles between two different fonts is found and transferred to another font using neural style transfer. Neural style transfer is a method of stylizing the contents of an image with the styles of another image. We proposed a novel neural style difference and content difference loss for the neural style transfer. With these losses, new fonts can be generated by adding or removing font styles from a font. We provided experimental results with various combinations of input fonts and discussed limitations and future development for the proposed method.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-21
Bahareh Behboodi; Mina Amiri; Rupert Brooks; Hassan Rivaz

Ultrasound (US) is one of the most commonly used imaging modalities in both diagnosis and surgical interventions due to its low-cost, safety, and non-invasive characteristic. US image segmentation is currently a unique challenge because of the presence of speckle noise. As manual segmentation requires considerable efforts and time, the development of automatic segmentation algorithms has attracted researchers attention. Although recent methodologies based on convolutional neural networks have shown promising performances, their success relies on the availability of a large number of training data, which is prohibitively difficult for many applications. Therefore, in this study we propose the use of simulated US images and natural images as auxiliary datasets in order to pre-train our segmentation network, and then to fine-tune with limited in vivo data. We show that with as little as 19 in vivo images, fine-tuning the pre-trained network improves the dice score by 21% compared to training from scratch. We also demonstrate that if the same number of natural and simulation US images is available, pre-training on simulation data is preferable.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-21
Ning Yuan; Xiao-Jun Wu; He-Feng Yin

The kernel function is introduced to solve the nonlinear pattern recognition problem. The advantage of a kernel method often depends critically on a proper choice of the kernel function. A promising approach is to learn the kernel from data automatically. Over the past few years, some methods which have been proposed to learn the kernel have some limitations: learning the parameters of some prespecified kernel function and so on. In this paper, the nonlinear face verification via learning the kernel matrix is proposed. A new criterion is used in the new algorithm to avoid inverting the possibly singular within-class which is a computational problem. The experimental results obtained on the facial database XM2VTS using the Lausanne protocol show that the verification performance of the new method is superior to that of the primary method Client Specific Kernel Discriminant Analysis (CSKDA). The method CSKDA needs to choose a proper kernel function through many experiments, while the new method could learn the kernel from data automatically which could save a lot of time and have the robust performance.

更新日期：2020-01-22
• arXiv.cs.CV Pub Date : 2020-01-21
Rajath S; Sumukh Aithal K; Natarajan Subramanyam

A concept of using Neural Ordinary Differential Equations(NODE) for Transfer Learning has been introduced. In this paper we use the EfficientNets to explore transfer learning on CIFAR-10 dataset. We use NODE for fine-tuning our model. Using NODE for fine tuning provides more stability during training and validation.These continuous depth blocks can also have a trade off between numerical precision and speed .Using Neural ODEs for transfer learning has resulted in much stable convergence of the loss function.

更新日期：2020-01-22
Contents have been reproduced by permission of the publishers.

down
wechat
bug