• Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-22
Hyunsung Park; Daijin Kim

This paper proposes a complementary regression network (CRN) that combines global and local regression methods to align faces. A global regression network (GRN) generates the coordinates of facial landmark points directly such that all facial feature points are fitted to the input face on the whole and a local regression network (LRN) generates the heatmap of facial landmark points such that each channel localizes the detail of its facial landmark point well. The CRN converts the GRN’s coordinates to another heatmap, then uses with the LRN’s heatmap to get the final facial landmark points. The CRN works complementarily such that the GRN’s overall fitting tendency compensates for the LRN’s poor alignment caused by missing local information, whereas the LRN’s detailed representation compensates for the GRN’s poor alignment caused by global miss-fitting. We conducted several experiments on the 300-W public dataset, the 300-W private dataset, and the Menpo dataset and the proposed CRN achieved 3.14%, 3.74%, and 1.996% the-state-of-art face alignment accuracy in terms of percentage of normalized mean error, respectively.

更新日期：2020-01-22
• arXiv.cs.AR Pub Date : 2020-01-18
Poulami Das; Christopher A. Pattison; Srilatha Manne; Douglas Carmean; Krysta Svore; Moinuddin Qureshi; Nicolas Delfosse

Quantum computation promises significant computational advantages over classical computation for some problems. However, quantum hardware suffers from much higher error rates than in classical hardware. As a result, extensive quantum error correction is required to execute a useful quantum algorithm. The decoder is a key component of the error correction scheme whose role is to identify errors faster than they accumulate in the quantum computer and that must be implemented with minimum hardware resources in order to scale to the regime of practical applications. In this work, we consider surface code error correction, which is the most popular family of error correcting codes for quantum computing, and we design a decoder micro-architecture for the Union-Find decoding algorithm. We propose a three-stage fully pipelined hardware implementation of the decoder that significantly speeds up the decoder. Then, we optimize the amount of decoding hardware required to perform error correction simultaneously over all the logical qubits of the quantum computer. By sharing resources between logical qubits, we obtain a 67% reduction of the number of hardware units and the memory capacity is reduced by 70%. Moreover, we reduce the bandwidth required for the decoding process by a factor at least 30x using low-overhead compression algorithms. Finally, we provide numerical evidence that our optimized micro-architecture can be executed fast enough to correct errors in a quantum computer.

更新日期：2020-01-22
• arXiv.cs.AR Pub Date : 2020-01-20
Javier Picorel; Seyed Alireza Sanaee Kohroudi; Zi Yan; Abhishek Bhattacharjee; Babak Falsafi; Djordje Jevdjic

Virtual memory (VM) is critical to the usability and programmability of hardware accelerators. Unfortunately, implementing accelerator VM efficiently is challenging because the area and power constraints make it difficult to employ the large multi-level TLBs used in general-purpose CPUs. Recent research proposals advocate a number of restrictions on virtual-to-physical address mappings in order to reduce the TLB size or increase its reach. However, such restrictions are unattractive because they forgo many of the original benefits of traditional VM, such as demand paging and copy-on-write. We propose SPARTA, a divide and conquer approach to address translation. SPARTA splits the address translation into accelerator-side and memory-side parts. The accelerator-side translation hardware consists of a tiny TLB covering only the accelerator's cache hierarchy (if any), while the translation for main memory accesses is performed by shared memory-side TLBs. Performing the translation for memory accesses on the memory side allows SPARTA to overlap data fetch with translation, and avoids the replication of TLB entries for data shared among accelerators. To further improve the performance and efficiency of the memory-side translation, SPARTA logically partitions the memory space, delegating translation to small and efficient per-partition translation hardware. Our evaluation on index-traversal accelerators shows that SPARTA virtually eliminates translation overhead, reducing it by over 30x on average (up to 47x) and improving performance by 57%. At the same time, SPARTA requires minimal accelerator-side translation hardware, reduces the total number of TLB entries in the system, gracefully scales with memory size, and preserves all key VM functionalities.

更新日期：2020-01-22
• arXiv.cs.AR Pub Date : 2020-01-21
Youren Shen; Hongliang Tian; Yu Chen; Kang Chen; Runji Wang; Yi Xu; Yubin Xia

Intel Software Guard Extensions (SGX) enables user-level code to create private memory regions called enclaves, whose code and data are protected by the CPU from software and hardware attacks outside the enclaves. Recent work introduces library operating systems (LibOSes) to SGX so that legacy applications can run inside enclaves with few or even no modifications. As virtually any non-trivial application demands multiple processes, it is essential for LibOSes to support multitasking. However, none of the existing SGX LibOSes support multitasking both securely and efficiently. This paper presents Occlum, a system that enables secure and efficient multitasking on SGX. We implement the LibOS processes as SFI-Isolated Processes (SIPs). SFI is a software instrumentation technique for sandboxing untrusted modules (called domains). We design a novel SFI scheme named MPX-based, Multi-Domain SFI (MMDSFI) and leverage MMDSFI to enforce the isolation of SIPs. We also design an independent verifier to ensure the security guarantees of MMDSFI. With SIPs safely sharing the single address space of an enclave, the LibOS can implement multitasking efficiently. The Occlum LibOS outperforms the state-of-the-art SGX LibOS on multitasking-heavy workloads by up to 6,600X on micro-benchmarks and up to 500X on application benchmarks.

更新日期：2020-01-22
• Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-19

This paper presents a Riemannian approach for free-space extraction and path planning using color catadioptric vision. The problem is formulated considering color catadioptric images as Riemannian manifolds and solved using the Riemannian Eikonal equation with an anisotropic fast marching numerical scheme. This formulation allows the integration of adapted color and spatial metrics in an incremental process. First, the traversable ground (namely free-space) is delimited using a color structure tensor built on the multi-dimensional components of the catadioptric image. Then, the Eikonal equation is solved in the image plane incorporating a generic metric tensor for central catadioptric systems. This built Riemannian metric copes with the geometric distortions in the catadioptric image plane introduced by the curved mirror in order to compute the geodesic distance map and the shortest path between image points. We present comparative results using Euclidean and Riemannian distance transforms and show the effectiveness of the Riemannian approach to produce safest path planning.

更新日期：2020-01-21
• arXiv.cs.AR Pub Date : 2019-01-27
Di Gao; Dayane Reis; Xiaobo Sharon Hu; Cheng Zhuo

Computing-in-Memory (CiM) architectures aim to reduce costly data transfers by performing arithmetic and logic operations in memory and hence relieve the pressure due to the memory wall. However, determining whether a given workload can really benefit from CiM, which memory hierarchy and what device technology should be adopted by a CiM architecture requires in-depth study that is not only time consuming but also demands significant expertise in architectures and compilers. This paper presents an energy evaluation framework, Eva-CiM, for systems based on CiM architectures. Eva-CiM encompasses a multi-level (from device to architecture) comprehensive tool chain by leveraging existing modeling and simulation tools such as GEM5, McPAT [2] and DESTINY [3]. To support high-confidence prediction, rapid design space exploration and ease of use, Eva-CiM introduces several novel modeling/analysis approaches including models for capturing memory access and dependency-aware ISA traces, and for quantifying interactions between the host CPU and CiM modules. Eva-CiM can readily produce energy estimates of the entire system for a given program, a processor architecture, and the CiM array and technology specifications. Eva-CiM is validated by comparing with DESTINY [3] and [4], and enables findings including practical contributions from CiM-supported accesses, CiM-sensitive benchmarking as well as the pros and cons of increased memory size for CiM. Eva-CiM also enables exploration over different configurations and device technologies, showing 1.3-6.0X energy improvement for SRAM and 2.0-7.9X for FeFET-RAM, respectively.

更新日期：2020-01-16
• Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-15
Ronghua Luo; Huailin Huang; WeiZeng Wu

The Convolutional Neural Networks(CNNs) with encoder-decoder architecture has shown powerful ability in semantic segmentation and it has also been applied in saliency detection. In most researches, the parameters of the backbone network which have been pre-trained on the ImageNet dataset will be retrained using the new training dataset to let CNNs adapt to the new task better. But the retraining will weaken generalization of the pre-trained backbone network and result in over-fitting, especially when the scale of the new training data is not very large. To make a balance between generalization and precision, and to further improve the performance of the CNNs with encoder-decoder architecture in salient object detection, We proposed a framework with enhanced backbone network (BENet). A encoder with structure of dual backbone networks(DBNs) is adopted in BENet to extract more diverse feature maps. In addtion, BENet includes a connection module based on improved Res2Net to efficiently fuse feature maps from the two backbone networks and a decoder based on weighted multi-scale feedback module (WMFM) to perform synchronous learning. Our approach is extensively evaluated on six public datasets, and experimental results show significant and consistent improvements over the state-of-the-art methods without any additional supervision.

更新日期：2020-01-15
• Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-15
Jingyi Lv; Zhiyong Li; Ke Nai; Ying Chen; Jin Yuan

In the person re-identification (re-ID) community, pedestrians often have great changes in appearance, and there are many similar persons, which incurs will degrades the accuracy. Re-ranking is an effective method to solve these problems, this paper proposes an expanded neighborhoods distance (END) to re-rank the re-ID results. We assume that if the two persons in different image are same, their initial ranking lists and two-level neighborhoods will be very similar when they are taken as the query. Our method follows the principle of similarity, and selects expanded neighborhoods in initial ranking list to calculate the END distance. Final distance is calculated as the combination of the END distance and Jaccard distance. Experiments on Market-1501, DukeMTMC-reID and CUHK03 datasets confirm the effectiveness of the novel re-ranking method in this article. Compare with re-ID baseline, the proposed method in this paper increases mAP by 14.2% on Market-1501 and Rank1 by 12.9% on DukeMTMC-reID.

更新日期：2020-01-15
• arXiv.cs.AR Pub Date : 2020-01-13
Paul Whatmough; Marco Donato; Glenn Ko; David Brooks; Gu-Yeon Wei

The current trend for domain-specific architectures (DSAs) has led to renewed interest in research test chips to demonstrate new specialized hardware. Tape-outs also offer huge pedagogical value garnered from real hands-on exposure to the whole system stack. However, successful tape-outs demand hard-earned experience, and the design process is time consuming and fraught with challenges. Therefore, custom chips have remained the preserve of a small number of research groups, typically focused on circuit design research. This paper describes the CHIPKIT framework. We describe a reusable SoC subsystem which provides basic IO, an on-chip programmable host, memory and peripherals. This subsystem can be readily extended with new IP blocks to generate custom test chips. We also present an agile RTL development flow, including a code generation tool calledVGEN. Finally, we outline best practices for full-chip validation across the entire design cycle.

更新日期：2020-01-15
• arXiv.cs.AR Pub Date : 2020-01-13
Yann Kurzo; Andreas Toftegaard Kristensen; Andreas Burg; Alexios Balatsoukas-Stimming

In-band full-duplex systems can transmit and receive information simultaneously on the same frequency band. However, due to the strong self-interference caused by the transmitter to its own receiver, the use of non-linear digital self-interference cancellation is essential. In this work, we describe a hardware architecture for a neural network-based non-linear self-interference (SI) canceller and we compare it with our own hardware implementation of a conventional polynomial based SI canceller. In particular, we present implementation results for a shallow and a deep neural network SI canceller as well as for a polynomial SI canceller. Our results show that the deep neural network canceller achieves a hardware efficiency of up to $312.8$ Msamples/s/mm$^2$ and an energy efficiency of up to $0.9$ nJ/sample, which is $2.1\times$ and $2\times$ better than the polynomial SI canceller, respectively. These results show that NN-based methods applied to communications are not only useful from a performance perspective, but can also be a very effective means to reduce the implementation complexity.

更新日期：2020-01-15
• arXiv.cs.AR Pub Date : 2020-01-14
Jesus Rodriguez Sanchez; Ove Edfors; Fredrik Rusek; Liang Liu

The Large Intelligent Surface (LIS) concept has emerged recently as a new paradigm for wireless communication, remote sensing and positioning. Despite of its potential, there are a lot of challenges from an implementation point of view, with the interconnection data-rate and computational complexity being the most relevant. Distributed processing techniques and hierarchical architectures are expected to play a vital role addressing this. In this paper we perform algorithm-architecture codesign and analyze the hardware requirements and architecture trade-offs for a discrete LIS to perform uplink detection. By doing this, we expect to give concrete case studies and guidelines for efficient implementation of LIS systems.

更新日期：2020-01-15
• arXiv.cs.AR Pub Date : 2020-01-14

The success of deep learning has brought forth a wave of interest in computer hardware design to better meet the high demands of neural network inference. In particular, analog computing hardware has been heavily motivated specifically for accelerating neural networks, based on either electronic, optical or photonic devices, which may well achieve lower power consumption than conventional digital electronics. However, these proposed analog accelerators suffer from the intrinsic noise generated by their physical components, which makes it challenging to achieve high accuracy on deep neural networks. Hence, for successful deployment on analog accelerators, it is essential to be able to train deep neural networks to be robust to random continuous noise in the network weights, which is a somewhat new challenge in machine learning. In this paper, we advance the understanding of noisy neural networks. We outline how a noisy neural network has reduced learning capacity as a result of loss of mutual information between its input and output. To combat this, we propose using knowledge distillation combined with noise injection during training to achieve more noise robust networks, which is demonstrated experimentally across different networks and datasets, including ImageNet. Our method achieves models with as much as two times greater noise tolerance compared with the previous best attempts, which is a significant step towards making analog hardware practical for deep learning.

更新日期：2020-01-15
• arXiv.cs.AR Pub Date : 2019-12-11
Jan Moritz Joseph; Lennart Bamberg; Imad Hajjar; Anna Drewes; Behnam Razi Perjikolaei; Alberto García-Ortiz; Thilo Pionteck

We introduce ratatoskr, an open-source framework for in-depth power, performance and area (PPA) analysis in NoCs for 3D-integrated and heterogeneous System-on-Chips (SoCs). It covers all layers of abstraction by providing a NoC hardware implementation on RT level, a NoC simulator on cycle-accurate level and an application model on transaction level. By this comprehensive approach, ratatoskr can provide the following specific PPA analyses: Dynamic power of links can be measured within 2.4% accuracy of bit-level simulations while maintaining cycle-accurate simulation speed. Router power is determined from RT level synthesis combined with cycle-accurate simulations. The performance of the whole NoC can be measured both via cycle-accurate and RT level simulations. The performance of individual routers is obtained from RT level including gate-level verification. The NoC area is calculated from RT level. Despite these manifold features, ratatoskr offers easy two-step user interaction: First, a single point-of-entry that allows to set design parameters and second, PPA reports are generated automatically. For both the input and the output, different levels of abstraction can be chosen for high-level rapid network analysis or low-level improvement of architectural details. The synthesize NoC model reduces up to 32% total router power and 3% router area in comparison to a conventional standard router. As a forward-thinking and unique feature not found in other NoC PPA-measurement tools, ratatoskr supports heterogeneous 3D integration that is one of the most promising integration paradigms for upcoming SoCs. Thereby, ratatoskr lies the groundwork to design their communication architectures.

更新日期：2020-01-15
• Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-14
Wenhui Zhou; Gaomin Liu; Jiangwei Shi; Hua Zhang; Guojun Dai

Light field imaging has recently become a promising technology for 3D rendering and displaying. However, capturing real-world light field images still faces many challenges in both the quantity and quality. In this paper, we develop a learning based technique to reconstruct light field from a single 2D RGB image. It includes three steps: unsupervised monocular depth estimation, view synthesis and depth-guided view inpainting. We first propose a novel monocular depth estimation network to predict disparity maps of each sub-aperture views from the central view of light field. Then we synthesize the initial sub-aperture views by using the warping scheme. Considering that occlusion makes synthesis ambiguous for pixels invisible in the central view, we present a simple but effective fully convolutional network (FCN) for view inpainting. Note that the proposed network architecture is a general framework for light field reconstruction, which can be extended to take a sparse set of views as input without changing any structure or parameters of the network. Comparison experiments demonstrate that our method outperforms the state-of-the-art light field reconstruction methods with single-view input, and achieves comparable results with the multi-input methods.

更新日期：2020-01-14
• arXiv.cs.AR Pub Date : 2020-01-11
Yongjune Kim; Yoocharn Jeon; Cyril Guyot; Yuval Cassuto

Magnetic random-access memory (MRAM) is a promising memory technology due to its high density, non-volatility, and high endurance. However, achieving high memory fidelity incurs significant write-energy costs, which should be reduced for large-scale deployment of MRAMs. In this paper, we formulate an optimization problem for maximizing the memory fidelity given energy constraints, and propose a biconvex optimization approach to solve it. The basic idea is to allocate non-uniform write pulses depending on the importance of each bit position. The fidelity measure we consider is minimum mean squared error (MSE), for which we propose an iterative water-filling algorithm. Although the iterative algorithm does not guarantee global optimality, we can choose a proper starting point that decreases the MSE exponentially and guarantees fast convergence. For an 8-bit accessed word, the proposed algorithm reduces the MSE by a factor of 21.

更新日期：2020-01-14
• arXiv.cs.AR Pub Date : 2020-01-13
Sai Aparna Aketi; Smriti Gupta; Huimei Cheng; Joycee Mekie; Peter A. Beerel

The risk of soft errors due to radiation continues to be a significant challenge for engineers trying to build systems that can handle harsh environments. Building systems that are Radiation Hardened by Design (RHBD) is the preferred approach, but existing techniques are expensive in terms of performance, power, and/or area. This paper introduces a novel soft-error resilient asynchronous bundled-data design template, SERAD, which uses a combination of temporal and spatial redundancy to mitigate Single Event Transients (SETs) and upsets (SEUs). SERAD uses Error Detecting Logic (EDL) to detect SETs at the inputs of sequential elements and correct them via re-sampling. Because SERAD only pays the delay penalty in the presence of an SET, which rarely occurs, its average performance is comparable to the baseline synchronous design. We tested the SERAD design using a combination of Spice and Verilog simulations and evaluated its impact on area, frequency, and power on an open-core MIPS-like processor using a NCSU 45nm cell library. Our post-synthesis results show that the SERAD design consumes less than half of the area of the Triple Modular Redundancy (TMR), exhibits significantly less performance degradation than Glitch Filtering (GF), and consumes no more total power than the baseline unhardened design.

更新日期：2020-01-14
• Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-10
Hengcheng Fu; Yihong Zhang; Wuneng Zhou; Xiaofeng Wang; Huanlong Zhang

Single-object tracking is a significant and challenging computer vision problem. Recently, discriminative correlation filters (DCF) have shown excellent performance. But there is a theoretical defects that the boundary effect, caused by the periodic assumption of training samples, greatly limit the tracking performance. Spatially regularized DCF (SRDCF) introduces a spatial regularization to penalize the filter coefficients depending on their spatial location, which improves the tracking performance a lot. However, this simple regularization strategy implements unequal penalties for the target area filter coefficients, which makes the filter learn a distorted object appearance model. In this paper, a novel spatial regularization strategy is proposed, utilizing a reliability map to approximate the target area and to keep the penalty coefficients of relevant region consistent. Besides, we introduce a spatial variation regularization component that the second-order difference of the filter, which smooths changes of filter coefficients to prevent the filter over-fitting current frame. Furthermore, an efficient optimization algorithm called alternating direction method of multipliers (ADMM) is developed. Comprehensive experiments are performed on three benchmark datasets: OTB-2013, OTB-2015 and TempleColor-128, and our algorithm achieves a more favorable performance than several state-of-the-art methods. Compared with SRDCF, our approach obtains an absolute gain of 6.6% and 5.1% in mean distance precision on OTB-2013 and OTB-2015, respectively. Our approach runs in real-time on a CPU.

更新日期：2020-01-11
• Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-09
Cemil Zalluhoglu; Nazli Ikizler-Cinbis

Collective activity recognition is an important subtask of human action recognition, where the existing datasets are mostly limited. In this paper, we look into this issue and introduce the “Collective Sports (C-Sports)” dataset, which is a novel benchmark dataset for multi-task recognition of both collective activity and sports categories. Various state-of-the-art techniques are evaluated on this dataset, together with multi-task variants which demonstrate increased performance. From the experimental results, we can say that while sports categories of the videos are inferred accurately, there is still room for improvement for collective activity recognition, especially regarding the generalization ability beyond previously unseen sports categories. In order to evaluate this ability, we introduce a novel evaluation protocol called unseen sports, where the training and test are carried out on disjoint sets of sports categories. The relatively lower recognition performances in this evaluation protocol indicate that the recognition models tend to be influenced by the surrounding context, rather than focusing on the essence of the collective activities. We believe that C-Sports dataset will stir further interest in this research direction.

更新日期：2020-01-09
• arXiv.cs.AR Pub Date : 2019-08-19
Karthik Ganesan; Srinivasa Shashank Nuthakki

Existing techniques to ensure functional correctness and hardware trust during pre-silicon verification face severe limitations. In this work, we systematically leverage two key ideas: 1) Symbolic Quick Error Detection (Symbolic QED or SQED), a recent bug detection and localization technique using Bounded Model Checking (BMC); and 2) Symbolic starting states, to present a method that: i) Effectively detects both "difficult" logic bugs and Hardware Trojans, even with long activation sequences where traditional BMC techniques fail; and ii) Does not need skilled manual guidance for writing testbenches, writing design-specific assertions, or debugging spurious counter-examples. Using open-source RISC-V cores, we demonstrate the following: 1. Quick (<5 minutes for an in-order scalar core and <2.5 hours for an out-of-order superscalar core) detection of 100% of hundreds of logic bug and hardware Trojan scenarios from commercial chips and research literature, and 97.9% of "extremal" bugs (randomly-generated bugs requiring ~100,000 activation instructions taken from random test programs). 2. Quick (~1 minute) detection of several previously unknown bugs in open-source RISC-V designs.

更新日期：2020-01-09
• Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-08
Zubair Khan; Jie Yang

Recent advances in system resources provide ease in the applicability of deep learning approaches in computer vision. In this paper, we propose a deep learning-based unsupervised image segmentation approach for natural image segmentation. Image segmentation aims to transform an image into regions, representing various objects in the image. Our method consists of a fully convolutional dense network-based unsupervised deep representation oriented clustering, followed by shallow features based high-dimensional region merging to produce the final segmented image. We evaluate our proposed approach on the BSD300 database and perform a comparison with several classical and some recent deep learning-based unsupervised segmentation methods. The experimental results represent that the proposed method is comparable and confirm the efficacy of the proposed approach.

更新日期：2020-01-08
• arXiv.cs.AR Pub Date : 2020-01-03

Stochastic unary computing provides low-area circuits. However, the required area consuming stochastic number generators (SNGs) in these circuits can diminish their overall gain in area, particularly if several SNGs are required. We propose area-efficient SNGs by sharing the permuted output of one linear feedback shift register (LFSR) among several SNGs. With no hardware overhead, the proposed architecture generates stochastic bit streams with minimum stochastic computing correlation (SCC). Compared to the circular shifting approach presented in prior work, our approach produces stochastic bit streams with 67% less average SCC when a 10-bit LFSR is shared between two SNGs. To generalize our approach, we propose an algorithm to find a set of m permutations (n>m>2) with minimum pairwise SCC, for an n-bit LFSR. The search space for finding permutations with exact minimum SCC grows rapidly when n increases and it is intractable to perform a search algorithm using accurately calculated pairwise SCC values, for n>9. We propose a similarity function that can be used in the proposed search algorithm to quickly find a set of permutations with SCC values close to the minimum one. We evaluated our approach for several applications. The results show that, compared to prior work, it achieves lower MSE with the same (or even lower) area. Additionally, based on simulation results, we show that replacing the comparator component of an SNG circuit with a weighted binary generator can reduce SCC.

更新日期：2020-01-08
• Image Vis. Comput. (IF 2.747) Pub Date : 2020-01-07
Junying Chen; Tongyao Bai

Both images and point clouds are beneficial for object detection in a visual navigation module for autonomous driving. The spatial relationships between different objects at different times in a bimodal space can vary significantly. It is difficult to combine bimodal descriptions into a unified model to effectively detect objects in an efficient amount of time. In addition, conventional voxelization methods resolve point clouds into voxels at a global level, and often overlook local attributes of the voxels. To address these problems, we propose a novel fusion-based deep framework named SAANet. SAANet utilizes a spatial adaptive alignment (SAA) module to align point cloud features and image features, by automatically discovering the complementary information between point clouds and images. Specifically, we transform the point clouds into 3D voxels, and introduce local orientation encoding to represent the point clouds. Then, we use a sparse convolutional neural network to learn a point cloud feature. Simultaneously, a ResNet-like 2D convolutional neural network is used to extract an image feature. Next, the point cloud feature and image feature are fused by our SAA block to derive a comprehensive feature. Then, the labels and 3D boxes for objects are learned using a multi-task learning network. Finally, an experimental evaluation on the KITTI benchmark demonstrates the advantages of our method in terms of average precision and inference time, as compared to previous state-of-the-art results for 3D object detection.

更新日期：2020-01-07
• arXiv.cs.AR Pub Date : 2020-01-06
Mantas Mikaitis

General algorithms and a hardware accelerator for performing stochastic rounding (SR) are presented. The main goal is to augment the ARM M4F based multi-core processor SpiNNaker 2 with a more flexible rounding functionality than is available in the ARM processor itself. The motivation of adding such an accelerator in hardware is based on our previous results showing improvements in numerical accuracy of ODE solvers in fixed-point arithmetic with SR, compared to standard round-to-nearest or bit truncation rounding modes. Furthermore, performing SR purely in software can be expensive, due to requirement of a pseudo-random number generator (PRNG), multiple masking and shifting instructions and an addition operation. Also, saturation of the rounded values is included, since rounding is usually followed by saturation, which is especially important in fixed-point arithmetic due to a narrow dynamic range of representable values. The main intended use of the accelerator is to round fixed-point multiplier outputs, which are returned unrounded by the ARM processor in a wider fixed-point format than the arguments.

更新日期：2020-01-07
• arXiv.cs.AR Pub Date : 2019-07-11
Ming Ling; Jiancong Ge; Guangmin Wang

To mitigate the performance gap between CPU and the main memory, multi-level cache architectures are widely used in modern processors. Therefore, modeling the behaviors of the downstream caches becomes a critical part of the processor performance evaluation in the early stage of Design Space Exploration (DSE). In this paper, we propose a fast and accurate L2 cache reuse distance histogram model, which can be used to predict the behaviors of the multi-level cache architectures where the L1 cache uses the LRU replacement policy and the L2 cache uses LRU/Random replacement policies. We use the profiled L1 reuse distance histogram and two newly proposed metrics, namely the RST table and the Hit-RDH, that describing more detailed information of the software traces as the inputs. For a given L1 cache configuration, the profiling results can be reused for different configurations of the L2 cache. The output of our model is the L2 cache reuse distance histogram, based on which the L2 cache miss rates can be evaluated. We compare the L2 cache miss rates with the results from gem5 cycle-accurate simulations of 15 benchmarks chosen from SPEC CPU 2006 and 9 benchmarks from SPEC CPU 2017. The average absolute error is less than 5%, while the evaluation time for each L2 configuration can be sped up almost 30X for four L2 cache candidates.

更新日期：2020-01-07
• arXiv.cs.AR Pub Date : 2020-01-03
Karthikeyan Nagarajan; Asmit De; Mohammad Nasim Imtiaz Khan; Swaroop Ghosh

In this paper, we investigate the advanced circuit features such as wordline- (WL) underdrive (prevents retention failure) and overdrive (assists write) employed in the peripherals of Dynamic RAM (DRAM) memories from a security perspective. In an ideal environment, these features ensure fast and reliable read and write operations. However, an adversary can re-purpose them by inserting Trojans to deliver malicious payloads such as fault injections, Denial-of-Service (DoS), and information leakage attacks when activated by the adversary. Simulation results indicate that wordline voltage can be increased to cause retention failure and thereby launch a DoS attack in DRAM memory. Furthermore, two wordlines or bitlines can be shorted to leak information or inject faults by exploiting the DRAM's refresh operation. We demonstrate an information leakage system exploit by implementing TrappeD on RocketChip SoC.

更新日期：2020-01-06
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-02
Yasar Abbas Ur Rehman; Lai-Man Po; Jukka Komulainen

Face presentation attack detection (PAD) in unconstrained conditions is one of the key issues in face biometric-based authentication and security applications. In this paper, we propose a perturbation layer — a learnable pre-processing layer for low-level deep features — to enhance the discriminative ability of deep features in face PAD. The perturbation layer takes the deep features of a candidate layer in Convolutional Neural Network (CNN), the corresponding hand-crafted features of an input image, and produces adaptive convolutional weights for the deep features of the candidate layer. These adaptive convolutional weights determine the importance of the pixels in the deep features of the candidate layer for face PAD. The proposed perturbation layer adds very little overhead to the total trainable parameters in the model. We evaluated the proposed perturbation layer with Local Binary Patterns (LBP), with and without color information, on three publicly available face PAD databases, i.e., CASIA, Idiap Replay-Attack, and OULU-NPU databases. Our experimental results show that the introduction of the proposed perturbation layer in the CNN improved the face PAD performance, in both intra-database and cross-database scenarios. Our results also highlight the attention created by the proposed perturbation layer in the deep features and its effectiveness for face PAD in general.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-16
Maoteng Zheng; Shiguang Wang; Xiaodong Xiong; Junfeng Zhu

This paper presents a fast and accurate iterative method for camera pose estimation problem. The dependence on initial values is reduced by replacing unknown angular parameters with three independent non-angular parameters. Image point coordinates are treated as observations with errors and a new model is built using a conditional adjustment with parameters for relative orientation. This model allows for the estimation of the errors in the observations. The estimated observation errors are then used iteratively to detect and eliminate gross errors in the adjustment. A total of 22 synthetic datasets and 10 real datasets are used to compare the proposed method with the traditional iterative method, the 5-point-RANSAC and the state-of-the-art 5-point-USAC methods. Preliminary results show that our proposed method is not only faster than the other methods, but also more accurate and stable.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-10
Amna Shifa; Muhammad Babar Imtiaz; Mamoona Naveed Asghar; Martin Fleury

An individual's privacy is a significant concern in surveillance videos. Existing research work into the location of individuals on the basis of detecting their skin is focused either on different techniques for detecting human skin on protecting individuals from the consequences of applying such techniques. This paper considers both lines of research and proposes a hybrid scheme for human skin detection and subsequent privacy protection by utilizing color information in dynamically varying illumination and environmental conditions. For those purposes, dynamic and explicit skin-detection approaches are implemented, simultaneously considering multiple color-spaces, i.e. RGB, perceptual (HSV) and orthogonal (YCbCr) color-spaces, and then detecting the human skin by the proposed Combined Threshold Rule (CTR)-based segmentation. Comparative qualitative and quantitative detection results with an average 93.73% accuracy, imply that the proposed scheme achieves considerable accuracy without incurring a training cost. Once skin detection has been performed, the detected skin pixels (including false positives) are encrypted, when standard AES-CFB encryption of skin pixels is shown to be preferable compared to selective encryption of a whole video frame. The scheme preserves the behavior of the subjects within the video. Hence, subsequent image processing and behavior analysis, if required, can be performed by an authorized user. The experimental results are encouraging, as they show that the average encryption time is 8.268 s and the Encryption Space Ratio (ESR) is an average 7.25% for a high definition video (1280 × 720 pixels/frame). A performance comparison in terms of Correct Detection Rate (CDR) showed an average 91.5% for CTB-based segmentation compared to using only one color space for segmentation, such as using RGB with 85.86%, HSV with 80.93% and YCbCr with an average 84.8%, which implies that the proposed method of combining color-space skin identifications has a higher ability to detect skin accurately. Security analysis confirmed that the proposed scheme could be a suitable choice for real-time surveillance applications operating on resource-constrained devices.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-12
Leulseged Tesfaye Alemu; Marcello Pelillo

Aggregating different image features for image retrieval has recently shown its effectiveness. While highly effective, though, the question of how to uplift the impact of the best features for a specific query image persists as an open computer vision problem. In this paper, we propose a computationally efficient approach to fuse several hand-crafted and deep features, based on the probabilistic distribution of a given membership score of a constrained cluster in an unsupervised manner. First, we introduce an incremental nearest neighbor (NN) selection method, whereby we dynamically select k-NN to the query. We then build several graphs from the obtained NN sets and employ constrained dominant sets (CDS) on each graph G to assign edge weights which consider the intrinsic manifold structure of the graph, and detect false matches to the query. Finally, we elaborate the computation of feature positive-impact weight (PIW) based on the dispersive degree of the characteristics vector. To this end, we exploit the entropy of a cluster membership-score distribution. In addition, the final NN set bypasses a heuristic voting scheme. Experiments on several retrieval benchmark datasets show that our method can improve the state-of-the-art result.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-13
Hang Shi; Chengjun Liu

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-14
Yin Gao; Qiming Li; Jun Li
更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-13
Mateusz Trokielewicz; Adam Czajka; Piotr Maciejewicz

This paper proposes the first known to us iris recognition methodology designed specifically for post-mortem samples. We propose to use deep learning-based iris segmentation models to extract highly irregular iris texture areas in post-mortem iris images. We show how to use segmentation masks predicted by neural networks in conventional, Gabor-based iris recognition method, which employs circular approximations of the pupillary and limbic iris boundaries. As a whole, this method allows for a significant improvement in post-mortem iris recognition accuracy over the methods designed only for ante-mortem irises, including the academic OSIRIS and commercial IriCore implementations. The proposed method reaches the EER less than 1% for samples collected up to 10 hours after death, when compared to 16.89% and 5.37% of EER observed for OSIRIS and IriCore, respectively. For samples collected up to 369 h post-mortem, the proposed method achieves the EER 21.45%, while 33.59% and 25.38% are observed for OSIRIS and IriCore, respectively. Additionally, the method is tested on a database of iris images collected from ophthalmology clinic patients, for which it also offers an advantage over the two other algorithms. This work is the first step towards post-mortem-specific iris recognition, which increases the chances of identification of deceased subjects in forensic investigations. The new database of post-mortem iris images acquired from 42 subjects, as well as the deep learning-based segmentation models are made available along with the paper, to ensure all the results presented in this manuscript are reproducible.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-16
Doyeob Yeo; Chang-Ock Lee

In general images, it is practically hard to distinguish only the desired object using the conventional image segmentation methods. In many cases, we can segment the desired object by using the shape information of the object in addition to the standard image segmentation. Chan and Zhu's model is not robust to the intensity changes of objects. In this paper, we propose a novel model for the shape prior segmentation that produces robust results using the hierarchical image segmentation and an attraction term. Moreover, we adopt an image registration technique and a multi-region image segmentation to get an initial for a given shape prior. Finally, we consider the free-form deformation in obtaining the shape function from the reference shape prior for real-world images. Numerical experiments demonstrate the results independent of intensities of objects and the location of the reference shape prior. All numerical calculations are automatic and progress without any user input.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-21
Fanqing Lin; Yao Chou; Tony Martinez

We tackle the task of semi-supervised video object segmentation, i.e. pixel-level object classification of the images in video sequences using very limited ground truth training data of its corresponding video. We present FLow Adaptive Video Object Segmentation, an efficient pipeline based on a novel online adaptation algorithm that utilizes optical flow, capable of tracking objects effectively throughout videos. Comparing with most of the recent deep learning based approaches that trade efficiency for accuracy, we provide extensive complexity analysis and additionally demonstrate that FLAVOS is natural for real world applications by introducing an interactive pipeline that enables the user to provide feedback for online training. Our method achieves state-of-the-art accuracy on three challenging benchmark datasets and nearly ground-truth level segmentation results with interactive user feedback.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-13
Min Yang; Mingtao Pei; Yunde Jia

In this paper, we address the problem of online multi-object tracking based on the Maximum a Posteriori (MAP) framework. Given the observations up to the current frame, we estimate the optimal object trajectories via two MAP estimation stages: object detection and data association. By introducing the sequential trajectory prior, i.e, the prior information from previous frames about “good” trajectories, into the two MAP stages, the inference of optimal detections is refined and the association correctness between trajectories and detections is enhanced. Furthermore, the sequential trajectory prior allows the two MAP stages interact with each other in a sequential manner, which jointly optimizes the detections and trajectories to facilitate online multi-object tracking. Compared with existing methods, our approach is able to alleviate the association ambiguity caused by noisy detections and frequent inter-object interactions without using sophisticated association likelihood models. The experiments on publicly available challenging datasets demonstrate that our approach provides superior tracking performance over state-of-the-art algorithms in various complex scenes.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-12-10
Seyed Mehdi Iranmanesh; Benjamin Riggan; Shuowen Hu; Nasser M. Nasrabadi

The large modality gap between faces captured in different spectra makes heterogeneous face recognition (HFR) a challenging problem. In this paper, we present a coupled generative adversarial network (CpGAN) to address the problem of matching non-visible facial imagery against a gallery of visible faces. Our CpGAN architecture consists of two sub-networks one dedicated to the visible spectrum and the other sub-network dedicated to the non-visible spectrum. Each sub-network consists of a generative adversarial network (GAN) architecture. Inspired by a dense network which is capable of maximizing the information flow among features at different levels, we utilize a densely connected encoder-decoder structure as the generator in each GAN sub-network. The proposed CpGAN framework uses multiple loss functions to force the features from each sub-network to be as close as possible for the same identities in a common latent subspace. To achieve a realistic photo reconstruction while preserving the discriminative information, we also added a perceptual loss function to the coupling loss function. An ablation study is performed to show the effectiveness of different loss functions in optimizing the proposed method. Moreover, the superiority of the model compared to the state-of-the-art models in HFR is demonstrated using multiple datasets.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-12
Shan Jia; Guodong Guo; Zhengquan Xu; Qiangchang Wang

The vulnerability of face recognition systems to different presentation attacks has aroused increasing concern in the biometric community. Face presentation detection (PAD) techniques, which aim to distinguish real face samples from spoof artifacts, are the efficient countermeasure. In recent years, various methods have been proposed to address 2D type face presentation attacks, including photo print attack and video replay attack. However, it is difficult to tell which methods perform better for these attacks, especially in practical mobile authentication scenarios, since there is no systematic evaluation or benchmark of the state-of-the-art methods on a common ground (i.e., using the same databases and protocols). Therefore, this paper presents a comprehensive evaluation of several representative face PAD methods (30 in total) on three public mobile spoofing datasets to quantitatively compare the detection performance. Furthermore, the generalization ability of existing methods is tested under cross-database testing scenarios to show the possible database bias. We also summarize meaningful observations and give some insights that will help promote both academic research and practical applications.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-10
Ambroise Moreau; Matei Mancas; Thierry Dutoit

Among the various cues that help us understand and interact with our surroundings, depth is of particular importance. It allows us to move in space and grab objects to complete different tasks. Therefore, depth prediction has been an active research field for decades and many algorithms have been proposed to retrieve depth. Some imitate human vision and compute depth through triangulation on correspondences found between pixels or handcrafted features in different views of the same scene. Others rely on simple assumptions and semantic knowledge of the structure of the scene to get the depth information. Recently, numerous algorithms based on deep learning have emerged from the computer vision community. They implement the same principles as the non-deep learning methods and leverage the ability of deep neural networks of automatically learning important features that help to solve the task. By doing so, they produce new state-of-the-art results and show encouraging prospects. In this article, we propose a taxonomy of deep learning methods for depth prediction from 2D images. We retained the training strategy as the sorting criterion. Indeed, some methods are trained in a supervised manner which means depth labels are needed during training while others are trained in an unsupervised manner. In that case, the models learn to perform a different task such as view synthesis and depth is only a by-product of this learning. In addition to this taxonomy, we also evaluate nine models on two similar datasets without retraining. Our analysis showed that (i) most models are sensitive to sharp discontinuities created by shadows or colour contrasts and (ii) the post processing applied to the results before computing the commonly used metrics can change the model ranking. Moreover, we showed that most metrics agree with each other and are thus redundant.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-09
Wen Zhang; Xujie He; Wanyi Li; Zhi Zhang; Yongkang Luo; Li Su; Peng Wang

Ship segmentation is an important task in maritime surveillance systems. A great deal of research on image segmentation has been done in the past few years, but there appears to be some problems when directly utilizing them for ship segmentation under complex maritime background. The interference factors decreasing segmentation performance usually are from the peculiarity of complex maritime background, such as the existence of sea fog, large wakes and large waves. To deal with these interference factors, this paper presents an integrated ship segmentation method based on discriminator and extractor (ISDE). Different from traditional segmentation methods, our method consists of two components in light of the structure: Interference Factor Discriminator (IFD) and Ship Extractor (SE). SqueezeNet is employed for the implementation of IFD as the first step to make a judgment on what interference factors are contained in the input image. While DeepLabv3 + and improved DeepLabv3 + are employed for the implementation of SE as the second step to finally extract ships. We collect a ship segmentation dataset and conduct intensive experiments on it. The experimental results demonstrate that our method for ship segmentation outperforms state-of-the-art methods in terms of segmentation accuracy, especially for the images contain sea fog. Besides our method can run in real time as well.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-09
Qiang Huang; Yongxiong Wang; Zhong Yin

Projective methods generally achieve better results in 3D object recognition in recent years. This may be similar to that human visual 3D shapes rely on various 2D observations which are unconscious on retina. Each projection is treated fairly in existing methods. However, we note that different viewpoint images of the same object have different discriminative features, and only some of images are completely significant. We propose a novel View-based Weight Network (VWN) for 3D object recognition where the different view-based weights are assigned to different projections. The trainable view-level weights are incorporated as a pooling layer of the multi-view residual network. The pooling layer contains 7 sub-layers. Meanwhile, we find a simple unsupervised criterion to evaluate the prediction results before they output. To improve the recognition accuracy, a new multi-channel integrated classifier combining Extreme Learning Machine, KNN, SVM and Random Forest is proposed based on the criterion. The multi-channel classifier can make the accuracy of Top1 close to Top2. Experiments on Princeton ModelNet 3D datasets demonstrate our proposed method outperforms the state-of-the-art approaches significantly in recognition accuracy.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-09
Bin Huang; Renwen Chen; Wang Xu; Qinbang Zhou

Conventional head pose estimation methods are regarded as a classification or regression paradigm, individually. The accuracy of classification-based approaches is limited to pose quantized interval and regression-based methods are fragile due to extremely large pose in non-ideal conditions. On the contrary to these methods, this paper introduces a novel head pose estimation method using two-stage ensembles with average top-k regression. The first stage is a binned classification subtask with the optimal pose partition. The second stage achieves average top-k regression based on the former prediction. Then we combine the two subtasks by considering the task-dependent weights instead of setting coefficients by grid search. We conduct several experiments to analyze the optimal pose partition for classification part and to validate the average top-k loss for regression part. Furthermore, we report the performance of proposed method on AFW, AFLW2000 and BIWI datasets and results show rather competitive performance in head pose prediction.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-07

The process of splitting an image into specular and diffuse components is a fundamental problem in computer vision, because most computer vision algorithms, such as image segmentation and tracking, assume diffuse surfaces, so existence of specular reflection can mislead algorithms to make incorrect decisions. Existing decomposition methods tend to work well for images with low specularity and high chromaticity, but they fail in cases of high intensity specular light and on images with low chromaticity. In this paper, we address the problem of removing high intensity specularity from low chromaticity images (faces). We introduce a new dataset, Spec-Face, comprising face images corrupted with specular lighting and corresponding ground truth diffuse images. We also introduce two deep learning models for specularity removal, Spec-Net and Spec-CGAN. Spec-Net takes an intensity channel as input and produces an output image that is very close to ground truth, while Spec-CGAN takes an RGB image as input and produces a diffuse image very similar to the ground truth RGB image. On Spec-Face, with Spec-Net, we obtain a peak signal-to-noise ratio (PSNR) of 3.979, a local mean squared error (LMSE) of 0.000071, a structural similarity index (SSIM) of 0.899, and a Fréchet Inception Distance (FID) of 20.932. With Spec-CGAN, we obtain a PSNR of 3.360, a LMSE of 0.000098, a SSIM of 0.707, and a FID of 31.699. With Spec-Net and Spec-CGAN, it is now feasible to perform specularity removal automatically prior to other critical complex vision processes for real world images, i.e., faces. This will potentially improve the performance of algorithms later in the processing stream, such as face recognition and skin cancer detection.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-06
Jiaxing Yang; Xiang Fang; Lihe Zhang; Huchuan Lu; Guohua Wei

In this paper, we propose a novel saliency model based on double random walks with dual restarts. Two agents (also known as walkers) respectively representing the foreground and background properties simultaneously walk on a graph to explore saliency distribution. First, we propose the propagation distance measure and use it to calculate the initial distributions of the two agents instead of using geodesic distance. Second, the two agents traverse the graph starting from their own initial distribution, and then interact with each other to correct their travel routes by the restart mechanism, which enforces the agents to return to some specific nodes with a certain probability after every movement. We define the dual restarts to take into account interaction between and weighting of two agents. Extensive evaluations demonstrate that the proposed algorithm performs favorably against other state-of-the-art methods on four benchmark datasets.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-01
Xianxian Zeng; Yun Zhang; Xiaodong Wang; Kairui Chen; Dong Li; Weijun Yang

Fine-Grained Image Retrieval is an important problem in computer vision. It is more challenging than the task of content-based image retrieval because it has small diversity within the different classes but large diversity in the same class. Recently, the cross entropy loss can be utilized to make Convolutional Neural Network (CNN) generate distinguish feature for Fine-Grained Image Retrieval, and it can obtain further improvement with some extra operations, such as Normalize-Scale layer. In this paper, we propose a variant of the cross entropy loss, named Piecewise Cross Entropy loss function, for enhancing model generalization and promoting the retrieval performance. Besides, the Piecewise Cross Entropy loss is easy to implement. We evaluate the performance of the proposed scheme on two standard fine-grained retrieval benchmarks, and obtain significant improvements over the state-of-the-art, with 11.8% and 3.3% over the previous work on CARS196 and CUB-200-2011, respectively.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-01
Ammar Mahmood; Mohammed Bennamoun; Senjian An; Ferdous Sohel; Farid Boussaid

Oceanographers rely on advanced digital imaging systems to assess the health of marine ecosystems. The majority of the imagery collected by these systems do not get annotated due to lack of resources. Consequently, the expert labeled data is not enough to train dedicated deep networks. Meanwhile, in the deep learning community, much focus is on how to use pre-trained deep networks to classify out-of-domain images and transfer learning. In this paper, we leverage these advances to evaluate how well features extracted from deep neural networks transfer to underwater image classification. We propose new image features (called ResFeats) extracted from the different convolutional layers of a deep residual network pre-trained on ImageNet. We further combine the ResFeats extracted from different layers to obtain compact and powerful deep features. Moreover, we show that ResFeats consistently perform better than their CNN counterparts. Experimental results are provided to show the effectiveness of ResFeats with state-of-the-art classification accuracies on MLC, Benthoz15, EILAT and RSMAS datasets.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-10-31
Ye Gu; Xiaofeng Ye; Weihua Sheng; Yongsheng Ou; Yongqiang Li

Human action recognition is one of the most important and challenging topic in the fields of image processing. Unlike object recognition, action recognition requires motion feature modeling which contains not only spatial but also temporal information. In this paper, we use multiple models to characterize both global and local motion features. Global motion patterns are represented efficiently by the depth-based 3-channel motion history images (MHIs). Meanwhile, the local spatial and temporal patterns are extracted from the skeleton graph. The decisions of these two streams are fused. At the end, the domain knowledge, which is the object/action dependency is considered. The proposed framework is evaluated on two RGB-D datasets. The experimental results show the effectiveness of our proposed approach. The performance is comparable with the state-of-the-art.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-10-31
Yong Dai; Yi Li; Shu-Tao Li

In the designing field, designers usually retrieve the images for reference according to product attributes when designing new proposals. To obtain the attributes of the product, the designers take lots of time and effort to collect product images and annotate them with multiple labels. However, the labels of product images represent the concept of subjective perception, which makes the multi-label learning more challenging to imitate the human aesthetic rather than discriminate the appearance. In this paper, a Feature Correlation Learning (FCL) network is proposed to solve this problem by exploiting the potential feature correlations of product images. Given a product image, the FCL network calculates the features of different levels and their correlations via gram matrices. The FCL is aggregated with the DenseNet to predict the labels of the input product image. The proposed method is compared with several outstanding multi-label learning methods, as well as DenseNet. Experimental results demonstrate that the proposed method outperforms the state-of-the-arts for multi-label learning problem of product image data.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-10-31
Costas Panagiotakis; Antonis Argyros

We present RFOVE, a region-based method for approximating an arbitrary 2D shape with an automatically determined number of possibly overlapping ellipses. RFOVE is completely unsupervised, operates without any assumption or prior knowledge on the object's shape and extends and improves the Decremental Ellipse Fitting Algorithm (DEFA) [1]. Both RFOVE and DEFA solve the multi-ellipse fitting problem by performing model selection that is guided by the minimization of the Akaike Information Criterion on a suitably defined shape complexity measure. However, in contrast to DEFA, RFOVE minimizes an objective function that allows for ellipses with higher degree of overlap and, thus, achieves better ellipse-based shape approximation. A comparative evaluation of RFOVE with DEFA on several standard datasets shows that RFOVE achieves better shape coverage with simpler models (less ellipses). As a practical exploitation of RFOVE, we present its application to the problem of detecting and segmenting potentially overlapping cells in fluorescence microscopy images. Quantitative results obtained in three public datasets (one synthetic and two with more than 4000 actual stained cells) show the superiority of RFOVE over the state of the art in overlapping cells segmentation.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-09-03
Javier Aldana-Iuit; Dmytro Mishkin; Ondřej Chum; Jiří Matas

A novel similarity-covariant feature detector that extracts points whose neighborhoods, when treated as a 3D intensity surface, have a saddle-like intensity profile is presented. The saddle condition is verified efficiently by intensity comparisons on two concentric rings that must have exactly two dark-to-bright and two bright-to-dark transitions satisfying certain geometric constraints. Saddle is a fast approximation of Hessian detector as ORB, that implements the FAST detector, is for Harris detector. We propose to use the matching strategy called the first geometric inconsistent with binary descriptors that is suitable for our feature detector, including experiments with fix point descriptors hand-crafted and learned. Experiments show that the Saddle features are general, evenly spread and appearing in high density in a range of images. The Saddle detector is among the fastest proposed. In comparison with detector with similar speed, the Saddle features show superior matching performance on number of challenging datasets. Compared to recently proposed deep-learning based interest point detectors and popular hand-crafted keypoint detectors, evaluated for repeatability in the ApolloScape dataset [1], the Saddle detectors shows the best performance in most of the street-level view sequences a.k.a. traversals.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2018-09-26
Andrés Romero; Juán León; Pablo Arbeláez

We propose a novel convolutional neural network approach to address the fine-grained recognition problem of multi-view dynamic facial action unit detection. We leverage recent gains in large-scale object recognition by formulating the task of predicting the presence or absence of a specific action unit in a still image of a human face as holistic classification. We then explore the design space of our approach by considering both shared and independent representations for separate action units, and also different CNN architectures for combining color and motion information. We then move to the novel setup of the FERA 2017 Challenge, in which we propose a multi-view extension of our approach that operates by first predicting the viewpoint from which the video was taken, and then evaluating an ensemble of action unit detectors that were trained for that specific viewpoint. Our approach is holistic, efficient, and modular, since new action units can be easily included in the overall system. Our approach significantly outperforms the baseline of the FERA 2017 Challenge, with an absolute improvement of 14% on the F1-metric. Additionally, it compares favorably against the winner of the FERA 2017 Challenge.

更新日期：2020-01-04
• Image Vis. Comput. (IF 2.747) Pub Date : 2019-11-26
Mercedes Torres Torres,Michel Valstar,Caroline Henry,Carole Ward,Don Sharkey

A baby's gestational age determines whether or not they are premature, which helps clinicians decide on suitable post-natal treatment. The most accurate dating methods use Ultrasound Scan (USS) machines, but these are expensive, require trained personnel and cannot always be deployed to remote areas. In the absence of USS, the Ballard Score, a postnatal clinical examination, can be used. However, this method is highly subjective and results vary widely depending on the experience of the examiner. Our main contribution is a novel system for automatic postnatal gestational age estimation using small sets of images of a newborn's face, foot and ear. Our two-stage architecture makes the most out of Convolutional Neural Networks trained on small sets of images to predict broad classes of gestational age, and then fuses the outputs of these discrete classes with a baby's weight to make fine-grained predictions of gestational age using Support Vector Regression. On a purpose-collected dataset of 130 babies, experiments show that our approach surpasses current automatic state-of-the-art postnatal methods and attains an expected error of 6 days. It is three times more accurate than the Ballard method. Making use of images improves predictions by 33% compared to using weight only. This indicates that even with a very small set of data, our method is a viable candidate for postnatal gestational age estimation in areas were USS is not available.

更新日期：2019-11-01
• Image Vis. Comput. (IF 2.747) Pub Date : 2001-01-01
A W Toga,P M Thompson

Image registration is a key step in a great variety of biomedical imaging applications. It provides the ability to geometrically align one dataset with another, and is a prerequisite for all imaging applications that compare datasets across subjects, imaging modalities, or across time. Registration algorithms also enable the pooling and comparison of experimental findings across laboratories, the construction of population-based brain atlases, and the creation of systems to detect group patterns in structural and functional imaging data. We review the major types of registration approaches used in brain imaging today. We focus on their conceptual basis, the underlying mathematics, and their strengths and weaknesses in different contexts. We describe the major goals of registration, including data fusion, quantification of change, automated image segmentation and labeling, shape measurement, and pathology detection. We indicate that registration algorithms have great potential when used in conjunction with a digital brain atlas, which acts as a reference system in which brain images can be compared for statistical analysis. The resulting armory of registration approaches is fundamental to medical image analysis, and in a brain mapping context provides a means to elucidate clinical, demographic, or functional trends in the anatomy or physiology of the brain.

更新日期：2019-11-01
• Image Vis. Comput. (IF 2.747) Pub Date : 2018-12-14
Wen-Sheng Chu,Fernando De la Torre,Jeffrey F Cohn

Facial action units (AUs) may be represented spatially, temporally, and in terms of their correlation. Previous research focuses on one or another of these aspects or addresses them disjointly. We propose a hybrid network architecture that jointly models spatial and temporal representations and their correlation. In particular, we use a Convolutional Neural Network (CNN) to learn spatial representations, and a Long Short-Term Memory (LSTM) to model temporal dependencies among them. The outputs of CNNs and LSTMs are aggregated into a fusion network to produce per-frame prediction of multiple AUs. The hybrid network was compared to previous state-of-the-art approaches in two large FACS-coded video databases, GFT and BP4D, with over 400,000 AU-coded frames of spontaneous facial behavior in varied social contexts. Relative to standard multi-label CNN and feature-based state-of-the-art approaches, the hybrid system reduced person-specific biases and obtained increased accuracy for AU detection. To address class imbalance within and between batches during training the network, we introduce multi-labeling sampling strategies that further increase accuracy when AUs are relatively sparse. Finally, we provide visualization of the learned AU models, which, to the best of our best knowledge, reveal for the first time how machines see AUs.

更新日期：2019-11-01
• Image Vis. Comput. (IF 2.747) Pub Date : 2010-02-18
Wurong Yu,Bugao Xu

This paper presents a whole body surface imaging system based on stereo vision technology. We have adopted a compact and economical configuration which involves only four stereo units to image the frontal and rear sides of the body. The success of the system depends on a stereo matching process that can effectively segment the body from the background in addition to recovering sufficient geometric details. For this purpose, we have developed a novel sub-pixel, dense stereo matching algorithm which includes two major phases. In the first phase, the foreground is accurately segmented with the help of a predefined virtual interface in the disparity space image, and a coarse disparity map is generated with block matching. In the second phase, local least squares matching is performed in combination with global optimization within a regularization framework, so as to ensure both accuracy and reliability. Our experimental results show that the system can realistically capture smooth and natural whole body shapes with high accuracy.

更新日期：2019-11-01
• Image Vis. Comput. (IF 2.747) Pub Date : 2010-02-18
Liya Ding,Aleix M Martinez

The manual signs in sign languages are generated and interpreted using three basic building blocks: handshape, motion, and place of articulation. When combined, these three components (together with palm orientation) uniquely determine the meaning of the manual sign. This means that the use of pattern recognition techniques that only employ a subset of these components is inappropriate for interpreting the sign or to build automatic recognizers of the language. In this paper, we define an algorithm to model these three basic components form a single video sequence of two-dimensional pictures of a sign. Recognition of these three components are then combined to determine the class of the signs in the videos. Experiments are performed on a database of (isolated) American Sign Language (ASL) signs. The results demonstrate that, using semi-automatic detection, all three components can be reliably recovered from two-dimensional video sequences, allowing for an accurate representation and recognition of the signs.

更新日期：2019-11-01
• Image Vis. Comput. (IF 2.747) Pub Date : 2010-02-18
Huiying Shen,James Coughlan,Volodymyr Ivanchenko

Foreground-background segmentation has recently been applied [26,12] to the detection and segmentation of specific objects or structures of interest from the background as an efficient alternative to techniques such as deformable templates [27]. We introduce a graphical model (i.e. Markov random field)-based formulation of structure-specific figure-ground segmentation based on simple geometric features extracted from an image, such as local configurations of linear features, that are characteristic of the desired figure structure. Our formulation is novel in that it is based on factor graphs, which are graphical models that encode interactions among arbitrary numbers of random variables. The ability of factor graphs to express interactions higher than pairwise order (the highest order encountered in most graphical models used in computer vision) is useful for modeling a variety of pattern recognition problems. In particular, we show how this property makes factor graphs a natural framework for performing grouping and segmentation, and demonstrate that the factor graph framework emerges naturally from a simple maximum entropy model of figure-ground segmentation.We cast our approach in a learning framework, in which the contributions of multiple grouping cues are learned from training data, and apply our framework to the problem of finding printed text in natural scenes. Experimental results are described, including a performance analysis that demonstrates the feasibility of the approach.

更新日期：2019-11-01
• Image Vis. Comput. (IF 2.747) Pub Date : 2010-02-18
Maja Pantic,Jeffrey F Cohn

更新日期：2019-11-01
• Image Vis. Comput. (IF 2.747) Pub Date : 2010-01-05
Simon Lucey,Yang Wang,Mark Cox,Sridha Sridharan,Jeffery F Cohn

Active appearance models (AAMs) have demonstrated great utility when being employed for non-rigid face alignment/tracking. The "simultaneous" algorithm for fitting an AAM achieves good non-rigid face registration performance, but has poor real time performance (2-3 fps). The "project-out" algorithm for fitting an AAM achieves faster than real time performance (> 200 fps) but suffers from poor generic alignment performance. In this paper we introduce an extension to a discriminative method for non-rigid face registration/tracking referred to as a constrained local model (CLM). Our proposed method is able to achieve superior performance to the "simultaneous" AAM algorithm along with real time fitting speeds (35 fps). We improve upon the canonical CLM formulation, to gain this performance, in a number of ways by employing: (i) linear SVMs as patch-experts, (ii) a simplified optimization criteria, and (iii) a composite rather than additive warp update step. Most notably, our simplified optimization criteria for fitting the CLM divides the problem of finding a single complex registration/warp displacement into that of finding N simple warp displacements. From these N simple warp displacements, a single complex warp displacement is estimated using a weighted least-squares constraint. Another major advantage of this simplified optimization lends from its ability to be parallelized, a step which we also theoretically explore in this paper. We refer to our approach for fitting the CLM as the "exhaustive local search" (ELS) algorithm. Experiments were conducted on the CMU Multi-PIE database.

更新日期：2019-11-01
• Image Vis. Comput. (IF 2.747) Pub Date : 2009-10-01
Ahmed Bilal Ashraf,Simon Lucey,Jeffrey F Cohn,Tsuhan Chen,Zara Ambadar,Kenneth M Prkachin,Patricia E Solomon

Pain is typically assessed by patient self-report. Self-reported pain, however, is difficult to interpret and may be impaired or in some circumstances (i.e., young children and the severely ill) not even possible. To circumvent these problems behavioral scientists have identified reliable and valid facial indicators of pain. Hitherto, these methods have required manual measurement by highly skilled human observers. In this paper we explore an approach for automatically recognizing acute pain without the need for human observers. Specifically, our study was restricted to automatically detecting pain in adult patients with rotator cuff injuries. The system employed video input of the patients as they moved their affected and unaffected shoulder. Two types of ground truth were considered. Sequence-level ground truth consisted of Likert-type ratings by skilled observers. Frame-level ground truth was calculated from presence/absence and intensity of facial actions previously associated with pain. Active appearance models (AAM) were used to decouple shape and appearance in the digitized face images. Support vector machines (SVM) were compared for several representations from the AAM and of ground truth of varying granularity. We explored two questions pertinent to the construction, design and development of automatic pain detection systems. First, at what level (i.e., sequence- or frame-level) should datasets be labeled in order to obtain satisfactory automatic pain detection performance? Second, how important is it, at both levels of labeling, that we non-rigidly register the face?

更新日期：2019-11-01
• Image Vis. Comput. (IF 2.747) Pub Date : 2017-02-01
László A Jeni,Jeffrey F Cohn,Takeo Kanade

To enable real-time, person-independent 3D registration from 2D video, we developed a 3D cascade regression approach in which facial landmarks remain invariant across pose over a range of approximately 60 degrees. From a single 2D image of a person's face, a dense 3D shape is registered in real time for each frame. The algorithm utilizes a fast cascade regression framework trained on high-resolution 3D face-scans of posed and spontaneous emotion expression. The algorithm first estimates the location of a dense set of landmarks and their visibility, then reconstructs face shapes by fitting a part-based 3D model. Because no assumptions are required about illumination or surface properties, the method can be applied to a wide range of imaging conditions that include 2D video and uncalibrated multi-view video. The method has been validated in a battery of experiments that evaluate its precision of 3D reconstruction, extension to multi-view reconstruction, temporal integration for videos and 3D head-pose estimation. Experimental findings strongly support the validity of real-time, 3D registration and reconstruction from 2D video. The software is available online at http://zface.org.

更新日期：2019-11-01
Contents have been reproduced by permission of the publishers.

down
wechat
bug