当前期刊: arXiv - CS - Multimedia Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
  • Robust Wavelet-Based Watermarking Using Dynamic Strength Factor
    arXiv.cs.MM Pub Date : 2020-04-06
    Mahsa Kadkhodaei; Shadrokh Samavi

    In unsecured network environments, ownership protection of digital contents, such as images, is becoming a growing concern. Different watermarking methods have been proposed to address the copyright protection of digital materials. Watermarking methods are challenged with conflicting parameters of imperceptibility and robustness. While embedding a watermark with a high strength factor increases robustness

  • Direct Speech-to-image Translation
    arXiv.cs.MM Pub Date : 2020-04-07
    Jiguo Li; Xinfeng Zhang; Chuanmin Jia; Jizheng Xu; Li Zhang; Yue Wang; Siwei Ma; Wen Gao

    Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In

  • Exploiting context dependence for image compression with upsampling
    arXiv.cs.MM Pub Date : 2020-04-06
    Jarek Duda

    Image compression with upsampling encodes information to succeedingly increase image resolution, for example by encoding differences in FUIF and JPEG XL. It is useful for progressive decoding, also often can improve compression ratio. However, the currently used solutions rather do not exploit context dependence for encoding of such upscaling information. This article discusses simple inexpensive general

  • Temporally Distributed Networks for Fast Video Semantic Segmentation
    arXiv.cs.MM Pub Date : 2020-04-03
    Ping Hu; Fabian Caba Heilbron; Oliver Wang; Zhe Lin; Stan Sclaroff; Federico Perazzi

    We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Leveraging the inherent temporal continuity in videos, we distribute these sub-networks over sequential frames. Therefore

  • A Local-to-Global Approach to Multi-modal Movie Scene Segmentation
    arXiv.cs.MM Pub Date : 2020-04-06
    Anyi Rao; Linning Xu; Yu Xiong; Guodong Xu; Qingqiu Huang; Bolei Zhou; Dahua Lin

    Scene, as the crucial unit of storytelling in movies, contains complex activities of actors and their interactions in a physical environment. Identifying the composition of scenes serves as a critical step towards semantic understanding of movies. This is very challenging -- compared to the videos studied in conventional vision problems, e.g. action recognition, as scenes in movies usually contain

  • Comparing emotional states induced by 360$^{\circ}$ videos via head-mounted display and computer screen
    arXiv.cs.MM Pub Date : 2020-04-03
    Jan-Niklas Voigt-Antons; Eero Lehtonen; Andres Pinilla Palacios; Danish Ali; Tanja Kojić; Sebastian Möller

    In recent years 360$^{\circ}$ videos have been becoming more popular. For traditional media presentations, e.g., on a computer screen, a wide range of assessment methods are available. Different constructs, such as perceived quality or the induced emotional state of viewers, can be reliably assessed by subjective scales. Many of the subjective methods have only been validated using stimuli presented

  • User Experience of Reading in Virtual Reality -- Finding Values for Text Distance, Size and Contrast
    arXiv.cs.MM Pub Date : 2020-04-03
    Tanja Kojić; Danish Ali; Robert Greinacher; Sebastian Möller; Jan-Niklas Voigt-Antons

    Virtual Reality (VR) has an increasing impact on the market in many fields, from education and medicine to engineering and entertainment, by creating different applications that replicate or in the case of augmentation enhance real-life scenarios. Intending to present realistic environments, VR applications are including text that we are surrounded by every day. However, text can only add value to

  • Impact of Tactile and Visual Feedback on Breathing Rhythm and User Experience in VR Exergaming
    arXiv.cs.MM Pub Date : 2020-04-03
    Robert Greinacher; Tanja Kojić; Luis Meier; Rudresha Gulaganjihalli Parameshappa; Sebastian Möller; Jan-Niklas Voigt-Antons

    Combining interconnected wearables provides fascinating opportunities like augmenting exergaming with virtual coaches, feedback on the execution of sports activities, or how to improve on them. Breathing rhythm is a particularly interesting physiological dimension since it is easy and unobtrusive to measure and gained data provide valuable insights regarding the correct execution of movements, especially

  • Reversible Data Hiding in Encrypted Images Using MSBs Integration and Histogram Modification
    arXiv.cs.MM Pub Date : 2019-12-06
    Ammar Mohammadi

    This paper presents a reversible data hiding in encrypted image that employs based notions of the RDH in plain-image schemes including histogram modification and prediction-error computation. In the proposed method, original image may be encrypted by desire encryption algorithm. Most significant bit (MSB) of encrypted pixels are integrated to vacate room for embedding data bits. Integrated ones will

  • Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
    arXiv.cs.MM Pub Date : 2020-04-02
    Zhicheng Huang; Zhaoyang Zeng; Bei Liu; Dongmei Fu; Jianlong Fu

    We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks.

  • Multi-Modal Video Forensic Platform for Investigating Post-Terrorist Attack Scenarios
    arXiv.cs.MM Pub Date : 2020-04-02
    Alexander Schindler; Andrew Lindley; Anahid Jalali; Martin Boyer; Sergiu Gordea; Ross King

    The forensic investigation of a terrorist attack poses a significant challenge to the investigative authorities, as often several thousand hours of video footage must be viewed. Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence. Current platforms focus primarily on the integration of different computer vision methods and thus

  • Social-Sensor Composition for Tapestry Scenes
    arXiv.cs.MM Pub Date : 2020-03-28
    Tooba Aamir; Hai Dong; Athman Bouguettaya

    The extensive use of social media platforms and overwhelming amounts of imagery data creates unique opportunities for sensing, gathering and sharing information about events. One of its potential applications is to leverage crowdsourced social media images to create a tapestry scene for scene analysis of designated locations and time intervals. The existing attempts however ignore the temporal-semantic

  • Fast Session Resumption in DTLS for Mobile Communications
    arXiv.cs.MM Pub Date : 2019-10-08
    Gyordan Caminati; Sara Kiade; Gabriele D'Angelo; Stefano Ferretti; Vittorio Ghini

    DTLS is a protocol that provides security guarantees to Internet communications. It can operate on top of both TCP and UDP transport protocols. Thus, it is particularly suited for peer-to-peer and distributed multimedia applications. The same holds if the endpoints are mobile devices. In this scenario, mechanisms are needed to surmount possible network disconnections, often arising due to the mobility

  • From QoS Distributions to QoE Distributions: a System's Perspective
    arXiv.cs.MM Pub Date : 2020-03-28
    Tobias Hossfeld; Poul E. Heegaard; Martin Varela; Lea Skorin-Kapov; Markus Fiedler

    In the context of QoE management, network and service providers commonly rely on models that map system QoS conditions (e.g., system response time, paket loss, etc.) to estimated end user QoE values. Observable QoS conditions in the system may be assumed to follow a certain distribution, meaning that different end users will experience different conditions. On the other hand, drawing from the results

  • Deep Residual Neural Networks for Image in Speech Steganography
    arXiv.cs.MM Pub Date : 2020-03-30
    Shivam Agarwal; Siddarth Venkatraman

    Steganography is the art of hiding a secret message inside a publicly visible carrier message. Ideally, it is done without modifying the carrier, and with minimal loss of information in the secret message. Recently, various deep learning based approaches to steganography have been applied to different message types. We propose a deep learning based technique to hide a source RGB image message inside

  • A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata
    arXiv.cs.MM Pub Date : 2019-10-13
    Tobia Tesan; Pasquale Coscia; Lamberto Ballan

    Images represent a commonly used form of visual communication among people. Nevertheless, image classification may be a challenging task when dealing with unclear or non-common images needing more context to be correctly annotated. Metadata accompanying images on social-media represent an ideal source of additional information for retrieving proper neighborhoods easing image annotation task. To this

  • Unsupervised Cross-Modal Audio Representation Learning from Unstructured Multilingual Text
    arXiv.cs.MM Pub Date : 2020-03-27
    Alexander Schindler; Sergiu Gordea; Peter Knees

    We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. By applying Latent Semantic Indexing (LSI) we embed corresponding textual information into a latent vector space from which we derive track relatedness for online triplet selection. This LSI

  • A General Approach for Using Deep Neural Network for Digital Watermarking
    arXiv.cs.MM Pub Date : 2020-03-08
    Yurui Ming; Weiping Ding; Zehong Cao; Chin-Teng Lin

    Technologies of the Internet of Things (IoT) facilitate digital contents such as images being acquired in a massive way. However, consideration from the privacy or legislation perspective still demands the need for intellectual content protection. In this paper, we propose a general deep neural network (DNN) based watermarking method to fulfill this goal. Instead of training a neural network for protecting

  • How deep is your encoder: an analysis of features descriptors for an autoencoder-based audio-visual quality metric
    arXiv.cs.MM Pub Date : 2020-03-24
    Helard Martinez; Andrew Hines; Mylene C. Q. Farias

    The development of audio-visual quality assessment models poses a number of challenges in order to obtain accurate predictions. One of these challenges is the modelling of the complex interaction that audio and visual stimuli have and how this interaction is interpreted by human users. The No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd) deals with this problem from a

  • Impact of the Number of Votes on the Reliability and Validity of Subjective Speech Quality Assessment in the Crowdsourcing Approach
    arXiv.cs.MM Pub Date : 2020-03-25
    Babak Naderi; Tobias Hossfeld; Matthias Hirth; Florian Metzger; Sebastian Möller; Rafael Zequeira Jiménez

    The subjective quality of transmitted speech is traditionally assessed in a controlled laboratory environment according to ITU-T Rec. P.800. In turn, with crowdsourcing, crowdworkers participate in a subjective online experiment using their own listening device, and in their own working environment. Despite such less controllable conditions, the increased use of crowdsourcing micro-task platforms for

  • Edge-assisted Viewport Adaptive Scheme for real-time Omnidirectional Video transmission
    arXiv.cs.MM Pub Date : 2020-03-21
    Tao Guo; Xikang Jiang; Bin Xiang; Lin Zhang

    Omnidirectional applications are immersive and highly interactive, which can improve the efficiency of remote collaborative work among factory workers. The transmission of omnidirectional video (OV) is the most important step in implementing virtual remote collaboration. Compared with the ordinary video transmission, OV transmission requires more bandwidth, which is still a huge burden even under 5G

  • JPEG Steganography and Synchronization of DCT Coefficients for a Given Development Pipeline
    arXiv.cs.MM Pub Date : 2020-03-23
    Théo Taburet; Patrick Bas; Wadih Sawaya; Remi Cogranne

    This short paper proposes to use the statistical analysis of the correlation between DCT coefficients to design a new synchronization strategy that can be used for cost-based steganographic schemes in the JPEG domain. First, an analysis is performed on the covariance matrix of DCT coefficients of neighboring blocks after a development similar to the one used to generate BossBase. This analysis exhibits

  • Multi-task U-Net for Music Source Separation
    arXiv.cs.MM Pub Date : 2020-03-23
    Venkatesh S. Kadandale; Juan F. Montesinos; Gloria Haro; Emilia Gómez

    A fairly straightforward approach for music source separation is to train independent models, wherein each model is dedicated for estimating only a specific source. Training a single model to estimate multiple sources generally does not perform as well as the independent dedicated models. However, Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separation

  • Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency
    arXiv.cs.MM Pub Date : 2020-03-23
    Eric Müller-Budack; Jonas Theiner; Sebastian Diering; Maximilian Idahl; Ralph Ewerth

    The World Wide Web has become a popular source for gathering information and news. Multimodal information, e.g., enriching text with photos, is typically used to convey the news more effectively or to attract attention. Photo content can range from decorative, depict additional important information, or can even contain misleading information. Therefore, automatic approaches to quantify cross-modal

  • Viewport-Aware Deep Reinforcement Learning Approach for 360$^o$ Video Caching
    arXiv.cs.MM Pub Date : 2020-03-18
    Pantelis Maniotis; Nikolaos Thomos

    360$^o$ video is an essential component of VR/AR/MR systems that provides immersive experience to the users. However, 360$^o$ video is associated with high bandwidth requirements. The required bandwidth can be reduced by exploiting the fact that users are interested in viewing only a part of the video scene and that users request viewports that overlap with each other. Motivated by the findings of

  • Convolutional Neural Networks for Continuous QoE Prediction in Video Streaming Services
    arXiv.cs.MM Pub Date : 2020-03-19
    Tho Nguyen Duc; Chanh Minh Tran; Phan Xuan Tan; Eiji Kamioka

    In video streaming services, predicting the continuous user's quality of experience (QoE) plays a crucial role in delivering high quality streaming contents to the user. However, the complexity caused by the temporal dependencies in QoE data and the non-linear relationships among QoE influence factors has introduced challenges to continuous QoE prediction. To deal with that, existing studies have utilized

  • FAURAS: A Proxy-based Framework for Ensuring the Fairness of Adaptive Video Streaming over HTTP/2 Server Push
    arXiv.cs.MM Pub Date : 2020-03-19
    Chanh Minh Tran; Tho Nguyen Duc; Phan Xuan Tan; Eiji Kamioka

    HTTP/2 video streaming has caught a lot of attentions in the development of multimedia technologies over the last few years. In HTTP/2, the server push mechanism allows the server to deliver more video segments to the client within a single request in order to deal with the requests explosion problem. As a result, recent research efforts have been focusing on utilizing such a feature to enhance the

  • Personalized Taste and Cuisine Preference Modeling via Images
    arXiv.cs.MM Pub Date : 2020-02-26
    Nitish Nag; Bindu Rajanna; Ramesh Jain

    With the exponential growth in the usage of social media to share live updates about life, taking pictures has become an unavoidable phenomenon. Individuals unknowingly create a unique knowledge base with these images. The food images, in particular, are of interest as they contain a plethora of information. From the image metadata and using computer vision tools, we can extract distinct insights for

  • DRST: Deep Residual Shearlet Transform for Densely Sampled Light Field Reconstruction
    arXiv.cs.MM Pub Date : 2020-03-19
    Yuan Gao; Robert Bregovic; Reinhard Koch; Atanas Gotchev

    The Image-Based Rendering (IBR) approach using Shearlet Transform (ST) is one of the most effective methods for Densely-Sampled Light Field (DSLF) reconstruction. The ST-based DSLF reconstruction typically relies on an iterative thresholding algorithm for Epipolar-Plane Image (EPI) sparse regularization in shearlet domain, involving dozens of transformations between image domain and shearlet domain

  • Normalized and Geometry-Aware Self-Attention Network for Image Captioning
    arXiv.cs.MM Pub Date : 2020-03-19
    Longteng Guo; Jing Liu; Xinxin Zhu; Peng Yao; Shichen Lu; Hanqing Lu

    Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA. While normalization is previously only applied outside SA, we introduce a novel normalization method and

  • Zero-shot Learning for Audio-based Music Classification and Tagging
    arXiv.cs.MM Pub Date : 2019-07-05
    Jeong Choi; Jongpil Lee; Jiyoung Park; Juhan Nam

    Audio-based music classification and tagging is typically based on categorical supervised learning with a fixed set of labels. This intrinsically cannot handle unseen labels such as newly added music genres or semantic words that users arbitrarily choose for music retrieval. Zero-shot learning can address this problem by leveraging an additional semantic space of labels where side information about

  • Binocular Rivalry Oriented Predictive Auto-Encoding Network for Blind Stereoscopic Image Quality Assessment
    arXiv.cs.MM Pub Date : 2019-09-04
    Jiahua Xu; Wei Zhou; Zhibo Chen; Suiyi Ling; Patrick Le Callet

    Stereoscopic image quality assessment (SIQA) has encountered non-trivial challenges due to the fast proliferation of 3D contents. In the past years, deep learning oriented SIQA methods have emerged and achieved spectacular performance compared to conventional algorithms which are only relied on extracting hand-crafted features. However, most existing deep SIQA evaluators are not specifically built

  • 3D Dynamic Point Cloud Denoising via Spatial-Temporal Graph Learning
    arXiv.cs.MM Pub Date : 2020-03-17
    Wei Hu; Qianjiang Hu; Zehua Wang; Xiang Gao

    The prevalence of accessible depth sensing and 3D laser scanning techniques has enabled the convenient acquisition of 3D dynamic point clouds, which provide efficient representation of arbitrarily-shaped objects in motion. Nevertheless, dynamic point clouds are often perturbed by noise due to hardware, software or other causes. While a plethora of methods have been proposed for static point cloud denoising

  • Remembering Winter Was Coming: Character-Oriented Video Summaries of TV Series
    arXiv.cs.MM Pub Date : 2019-09-05
    Xavier BostLIA; Serigne GueyeLIA; Vincent LabatutLIA; Martha LarsonDMIR; Georges LinarèsLIA; Damien MalinasCNELIAS; Raphaël RothCNELIAS

    Today's popular TV series tend to develop continuous, complex plots spanning several seasons, but are often viewed in controlled and discontinuous conditions. Consequently, most viewers need to be re-immersed in the story before watching a new season. Although discussions with friends and family can help, we observe that most viewers make extensive use of summaries to re-engage with the plot. Automatic

  • Hide Secret Information in Blocks: Minimum Distortion Embedding
    arXiv.cs.MM Pub Date : 2020-03-17
    Md Amiruzzaman; Rizal Mohd Nor

    In this paper, a new steganographic method is presented that provides minimum distortion in the stego image. The proposed encoding algorithm focuses on DCT rounding error and optimizes that in a way to reduce distortion in the stego image, and the proposed algorithm produces less distortion than existing methods (e.g., F5 algorithm). The proposed method is based on DCT rounding error which helps to

  • Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method
    arXiv.cs.MM Pub Date : 2020-03-17
    Cunhang Fan; Jianhua Tao; Bin Liu; Jiangyan Yi; Zhengqi Wen; Xuefei Liu

    In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual interference

  • Reinforcement Learning Driven Adaptive VR Streaming with Optical Flow Based QoE
    arXiv.cs.MM Pub Date : 2020-03-17
    Wei Quan; Yuxuan Pan; Bin Xiang; Lin Zhang

    With the merit of containing full panoramic content in one camera, Virtual Reality (VR) and 360-degree videos have attracted more and more attention in the field of industrial cloud manufacturing and training. Industrial Internet of Things (IoT), where many VR terminals needed to be online at the same time, can hardly guarantee VR's bandwidth requirement. However, by making use of users' quality of

  • Parameter-Free Style Projection for Arbitrary Style Transfer
    arXiv.cs.MM Pub Date : 2020-03-17
    Siyu Huang; Haoyi Xiong; Tianyang Wang; Qingzhong Wang; Zeyu Chen; Jun Huan; Dejing Dou

    Arbitrary image style transfer is a challenging task which aims to stylize a content image conditioned on an arbitrary style image. In this task the content-style feature transformation is a critical component for a proper fusion of features. Existing feature transformation algorithms often suffer from unstable learning, loss of content and style details, and non-natural stroke patterns. To mitigate

  • Characterizing Generalized Rate-Distortion Performance of Video Coding: An Eigen Analysis Approach
    arXiv.cs.MM Pub Date : 2019-12-15
    Zhengfang DuanmuUniversity of Waterloo, Canada; Wentao LiuUniversity of Waterloo, Canada; Zhuoran LiUniversity of Waterloo, Canada; Kede MaCity University of Hong Kong, Hong Kong, China; Zhou WangUniversity of Waterloo, Canada

    Rate-distortion (RD) theory is at the heart of lossy data compression. Here we aim to model the generalized RD (GRD) trade-off between the visual quality of a compressed video and its encoding profiles (e.g., bitrate and spatial resolution). We first define the theoretical functional space $\mathcal{W}$ of the GRD function by analyzing its mathematical properties.We show that $\mathcal{W}$ is a convex

  • Estimation of Rate Control Parameters for Video Coding Using CNN
    arXiv.cs.MM Pub Date : 2020-03-13
    Maria Santamaria; Ebroul Izquierdo; Saverio Blasi; Marta Mrak

    Rate-control is essential to ensure efficient video delivery. Typical rate-control algorithms rely on bit allocation strategies, to appropriately distribute bits among frames. As reference frames are essential for exploiting temporal redundancies, intra frames are usually assigned a larger portion of the available bits. In this paper, an accurate method to estimate number of bits and quality of intra

  • Exploring the Role of Visual Content in Fake News Detection
    arXiv.cs.MM Pub Date : 2020-03-11
    Juan Cao; Peng Qi; Qiang Sheng; Tianyun Yang; Junbo Guo; Jintao Li

    The increasing popularity of social media promotes the proliferation of fake news, which has caused significant negative societal effects. Therefore, fake news detection on social media has recently become an emerging research area of great concern. With the development of multimedia technology, fake news attempts to utilize multimedia content with images or videos to attract and mislead consumers

  • Prediction, Communication, and Computing Duration Optimization for VR Video Streaming
    arXiv.cs.MM Pub Date : 2019-10-30
    Xing Wei; Chenyang Yang; Shengqian Han

    Proactive tile-based video streaming can avoid motion-to-photon latency of wireless virtual reality (VR) by computing and delivering the predicted tiles to be requested before playback. All existing works either focus on the task of tile prediction or on the tasks of computing and communications, overlooking the facts that these successively executed tasks have to share the same duration to avoid the

  • Learning to Fuse Music Genres with Generative Adversarial Dual Learning
    arXiv.cs.MM Pub Date : 2017-12-05
    Zhiqian Chen; Chih-Wei Wu; Yen-Cheng Lu; Alexander Lerch; Chang-Tien Lu

    FusionGAN is a novel genre fusion framework for music generation that integrates the strengths of generative adversarial networks and dual learning. In particular, the proposed method offers a dual learning extension that can effectively integrate the styles of the given domains. To efficiently quantify the difference among diverse domains and avoid the vanishing gradient issue, FusionGAN provides

  • Transferring Cross-domain Knowledge for Video Sign Language Recognition
    arXiv.cs.MM Pub Date : 2020-03-08
    Dongxu Li; Xin Yu; Chenchen Xu; Lars Petersson; Hongdong Li

    Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation. It requires models to recognize isolated sign words from videos. However, annotating WSLR data needs expert knowledge, thus limiting WSLR dataset acquisition. On the contrary, there are abundant subtitled sign news videos on the internet. Since these videos have no word-level annotation and exhibit a

  • Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism
    arXiv.cs.MM Pub Date : 2020-03-09
    Hao Wang; Doyen Sahoo; Chenghao Liu; Ke Shu; Palakorn Achananuparp; Ee-peng Lim; Steven C. H. Hoi

    Cross-modal food retrieval is an important task to perform analysis of food-related information, such as food images and cooking recipes. The goal is to learn an embedding of images and recipes in a common feature space, so that precise matching can be realized. Compared with existing cross-modal retrieval approaches, two major challenges in this specific problem are: 1) the large intra-class variance

  • Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds
    arXiv.cs.MM Pub Date : 2020-03-09
    Arun Balajee Vasudevan; Dengxin Dai; Luc Van Gool

    Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight

  • Learning Cross-Modal Embeddings with Adversarial Networks for Cooking Recipes and Food Images
    arXiv.cs.MM Pub Date : 2019-05-03
    Hao Wang; Doyen Sahoo; Chenghao Liu; Ee-peng Lim; Steven C. H. Hoi

    Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle. An important task under the food-computing umbrella is retrieval, which is particularly helpful for health related applications, where we are interested in retrieving important information about food (e.g

  • Soft Video Multicasting Using Adaptive Compressed Sensing
    arXiv.cs.MM Pub Date : 2020-03-06
    Hadi Hadizadeh; Ivan V. bajic

    Recently, soft video multicasting has gained a lot of attention, especially in broadcast and mobile scenarios where the bit rate supported by the channel may differ across receivers, and may vary quickly over time. Unlike the conventional designs that force the source to use a single bit rate according to the receiver with the worst channel quality, soft video delivery schemes transmit the video such

  • Trends and Advancements in Deep Neural Network Communication
    arXiv.cs.MM Pub Date : 2020-03-06
    Felix Sattler; Thomas Wiegand; Wojciech Samek

    Due to their great performance and scalability properties neural networks have become ubiquitous building blocks of many applications. With the rise of mobile and IoT, these models now are also being increasingly applied in distributed settings, where the owners of the data are separated by limited communication channels and privacy constraints. To address the challenges of these distributed environments

  • Cloud Rendering-based Volumetric Video Streaming System for Mixed Reality Services
    arXiv.cs.MM Pub Date : 2020-03-05
    Serhan Gül; Dimitri Podborski; Jangwoo Son; Gurdeep Singh Bhullar; Thomas Buchholz; Thomas Schierl; Cornelius Hellge

    Volumetric video is an emerging technology for immersive representation of 3D spaces that captures objects from all directions using multiple cameras and creates a dynamic 3D model of the scene. However, rendering volumetric content requires high amounts of processing power and is still a very demanding tasks for today's mobile devices. To mitigate this, we propose a volumetric video streaming system

  • Region adaptive graph fourier transform for 3d point clouds
    arXiv.cs.MM Pub Date : 2020-03-04
    Eduardo Pavez; Benjamin Girault; Antonio Ortega; Philip A. Chou

    We introduce the Region Adaptive Graph Fourier Transform (RA-GFT) for compression of 3D point cloud attributes. We assume the points are organized by a family of nested partitions represented by a tree. The RA-GFT is a multiresolution transform, formed by combining spatially localized block transforms. At each resolution level, attributes are processed in clusters by a set of block transforms. Each

  • ASMD: an automatic framework for compiling multimodal datasets
    arXiv.cs.MM Pub Date : 2020-03-04
    Federico Simonetta; Stavros Ntalampiras; Federico Avanzini

    This paper describes an open-source Python framework for handling datasets for music processing tasks, built with the aim of improving the reproducibility of research projects in music computing and assessing the generalization abilities of machine learning models. The framework enables the automatic download and installation of several commonly used datasets for multimodal music processing. Specifically

  • Harmonics Based Representation in Clarinet Tone Quality Evaluation
    arXiv.cs.MM Pub Date : 2020-03-01
    Yixin Wang; Xiaohong Guan; Youtian Du; Nan Nan

    Music tone quality evaluation is generally performed by experts. It could be subjective and short of consistency and fairness as well as time-consuming. In this paper we present a new method for identifying the clarinet reed quality by evaluating tone quality based on the harmonic structure and energy distribution. We first decouple the quality of reed and clarinet pipe based on the acoustic harmonics

  • Towards Automatic Face-to-Face Translation
    arXiv.cs.MM Pub Date : 2020-03-01
    Prajwal K R; Rudrabha Mukhopadhyay; Jerin Philip; Abhishek Jha; Vinay Namboodiri; C. V. Jawahar

    In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work

  • Week Texture Information Map Guided Image Super-resolution with Deep Residual Networks
    arXiv.cs.MM Pub Date : 2020-03-01
    Bo Fu; Liyan Wang; Yuechu Wu; Yufeng Wu; Shilin Fu; Yonggong Ren

    Single image super-resolution (SISR) is an image processing task which obtains high-resolution (HR) image from a low-resolution (LR) image. Recently, due to the capability in feature extraction, a series of deep learning methods have brought important crucial improvement for SISR. However, we observe that no matter how deeper the networks are designed, they usually do not have good generalization ability

  • An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos
    arXiv.cs.MM Pub Date : 2020-02-12
    Sicheng Zhao; Yunsheng Ma; Yang Gu; Jufeng Yang; Tengfei Xing; Pengfei Xu; Runbo Hu; Hua Chai; Kurt Keutzer

    Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio

  • Improved Image Coding Autoencoder With Deep Learning
    arXiv.cs.MM Pub Date : 2020-02-28
    Licheng Xiao; Hairong Wang; Nam Ling

    In this paper, we build autoencoder based pipelines for extreme end-to-end image compression based on Ball\'e's approach, which is the state-of-the-art open source implementation in image compression using deep learning. We deepened the network by adding one more hidden layer before each strided convolutional layer with exactly the same number of down-samplings and up-samplings. Our approach outperformed

  • Learning to Shade Hand-drawn Sketches
    arXiv.cs.MM Pub Date : 2020-02-26
    Qingyuan Zheng; Zhuoru Li; Adam Bargteil

    We present a fully automatic method to generate detailed and accurate artistic shadows from pairs of line drawing sketches and lighting directions. We also contribute a new dataset of one thousand examples of pairs of line drawings and shadows that are tagged with lighting directions. Remarkably, the generated shadows quickly communicate the underlying 3D structure of the sketched scene. Consequently

  • BBAND Index: A No-Reference Banding Artifact Predictor
    arXiv.cs.MM Pub Date : 2020-02-27
    Zhengzhong Tu; Jessie Lin; Yilin Wang; Balu Adsumilli; Alan C. Bovik

    Banding artifact, or false contouring, is a common video compression impairment that tends to appear on large flat regions in encoded videos. These staircase-shaped color bands can be very noticeable in high-definition videos. Here we study this artifact, and propose a new distortion-specific no-reference video quality model for predicting banding artifacts, called the Blind BANding Detector (BBAND

  • Subjective Quality Assessment for YouTube UGC Dataset
    arXiv.cs.MM Pub Date : 2020-02-27
    Joong Gon Yim; Yilin Wang; Neil Birkbeck; Balu Adsumilli

    Due to the scale of social video sharing, User Generated Content (UGC) is getting more attention from academia and industry. To facilitate compression-related research on UGC, YouTube has released a large-scale dataset. The initial dataset only provided videos, limiting its use in quality assessment. We used a crowd-sourcing platform to collect subjective quality scores for this dataset. We analyzed

Contents have been reproduced by permission of the publishers.
全球疫情及响应:BMC Medicine专题征稿