• arXiv.cs.MM Pub Date : 2020-04-06

In unsecured network environments, ownership protection of digital contents, such as images, is becoming a growing concern. Different watermarking methods have been proposed to address the copyright protection of digital materials. Watermarking methods are challenged with conflicting parameters of imperceptibility and robustness. While embedding a watermark with a high strength factor increases robustness

更新日期：2020-04-08
• arXiv.cs.MM Pub Date : 2020-04-07
Jiguo Li; Xinfeng Zhang; Chuanmin Jia; Jizheng Xu; Li Zhang; Yue Wang; Siwei Ma; Wen Gao

Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing form. However, as far as we know, it has not been well-studied how to translate the speech signals into images directly and how well they can be translated. In

更新日期：2020-04-08
• arXiv.cs.MM Pub Date : 2020-04-06
Jarek Duda

Image compression with upsampling encodes information to succeedingly increase image resolution, for example by encoding differences in FUIF and JPEG XL. It is useful for progressive decoding, also often can improve compression ratio. However, the currently used solutions rather do not exploit context dependence for encoding of such upscaling information. This article discusses simple inexpensive general

更新日期：2020-04-08
• arXiv.cs.MM Pub Date : 2020-04-03
Ping Hu; Fabian Caba Heilbron; Oliver Wang; Zhe Lin; Stan Sclaroff; Federico Perazzi

We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Leveraging the inherent temporal continuity in videos, we distribute these sub-networks over sequential frames. Therefore

更新日期：2020-04-08
• arXiv.cs.MM Pub Date : 2020-04-06
Anyi Rao; Linning Xu; Yu Xiong; Guodong Xu; Qingqiu Huang; Bolei Zhou; Dahua Lin

Scene, as the crucial unit of storytelling in movies, contains complex activities of actors and their interactions in a physical environment. Identifying the composition of scenes serves as a critical step towards semantic understanding of movies. This is very challenging -- compared to the videos studied in conventional vision problems, e.g. action recognition, as scenes in movies usually contain

更新日期：2020-04-08
• arXiv.cs.MM Pub Date : 2020-04-03
Jan-Niklas Voigt-Antons; Eero Lehtonen; Andres Pinilla Palacios; Danish Ali; Tanja Kojić; Sebastian Möller

In recent years 360$^{\circ}$ videos have been becoming more popular. For traditional media presentations, e.g., on a computer screen, a wide range of assessment methods are available. Different constructs, such as perceived quality or the induced emotional state of viewers, can be reliably assessed by subjective scales. Many of the subjective methods have only been validated using stimuli presented

更新日期：2020-04-06
• arXiv.cs.MM Pub Date : 2020-04-03
Tanja Kojić; Danish Ali; Robert Greinacher; Sebastian Möller; Jan-Niklas Voigt-Antons

Virtual Reality (VR) has an increasing impact on the market in many fields, from education and medicine to engineering and entertainment, by creating different applications that replicate or in the case of augmentation enhance real-life scenarios. Intending to present realistic environments, VR applications are including text that we are surrounded by every day. However, text can only add value to

更新日期：2020-04-06
• arXiv.cs.MM Pub Date : 2020-04-03
Robert Greinacher; Tanja Kojić; Luis Meier; Rudresha Gulaganjihalli Parameshappa; Sebastian Möller; Jan-Niklas Voigt-Antons

Combining interconnected wearables provides fascinating opportunities like augmenting exergaming with virtual coaches, feedback on the execution of sports activities, or how to improve on them. Breathing rhythm is a particularly interesting physiological dimension since it is easy and unobtrusive to measure and gained data provide valuable insights regarding the correct execution of movements, especially

更新日期：2020-04-06
• arXiv.cs.MM Pub Date : 2019-12-06

This paper presents a reversible data hiding in encrypted image that employs based notions of the RDH in plain-image schemes including histogram modification and prediction-error computation. In the proposed method, original image may be encrypted by desire encryption algorithm. Most significant bit (MSB) of encrypted pixels are integrated to vacate room for embedding data bits. Integrated ones will

更新日期：2020-04-06
• arXiv.cs.MM Pub Date : 2020-04-02
Zhicheng Huang; Zhaoyang Zeng; Bei Liu; Dongmei Fu; Jianlong Fu

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image pixels and language semantics directly from image and sentence pairs instead of using region-based image features as the most recent vision and language tasks.

更新日期：2020-04-03
• arXiv.cs.MM Pub Date : 2020-04-02
Alexander Schindler; Andrew Lindley; Anahid Jalali; Martin Boyer; Sergiu Gordea; Ross King

The forensic investigation of a terrorist attack poses a significant challenge to the investigative authorities, as often several thousand hours of video footage must be viewed. Large scale Video Analytic Platforms (VAP) assist law enforcement agencies (LEA) in identifying suspects and securing evidence. Current platforms focus primarily on the integration of different computer vision methods and thus

更新日期：2020-04-03
• arXiv.cs.MM Pub Date : 2020-03-28
Tooba Aamir; Hai Dong; Athman Bouguettaya

The extensive use of social media platforms and overwhelming amounts of imagery data creates unique opportunities for sensing, gathering and sharing information about events. One of its potential applications is to leverage crowdsourced social media images to create a tapestry scene for scene analysis of designated locations and time intervals. The existing attempts however ignore the temporal-semantic

更新日期：2020-04-01
• arXiv.cs.MM Pub Date : 2019-10-08
Gyordan Caminati; Sara Kiade; Gabriele D'Angelo; Stefano Ferretti; Vittorio Ghini

DTLS is a protocol that provides security guarantees to Internet communications. It can operate on top of both TCP and UDP transport protocols. Thus, it is particularly suited for peer-to-peer and distributed multimedia applications. The same holds if the endpoints are mobile devices. In this scenario, mechanisms are needed to surmount possible network disconnections, often arising due to the mobility

更新日期：2020-04-01
• arXiv.cs.MM Pub Date : 2020-03-28
Tobias Hossfeld; Poul E. Heegaard; Martin Varela; Lea Skorin-Kapov; Markus Fiedler

In the context of QoE management, network and service providers commonly rely on models that map system QoS conditions (e.g., system response time, paket loss, etc.) to estimated end user QoE values. Observable QoS conditions in the system may be assumed to follow a certain distribution, meaning that different end users will experience different conditions. On the other hand, drawing from the results

更新日期：2020-03-31
• arXiv.cs.MM Pub Date : 2020-03-30
Shivam Agarwal; Siddarth Venkatraman

Steganography is the art of hiding a secret message inside a publicly visible carrier message. Ideally, it is done without modifying the carrier, and with minimal loss of information in the secret message. Recently, various deep learning based approaches to steganography have been applied to different message types. We propose a deep learning based technique to hide a source RGB image message inside

更新日期：2020-03-31
• arXiv.cs.MM Pub Date : 2019-10-13
Tobia Tesan; Pasquale Coscia; Lamberto Ballan

Images represent a commonly used form of visual communication among people. Nevertheless, image classification may be a challenging task when dealing with unclear or non-common images needing more context to be correctly annotated. Metadata accompanying images on social-media represent an ideal source of additional information for retrieving proper neighborhoods easing image annotation task. To this

更新日期：2020-03-31
• arXiv.cs.MM Pub Date : 2020-03-27
Alexander Schindler; Sergiu Gordea; Peter Knees

We present an approach to unsupervised audio representation learning. Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness. By applying Latent Semantic Indexing (LSI) we embed corresponding textual information into a latent vector space from which we derive track relatedness for online triplet selection. This LSI

更新日期：2020-03-30
• arXiv.cs.MM Pub Date : 2020-03-08
Yurui Ming; Weiping Ding; Zehong Cao; Chin-Teng Lin

Technologies of the Internet of Things (IoT) facilitate digital contents such as images being acquired in a massive way. However, consideration from the privacy or legislation perspective still demands the need for intellectual content protection. In this paper, we propose a general deep neural network (DNN) based watermarking method to fulfill this goal. Instead of training a neural network for protecting

更新日期：2020-03-30
• arXiv.cs.MM Pub Date : 2020-03-24
Helard Martinez; Andrew Hines; Mylene C. Q. Farias

The development of audio-visual quality assessment models poses a number of challenges in order to obtain accurate predictions. One of these challenges is the modelling of the complex interaction that audio and visual stimuli have and how this interaction is interpreted by human users. The No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd) deals with this problem from a

更新日期：2020-03-26
• arXiv.cs.MM Pub Date : 2020-03-25
Babak Naderi; Tobias Hossfeld; Matthias Hirth; Florian Metzger; Sebastian Möller; Rafael Zequeira Jiménez

The subjective quality of transmitted speech is traditionally assessed in a controlled laboratory environment according to ITU-T Rec. P.800. In turn, with crowdsourcing, crowdworkers participate in a subjective online experiment using their own listening device, and in their own working environment. Despite such less controllable conditions, the increased use of crowdsourcing micro-task platforms for

更新日期：2020-03-26
• arXiv.cs.MM Pub Date : 2020-03-21
Tao Guo; Xikang Jiang; Bin Xiang; Lin Zhang

Omnidirectional applications are immersive and highly interactive, which can improve the efficiency of remote collaborative work among factory workers. The transmission of omnidirectional video (OV) is the most important step in implementing virtual remote collaboration. Compared with the ordinary video transmission, OV transmission requires more bandwidth, which is still a huge burden even under 5G

更新日期：2020-03-24
• arXiv.cs.MM Pub Date : 2020-03-23
Théo Taburet; Patrick Bas; Wadih Sawaya; Remi Cogranne

This short paper proposes to use the statistical analysis of the correlation between DCT coefficients to design a new synchronization strategy that can be used for cost-based steganographic schemes in the JPEG domain. First, an analysis is performed on the covariance matrix of DCT coefficients of neighboring blocks after a development similar to the one used to generate BossBase. This analysis exhibits

更新日期：2020-03-24
• arXiv.cs.MM Pub Date : 2020-03-23
Venkatesh S. Kadandale; Juan F. Montesinos; Gloria Haro; Emilia Gómez

A fairly straightforward approach for music source separation is to train independent models, wherein each model is dedicated for estimating only a specific source. Training a single model to estimate multiple sources generally does not perform as well as the independent dedicated models. However, Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separation

更新日期：2020-03-24
• arXiv.cs.MM Pub Date : 2020-03-23
Eric Müller-Budack; Jonas Theiner; Sebastian Diering; Maximilian Idahl; Ralph Ewerth

The World Wide Web has become a popular source for gathering information and news. Multimodal information, e.g., enriching text with photos, is typically used to convey the news more effectively or to attract attention. Photo content can range from decorative, depict additional important information, or can even contain misleading information. Therefore, automatic approaches to quantify cross-modal

更新日期：2020-03-24
• arXiv.cs.MM Pub Date : 2020-03-18
Pantelis Maniotis; Nikolaos Thomos

360$^o$ video is an essential component of VR/AR/MR systems that provides immersive experience to the users. However, 360$^o$ video is associated with high bandwidth requirements. The required bandwidth can be reduced by exploiting the fact that users are interested in viewing only a part of the video scene and that users request viewports that overlap with each other. Motivated by the findings of

更新日期：2020-03-20
• arXiv.cs.MM Pub Date : 2020-03-19
Tho Nguyen Duc; Chanh Minh Tran; Phan Xuan Tan; Eiji Kamioka

In video streaming services, predicting the continuous user's quality of experience (QoE) plays a crucial role in delivering high quality streaming contents to the user. However, the complexity caused by the temporal dependencies in QoE data and the non-linear relationships among QoE influence factors has introduced challenges to continuous QoE prediction. To deal with that, existing studies have utilized

更新日期：2020-03-20
• arXiv.cs.MM Pub Date : 2020-03-19
Chanh Minh Tran; Tho Nguyen Duc; Phan Xuan Tan; Eiji Kamioka

HTTP/2 video streaming has caught a lot of attentions in the development of multimedia technologies over the last few years. In HTTP/2, the server push mechanism allows the server to deliver more video segments to the client within a single request in order to deal with the requests explosion problem. As a result, recent research efforts have been focusing on utilizing such a feature to enhance the

更新日期：2020-03-20
• arXiv.cs.MM Pub Date : 2020-02-26
Nitish Nag; Bindu Rajanna; Ramesh Jain

With the exponential growth in the usage of social media to share live updates about life, taking pictures has become an unavoidable phenomenon. Individuals unknowingly create a unique knowledge base with these images. The food images, in particular, are of interest as they contain a plethora of information. From the image metadata and using computer vision tools, we can extract distinct insights for

更新日期：2020-03-20
• arXiv.cs.MM Pub Date : 2020-03-19
Yuan Gao; Robert Bregovic; Reinhard Koch; Atanas Gotchev

The Image-Based Rendering (IBR) approach using Shearlet Transform (ST) is one of the most effective methods for Densely-Sampled Light Field (DSLF) reconstruction. The ST-based DSLF reconstruction typically relies on an iterative thresholding algorithm for Epipolar-Plane Image (EPI) sparse regularization in shearlet domain, involving dozens of transformations between image domain and shearlet domain

更新日期：2020-03-20
• arXiv.cs.MM Pub Date : 2020-03-19
Longteng Guo; Jing Liu; Xinxin Zhu; Peng Yao; Shichen Lu; Hanqing Lu

Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA. While normalization is previously only applied outside SA, we introduce a novel normalization method and

更新日期：2020-03-20
• arXiv.cs.MM Pub Date : 2019-07-05
Jeong Choi; Jongpil Lee; Jiyoung Park; Juhan Nam

Audio-based music classification and tagging is typically based on categorical supervised learning with a fixed set of labels. This intrinsically cannot handle unseen labels such as newly added music genres or semantic words that users arbitrarily choose for music retrieval. Zero-shot learning can address this problem by leveraging an additional semantic space of labels where side information about

更新日期：2020-03-20
• arXiv.cs.MM Pub Date : 2019-09-04
Jiahua Xu; Wei Zhou; Zhibo Chen; Suiyi Ling; Patrick Le Callet

Stereoscopic image quality assessment (SIQA) has encountered non-trivial challenges due to the fast proliferation of 3D contents. In the past years, deep learning oriented SIQA methods have emerged and achieved spectacular performance compared to conventional algorithms which are only relied on extracting hand-crafted features. However, most existing deep SIQA evaluators are not specifically built

更新日期：2020-03-20
• arXiv.cs.MM Pub Date : 2020-03-17
Wei Hu; Qianjiang Hu; Zehua Wang; Xiang Gao

The prevalence of accessible depth sensing and 3D laser scanning techniques has enabled the convenient acquisition of 3D dynamic point clouds, which provide efficient representation of arbitrarily-shaped objects in motion. Nevertheless, dynamic point clouds are often perturbed by noise due to hardware, software or other causes. While a plethora of methods have been proposed for static point cloud denoising

更新日期：2020-03-19
• arXiv.cs.MM Pub Date : 2019-09-05
Xavier BostLIA; Serigne GueyeLIA; Vincent LabatutLIA; Martha LarsonDMIR; Georges LinarèsLIA; Damien MalinasCNELIAS; Raphaël RothCNELIAS

Today's popular TV series tend to develop continuous, complex plots spanning several seasons, but are often viewed in controlled and discontinuous conditions. Consequently, most viewers need to be re-immersed in the story before watching a new season. Although discussions with friends and family can help, we observe that most viewers make extensive use of summaries to re-engage with the plot. Automatic

更新日期：2020-03-19
• arXiv.cs.MM Pub Date : 2020-03-17
Md Amiruzzaman; Rizal Mohd Nor

In this paper, a new steganographic method is presented that provides minimum distortion in the stego image. The proposed encoding algorithm focuses on DCT rounding error and optimizes that in a way to reduce distortion in the stego image, and the proposed algorithm produces less distortion than existing methods (e.g., F5 algorithm). The proposed method is based on DCT rounding error which helps to

更新日期：2020-03-18
• arXiv.cs.MM Pub Date : 2020-03-17
Cunhang Fan; Jianhua Tao; Bin Liu; Jiangyan Yi; Zhengqi Wen; Xuefei Liu

In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual interference

更新日期：2020-03-18
• arXiv.cs.MM Pub Date : 2020-03-17
Wei Quan; Yuxuan Pan; Bin Xiang; Lin Zhang

With the merit of containing full panoramic content in one camera, Virtual Reality (VR) and 360-degree videos have attracted more and more attention in the field of industrial cloud manufacturing and training. Industrial Internet of Things (IoT), where many VR terminals needed to be online at the same time, can hardly guarantee VR's bandwidth requirement. However, by making use of users' quality of

更新日期：2020-03-18
• arXiv.cs.MM Pub Date : 2020-03-17
Siyu Huang; Haoyi Xiong; Tianyang Wang; Qingzhong Wang; Zeyu Chen; Jun Huan; Dejing Dou

Arbitrary image style transfer is a challenging task which aims to stylize a content image conditioned on an arbitrary style image. In this task the content-style feature transformation is a critical component for a proper fusion of features. Existing feature transformation algorithms often suffer from unstable learning, loss of content and style details, and non-natural stroke patterns. To mitigate

更新日期：2020-03-18
• arXiv.cs.MM Pub Date : 2019-12-15
Zhengfang DuanmuUniversity of Waterloo, Canada; Wentao LiuUniversity of Waterloo, Canada; Zhuoran LiUniversity of Waterloo, Canada; Kede MaCity University of Hong Kong, Hong Kong, China; Zhou WangUniversity of Waterloo, Canada

Rate-distortion (RD) theory is at the heart of lossy data compression. Here we aim to model the generalized RD (GRD) trade-off between the visual quality of a compressed video and its encoding profiles (e.g., bitrate and spatial resolution). We first define the theoretical functional space $\mathcal{W}$ of the GRD function by analyzing its mathematical properties.We show that $\mathcal{W}$ is a convex

更新日期：2020-03-18
• arXiv.cs.MM Pub Date : 2020-03-13
Maria Santamaria; Ebroul Izquierdo; Saverio Blasi; Marta Mrak

Rate-control is essential to ensure efficient video delivery. Typical rate-control algorithms rely on bit allocation strategies, to appropriately distribute bits among frames. As reference frames are essential for exploiting temporal redundancies, intra frames are usually assigned a larger portion of the available bits. In this paper, an accurate method to estimate number of bits and quality of intra

更新日期：2020-03-16
• arXiv.cs.MM Pub Date : 2020-03-11
Juan Cao; Peng Qi; Qiang Sheng; Tianyun Yang; Junbo Guo; Jintao Li

The increasing popularity of social media promotes the proliferation of fake news, which has caused significant negative societal effects. Therefore, fake news detection on social media has recently become an emerging research area of great concern. With the development of multimedia technology, fake news attempts to utilize multimedia content with images or videos to attract and mislead consumers

更新日期：2020-03-12
• arXiv.cs.MM Pub Date : 2019-10-30
Xing Wei; Chenyang Yang; Shengqian Han

Proactive tile-based video streaming can avoid motion-to-photon latency of wireless virtual reality (VR) by computing and delivering the predicted tiles to be requested before playback. All existing works either focus on the task of tile prediction or on the tasks of computing and communications, overlooking the facts that these successively executed tasks have to share the same duration to avoid the

更新日期：2020-03-12
• arXiv.cs.MM Pub Date : 2017-12-05
Zhiqian Chen; Chih-Wei Wu; Yen-Cheng Lu; Alexander Lerch; Chang-Tien Lu

FusionGAN is a novel genre fusion framework for music generation that integrates the strengths of generative adversarial networks and dual learning. In particular, the proposed method offers a dual learning extension that can effectively integrate the styles of the given domains. To efficiently quantify the difference among diverse domains and avoid the vanishing gradient issue, FusionGAN provides

更新日期：2020-03-12
• arXiv.cs.MM Pub Date : 2020-03-08
Dongxu Li; Xin Yu; Chenchen Xu; Lars Petersson; Hongdong Li

Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation. It requires models to recognize isolated sign words from videos. However, annotating WSLR data needs expert knowledge, thus limiting WSLR dataset acquisition. On the contrary, there are abundant subtitled sign news videos on the internet. Since these videos have no word-level annotation and exhibit a

更新日期：2020-03-10
• arXiv.cs.MM Pub Date : 2020-03-09
Hao Wang; Doyen Sahoo; Chenghao Liu; Ke Shu; Palakorn Achananuparp; Ee-peng Lim; Steven C. H. Hoi

Cross-modal food retrieval is an important task to perform analysis of food-related information, such as food images and cooking recipes. The goal is to learn an embedding of images and recipes in a common feature space, so that precise matching can be realized. Compared with existing cross-modal retrieval approaches, two major challenges in this specific problem are: 1) the large intra-class variance

更新日期：2020-03-10
• arXiv.cs.MM Pub Date : 2020-03-09
Arun Balajee Vasudevan; Dengxin Dai; Luc Van Gool

Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight

更新日期：2020-03-10
• arXiv.cs.MM Pub Date : 2019-05-03
Hao Wang; Doyen Sahoo; Chenghao Liu; Ee-peng Lim; Steven C. H. Hoi

Food computing is playing an increasingly important role in human daily life, and has found tremendous applications in guiding human behavior towards smart food consumption and healthy lifestyle. An important task under the food-computing umbrella is retrieval, which is particularly helpful for health related applications, where we are interested in retrieving important information about food (e.g

更新日期：2020-03-10
• arXiv.cs.MM Pub Date : 2020-03-06

Recently, soft video multicasting has gained a lot of attention, especially in broadcast and mobile scenarios where the bit rate supported by the channel may differ across receivers, and may vary quickly over time. Unlike the conventional designs that force the source to use a single bit rate according to the receiver with the worst channel quality, soft video delivery schemes transmit the video such

更新日期：2020-03-09
• arXiv.cs.MM Pub Date : 2020-03-06
Felix Sattler; Thomas Wiegand; Wojciech Samek

Due to their great performance and scalability properties neural networks have become ubiquitous building blocks of many applications. With the rise of mobile and IoT, these models now are also being increasingly applied in distributed settings, where the owners of the data are separated by limited communication channels and privacy constraints. To address the challenges of these distributed environments

更新日期：2020-03-09
• arXiv.cs.MM Pub Date : 2020-03-05
Serhan Gül; Dimitri Podborski; Jangwoo Son; Gurdeep Singh Bhullar; Thomas Buchholz; Thomas Schierl; Cornelius Hellge

Volumetric video is an emerging technology for immersive representation of 3D spaces that captures objects from all directions using multiple cameras and creates a dynamic 3D model of the scene. However, rendering volumetric content requires high amounts of processing power and is still a very demanding tasks for today's mobile devices. To mitigate this, we propose a volumetric video streaming system

更新日期：2020-03-06
• arXiv.cs.MM Pub Date : 2020-03-04
Eduardo Pavez; Benjamin Girault; Antonio Ortega; Philip A. Chou

We introduce the Region Adaptive Graph Fourier Transform (RA-GFT) for compression of 3D point cloud attributes. We assume the points are organized by a family of nested partitions represented by a tree. The RA-GFT is a multiresolution transform, formed by combining spatially localized block transforms. At each resolution level, attributes are processed in clusters by a set of block transforms. Each

更新日期：2020-03-05
• arXiv.cs.MM Pub Date : 2020-03-04
Federico Simonetta; Stavros Ntalampiras; Federico Avanzini

This paper describes an open-source Python framework for handling datasets for music processing tasks, built with the aim of improving the reproducibility of research projects in music computing and assessing the generalization abilities of machine learning models. The framework enables the automatic download and installation of several commonly used datasets for multimodal music processing. Specifically

更新日期：2020-03-05
• arXiv.cs.MM Pub Date : 2020-03-01
Yixin Wang; Xiaohong Guan; Youtian Du; Nan Nan

Music tone quality evaluation is generally performed by experts. It could be subjective and short of consistency and fairness as well as time-consuming. In this paper we present a new method for identifying the clarinet reed quality by evaluating tone quality based on the harmonic structure and energy distribution. We first decouple the quality of reed and clarinet pipe based on the acoustic harmonics

更新日期：2020-03-03
• arXiv.cs.MM Pub Date : 2020-03-01
Prajwal K R; Rudrabha Mukhopadhyay; Jerin Philip; Abhishek Jha; Vinay Namboodiri; C. V. Jawahar

In light of the recent breakthroughs in automatic machine translation systems, we propose a novel approach that we term as "Face-to-Face Translation". As today's digital communication becomes increasingly visual, we argue that there is a need for systems that can automatically translate a video of a person speaking in language A into a target language B with realistic lip synchronization. In this work

更新日期：2020-03-03
• arXiv.cs.MM Pub Date : 2020-03-01
Bo Fu; Liyan Wang; Yuechu Wu; Yufeng Wu; Shilin Fu; Yonggong Ren

Single image super-resolution (SISR) is an image processing task which obtains high-resolution (HR) image from a low-resolution (LR) image. Recently, due to the capability in feature extraction, a series of deep learning methods have brought important crucial improvement for SISR. However, we observe that no matter how deeper the networks are designed, they usually do not have good generalization ability

更新日期：2020-03-03
• arXiv.cs.MM Pub Date : 2020-02-12
Sicheng Zhao; Yunsheng Ma; Yang Gu; Jufeng Yang; Tengfei Xing; Pengfei Xu; Runbo Hu; Hua Chai; Kurt Keutzer

Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio

更新日期：2020-03-03
• arXiv.cs.MM Pub Date : 2020-02-28
Licheng Xiao; Hairong Wang; Nam Ling

In this paper, we build autoencoder based pipelines for extreme end-to-end image compression based on Ball\'e's approach, which is the state-of-the-art open source implementation in image compression using deep learning. We deepened the network by adding one more hidden layer before each strided convolutional layer with exactly the same number of down-samplings and up-samplings. Our approach outperformed

更新日期：2020-03-02
• arXiv.cs.MM Pub Date : 2020-02-26
Qingyuan Zheng; Zhuoru Li; Adam Bargteil

We present a fully automatic method to generate detailed and accurate artistic shadows from pairs of line drawing sketches and lighting directions. We also contribute a new dataset of one thousand examples of pairs of line drawings and shadows that are tagged with lighting directions. Remarkably, the generated shadows quickly communicate the underlying 3D structure of the sketched scene. Consequently

更新日期：2020-02-28
• arXiv.cs.MM Pub Date : 2020-02-27
Zhengzhong Tu; Jessie Lin; Yilin Wang; Balu Adsumilli; Alan C. Bovik

Banding artifact, or false contouring, is a common video compression impairment that tends to appear on large flat regions in encoded videos. These staircase-shaped color bands can be very noticeable in high-definition videos. Here we study this artifact, and propose a new distortion-specific no-reference video quality model for predicting banding artifacts, called the Blind BANding Detector (BBAND

更新日期：2020-02-28
• arXiv.cs.MM Pub Date : 2020-02-27
Joong Gon Yim; Yilin Wang; Neil Birkbeck; Balu Adsumilli

Due to the scale of social video sharing, User Generated Content (UGC) is getting more attention from academia and industry. To facilitate compression-related research on UGC, YouTube has released a large-scale dataset. The initial dataset only provided videos, limiting its use in quality assessment. We used a crowd-sourcing platform to collect subjective quality scores for this dataset. We analyzed

更新日期：2020-02-28
Contents have been reproduced by permission of the publishers.

down
wechat
bug