• arXiv.cs.MM Pub Date : 2020-06-10
Andrew Perkis; Christian Timmerer; Sabina Baraković; Jasmina Baraković Husić; Søren Bech; Sebastian Bosse; Jean Botev; Kjell Brunnström; Luis Cruz; Katrien De Moor; Andrea de Polo Saibanti; Wouter Durnez; Sebastian Egger-Lampl; Ulrich Engelke; Tiago H. Falk; Asim Hameed; Andrew Hines; Tanja Kojic; Dragan Kukolj; Eirini Liotou; Dragorad Milovanovic; Sebastian Möller; Niall Murray; Babak Naderi; Manuela

With the coming of age of virtual/augmented reality and interactive media, numerous definitions, frameworks, and models of immersion have emerged across different fields ranging from computer graphics to literary works. Immersion is oftentimes used interchangeably with presence as both concepts are closely related. However, there are noticeable interdisciplinary differences regarding definitions, scope

更新日期：2020-07-15
• arXiv.cs.MM Pub Date : 2020-07-11
Xianchao Wu; Chengyuan Wang; Qinying Lei

Current state-of-the-art AI based classical music creation algorithms such as Music Transformer are trained by employing single sequence of notes with time-shifts. The major drawback of absolute time interval expression is the difficulty of similarity computing of notes that share the same note value yet different tempos, in one or among MIDI files. In addition, the usage of single sequence restricts

更新日期：2020-07-15
• arXiv.cs.MM Pub Date : 2020-07-14
Di Ma; Fan Zhang; David R. Bull

In this paper, we propose a novel convolutional neural network (CNN) architecture, MFRNet, for post-processing (PP) and in-loop filtering (ILF) in the context of video compression. This network consists of four Multi-level Feature review Residual dense Blocks (MFRBs), which are connected using a cascading structure. Each MFRB extracts features from multiple convolutional layers using dense connections

更新日期：2020-07-15
• arXiv.cs.MM Pub Date : 2020-07-13
Piyush Yadav; Dhaval Salwala; Edward Curry

Complex Event Processing (CEP) is an event processing paradigm to perform real-time analytics over streaming data and match high-level event patterns. Presently, CEP is limited to process structured data stream. Video streams are complicated due to their unstructured data model and limit CEP systems to perform matching over them. This work introduces a graph-based structure for continuous evolving

更新日期：2020-07-14
• arXiv.cs.MM Pub Date : 2020-07-11
Ankit Sharma; Puneet Kumar; Vikas Maddukuri; Nagasai Madamshettib; Kishore KG; Sahit Sai Sriram Kavurub; Balasubramanian Raman; Partha Pratim Roy

The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications such as digital assistants, mobile phones, embedded

更新日期：2020-07-14
• arXiv.cs.MM Pub Date : 2020-07-09
Rohit Agrawal

Steganography plays a vital role in achieving secret data security by embedding it into cover media. The cover media and the secret data can be text or multimedia, such as images, videos, etc. In this paper, we propose a novel $\ell_1$-minimization and sparse approximation based blind multi-image steganography scheme, termed $\ell_1$SABMIS. By using $\ell_1$SABMIS, multiple secret images can be hidden

更新日期：2020-07-13
• arXiv.cs.MM Pub Date : 2020-07-09
Emre Çakır; Konstantinos Drossos; Tuomas Virtanen

Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with audio clips and corresponding natural language descriptions (i.e. captions). A significant challenge for audio captioning is the distribution of words in the captions:

更新日期：2020-07-10
• arXiv.cs.MM Pub Date : 2020-07-08
Youqing Wu; Wenjing Ma; Yinyin Peng; Ruiling Zhang; Zhaoxia Yin

As a technology that can protect the information on the original image of being disclosed and accurately extract the embedded information, the reversible data hiding in encrypted images (RDHEI) has been widely concerned by researchers. One of the current challenges is how to further improve the performance of the RDHEI method. In this paper, a high-capacity RDHEI method based on bit plane compression

更新日期：2020-07-09
• arXiv.cs.MM Pub Date : 2020-07-08
Zhaoxia Yin; Xiaomeng She; Jin Tang; Bin Luo

Great concern has arisen in the field of reversible data hiding in encrypted images (RDHEI) due to the development of cloud storage and privacy protection. RDHEI is an effective technology that can embed additional data after image encryption, extract additional data without any errors and reconstruct original images losslessly. In this paper, a high-capacity and fully reversible data hiding in encrypted

更新日期：2020-07-09
• arXiv.cs.MM Pub Date : 2020-07-07
Ping Hu; Federico Perazzi; Fabian Caba Heilbron; Oliver Wang; Zhe Lin; Kate Saenko; Stan Sclaroff

In deep CNN based models for semantic segmentation, high accuracy relies on rich spatial context (large receptive fields) and fine spatial details (high resolution), both of which incur high computational costs. In this paper, we propose a novel architecture that addresses both challenges and achieves state-of-the-art performance for semantic segmentation of high-resolution images and videos in real-time

更新日期：2020-07-09
• arXiv.cs.MM Pub Date : 2020-07-07
Mahmoud Darwich; Yasser Ismail; Talal Darwich; Magdy Bayoumi

Video stream is converted to several formats to support the user's device, this conversion process is called video transcoding, which imposes high storage and powerful resources. With emerging of cloud technology, video stream companies adopted to process video on the cloud. Generally, many formats of the same video are made (pre-transcoded) and streamed to the adequate user's device. However, pre-transcoding

更新日期：2020-07-08
• arXiv.cs.MM Pub Date : 2020-07-01
Katherine McLeod; Petros Spachos; Konstantinos Plataniotis

Mental health and general wellness are becoming a growing concern in our society. Environmental factors contribute to mental illness and have the power to affect a person's wellness. This work presents a smartphone-based wellness assessment system and examines if there is any correlation with one's environment and their wellness. The introduced system was initiated in response to a growing need for

更新日期：2020-07-08
• arXiv.cs.MM Pub Date : 2020-07-05
Xin Zhong; Pei-Chi Huang; Spyridon Mastorakis; Frank Y. Shih

Digital image watermarking is the process of embedding and extracting a watermark covertly on a cover-image. To dynamically adapt image watermarking algorithms, deep learning-based image watermarking schemes have attracted increased attention during recent years. However, existing deep learning-based watermarking methods neither fully apply the fitting ability to learn and automate the embedding and

更新日期：2020-07-07
• arXiv.cs.MM Pub Date : 2020-07-05
Seung-Hun Nam; Wonhyuk Ahn; In-Jae Yu; Myung-Joon Kwon; Minseok Son; Heung-Kyu Lee

Seam carving is a representative content-aware image retargeting approach to adjust the size of an image while preserving its visually prominent content. To maintain visually important content, seam-carving algorithms first calculate the connected path of pixels, referred to as the seam, according to a defined cost function and then adjust the size of an image by removing and duplicating repeatedly

更新日期：2020-07-07
• arXiv.cs.MM Pub Date : 2020-07-05
Lijie Wang; Xueting Wang; Toshihiko Yamasaki

The spread of social networking services has created an increasing demand for selecting, editing, and generating impressive images. This trend increases the importance of evaluating image aesthetics as a complementary function of automatic image processing. We propose a multi-patch method, named MPA-Net (Multi-Patch Aggregation Network), to predict image aesthetics scores by maintaining the original

更新日期：2020-07-07
• arXiv.cs.MM Pub Date : 2020-07-05
R. J. Cintra

Approximate methods have been considered as a means to the evaluation of discrete transforms. In this work, we propose and analyze a class of integer transforms for the discrete Fourier, Hartley, and cosine transforms (DFT, DHT, and DCT), based on simple dyadic rational approximation methods. The introduced method is general, applicable to several block-lengths, whereas existing approaches are usually

更新日期：2020-07-07
• arXiv.cs.MM Pub Date : 2020-07-04
Hengguan Huang; Fuzhao Xue; Hao Wang; Ye Wang

Lying at the core of human intelligence, relational thinking is characterized by initially relying on innumerable unconscious percepts pertaining to relations between new sensory signals and prior knowledge, consequently becoming a recognizable concept or object through coupling and transformation of these percepts. Such mental processes are difficult to model in real-world problems such as in conversational

更新日期：2020-07-07
• arXiv.cs.MM Pub Date : 2020-07-04
Pranay Gupta; Anirudh Thatipelli; Aditya Aggarwal; Shubh Maheshwari; Neel Trivedi; Sourav Das; Ravi Kiran Sarvadevabhatla

In this paper, we study current and upcoming frontiers across the landscape of skeleton-based human action recognition. To begin with, we benchmark state-of-the-art models on the NTU-120 dataset and provide multi-layered assessment of the results. To examine skeleton action recognition 'in the wild', we introduce Skeletics-152, a curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700

更新日期：2020-07-07
• arXiv.cs.MM Pub Date : 2020-07-02
Tamami Nakano; Atsuya Sakata; Akihiro Kishimoto

Highlight detection in sports videos has a broad viewership and huge commercial potential. It is thus imperative to detect highlight scenes more suitably for human interest with high temporal accuracy. Since people instinctively suppress blinks during attention-grabbing events and synchronously generate blinks at attention break points in videos, the instantaneous blink rate can be utilized as a highly

更新日期：2020-07-03
• arXiv.cs.MM Pub Date : 2020-07-02
Xinyu Huang; Lijun He

In traditional communication system, information of APP (Application) layer, transport layer and MAC (Media Access Control)layer has not been fully interacted,which inevitably leads to inconsistencies among TCP congestion state, clients'requirements and resource allocation. To solve the problem, we propose a joint optimization framework, which consists of APP layer, transport layer and MAC layer, to

更新日期：2020-07-03
• arXiv.cs.MM Pub Date : 2020-06-29
Zhen Long; Ce Zhu; Jiani Liu; Yipeng Liu

Low rank tensor ring model is powerful for image completion which recovers missing entries in data acquisition and transformation. The recently proposed tensor ring (TR) based completion algorithms generally solve the low rank optimization problem by alternating least squares method with predefined ranks, which may easily lead to overfitting when the unknown ranks are set too large and only a few measurements

更新日期：2020-07-03
• arXiv.cs.MM Pub Date : 2020-07-01
Pablo Carballeira; Carlos Carmona; César Díaz; Daniel Berjón; Daniel Corregidor; Julián Cabrera; Francisco Morán; Carmen Doblado; Sergio Arnaldo; María del Mar Martín; Narciso García

FVV Live is a novel end-to-end free-viewpoint video system, designed for low cost and real-time operation, based on off-the-shelf components. The system has been designed to yield high-quality free-viewpoint video using consumer-grade cameras and hardware, which enables low deployment costs and easy installation for immersive event-broadcasting or videoconferencing. The paper describes the architecture

更新日期：2020-07-02
• arXiv.cs.MM Pub Date : 2020-06-30
Daniel Berjón; Pablo Carballeira; Julián Cabrera; Carlos Carmona; Daniel Corregidor; César Díaz; Francisco Morán; Narciso García

FVV Live is a novel real-time, low-latency, end-to-end free viewpoint system including capture, transmission, synthesis on an edge server and visualization and control on a mobile terminal. The system has been specially designed for low-cost and real-time operation, only using off-the-shelf components.

更新日期：2020-07-01
• arXiv.cs.MM Pub Date : 2020-06-30
In-Jae Yu; Wonhyuk Ahn; Seung-Hun Nam; Heung-Kyu Lee

Convolutional neural networks (CNN) for image steganalysis demonstrate better performances with employing concepts from high-level vision tasks. The major employed concept is to use data augmentation to avoid overfitting due to limited data. To augment data without damaging the message embedding, only rotating multiples of 90 degrees or horizontally flipping are used in steganalysis, which generates

更新日期：2020-07-01
• arXiv.cs.MM Pub Date : 2020-06-29
Zhaoxia Yin; Yang Du; Yuan Ji

Huffman code mapping (HCM) is a recent technique for reversible data hiding (RDH) in JPEG images. The existing HCM-based RDH schemes cause neither file-size increment nor visual distortion for the marked JPEG image, which is the superiority compared to the RDH schemes that use other techniques, such as histogram shifting (HS). However, the embedding capacity achieved by the HCM-based RDH schemes is

更新日期：2020-06-30
• arXiv.cs.MM Pub Date : 2020-06-27
Marc Górriz; Saverio Blasi; Alan F. Smeaton; Noel E. O'Connor; Marta Mrak

Neural networks can be used in video coding to improve chroma intra-prediction. In particular, usage of fully-connected networks has enabled better cross-component prediction with respect to traditional linear models. Nonetheless, state-of-the-art architectures tend to disregard the location of individual reference samples in the prediction process. This paper proposes a new neural network architecture

更新日期：2020-06-30
• arXiv.cs.MM Pub Date : 2020-06-26
Ivan Bacher; Hossein Javidnia; Soumyabrata Dev; Rahul Agrahari; Murhaf Hossari; Matthew Nicholson; Clare Conran; Jian Tang; Peng Song; David Corrigan; François Pitié

Over the past decade, the evolution of video-sharing platforms has attracted a significant amount of investments on contextual advertising. The common contextual advertising platforms utilize the information provided by users to integrate 2D visual ads into videos. The existing platforms face many technical challenges such as ad integration with respect to occluding objects and 3D ad placement. This

更新日期：2020-06-29
• arXiv.cs.MM Pub Date : 2020-06-25
Xiao-Wei Tang; Xin-Lin Huang; Fei Hu

The explosive demands for high quality mobile video services have caused heavy overload to the existing cellular networks. Although the small cell has been proposed to alleviate such a problem, the network operators may not be interested in deploying numerous base stations (BSs) due to expensive infrastructure construction and maintenance. The unmanned aerial vehicles (UAVs) can provide the low-cost

更新日期：2020-06-26
• arXiv.cs.MM Pub Date : 2020-06-25
Navid Mahmoudian Bidgoli; Thomas Maugey; Aline Roumy

In this paper, we propose a new interactive compression scheme for omnidirectional images. This requires two characteristics: efficient compression of data, to lower the storage cost, and random access ability to extract part of the compressed stream requested by the user (for reducing the transmission rate). For efficient compression, data needs to be predicted by a series of references that have

更新日期：2020-06-26
• arXiv.cs.MM Pub Date : 2020-06-23
Kun Su; Xiulong Liu; Eli Shlizerman

We present a novel system that gets as an input video frames of a musician playing the piano and generates the music for that video. Generation of music from visual cues is a challenging problem and it is not clear whether it is an attainable goal at all. Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the association

更新日期：2020-06-26
• arXiv.cs.MM Pub Date : 2020-06-24
Shengyu Zhang; Ziqi Tan; Jin Yu; Zhou Zhao; Kun Kuang; Tan Jiang; Jingren Zhou; Hongxia Yang; Fei Wu

In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive in volume. To recommend these videos to potential consumers more effectively, diverse and catchy video titles are critical. However, consumer-generated videos seldom accompany appropriate titles. To bridge this gap, we integrate comprehensive

更新日期：2020-06-25
• arXiv.cs.MM Pub Date : 2020-06-23
Huyen T. T. Tran; Nam Pham Ngoc; Truong Cong Thang

HTTP Adaptive Streaming (HAS) has become a cost-effective means for multimedia delivery nowadays. However, how the quality of experience (QoE) is jointly affected by 1) varying perceptual quality and 2) interruptions is not well-understood. In this paper, we present the first attempt to quantitatively quantify the relative impacts of these factors on the QoE of streaming sessions. To achieve this purpose

更新日期：2020-06-24
• arXiv.cs.MM Pub Date : 2020-06-23
Tianyi Li; Mai Xu; Runzhi Tang

The latest standard Versatile Video Coding (VVC) significantly improves the coding efficiency over its ancestor standard High Efficiency Video Coding (HEVC), but at the expense of sharply increased complexity. In VVC, the quadtree plus multi-type tree (QTMT) structure of coding unit (CU) partition accounts for most of encoding time, due to the brute-force search for recursive rate-distortion (RD) optimization

更新日期：2020-06-24
• arXiv.cs.MM Pub Date : 2020-06-20
Yijun Quan; Chang-Tsun Li

Photo Response Non-Uniformity (PRNU) has been used as a powerful device fingerprint for image forgery detection because image forgeries can be revealed by finding the absence of the PRNU in the manipulated areas. The correlation between an image's noise residual with the device's reference PRNU is often compared with a decision threshold to check the existence of the PRNU. A PRNU correlation predictor

更新日期：2020-06-23
• arXiv.cs.MM Pub Date : 2020-06-19
Pavan C. Madhusudana; Neil Birkbeck; Yilin Wang; Balu Adsumilli; Alan C. Bovik

High frame rate videos are increasingly getting popular in recent years majorly driven by strong requirements by the entertainment and streaming industries to provide high quality of experiences to consumers. To achieve the best trade-off between the bandwidth requirements and video quality in terms of frame rate adaptation, it is imperative to understand the effects of frame rate on video quality

更新日期：2020-06-23
• arXiv.cs.MM Pub Date : 2020-06-21
Purva Tendulkar; Abhishek Das; Aniruddha Kembhavi; Devi Parikh

We present a general computational approach that enables a machine to generate a dance for any input music. We encode intuitive, flexible heuristics for what a 'good' dance is: the structure of the dance should align with the structure of the music. This flexibility allows the agent to discover creative dances. Human studies show that participants find our dances to be more creative and inspiring compared

更新日期：2020-06-23
• arXiv.cs.MM Pub Date : 2020-06-20
Huirong Huang; Zhiyong Wu; Shiyin Kang; Dongyang Dai; Jia Jia; Tianxiao Fu; Deyi Tuo; Guangzhi Lei; Peng Liu; Dan Su; Dong Yu; Helen Meng

Generating 3D speech-driven talking head has received more and more attention in recent years. Recent approaches mainly have following limitations: 1) most speaker-independent methods need handcrafted features that are time-consuming to design or unreliable; 2) there is no convincing method to support multilingual or mixlingual speech as input. In this work, we propose a novel approach using phonetic

更新日期：2020-06-23
• arXiv.cs.MM Pub Date : 2020-06-19
D. R. Canterle; T. L. T. da Silveira; F. M. Bayer; R. J. Cintra

Discrete transforms play an important role in many signal processing applications, and low-complexity alternatives for classical transforms became popular in recent years. Particularly, the discrete cosine transform (DCT) has proven to be convenient for data compression, being employed in well-known image and video coding standards such as JPEG, H.264, and the recent high efficiency video coding (HEVC)

更新日期：2020-06-23
• arXiv.cs.MM Pub Date : 2020-06-03
Chongyang Bai; Haipeng Chen; Srijan Kumar; Jure Leskovec; V. S. Subrahmanian

Identifying persuasive speakers in an adversarial environment is a critical task. In a national election, politicians would like to have persuasive speakers campaign on their behalf. When a company faces adverse publicity, they would like to engage persuasive advocates for their position in the presence of adversaries who are critical of them. Debates represent a common platform for these forms of

更新日期：2020-06-23
• arXiv.cs.MM Pub Date : 2020-06-19
Omid Jafari; Parth Nagarkar

Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many multimedia retrieval applications. Exact tree-based indexing approaches are known to suffer from the notorious curse of dimensionality for high-dimensional data. Approximate searching techniques sacrifice some accuracy while returning good enough results for faster performance. Locality Sensitive Hashing (LSH) is

更新日期：2020-06-23
• arXiv.cs.MM Pub Date : 2020-06-19
Omid Jafari; Parth Nagarkar; Jonathan Montaño

Similarity search in high-dimensional spaces is an important task for many multimedia applications. Due to the notorious curse of dimensionality, approximate nearest neighbor techniques are preferred over exact searching techniques since they can return good enough results at a much better speed. Locality Sensitive Hashing (LSH) is a very popular random hashing technique for finding approximate nearest

更新日期：2020-06-23
• arXiv.cs.MM Pub Date : 2020-06-13
Aman Chadha; John Britto; M. Mani Roja

Recently, learning-based models have enhanced the performance of single-image super-resolution (SISR). However, applying SISR successively to each video frame leads to a lack of temporal coherency. Convolutional neural networks (CNNs) outperform traditional approaches in terms of image quality metrics such as peak signal to noise ratio (PSNR) and structural similarity (SSIM). However, generative adversarial

更新日期：2020-06-22
• arXiv.cs.MM Pub Date : 2020-06-18
Dhruv Upadhyay; Vaibhav Pandey; Nitish Nag; Ramesh Jain

Sleep is critical to leading a healthy lifestyle. Each day, most people go to sleep without any idea about how their night's rest is going to be. For an activity that humans spend around a third of their life doing, there is a surprising amount of mystery around it. Despite current research, creating personalized sleep models in real-world settings has been challenging. Existing literature provides

更新日期：2020-06-22
• arXiv.cs.MM Pub Date : 2020-06-17
Elad Liebman; Peter Stone

Computers have been used to analyze and create music since they were first introduced in the 1950s and 1960s. Beginning in the late 1990s, the rise of the Internet and large scale platforms for music recommendation and retrieval have made music an increasingly prevalent domain of machine learning and artificial intelligence research. While still nascent, several different approaches have been employed

更新日期：2020-06-19
• arXiv.cs.MM Pub Date : 2020-06-18
Madhawa Vidanapathirana; Supriya Pandhre; Sonia Raychaudhuri; Anjali Khurana

We address the problem of language-based temporal localization of moments in untrimmed videos. Compared to temporal localization with fixed categories, this problem is more challenging as the language-based queries have no predefined activity classes and may also contain complex descriptions. Current state-of-the-art model MAC addresses it by mining activity concepts from both video and language modalities

更新日期：2020-06-19
• arXiv.cs.MM Pub Date : 2020-06-16
Hao Hao Tan; Yin-Jyun Luo; Dorien Herremans

We present a controllable neural audio synthesizer based on Gaussian Mixture Variational Autoencoders (GM-VAE), which can generate realistic piano performances in the audio domain that closely follows temporal conditions of two essential style features for piano performances: articulation and dynamics. We demonstrate how the model is able to apply fine-grained style morphing over the course of synthesizing

更新日期：2020-06-18
• arXiv.cs.MM Pub Date : 2020-06-16
Kunal Swami; Prasanna Vishnu Bondada; Pankaj Kumar Bajpai

Single image depth estimation is a challenging problem. The current state-of-the-art method formulates the problem as that of ordinal regression. However, the formulation is not fully differentiable and depth maps are not generated in an end-to-end fashion. The method uses a na\"ive threshold strategy to determine per-pixel depth labels, which results in significant discretization errors. For the first

更新日期：2020-06-16
• arXiv.cs.MM Pub Date : 2020-06-16
Andrew Rouditchenko; Angie Boggust; David Harwath; Dhiraj Joshi; Samuel Thomas; Kartik Audhkhasi; Rogerio Feris; Brian Kingsbury; Michael Picheny; Antonio Torralba; James Glass

Current methods for learning visually grounded language from videos often rely on time-consuming and expensive data collection, such as human annotated textual summaries or machine generated automatic speech recognition transcripts. In this work, we introduce Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs

更新日期：2020-06-16
• arXiv.cs.MM Pub Date : 2020-06-15
Hana Alghamdi; Rozenn Dahyot

We propose a new method with Nadaraya-Watson that maps one N-dimensional distribution to another taking into account available information about correspondences. We extend the 2D/3D problem to higher dimensions by encoding overlapping neighborhoods of data points and solve the high dimensional problem in 1D space using an iterative projection approach. To show potentials of this mapping, we apply it

更新日期：2020-06-15
• arXiv.cs.MM Pub Date : 2020-06-15
Lukas Stappen; Xinchen Du; Vincent Karas; Stefan Müller; Björn W. Schuller

Systems for the automatic recognition and detection of automotive parts are crucial in several emerging research areas in the development of intelligent vehicles. They enable, for example, the detection and modelling of interactions between human and the vehicle. In this paper, we present three suitable datasets as well as quantitatively and qualitatively explore the efficacy of state-of-the-art deep

更新日期：2020-06-15
• arXiv.cs.MM Pub Date : 2020-06-15
Ziwei Wang; Zi Huang; Yadan Luo; Huimin Lu

With the rapid advancement of image captioning and visual question answering at single-round level, the question of how to generate multi-round dialogue about visual content has not yet been well explored.Existing visual dialogue methods encode the image into a fixed feature vector directly, concatenated with the question and history embeddings to predict the response.Some recent methods tackle the

更新日期：2020-06-15
• arXiv.cs.MM Pub Date : 2020-06-15
Ruixiang Tang; Mengnan Du; Yuening Li; Zirui Liu; Xia Hu

Recent studies have shown that captioning datasets, such as the COCO dataset, may contain severe social bias which could potentially lead to unintentional discrimination in learning models. In this work, we specifically focus on the gender bias problem. The existing dataset fails to quantify bias because models that intrinsically memorize gender bias from training data could still achieve a competitive

更新日期：2020-06-15
• arXiv.cs.MM Pub Date : 2020-06-12
Susanne M Hoffmann

This paper summarises briefly and in English some of the results of the book Hoffmann: Hipparchs Himmelsglobus, Springer, 2017 that had to be written in German. The globe of Hipparchus is not preserved. For that reason, it has been a source of much speculation and scientific inquiry during the last few centuries. This study presents a new analysis of the data given in the commentary on Aratus' poem

更新日期：2020-06-12
• arXiv.cs.MM Pub Date : 2020-06-11
Luka Murn; Saverio Blasi; Alan F. Smeaton; Noel E. O'Connor; Marta Mrak

Deep learning has shown great potential in image and video compression tasks. However, it brings bit savings at the cost of significant increases in coding complexity, which limits its potential for implementation within practical applications. In this paper, a novel neural network-based tool is presented which improves the interpolation of reference samples needed for fractional precision motion compensation

更新日期：2020-06-11
• arXiv.cs.MM Pub Date : 2020-06-11
David A. Shamma; Tony Dunnigan; Lyndon Kennedy

Photo applications offer tools for annotation via text and stickers. Ideophones, mimetic and onomatopoeic words, which are common in graphic novels, have yet to be explored for photo annotation use. We present a method for automatic ideophone recommendation and positioning of the text on photos. These annotations are accomplished by obtaining a list of ideophones with English definitions and applying

更新日期：2020-06-11
• arXiv.cs.MM Pub Date : 2020-06-09
Huaizheng Zhang; Yuanming Li; Qiming Ai; Yong Luo; Yonggang Wen; Yichao Jin; Nguyen Binh Duong Ta

Combining \underline{v}ideo streaming and online \underline{r}etailing (V2R) has been a growing trend recently. In this paper, we provide practitioners and researchers in multimedia with a cloud-based platform named Hysia for easy development and deployment of V2R applications. The system consists of: 1) a back-end infrastructure providing optimized V2R related services including data engine, model

更新日期：2020-06-09
• arXiv.cs.MM Pub Date : 2020-06-06
Marcin Plata; Piotr Syga

In this paper we present a novel deep framework for a watermarking - a technique of embedding a transparent message into an image in a way that allows retrieving the message from a (perturbed) copy, so that copyright infringement can be tracked. For this technique, it is essential to extract the information from the image even after imposing some digital processing operations on it. Our framework outperforms

更新日期：2020-06-06
• arXiv.cs.MM Pub Date : 2020-06-06
Flavio Bertini; Rajesh Sharma; Danilo Montesi

In the last decade, Social Networks (SNs) have deeply changed many aspects of society, and one of the most widespread behaviours is the sharing of pictures. However, malicious users often exploit shared pictures to create fake profiles leading to the growth of cybercrime. Thus, keeping in mind this scenario, authorship attribution and verification through image watermarking techniques are becoming

更新日期：2020-06-06
• arXiv.cs.MM Pub Date : 2020-06-06
Sachin Singh; Victor Sanchez; Tanaya Guha

We propose a computational framework for ranking images (group photos in particular) taken at the same event within a short time span. The ranking is expected to correspond with human perception of overall appeal of the images. We hypothesize and provide evidence through subjective analysis that the factors that appeal to humans are its emotional content, aesthetics and image quality. We propose a

更新日期：2020-06-06
• arXiv.cs.MM Pub Date : 2020-06-05
Bofan Xue; David Chan; John Canny

We present a new publicly available dataset with the goal of advancing multi-modality learning by offering vision and language data within the same context. This is achieved by obtaining data from a social media website with posts containing multiple paired images/videos and text, along with comment trees containing images/videos and/or text. With a total of 677k posts, 2.9 million post images, 488k

更新日期：2020-06-05
Contents have been reproduced by permission of the publishers.

down
wechat
bug