• arXiv.cs.MM Pub Date : 2020-01-19
Amit Kumar Jaiswal; Haiming Liu; Ingo Frommholz

User implicit feedback plays an important role in recommender systems. However, finding implicit features is a tedious task. This paper aims to identify users' preferences through implicit behavioural signals for image recommendation based on the Information Scent Model of Information Foraging Theory. In the first part, we hypothesise that the users' perception is improved with visual cues in the images as behavioural signals that provide users' information scent during information seeking. We designed a content-based image recommendation system to explore which image attributes (i.e., visual cues or bookmarks) help users find their desired image. We found that users prefer recommendations predicated by visual cues and therefore consider the visual cues as good information scent for their information seeking. In the second part, we investigated if visual cues in the images together with the images itself can be better perceived by the users than each of them on its own. We evaluated the information scent artifacts in image recommendation on the Pinterest image collection and the WikiArt dataset. We find our proposed image recommendation system supports the implicit signals through Information Foraging explanation of the information scent model.

更新日期：2020-01-22
• arXiv.cs.MM Pub Date : 2020-01-19
Meysam Asgari-Chenaghlu; M. Reza Feizi-Derakhshi; Leili Farzinvash; Cina Motamed

Named Entity Recognition (NER) from social media posts is a challenging task. User generated content which forms the nature of social media, is noisy and contains grammatical and linguistic errors. This noisy content makes it much harder for tasks such as named entity recognition. However some applications like automatic journalism or information retrieval from social media, require more information about entities mentioned in groups of social media posts. Conventional methods applied to structured and well typed documents provide acceptable results while compared to new user generated media, these methods are not satisfactory. One valuable piece of information about an entity is the related image to the text. Combining this multimodal data reduces ambiguity and provides wider information about the entities mentioned. In order to address this issue, we propose a novel deep learning approach utilizing multimodal deep learning. Our solution is able to provide more accurate results on named entity recognition task. Experimental results, namely the precision, recall and F1 score metrics show the superiority of our work compared to other state-of-the-art NER solutions.

更新日期：2020-01-22
• arXiv.cs.MM Pub Date : 2020-01-20
Yichao Zhou; Shaunak Mishra; Manisha Verma; Narayan Bhamidipati; Wei Wang

更新日期：2020-01-22
• arXiv.cs.MM Pub Date : 2020-01-21
Adarsh Pyarelal; Marco A. Valenzuela-Escarcega; Rebecca Sharp; Paul D. Hein; Jon Stephens; Pratik Bhandari; HeuiChan Lim; Saumya Debray; Clayton T. Morrison

Models of complicated systems can be represented in different ways - in scientific papers, they are represented using natural language text as well as equations. But to be of real use, they must also be implemented as software, thus making code a third form of representing models. We introduce the AutoMATES project, which aims to build semantically-rich unified representations of models from scientific code and publications to facilitate the integration of computational models from different domains and allow for modeling large, complicated systems that span multiple domains and levels of abstraction.

更新日期：2020-01-22
• arXiv.cs.MM Pub Date : 2020-01-21
Ghalia MerzouguiTECHNÉ - EA 6316; Roumaissa DehkalTECHNÉ - EA 6316; Maheiddine DjoudiTECHNÉ - EA 6316

Interactive multimedia educational content has recently been of interest to attract attention on the learner and increase understanding by the latter. In parallel several open source authoring tools offer a quick and easy production of this type of content. As such, our contribution is to mediatize a course i.e. 'English' with the authoring system 'Xerte' which is intended both for simple users and developers in ActionScript. An experiment of course is conducted on a sample of a private school's students. At the end of this experience, we administered a questionnaire to evaluate the device, the results obtained, evidenced by the favorable reception of interactive multimedia integration in educational content.

更新日期：2020-01-22
• arXiv.cs.MM Pub Date : 2019-10-04
Bo Wu; Wen-Huang Cheng; Peiye Liu; Bei Liu; Zhaoyang Zeng; Jiebo Luo

"SMP Challenge" aims to discover novel prediction tasks for numerous data on social multimedia and seek excellent research teams. Making predictions via social multimedia data (e.g. photos, videos or news) is not only helps us to make better strategic decisions for the future, but also explores advanced predictive learning and analytic methods on various problems and scenarios, such as multimedia recommendation, advertising system, fashion analysis etc. In the SMP Challenge at ACM Multimedia 2019, we introduce a novel prediction task Temporal Popularity Prediction, which focuses on predicting future interaction or attractiveness (in terms of clicks, views or likes etc.) of new online posts in social media feeds before uploading. We also collected and released a large-scale SMPD benchmark with over 480K posts from 69K users. In this paper, we define the challenge problem, give an overview of the dataset, present statistics of rich information for data and annotation and design the accuracy and correlation evaluation metrics for temporal popularity prediction to the challenge.

更新日期：2020-01-22
• arXiv.cs.MM Pub Date : 2019-10-24
Dongxu Li; Cristian Rodriguez Opazo; Xin Yu; Hongdong Li

Vision-based sign language recognition aims at helping deaf people to communicate with others. However, most existing sign language datasets are limited to a small number of words. Due to the limited vocabulary size, models learned from those datasets cannot be applied in practice. In this paper, we introduce a new large-scale Word-Level American Sign Language (WLASL) video dataset, containing more than 2000 words performed by over 100 signers. This dataset will be made publicly available to the research community. To our knowledge, it is by far the largest public ASL dataset to facilitate word-level sign recognition research. Based on this new large-scale dataset, we are able to experiment with several deep learning methods for word-level sign recognition and evaluate their performances in large scale scenarios. Specifically we implement and compare two different models,i.e., (i) holistic visual appearance-based approach, and (ii) 2D human pose based approach. Both models are valuable baselines that will benefit the community for method benchmarking. Moreover, we also propose a novel pose-based temporal graph convolution networks (Pose-TGCN) that models spatial and temporal dependencies in human pose trajectories simultaneously, which has further boosted the performance of the pose-based method. Our results show that pose-based and appearance-based models achieve comparable performances up to 66% at top-10 accuracy on 2,000 words/glosses, demonstrating the validity and challenges of our dataset. Our dataset and baseline deep models are available at \url{https://dxli94.github.io/WLASL/}.

更新日期：2020-01-22
• arXiv.cs.MM Pub Date : 2020-01-12
Yiyan Chen; Li Tao; Xueting Wang; Toshihiko Yamasaki

Conventional video summarization approaches based on reinforcement learning have the problem that the reward can only be received after the whole summary is generated. Such kind of reward is sparse and it makes reinforcement learning hard to converge. Another problem is that labelling each frame is tedious and costly, which usually prohibits the construction of large-scale datasets. To solve these problems, we propose a weakly supervised hierarchical reinforcement learning framework, which decomposes the whole task into several subtasks to enhance the summarization quality. This framework consists of a manager network and a worker network. For each subtask, the manager is trained to set a subgoal only by a task-level binary label, which requires much fewer labels than conventional approaches. With the guide of the subgoal, the worker predicts the importance scores for video frames in the subtask by policy gradient according to both global reward and innovative defined sub-rewards to overcome the sparse problem. Experiments on two benchmark datasets show that our proposal has achieved the best performance, even better than supervised approaches.

更新日期：2020-01-17
• arXiv.cs.MM Pub Date : 2020-01-16
Viet Duong; Phu Pham; Ritwik Bose; Jiebo Luo

Recently, the emergence of the #MeToo trend on social media has empowered thousands of people to share their own sexual harassment experiences. This viral trend, in conjunction with the massive personal information and content available on Twitter, presents a promising opportunity to extract data driven insights to complement the ongoing survey based studies about sexual harassment in college. In this paper, we analyze the influence of the #MeToo trend on a pool of college followers. The results show that the majority of topics embedded in those #MeToo tweets detail sexual harassment stories, and there exists a significant correlation between the prevalence of this trend and official reports on several major geographical regions. Furthermore, we discover the outstanding sentiments of the #MeToo tweets using deep semantic meaning representations and their implications on the affected users experiencing different types of sexual harassment. We hope this study can raise further awareness regarding sexual misconduct in academia.

更新日期：2020-01-17
• arXiv.cs.MM Pub Date : 2019-07-22
Quang-Trung Luu; Sylvaine Kerboeuf; Alexandre Mouradian; Michel Kieffer

With network slicing in 5G networks, Mobile Network Operators can create various slices for Service Providers (SPs) to accommodate customized services. Usually, the various Service Function Chains (SFCs) belonging to a slice are deployed on a best-effort basis. Nothing ensures that the Infrastructure Provider (InP) will be able to allocate enough resources to cope with the increasing demands of some SP. Moreover, in many situations, slices have to be deployed over some geographical area: coverage as well as minimum per-user rate constraints have then to be taken into account. This paper takes the InP perspective and proposes a slice resource provisioning approach to cope with multiple slice demands in terms of computing, storage, coverage, and rate constraints.The resource requirements of the various SFCs within a slice are aggregated within a graph of Slice Resource Demands (SRD). Infrastructure nodes and links have then to be provisioned so as to satisfy all SRDs. This problem leads to a Mixed Integer Linear Programming formulation. A two-step approach is considered, with several variants, depending on whether the constraints of each slice to be provisioned are taken into account sequentially or jointly. Once provisioning has been performed, any slice deployment strategy may be considered on the reduced-size infrastructure graph on which resources have been provisioned. Simulation results demonstrate the effectiveness of the proposed approach compared to a more classical direct slice embedding approach.

更新日期：2020-01-17
• arXiv.cs.MM Pub Date : 2020-01-15
Rabie Hachemi; Ikram Achar; Biasi Wiga; Mahfoud Sidi Ali Mebarek

Humans are capable of identifying a book only by looking at its cover, but how can computers do the same? In this paper, we explore different feature detectors and matching methods for book cover identification, and compare their performances in terms of both speed and accuracy. This will allow, for example, libraries to develop interactive services based on cover book picture. Only one single image of a cover book needs to be available through a database. Tests have been performed by taking into account different transformations of each book cover image. Encouraging results have been achieved.

更新日期：2020-01-16
• arXiv.cs.MM Pub Date : 2020-01-15
Linsen Song; Wayne Wu; Chen Qian; Ran He; Chen Change Loy

We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.

更新日期：2020-01-16
• arXiv.cs.MM Pub Date : 2020-01-14
Xiyang Luo; Ruohan Zhan; Huiwen Chang; Feng Yang; Peyman Milanfar

Watermarking is the process of embedding information into an image that can survive under distortions, while requiring the encoded image to have little or no perceptual difference from the original image. Recently, deep learning-based methods achieved impressive results in both visual quality and message payload under a wide variety of image distortions. However, these methods all require differentiable models for the image distortions at training time, and may generalize poorly to unknown distortions. This is undesirable since the types of distortions applied to watermarked images are usually unknown and non-differentiable. In this paper, we propose a new framework for distortion-agnostic watermarking, where the image distortion is not explicitly modeled during training. Instead, the robustness of our system comes from two sources: adversarial training and channel coding. Compared to training on a fixed set of distortions and noise levels, our method achieves comparable or better results on distortions available during training, and better performance on unknown distortions.

更新日期：2020-01-15
• arXiv.cs.MM Pub Date : 2020-01-14
C. Estelle Smith; Eduardo Nevarez; Haiyi Zhu

Mass media afford researchers critical opportunities to disseminate research findings and trends to the general public. Yet researchers also perceive that their work can be miscommunicated in mass media, thus generating unintended understandings of HCI research by the general public. We conduct a Grounded Theory analysis of interviews with 12 HCI researchers and find that miscommunication can occur at four origins along the socio-technical infrastructure known as the Media Production Pipeline (MPP) for science news. Results yield researchers' perceived hazards of disseminating their work through mass media, as well as strategies for fostering effective communication of research. We conclude with implications for augmenting or innovating new MPP technologies.

更新日期：2020-01-15
• arXiv.cs.MM Pub Date : 2019-11-19
Zhijie Lin; Zhou Zhao; Zhu Zhang; Qi Wang; Huasheng Liu

Video moment retrieval is to search the moment that is most relevant to the given natural language query. Existing methods are mostly trained in a fully-supervised setting, which requires the full annotations of temporal boundary for each query. However, manually labeling the annotations is actually time-consuming and expensive. In this paper, we propose a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training. Specifically, we devise a proposal generation module that aggregates the context information to generate and score all candidate proposals in one single pass. We then devise an algorithm that considers both exploitation and exploration to select top-K proposals. Next, we build a semantic completion module to measure the semantic similarity between the selected proposals and query, compute reward and provide feedbacks to the proposal generation module for scoring refinement. Experiments on the ActivityCaptions and Charades-STA demonstrate the effectiveness of our proposed method.

更新日期：2020-01-15
• arXiv.cs.MM Pub Date : 2020-01-13
Abhinav Shukla; Konstantinos Vougioukas; Pingchuan Ma; Stavros Petridis; Maja Pantic

Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. Through this process, the audio encoder network learns useful speech representations that we evaluate on emotion recognition and speech recognition. We achieve state of the art results for emotion recognition and competitive results for speech recognition. This demonstrates the potential of visual supervision for learning audio representations as a novel way for self-supervised learning which has not been explored in the past. The proposed unsupervised audio features can leverage a virtually unlimited amount of training data of unlabelled audiovisual speech and have a large number of potentially promising applications.

更新日期：2020-01-14
• arXiv.cs.MM Pub Date : 2020-01-13
Kangle Deng; Aayush Bansal; Deva Ramanan

We present an unsupervised approach that enables us to convert the speech input of any one individual to an output set of potentially-infinitely many speakers. One can stand in front of a mic and be able to make their favorite celebrity say the same words. Our approach builds on simple autoencoders that project out-of-sample data to the distribution of the training set (motivated by PCA/linear autoencoders). We use an exemplar autoencoder to learn the voice and specific style (emotions and ambiance) of a target speaker. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers in a very little time using only two-three minutes of audio data from a speaker. We also exhibit the usefulness of our approach for generating video from audio signals and vice-versa. We suggest the reader to check out our project webpage for various synthesized examples: https://dunbar12138.github.io/projectpage/Audiovisual/

更新日期：2020-01-14
• arXiv.cs.MM Pub Date : 2019-07-05
Xavier BostLIA; Vincent LabatutLIA

A character network is a graph extracted from a narrative, in which vertices represent characters and edges correspond to interactions between them. A number of narrative-related problems can be addressed automatically through the analysis of character networks, such as summarization, classification, or role detection. Character networks are particularly relevant when considering works of fictions (e.g. novels, plays, movies, TV series), as their exploitation allows developing information retrieval and recommendation systems. However, works of fiction possess specific properties making these tasks harder. This survey aims at presenting and organizing the scientific literature related to the extraction of character networks from works of fiction, as well as their analysis. We first describe the extraction process in a generic way, and explain how its constituting steps are implemented in practice, depending on the medium of the narrative, the goal of the network analysis, and other factors. We then review the descriptive tools used to characterize character networks, with a focus on the way they are interpreted in this context. We illustrate the relevance of character networks by also providing a review of applications derived from their analysis. Finally, we identify the limitations of the existing approaches, and the most promising perspectives.

更新日期：2020-01-14
• arXiv.cs.MM Pub Date : 2020-01-09

Digital image watermarking has been widely used in different applications such as copyright protection of digital media, such as audio, image, and video files. Two opposing criteria of robustness and transparency are the goals of watermarking methods. In this paper, we propose a framework for determining the appropriate embedding strength factor. The framework can use most DWT and DCT based blind watermarking approaches. We use Mask R-CNN on the COCO dataset to find a good strength factor for each sub-block. Experiments show that this method is robust against different attacks and has good transparency.

更新日期：2020-01-13
• arXiv.cs.MM Pub Date : 2020-01-10
Yiwei Zhang; Xueting Wang; Yoshiaki Sakai; Toshihiko Yamasaki

In this paper, we propose a new measure to estimate the similarity between brands via posts of brands' followers on social network services (SNS). Our method was developed with the intention of exploring the brands that customers are likely to jointly purchase. Nowadays, brands use social media for targeted advertising because influencing users' preferences can greatly affect the trends in sales. We assume that data on SNS allows us to make quantitative comparisons between brands. Our proposed algorithm analyzes the daily photos and hashtags posted by each brand's followers. By clustering them and converting them to histograms, we can calculate the similarity between brands. We evaluated our proposed algorithm with purchase logs, credit card information, and answers to the questionnaires. The experimental results show that the purchase data maintained by a mall or a credit card company can predict the co-purchase very well, but not the customer's willingness to buy products of new brands. On the other hand, our method can predict the users' interest on brands with a correlation value over 0.53, which is pretty high considering that such interest to brands are high subjective and individual dependent.

更新日期：2020-01-13
• arXiv.cs.MM Pub Date : 2020-01-10
Jie Li; Ransheng Feng; Zhi Liu; Wei Sun; Qiyue Li

360-degree video provides an immersive 360-degree viewing experience and has been widely used in many areas. The 360-degree video live streaming systems involve capturing, compression, uplink (camera to video server) and downlink (video server to user) transmissions. However, few studies have jointly investigated such complex systems, especially the rate adaptation for the coupled uplink and downlink in the 360-degree video streaming under limited bandwidth constraints. In this letter, we propose a quality of experience (QoE)-driven 360-degree video live streaming system, in which a video server performs rate adaptation based on the uplink and downlink bandwidths and information concerning each user's real-time field-of-view (FOV). We formulate it as a nonlinear integer programming problem and propose an algorithm, which combines the Karush-Kuhn-Tucker (KKT) condition and branch and bound method, to solve it. The numerical results show that the proposed optimization model can improve users' QoE significantly in comparison with other baseline schemes.

更新日期：2020-01-13
• arXiv.cs.MM Pub Date : 2020-01-10
Jolien De Letter; Anissa All; Lieven De Marez; Vasileios Avramelos; Peter Lambert; Glenn Van Wallendael

In this paper we assess the impact of head movement on user's visual acuity and their quality perception of impaired images. There are physical limitations on the amount of visual information a person can perceive and physical limitations regarding the speed at which our body, and as a consequence our head, can explore a scene. In these limitations lie fundamental solutions for the communication of multimedia systems. As such, subjects were asked to evaluate the perceptual quality of static images presented on a TV screen while their head was in a dynamic (moving) state. The idea is potentially applicable to virtual reality applications and therefore, we also measured the image quality perception of each subject on a head mounted display. Experiments show the significant decrease in visual acuity and quality perception when the user's head is not static, and give an indication on how much the quality can be reduced without the user noticing any impairments.

更新日期：2020-01-13
• arXiv.cs.MM Pub Date : 2020-01-08
Taburet Théo; Bas Patrick; Sawaya Wadih; Jessica Fridrich

In order to achieve high practical security, Natural Steganography (NS) uses cover images captured at ISO sensitivity $ISO_{1}$ and generates stego images mimicking ISO sensitivity $ISO_{2}>ISO_{1}$. This is achieved by adding a stego signal to the cover that mimics the sensor photonic noise. This paper proposes an embedding mechanism to perform NS in the JPEG domain after linear developments by explicitly computing the correlations between DCT coefficients before quantization. In order to compute the covariance matrix of the photonic noise in the DCT domain, we first develop the matrix representation of demosaicking, luminance averaging, pixel section, and 2D-DCT. A detailed analysis of the resulting covariance matrix is done in order to explain the origins of the correlations between the coefficients of $3\times3$ DCT blocks. An embedding scheme is then presented that takes in order to take into account all the correlations. It employs 4 sub-lattices and 64 lattices per sub-lattices. The modification probabilities of each DCT coefficient are then derived by computing conditional probabilities from the multivariate Gaussian distribution using the Cholesky decomposition of the covariance matrix. This derivation is also used to compute the embedding capacity of each image. Using a specific database called E1 Base, we show that in the JPEG domain NS (J-Cov-NS) enables to achieve high capacity (more than 2 bits per non-zero AC DCT) and with high practical security ($P_{\mathrm{E}}\simeq40\%$ using DCTR from QF 75 to QF 100).

更新日期：2020-01-09
• arXiv.cs.MM Pub Date : 2019-08-04
Linda Woodburn; Yalong Yang; Kim Marriott

We have compared three common visualisations for hierarchical quantitative data, treemaps, icicle plots and sunburst charts as well as a semicircular variant of sunburst charts we call the sundown chart. In a pilot study, we found that the sunburst chart was least preferred. In a controlled study with 12 participants, we compared treemaps, icicle plots and sundown charts. Treemap was the least preferred and had a slower performance on a basic navigation task and slower performance and accuracy in hierarchy understanding tasks. The icicle plot and sundown chart had similar performance with slight user preference for the icicle plot.

更新日期：2020-01-09
• arXiv.cs.MM Pub Date : 2020-01-06
Stefan Lattner

In recent years, artificial neural networks (ANNs) have become a universal tool for tackling real-world problems. ANNs have also shown great success in music-related tasks including music summarization and classification, similarity estimation, computer-aided or autonomous composition, and automatic music analysis. As structure is a fundamental characteristic of Western music, it plays a role in all these tasks. Some structural aspects are particularly challenging to learn with current ANN architectures. This is especially true for mid- and high-level self-similarity, tonal and rhythmic relationships. In this thesis, I explore the application of ANNs to different aspects of musical structure modeling, identify some challenges involved and propose strategies to address them. First, using probability estimations of a Restricted Boltzmann Machine (RBM), a probabilistic bottom-up approach to melody segmentation is studied. Then, a top-down method for imposing a high-level structural template in music generation is presented, which combines Gibbs sampling using a convolutional RBM with gradient-descent optimization on the intermediate solutions. Furthermore, I motivate the relevance of musical transformations in structure modeling and show how a connectionist model, the Gated Autoencoder (GAE), can be employed to learn transformations between musical fragments. For learning transformations in sequences, I propose a special predictive training of the GAE, which yields a representation of polyphonic music as a sequence of intervals. Furthermore, the applicability of these interval representations to a top-down discovery of repeated musical sections is shown. Finally, a recurrent variant of the GAE is proposed, and its efficacy in music prediction and modeling of low-level repetition structure is demonstrated.

更新日期：2020-01-08
• arXiv.cs.MM Pub Date : 2020-01-07
Hanhe Lin; Vlad Hosu; Chunling Fan; Yun Zhang; Yuchen Mu; Raouf Hamzaoui; Dietmar Saupe

The Satisfied User Ratio (SUR) curve for a lossy image compression scheme, e.g., JPEG, gives the distribution function of the Just Noticeable Difference (JND), the smallest distortion level that can be perceived by a subject when a reference image is compared to a distorted one. A sequence of JNDs can be defined with a suitable successive choice of reference images. We propose the first deep learning approach to predict SUR curves. We show how to exploit maximum likelihood estimation and the Kolmogorov-Smirnov test to select a suitable parametric model for the distribution function. We then use deep feature learning to predict samples of the SUR curve and apply the method of least squares to fit the parametric model to the predicted samples. Our deep learning approach relies on a Siamese Convolutional Neural Networks (CNN), transfer learning, and deep feature learning, using pairs consisting of a reference image and compressed image for training. Experiments on the MCL-JCI dataset showed state-of-the-art performance. For example, the mean Bhattacharyya distances between the predicted and ground truth first, second, and third JND distributions were 0.0810, 0.0702, and 0.0522, respectively, and the corresponding average absolute differences of the peak signal-to-noise ratio at the median of the distributions were 0.56, 0.65, and 0.53 dB.

更新日期：2020-01-08
• arXiv.cs.MM Pub Date : 2019-03-26
Jun Yu; Xiao-Jun Wu

The measure between heterogeneous data is still an open problem. Many research works have been developed to learn a common subspace where the similarity between different modalities can be calculated directly. However, most of existing works focus on learning a latent subspace but the semantically structural information is not well preserved. Thus, these approaches cannot get desired results. In this paper, we propose a novel framework, termed Cross-modal subspace learning via Kernel correlation maximization and Discriminative structure-preserving (CKD), to solve this problem in two aspects. Firstly, we construct a shared semantic graph to make each modality data preserve the neighbor relationship semantically. Secondly, we introduce the Hilbert-Schmidt Independence Criteria (HSIC) to ensure the consistency between feature-similarity and semantic-similarity of samples. Our model not only considers the inter-modality correlation by maximizing the kernel correlation but also preserves the semantically structural information within each modality. The extensive experiments are performed to evaluate the proposed framework on the three public datasets. The experimental results demonstrated that the proposed CKD is competitive compared with the classic subspace learning methods.

更新日期：2020-01-08
• arXiv.cs.MM Pub Date : 2020-01-06
Jie Li; Cong Zhang; Zhi Liu; Wei Sun; Qiyue Li

Point cloud video is the most popular representation of hologram, which is the medium to precedent natural content in VR/AR/MR and is expected to be the next generation video. Point cloud video system provides users immersive viewing experience with six degrees of freedom and has wide applications in many fields such as online education, entertainment. To further enhance these applications, point cloud video streaming is in critical demand. The inherent challenges lie in the large size by the necessity of recording the three-dimensional coordinates besides color information, and the associated high computation complexity of encoding. To this end, this paper proposes a communication and computation resource allocation scheme for QoE-driven point cloud video streaming. In particular, we maximize system resource utilization by selecting different quantities, transmission forms and quality level tiles to maximize the quality of experience. Extensive simulations are conducted and the simulation results show the superior performance over the existing schemes

更新日期：2020-01-07
• arXiv.cs.MM Pub Date : 2018-08-13
Jun Yu; Xiao-Jun Wu; Josef Kittler

Hashing techniques have been applied broadly in retrieval tasks due to their low storage requirements and high speed of processing. Many hashing methods based on a single view have been extensively studied for information retrieval. However, the representation capacity of a single view is insufficient and some discriminative information is not captured, which results in limited improvement. In this paper, we employ multiple views to represent images and texts for enriching the feature information. Our framework exploits the complementary information among multiple views to better learn the discriminative compact hash codes. A discrete hashing learning framework that jointly performs classifier learning and subspace learning is proposed to complete multiple search tasks simultaneously. Our framework includes two stages, namely a kernelization process and a quantization process. Kernelization aims to find a common subspace where multi-view features can be fused. The quantization stage is designed to learn discriminative unified hashing codes. Extensive experiments are performed on single-label datasets (WiKi and MMED) and multi-label datasets (MIRFlickr and NUS-WIDE) and the experimental results indicate the superiority of our method compared with the state-of-the-art methods.

更新日期：2020-01-07
• arXiv.cs.MM Pub Date : 2019-03-26
Jun Yu; Xiao-Jun Wu

Different from the content-based image retrieval methods, cross-modal image retrieval methods uncover the rich semantic-level information of social images to further understand image contents. As multiple modal data depict a common object from multiple perspectives, many works focus on learning the unified subspace representation. Recently, hash representation has received much attention in the retrieval field. In common Hamming space, how to directly preserve the local manifold structure among objects become an interesting problem. Most of the unsupervised hashing methods attempt to solve it by constructing a neighborhood graph on every modality respectively. However, it is hard to decide the weight factor of each graph to get the optimal graph. To overcome this problem, we adopt the concatenated features to represent the common object since the information implied by different modalities is complementary. In our framework, Locally Linear Embedding and Locality Preserving Projection are introduced to reconstruct the manifold structure of the original space. Besides, The $\ell_{2,1}$-norm constraint is imposed on the projection matrices to explore the discriminative hashing functions. Extensive experiments are performed on three public datasets and the experimental results show that our method outperforms several classic unsupervised hashing models.

更新日期：2020-01-07
• arXiv.cs.MM Pub Date : 2019-10-11
Wenwu Zhu; Xin Wang; Hongzhi Li

With the rapid development of Internet and multimedia services in the past decade, a huge amount of user-generated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for multi-modal analysis in multimedia. We introduce two scientific research problems, data-driven correlational representation and knowledge-guided fusion for multimedia analysis. To address the two scientific problems, we investigate them from the following aspects: 1) multi-modal correlational representation: multi-modal fusion of data across different modalities, and 2) multi-modal data and knowledge fusion: multi-modal fusion of data with domain knowledge. More specifically, on data-driven correlational representation, we highlight three important categories of methods, such as multi-modal deep representation, multi-modal transfer learning, and multi-modal hashing. On knowledge-guided fusion, we discuss the approaches for fusing knowledge with data and four exemplar applications that require various kinds of domain knowledge, including multi-modal visual question answering, multi-modal video summarization, multi-modal visual pattern mining and multi-modal recommendation. Finally, we bring forward our insights and future research directions.

更新日期：2020-01-07
• arXiv.cs.MM Pub Date : 2019-12-21
Huaizheng Zhang; Yong Luo; Qiming Ai; Yonggang Wen

更新日期：2020-01-06
• arXiv.cs.MM Pub Date : 2019-12-16
Maria Mannone; Federico Favali; Balandino Di Donato; Luca Turchet

Mathematics can help analyze the arts and inspire new artwork. Mathematics can also help make transformations from one artistic medium to another, considering exceptions and choices, as well as artists' individual and unique contributions. We propose a method based on diagrammatic thinking and quantum formalism. We exploit decompositions of complex forms into a set of simple shapes, discretization of complex images, and Dirac notation, imagining a world of "prototypes" that can be connected to obtain a fine or coarse-graining approximation of a given visual image. Visual prototypes are exchanged with auditory ones, and the information (position, size) characterizing visual prototypes is connected with the information (onset, duration, loudness, pitch range) characterizing auditory prototypes. The topic is contextualized within a philosophical debate (discreteness and comparison of apparently unrelated objects), it develops through mathematical formalism, and it leads to programming, to spark interdisciplinary thinking and ignite creativity within STEAM.

更新日期：2020-01-04
• arXiv.cs.MM Pub Date : 2020-01-01
Ruben Tolosana; Ruben Vera-Rodriguez; Julian Fierrez; Aythami Morales; Javier Ortega-Garcia

The free access to large-scale public databases, together with the fast progress of deep learning techniques, in particular Generative Adversarial Networks, have led to the generation of very realistic fake contents with its corresponding implications towards society in this era of fake news. This survey provides a thorough review of techniques for manipulating face images including DeepFake methods, and methods to detect such manipulations. In particular, four types of facial manipulation are reviewed: i) entire face synthesis, ii) face identity swap (DeepFakes), iii) facial attributes manipulation, and iv) facial expression manipulation. For each manipulation type, we provide details regarding manipulation techniques, existing public databases, and key benchmarks for technology evaluation of fake detection methods, including a summary of results from those evaluations. Among the different databases available and discussed in the survey, FaceForensics++ is for example one of the most widely used for detecting both face identity swap and facial expression manipulations, with results in the literature in the range of 90-100% of manipulation detection accuracy. In addition to the survey information, we also discuss trends and provide an outlook of the ongoing work in this field, e.g., the recently announced DeepFake Detection Challenge (DFDC).

更新日期：2020-01-04
• arXiv.cs.MM Pub Date : 2018-04-30
Yongfeng Zhang; Xu Chen

Explainable recommendation attempts to develop models that generate not only high-quality recommendations but also intuitive explanations. The explanations may either be post-hoc or directly come from an explainable model. Explainable recommendation tries to address the problem of why: by providing explanations to users or system designers, it helps humans to understand why certain items are recommended by the algorithm, where the human can either be users or system designers. Explainable recommendation helps to improve the transparency, persuasiveness, effectiveness, trustworthiness, and satisfaction of recommendation systems. It also facilitates system designers for better system debugging. In recent years, a large number of explainable recommendation approaches -- especially model-based methods -- have been proposed and applied in real-world systems. In this survey, we provide a comprehensive review for the explainable recommendation research. We highlight the position of explainable recommendation in recommender system research by categorizing recommendation problems into the 5W, i.e., what, when, who, where, and why. We then conduct a comprehensive survey of explainable recommendation on three perspectives: 1) We provide a chronological research timeline of explainable recommendation, including user study approaches in the early years and more recent model-based approaches. 2) We provide a two-dimensional taxonomy to classify existing explainable recommendation research: one dimension is the information source of the explanations, and the other dimension is the algorithmic mechanism to generate explainable recommendations. 3) We summarize how explainable recommendation applies to different recommendation tasks, such as product, social, and POI recommendations. We also devote a section to discuss the future directions to promote the explainable recommendation research.

更新日期：2020-01-04
• arXiv.cs.MM Pub Date : 2019-02-21
Ivo Trowitzsch; Jalil Taghia; Youssef Kashef; Klaus Obermayer

Computational auditory scene analysis is gaining interest in the last years. Trailing behind the more mature field of speech recognition, it is particularly general sound event detection that is attracting increasing attention. Crucial for training and testing reasonable models is having available enough suitable data -- until recently, general sound event databases were hardly found. We release and present a database with 714 wav files containing isolated high quality sound events of 14 different types, plus 303 `general' wav files of anything else but these 14 types. All sound events are strongly labeled with perceptual on- and offset times, paying attention to omitting in-between silences. The amount of isolated sound events, the quality of annotations, and the particular general sound class distinguish NIGENS from other databases.

更新日期：2020-01-04
• arXiv.cs.MM Pub Date : 2019-12-22
Kartik Sharma; Ashutosh Aggarwal; Tanay Singhania; Deepak Gupta; Ashish Khanna

Steganography is an art of obscuring data inside another quotidian file of similar or varying types. Hiding data has always been of significant importance to digital forensics. Previously, steganography has been combined with cryptography and neural networks separately. Whereas, this research combines steganography, cryptography with the neural networks all together to hide an image inside another container image of the larger or same size. Although the cryptographic technique used is quite simple, but is effective when convoluted with deep neural nets. Other steganography techniques involve hiding data efficiently, but in a uniform pattern which makes it less secure. This method targets both the challenges and make data hiding secure and non-uniform.

更新日期：2020-01-04
Contents have been reproduced by permission of the publishers.

down
wechat
bug