-
Deep Bi-directional Attention Network for Image Super-Resolution Quality Assessment arXiv.cs.MM Pub Date : 2024-03-15 Yixiao Li, Xiaoyuan Yang, Jun Fu, Guanghui Yue, Wei Zhou
There has emerged a growing interest in exploring efficient quality assessment algorithms for image super-resolution (SR). However, employing deep learning techniques, especially dual-branch algorithms, to automatically evaluate the visual quality of SR images remains challenging. Existing SR image quality assessment (IQA) metrics based on two-stream networks lack interactions between branches. To
-
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification arXiv.cs.MM Pub Date : 2024-03-15 Pingping Zhang, Yuhao Wang, Yang Liu, Zhengzheng Tu, Huchuan Lu
Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. However, previous methods may be easily affected by irrelevant backgrounds and usually ignore the modality gaps. To address
-
Enhancing Human-Centered Dynamic Scene Understanding via Multiple LLMs Collaborated Reasoning arXiv.cs.MM Pub Date : 2024-03-15 Hang Zhang, Wenxiao Zhang, Haoxuan Qu, Jun Liu
Human-centered dynamic scene understanding plays a pivotal role in enhancing the capability of robotic and autonomous systems, in which Video-based Human-Object Interaction (V-HOI) detection is a crucial task in semantic scene understanding, aimed at comprehensively understanding HOI relationships within a video to benefit the behavioral decisions of mobile robots and autonomous driving systems. Although
-
Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment arXiv.cs.MM Pub Date : 2024-03-15 Ziyu Shan, Yujie Zhang, Qi Yang, Haichen Yang, Yiling Xu, Jenq-Neng Hwang, Xiaozhong Xu, Shan Liu
No-reference point cloud quality assessment (NR-PCQA) aims to automatically evaluate the perceptual quality of distorted point clouds without available reference, which have achieved tremendous improvements due to the utilization of deep neural networks. However, learning-based NR-PCQA methods suffer from the scarcity of labeled data and usually perform suboptimally in terms of generalization. To solve
-
PAME: Self-Supervised Masked Autoencoder for No-Reference Point Cloud Quality Assessment arXiv.cs.MM Pub Date : 2024-03-15 Ziyu Shan, Yujie Zhang, Qi Yang, Haichen Yang, Yiling Xu, Shan Liu
No-reference point cloud quality assessment (NR-PCQA) aims to automatically predict the perceptual quality of point clouds without reference, which has achieved remarkable performance due to the utilization of deep learning-based models. However, these data-driven models suffer from the scarcity of labeled data and perform unsatisfactorily in cross-dataset evaluations. To address this problem, we propose
-
Lost in Overlap: Exploring Watermark Collision in LLMs arXiv.cs.MM Pub Date : 2024-03-15 Yiyang Luo, Ke Lin, Chao Gu
The proliferation of large language models (LLMs) in generating content raises concerns about text copyright. Watermarking methods, particularly logit-based approaches, embed imperceptible identifiers into text to address these challenges. However, the widespread use of watermarking across diverse LLMs has led to an inevitable issue known as watermark collision during common tasks like question answering
-
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric arXiv.cs.MM Pub Date : 2024-03-12 Haokun Lin, Haoli Bai, Zhili Liu, Lu Hou, Muyi Sun, Linqi Song, Ying Wei, Zhenan Sun
Vision-language pre-trained models have achieved impressive performance on various downstream tasks. However, their large model sizes hinder their utilization on platforms with limited computational resources. We find that directly using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance. Recent efforts for VLP compression
-
Imagine a dragon made of seaweed: How images enhance learning in Wikipedia arXiv.cs.MM Pub Date : 2024-03-12 Anita Silva, Maria Tracy, Katharina Reinecke, Eytan Adar, Miriam Redi
Though images are ubiquitous across Wikipedia, it is not obvious that the image choices optimally support learning. When well selected, images can enhance learning by dual coding, complementing, or supporting articles. When chosen poorly, images can mislead, distract, and confuse. We developed a large dataset containing 470 questions & answers to 94 Wikipedia articles with images on a wide range of
-
AMUSE: Adaptive Multi-Segment Encoding for Dataset Watermarking arXiv.cs.MM Pub Date : 2024-03-08 Saeed Ranjbar AlvarMing Xuan, Mohammad AkbariMing Xuan, DavidMing Xuan, Yue, Lingyang Chu, Yong Zhang
Curating high quality datasets that play a key role in the emergence of new AI applications requires considerable time, money, and computational resources. So, effective ownership protection of datasets is becoming critical. Recently, to protect the ownership of an image dataset, imperceptible watermarking techniques are used to store ownership information (i.e., watermark) into the individual image
-
FashionReGen: LLM-Empowered Fashion Report Generation arXiv.cs.MM Pub Date : 2024-03-11 Yujuan Ding, Yunshan Ma, Wenqi Fan, Yige Yao, Tat-Seng Chua, Qing Li
Fashion analysis refers to the process of examining and evaluating trends, styles, and elements within the fashion industry to understand and interpret its current state, generating fashion reports. It is traditionally performed by fashion professionals based on their expertise and experience, which requires high labour cost and may also produce biased results for relying heavily on a small group of
-
Interest-Aware Joint Caching, Computing, and Communication Optimization for Mobile VR Delivery in MEC Networks arXiv.cs.MM Pub Date : 2024-03-09 Baojie Fu, Tong Tang, Dapeng Wu, Ruyan Wang
In the upcoming B5G/6G era, virtual reality (VR) over wireless has become a typical application, which is an inevitable trend in the development of video. However, in immersive and interactive VR experiences, VR services typically exhibit high delay, while simultaneously posing challenges for the energy consumption of local devices. To address these issues, this paper aims to improve the performance
-
Born to Run, Programmed to Play: Mapping the Extended Reality Exergames Landscape arXiv.cs.MM Pub Date : 2024-03-11 Sukran Karaosmanoglu, Sebastian Cmentowski, Lennart E. Nacke, Frank Steinicke
Many people struggle to exercise regularly, raising the risk of serious health-related issues. Extended reality (XR) exergames address these hurdles by combining physical exercises with enjoyable, immersive gameplay. While a growing body of research explores XR exergames, no previous review has structured this rapidly expanding research landscape. We conducted a scoping review of the current state
-
QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning arXiv.cs.MM Pub Date : 2024-03-11 Jiun-Man Chen, Yu-Hsuan Chao, Yu-Jie Wang, Ming-Der Shieh, Chih-Chung Hsu, Wei-Fen Lin
Transformer-based models have gained widespread popularity in both the computer vision (CV) and natural language processing (NLP) fields. However, significant challenges arise during post-training linear quantization, leading to noticeable reductions in inference accuracy. Our study focuses on uncovering the underlying causes of these accuracy drops and proposing a quantization-friendly fine-tuning
-
Deep Contrastive Multi-view Clustering under Semantic Feature Guidance arXiv.cs.MM Pub Date : 2024-03-09 Siwen Liu, Jinyan Liu, Hanning Yuan, Qi Li, Jing Geng, Ziqiang Yuan, Huaxu Han
Contrastive learning has achieved promising performance in the field of multi-view clustering recently. However, the positive and negative sample construction mechanisms ignoring semantic consistency lead to false negative pairs, limiting the performance of existing algorithms from further improvement. To solve this problem, we propose a multi-view clustering framework named Deep Contrastive Multi-view
-
Feature CAM: Interpretable AI in Image Classification arXiv.cs.MM Pub Date : 2024-03-08 Frincy Clement, Ji Yang, Irene Cheng
Deep Neural Networks have often been called the black box because of the complex, deep architecture and non-transparency presented by the inner layers. There is a lack of trust to use Artificial Intelligence in critical and high-precision fields such as security, finance, health, and manufacturing industries. A lot of focused work has been done to provide interpretable models, intending to deliver
-
Towards Real-World Stickers Use: A New Dataset for Multi-Tag Sticker Recognition arXiv.cs.MM Pub Date : 2024-03-08 Bingbing Wang, Bin Liang, Chun-Mei Feng, Wangmeng Zuo, Zhixin Bai, Shijue Huang, Kam-Fai Wong, Ruifeng Xu
In real-world conversations, the diversity and ambiguity of stickers often lead to varied interpretations based on the context, necessitating the requirement for comprehensively understanding stickers and supporting multi-tagging. To address this challenge, we introduce StickerTAG, the first multi-tag sticker dataset comprising a collected tag set with 461 tags and 13,571 sticker-tag pairs, designed
-
Reply with Sticker: New Dataset and Model for Sticker Retrieval arXiv.cs.MM Pub Date : 2024-03-08 Bin Liang, Bingbing Wang, Zhixin Bai, Qiwei Lang, Mingwei Sun, Kaiheng Hou, Kam-Fai Wong, Ruifeng Xu
Using stickers in online chatting is very prevalent on social media platforms, where the stickers used in the conversation can express someone's intention/emotion/attitude in a vivid, tactful, and intuitive way. Existing sticker retrieval research typically retrieves stickers based on context and the current utterance delivered by the user. That is, the stickers serve as a supplement to the current
-
Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval arXiv.cs.MM Pub Date : 2024-03-08 Hailang Huang, Zhijie Nie, Ziqiao Wang, Ziyu Shang
Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly affect the accuracy of image-text retrieval. To address these challenges, we propose a novel method called Cross-modal and Uni-modal Soft-label Alignment
-
Measuring Non-Typical Emotions for Mental Health: A Survey of Computational Approaches arXiv.cs.MM Pub Date : 2024-03-09 Puneet Kumar, Alexander Vedernikov, Xiaobai Li
Analysis of non-typical emotions, such as stress, depression and engagement is less common and more complex compared to that of frequently discussed emotions like happiness, sadness, fear, and anger. The importance of these non-typical emotions has been increasingly recognized due to their implications on mental health and well-being. Stress and depression impact the engagement in daily tasks, highlighting
-
Self-supervised Photographic Image Layout Representation Learning arXiv.cs.MM Pub Date : 2024-03-06 Zhaoran Zhao, Peng Lu, Xujun Peng, Wenhao Guo
In the domain of image layout representation learning, the critical process of translating image layouts into succinct vector forms is increasingly significant across diverse applications, such as image retrieval, manipulation, and generation. Most approaches in this area heavily rely on costly labeled datasets and notably lack in adapting their modeling and learning methods to the specific nuances
-
MMoFusion: Multi-modal Co-Speech Motion Generation with Diffusion Model arXiv.cs.MM Pub Date : 2024-03-05 Sen Wang, Jiangning Zhang, Weijian Cao, Xiaobin Hu, Moran Li, Xiaozhong Ji, Xin Tan, Mengtian Li, Zhifeng Xie, Chengjie Wang, Lizhuang Ma
The body movements accompanying speech aid speakers in expressing their ideas. Co-speech motion generation is one of the important approaches for synthesizing realistic avatars. Due to the intricate correspondence between speech and motion, generating realistic and diverse motion is a challenging task. In this paper, we propose MMoFusion, a Multi-modal co-speech Motion generation framework based on
-
Enhancing Generalization in Medical Visual Question Answering Tasks via Gradient-Guided Model Perturbation arXiv.cs.MM Pub Date : 2024-03-05 Gang Liu, Hongyang Li, Zerui He, Shenjun Zhong
Leveraging pre-trained visual language models has become a widely adopted approach for improving performance in downstream visual question answering (VQA) applications. However, in the specialized field of medical VQA, the scarcity of available data poses a significant barrier to achieving reliable model generalization. Numerous methods have been proposed to enhance model generalization, addressing
-
Characterizing Multimedia Information Environment through Multi-modal Clustering of YouTube Videos arXiv.cs.MM Pub Date : 2024-02-28 Niloofar Yousefi, Mainuddin Shaik, Nitin Agarwal
This study aims to investigate the comprehensive characterization of information content in multimedia (videos), particularly on YouTube. The research presents a multi-method framework for characterizing multimedia content by clustering signals from various modalities, such as audio, video, and text. With a focus on South China Sea videos as a case study, this approach aims to enhance our understanding
-
A Novel Approach to Industrial Defect Generation through Blended Latent Diffusion Model with Online Adaptation arXiv.cs.MM Pub Date : 2024-02-29 Hanxi Li, Zhengxun Zhang, Hao Chen, Lin Wu, Bo Li, Deyin Liu, Mingwen Wang
Effectively addressing the challenge of industrial Anomaly Detection (AD) necessitates an ample supply of defective samples, a constraint often hindered by their scarcity in industrial contexts. This paper introduces a novel algorithm designed to augment defective samples, thereby enhancing AD performance. The proposed method tailors the blended latent diffusion model for defect sample generation,
-
Edge Computing Enabled Real-Time Video Analysis via Adaptive Spatial-Temporal Semantic Filtering arXiv.cs.MM Pub Date : 2024-02-29 Xiang Chen, Wenjie Zhu, Jiayuan Chen, Tong Zhang, Changyan Yi, Jun Cai
This paper proposes a novel edge computing enabled real-time video analysis system for intelligent visual devices. The proposed system consists of a tracking-assisted object detection module (TAODM) and a region of interesting module (ROIM). TAODM adaptively determines the offloading decision to process each video frame locally with a tracking algorithm or to offload it to the edge server inferred
-
Deep Learning for 3D Human Pose Estimation and Mesh Recovery: A Survey arXiv.cs.MM Pub Date : 2024-02-29 Yang Liu, Changzhen Qiu, Zhiyong Zhang
3D human pose estimation and mesh recovery have attracted widespread research interest in many areas, such as computer vision, autonomous driving, and robotics. Deep learning on 3D human pose estimation and mesh recovery has recently thrived, with numerous methods proposed to address different problems in this area. In this paper, to stimulate future research, we present a comprehensive review of recent
-
Exploration of Learned Lifting-Based Transform Structures for Fully Scalable and Accessible Wavelet-Like Image Compression arXiv.cs.MM Pub Date : 2024-02-29 Xinyue Li, Aous Naman, David Taubman
This paper provides a comprehensive study on features and performance of different ways to incorporate neural networks into lifting-based wavelet-like transforms, within the context of fully scalable and accessible image compression. Specifically, we explore different arrangements of lifting steps, as well as various network architectures for learned lifting operators. Moreover, we examine the impact
-
Balanced Similarity with Auxiliary Prompts: Towards Alleviating Text-to-Image Retrieval Bias for CLIP in Zero-shot Learning arXiv.cs.MM Pub Date : 2024-02-28 Hanyao Wang, Yibing Zhan, Liu Liu, Liang Ding, Jun Yu
CLIP has the ability to align texts and images and is nearly the most frequently used foundation model in cross-modal zero-shot learning. However, our experimental findings reveal that CLIP suffers from a bias in text-to-image retrieval, resulting in a decrease in CLIP's zero-shot learning performance. We analytically discover that the bias partly arises from the imbalanced range of similarity scores
-
Multimodal Interaction Modeling via Self-Supervised Multi-Task Learning for Review Helpfulness Prediction arXiv.cs.MM Pub Date : 2024-02-28 HongLin Gong, Mengzhao Jia, Liqiang Jing
In line with the latest research, the task of identifying helpful reviews from a vast pool of user-generated textual and visual data has become a prominent area of study. Effective modal representations are expected to possess two key attributes: consistency and differentiation. Current methods designed for Multimodal Review Helpfulness Prediction (MRHP) face limitations in capturing distinctive information
-
SPC-NeRF: Spatial Predictive Compression for Voxel Based Radiance Field arXiv.cs.MM Pub Date : 2024-02-26 Zetian Song, Wenhong Duan, Yuhuai Zhang, Shiqi Wang, Siwei Ma, Wen Gao
Representing the Neural Radiance Field (NeRF) with the explicit voxel grid (EVG) is a promising direction for improving NeRFs. However, the EVG representation is not efficient for storage and transmission because of the terrific memory cost. Current methods for compressing EVG mainly inherit the methods designed for neural network compression, such as pruning and quantization, which do not take full
-
Where Do We Go from Here? Multi-scale Allocentric Relational Inference from Natural Spatial Descriptions arXiv.cs.MM Pub Date : 2024-02-26 Tzuf Paz-Argaman, Sayali Kulkarni, John Palowitch, Jason Baldridge, Reut Tsarfaty
When communicating routes in natural language, the concept of {\em acquired spatial knowledge} is crucial for geographic information retrieval (GIR) and in spatial cognitive research. However, NLP navigation studies often overlook the impact of such acquired knowledge on textual descriptions. Current navigation studies concentrate on egocentric local descriptions (e.g., `it will be on your right')
-
Gradient-Guided Modality Decoupling for Missing-Modality Robustness arXiv.cs.MM Pub Date : 2024-02-26 Hao Wang, Shengda Luo, Guosheng Hu, Jianguo Zhang
Multimodal learning with incomplete input data (missing modality) is practical and challenging. In this work, we conduct an in-depth analysis of this challenge and find that modality dominance has a significant negative impact on the model training, greatly degrading the missing modality performance. Motivated by Grad-CAM, we introduce a novel indicator, gradients, to monitor and reduce modality dominance
-
Disentangled Graph Variational Auto-Encoder for Multimodal Recommendation with Interpretability arXiv.cs.MM Pub Date : 2024-02-25 Xin Zhou, Chunyan Miao
Multimodal recommender systems amalgamate multimodal information (e.g., textual descriptions, images) into a collaborative filtering framework to provide more accurate recommendations. While the incorporation of multimodal information could enhance the interpretability of these systems, current multimodal models represent users and items utilizing entangled numerical vectors, rendering them arduous
-
Predicting Outcomes in Video Games with Long Short Term Memory Networks arXiv.cs.MM Pub Date : 2024-02-24 Kittimate Chulajata, Sean Wu, Fabien Scalzo, Eun Sang Cha
Forecasting winners in E-sports with real-time analytics has the potential to further engage audiences watching major tournament events. However, making such real-time predictions is challenging due to unpredictable variables within the game involving diverse player strategies and decision-making. Our work attempts to enhance audience engagement within video game tournaments by introducing a real-time
-
Intelligent Director: An Automatic Framework for Dynamic Visual Composition using ChatGPT arXiv.cs.MM Pub Date : 2024-02-24 Sixiao Zheng, Jingyang Huo, Yu Wang, Yanwei Fu
With the rise of short video platforms represented by TikTok, the trend of users expressing their creativity through photos and videos has increased dramatically. However, ordinary users lack the professional skills to produce high-quality videos using professional creation software. To meet the demand for intelligent and user-friendly video creation tools, we propose the Dynamic Visual Composition
-
Think before You Leap: Content-Aware Low-Cost Edge-Assisted Video Semantic Segmentation arXiv.cs.MM Pub Date : 2024-02-22 Mingxuan Yan, Yi Wang, Xuedou Xiao, Zhiqing Luo, Jianhua He, Wei Wang
Offloading computing to edge servers is a promising solution to support growing video understanding applications at resource-constrained IoT devices. Recent efforts have been made to enhance the scalability of such systems by reducing inference costs on edge servers. However, existing research is not directly applicable to pixel-level vision tasks such as video semantic segmentation (VSS), partly due
-
A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis arXiv.cs.MM Pub Date : 2024-02-20 Nailei Hei, Qianyu Guo, Zihao Wang, Yan Wang, Haofen Wang, Wenqiang Zhang
Well-designed prompts have demonstrated the potential to guide text-to-image models in generating amazing images. Although existing prompt engineering methods can provide high-level guidance, it is challenging for novice users to achieve the desired results by manually entering prompts due to a discrepancy between novice-user-input prompts and the model-preferred prompts. To bridge the distribution
-
SoMeLVLM: A Large Vision Language Model for Social Media Processing arXiv.cs.MM Pub Date : 2024-02-20 Xinnong Zhang, Haoyu Kuang, Xinyi Mou, Hanjia Lyu, Kun Wu, Siming Chen, Jiebo Luo, Xuanjing Huang, Zhongyu Wei
The growth of social media, characterized by its multimodal nature, has led to the emergence of diverse phenomena and challenges, which calls for an effective approach to uniformly solve automated tasks. The powerful Large Vision Language Models make it possible to handle a variety of tasks simultaneously, but even with carefully designed prompting methods, the general domain models often fall short
-
Multimodal Emotion Recognition from Raw Audio with Sinc-convolution arXiv.cs.MM Pub Date : 2024-02-19 Xiaohui Zhang, Wenjie Fu, Mangui Liang
Speech Emotion Recognition (SER) is still a complex task for computers with average recall rates usually about 70% on the most realistic datasets. Most SER systems use hand-crafted features extracted from audio signal such as energy, zero crossing rate, spectral information, prosodic, mel frequency cepstral coefficient (MFCC), and so on. More recently, using raw waveform for training neural network
-
Evaluating Image Review Ability of Vision Language Models arXiv.cs.MM Pub Date : 2024-02-19 Shigeki Saito, Kazuki Hayashi, Yusuke Ide, Yusuke Sakai, Kazuma Onishi, Toma Suzuki, Seiji Gobara, Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
Large-scale vision language models (LVLMs) are language models that are capable of processing images and text inputs by a single model. This paper explores the use of LVLMs to generate review texts for images. The ability of LVLMs to review images is not fully understood, highlighting the need for a methodical evaluation of their review abilities. Unlike image captions, review texts can be written
-
Interpretable Embedding for Ad-hoc Video Search arXiv.cs.MM Pub Date : 2024-02-19 Jiaxin Wu, Chong-Wah Ngo
Answering query with semantic concepts has long been the mainstream approach for video search. Until recently, its performance is surpassed by concept-free approach, which embeds queries in a joint space as videos. Nevertheless, the embedded features as well as search results are not interpretable, hindering subsequent steps in video browsing and query reformulation. This paper integrates feature embedding
-
Cross-Attention Fusion of Visual and Geometric Features for Large Vocabulary Arabic Lipreading arXiv.cs.MM Pub Date : 2024-02-18 Samar Daou, Ahmed Rekik, Achraf Ben-Hamadou, Abdelaziz Kallel
Lipreading involves using visual data to recognize spoken words by analyzing the movements of the lips and surrounding area. It is a hot research topic with many potential applications, such as human-machine interaction and enhancing audio speech recognition. Recent deep-learning based works aim to integrate visual features extracted from the mouth region with landmark points on the lip contours. However
-
Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond arXiv.cs.MM Pub Date : 2024-02-16 Yongqi Li, Wenjie Wang, Leigang Qu, Liqiang Nie, Wenjie Li, Tat-Seng Chua
The recent advancements in generative language models have demonstrated their ability to memorize knowledge from documents and recall knowledge to respond to user queries effectively. Building upon this capability, we propose to enable multimodal large language models (MLLMs) to memorize and recall images within their parameters. Given a user query for visual content, the MLLM is anticipated to "recall"
-
LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing arXiv.cs.MM Pub Date : 2024-02-15 Bryan Wang, Yuliang Li, Zhaoyang Lv, Haijun Xia, Yan Xu, Raj Sodhi
Video creation has become increasingly popular, yet the expertise and effort required for editing often pose barriers to beginners. In this paper, we explore the integration of large language models (LLMs) into the video editing workflow to reduce these barriers. Our design vision is embodied in LAVE, a novel system that provides LLM-powered agent assistance and language-augmented editing features
-
ChatDiet: Empowering Personalized Nutrition-Oriented Food Recommender Chatbots through an LLM-Augmented Framework arXiv.cs.MM Pub Date : 2024-02-18 Zhongqi Yang, Elahe Khatibi, Nitish Nagesh, Mahyar Abbasian, Iman Azimi, Ramesh Jain, Amir M. Rahmani
The profound impact of food on health necessitates advanced nutrition-oriented food recommendation services. Conventional methods often lack the crucial elements of personalization, explainability, and interactivity. While Large Language Models (LLMs) bring interpretability and explainability, their standalone use falls short of achieving true personalization. In this paper, we introduce ChatDiet,
-
MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding arXiv.cs.MM Pub Date : 2024-02-15 Hai-Tao Yu, Mofei Song
In perception, multiple sensory information is integrated to map visual information from 2D views onto 3D objects, which is beneficial for understanding in 3D environments. But in terms of a single 2D view rendered from different angles, only limited partial information can be provided.The richness and value of Multi-view 2D information can provide superior self-supervised signals for 3D objects. In
-
Lester: rotoscope animation through video object segmentation and tracking arXiv.cs.MM Pub Date : 2024-02-15 Ruben Tous
This article introduces Lester, a novel method to automatically synthetise retro-style 2D animations from videos. The method approaches the challenge mainly as an object segmentation and tracking problem. Video frames are processed with the Segment Anything Model (SAM) and the resulting masks are tracked through subsequent frames with DeAOT, a method of hierarchical propagation for semi-supervised
-
LL-GABR: Energy Efficient Live Video Streaming Using Reinforcement Learning arXiv.cs.MM Pub Date : 2024-02-14 Adithya Raman, Bekir Turkkan, Tevfik Kosar
Over the recent years, research and development in adaptive bitrate (ABR) algorithms for live video streaming have been successful in improving users' quality of experience (QoE) by reducing latency to near real-time levels while delivering higher bitrate videos with minimal rebuffering time. However, the QoE models used by these ABR algorithms do not take into account that a large portion of live
-
Blind Deep-Learning-Based Image Watermarking Robust Against Geometric Transformations arXiv.cs.MM Pub Date : 2024-02-14 Hannes Mareen, Lucas Antchougov, Glenn Van Wallendael, Peter Lambert
Digital watermarking enables protection against copyright infringement of images. Although existing methods embed watermarks imperceptibly and demonstrate robustness against attacks, they typically lack resilience against geometric transformations. Therefore, this paper proposes a new watermarking method that is robust against geometric attacks. The proposed method is based on the existing HiDDeN architecture
-
Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data arXiv.cs.MM Pub Date : 2024-02-12 Puneet Kumar, Sarthak Malik, Balasubramanian Raman, Xiaobai Li
The ability to generate sentiment-controlled feedback in response to multimodal inputs, comprising both text and images, addresses a critical gap in human-computer interaction by enabling systems to provide empathetic, accurate, and engaging responses. This capability has profound applications in healthcare, marketing, and education. To this end, we construct a large-scale Controllable Multimodal Feedback
-
BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind arXiv.cs.MM Pub Date : 2024-02-12 Yuanyuan Mao, Xin Lin, Qin Ni, Liang He
As a foundational component of cognitive intelligence, theory of mind (ToM) can make AI more closely resemble human thought processes, thereby enhancing their interaction and collaboration with human. In particular, it can significantly improve a model's comprehension of videos in complex scenes. However, current video question answer (VideoQA) datasets focus on studying causal reasoning within events
-
Evaluation Metrics for Automated Typographic Poster Generation arXiv.cs.MM Pub Date : 2024-02-10 Sérgio M. Rebelo, J. J. Merelo, João Bicker, Penousal Machado
Computational Design approaches facilitate the generation of typographic design, but evaluating these designs remains a challenging task. In this paper, we propose a set of heuristic metrics for typographic design evaluation, focusing on their legibility, which assesses the text visibility, aesthetics, which evaluates the visual quality of the design, and semantic features, which estimate how effectively
-
Multimodal Learned Sparse Retrieval for Image Suggestion arXiv.cs.MM Pub Date : 2024-02-12 Thong Nguyen, Mariya Hendriksen, Andrew Yates
Learned Sparse Retrieval (LSR) is a group of neural methods designed to encode queries and documents into sparse lexical vectors. These vectors can be efficiently indexed and retrieved using an inverted index. While LSR has shown promise in text retrieval, its potential in multi-modal retrieval remains largely unexplored. Motivated by this, in this work, we explore the application of LSR in the multi-modal
-
VCR: Video representation for Contextual Retrieval arXiv.cs.MM Pub Date : 2024-02-12 Oron Nir, Idan Vidra, Avi Neeman, Barak Kinarti, Ariel Shamir
Streamlining content discovery within media archives requires integrating advanced data representations and effective visualization techniques for clear communication of video topics to users. The proposed system addresses the challenge of efficiently navigating large video collections by exploiting a fusion of visual, audio, and textual features to accurately index and categorize video content through
-
SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers arXiv.cs.MM Pub Date : 2024-02-11 Zheng Ning, Brianna Wimer, Kaiwen Jiang, Keyi Chen, Jerrick Ban, Yapeng Tian, Yuhang Zhao, Toby Li
Blind or Low-Vision (BLV) users often rely on audio descriptions (AD) to access video content. However, conventional static ADs can leave out detailed information in videos, impose a high mental load, neglect the diverse needs and preferences of BLV users, and lack immersion. To tackle these challenges, we introduce SPICA, an AI-powered system that enables BLV users to interactively explore video content
-
Design of a 5G Multimedia Broadcast Application Function Supporting Adaptive Error Recovery arXiv.cs.MM Pub Date : 2024-02-09 C. M. Lentisco, L. Bellido, A. Cárdenas, R. F. Moyano, D. Fernández
The demand for mobile multimedia streaming services has been steadily growing in recent years. Mobile multimedia broadcasting addresses the shortage of radio resources but introduces a network error recovery problem. Retransmitting multimedia segments that are not correctly broadcast can cause service disruptions and increased service latency, affecting the quality of experience perceived by end users
-
Reducing Latency for Multimedia Broadcast Services Over Mobile Networks arXiv.cs.MM Pub Date : 2024-02-09 C. M. Lentisco, L. Bellido, A. Cárdenas, R. F. Moyano, D. Fernández
Multimedia services over mobile networks pose several challenges, such as the efficient management of radio resources or the latency induced by network delays and buffering requirements on the multimedia players. In Long Term Evolution (LTE) networks, the definition of multimedia broadcast services over a common radio channel addresses the shortage of radio resources but introduces the problem of network
-
Human Aesthetic Preference-Based Large Text-to-Image Model Personalization: Kandinsky Generation as an Example arXiv.cs.MM Pub Date : 2024-02-09 Aven-Le Zhou, Yu-Ao Wang, Wei Wu, Kang Zhang
With the advancement of neural generative capabilities, the art community has actively embraced GenAI (generative artificial intelligence) for creating painterly content. Large text-to-image models can quickly generate aesthetically pleasing outcomes. However, the process can be non-deterministic and often involves tedious trial-and-error, as users struggle with formulating effective prompts to achieve
-
Quantifying and Enhancing Multi-modal Robustness with Modality Preference arXiv.cs.MM Pub Date : 2024-02-09 Zequn Yang, Yake Wei, Ce Liang, Di Hu
Multi-modal models have shown a promising capability to effectively integrate information from various sources, yet meanwhile, they are found vulnerable to pervasive perturbations, such as uni-modal attacks and missing conditions. To counter these perturbations, robust multi-modal representations are highly expected, which are positioned well away from the discriminative multi-modal decision boundary
-
Listening Between the Lines: Synthetic Speech Detection Disregarding Verbal Content arXiv.cs.MM Pub Date : 2024-02-08 Davide Salvi, Temesgen Semu Balcha, Paolo Bestagini, Stefano Tubaro
Recent advancements in synthetic speech generation have led to the creation of forged audio data that are almost indistinguishable from real speech. This phenomenon poses a new challenge for the multimedia forensics community, as the misuse of synthetic media can potentially cause adverse consequences. Several methods have been proposed in the literature to mitigate potential risks and detect synthetic