样式: 排序: IF: - GO 导出 标记为已读
-
Generative Adversarial Networks with Learnable Auxiliary Module for Image Synthesis ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-17 Yan Gan, Chenxue Yang, Mao Ye, Renjie Huang, Deqiang Ouyang
Training generative adversarial networks (GANs) for noise-to-image synthesis is a challenge task, primarily due to the instability of GANs’ training process. One of the key issues is the generator’s sensitivity to input data, which can cause sudden fluctuations in the generator’s loss value with certain inputs. This sensitivity suggests an inadequate ability to resist disturbances in the generator
-
Multi-Agent DRL-based Multipath Scheduling for Video Streaming with QUIC ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-15 Xueqiang Han, Biao Han, Jinrong Li, Congxi Song
The popularization of video streaming brings challenges in satisfying diverse Quality of Service (QoS) requirements. The multipath extension of the Quick UDP Internet Connection (QUIC) protocol, also called MPQUIC, has the potential to improve video streaming performance with multiple simultaneously transmitting paths. The multipath scheduler of MPQUIC determines how to distribute the packets onto
-
Realizing Efficient On-Device Language-based Image Retrieval ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-15 Zhiming Hu, Mete Kemertas, Lan Xiao, Caleb Phillips, Iqbal Mohomed, Afsaneh Fazly
Advances in deep learning have enabled accurate language-based search and retrieval, e.g., over user photos, in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art cross-modal retrieval models achieve high accuracy through learning entangled representations
-
Invisible Adversarial Watermarking: A Novel Security Mechanism for Enhancing Copyright Protection ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-14 Jinwei Wang, Haihua Wang, Jiawei Zhang, Hao Wu, Xiangyang Luo, Bin Ma
Invisible watermarking can be used as an important tool for copyright certification in the Metaverse. However, with the advent of deep learning, Deep Neural Networks (DNNs) have posed new threats to this technique. For example, artificially trained DNNs can perform unauthorized content analysis and achieve illegal access to protected images. Furthermore, some specially crafted DNNs may even erase invisible
-
Audio-Visual Contrastive Pre-train for Face Forgery Detection ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-13 Hanqing Zhao, Wenbo Zhou, Dongdong Chen, Weiming Zhang, Ying Guo, Zhen Cheng, Pengfei Yan, Nenghai Yu
The highly realistic avatar in the metaverse may lead to severe leakage of facial privacy. Malicious users can more easily obtain the 3D structure of faces, thus using Deepfake technology to create counterfeit videos with higher realism. To automatically discern facial videos forged with the advancing generation techniques, deepfake detectors need to achieve stronger generalization abilities. Inspired
-
Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-12 Xiangming Gu, Longshen Ou, Wei Zeng, Jianan Zhang, Nicholas Wong, Ye Wang
Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is
-
Suitable and Style-consistent Multi-texture Recommendation for Cartoon Illustrations ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-12 Huisi Wu, Zhaoze Wang, Yifan Li, Xueting Liu, Tong-Yee Lee
Texture plays an important role in cartoon illustrations to display object materials and enrich visual experiences. Unfortunately, manually designing and drawing an appropriate texture is not easy even for proficient artists, let alone novice or amateur people. While there exist tons of textures on the Internet, it is not easy to pick an appropriate one using traditional text-based search engines.
-
Mastering Deepfake Detection: A Cutting-Edge Approach to Distinguish GAN and Diffusion-Model Images ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-09 Luca Guarnera, Oliver Giudice, Sebastiano Battiato
Detecting and recognizing deepfakes is a pressing issue in the digital age. In this study, we first collected a dataset of pristine images and fake ones properly generated by nine different Generative Adversarial Network (GAN) architectures and four Diffusion Models (DM). The dataset contained a total of 83,000 images, with equal distribution between the real and deepfake data. Then, to address different
-
Backdoor Two-Stream Video Models on Federated Learning ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-07 Jing Zhao, Hongwei Yang, Hui He, Jie Peng, Weizhe Zhang, Jiangqun Ni, Arun Kumar Sangaiah, Aniello Castiglione
Video models on federated learning (FL) enable continual learning of the involved models for video tasks on end-user devices while protecting the privacy of end-user data. As a result, the security issues on FL, e.g., the backdoor attacks on FL and their defense have increasingly becoming the domains of extensive research in recent years. The backdoor attacks on FL are a class of poisoning attacks
-
Review and Analysis of RGBT Single Object Tracking Methods: A Fusion Perspective ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-07 ZhiHao Zhang, Jun Wang, Zhuli Zang, Lei Jin, Shengjie Li, Hao Wu, Jian Zhao, Zhang Bo
Visual tracking is a fundamental task in computer vision with significant practical applications in various domains, including surveillance, security, robotics, and human-computer interaction. However, it may face limitations in visible light data, such as low-light environments, occlusion, and camouflage, which can significantly reduce its accuracy. To cope with these challenges, researchers have
-
A Bitcoin-based Secure Outsourcing Scheme for Optimization Problem in Multimedia Internet of Things ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Wenyuan Yang, Shaocong Wu, Jianwei Fei, Xianwang Zeng, Yuemin Ding, Zhihua Xia
With the development of the Internet of Things (IoT) and cloud computing, various multimedia data such as audio, video, and images have experienced explosive growth, ushering in the era of big data. Large-scale computing tasks in the Multimedia Internet of Things (M-IoT), such as mathematical optimization problems, have begun to be outsourced from IoT devices with limited computing power to cloud servers
-
Privacy and Integrity Protection for IoT Multimodal Data Using Machine Learning and Blockchain ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Qingzhi Liu, Yuchen Huang, Chenglu Jin, Xiaohan Zhou, Ying Mao, Cagatay Catal, Long Cheng
With the wide application of Internet of Things (IoT) technology, large volumes of multimodal data are collected and analyzed for various diagnoses, analyses, and predictions to help in decision-making and management. However, the research on protecting data integrity and privacy is quite limited, while the lack of proper protection for sensitive data may have significant impacts on the benefits and
-
Detecting Post Editing of Multimedia Images using Transfer Learning and Fine Tuning ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Simon Jonker, Malthe Jelstrup, Weizhi Meng, Brooke Lampe
In the domain of general image forgery detection, a myriad of different classification solutions have been developed to distinguish a “tampered” image from a “pristine” image. In this work, we aim to develop a new method to tackle the problem of binary image forgery detection. Our approach builds upon the extensive training that state-of-the-art image classification models have undergone on regular
-
IoT-enabled Biometric Security: Enhancing Smart Car Safety with Depth-based Head Pose Estimation ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Carmen Bisogni, Lucia Cascone, Michele Nappi, Chiara Pero
Advanced Driver Assistance Systems (ADAS) are experiencing higher levels of automation, facilitated by the synergy among various sensors integrated within vehicles, thereby forming an Internet of Things (IoT) framework. Among these sensors, cameras have emerged as valuable tools for detecting driver fatigue and distraction. This study introduces HYDE-F, a Head Pose Estimation (HPE) system exclusively
-
Trustworthy and Efficient Digital Twins in Post-Quantum Era with Hybrid Hardware-Assisted Signatures ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Saif E. Nouma, Attila A. Yavuz
Digital Twins (DT) virtually model cyber-physical objects via sensory inputs by simulating or monitoring their behavior. Therefore, DTs usually harbor vast quantities of Internet of Things (IoT) components (e.g., sensors) that gather, process, and offload sensitive information (e.g., healthcare) to the cloud. It is imperative to ensure the trustworthiness of such sensitive information with long-term
-
Vocoder Detection of Spoofing Speech Based on GAN Fingerprints and Domain Generalization ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Fan Li, Yanxiang Chen, Haiyang Liu, Zuxing Zhao, Yuanzhi Yao, Xin Liao
As an important part of the text-to-speech (TTS) system, vocoders convert acoustic features into speech waveforms. The difference in vocoders is key to producing different types of forged speech in the TTS system. With the rapid development of general adversarial networks (GANs), an increasing number of GAN vocoders have been proposed. Detectors often encounter vocoders of unknown types, which leads
-
Incomplete Multiview Clustering via Semidiscrete Optimal Transport for Multimedia Data Mining in IoT ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Jing Gao, Peng Li, Asif Ali Laghari, Gautam Srivastava, Thippa Reddy Gadekallu, Sidra Abbas, Jianing Zhang
With the wide deployment of the Internet of Things (IoT), large volumes of incomplete multiview data that violates data integrity is generated by various applications, which inevitably produces negative impacts on the quality of service of IoT systems. Incomplete multiview clustering (IMC), as an essential technique of data processing, has the potential for mining patterns of incomplete IoT data. However
-
Pedestrian Attribute Recognition via Spatio-temporal Relationship Learning for Visual Surveillance ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Zhenyu Liu, Da Li, Xinyu Zhang, Zhang Zhang, Peng Zhang, Caifeng Shan, Jungong Han
Pedestrian attribute recognition (PAR) aims at predicting the visual attributes of a pedestrian image. PAR has been used as soft biometrics for visual surveillance and IoT security. Most of the current PAR methods are developed based on discrete images. However, it is challenging for the image-based method to handle the occlusion and action-related attributes in real-world applications. Recently, video-based
-
Diverse Visual Question Generation Based on Multiple Objects Selection ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Wenhao Fang, Jiayuan Xie, Hongfei Liu, Jiali Chen, Yi Cai
Visual question generation task aims at generating high-quality questions about a given image. To make this tak applicable to various scenarios, e.g., the growing demand for exams, it is important to generate diverse questions. The existing methods for this task control diverse question generation based on different question types, e.g., “what” and “when.” Although different question types lead to
-
A Reconfigurable Framework for Neural Network Based Video In-Loop Filtering ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Yichi Zhang, Dandan Ding, Zhan Ma, Zhu Li
This article proposes a reconfigurable framework for neural network based video in-loop filtering to guide large-scale models for content-aware processing. Specifically, the backbone neural model is decomposed into several convolutional groups and the encoder systematically traverses all candidate configurations combined by these groups to find the best one. The selected configuration index is then
-
Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Ronglai Zuo, Brian Mak
Deep-learning-based continuous sign language recognition (CSLR) models typically consist of a visual module, a sequential module, and an alignment module. However, the effectiveness of training such CSLR backbones is hindered by limited training samples, rendering the use of a single connectionist temporal classification loss insufficient. To address this limitation, we propose three auxiliary tasks
-
Human Selective Matting ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Qinglin Liu, Quanling Meng, Xiaoqian Lv, Zonglin Li, Wei Yu, Shengping Zhang
Existing human matting methods are incapable of accurately estimating the alpha mattes of arbitrarily selected humans from a group photo. An alternative solution is to apply them to the corresponding cropped image patches. However, this option obtains an inaccurate alpha estimation due to the interference of the body parts of the neighboring humans. In addition, these methods are only trained on finely
-
Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Shenshen Li, Xing Xu, Xun Jiang, Fumin Shen, Zhe Sun, Andrzej Cichocki
In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific
-
MCFNet: Multi-Attentional Class Feature Augmentation Network for Real-Time Scene Parsing ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Xizhong Wang, Rui Liu, Xin Yang, Qiang Zhang, Dongsheng Zhou
For real-time scene parsing tasks, capturing multi-scale semantic features and performing effective feature fusion is crucial. However, many existing solutions ignore stripe-shaped things like poles, traffic lights and are so computationally expensive that cannot meet the high real-time requirements. This article presents a novel model, the Multi-Attention Class Feature Augmentation Network (MCFNet)
-
Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Weichao Zhao, Hezhen Hu, Wengang Zhou, Li Li, Houqiang Li
Reconstructing interacting hands from monocular RGB data is a challenging task, as it involves many interfering factors, e.g., self- and mutual occlusion and similar textures. Previous works only leverage information from a single RGB image without modeling their physically plausible relation, which leads to inferior reconstruction results. In this work, we are dedicated to explicitly exploiting spatial-temporal
-
Immersive Multimedia Service Caching in Edge Cloud with Renewable Energy ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 M. Shamim Hossain, Yixue Hao, Long Hu, Jia Liu, Gang Wei, Chen Min
Immersive service caching, based on the intelligent edge cloud, can meet delay-sensitive service requirements. Although numerous service caching solutions for edge clouds have been designed, they have not been well explored. Moreover, to the best of our knowledge, there is no work to consider the immersive service caching scheme under the supply of renewable energy. In this article, we investigate
-
Perceptual Quality Assessment of Omnidirectional Images: A Benchmark and Computational Model ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Xuelin Liu, Jiebin Yan, Liping Huang, Yuming Fang, Zheng Wan, Yang Liu
Compared with traditional 2D images, omnidirectional images (also referred to as 360∘ images) have more complicated perceptual characteristics due to the particularities of imaging and display. How humans perceive omnidirectional images in an immersive environment and form the immersive quality of experience are important problems. Thus, it is crucial to measure the quality of omnidirectional images
-
Multi-Content Interaction Network for Few-Shot Segmentation ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Hao Chen, Yunlong Yu, Yonghan Dong, Zheming Lu, Yingming Li, Zhongfei Zhang
Few-Shot Segmentation (FSS) poses significant challenges due to limited support images and large intra-class appearance discrepancies. Most existing approaches focus on aligning the support-query correlations from the same layer of the frozen backbone while neglecting the bias between different tasks and different layers. In this article, we propose a Multi-Content Interaction Network (MCINet) to remedy
-
Iterative Temporal-spatial Transformer-based Cardiac T1 Mapping MRI Reconstruction ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-08 Jun Lyu, Guangming Wang, M. Shamim Hossain
The precise reconstruction of accelerated magnetic resonance imaging (MRI) brings about notable advantages, such as enhanced diagnostic precision and decreased examination costs. In contrast, traditional cardiac MRI necessitates repetitive acquisitions across multiple heartbeats, resulting in prolonged acquisition times. Significant strides have been made in accelerating MRI through deep learning-based
-
A Quality of Experience and Visual Attention Evaluation for 360° videos with non-spatial and spatial audio ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-06 Amit Hirway, Yuansong Qiao, Niall Murray
This article presents the results of an empirical study that aimed to investigate the influence of various types of audio (spatial and non-spatial) on the user quality of experience (QoE) of and visual attention in 360° videos. The study compared the head pose, eye gaze, pupil dilations, heart rate and subjective responses of 73 users who watched ten 360° videos with different sound configurations
-
Delay threshold for social interaction in volumetric eXtended Reality communication ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-06 Carlos Cortés, Irene Viola, Jesús Gutiérrez, Jack Jansen, Shishir Subramanyam, Evangelos Alexiou, Pablo Pérez, Narciso García, Pablo César
Immersive technologies like eXtended Reality (XR) are the next step in videoconferencing. In this context, understanding the effect of delay on communication is crucial. This paper presents the first study on the impact of delay on collaborative tasks using a realistic Social XR system. Specifically, we design an experiment and evaluate the impact of end-to-end delays of 300, 600, 900, 1200, and 1500
-
Enhanced Video Super-Resolution Network Towards Compressed Data ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-06 Feng Li, Yixuan Wu, Anqi Li, Huihui Bai, Runmin Cong, Yao Zhao
Video super-resolution (VSR) algorithms aim at recovering a temporally consistent high-resolution (HR) video from its corresponding low-resolution (LR) video sequence. Due to the limited bandwidth during video transmission, most available videos on the internet are compressed. Nevertheless, few existing algorithms consider the compression factor in practical applications. In this paper, we propose
-
GreenABR+: Generalized Energy-Aware Adaptive Bitrate Streaming ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-05 Bekir Oguzhan Turkkan, Ting Dai, Adithya Raman, Tevfik Kosar, Changyou Chen, Muhammed Fatih Bulut, Jaroslaw Zola, Daby Sow
Adaptive bitrate (ABR) algorithms play a critical role in video streaming by making optimal bitrate decisions in dynamically changing network conditions to provide a high quality of experience (QoE) for users. However, most existing ABRs suffer from limitations such as predefined rules and incorrect assumptions about streaming parameters. They often prioritize higher bitrates and ignore the corresponding
-
WaRENet: A Novel Urban Waterlogging Risk Evaluation Network ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-05 Xiaoya Yu, Kejun Wu, You Yang, Qiong Liu
In this paper, we propose a novel urban waterlogging risk evaluation network (WaRENet) to evaluate the risk of waterlogging. The WaRENet distinguishes whether an urban image involves waterlogging by classification module, and estimates the waterlogging risk levels by multi-class reference objects detection module (MCROD). Firstly, in the waterlogging scene classification, ResNet combined with Se-block
-
Learning Nighttime Semantic Segmentation the Hard Way ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-04 Wenxi Liu, Jiaxin Cai, Qi Li, Chenyang Liao, Jingjing Cao, Shengfeng He, Yuanlong Yu
Nighttime semantic segmentation is an important but challenging research problem for autonomous driving. The major challenges lie in the small objects or regions from the under-/over-exposed areas or suffer from motion blur caused by the camera deployed on moving vehicles. To resolve this, we propose a novel hard-class-aware module that bridges the main network for full-class segmentation and the hard-class
-
Continuous Image Outpainting with Neural ODE ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-02 Penglei Gao, Xi Yang, Rui Zhang, Kaizhu Huang
Generalised image outpainting is an important and active research topic in computer vision, which aims to extend appealing content all-side around a given image. Existing state-of-the-art outpainting methods often rely on discrete extrapolation to extend the feature map in the bottleneck. They thus suffer from content unsmoothness, especially in circumstances where the outlines of objects in the extrapolated
-
Learning scene representations for human-assistive displays using self-attention networks ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-02 Jaime Ruiz-Serra, Jack White, Stephen Petrie, Tatiana Kameneva, Chris McCarthy
Video-see-through (VST) augmented reality (AR) is widely used to present novel augmentative visual experiences by processing video frames for viewers. Among VST AR systems, assistive vision displays aim to compensate for low vision or blindness, presenting enhanced visual information to support activities of daily living for the vision impaired/deprived. Despite progress, current assistive displays
-
Robust Image Hashing via CP Decomposition and DCT for Copy Detection ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-03-01 Xiaoping Liang, Wanting Liu, Xianquan Zhang, Zhenjun Tang
Copy detection is a key task of image copyright protection. This paper proposes a robust image hashing algorithm by CP decomposition and discrete cosine transform (DCT) for copy detection. The first contribution is the third-order tensor construction with low-frequency coefficients in the DCT domain. Since the low-frequency DCT coefficients contain most of the image energy, they can reflect the basic
-
Exploring the Facets of the Multiplayer VR Gaming Experience ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-29 Sara Vlahovic, Ivan Slivar, Matko Silic, Lea Skorin-Kapov, MirkoSuznjevic
While the topic of investigating user experience with immersive services, such as Social Virtual Reality (VR), is starting to gain traction in the research community, the unique case of multiplayer VR games requires a more specific approach. Attempts to investigate user experiences with this complex, multidimensional service are hindered by the absence of specific standards and guidelines going beyond
-
Perceptual Quality-Oriented Rate Allocation via Distillation from End-to-End Image Compression ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-29 Runyu Yang, Dong Liu, Siwei Ma, Feng Wu, Wen Gao
Mainstream image/video coding standards, exemplified by the state-of-the-art H.266/VVC, AVS3, and AV1, follow the block-based hybrid coding framework. Due to the block-based framework, encoders designed for these standards are easily optimized for peak signal-to-noise ratio (PSNR) but have difficulties optimizing for the metrics more aligned to perceptual quality, e.g. multi-scale structural similarity
-
Joint Distortion Restoration and Quality Feature Learning for No-Reference Image Quality Assessment ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-28 Jifan Yang, Zhongyuan Wang, Baojin Huang, Jiaxin Ai, Yuhong Yang, Zixiang Xiong
No-reference image quality assessment (NR-IQA) methods inspired by the free energy principle, improve the accuracy of image quality prediction by simulating the human brain’s repair process for distorted images. However, existing methods use separate optimisation schemes for distortion restoration and quality prediction, which undermines the accurate mapping of feature representations to quality scores
-
An Optimal Edge-weighted Graph Semantic Correlation Framework for Multi-view Feature Representation Learning ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-27 Lei Gao, Zheng Guo, Ling Guan
In this paper, we present an optimal edge-weighted graph semantic correlation (EWGSC) framework for multi-view feature representation learning. Different from most existing multi-view representation methods, local structural information and global correlation in multi-view feature spaces are exploited jointly in the EWGSC framework, leading to a new and high quality multi-view feature representation
-
Instance-level Adversarial Source-free Domain Adaptive Person Re-identification ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-27 Xiaofeng Qu, Li Liu, Lei Zhu, Liqiang Nie, Huaxiang Zhang
Domain adaption (DA) for person re-identification (ReID) has attained considerable progress by transferring knowledge from a source domain with labels to a target domain without labels. Nonetheless, most of existing methods require access to source data, which raises privacy concerns. Source-free DA has recently emerged as a response to these privacy challenges, yet its direct application to open-set
-
Deep Network for Image Compressed Sensing Coding Using Local Structural Sampling ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-26 Wenxue Cui, Xingtao Wang, Xiaopeng Fan, Shaohui Liu, Xinwei Gao, Debin Zhao
Existing image compressed sensing (CS) coding frameworks usually solve an inverse problem based on measurement coding and optimization-based image reconstruction, which still exist the following two challenges: 1) The widely used random sampling matrix, such as the Gaussian Random Matrix (GRM), usually leads to low measurement coding efficiency. 2) The optimization-based reconstruction methods generally
-
Scene Graph Lossless Compression with Adaptive Prediction for Objects and Relations ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-26 Weiyao Lin, Yufeng Zhang, Wenrui Dai, Huabin Liu, John See, Hongkai Xiong
The scene graph is a novel data structure describing objects and their pairwise relationship within image scenes. As the size of scene graphs in vision and multimedia applications increases, the need for lossless storage and transmission of such data becomes more critical. However, the compression of scene graphs is less studied because of the complicated data structures involved and complex distributions
-
Action Segmentation Through Self-Supervised Video Features and Positional-Encoded Embeddings ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-24 Guilherme de A. P. Marques, José Matheus C. Boaro, Antonio José G. Busson, Alan L. V. Guedes, Julio Cesar Duarte, Sérgio Colcher
Action segmentation consists of temporally segmenting a video and labeling each segmented interval with a specific action label. In this work, we propose a novel action segmentation method that requires no initial video analysis and no annotated data. Our proposal involves extracting features from videos using several pre-trained deep-learning models, including spatiotemporal and self-supervised methods
-
Tensorial Evolutionary Optimization for Natural Image Matting ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-23 Si-chao Lei, Yue-Jiao Gong, Xiao-Lin Xiao, Yi-Cong Zhou, Jun Zhang
Natural image matting has garnered increasing attention in various computer vision applications. The matting problem aims to find the optimal foreground/background (F/B) color pair for each unknown pixel, and thus obtain an alpha matte indicating the opacity of the foreground object. This problem is typically modeled as a large-scale pixel pair combinatorial optimization (PPCO) problem. Heuristic optimization
-
VertexShuffle-Based Spherical Super-Resolution for 360-Degree Videos ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-23 Na Li, Yao Liu
360-degree video is an emerging form of media that encodes information about all directions surrounding a camera, offering an immersive experience to the users. Unlike traditional 2D videos, visual information in 360-degree videos can be naturally represented as pixels on a sphere. Inspired by state-of-the-art deep-learning-based 2D image super-resolution models and spherical CNNs, in this paper, we
-
Recipe Generation from Unsegmented Cooking Videos ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-21 Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku, Hirotaka Kameko, Shinsuke Mori
This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is crucial
-
Optimizing Camera Motion with MCTS and Target Motion Modeling in Multi-Target Active Object Tracking ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-21 Zheng Chen, Jian Zhao, Mingyu Yang, Wengang Zhou, Houqiang Li
In this work, we are dedicated to multi-target active object tracking (AOT), where the goal is to achieve continuous tracking of targets through real-time control of camera. This form of active camera control can be applied to unmanned aerial vehicles (UAV), intelligent robots, and sports events. Our work is conducted in an environment featuring multiple cameras and targets, where our goal is to maximize
-
Meetor: A Human-Centered Automatic Video Editing System for Meeting Recordings ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-19 Haihan Duan, Junhua Liao, Lehao Lin, Abdulmotaleb El Saddik, Wei Cai
Widely adopted digital cameras and smartphones have generated a large number of videos, which have brought a tremendous workload to video editors. Recently, a variety of automatic/semi-automatic video editing methods have been proposed to tackle these issues in some specific areas. However, for the production of meeting recordings, the existing studies highly depend on extra equipment in the conference
-
MS-GDA: Improving Heterogeneous Recipe Representation via Multinomial Sampling Graph Data Augmentation ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-20 Liangzhe Chen, Wei Li, Xiaohui Cui, Zhenyu Wang, Stefano Berretti, Shaohua Wan
We study the problem of classifying different cooking styles, based on the recipe. The difficulty is that the same food ingredients, seasoning, and the very similar instructions result in different flavors, with different cooking styles. Existing methods have limitation: they mainly focus on homogeneous data (e.g., instruction or image), ignoring heterogeneous data (e.g., flavor compound or ingredient)
-
Multimodal Visual-Semantic Representations Learning for Scene Text Recognition ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-19 Xinjian Gao, Ye Pang, Yuyu Liu, MaoKun Han, Jun Yu, Wei Wang, Yuanxu Chen
Scene Text Recognition (STR), the critical step in OCR systems, has attracted much attention in computer vision. Recent research on modeling textual semantics with Language Model (LM) has witnessed remarkable progress. However, LM only optimizes the joint probability of the estimated characters generated from the Vision Model (VM) in a single language modality, ignoring the visual-semantic relations
-
DATRA-MIV: Decoder-Adaptive Tiling and Rate Allocation for MPEG Immersive Video ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-19 Jong-Beom Jeong, Soonbin Lee, Eun-Seok Ryu
The emerging immersive video coding standard moving picture experts group (MPEG) immersive video (MIV) which is ongoing standardization by MPEG-Immersive (MPEG-I) group, enables six degrees of freedom (6DoF) in a virtual reality (VR) environment that represents both natural and computer-generated scenes using multi-view video compression. The MIV eliminates the redundancy between multi-view videos
-
MultiMatch: Multi-task Learning for Semi-supervised Domain Generalization ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-19 Lei Qi, Hongpeng Yang, Yinghuan Shi, Xin Geng
Domain generalization (DG) aims at learning a model on source domains to well generalize on the unseen target domain. Although it has achieved great success, most of the existing methods require the label information for all training samples in source domains, which is time-consuming and expensive in the real-world application. In this paper, we resort to solving the semi-supervised domain generalization
-
ReFID: Reciprocal Frequency-aware Generalizable Person Re-identification via Decomposition and Filtering ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-16 Jinjia Peng, Pengpeng Song, Hui Li, Huibing Wang
Domain generalization (DG) of person re-identification (ReID) aims to conduct testing across domains that have not been previously encountered, without utilizing target domain data during the training stage. As the number of source domains increases, the relationships between training samples become more complex. This can lead to domain-invariant features that include certain instance-level spurious
-
Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-16 Yucheng Suo, Zhedong Zheng, Xiaohan Wang, Bang Zhang, Yi Yang
Sign language provides a way for differently-abled individuals to express their feelings and emotions. However, learning sign language can be challenging and time-consuming. An alternative approach is to animate user photos using sign language videos of specific words, which can be achieved using existing image animation methods. However, the finger motions in the generated videos are often not ideal
-
Detection of Adversarial Facial Accessory Presentation Attacks Using Local Face Differential ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-15 Fei Peng, Le Qin, Min Long, Jin Li
To counter adversarial facial accessory presentation attacks (PAs), a detection method based on local face differential is proposed in this paper. It extracts the local face differential features from a suspected face image and a reference face image, and then adaptively fuses the differential features of different local face regions to detect adversarial facial accessory PAs. Meanwhile, the principle
-
Region-Focused Network for Dense Captioning ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-15 Qingbao Huang, Pijian Li, youji Huang, Feng Shuang, Yi Cai
Dense captioning is a very critical but under-explored task, which aims to densely detect, localize regions-of-interest (RoIs) and describe them with natural language in a given image. Although recent studies tried to fuse multi-scale features from different visual instances to generate more accurate descriptions, their methods still suffer from the lack of exploration of relation semantic information
-
SWRM: Similarity Window Reweighting and Margin for Long-Tailed Recognition ACM Trans. Multimed. Comput. Commun. Appl. (IF 5.1) Pub Date : 2024-02-07 Qiong Chen, Tianlin Huang, Qingfa Liu
Real-world data usually obeys a long-tailed distribution, where a few classes have higher number of samples compared to the other classes. Recent studies have been proposed to alleviate the extreme data imbalance from different perspectives. In this paper, we experimentally find that due to the easily confusing visual features between some head- and tail classes, the cross-entropy model is prone to