
样式: 排序: IF: - GO 导出 标记为已读
-
Diagnosing Human-Object Interaction Detectors Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-02-16 Fangrui Zhu, Yiming Xie, Weidi Xie, Huaizu Jiang
We have witnessed significant progress in human-object interaction (HOI) detection. However, relying solely on mAP (mean Average Precision) scores as a summary metric does not provide sufficient insight into the nuances of model performance (e.g., why one model outperforms another), which can hinder further innovation in this field. To address this issue, we introduce a diagnosis toolbox in this paper
-
Fusion4DAL: Offline Multi-modal 3D Object Detection for 4D Auto-labeling Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-02-15 Zhiyuan Yang, Xuekuan Wang, Wei Zhang, Xiao Tan, Jincheng Lu, Jingdong Wang, Errui Ding, Cairong Zhao
Integrating LiDAR and camera information has been a widely adopted approach for 3D object detection in autonomous driving. Nevertheless, the unexplored potential of multi-modal fusion remains in the realm of offline 4D detection. We experimentally find that the root lies in two reasons: (1) the sparsity of point clouds poses a challenge in extracting long-term image features and thereby results in
-
An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-training Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-02-13 Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple lightweight ViTs’ fine-tuning performance can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight
-
Smaller But Better: Unifying Layout Generation with Smaller Large Language Models Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-02-12 Peirong Zhang, Jiaxin Zhang, Jiahuan Cao, Hongliang Li, Lianwen Jin
We propose LGGPT, an LLM-based model tailored for unified layout generation. First, we propose Arbitrary Layout Instruction (ALI) and Universal Layout Response (ULR) as the uniform I/O template. ALI accommodates arbitrary layout generation task inputs across multiple layout domains, enabling LGGPT to unify both task-generic and domain-generic layout generation hitherto unexplored. Collectively, ALI
-
LiDAR-guided Geometric Pretraining for Vision-Centric 3D Object Detection Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-02-09 Linyan Huang, Huijie Wang, Jia Zeng, Shengchuan Zhang, Liujuan Cao, Junchi Yan, Hongyang Li
Multi-camera 3D object detection for autonomous driving is a challenging problem that has garnered notable attention from both academia and industry. An obstacle encountered in vision-based techniques involves the precise extraction of geometry-conscious features from RGB images. Recent approaches have utilized geometric-aware image backbones pretrained on depth-relevant tasks to acquire spatial information
-
Learning Structure-Supporting Dependencies via Keypoint Interactive Transformer for General Mammal Pose Estimation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-02-07 Tianyang Xu, Jiyong Rao, Xiaoning Song, Zhenhua Feng, Xiao-Jun Wu
General mammal pose estimation is an important and challenging task in computer vision, which is essential for understanding mammal behaviour in real-world applications. However, existing studies are at their preliminary research stage, which focus on addressing the problem for only a few specific mammal species. In principle, from specific to general mammal pose estimation, the biggest issue is how
-
Towards Boosting Out-of-Distribution Detection from a Spatial Feature Importance Perspective Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-02-05 Yao Zhu, Xiu Yan, Chuanlong Xie
In ensuring the reliable and secure operation of models, Out-of-Distribution (OOD) detection has gained widespread attention in recent years. Researchers have proposed various promising detection criteria to construct the rejection region of the model, treating samples falling into this region as out-of-distribution. However, these detection criteria are computed using all dense features of the model
-
Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-02-04 Tianshui Chen, Jianman Lin, Zhijing Yang, Chumei Qing, Yukai Shi, Liang Lin
Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents. Thus, emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. However, the intrinsic intertwining of these elements during the
-
TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-31 Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang
Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread
-
DiffuVolume: Diffusion Model for Volume based Stereo Matching Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-02-01 Dian Zheng, Xiao-Ming Wu, Zuhao Liu, Jingke Meng, Wei-Shi Zheng
Stereo matching is a significant part in many computer vision tasks and driving-based applications. Recently cost volume-based methods have achieved great success benefiting from the rich geometry information in paired images. However, the redundancy of cost volume also interferes with the model training and limits the performance. To construct a more precise cost volume, we pioneeringly apply the
-
Self-supervised Shutter Unrolling with Events Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-29 Mingyuan Lin, Yangguang Wang, Xiang Zhang, Boxin Shi, Wen Yang, Chu He, Gui-song Xia, Lei Yu
Continuous-time Global Shutter Video Recovery (CGVR) faces a substantial challenge in recovering undistorted high frame-rate Global Shutter (GS) videos from distorted Rolling Shutter (RS) images. This problem is severely ill-posed due to the absence of temporal dynamic information within RS intra-frame scanlines and inter-frame exposures, particularly when prior knowledge about camera/object motions
-
Image Synthesis Under Limited Data: A Survey and Taxonomy Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-27 Mengping Yang, Zhe Wang
Deep generative models, which target reproducing the data distribution to produce novel images, have made unprecedented advancements in recent years. However, one critical prerequisite for their tremendous success is the availability of a sufficient number of training samples, which requires massive computation resources. When trained on limited data, generative models tend to suffer from severe performance
-
Dual-Space Video Person Re-identification Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-27 Jiaxu Leng, Changjiang Kuang, Shuang Li, Ji Gan, Haosheng Chen, Xinbo Gao
Video person re-identification (VReID) aims to recognize individuals across video sequences. Existing methods primarily use Euclidean space for representation learning but struggle to capture complex hierarchical structures, especially in scenarios with occlusions and background clutter. In contrast, hyperbolic space, with its negatively curved geometry, excels at preserving hierarchical relationships
-
Learning with Enriched Inductive Biases for Vision-Language Models Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-28 Lingxiao Yang, Ru-Yuan Zhang, Qi Chen, Xiaohua Xie
Vision-Language Models, pre-trained on large-scale image-text pairs, serve as strong foundation models for transfer learning across a variety of downstream tasks. For few-shot generalization tasks, i.e., when the model is trained on few-shot samples and then tested on unseen categories or datasets, there is a balance to be struck between generalization and discrimination when tweaking these models
-
Sample-Cohesive Pose-Aware Contrastive Facial Representation Learning Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-28 Yuanyuan Liu, Shaoze Feng, Shuyang Liu, Yibing Zhan, Dapeng Tao, Zijing Chen, Zhe Chen
Self-supervised facial representation learning (SFRL) methods, especially contrastive learning (CL) methods, have been increasingly popular due to their ability to perform face understanding without heavily relying on large-scale well-annotated datasets. However, analytically, current CL-based SFRL methods still perform unsatisfactorily in learning facial representations due to their tendency to learn
-
SeaFormer++: Squeeze-Enhanced Axial Transformer for Mobile Visual Recognition Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-25 Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, Li Zhang
Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement renders these methods unsuitable on the mobile device. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer)
-
DeepFake-Adapter: Dual-Level Adapter for DeepFake Detection Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-24 Rui Shao, Tianxing Wu, Liqiang Nie, Ziwei Liu
Existing deepfake detection methods fail to generalize well to unseen or degraded samples, which can be attributed to the over-fitting of low-level forgery patterns. Here we argue that high-level semantics are also indispensable recipes for generalizable forgery detection. Recently, large pre-trained Vision Transformers (ViTs) have shown promising generalization capability. In this paper, we propose
-
MoonShot: Towards Controllable Video Generation and Editing with Motion-Aware Multimodal Conditions Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-24 David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, Doyen Sahoo
Current video diffusion models (VDMs) mostly rely on text conditions, limiting control over video appearance and geometry. This study introduces a new model, MoonShot, conditioning on both image and text for enhanced control. It features the Multimodal Video Block (MVB), integrating the motion-aware dual cross-attention layer for precise appearance and motion alignment with provided prompts, and the
-
A Mutual Supervision Framework for Referring Expression Segmentation and Generation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-23 Shijia Huang, Feng Li, Hao Zhang, Shilong Liu, Lei Zhang, Liwei Wang
Reference Expression Segmentation (RES) and Reference Expression Generation (REG) are mutually inverse tasks that can be naturally jointly trained. Though recent work has explored such joint training, the mechanism of how RES and REG can benefit each other is still unclear. In this paper, we propose a mutual supervision framework that enables two tasks to improve each other. Our mutual supervision
-
GL-MCM: Global and Local Maximum Concept Matching for Zero-Shot Out-of-Distribution Detection Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-22 Atsuyuki Miyai, Qing Yu, Go Irie, Kiyoharu Aizawa
Zero-shot OOD detection is a task that detects OOD images during inference with only in-distribution (ID) class names. Existing methods assume ID images contain a single, centered object, and do not consider the more realistic multi-object scenarios, where both ID and OOD objects are present. To meet the needs of many users, the detection method must have the flexibility to adapt the type of ID images
-
Pre-trained Trojan Attacks for Visual Recognition Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-22 Aishan Liu, Xianglong Liu, Xinwei Zhang, Yisong Xiao, Yuguang Zhou, Siyuan Liang, Jiakai Wang, Xiaochun Cao, Dacheng Tao
Pre-trained vision models (PVMs) have become a dominant component due to their exceptional performance when fine-tuned for downstream tasks. However, the presence of backdoors within PVMs poses significant threats. Unfortunately, existing studies primarily focus on backdooring PVMs for the classification task, neglecting potential inherited backdoors in downstream tasks such as detection and segmentation
-
Rethinking Generalizability and Discriminability of Self-Supervised Learning from Evolutionary Game Theory Perspective Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-19 Jiangmeng Li, Zehua Zang, Qirui Ji, Chuxiong Sun, Wenwen Qiang, Junge Zhang, Changwen Zheng, Fuchun Sun, Hui Xiong
Representations learned by self-supervised approaches are generally considered to possess sufficient generalizability and discriminability. However, we disclose a nontrivial mutual-exclusion relationship between these critical representation properties through an exploratory demonstration on self-supervised learning. State-of-the-art self-supervised methods tend to enhance either generalizability or
-
Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-15 Antonin Vobecky, David Hurych, Oriane Siméoni, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Josef Sivic
Semantic image segmentation models typically require extensive pixel-wise annotations, which are costly to obtain and prone to biases. Our work investigates learning semantic segmentation in urban scenes without any manual annotation. We propose a novel method for learning pixel-wise semantic segmentation using raw, uncurated data from vehicle-mounted cameras and LiDAR sensors, thus eliminating the
-
UniCanvas: Affordance-Aware Unified Real Image Editing via Customized Text-to-Image Generation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-14 Jian Jin, Yang Shen, Xinyang Zhao, Zhenyong Fu, Jian Yang
The demand for assorted conditional edits on a single real image is becoming increasingly prevalent. We focus on two dominant editing tasks that respectively condition on image and text input, namely subject-driven editing and semantic editing. Previous studies typically tackle these two editing tasks separately, thereby demanding multiple editing processes to achieve versatile edits on a single image
-
Generalized Robotic Vision-Language Learning Model via Linguistic Foreground-Aware Contrast Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-14 Kangcheng Liu, Chaoqun Wang, Xiaodong Han, Yong-Jin Liu, Baoquan Chen
Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, most existing work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive
-
RIGID: Recurrent GAN Inversion and Editing of Real Face Videos and Beyond Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-13 Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo
GAN inversion is essential for harnessing the editability of GANs in real images, yet existing methods that invert video frames individually often yield temporally inconsistent results. To address this issue, we present a unified recurrent framework, Recurrent vIdeo GAN Inversion and eDiting (RIGID), designed to enforce temporally coherent GAN inversion and facial editing in real videos explicitly
-
Learning Meshing from Delaunay Triangulation for 3D Shape Representation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-10 Chen Zhang, Wenbing Tao
Recently, there has been a growing interest in learning-based explicit methods due to their ability to respect the original input and preserve details. However, the connectivity on complex structures is still difficult to infer due to the limited local shape perception, resulting in artifacts and non-watertight triangles. In this paper, we present a novel learning-based method with Delaunay triangulation
-
LDTrack: Dynamic People Tracking by Service Robots Using Diffusion Models Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-08 Angus Fung, Beno Benhabib, Goldie Nejat
Tracking of dynamic people in cluttered and crowded human-centered environments is a challenging robotics problem due to the presence of intraclass variations including occlusions, pose deformations, and lighting variations. This paper introduces a novel deep learning architecture, using conditional latent diffusion models, the Latent Diffusion Track (LDTrack), for tracking multiple dynamic people
-
Context-Aware Multi-view Stereo Network for Efficient Edge-Preserving Depth Estimation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-07 Wanjuan Su, Wenbing Tao
Learning-based multi-view stereo methods have achieved great progress in recent years by employing the coarse-to-fine depth estimation framework. However, existing methods still encounter difficulties in recovering depth in featureless areas, object boundaries, and thin structures which mainly due to the poor distinguishability of matching clues in low-textured regions, the inherently smooth properties
-
PICK: Predict and Mask for Semi-supervised Medical Image Segmentation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-05 Qingjie Zeng, Zilin Lu, Yutong Xie, Yong Xia
Pseudo-labeling and consistency-based co-training are established paradigms in semi-supervised learning. Pseudo-labeling focuses on selecting reliable pseudo-labels, while co-training emphasizes sub-network diversity for complementary information extraction. However, both paradigms struggle with the inevitable erroneous predictions from unlabeled data, which poses a risk to task-specific decoders and
-
Delving Deep into Simplicity Bias for Long-Tailed Image Recognition Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-06 Xiu-Shen Wei, Xuhao Sun, Yang Shen, Peng Wang
Simplicity Bias (SB) is a phenomenon that deep neural networks tend to rely favorably on simpler predictive patterns but ignore some complex features when applied to supervised discriminative tasks. In this work, we investigate SB in long-tailed image recognition and find the tail classes suffer more severely from SB, which harms the generalization performance of such underrepresented classes. We empirically
-
HUPE: Heuristic Underwater Perceptual Enhancement with Semantic Collaborative Learning Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-04 Zengxi Zhang, Zhiying Jiang, Long Ma, Jinyuan Liu, Xin Fan, Risheng Liu
Underwater images are often affected by light refraction and absorption, reducing visibility and interfering with subsequent applications. Existing underwater image enhancement methods primarily focus on improving visual quality while overlooking practical implications. To strike a balance between visual quality and application, we propose a heuristic invertible network for underwater perception enhancement
-
General Class-Balanced Multicentric Dynamic Prototype Pseudo-Labeling for Source-Free Domain Adaptation Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-05 Sanqing Qu, Guang Chen, Jing Zhang, Zhijun Li, Wei He, Dacheng Tao
Source-free Domain Adaptation aims to adapt a pre-trained source model to an unlabeled target domain while circumventing access to well-labeled source data. To compensate for the absence of source data, most existing approaches employ prototype-based pseudo-labeling strategies to facilitate self-training model adaptation. Nevertheless, these methods commonly rely on instance-level predictions for direct
-
Relation-Guided Versatile Regularization for Federated Semi-Supervised Learning Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-05 Qiushi Yang, Zhen Chen, Zhe Peng, Yixuan Yuan
Federated semi-supervised learning (FSSL) target to address the increasing privacy concerns for the practical scenarios, where data holders are limited in labeling capability. Latest FSSL approaches leverage the prediction consistency between the local model and global model to exploit knowledge from partially labeled or completely unlabeled clients. However, they merely utilize data-level augmentation
-
Robust Sequential DeepFake Detection Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-04 Rui Shao, Tianxing Wu, Ziwei Liu
Since photorealistic faces can be readily generated by facial manipulation technologies nowadays, potential malicious abuse of these technologies has drawn great concerns. Numerous deepfake detection methods are thus proposed. However, existing methods only focus on detecting one-step facial manipulation. As the emergence of easy-accessible facial editing applications, people can easily manipulate
-
RepSNet: A Nucleus Instance Segmentation Model Based on Boundary Regression and Structural Re-Parameterization Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-02 Shengchun Xiong, Xiangru Li, Yunpeng Zhong, Wanfen Peng
Pathological diagnosis is the gold standard for tumor diagnosis, and nucleus instance segmentation is a key step in digital pathology analysis and pathological diagnosis. However, the computational efficiency of the model and the treatment of overlapping targets are the major challenges in the studies of this problem. To this end, a neural network model RepSNet was designed based on a nucleus boundary
-
Blind Image Quality Assessment: Exploring Content Fidelity Perceptibility via Quality Adversarial Learning Int. J. Comput. Vis. (IF 11.6) Pub Date : 2025-01-03 Mingliang Zhou, Wenhao Shen, Xuekai Wei, Jun Luo, Fan Jia, Xu Zhuang, Weijia Jia
In deep learning-based no-reference image quality assessment (NR-IQA) methods, the absence of reference images limits their ability to assess content fidelity, making it difficult to distinguish between original content and distortions that degrade quality. To address this issue, we propose a quality adversarial learning framework emphasizing both content fidelity and prediction accuracy. The main
-
Pseudo-Plane Regularized Signed Distance Field for Neural Indoor Scene Reconstruction Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-31 Jing Li, Jinpeng Yu, Ruoyu Wang, Shenghua Gao
Given only a set of images, neural implicit surface representation has shown its capability in 3D surface reconstruction. However, as the nature of per-scene optimization is based on the volumetric rendering of color, previous neural implicit surface reconstruction methods usually fail in the low-textured regions, including floors, walls, etc., which commonly exist for indoor scenes. Being aware of
-
CSFRNet: Integrating Clothing Status Awareness for Long-Term Person Re-identification Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-30 Yan Huang, Yan Huang, Zhang Zhang, Qiang Wu, Yi Zhong, Liang Wang
Addressing the dynamic nature of long-term person re-identification (LT-reID) amid varying clothing conditions necessitates a departure from conventional methods. Traditional LT-reID strategies, mainly biometrics-based and data adaptation-based, each have their pitfalls. The former falters in environments lacking high-quality biometric data, while the latter loses efficacy with minimal or subtle clothing
-
AniClipart: Clipart Animation with Text-to-Video Priors Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-27 Ronghuan Wu, Wanchao Su, Kede Ma, Jing Liao
Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless
-
Combating Label Noise with a General Surrogate Model for Sample Selection Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-27 Chao Liang, Linchao Zhu, Humphrey Shi, Yi Yang
Modern deep learning systems are data-hungry. Learning with web data is one of the feasible solutions, but will introduce label noise inevitably, which can hinder the performance of deep neural networks. Sample selection is an effective way to deal with label noise. The key is to separate clean samples based on some criterion. Previous methods pay more attention to the small loss criterion where small-loss
-
Exploring Homogeneous and Heterogeneous Consistent Label Associations for Unsupervised Visible-Infrared Person ReID Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-26 Lingfeng He, De Cheng, Nannan Wang, Xinbo Gao
Unsupervised visible-infrared person re-identification (USL-VI-ReID) endeavors to retrieve pedestrian images of the same identity from different modalities without annotations. While prior work focuses on establishing cross-modality pseudo-label associations to bridge the modality-gap, they ignore maintaining the instance-level homogeneous and heterogeneous consistency between the feature space and
-
SLIDE: A Unified Mesh and Texture Generation Framework with Enhanced Geometric Control and Multi-view Consistency Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-23 Jinyi Wang, Zhaoyang Lyu, Ben Fei, Jiangchao Yao, Ya Zhang, Bo Dai, Dahua Lin, Ying He, Yanfeng Wang
The generation of textured mesh is crucial for computer graphics and virtual content creation. However, current generative models often struggle with challenges such as irregular mesh structures and inconsistencies in multi-view textures. In this study, we present a unified framework for both geometry generation and texture generation, utilizing a novel sparse latent point diffusion model that specifically
-
LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-23 Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose
-
AutoStory: Generating Diverse Storytelling Images with Minimal Human Efforts Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-23 Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, Chunhua Shen
Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring
-
FusionBooster: A Unified Image Fusion Boosting Paradigm Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-23 Chunyang Cheng, Tianyang Xu, Xiao-Jun Wu, Hui Li, Xi Li, Josef Kittler
In recent years, numerous ideas have emerged for designing a mutually reinforcing mechanism or extra stages for the image fusion task, ignoring the inevitable gaps between different vision tasks and the computational burden. We argue that there is a scope to improve the fusion performance with the help of the FusionBooster, a model specifically designed for fusion tasks. In particular, our booster
-
Noise-Resistant Multimodal Transformer for Emotion Recognition Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-22 Yuanyuan Liu, Haoyu Zhang, Yibing Zhan, Zijing Chen, Guanghao Yin, Lin Wei, Zhe Chen
Multimodal emotion recognition identifies human emotions from various data modalities like video, text, and audio. However, we found that this task can be easily affected by noisy information that does not contain useful semantics and may occur at different locations of a multimodal input sequence. To this end, we present a novel paradigm that attempts to extract noise-resistant features in its pipeline
-
Deep Attention Learning for Pre-operative Lymph Node Metastasis Prediction in Pancreatic Cancer via Multi-object Relationship Modeling Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-20 Zhilin Zheng, Xu Fang, Jiawen Yao, Mengmeng Zhu, Le Lu, Yu Shi, Hong Lu, Jianping Lu, Ling Zhang, Chengwei Shao, Yun Bian
Lymph node (LN) metastasis status is one of the most critical prognostic and cancer staging clinical factors for patients with resectable pancreatic ductal adenocarcinoma (PDAC, generally for any types of solid malignant tumors). Pre-operative prediction of LN metastasis from non-invasive CT imaging is highly desired, as it might be directly and conveniently used to guide the follow-up neoadjuvant
-
Learning Discriminative Features for Visual Tracking via Scenario Decoupling Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-19 Yinchao Ma, Qianjin Yu, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang
Visual tracking aims to estimate object state automatically in a video sequence, which is challenging especially in complex scenarios. Recent Transformer-based trackers enable the interaction between the target template and search region in the feature extraction phase for target-aware feature learning, which have achieved superior performance. However, visual tracking is essentially a task to discriminate
-
Polynomial Implicit Neural Framework for Promoting Shape Awareness in Generative Models Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-20 Utkarsh Nath, Rajhans Singh, Ankita Shukla, Kuldeep Kulkarni, Pavan Turaga
Polynomial functions have been employed to represent shape-related information in 2D and 3D computer vision, even from the very early days of the field. In this paper, we present a framework using polynomial-type basis functions to promote shape awareness in contemporary generative architectures. The benefits of using a learnable form of polynomial basis functions as drop-in modules into generative
-
Hard-Normal Example-Aware Template Mutual Matching for Industrial Anomaly Detection Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-18 Zixuan Chen, Xiaohua Xie, Lingxiao Yang, Jian-Huang Lai
Anomaly detectors are widely used in industrial manufacturing to detect and localize unknown defects in query images. These detectors are trained on anomaly-free samples and have successfully distinguished anomalies from most normal samples. However, hard-normal examples are scattered and far apart from most normal samples, and thus they are often mistaken for anomalies by existing methods. To address
-
Beyond Talking – Generating Holistic 3D Human Dyadic Motion for Communication Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-17 Mingze Sun, Chao Xu, Xinyu Jiang, Yang Liu, Baigui Sun, Ruqi Huang
In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ-VAEs
-
Hyper-3DG: Text-to-3D Gaussian Generation via Hypergraph Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-16 Donglin Di, Jiahui Yang, Chaofan Luo, Zhou Xue, Wei Chen, Xun Yang, Yue Gao
Text-to-3D generation represents an exciting field that has seen rapid advancements, facilitating the transformation of textual descriptions into detailed 3D models. However, current progress often neglects the intricate high-order correlation of geometry and texture within 3D objects, leading to challenges such as over-smoothness, over-saturation and the Janus problem. In this work, we propose a method
-
Relation-Guided Adversarial Learning for Data-Free Knowledge Transfer Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-13 Yingping Liang, Ying Fu
Data-free knowledge distillation transfers knowledge by recovering training data from a pre-trained model. Despite the recent success of seeking global data diversity, the diversity within each class and the similarity among different classes are largely overlooked, resulting in data homogeneity and limited performance. In this paper, we introduce a novel Relation-Guided Adversarial Learning method
-
CMAE-3D: Contrastive Masked AutoEncoders for Self-Supervised 3D Object Detection Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-11 Yanan Zhang, Jiaxin Chen, Di Huang
LiDAR-based 3D object detection is a crucial task for autonomous driving, owing to its accurate object recognition and localization capabilities in the 3D real-world space. However, existing methods heavily rely on time-consuming and laborious large-scale labeled LiDAR data, posing a bottleneck for both performance improvement and practical applications. In this paper, we propose Contrastive Masked
-
Structured Generative Models for Scene Understanding Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-12 Christopher K. I. Williams
This position paper argues for the use of structured generative models (SGMs) for the understanding of static scenes. This requires the reconstruction of a 3D scene from an input image (or a set of multi-view images), whereby the contents of the image(s) are causally explained in terms of models of instantiated objects, each with their own type, shape, appearance and pose, along with global variables
-
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-12 Yupeng Zhou, Daquan Zhou, Yaxing Wang, Jiashi Feng, Qibin Hou
Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. However, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the erroneous generation of objects and their attributes is the inadequate cross-modality relation learning between
-
MoDA: Modeling Deformable 3D Objects from Casual Videos Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-12 Chaoyue Song, Jiacheng Wei, Tianyi Chen, Yiwen Chen, Chuan-Sheng Foo, Fayao Liu, Guosheng Lin
In this paper, we focus on the challenges of modeling deformable 3D objects from casual videos. With the popularity of NeRF, many works extend it to dynamic scenes with a canonical NeRF and a deformation model that achieves 3D point transformation between the observation space and the canonical space. Recent works rely on linear blend skinning (LBS) to achieve the canonical-observation transformation
-
Language-Guided Hierarchical Fine-Grained Image Forgery Detection and Localization Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-10 Xiao Guo, Xiaohong Liu, Iacopo Masi, Xiaoming Liu
Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels
-
Image-Based Virtual Try-On: A Survey Int. J. Comput. Vis. (IF 11.6) Pub Date : 2024-12-10 Dan Song, Xuanpu Zhang, Juan Zhou, Weizhi Nie, Ruofeng Tong, Mohan Kankanhalli, An-An Liu
Image-based virtual try-on aims to synthesize a naturally dressed person image with a clothing image, which revolutionizes online shopping and inspires related topics within image generation, showing both research significance and commercial potential. However, there is a gap between current research progress and commercial applications and an absence of comprehensive overview of this field to accelerate