样式: 排序: IF: - GO 导出 标记为已读
-
P-MapNet: Far-seeing Map Generator Enhanced by both SDMap and HDMap Priors arXiv.cs.CV Pub Date : 2024-03-15 Zhou Jiang, Zhenxin Zhu, Pengfei Li, Huan-ang Gao, Tianyuan Yuan, Yongliang Shi, Hang Zhao, Hao Zhao
Autonomous vehicles are gradually entering city roads today, with the help of high-definition maps (HDMaps). However, the reliance on HDMaps prevents autonomous vehicles from stepping into regions without this expensive digital infrastructure. This fact drives many researchers to study online HDMap generation algorithms, but the performance of these algorithms at far regions is still unsatisfying.
-
Frozen Feature Augmentation for Few-Shot Image Classification arXiv.cs.CV Pub Date : 2024-03-15 Andreas Bär, Neil Houlsby, Mostafa Dehghani, Manoj Kumar
Training a linear classifier or lightweight model on top of pretrained vision model outputs, so-called 'frozen features', leads to impressive performance on a number of downstream few-shot tasks. Currently, frozen features are not modified during training. On the other hand, when networks are trained directly on images, data augmentation is a standard recipe that improves performance with no substantial
-
A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction arXiv.cs.CV Pub Date : 2024-03-15 Anshul Gupta, Samy Tafasca, Arya Farkhondeh, Pierre Vuillecard, Jean-Marc Odobez
Gaze following and social gaze prediction are fundamental tasks providing insights into human communication behaviors, intent, and social interactions. Most previous approaches addressed these tasks separately, either by designing highly specialized social gaze models that do not generalize to other social gaze tasks or by considering social gaze inference as an ad-hoc post-processing of the gaze following
-
Mitigating Dialogue Hallucination for Large Multi-modal Models via Adversarial Instruction Tuning arXiv.cs.CV Pub Date : 2024-03-15 Dongmin Park, Zhaofang Qian, Guangxing Han, Ser-Nam Lim
Mitigating hallucinations of Large Multi-modal Models(LMMs) is crucial to enhance their reliability for general-purpose assistants. This paper shows that such hallucinations of LMMs can be significantly exacerbated by preceding user-system dialogues. To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended hallucinatory dialogues
-
Robust Shape Fitting for 3D Scene Abstraction arXiv.cs.CV Pub Date : 2024-03-15 Florian Kluger, Eric Brachmann, Michael Ying Yang, Bodo Rosenhahn
Humans perceive and construct the world as an arrangement of simple parametric models. In particular, we can often describe man-made environments using volumetric primitives such as cuboids or cylinders. Inferring these primitives is important for attaining high-level, abstract scene descriptions. Previous approaches for primitive-based abstraction estimate shape parameters directly and are only able
-
Using an LLM to Turn Sign Spottings into Spoken Language Sentences arXiv.cs.CV Pub Date : 2024-03-15 Ozge Mercanoglu Sincan, Necati Cihan Camgoz, Richard Bowden
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos. In this paper, we introduce a hybrid SLT approach, Spotter+GPT, that utilizes a sign spotter and a pretrained large language model to improve SLT performance. Our method builds upon the strengths of both components. The videos are first processed by the spotter, which is
-
SWAG: Splatting in the Wild images with Appearance-conditioned Gaussians arXiv.cs.CV Pub Date : 2024-03-15 Hiba Dahmani, Moussab Bennehar, Nathan Piasco, Luis Roldao, Dzmitry Tsishkou
Implicit neural representation methods have shown impressive advancements in learning 3D scenes from unstructured in-the-wild photo collections but are still limited by the large computational cost of volumetric rendering. More recently, 3D Gaussian Splatting emerged as a much faster alternative with superior rendering quality and training efficiency, especially for small-scale and object-centric scenarios
-
Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search arXiv.cs.CV Pub Date : 2024-03-15 Hongyuan Yu, Cheng Wan, Mengchen Liu, Dongdong Chen, Bin Xiao, Xiyang Dai
Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head
-
CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning arXiv.cs.CV Pub Date : 2024-03-15 Hyuck Lee, Heeyoung Kim
Pseudo-label-based semi-supervised learning (SSL) algorithms trained on a class-imbalanced set face two cascading challenges: 1) Classifiers tend to be biased towards majority classes, and 2) Biased pseudo-labels are used for training. It is difficult to appropriately re-balance the classifiers in SSL because the class distribution of an unlabeled set is often unknown and could be mismatched with that
-
Evaluating Perceptual Distances by Fitting Binomial Distributions to Two-Alternative Forced Choice Data arXiv.cs.CV Pub Date : 2024-03-15 Alexander Hepburn, Raul Santos-Rodriguez, Javier Portilla
The two-alternative forced choice (2AFC) experimental setup is popular in the visual perception literature, where practitioners aim to understand how human observers perceive distances within triplets that consist of a reference image and two distorted versions of that image. In the past, this had been conducted in controlled environments, with a tournament-style algorithm dictating which images are
-
PASTA: Towards Flexible and Efficient HDR Imaging Via Progressively Aggregated Spatio-Temporal Aligment arXiv.cs.CV Pub Date : 2024-03-15 Xiaoning Liu, Ao Li, Zongwei Wu, Yapeng Du, Le Zhang, Yulun Zhang, Radu Timofte, Ce Zhu
Leveraging Transformer attention has led to great advancements in HDR deghosting. However, the intricate nature of self-attention introduces practical challenges, as existing state-of-the-art methods often demand high-end GPUs or exhibit slow inference speeds, especially for high-resolution images like 2K. Striking an optimal balance between performance and latency remains a critical concern. In response
-
Open Stamped Parts Dataset arXiv.cs.CV Pub Date : 2024-03-15 Sara Antiles, Sachin S. Talathi
We present the Open Stamped Parts Dataset (OSPD), featuring synthetic and real images of stamped metal sheets for auto manufacturing. The real part images, captured from 7 cameras, consist of 7,980 unlabeled images and 1,680 labeled images. In addition, we have compiled a defect dataset by overlaying synthetically generated masks on 10% of the holes. The synthetic dataset replicates the real manufacturing
-
Testing MediaPipe Holistic for Linguistic Analysis of Nonmanual Markers in Sign Languages arXiv.cs.CV Pub Date : 2024-03-15 Anna Kuznetsova, Vadim Kimmelman
Advances in Deep Learning have made possible reliable landmark tracking of human bodies and faces that can be used for a variety of tasks. We test a recent Computer Vision solution, MediaPipe Holistic (MPH), to find out if its tracking of the facial features is reliable enough for a linguistic analysis of data from sign languages, and compare it to an older solution (OpenFace, OF). We use an existing
-
SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras arXiv.cs.CV Pub Date : 2024-03-15 Yingqi Tang, Zhaotie Meng, Guoliang Chen, Erkang Cheng
The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird's Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized
-
ParaPoint: Learning Global Free-Boundary Surface Parameterization of 3D Point Clouds arXiv.cs.CV Pub Date : 2024-03-15 Qijian Zhang, Junhui Hou, Ying He
Surface parameterization is a fundamental geometry processing problem with rich downstream applications. Traditional approaches are designed to operate on well-behaved mesh models with high-quality triangulations that are laboriously produced by specialized 3D modelers, and thus unable to meet the processing demand for the current explosion of ordinary 3D data. In this paper, we seek to perform UV
-
SCILLA: SurfaCe Implicit Learning for Large Urban Area, a volumetric hybrid solution arXiv.cs.CV Pub Date : 2024-03-15 Hala Djeghim, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou, Désiré Sidibé
Neural implicit surface representation methods have recently shown impressive 3D reconstruction results. However, existing solutions struggle to reconstruct urban outdoor scenes due to their large, unbounded, and highly detailed nature. Hence, to achieve accurate reconstructions, additional supervision data such as LiDAR, strong geometric priors, and long training times are required. To tackle such
-
How Powerful Potential of Attention on Image Restoration? arXiv.cs.CV Pub Date : 2024-03-15 Cong Wang, Jinshan Pan, Yeying Jin, Liyan Wang, Wei Wang, Gang Fu, Wenqi Ren, Xiaochun Cao
Transformers have demonstrated their effectiveness in image restoration tasks. Existing Transformer architectures typically comprise two essential components: multi-head self-attention and feed-forward network (FFN). The former captures long-range pixel dependencies, while the latter enables the model to learn complex patterns and relationships in the data. Previous studies have demonstrated that FFNs
-
NECA: Neural Customizable Human Avatar arXiv.cs.CV Pub Date : 2024-03-15 Junjin Xiao, Qing Zhang, Zhan Xu, Wei-Shi Zheng
Human avatar has become a novel type of 3D asset with various applications. Ideally, a human avatar should be fully customizable to accommodate different settings and environments. In this work, we introduce NECA, an approach capable of learning versatile human representation from monocular or sparse-view videos, enabling granular customization across aspects such as pose, shadow, shape, lighting and
-
Context-Semantic Quality Awareness Network for Fine-Grained Visual Categorization arXiv.cs.CV Pub Date : 2024-03-15 Qin Xu, Sitong Li, Jiahui Wang, Bo Jiang, Jinhui Tang
Exploring and mining subtle yet distinctive features between sub-categories with similar appearances is crucial for fine-grained visual categorization (FGVC). However, less effort has been devoted to assessing the quality of extracted visual representations. Intuitively, the network may struggle to capture discriminative features from low-quality samples, which leads to a significant decline in FGVC
-
Leveraging Neural Radiance Field in Descriptor Synthesis for Keypoints Scene Coordinate Regression arXiv.cs.CV Pub Date : 2024-03-15 Huy-Hoang Bui, Bach-Thuan Bui, Dinh-Tuan Tran, Joo-Ho Lee
Classical structural-based visual localization methods offer high accuracy but face trade-offs in terms of storage, speed, and privacy. A recent innovation, keypoint scene coordinate regression (KSCR) named D2S addresses these issues by leveraging graph attention networks to enhance keypoint relationships and predict their 3D coordinates using a simple multilayer perceptron (MLP). Camera pose is then
-
Deep Learning for Multi-Level Detection and Localization of Myocardial Scars Based on Regional Strain Validated on Virtual Patients arXiv.cs.CV Pub Date : 2024-03-15 Müjde Akdeniz, Claudia Alessandra Manetti, Tijmen Koopsen, Hani Nozari Mirar, Sten Roar Snare, Svein Arne Aase, Joost Lumens, Jurica Šprem, Kristin Sarah McLeod
How well the heart is functioning can be quantified through measurements of myocardial deformation via echocardiography. Clinical assessment of cardiac function is generally focused on global indices of relative shortening, however, territorial, and segmental strain indices have shown to be abnormal in regions of myocardial disease, such as scar. In this work, we propose a single framework to predict
-
Few-Shot Image Classification and Segmentation as Visual Question Answering Using Vision-Language Models arXiv.cs.CV Pub Date : 2024-03-15 Tian Meng, Yang Tao, Ruilin Lyu, Wuliang Yin
The task of few-shot image classification and segmentation (FS-CS) involves classifying and segmenting target objects in a query image, given only a few examples of the target classes. We introduce the Vision-Instructed Segmentation and Evaluation (VISE) method that transforms the FS-CS problem into the Visual Question Answering (VQA) problem, utilising Vision-Language Models (VLMs), and addresses
-
Local positional graphs and attentive local features for a data and runtime-efficient hierarchical place recognition pipeline arXiv.cs.CV Pub Date : 2024-03-15 Fangming Yuan, Stefan Schubert, Peter Protzel, Peer Neubert
Large-scale applications of Visual Place Recognition (VPR) require computationally efficient approaches. Further, a well-balanced combination of data-based and training-free approaches can decrease the required amount of training data and effort and can reduce the influence of distribution shifts between the training and application phases. This paper proposes a runtime and data-efficient hierarchical
-
Towards Generalizable Deepfake Video Detection with Thumbnail Layout and Graph Reasoning arXiv.cs.CV Pub Date : 2024-03-15 Yuting Xu, Jian Liang, Lijun Sheng, Xiao-Yu Zhang
The deepfake threats to society and cybersecurity have provoked significant public apprehension, driving intensified efforts within the realm of deepfake video detection. Current video-level methods are mostly based on {3D CNNs} resulting in high computational demands, although have achieved good performance. This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL)
-
Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder arXiv.cs.CV Pub Date : 2024-03-15 Jinseok Kim, Tae-Kyun Kim
Super-resolution (SR) and image generation are important tasks in computer vision and are widely adopted in real-world applications. Most existing methods, however, generate images only at fixed-scale magnification and suffer from over-smoothing and artifacts. Additionally, they do not offer enough diversity of output images nor image consistency at different scales. Most relevant work applied Implicit
-
Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning arXiv.cs.CV Pub Date : 2024-03-15 Meixuan Li, Tianyu Li, Guoqing Wang, Peng Wang, Yang Yang, Heng Tao Shen
In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our
-
FDGaussian: Fast Gaussian Splatting from Single Image via Geometric-aware Diffusion Model arXiv.cs.CV Pub Date : 2024-03-15 Qijun Feng, Zhen Xing, Zuxuan Wu, Yu-Gang Jiang
Reconstructing detailed 3D objects from single-view images remains a challenging task due to the limited information available. In this paper, we introduce FDGaussian, a novel two-stage framework for single-image 3D reconstruction. Recent methods typically utilize pre-trained 2D diffusion models to generate plausible novel views from the input image, yet they encounter issues with either multi-view
-
A Fixed-Point Approach to Unified Prompt-Based Counting arXiv.cs.CV Pub Date : 2024-03-15 Wei Lin, Antoni B. Chan
Existing class-agnostic counting models typically rely on a single type of prompt, e.g., box annotations. This paper aims to establish a comprehensive prompt-based counting framework capable of generating density maps for concerned objects indicated by various prompt types, such as box, point, and text. To achieve this goal, we begin by converting prompts from different modalities into prompt masks
-
HawkEye: Training Video-Text LLMs for Grounding Text in Videos arXiv.cs.CV Pub Date : 2024-03-15 Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao
Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos. However, they perform almost the same as random on grounding text queries in long and complicated videos, having little ability to understand and reason about temporal information, which is the most fundamental difference between videos and images.
-
Exploring Optical Flow Inclusion into nnU-Net Framework for Surgical Instrument Segmentation arXiv.cs.CV Pub Date : 2024-03-15 Marcos Fernández-Rodríguez, Bruno Silva, Sandro Queirós, Helena R. Torres, Bruno Oliveira, Pedro Morais, Lukas R. Buschle, Jorge Correia-Pinto, Estevão Lima, João L. Vilaça
Surgical instrument segmentation in laparoscopy is essential for computer-assisted surgical systems. Despite the Deep Learning progress in recent years, the dynamic setting of laparoscopic surgery still presents challenges for precise segmentation. The nnU-Net framework excelled in semantic segmentation analyzing single frames without temporal information. The framework's ease of use, including its
-
BlindDiff: Empowering Degradation Modelling in Diffusion Models for Blind Image Super-Resolution arXiv.cs.CV Pub Date : 2024-03-15 Feng Li, Yixuan Wu, Zichao Liang, Runmin Cong, Huihui Bai, Yao Zhao, Meng Wang
Diffusion models (DM) have achieved remarkable promise in image super-resolution (SR). However, most of them are tailored to solving non-blind inverse problems with fixed known degradation settings, limiting their adaptability to real-world applications that involve complex unknown degradations. In this work, we propose BlindDiff, a DM-based blind SR method to tackle the blind degradation settings
-
Generative Region-Language Pretraining for Open-Ended Object Detection arXiv.cs.CV Pub Date : 2024-03-15 Chuang Lin, Yi Jiang, Lizhen Qu, Zehuan Yuan, Jianfei Cai
In recent research, significant attention has been devoted to the open-vocabulary object detection task, aiming to generalize beyond the limited number of classes labeled during training and detect objects described by arbitrary category names at inference. Compared with conventional object detection, open vocabulary object detection largely extends the object detection categories. However, it relies
-
Animate Your Motion: Turning Still Images into Dynamic Videos arXiv.cs.CV Pub Date : 2024-03-15 Mingxiao Li, Bo Wan, Marie-Francine Moens, Tinne Tuytelaars
In recent years, diffusion models have made remarkable strides in text-to-video generation, sparking a quest for enhanced control over video outputs to more accurately reflect user intentions. Traditional efforts predominantly focus on employing either semantic cues, like images or depth maps, or motion-based conditions, like moving sketches or object bounding boxes. Semantic inputs offer a rich scene
-
A Hybrid SNN-ANN Network for Event-based Object Detection with Spatial and Temporal Attention arXiv.cs.CV Pub Date : 2024-03-15 Soikat Hasan Ahmed, Jan Finkbeiner, Emre Neftci
Event cameras offer high temporal resolution and dynamic range with minimal motion blur, making them promising for object detection tasks. While Spiking Neural Networks (SNNs) are a natural match for event-based sensory data and enable ultra-energy efficient and low latency inference on neuromorphic hardware, Artificial Neural Networks (ANNs) tend to display more stable training dynamics and faster
-
Computer User Interface Understanding. A New Dataset and a Learning Framework arXiv.cs.CV Pub Date : 2024-03-15 Andrés Muñoz, Daniel Borrajo
User Interface (UI) understanding has been an increasingly popular topic over the last few years. So far, there has been a vast focus solely on web and mobile applications. In this paper, we introduce the harder task of computer UI understanding. With the goal of enabling research in this field, we have generated a dataset with a set of videos where a user is performing a sequence of actions and each
-
SemanticHuman-HD: High-Resolution Semantic Disentangled 3D Human Generation arXiv.cs.CV Pub Date : 2024-03-15 Peng Zheng, Tao Liu, Zili Yi, Rui Ma
With the development of neural radiance fields and generative models, numerous methods have been proposed for learning 3D human generation from 2D images. These methods allow control over the pose of the generated 3D human and enable rendering from different viewpoints. However, none of these methods explore semantic disentanglement in human image synthesis, i.e., they can not disentangle the generation
-
GGRt: Towards Generalizable 3D Gaussians without Pose Priors in Real-Time arXiv.cs.CV Pub Date : 2024-03-15 Hao Li, Yuanyuan Gao, Dingwen Zhang, Chenming Wu, Yalun Dai, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han
This paper presents GGRt, a novel approach to generalizable novel view synthesis that alleviates the need for real camera poses, complexity in processing high-resolution images, and lengthy optimization processes, thus facilitating stronger applicability of 3D Gaussian Splatting (3D-GS) in real-world scenarios. Specifically, we design a novel joint learning framework that consists of an Iterative Pose
-
E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance arXiv.cs.CV Pub Date : 2024-03-15 Tianrui Huang, Pu Cao, Lu Yang, Chun Liu, Mengjie Hu, Zhiwei Liu, Qing Song
Diffusion-based image editing is a composite process of preserving the source image content and generating new content or applying modifications. While current editing approaches have made improvements under text guidance, most of them have only focused on preserving the information of the input image, disregarding the importance of editability and alignment to the target prompt. In this paper, we
-
TransLandSeg: A Transfer Learning Approach for Landslide Semantic Segmentation Based on Vision Foundation Model arXiv.cs.CV Pub Date : 2024-03-15 Changhong Hou, Junchuan Yu, Daqing Ge, Liu Yang, Laidian Xi, Yunxuan Pang, Yi Wen
Landslides are one of the most destructive natural disasters in the world, posing a serious threat to human life and safety. The development of foundation models has provided a new research paradigm for large-scale landslide detection. The Segment Anything Model (SAM) has garnered widespread attention in the field of image segmentation. However, our experiment found that SAM performed poorly in the
-
Depth-induced Saliency Comparison Network for Diagnosis of Alzheimer's Disease via Jointly Analysis of Visual Stimuli and Eye Movements arXiv.cs.CV Pub Date : 2024-03-15 Yu Liu, Wenlin Zhang, Shaochu Wang, Fangyu Zuo, Peiguang Jing, Yong Ji
Early diagnosis of Alzheimer's Disease (AD) is very important for following medical treatments, and eye movements under special visual stimuli may serve as a potential non-invasive biomarker for detecting cognitive abnormalities of AD patients. In this paper, we propose an Depth-induced saliency comparison network (DISCN) for eye movement analysis, which may be used for diagnosis the Alzheimers disease
-
URS-NeRF: Unordered Rolling Shutter Bundle Adjustment for Neural Radiance Fields arXiv.cs.CV Pub Date : 2024-03-15 Bo Xu, Ziao Liu, Mengqi Guo, Jiancheng Li, Gim Hee Li
We propose a novel rolling shutter bundle adjustment method for neural radiance fields (NeRF), which utilizes the unordered rolling shutter (RS) images to obtain the implicit 3D representation. Existing NeRF methods suffer from low-quality images and inaccurate initial camera poses due to the RS effect in the image, whereas, the previous method that incorporates the RS into NeRF requires strict sequential
-
CSDNet: Detect Salient Object in Depth-Thermal via A Lightweight Cross Shallow and Deep Perception Network arXiv.cs.CV Pub Date : 2024-03-15 Xiaotong Yu, Ruihan Xie, Zhihe Zhao, Chang-Wen Chen
While we enjoy the richness and informativeness of multimodal data, it also introduces interference and redundancy of information. To achieve optimal domain interpretation with limited resources, we propose CSDNet, a lightweight \textbf{C}ross \textbf{S}hallow and \textbf{D}eep Perception \textbf{Net}work designed to integrate two modalities with less coherence, thereby discarding redundant information
-
DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video arXiv.cs.CV Pub Date : 2024-03-15 Huiqiang Sun, Xingyi Li, Liao Shen, Xinyi Ye, Ke Xian, Zhiguo Cao
Recent advancements in dynamic neural radiance field methods have yielded remarkable outcomes. However, these approaches rely on the assumption of sharp input images. When faced with motion blur, existing dynamic NeRF methods often struggle to generate high-quality novel views. In this paper, we propose DyBluRF, a dynamic radiance field approach that synthesizes sharp novel views from a monocular video
-
KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation arXiv.cs.CV Pub Date : 2024-03-15 Ruida Zhang, Chenyangguang Zhang, Yan Di, Fabian Manhardt, Xingyu Liu, Federico Tombari, Xiangyang Ji
In this paper, we present KP-RED, a unified KeyPoint-driven REtrieval and Deformation framework that takes object scans as input and jointly retrieves and deforms the most geometrically similar CAD models from a pre-processed database to tightly match the target. Unlike existing dense matching based methods that typically struggle with noisy partial scans, we propose to leverage category-consistent
-
DiffMAC: Diffusion Manifold Hallucination Correction for High Generalization Blind Face Restoration arXiv.cs.CV Pub Date : 2024-03-15 Nan Gao, Jia Li, Huaibo Huang, Zhi Zeng, Ke Shang, Shuwu Zhang, Ran He
Blind face restoration (BFR) is a highly challenging problem due to the uncertainty of degradation patterns. Current methods have low generalization across photorealistic and heterogeneous domains. In this paper, we propose a Diffusion-Information-Diffusion (DID) framework to tackle diffusion manifold hallucination correction (DiffMAC), which achieves high-generalization face restoration in diverse
-
Monkeypox disease recognition model based on improved SE-InceptionV3 arXiv.cs.CV Pub Date : 2024-03-15 Junzhuo Chen, Zonghan Lu, Shitong Kang
In the wake of the global spread of monkeypox, accurate disease recognition has become crucial. This study introduces an improved SE-InceptionV3 model, embedding the SENet module and incorporating L2 regularization into the InceptionV3 framework to enhance monkeypox disease detection. Utilizing the Kaggle monkeypox dataset, which includes images of monkeypox and similar skin conditions, our model demonstrates
-
VRHCF: Cross-Source Point Cloud Registration via Voxel Representation and Hierarchical Correspondence Filtering arXiv.cs.CV Pub Date : 2024-03-15 Guiyu Zhao, Zewen Du, Zhentao Guo, Hongbin Ma
Addressing the challenges posed by the substantial gap in point cloud data collected from diverse sensors, achieving robust cross-source point cloud registration becomes a formidable task. In response, we present a novel framework for point cloud registration with broad applicability, suitable for both homologous and cross-source registration scenarios. To tackle the issues arising from different densities
-
CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner arXiv.cs.CV Pub Date : 2024-03-15 Tingbing Yan, Wenzheng Zeng, Yang Xiao, Xingyu Tong, Bo Tan, Zhiwen Fang, Zhiguo Cao, Joey Tianyi Zhou
Most existing one-shot skeleton-based action recognition focuses on raw low-level information (e.g., joint location), and may suffer from local information loss and low generalization ability. To alleviate these, we propose to leverage text description generated from large language models (LLM) that contain high-level human knowledge, to guide feature learning, in a global-local-global way. Particularly
-
Learning Physical Dynamics for Object-centric Visual Prediction arXiv.cs.CV Pub Date : 2024-03-15 Huilin Xu, Tao Chen, Feng Xu
The ability to model the underlying dynamics of visual scenes and reason about the future is central to human intelligence. Many attempts have been made to empower intelligent systems with such physical understanding and prediction abilities. However, most existing methods focus on pixel-to-pixel prediction, which suffers from heavy computational costs while lacking a deep understanding of the physical
-
Benchmarking Adversarial Robustness of Image Shadow Removal with Shadow-adaptive Attacks arXiv.cs.CV Pub Date : 2024-03-15 Chong Wang, Yi Yu, Lanqing Guo, Bihan Wen
Shadow removal is a task aimed at erasing regional shadows present in images and reinstating visually pleasing natural scenes with consistent illumination. While recent deep learning techniques have demonstrated impressive performance in image shadow removal, their robustness against adversarial attacks remains largely unexplored. Furthermore, many existing attack frameworks typically allocate a uniform
-
Revisiting Adversarial Training under Long-Tailed Distributions arXiv.cs.CV Pub Date : 2024-03-15 Xinli Yue, Ningping Mou, Qian Wang, Lingchen Zhao
Deep neural networks are vulnerable to adversarial attacks, often leading to erroneous outputs. Adversarial training has been recognized as one of the most effective methods to counter such attacks. However, existing adversarial training techniques have predominantly been tested on balanced datasets, whereas real-world data often exhibit a long-tailed distribution, casting doubt on the efficacy of
-
Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling arXiv.cs.CV Pub Date : 2024-03-15 Baoquan Zhang, Huaibin Wang, Luo Chuyao, Xutao Li, Liang Guotao, Yunming Ye, Xiaochen Qi, Yao He
Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in image synthesis, which aims to represent an image with a discrete token sequence. Existing studies effectively address this problem by learning a discrete codebook from scratch and in a code-independent manner to quantize continuous representations into discrete tokens. However, learning a codebook from scratch and in a code-independent
-
Boundary Matters: A Bi-Level Active Finetuning Framework arXiv.cs.CV Pub Date : 2024-03-15 Han Lu, Yichen Xie, Xiaokang Yang, Junchi Yan
The pretraining-finetuning paradigm has gained widespread adoption in vision tasks and other fields, yet it faces the significant challenge of high sample annotation costs. To mitigate this, the concept of active finetuning has emerged, aiming to select the most appropriate samples for model finetuning within a limited budget. Traditional active learning methods often struggle in this setting due to
-
RID-TWIN: An end-to-end pipeline for automatic face de-identification in videos arXiv.cs.CV Pub Date : 2024-03-15 Anirban Mukherjee, Monjoy Narayan Choudhury, Dinesh Babu Jayagopi
Face de-identification in videos is a challenging task in the domain of computer vision, primarily used in privacy-preserving applications. Despite the considerable progress achieved through generative vision models, there remain multiple challenges in the latest approaches. They lack a comprehensive discussion and evaluation of aspects such as realism, temporal coherence, and preservation of non-identifiable
-
Group-Mix SAM: Lightweight Solution for Industrial Assembly Line Applications arXiv.cs.CV Pub Date : 2024-03-15 Wu Liang, X. -G. Ma
Since the advent of the Segment Anything Model(SAM) approximately one year ago, it has engendered significant academic interest and has spawned a large number of investigations and publications from various perspectives. However, the deployment of SAM in practical assembly line scenarios has yet to materialize due to its large image encoder, which weighs in at an imposing 632M. In this study, we have
-
T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory arXiv.cs.CV Pub Date : 2024-03-15 Daehee Park, Jaeseok Jeong, Sung-Hoon Yoon, Jaewoo Jeong, Kuk-Jin Yoon
Trajectory prediction is a challenging problem that requires considering interactions among multiple actors and the surrounding environment. While data-driven approaches have been used to address this complex problem, they suffer from unreliable predictions under distribution shifts during test time. Accordingly, several online learning methods have been proposed using regression loss from the ground
-
Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing arXiv.cs.CV Pub Date : 2024-03-15 Tian-Xing Xu, Wenbo Hu, Yu-Kun Lai, Ying Shan, Song-Hai Zhang
3D Gaussian splatting, emerging as a groundbreaking approach, has drawn increasing attention for its capabilities of high-fidelity reconstruction and real-time rendering. However, it couples the appearance and geometry of the scene within the Gaussian attributes, which hinders the flexibility of editing operations, such as texture swapping. To address this issue, we propose a novel approach, namely
-
TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model arXiv.cs.CV Pub Date : 2024-03-15 Jiahao Lyu, Jin Wei, Gangyan Zeng, Zeng Li, Enze Xie, Wei Wang, Yu Zhou
Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise
-
SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model arXiv.cs.CV Pub Date : 2024-03-15 Tao Wu, Xuewei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li
Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains.However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation.In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality
-
Rethinking Low-quality Optical Flow in Unsupervised Surgical Instrument Segmentation arXiv.cs.CV Pub Date : 2024-03-15 Peiran Wu, Yang Liu, Jiayu Huo, Gongyu Zhang, Christos Bergeles, Rachel Sparks, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin
Video-based surgical instrument segmentation plays an important role in robot-assisted surgeries. Unlike supervised settings, unsupervised segmentation relies heavily on motion cues, which are challenging to discern due to the typically lower quality of optical flow in surgical footage compared to natural scenes. This presents a considerable burden for the advancement of unsupervised segmentation techniques