-
Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study arXiv.cs.LG Pub Date : 2024-03-15 Chenguang Wang, Ruoxi Jia, Xin Liu, Dawn Song
Pre-training image representations from the raw text about images enables zero-shot vision transfer to downstream tasks. Through pre-training on millions of samples collected from the internet, multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results that often reach competitiveness with fully supervised methods without the need for task-specific training. Besides the
-
A comparative study on machine learning approaches for rock mass classification using drilling data arXiv.cs.LG Pub Date : 2024-03-15 Tom F. Hansen, Georg H. Erharter, Zhongqiang Liu, Jim Torresen
Current rock engineering design in drill and blast tunnelling primarily relies on engineers' observational assessments. Measure While Drilling (MWD) data, a high-resolution sensor dataset collected during tunnel excavation, is underutilised, mainly serving for geological visualisation. This study aims to automate the translation of MWD data into actionable metrics for rock engineering. It seeks to
-
Regret Minimization via Saddle Point Optimization arXiv.cs.LG Pub Date : 2024-03-15 Johannes Kirschner, Seyed Alireza Bakhtiari, Kushagra Chandak, Volodymyr Tkachuk, Csaba Szepesvári
A long line of works characterizes the sample complexity of regret minimization in sequential decision-making by min-max programs. In the corresponding saddle-point game, the min-player optimizes the sampling distribution against an adversarial max-player that chooses confusing models leading to large regret. The most recent instantiation of this idea is the decision-estimation coefficient (DEC), which
-
Towards a general framework for improving the performance of classifiers using XAI methods arXiv.cs.LG Pub Date : 2024-03-15 Andrea Apicella, Salvatore Giugliano, Francesco Isgrò, Roberto Prevete
Modern Artificial Intelligence (AI) systems, especially Deep Learning (DL) models, poses challenges in understanding their inner workings by AI researchers. eXplainable Artificial Intelligence (XAI) inspects internal mechanisms of AI models providing explanations about their decisions. While current XAI research predominantly concentrates on explaining AI systems, there is a growing interest in using
-
Generation is better than Modification: Combating High Class Homophily Variance in Graph Anomaly Detection arXiv.cs.LG Pub Date : 2024-03-15 Rui Zhang, Dawei Cheng, Xin Liu, Jie Yang, Yi Ouyang, Xian Wu, Yefeng Zheng
Graph-based anomaly detection is currently an important research topic in the field of graph neural networks (GNNs). We find that in graph anomaly detection, the homophily distribution differences between different classes are significantly greater than those in homophilic and heterophilic graphs. For the first time, we introduce a new metric called Class Homophily Variance, which quantitatively describes
-
Towards Non-Adversarial Algorithmic Recourse arXiv.cs.LG Pub Date : 2024-03-15 Tobias Leemann, Martin Pawelczyk, Bardh Prenkaj, Gjergji Kasneci
The streams of research on adversarial examples and counterfactual explanations have largely been growing independently. This has led to several recent works trying to elucidate their similarities and differences. Most prominently, it has been argued that adversarial examples, as opposed to counterfactual explanations, have a unique characteristic in that they lead to a misclassification compared to
-
Anytime Neural Architecture Search on Tabular Data arXiv.cs.LG Pub Date : 2024-03-15 Naili Xing, Shaofeng Cai, Zhaojing Luo, BengChin Ooi, Jian Pei
The increasing demand for tabular data analysis calls for transitioning from manual architecture design to Neural Architecture Search (NAS). This transition demands an efficient and responsive anytime NAS approach that is capable of returning current optimal architectures within any given time budget while progressively enhancing architecture quality with increased budget allocation. However, the area
-
Comprehensive Study Of Predictive Maintenance In Industries Using Classification Models And LSTM Model arXiv.cs.LG Pub Date : 2024-03-15 Saket Maheshwari, Sambhav Tiwari, Shyam Rai, Satyam Vinayak Daman Pratap Singh
In today's technology-driven era, the imperative for predictive maintenance and advanced diagnostics extends beyond aviation to encompass the identification of damages, failures, and operational defects in rotating and moving machines. Implementing such services not only curtails maintenance costs but also extends machine lifespan, ensuring heightened operational efficiency. Moreover, it serves as
-
Open Continual Feature Selection via Granular-Ball Knowledge Transfer arXiv.cs.LG Pub Date : 2024-03-15 Xuemei Cao, Xin Yang, Shuyin Xia, Guoyin Wang, Tianrui Li
This paper presents a novel framework for continual feature selection (CFS) in data preprocessing, particularly in the context of an open and dynamic environment where unknown classes may emerge. CFS encounters two primary challenges: the discovery of unknown knowledge and the transfer of known knowledge. To this end, the proposed CFS method combines the strengths of continual learning (CL) with granular-ball
-
From Chaos to Clarity: Time Series Anomaly Detection in Astronomical Observations arXiv.cs.LG Pub Date : 2024-03-15 Xinli Hao, Yile Chen, Chen Yang, Zhihui Du, Chaohong Ma, Chao Wu, Xiaofeng Meng
With the development of astronomical facilities, large-scale time series data observed by these facilities is being collected. Analyzing anomalies in these astronomical observations is crucial for uncovering potential celestial events and physical phenomena, thus advancing the scientific research process. However, existing time series anomaly detection methods fall short in tackling the unique characteristics
-
Online Policy Learning from Offline Preferences arXiv.cs.LG Pub Date : 2024-03-15 Guoxi Zhang, Han Bao, Hisashi Kashima
In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged \emph{offline preferences}, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not
-
Functional Graph Convolutional Networks: A unified multi-task and multi-modal learning framework to facilitate health and social-care insights arXiv.cs.LG Pub Date : 2024-03-15 Tobia Boschi, Francesca Bonin, Rodrigo Ordonez-Hurtado, Cécile Rosseau, Alessandra Pascale, John Dinsmore
This paper introduces a novel Functional Graph Convolutional Network (funGCN) framework that combines Functional Data Analysis and Graph Convolutional Networks to address the complexities of multi-task and multi-modal learning in digital health and longitudinal studies. With the growing importance of health solutions to improve health care and social support, ensure healthy lives, and promote well-being
-
Regularization-Based Efficient Continual Learning in Deep State-Space Models arXiv.cs.LG Pub Date : 2024-03-15 Yuanhang Zhang, Zhidi Lin, Yiyong Sun, Feng Yin, Carsten Fritsche
Deep state-space models (DSSMs) have gained popularity in recent years due to their potent modeling capacity for dynamic systems. However, existing DSSM works are limited to single-task modeling, which requires retraining with historical task data upon revisiting a forepassed task. To address this limitation, we propose continual learning DSSMs (CLDSSMs), which are capable of adapting to evolving tasks
-
Meta Operator for Complex Query Answering on Knowledge Graphs arXiv.cs.LG Pub Date : 2024-03-15 Hang Yin, Zihao Wang, Yangqiu Song
Knowledge graphs contain informative factual knowledge but are considered incomplete. To answer complex queries under incomplete knowledge, learning-based Complex Query Answering (CQA) models are proposed to directly learn from the query-answer samples to avoid the direct traversal of incomplete graph data. Existing works formulate the training of complex query answering models as multi-task learning
-
Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks arXiv.cs.LG Pub Date : 2024-03-15 Shin'ya Yamaguchi, Sekitoshi Kanai, Kazuki Adachi, Daiki Chijiwa
While fine-tuning is a de facto standard method for training deep neural networks, it still suffers from overfitting when using small target datasets. Previous methods improve fine-tuning performance by maintaining knowledge of the source datasets or introducing regularization terms such as contrastive loss. However, these methods require auxiliary source information (e.g., source labels or datasets)
-
Towards Adversarially Robust Dataset Distillation by Curvature Regularization arXiv.cs.LG Pub Date : 2024-03-15 Eric Xue, Yijiang Li, Haoyang Liu, Yifan Shen, Haohan Wang
Dataset distillation (DD) allows datasets to be distilled to fractions of their original size while preserving the rich distributional information so that models trained on the distilled datasets can achieve a comparable accuracy while saving significant computational loads. Recent research in this area has been focusing on improving the accuracy of models trained on distilled datasets. In this paper
-
AD3: Implicit Action is the Key for World Models to Distinguish the Diverse Visual Distractors arXiv.cs.LG Pub Date : 2024-03-15 Yucen Wang, Shenghua Wan, Le Gan, Shuai Feng, De-Chuan Zhan
Model-based methods have significantly contributed to distinguishing task-irrelevant distractors for visual control. However, prior research has primarily focused on heterogeneous distractors like noisy background videos, leaving homogeneous distractors that closely resemble controllable agents largely unexplored, which poses significant challenges to existing methods. To tackle this problem, we propose
-
Prediction of Vessel Arrival Time to Pilotage Area Using Multi-Data Fusion and Deep Learning arXiv.cs.LG Pub Date : 2024-03-15 Xiaocai Zhang, Xiuju Fu, Zhe Xiao, Haiyan Xu, Xiaoyang Wei, Jimmy Koh, Daichi Ogawa, Zheng Qin
This paper investigates the prediction of vessels' arrival time to the pilotage area using multi-data fusion and deep learning approaches. Firstly, the vessel arrival contour is extracted based on Multivariate Kernel Density Estimation (MKDE) and clustering. Secondly, multiple data sources, including Automatic Identification System (AIS), pilotage booking information, and meteorological data, are fused
-
Online GNN Evaluation Under Test-time Graph Distribution Shifts arXiv.cs.LG Pub Date : 2024-03-15 Xin Zheng, Dongjin Song, Qingsong Wen, Bo Du, Shirui Pan
Evaluating the performance of a well-trained GNN model on real-world graphs is a pivotal step for reliable GNN online deployment and serving. Due to a lack of test node labels and unknown potential training-test graph data distribution shifts, conventional model evaluation encounters limitations in calculating performance metrics (e.g., test error) and measuring graph data-level discrepancies, particularly
-
Quality-Diversity Actor-Critic: Learning High-Performing and Diverse Behaviors via Value and Successor Features Critics arXiv.cs.LG Pub Date : 2024-03-15 Luca GrillottiImperial College London, Maxence FaldorImperial College London, Borja G. LeónImperial College London, Antoine CullyImperial College London
A key aspect of intelligence is the ability to demonstrate a broad spectrum of behaviors for adapting to unexpected situations. Over the past decade, advancements in deep reinforcement learning have led to groundbreaking achievements to solve complex continuous control tasks. However, most approaches return only one solution specialized for a specific problem. We introduce Quality-Diversity Actor-Critic
-
FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models arXiv.cs.LG Pub Date : 2024-03-14 Kai Yi, Georg Meinhardt, Laurent Condat, Peter Richtárik
Federated Learning (FL) has garnered increasing attention due to its unique characteristic of allowing heterogeneous clients to process their private data locally and interact with a central server, while being respectful of privacy. A critical bottleneck in FL is the communication cost. A pivotal strategy to mitigate this burden is \emph{Local Training}, which involves running multiple local stochastic
-
Robust Subgraph Learning by Monitoring Early Training Representations arXiv.cs.LG Pub Date : 2024-03-14 Sepideh Neshatfar, Salimeh Yasaei Sekeh
Graph neural networks (GNNs) have attracted significant attention for their outstanding performance in graph learning and node classification tasks. However, their vulnerability to adversarial attacks, particularly through susceptible nodes, poses a challenge in decision-making. The need for robust graph summarization is evident in adversarial challenges resulting from the propagation of attacks throughout
-
TimeMachine: A Time Series is Worth 4 Mambas for Long-term Forecasting arXiv.cs.LG Pub Date : 2024-03-14 Md Atik Ahamed, Qiang Cheng
Long-term time-series forecasting remains challenging due to the difficulty in capturing long-term dependencies, achieving linear scalability, and maintaining computational efficiency. We introduce TimeMachine, an innovative model that leverages Mamba, a state-space model, to capture long-term dependencies in multivariate time series data while maintaining linear scalability and small memory footprints
-
Generalization of Scaled Deep ResNets in the Mean-Field Regime arXiv.cs.LG Pub Date : 2024-03-14 Yihang Chen, Fanghui Liu, Yiping Lu, Grigorios G. Chrysos, Volkan Cevher
Despite the widespread empirical success of ResNet, the generalization properties of deep ResNet are rarely explored beyond the lazy training regime. In this work, we investigate \emph{scaled} ResNet in the limit of infinitely deep and wide neural networks, of which the gradient flow is described by a partial differential equation in the large-neural network limit, i.e., the \emph{mean-field} regime
-
iBRF: Improved Balanced Random Forest Classifier arXiv.cs.LG Pub Date : 2024-03-14 Asif Newaz, Md. Salman Mohosheu, MD. Abdullah al Noman, Dr. Taskeed Jabid
Class imbalance poses a major challenge in different classification tasks, which is a frequently occurring scenario in many real-world applications. Data resampling is considered to be the standard approach to address this issue. The goal of the technique is to balance the class distribution by generating new samples or eliminating samples from the data. A wide variety of sampling techniques have been
-
MAMBA: an Effective World Model Approach for Meta-Reinforcement Learning arXiv.cs.LG Pub Date : 2024-03-14 Zohar Rimon, Tom Jurgenson, Orr Krupnik, Gilad Adler, Aviv Tamar
Meta-reinforcement learning (meta-RL) is a promising framework for tackling challenging domains requiring efficient exploration. Existing meta-RL algorithms are characterized by low sample efficiency, and mostly focus on low-dimensional task distributions. In parallel, model-based RL methods have been successful in solving partially observable MDPs, of which meta-RL is a special case. In this work
-
Few-Shot Class Incremental Learning with Attention-Aware Self-Adaptive Prompt arXiv.cs.LG Pub Date : 2024-03-14 Chenxi Liu, Zhenyi Wang, Tianyi Xiong, Ruibo Chen, Yihan Wu, Junfeng Guo, Heng Huang
Few-Shot Class-Incremental Learning (FSCIL) models aim to incrementally learn new classes with scarce samples while preserving knowledge of old ones. Existing FSCIL methods usually fine-tune the entire backbone, leading to overfitting and hindering the potential to learn new classes. On the other hand, recent prompt-based CIL approaches alleviate forgetting by training prompts with sufficient data
-
Towards the Reusability and Compositionality of Causal Representations arXiv.cs.LG Pub Date : 2024-03-14 Davide Talon, Phillip Lippe, Stuart James, Alessio Del Bue, Sara Magliacane
Causal Representation Learning (CRL) aims at identifying high-level causal factors and their relationships from high-dimensional observations, e.g., images. While most CRL works focus on learning causal representations in a single environment, in this work we instead propose a first step towards learning causal representations from temporal sequences of images that can be adapted in a new environment
-
Self-Supervised Learning for Time Series: Contrastive or Generative? arXiv.cs.LG Pub Date : 2024-03-14 Ziyu Liu, Azadeh Alavi, Minyi Li, Xiang Zhang
Self-supervised learning (SSL) has recently emerged as a powerful approach to learning representations from large-scale unlabeled data, showing promising results in time series analysis. The self-supervised representation learning can be categorized into two mainstream: contrastive and generative. In this paper, we will present a comprehensive comparative study between contrastive and generative methods
-
Towards Diverse Perspective Learning with Selection over Multiple Temporal Poolings arXiv.cs.LG Pub Date : 2024-03-14 Jihyeon Seong, Jungmin Kim, Jaesik Choi
In Time Series Classification (TSC), temporal pooling methods that consider sequential information have been proposed. However, we found that each temporal pooling has a distinct mechanism, and can perform better or worse depending on time series data. We term this fixed pooling mechanism a single perspective of temporal poolings. In this paper, we propose a novel temporal pooling method with diverse
-
FeatUp: A Model-Agnostic Framework for Features at Any Resolution arXiv.cs.LG Pub Date : 2024-03-15 Stephanie Fu, Mark Hamilton, Laura Brandt, Axel Feldman, Zhoutong Zhang, William T. Freeman
Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce
-
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers arXiv.cs.LG Pub Date : 2024-03-15 Haoyang Liu, Aditya Singh, Yijiang Li, Haohan Wang
Enhancing the robustness of deep learning models, particularly in the realm of vision transformers (ViTs), is crucial for their real-world deployment. In this work, we provide a finetuning approach to enhance the robustness of vision transformers inspired by the concept of nullspace from linear algebra. Our investigation centers on whether a vision transformer can exhibit resilience to input variations
-
Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases arXiv.cs.LG Pub Date : 2024-03-15 Jiarui Li, Ye Yuan, Zehua Zhang
We proposed an end-to-end system design towards utilizing Retrieval Augmented Generation (RAG) to improve the factual accuracy of Large Language Models (LLMs) for domain-specific and time-sensitive queries related to private knowledge-bases. Our system integrates RAG pipeline with upstream datasets processing and downstream performance evaluation. Addressing the challenge of LLM hallucinations, we
-
Energy Correction Model in the Feature Space for Out-of-Distribution Detection arXiv.cs.LG Pub Date : 2024-03-15 Marc Lafon, Clément Rambour, Nicolas Thome
In this work, we study the out-of-distribution (OOD) detection problem through the use of the feature space of a pre-trained deep classifier. We show that learning the density of in-distribution (ID) features with an energy-based models (EBM) leads to competitive detection results. However, we found that the non-mixing of MCMC sampling during the EBM's training undermines its detection performance
-
Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding arXiv.cs.LG Pub Date : 2024-03-15 Pengkun Liu, Yikai Wang, Fuchun Sun, Jiafang Li, Hang Xiao, Hongxiang Xue, Xinzhou Wang
Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. Most existing methods combine novel-view lifting from 2D diffusion models which usually take the reference image as a condition while applying hard L2 image supervision at the reference view. Yet heavily adhering to the image
-
Scalable Algorithms for Individual Preference Stable Clustering arXiv.cs.LG Pub Date : 2024-03-15 Ron Mosenzon, Ali Vakilian
In this paper, we study the individual preference (IP) stability, which is an notion capturing individual fairness and stability in clustering. Within this setting, a clustering is $\alpha$-IP stable when each data point's average distance to its cluster is no more than $\alpha$ times its average distance to any other cluster. In this paper, we study the natural local search algorithm for IP stable
-
Denoising Task Difficulty-based Curriculum for Training Diffusion Models arXiv.cs.LG Pub Date : 2024-03-15 Jin-Young Kim, Hyojun Go, Soonwoo Kwon, Hyun-Gyoon Kim
Diffusion-based generative models have emerged as powerful tools in the realm of generative modeling. Despite extensive research on denoising across various timesteps and noise levels, a conflict persists regarding the relative difficulties of the denoising tasks. While various studies argue that lower timesteps present more challenging tasks, others contend that higher timesteps are more difficult
-
GreedyML: A Parallel Algorithm for Maximizing Submodular Functions arXiv.cs.LG Pub Date : 2024-03-15 Shivaram Gopal, S M Ferdous, Hemanta K. Maji, Alex Pothen
We describe a parallel approximation algorithm for maximizing monotone submodular functions subject to hereditary constraints on distributed memory multiprocessors. Our work is motivated by the need to solve submodular optimization problems on massive data sets, for practical applications in areas such as data summarization, machine learning, and graph sparsification. Our work builds on the randomized
-
CDGP: Automatic Cloze Distractor Generation based on Pre-trained Language Model arXiv.cs.LG Pub Date : 2024-03-15 Shang-Hsuan Chiang, Ssu-Cheng Wang, Yao-Chung Fan
Manually designing cloze test consumes enormous time and efforts. The major challenge lies in wrong option (distractor) selection. Having carefully-design distractors improves the effectiveness of learner ability assessment. As a result, the idea of automatically generating cloze distractor is motivated. In this paper, we investigate cloze distractor generation by exploring the employment of pre-trained
-
Team Trifecta at Factify5WQA: Setting the Standard in Fact Verification with Fine-Tuning arXiv.cs.LG Pub Date : 2024-03-15 Shang-Hsuan Chiang, Ming-Chih Lo, Lin-Wei Chao, Wen-Chih Peng
In this paper, we present Pre-CoFactv3, a comprehensive framework comprised of Question Answering and Text Classification components for fact verification. Leveraging In-Context Learning, Fine-tuned Large Language Models (LLMs), and the FakeNet model, we address the challenges of fact verification. Our experiments explore diverse approaches, comparing different Pre-trained LLMs, introducing FakeNet
-
DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers arXiv.cs.LG Pub Date : 2024-03-15 Xuanlei Zhao, Shenggan Cheng, Zangwei Zheng, Zheming Yang, Ziming Liu, Yang You
Scaling large models with long sequences across applications like language generation, video generation and multimodal tasks requires efficient sequence parallelism. However, existing sequence parallelism methods all assume a single sequence dimension and fail to adapt to multi-dimensional transformer architectures that perform attention calculations across different dimensions. This paper introduces
-
Perceptual Quality-based Model Training under Annotator Label Uncertainty arXiv.cs.LG Pub Date : 2024-03-15 Chen Zhou, Mohit Prabhushankar, Ghassan AlRegib
Annotators exhibit disagreement during data labeling, which can be termed as annotator label uncertainty. Annotator label uncertainty manifests in variations of labeling quality. Training with a single low-quality annotation per sample induces model reliability degradations. In this work, we first examine the effects of annotator label uncertainty in terms of the model's generalizability and prediction
-
CoReEcho: Continuous Representation Learning for 2D+time Echocardiography Analysis arXiv.cs.LG Pub Date : 2024-03-15 Fadillah Adamsyah Maani, Numan Saeed, Aleksandr Matsun, Mohammad Yaqub
Deep learning (DL) models have been advancing automatic medical image analysis on various modalities, including echocardiography, by offering a comprehensive end-to-end training pipeline. This approach enables DL models to regress ejection fraction (EF) directly from 2D+time echocardiograms, resulting in superior performance. However, the end-to-end training pipeline makes the learned representations
-
Improving Medical Multi-modal Contrastive Learning with Expert Annotations arXiv.cs.LG Pub Date : 2024-03-15 Yogesh Kumar, Pekka Marttinen
We introduce eCLIP, an enhanced version of the CLIP model that integrates expert annotations in the form of radiologist eye-gaze heatmaps. It tackles key challenges in contrastive multi-modal medical imaging analysis, notably data scarcity and the "modality gap" -- a significant disparity between image and text embeddings that diminishes the quality of representations and hampers cross-modal interoperability
-
A survey of synthetic data augmentation methods in computer vision arXiv.cs.LG Pub Date : 2024-03-15 Alhassan Mumuni, Fuseini Mumuni, Nana Kobina Gerrar
The standard approach to tackling computer vision problems is to train deep convolutional neural network (CNN) models using large-scale image datasets which are representative of the target task. However, in many scenarios, it is often challenging to obtain sufficient image data for the target task. Data augmentation is a way to mitigate this challenge. A common practice is to explicitly transform
-
Linear optimal transport subspaces for point set classification arXiv.cs.LG Pub Date : 2024-03-15 Mohammad Shifat E Rabbi, Naqib Sad Pathan, Shiying Li, Yan Zhuang, Abu Hasnat Mohammad Rubaiyat, Gustavo K Rohde
Learning from point sets is an essential component in many computer vision and machine learning applications. Native, unordered, and permutation invariant set structure space is challenging to model, particularly for point set classification under spatial deformations. Here we propose a framework for classifying point sets experiencing certain types of spatial deformations, with a particular emphasis
-
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery arXiv.cs.LG Pub Date : 2024-03-15 Enguang Wang, Zhimao Peng, Zhengyuan Xie, Xialei Liu, Ming-Ming Cheng
Given unlabelled datasets containing both old and new categories, generalized category discovery (GCD) aims to accurately discover new classes while correctly classifying old classes, leveraging the class concepts learned from labeled samples. Current GCD methods only use a single visual modality of information, resulting in poor classification of visually similar classes. Though certain classes are
-
Recurrent Drafter for Fast Speculative Decoding in Large Language Models arXiv.cs.LG Pub Date : 2024-03-14 Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, Yunfei Cheng
In this paper, we introduce an improved approach of speculative decoding aimed at enhancing the efficiency of serving large language models. Our method capitalizes on the strengths of two established techniques: the classic two-model speculative decoding approach, and the more recent single-model approach, Medusa. Drawing inspiration from Medusa, our approach adopts a single-model strategy for speculative
-
Attention-based Class-Conditioned Alignment for Multi-Source Domain Adaptive Object Detection arXiv.cs.LG Pub Date : 2024-03-14 Atif Belal, Akhil Meethal, Francisco Perdigon Romero, Marco Pedersoli, Eric Granger
Domain adaptation methods for object detection (OD) strive to mitigate the impact of distribution shifts by promoting feature alignment across source and target domains. Multi-source domain adaptation (MSDA) allows leveraging multiple annotated source datasets, and unlabeled target data to improve the accuracy and robustness of the detection model. Most state-of-the-art MSDA methods for OD perform
-
Fisher Mask Nodes for Language Model Merging arXiv.cs.LG Pub Date : 2024-03-14 Thennal D K, Ganesh Nathan, Suchithra M S
Fine-tuning pre-trained models provides significant advantages in downstream performance. The ubiquitous nature of pre-trained models such as BERT and its derivatives in natural language processing has also led to a proliferation of task-specific fine-tuned models. As these models typically only perform one task well, additional training or ensembling is required in multi-task scenarios. The growing
-
ThermoHands: A Benchmark for 3D Hand Pose Estimation from Egocentric Thermal Image arXiv.cs.LG Pub Date : 2024-03-14 Fangqiang Ding, Yunzhou Zhu, Xiangyu Wen, Chris Xiaoxuan Lu
In this work, we present ThermoHands, a new benchmark for thermal image-based egocentric 3D hand pose estimation, aimed at overcoming challenges like varying lighting and obstructions (e.g., handwear). The benchmark includes a diverse dataset from 28 subjects performing hand-object and hand-virtual interactions, accurately annotated with 3D hand poses through an automated process. We introduce a bespoken
-
LabelAId: Just-in-time AI Interventions for Improving Human Labeling Quality and Domain Knowledge in Crowdsourcing Systems arXiv.cs.LG Pub Date : 2024-03-14 Chu Li, Zhihan Zhang, Michael Saugstad, Esteban Safranchik, Minchu Kulkarni, Xiaoyu Huang, Shwetak Patel, Vikram Iyer, Tim Althoff, Jon E. Froehlich
Crowdsourcing platforms have transformed distributed problem-solving, yet quality control remains a persistent challenge. Traditional quality control measures, such as prescreening workers and refining instructions, often focus solely on optimizing economic output. This paper explores just-in-time AI interventions to enhance both labeling quality and domain-specific knowledge among crowdworkers. We
-
On the Utility of 3D Hand Poses for Action Recognition arXiv.cs.LG Pub Date : 2024-03-14 Md Salman Shamil, Dibyadip Chatterjee, Fadime Sener, Shugao Ma, Angela Yao
3D hand poses are an under-explored modality for action recognition. Poses are compact yet informative and can greatly benefit applications with limited compute budgets. However, poses alone offer an incomplete understanding of actions, as they cannot fully capture objects and environments with which humans interact. To efficiently model hand-object interactions, we propose HandFormer, a novel multimodal
-
The Human Factor in Detecting Errors of Large Language Models: A Systematic Literature Review and Future Research Directions arXiv.cs.LG Pub Date : 2024-03-13 Christian A. Schiller
The launch of ChatGPT by OpenAI in November 2022 marked a pivotal moment for Artificial Intelligence, introducing Large Language Models (LLMs) to the mainstream and setting new records in user adoption. LLMs, particularly ChatGPT, trained on extensive internet data, demonstrate remarkable conversational capabilities across various domains, suggesting a significant impact on the workforce. However,
-
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning arXiv.cs.LG Pub Date : 2024-03-14 Jongsuk Kim, Hyeongkeun Lee, Kyeongha Rho, Junmo Kim, Joon Son Chung
Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address
-
Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models arXiv.cs.LG Pub Date : 2024-03-14 Akhil Kedia, Mohd Abbas Zaidi, Sushil Khyalia, Jungho Jung, Harshith Goka, Haejun Lee
In spite of their huge success, transformer models remain difficult to scale in depth. In this work, we develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention
-
Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking arXiv.cs.LG Pub Date : 2024-03-14 Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman
When writing and talking, people sometimes pause to think. Although reasoning-focused works have often framed reasoning as a method of answering questions or completing agentic tasks, reasoning is implicit in almost all written text. For example, this applies to the steps not stated between the lines of a proof or to the theory of mind underlying a conversation. In the Self-Taught Reasoner (STaR, Zelikman
-
Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation arXiv.cs.LG Pub Date : 2024-03-14 Fangfu Liu, Hanyang Wang, Weiliang Chen, Haowen Sun, Yueqi Duan
Recent years have witnessed the strong power of 3D generation models, which offer a new level of creative flexibility by allowing users to guide the 3D content generation process through a single image or natural language. However, it remains challenging for existing 3D generation methods to create subject-driven 3D content across diverse prompts. In this paper, we introduce a novel 3D customization
-
Compute-first optical detection for noise-resilient visual perception arXiv.cs.LG Pub Date : 2024-03-14 Jungmin Kim, Nanfang Yu, Zongfu Yu
In the context of visual perception, the optical signal from a scene is transferred into the electronic domain by detectors in the form of image data, which are then processed for the extraction of visual information. In noisy and weak-signal environments such as thermal imaging for night vision applications, however, the performance of neural computing tasks faces a significant bottleneck due to the
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training arXiv.cs.LG Pub Date : 2024-03-14 Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale