-
Accelerating Regular Path Queries over Graph Database with Processing-in-Memory arXiv.cs.DB Pub Date : 2024-03-15 Ruoyan Ma, Shengan Zheng, Guifeng Wang, Jin Pu, Yifan Hua, Wentao Wang, Linpeng Huang
Regular path queries (RPQs) in graph databases are bottlenecked by the memory wall. Emerging processing-in-memory (PIM) technologies offer a promising solution to dispatch and execute path matching tasks in parallel within PIM modules. We present Moctopus, a PIM-based data management system for graph databases that supports efficient batch RPQs and graph updates. Moctopus employs a PIM-friendly dynamic
-
Interactive Trimming against Evasive Online Data Manipulation Attacks: A Game-Theoretic Approach arXiv.cs.DB Pub Date : 2024-03-15 Yue Fu, Qingqing Ye, Rong Du, Haibo Hu
With the exponential growth of data and its crucial impact on our lives and decision-making, the integrity of data has become a significant concern. Malicious data poisoning attacks, where false values are injected into the data, can disrupt machine learning processes and lead to severe consequences. To mitigate these attacks, distance-based defenses, such as trimming, have been proposed, but they
-
KIF: A Framework for Virtual Integration of Heterogeneous Knowledge Bases using Wikidata arXiv.cs.DB Pub Date : 2024-03-15 Guilherme Lima, Marcelo Machado, Elton Soares, Sandro R. Fiorini, Raphael Thiago, Leonardo G. Azevedo, Viviane T. da Silva, Renato Cerqueira
We present a knowledge integration framework (called KIF) that uses Wikidata as a lingua franca to integrate heterogeneous knowledge bases. These can be triplestores, relational databases, CSV files, etc., which may or may not use the Wikidata dialect of RDF. KIF leverages Wikidata's data model and vocabulary plus user-defined mappings to expose a unified view of the integrated bases while keeping
-
Query Rewriting via Large Language Models arXiv.cs.DB Pub Date : 2024-03-14 Jie Liu, Barzan Mozafari
Query rewriting is one of the most effective techniques for coping with poorly written queries before passing them down to the query optimizer. Manual rewriting is not scalable, as it is error-prone and requires deep expertise. Similarly, traditional query rewriting algorithms can only handle a small subset of queries: rule-based techniques do not generalize to new query patterns and synthesis-based
-
Reconciling Conflicting Data Curation Actions: Transparency Through Argumentation arXiv.cs.DB Pub Date : 2024-03-13 Yilin Xia, Shawn Bowers, Lan Li, Bertram Ludäscher
We propose a new approach for modeling and reconciling conflicting data cleaning actions. Such conflicts arise naturally in collaborative data curation settings where multiple experts work independently and then aim to put their efforts together to improve and accelerate data cleaning. The key idea of our approach is to model conflicting updates as a formal \emph{argumentation framework}(AF). Such
-
OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories arXiv.cs.DB Pub Date : 2024-03-12 Christos Koutras, Jiani Zhang, Xiao Qin, Chuan Lei, Vasileios Ioannidis, Christos Faloutsos, George Karypis, Asterios Katsifodimos
How can we discover join relationships among columns of tabular data in a data repository? Can this be done effectively when metadata is missing? Traditional column matching works mainly rely on similarity measures based on exact value overlaps, hence missing important semantics or failing to handle noise in the data. At the same time, recent dataset discovery methods focusing on deep table representation
-
Generalised Graph Grammars for Natural Language Processing arXiv.cs.DB Pub Date : 2024-03-12 Oliver Robert Fox, Giacomo Bergami
This seminal paper proposes a new query language for graph matching and rewriting overcoming {the declarative} limitation of Cypher while outperforming {Neo4j} on graph matching and rewriting by at least one order of magnitude. We exploited columnar databases (KnoBAB) to represent graphs using the Generalised Semistructured Model.
-
Evaluating Large Language Models in Process Mining: Capabilities, Benchmarks, Evaluation Strategies, and Future Challenges arXiv.cs.DB Pub Date : 2024-03-11 Alessandro Berti, Humam Kourani, Hannes Hafke, Chiao-Yun Li, Daniel Schuster
Using Large Language Models (LLMs) for Process Mining (PM) tasks is becoming increasingly essential, and initial approaches yield promising results. However, little attention has been given to developing strategies for evaluating and benchmarking the utility of incorporating LLMs into PM tasks. This paper reviews the current implementations of LLMs in PM and reflects on three different questions. 1)
-
BoostER: Leveraging Large Language Models for Enhancing Entity Resolution arXiv.cs.DB Pub Date : 2024-03-11 Huahang Li, Shuangyin Li, Fei Hao, Chen Jason Zhang, Yuanfeng Song, Lei Chen
Entity resolution, which involves identifying and merging records that refer to the same real-world entity, is a crucial task in areas like Web data integration. This importance is underscored by the presence of numerous duplicated and multi-version data resources on the Web. However, achieving high-quality entity resolution typically demands significant effort. The advent of Large Language Models
-
Evaluation of NoSQL in the Energy Marketplace with GraphQL Optimization arXiv.cs.DB Pub Date : 2024-03-07 Michael Howard
The growing popularity of electric vehicles in the United States requires an ever-expanding infrastructure of commercial DC fast charging stations. The U.S. Department of Energy estimates 33,355 publicly available DC fast charging stations as of September 2023. Range anxiety is an important impediment to the adoption of electric vehicles and is even more relevant in underserved regions in the country
-
Schema-Aware Multi-Task Learning for Complex Text-to-SQL arXiv.cs.DB Pub Date : 2024-03-09 Yangjun Wu, Han Wang
Conventional text-to-SQL parsers are not good at synthesizing complex SQL queries that involve multiple tables or columns, due to the challenges inherent in identifying the correct schema items and performing accurate alignment between question and schema items. To address the above issue, we present a schema-aware multi-task learning framework (named MTSQL) for complicated SQL queries. Specifically
-
What Cannot be Skipped About the Skiplist: A Survey of Skiplists and Their Applications in Big Data Systems arXiv.cs.DB Pub Date : 2024-03-07 Venkata Sai Pavan Kumar Vadrevu, Lu Xing, Walid G. Aref
Skiplists have become prevalent in systems. The main advantages of skiplists are their simplicity and ease of implementation, and the ability to support operations in the same asymptotic complexities as their tree-based counterparts. In this survey, we explore skiplists and their many variants. We highlight many scenarios of how skiplists are useful and fit well in these usage scenarios. We study several
-
ProMoAI: Process Modeling with Generative AI arXiv.cs.DB Pub Date : 2024-03-07 Humam Kourani, Alessandro Berti, Daniel Schuster, Wil M. P. van der Aalst
ProMoAI is a novel tool that leverages Large Language Models (LLMs) to automatically generate process models from textual descriptions, incorporating advanced prompt engineering, error handling, and code generation techniques. Beyond automating the generation of complex process models, ProMoAI also supports process model optimization. Users can interact with the tool by providing feedback on the generated
-
Mining Transactional Data To Produce Extended Association Rules Using Collaborative Apriori, Fsa-Red And M5p Predictive Algorithm As A Basis Of Business Actions arXiv.cs.DB Pub Date : 2024-03-07 Feri Sulianta, Laksana Eka Angga, Thee Houw Liong
There are large amounts of transactional data which showed consumer shopping cart at a store that sells more than 150 types of products. In this case, the company is utilizing these data in making business action. In previous studies, the data that has a lot of attributes and record data reduction algorithms handled by the FSA Red (Feature Selection for Association Rules)are then mined using Apriori
-
Spanning Tree-based Query Plan Enumeration arXiv.cs.DB Pub Date : 2024-03-06 Yesdaulet Izenov, Asoke Datta, Brian Tsan, Abylay Amanbayev, Florin Rusu
In this work, we define the problem of finding an optimal query plan as finding spanning trees with low costs. This approach empowers the utilization of a series of spanning tree algorithms, thereby enabling systematic exploration of the plan search space over a join graph. Capitalizing on the polynomial time complexity of spanning tree algorithms, we present the Ensemble Spanning Tree Enumeration
-
Development and evaluation of Artificial Intelligence techniques for IoT data quality assessment and curation arXiv.cs.DB Pub Date : 2024-03-06 Laura Martín, Luis Sánchez, Jorge Lanza, Pablo Sotres
Nowadays, data is becoming the new fuel for economic wealth and creation of novel and profitable business models. Multitude of technologies are contributing to an abundance of information sources which are already the baseline for multi-millionaire services and applications. Internet of Things (IoT), is probably the most representative one. However, for an economy of data to actually flourish there
-
iSummary: Workload-based, Personalized Summaries for Knowledge Graphs arXiv.cs.DB Pub Date : 2024-03-05 Giannis Vassiliou, Fanouris Alevizakis, Nikolaos Papadakis, Haridimos Kondylakis
The explosion in the size and the complexity of the available Knowledge Graphs on the web has led to the need for efficient and effective methods for their understanding and exploration. Semantic summaries have recently emerged as methods to quickly explore and understand the contents of various sources. However in most cases they are static not incorporating user needs and preferences and cannot scale
-
Model Lakes arXiv.cs.DB Pub Date : 2024-03-04 Koyena Pal, David Bau, Renée J. Miller
Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of machine learning models increases, this issue of
-
Stage: Query Execution Time Prediction in Amazon Redshift arXiv.cs.DB Pub Date : 2024-03-04 Ziniu Wu, Ryan Marcus, Zhengchun Liu, Parimarjan Negi, Vikram Nathan, Pascal Pfeil, Gaurav Saxena, Mohammad Rahman, Balakrishnan Narayanaswamy, Tim Kraska
Query performance (e.g., execution time) prediction is a critical component of modern DBMSes. As a pioneering cloud data warehouse, Amazon Redshift relies on an accurate execution time prediction for many downstream tasks, ranging from high-level optimizations, such as automatically creating materialized views, to low-level tasks on the critical path of query execution, such as admission, scheduling
-
OCEL 2.0 Resources -- www.ocel-standard.org arXiv.cs.DB Pub Date : 2024-03-04 Istvan Koren, Niklas Adams, Alessandro Berti
Process mining has become a cornerstone of process analysis and improvement over the last few years. With the widespread adoption of process mining tools and libraries, the limitations of traditional process mining to deal with event data with multiple case identifiers, i.e., object-centric event data, have become apparent. As a response, the subfield of object-centric process mining has formed, including
-
OCEL (Object-Centric Event Log) 2.0 Specification arXiv.cs.DB Pub Date : 2024-03-04 Alessandro Berti, Istvan Koren, Jan Niklas Adams, Gyunam Park, Benedikt Knopp, Nina Graves, Majid Rafiei, Lukas Liß, Leah Tacke Genannt Unterberg, Yisong Zhang, Christopher Schwanen, Marco Pegoraro, Wil M. P. van der Aalst
Object-Centric Event Logs (OCELs) form the basis for Object-Centric Process Mining (OCPM). OCEL 1.0 was first released in 2020 and triggered the development of a range of OCPM techniques. OCEL 2.0 forms the new, more expressive standard, allowing for more extensive process analyses while remaining in an easily exchangeable format. In contrast to the first OCEL standard, it can depict changes in objects
-
Schema-Based Query Optimisation for Graph Databases arXiv.cs.DB Pub Date : 2024-03-04 Chandan SharmaTYREX, Pierre GenevèsTYREX, Nils GesbertTYREX, Nabil LayaïdaTYREX
Recursive graph queries are increasingly popular for extracting information from interconnected data found in various domains such as social networks, life sciences, and business analytics. Graph data often come with schema information that describe how nodes and edges are organized. We propose a type inference mechanism that enriches recursive graph queries with relevant structural information contained
-
TreeTracker Join: Turning the Tide When a Tuple Fails to Join arXiv.cs.DB Pub Date : 2024-03-03 Zeyuan Hu, Daniel P. Miranker
Many important query processing methods proactively use semijoins or semijoin-like filters to delete dangling tuples, i.e., tuples that do not appear in the final query result. Semijoin methods can achieve formal optimality but have high upfront cost in practice. Filter methods reduce the cost but lose the optimality guarantee. We propose a new join algorithm, TreeTracker Join ($\mathsf{TTJ}$), that
-
Relational to RDF Data Migration by Query Co-Evaluation arXiv.cs.DB Pub Date : 2024-03-03 Ryan Wisnesky, Daniel Filonik
In this paper we define a new algorithm to convert an input relational database to an output set of RDF triples. The algorithm can be used to e.g. load CSV data into a financial OWL ontology such as FIBO. The algorithm takes as input a set of relational conjunctive (select-from-where) queries, one for each input table, from the three column (subject, predicate, object) output RDF schema to the input
-
ReMatch: Retrieval Enhanced Schema Matching with LLMs arXiv.cs.DB Pub Date : 2024-03-03 Eitam Sheetrit, Menachem Brief, Moshik Mishaeli, Oren Elisha
Schema matching is a crucial task in data integration, involving the alignment of a source database schema with a target schema to establish correspondence between their elements. This task is challenging due to textual and semantic heterogeneity, as well as differences in schema sizes. Although machine-learning-based solutions have been explored in numerous studies, they often suffer from low accuracy
-
A Conceptual Model for Data Storytelling Highlights in Business Intelligence Environments arXiv.cs.DB Pub Date : 2024-03-01 Panos Vassiliadis, Patrick Marcel, Faten El Outa, Veronika Peralta, Dimos Gkitsakis
We introduce a conceptual model for highlights to support data analysis and storytelling in the domain of Business Intelligence, via the automated extraction, representation, and exploitation of highlights revealing key facts that are hidden in the data with which a data analyst works. The model builds on the concepts of Holistic and Elementary Highlights, along with their context, constituents and
-
DFIN-SQL: Integrating Focused Schema with DIN-SQL for Superior Accuracy in Large-Scale Databases arXiv.cs.DB Pub Date : 2024-03-01 Shai Volvovsky, Marco Marcassa, Mustafa Panbiharwala
The task of converting natural language queries into SQL queries is intricate, necessitating a blend of precise techniques for an accurate translation. The DIN-SQL (Decomposed-In-Context SQL) methodology represents a significant development in this domain. This paper introduces DFIN (Decomposed Focused-In-Context), an innovative extension of DIN-SQL that enhances Text-to-SQL conversion by addressing
-
Data Quality Assessment: Challenges and Opportunities arXiv.cs.DB Pub Date : 2024-03-01 Sedir Mohammed, Hazar Harmouch, Felix Naumann, Divesh Srivastava
Data-oriented applications, their users, and even the law require data of high quality. Research has broken down the rather vague notion of data quality into various dimensions, such as accuracy, consistency, and reputation, to name but a few. To achieve the goal of high data quality, many tools and techniques exist to clean and otherwise improve data. Yet, systematic research on actually assessing
-
DynaWarp -- Efficient, large-scale log storage and retrieval arXiv.cs.DB Pub Date : 2024-02-28 Julian Reichinger, Thomas Krismayer, Jan Rellermeyer
Modern, large scale monitoring systems have to process and store vast amounts of log data in near real-time. At query time the systems have to find relevant logs based on the content of the log message using support structures that can scale to these amounts of data while still being efficient to use. We present our novel DynaWarp membership sketch, capable of answering Multi-Set Multi-Membership-Queries
-
Play like a Vertex: A Stackelberg Game Approach for Streaming Graph Partitioning arXiv.cs.DB Pub Date : 2024-02-28 Zezhong Ding, Yongan Xiang, Shangyou Wang, Xike Xie, S. Kevin Zhou
In the realm of distributed systems tasked with managing and processing large-scale graph-structured data, optimizing graph partitioning stands as a pivotal challenge. The primary goal is to minimize communication overhead and runtime cost. However, alongside the computational complexity associated with optimal graph partitioning, a critical factor to consider is memory overhead. Real-world graphs
-
Certain and Approximately Certain Models for Statistical Learning arXiv.cs.DB Pub Date : 2024-02-27 Cheng Zhen, Nischal Aryal, Arash Termehchy, Amandeep Singh Chabada
Real-world data is often incomplete and contains missing values. To train accurate models over real-world datasets, users need to spend a substantial amount of time and resources imputing and finding proper values for missing data items. In this paper, we demonstrate that it is possible to learn accurate models directly from data with missing values for certain training data and target models. We propose
-
Metasql: A Generate-then-Rank Framework for Natural Language to SQL Translation arXiv.cs.DB Pub Date : 2024-02-27 Yuankai Fan, Zhenying He, Tonghui Ren, Can Huang, Yinan Jing, Kai Zhang, X. Sean Wang
The Natural Language Interface to Databases (NLIDB) empowers non-technical users with database access through intuitive natural language (NL) interactions. Advanced approaches, utilizing neural sequence-to-sequence models or large-scale language models, typically employ auto-regressive decoding to generate unique SQL queries sequentially. While these translation models have greatly improved the overall
-
Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries arXiv.cs.DB Pub Date : 2024-02-25 Mike Heddes, Igor Nunes, Tony Givargis, Alex Nicolau
With the increasing rate of data generated by critical systems, estimating functions on streaming data has become essential. This demand has driven numerous advancements in algorithms designed to efficiently query and analyze one or more data streams while operating under memory constraints. The primary challenge arises from the rapid influx of new items, requiring algorithms that enable efficient
-
Aligning Large Language Models to a Domain-specific Graph Database arXiv.cs.DB Pub Date : 2024-02-26 Yuanyuan Liang, Keren Tan, Tingyu Xie, Wenbiao Tao, Siyuan Wang, Yunshi Lan, Weining Qian
Graph Databases (Graph DB) are widely applied in various fields, including finance, social networks, and medicine. However, translating Natural Language (NL) into the Graph Query Language (GQL), commonly known as NL2GQL, proves to be challenging due to its inherent complexity and specialized nature. Some approaches have sought to utilize Large Language Models (LLMs) to address analogous tasks like
-
CodeS: Towards Building Open-source Language Models for Text-to-SQL arXiv.cs.DB Pub Date : 2024-02-26 Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen
Language models have shown promising performance on the task of translating natural language questions into SQL queries (Text-to-SQL). However, most of the state-of-the-art (SOTA) approaches rely on powerful yet closed-source large language models (LLMs), such as ChatGPT and GPT-4, which may have the limitations of unclear model architectures, data privacy risks, and expensive inference overheads.
-
ExaLogLog: Space-Efficient and Practical Approximate Distinct Counting up to the Exa-Scale arXiv.cs.DB Pub Date : 2024-02-21 Otmar Ertl
This work introduces ExaLogLog, a new data structure for approximate distinct counting, which has the same practical properties as the popular HyperLogLog algorithm. It is commutative, idempotent, mergeable, reducible, has a constant-time insert operation, and supports distinct counts up to the exa-scale. At the same time, as theoretically derived and experimentally verified, it requires 43% less space
-
idwMapper: An interactive and data-driven web mapping framework for visualizing and sensing high-dimensional geospatial (big) data arXiv.cs.DB Pub Date : 2024-02-16 Sarigai Sarigai, Liping Yang, Katie Slack, K. Maria D. Lane, Michaela Buenemann, Qiusheng Wu, Gordon Woodhull, Joshua Driscol
We are surrounded by overwhelming big data, which brings substantial advances but meanwhile poses many challenges. Geospatial big data comprises a big portion of big data, and is essential and powerful for decision-making if being utilized strategically. Volumes in size and high dimensions are two of the major challenges that prevent strategic decision-making from (geospatial) big data. Interactive
-
Major TOM: Expandable Datasets for Earth Observation arXiv.cs.DB Pub Date : 2024-02-19 Alistair Francis, Mikolaj Czerkawski
Deep learning models are increasingly data-hungry, requiring significant resources to collect and compile the datasets needed to train them, with Earth Observation (EO) models being no exception. However, the landscape of datasets in EO is relatively atomised, with interoperability made difficult by diverse formats and data structures. If ever larger datasets are to be built, and duplication of effort
-
A survey of LSM-Tree based Indexes, Data Systems and KV-stores arXiv.cs.DB Pub Date : 2024-02-16 Supriya Mishra
Modern databases typically makes use of the Log Structured Merge-Tree for organizing data in indexes, which is a kind of disk-based data structure. It was proposed to efficiently handle frequent update queries (also called update intensive workloads) databases. In recent years, LSM-Tree has gained popularity and has been adopted by a number of NoSql databases, and key-value stores. Since LSM-Tree was
-
An advanced data fabric architecture leveraging homomorphic encryption and federated learning arXiv.cs.DB Pub Date : 2024-02-15 Sakib Anwar Rieyan, Md. Raisul Kabir News, A. B. M. Muntasir Rahman, Sadia Afrin Khan, Sultan Tasneem Jawad Zaarif, Md. Golam Rabiul Alam, Mohammad Mehedi Hassan, Michele Ianni, Giancarlo Fortino
Data fabric is an automated and AI-driven data fusion approach to accomplish data management unification without moving data to a centralized location for solving complex data problems. In a Federated learning architecture, the global model is trained based on the learned parameters of several local models that eliminate the necessity of moving data to a centralized repository for machine learning
-
Evaluating the Data Model Robustness of Text-to-SQL Systems Based on Real User Queries arXiv.cs.DB Pub Date : 2024-02-13 Jonathan Fürst, Catherine Kosten, Farhard Nooralahzadeh, Yi Zhang, Kurt Stockinger
Text-to-SQL systems (also known as NL-to-SQL systems) have become an increasingly popular solution for bridging the gap between user capabilities and SQL-based data access. These systems translate user requests in natural language to valid SQL statements for a specific database. Recent Text-to-SQL systems have benefited from the rapid improvement of transformer-based language models. However, while
-
Sampling Space-Saving Set Sketches arXiv.cs.DB Pub Date : 2024-02-13 Homin K. Lee, Charles Masson
Large, distributed data streams are now ubiquitous. High-accuracy sketches with low memory overhead have become the de facto method for analyzing this data. For instance, if we wish to group data by some label and report the largest counts using fixed memory, we need to turn to mergeable heavy hitter sketches that can provide highly accurate approximate counts. Similarly, if we wish to keep track of
-
Intent-Based Access Control: Using LLMs to Intelligently Manage Access Control arXiv.cs.DB Pub Date : 2024-02-11 Pranav Subramaniam, Sanjay Krishnan
In every enterprise database, administrators must define an access control policy that specifies which users have access to which assets. Access control straddles two worlds: policy (organization-level principles that define who should have access) and process (database-level primitives that actually implement the policy). Assessing and enforcing process compliance with a policy is a manual and ad-hoc
-
A harmonized and interoperable format for storing and processing polysomnography data arXiv.cs.DB Pub Date : 2024-02-09 Riku Huttunen, Matias Rusanen, Sami Nikkonen, Henri Korkalainen, Samu Kainulainen
Polysomnography (PSG) data is recorded and stored in various formats depending on the recording software. Although the PSG data can usually be exported to open formats, such as the European Data Format (EDF), they are limited in data types, validation, and readability. Moreover, the exported data is not harmonized, which means different datasets need customized preprocessing to conduct research on
-
Fostering the integration of European Open Data into Data Spaces through High-Quality Metadata arXiv.cs.DB Pub Date : 2024-02-08 Javier Conde, Alejandro Pozo, Andrés Munoz-Arcentales, Johnny Choque, Álvaro Alonso
The term Data Space, understood as the secure exchange of data in distributed systems, ensuring openness, transparency, decentralization, sovereignty, and interoperability of information, has gained importance during the last years. However, Data Spaces are in an initial phase of definition, and new research is necessary to address their requirements. The Open Data ecosystem can be understood as one
-
Approximate Keys and Functional Dependencies in Incomplete Databases With Limited Domains-Algorithmic Perspective arXiv.cs.DB Pub Date : 2024-02-07 Munqath Al-atar, Attila Sali
A possible world of an incomplete database table is obtained by imputing values from the attributes (infinite) domain to the place of \texttt{NULL} s. A table satisfies a possible key or possible functional dependency constraint if there exists a possible world of the table that satisfies the given key or functional dependency constraint. A certain key or functional dependency is satisfied by a table
-
Approximate Integrity Constraints in Incomplete Databases With Limited Domains arXiv.cs.DB Pub Date : 2024-02-07 Munqath Al-atar, Attila Sali
In case of incomplete database tables, a possible world is obtained by replacing any missing value by a value from the corresponding attribute's domain that can be infinite. A possible key or possible functional dependency constraint is satisfied by an incomplete table if we can obtain a possible world that satisfies the given key or functional dependency. On the other hand, a certain key or certain
-
Topological relations in water quality monitoring arXiv.cs.DB Pub Date : 2024-02-07 Bruno Chaves Figueiredo, Maria Alexandra Oliveira, João Nuno Silva
The Alqueva Multi-Purpose Project (EFMA) is a massive abduction and storage infrastructure system in the Alentejo, which has a water quality monitoring network with almost thousands of water quality stations distributed across three subsystems: Alqueva, Pedrog\~ao, and Ardila. Identification of pollution sources in complex infrastructure systems, such as the EFMA, requires recognition of water flow
-
Towards a Flexible Scale-out Framework for Efficient Visual Data Query Processing arXiv.cs.DB Pub Date : 2024-02-05 Rohit Verma, Arun Raghunath
There is growing interest in visual data management systems that support queries with specialized operations ranging from resizing an image to running complex machine learning models. With a plethora of such operations, the basic need to receive query responses in minimal time takes a hit, especially when the client desires to run multiple such operations in a single query. Existing systems provide
-
Mining a Minimal Set of Behavioral Patterns using Incremental Evaluation arXiv.cs.DB Pub Date : 2024-02-05 Mehdi Acheli, Daniela Grigori, Matthias Weidlich
Process mining provides methods to analyse event logs generated by information systems during the execution of processes. It thereby supports the design, validation, and execution of processes in domains ranging from healthcare, through manufacturing, to e-commerce. To explore the regularities of flexible processes that show a large behavioral variability, it was suggested to mine recurrent behavioral
-
LLM-Enhanced Data Management arXiv.cs.DB Pub Date : 2024-02-04 Xuanhe Zhou, Xinyang Zhao, Guoliang Li
Machine learning (ML) techniques for optimizing data management problems have been extensively studied and widely deployed in recent five years. However traditional ML methods have limitations on generalizability (adapting to different scenarios) and inference ability (understanding the context). Fortunately, large language models (LLMs) have shown high generalizability and human-competitive abilities
-
On the development of an application for the compilation of global sea level changes arXiv.cs.DB Pub Date : 2024-02-04 Mihir Odhavji, Maria Alexandra Oliveira, João Nuno Silva
There is a lot of data about mean sea level variation from studies conducted around the globe. This data is dispersed, lacks organization along with standardization, and in most cases, it is not available online. In some instances, when it is available, it is often in unpractical ways and different formats. Analyzing it would be inefficient and very time-consuming. In addition to all of that, to successfully
-
HotRAP: Hot Record Retention and Promotion for LSM-trees with tiered storage arXiv.cs.DB Pub Date : 2024-02-03 Jiansheng Qiu, Fangzhou Yuan, Huanchen Zhang
The multi-level design of Log-Structured Merge-trees (LSM-trees) naturally fits the tiered storage architecture: the upper levels (recently inserted/updated records) are kept in fast storage to guarantee performance while the lower levels (the majority of records) are placed in slower but cheaper storage to reduce cost. However, frequently accessed records may have been compacted and reside in slow
-
PANDA: Query Evaluation in Submodular Width arXiv.cs.DB Pub Date : 2024-02-03 Mahmoud Abo Khamis, Hung Q. Ngo, Dan Suciu
In recent years, several information-theoretic upper bounds have been introduced on the output size and evaluation cost of database join queries. These bounds vary in their power depending on both the type of statistics on input relations and the query plans that they support. This motivated the search for algorithms that can compute the output of a join query in times that are bounded by the corresponding
-
Knowledge Acquisition and Integration with Expert-in-the-loop arXiv.cs.DB Pub Date : 2024-02-05 Sajjadur Rahman, Frederick Choi, Hannah Kim, Dan Zhang, Estevam Hruschka
Constructing and serving knowledge graphs (KGs) is an iterative and human-centered process involving on-demand programming and analysis. In this paper, we present Kyurem, a programmable and interactive widget library that facilitates human-in-the-loop knowledge acquisition and integration to enable continuous curation a knowledge graph (KG). Kyurem provides a seamless environment within computational
-
Effective Bug Detection in Graph Database Engines: An LLM-based Approach arXiv.cs.DB Pub Date : 2024-02-01 Jiayi Wu, Zhengyu Wu, Ronghua Li, Hongchao Qin, Guoren Wang
Graph database engines play a pivotal role in efficiently storing and managing graph data across various domains, including bioinformatics, knowledge graphs, and recommender systems. Ensuring data accuracy within graph database engines is paramount, as inaccuracies can yield unreliable analytical outcomes. Current bug-detection approaches are confined to specific graph query languages, limiting their
-
Joining Entities Across Relation and Graph with a Unified Model arXiv.cs.DB Pub Date : 2024-01-31 Wenzhi Fu
This paper introduces RG (Relational Genetic) model, a revised relational model to represent graph-structured data in RDBMS while preserving its topology, for efficiently and effectively extracting data in different formats from disparate sources. Along with: (a) SQL$_\delta$, an SQL dialect augmented with graph pattern queries and tuple-vertex joins, such that one can extract graph properties via
-
A Graph-Native Query Optimization Framework arXiv.cs.DB Pub Date : 2024-01-31 Bingqing Lyu, Xiaoli Zhou, Longbin Lai, Yufan Yang, Yunkai Lou, Wenyuan Yu, Jingren Zhou
Graph queries that combine pattern matching with relational operations, referred as PatRelQuery, are widely used in many real-world applications. It allows users to identify arbitrary patterns in a graph and further perform in-depth relational analysis on the results. To effectively support PatRelQuery, two key challenges need to be addressed: (1) how to optimize PatRelQuery in a unified framework
-
Performance Comparison Analysis of ArangoDB, MySQL, and Neo4j: An Experimental Study of Querying Connected Data arXiv.cs.DB Pub Date : 2024-01-30 Johan Sandell, Einar Asplund, Workneh Yilma Ayele, Martin Duneld
Choosing and developing performant database solutions helps organizations optimize their operational practices and decision-making. Since graph data is becoming more common, it is crucial to develop and use them in big data with complex relationships with high and consistent performance. However, legacy database technologies such as MySQL are tailored to store relational databases and need to perform
-
Neural Locality Sensitive Hashing for Entity Blocking arXiv.cs.DB Pub Date : 2024-01-31 Runhui Wang, Luyang Kong, Yefan Tao, Andrew Borthwick, Davor Golac, Henrik Johnson, Shadie Hijazi, Dong Deng, Yongfeng Zhang
Locality-sensitive hashing (LSH) is a fundamental algorithmic technique widely employed in large-scale data processing applications, such as nearest-neighbor search, entity resolution, and clustering. However, its applicability in some real-world scenarios is limited due to the need for careful design of hashing functions that align with specific metrics. Existing LSH-based Entity Blocking solutions