• arXiv.cs.DB Pub Date : 2021-01-18
Mónica Figuera; Philipp D. Rohde; Maria-Esther Vidal

Knowledge graphs have emerged as expressive data structures for Web data. Knowledge graph potential and the demand for ecosystems to facilitate their creation, curation, and understanding, is testified in diverse domains, e.g., biomedicine. The Shapes Constraint Language (SHACL) is the W3C recommendation language for integrity constraints over RDF knowledge graphs. Enabling quality assements of knowledge

更新日期：2021-01-19
• arXiv.cs.DB Pub Date : 2021-01-17
Hemant Saxena; Lukasz Golab; Stratos Idreos; Ihab F. Ilyas

Real-time data analytics systems such as SAP HANA, MemSQL, and IBM Wildfire employ hybrid data layouts, in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high data rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge

更新日期：2021-01-19
• arXiv.cs.DB Pub Date : 2021-01-17
Rabia Azzi; Gayo Diallo

In this paper we present AMALGAM, a matching approach to fairify tabular data with the use of a knowledge graph. The ultimate goal is to provide fast and efficient approach to annotate tabular data with entities from a background knowledge. The approach combines lookup and filtering services combined with text pre-processing techniques. Experiments conducted in the context of the 2020 Semantic Web

更新日期：2021-01-19
• arXiv.cs.DB Pub Date : 2021-01-18
Masatoshi Hanai; Nikos Tziritas; Toyotaro Suzumura; Wentong Cai; Georgios Theodoropoulos

The dynamic scaling of distributed computations plays an important role in the utilization of elastic computational resources, such as the cloud. It enables the provisioning and de-provisioning of resources to match dynamic resource availability and demands. In the case of distributed graph processing, changing the number of the graph partitions while maintaining high partitioning quality imposes serious

更新日期：2021-01-19
• arXiv.cs.DB Pub Date : 2021-01-17
Peng Gao; Fei Shao; Xiaoyuan Liu; Xusheng Xiao; Haoyuan Liu; Zheng Qin; Fengyuan Xu; Prateek Mittal; Sanjeev R. Kulkarni; Dawn Song

Log-based cyber threat hunting has emerged as an important solution to counter sophisticated cyber attacks. However, existing approaches require non-trivial efforts of manual query construction and have overlooked the rich external knowledge about threat behaviors provided by open-source Cyber Threat Intelligence (OSCTI). To bridge the gap, we build ThreatRaptor, a system that facilitates cyber threat

更新日期：2021-01-19
• arXiv.cs.DB Pub Date : 2021-01-17
Massimo Cafaro; Catiuscia Melle; Italo Epicoco; Marco Pulimeno

UDDSKETCH is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSKETCH algorithm. UDDSKETCH provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSKETCH. In this paper we show how to compress and fuse data streams (or datasets) by using UDDSKETCH data summaries

更新日期：2021-01-19
• arXiv.cs.DB Pub Date : 2021-01-15

This paper aims at providing extremely efficient algorithms for approximate query enumeration on sparse databases, that come with performance and accuracy guarantees. We introduce a new model for approximate query enumeration on classes of relational databases of bounded degree. We first prove that on databases of bounded degree any local first-order definable query can be enumerated approximately

更新日期：2021-01-18
• arXiv.cs.DB Pub Date : 2021-01-15
Daniel Obraczka; Jonathan Schuchart; Erhard Rahm

Entity Resolution (ER) is a constitutional part for integrating different knowledge graphs in order to identify entities referring to the same real-world object. A promising approach is the use of graph embeddings for ER in order to determine the similarity of entities based on the similarity of their graph neighborhood. The similarity computations for such embeddings translates to calculating the

更新日期：2021-01-18
• arXiv.cs.DB Pub Date : 2021-01-13
Adel Ardalan; Derek Paulsen; Amanpreet Singh Saini; Walter Cai; AnHai Doan

Many applications need to clean data with a target accuracy. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all string that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in

更新日期：2021-01-15
• arXiv.cs.DB Pub Date : 2021-01-13
Maximilian Ernst Tschuchnig; Dejan Radovanovic; Eduard Hirsch; Anna-Maria Oberluggauer; Georg Schäfer

Conventional data storage methods like SQL and NoSQL offer a huge amount of possibilities with one major disadvantage, having to use a centralized authority. This authority may be in the form of a centralized or decentralized master server or a permissioned peer-to-peer setting. This paper looks at different technologies on how to persist data without using a central authority, mainly looking at permissionless

更新日期：2021-01-14
• arXiv.cs.DB Pub Date : 2021-01-13

Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer's

更新日期：2021-01-14
• arXiv.cs.DB Pub Date : 2021-01-12
Tanja Auge; Nic Scharlau; Andreas Heuer

Given a query result of a big database, why-provenance can be used to calculate the necessary part of this database, consisting of so-called witnesses. If this database consists of personal data, privacy protection has to prevent the publication of these witnesses. This implies a natural conflict of interest between publishing original data (provenance) and protecting these data (privacy). In this

更新日期：2021-01-13
• arXiv.cs.DB Pub Date : 2021-01-11
Arif Usta; Akifhan Karakayali; Özgür Ulusoy

Translating Natural Language Queries (NLQs) to Structured Query Language (SQL) in interfaces deployed in relational databases is a challenging task, which has been widely studied in database community recently. Conventional rule based systems utilize series of solutions as a pipeline to deal with each step of this task, namely stop word filtering, tokenization, stemming/lemmatization, parsing, tagging

更新日期：2021-01-13
• arXiv.cs.DB Pub Date : 2021-01-11
Shaleen Deep; Xiao Hu; Paraschos Koutris

We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to maintain

更新日期：2021-01-12
• arXiv.cs.DB Pub Date : 2021-01-11
Wilmer Ricciotti; James Cheney

Language-integrated query based on comprehension syntax is a powerful technique for safe database programming, and provides a basis for advanced techniques such as query shredding or query flattening that allow efficient programming with complex nested collections. However, the foundations of these techniques are lacking: although SQL, the most widely-used database query language, supports heterogeneous

更新日期：2021-01-12
• arXiv.cs.DB Pub Date : 2021-01-09
Shuyuan Yan; Bolin Ding; Wei Guo; Jingren Zhou; Zhewei Wei; Xiaowei Jiang; Sheng Xu

Interactive response time is important in analytical pipelines for users to explore a sufficient number of possibilities and make informed business decisions. We consider a forecasting pipeline with large volumes of high-dimensional time series data. Real-time forecasting can be conducted in two steps. First, we specify the portion of data to be focused on and the measure to be predicted by slicing

更新日期：2021-01-12
• arXiv.cs.DB Pub Date : 2021-01-08
Cristina Feier; Carsten Lutz; Marcin Przybyłko

We study the complexity of answer counting for ontology-mediated queries and for querying under constraints, considering conjunctive queries and unions thereof (UCQs) as the query language and guarded TGDs as the ontology and constraint language, respectively. Our main result is a classification according to whether answer counting is fixed-parameter tractable (FPT), W[1]-equivalent, #W[1]-equivalent

更新日期：2021-01-11
• arXiv.cs.DB Pub Date : 2021-01-07
Cyril Cappi; Camille Chapdelaine; Laurent Gardes; Eric Jenn; Baptiste Lefevre; Sylvaine Picard; Thomas Soumarmon

This document gives a set of recommendations to build and manipulate the datasets used to develop and/or validate machine learning models such as deep neural networks. This document is one of the 3 documents defined in [1] to ensure the quality of datasets. This is a work in progress as good practices evolve along with our understanding of machine learning. The document is divided into three main parts

更新日期：2021-01-11
• arXiv.cs.DB Pub Date : 2021-01-08
Meifan Zhang; Hongzhi Wang

The Group-By query is an important kind of query, which is common and widely used in data warehouses, data analytics, and data visualization. Approximate query processing is an effective way to increase the querying efficiency on big data. The answer to a group-by query involves multiple values, which makes it difficult to provide sufficiently accurate estimations for all the groups. Stratified sampling

更新日期：2021-01-11
• arXiv.cs.DB Pub Date : 2021-01-08
Hui Luo; Jingbo Zhou; Zhifeng Bao; Shuangli Li; J. Shane Culpepper; Haochao Ying; Hao Liu; Hui Xiong

Existing spatial object recommendation algorithms generally treat objects identically when ranking them. However, spatial objects often cover different levels of spatial granularity and thereby are heterogeneous. For example, one user may prefer to be recommended a region (say Manhattan), while another user might prefer a venue (say a restaurant). Even for the same user, preferences can change at different

更新日期：2021-01-11
• arXiv.cs.DB Pub Date : 2021-01-07
Miika Hannula; Bor-Kuan Song; Sebastian Link

For years, independence has been considered as an important concept in many disciplines. Nevertheless, we present the first research that investigates the discovery problem of independence in data. In its arguably simplest form, independence is a statement between two sets of columns expressing that for every two rows in a table there is also a row in the table that coincides with the first row on

更新日期：2021-01-08
• arXiv.cs.DB Pub Date : 2021-01-05
William F Godoy; Peter F Peterson; Steven E Hahn; Jay J Billings

Oak Ridge National Laboratory (ORNL) experimental neutron science facilities produce 1.2\,TB a day of raw event-based data that is stored using the standard metadata-rich NeXus schema built on top of the HDF5 file format. Performance of several data reduction workflows is largely determined by the amount of time spent on the loading and processing algorithms in Mantid, an open-source data analysis

更新日期：2021-01-08
• arXiv.cs.DB Pub Date : 2021-01-07
Miika Hannula; Xinyi Li; Sebastian Link

Codd's rule of entity integrity stipulates that every table has a primary key. Hence, the attributes of the primary key carry unique and complete value combinations. In practice, data cannot always meet such requirements. Previous work proposed the superior notion of key sets for controlling entity integrity. We establish a linear-time algorithm for validating whether a given key set holds on a given

更新日期：2021-01-08
• arXiv.cs.DB Pub Date : 2021-01-07
Miika Hannula; Juha Kontinen; Sebastian Link

Infamously, the finite and unrestricted implication problems for the classes of i) functional and inclusion dependencies together, and ii) embedded multivalued dependencies alone are each undecidable. Famously, the restriction of i) to functional and unary inclusion dependencies in combination with the restriction of ii) to multivalued dependencies yield implication problems that are still different

更新日期：2021-01-08
• arXiv.cs.DB Pub Date : 2021-01-04
Majid Rafiei; Wil M. P. van der Aalst

Process mining aims to provide insights into the actual processes based on event data. These data are often recorded by information systems and are widely available. However, they often contain sensitive private information that should be analyzed responsibly. Therefore, privacy issues in process mining are recently receiving more attention. Privacy preservation techniques obviously need to modify

更新日期：2021-01-08
• arXiv.cs.DB Pub Date : 2021-01-06
Reza Karegar; Parke Godfrey; Lukasz Golab; Mehdi Kargar; Divesh Srivastava; Jaroslaw Szlichta

Order dependencies (ODs) capture relationships between ordered domains of attributes. Approximate ODs (AODs) capture such relationships even when there exist exceptions in the data. During automated discovery of ODs, validation is the process of verifying whether an OD holds. We present an algorithm for validating approximate ODs with significantly improved runtime performance over existing methods

更新日期：2021-01-07
• arXiv.cs.DB Pub Date : 2021-01-06
Xikui Wang; Michael J. Carey; Vassilis J. Tsotras

In many Big Data applications today, information needs to be actively shared between systems managed by different organizations. To enable sharing Big Data at scale, developers would have to create dedicated server programs and glue together multiple Big Data systems for scalability. Developing and managing such glued data sharing services requires a significant amount of work from developers. In our

更新日期：2021-01-07
• arXiv.cs.DB Pub Date : 2021-01-06
Katrin Casel; Markus L. Schmid

A regular path query (RPQ) is a regular expression q that returns all node pairs (u, v) from a graph database that are connected by an arbitrary path labelled with a word from L(q). The obvious algorithmic approach to RPQ-evaluation (called PG-approach), i.e., constructing the product graph between an NFA for q and the graph database, is appealing due to its simplicity and also leads to efficient algorithms

更新日期：2021-01-07
• arXiv.cs.DB Pub Date : 2021-01-06
Mingxi Wu; Xi Chen

Modern fraudsters write malicious programs to coordinate a group of accounts to commit collective fraud for illegal profits in online platforms. These programs have access to a set of finite resources - a set of IPs, devices, and accounts etc. and sometime manipulate fake accounts to collaboratively attack the target system. Inspired by these observations, we share our experience in building two real-time

更新日期：2021-01-07
• arXiv.cs.DB Pub Date : 2021-01-05
Hai Lan; Zhifeng Bao; Yuwei Peng

Query optimizer is at the heart of the database systems. Cost-based optimizer studied in this paper is adopted in almost all current database systems. A cost-based optimizer introduces a plan enumeration algorithm to find a (sub)plan, and then uses a cost model to obtain the cost of that plan, and selects the plan with the lowest cost. In the cost model, cardinality, the number of tuples through an

更新日期：2021-01-06
• arXiv.cs.DB Pub Date : 2021-01-05
Xiaoou Ding; Hongzhi Wang; Chen Wang; Zijue Li; Zheng Liang

The demand for high-performance anomaly detection techniques of IoT data becomes urgent, especially in industry field. The anomaly identification and explanation in time series data is one essential task in IoT data mining. Since that the existing anomaly detection techniques focus on the identification of anomalies, the explanation of anomalies is not well-solved. We address the anomaly explanation

更新日期：2021-01-06
• arXiv.cs.DB Pub Date : 2021-01-05
Maximilian Schleich; Zixuan Geng; Yihong Zhang; Dan Suciu

Machine learning is increasingly applied in high-stakes decision making that directly affect people's lives, and this leads to an increased demand for systems to explain their decisions. Explanations often take the form of counterfactuals, which consists of conveying to the end user what she/he needs to change in order to improve the outcome. Computing counterfactual explanations is challenging, because

更新日期：2021-01-06
• arXiv.cs.DB Pub Date : 2021-01-04
Aman Abidi; Lu Chen; Rui Zhou; Chengfei Liu

There are extensive studies focusing on the application scenario that all the bipartite cohesive subgraphs need to be discovered in a bipartite graph. However, we observe that, for some applications, one is interested in finding bipartite cohesive subgraphs containing a specific vertex. In this paper, we study a new query dependent bipartite cohesive subgraph search problem based on $k$-wing model

更新日期：2021-01-05
• arXiv.cs.DB Pub Date : 2021-01-04
Yaliang Li; Daoyuan Chen; Bolin Ding; Kai Zeng; Jingren Zhou

Database indexes facilitate data retrieval and benefit broad applications in real-world systems. Recently, a new family of index, named learned index, is proposed to learn hidden yet useful data distribution and incorporate such information into the learning of indexes, which leads to promising performance improvements. However, the "learning" process of learned indexes is still under-explored. In

更新日期：2021-01-05
• arXiv.cs.DB Pub Date : 2021-01-02
Olga Poppe; Chuan Lei; Lei Ma; Allison Rozet; Elke A. Rundensteiner

Complex event processing (CEP) systems continuously evaluate large workloads of pattern queries under tight time constraints. Event trend aggregation queries with Kleene patterns are commonly used to retrieve summarized insights about the recent trends in event streams. State-of-art methods are limited either due to repetitive computations or unnecessary trend construction. Existing shared approaches

更新日期：2021-01-05
• arXiv.cs.DB Pub Date : 2021-01-01
Daniel Szelogowski

Current open source applications which allow for cross-platform data visualization of OLAP cubes feature issues of high overhead and inconsistency due to data oversimplification. To improve upon this issue, there is a need to cut down the number of pipelines that the data must travel between for these aggregation operations and create a single, unified application which performs efficiently without

更新日期：2021-01-05
• arXiv.cs.DB Pub Date : 2021-01-01
Daniel Szelogowski

With web and mobile platforms becoming more prominent devices utilized in data analysis, there are currently few systems which are not without flaw. In order to increase the performance of these systems and decrease errors of data oversimplification, we seek to understand how other programming languages can be used across these platforms which provide data and type safety, as well as utilizing concurrency

更新日期：2021-01-05
• arXiv.cs.DB Pub Date : 2021-01-04
Alvin Cheung; Natacha Crooks; Joseph M. Hellerstein; Matthew Milano

Nearly twenty years after the launch of AWS, it remains difficult for most developers to harness the enormous potential of the cloud. In this paper we lay out an agenda for a new generation of cloud programming research aimed at bringing research ideas to programmers in an evolutionary fashion. Key to our approach is a separation of distributed programs into a PACT of four facets: Program semantics

更新日期：2021-01-05
• arXiv.cs.DB Pub Date : 2021-01-01
Otmar Ertl

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to

更新日期：2021-01-05
• arXiv.cs.DB Pub Date : 2021-01-01
Daniel Szelogowski

Chunking data is obviously no new concept; however, I had never found any data structures that used chunking as the basis of their implementation. I figured that by using chunking alongside concurrency, I could create an extremely fast run-time in regards to particular methods as searching and/or sorting. By using chunking and concurrency to my advantage, I came up with the chunk list - a dynamic list-based

更新日期：2021-01-05
• arXiv.cs.DB Pub Date : 2020-12-31
Chang Ge; Shubhankar Mohapatra; Xi He; Ihab F. Ilyas

Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring the privacy of individual data records. Existing differentially private data synthesis methods aim to generate useful data based on applications, but

更新日期：2021-01-01
• arXiv.cs.DB Pub Date : 2020-12-31
Christian Riegger; Arthur Bernhardt; Bernhard Moessner; Ilia Petrov

We introduce bloomRF as a unified method for approximate membership testing that supports both point- and range-queries on a single data structure. bloomRF extends Bloom-Filters with range query support and may replace them. The core idea is to employ a dyadic interval scheme to determine the set of dyadic intervals covering a data point, which are then encoded and inserted. bloomRF introduces Dyadic

更新日期：2021-01-01
• arXiv.cs.DB Pub Date : 2020-12-31
Alexandr Savinov

In this paper we argue that representing entity properties by tuple attributes, as evangelized in most set-oriented data models, is a controversial method conflicting with the principle of tuple immutability. As a principled solution to this problem of tuple immutability on one hand and the need to modify tuple attributes on the other hand, we propose to use mathematical functions for representing

更新日期：2021-01-01
• arXiv.cs.DB Pub Date : 2020-12-30
Hannah Bast; Patrick Brosi; Markus Näther

We study the following problem: given two public transit station identifiers A and B, each with a label and a geographic coordinate, decide whether A and B describe the same station. For example, for "St Pancras International" at (51.5306, -0.1253) and "London St Pancras" at (51.5319, -0.1269), the answer would be "Yes". This problem frequently arises in areas where public transit data is used, for

更新日期：2021-01-01
• arXiv.cs.DB Pub Date : 2020-12-29
Ziniu Wu; Amir Shaikhha

Cardinality estimation is one of the fundamental problems in database management systems and it is an essential component in query optimizers. Traditional machine-learning-based approaches use probabilistic models such as Bayesian Networks (BNs) to learn joint distributions on data. Recent research advocates for using deep unsupervised learning and achieves state-of-the-art performance in estimating

更新日期：2021-01-01
• arXiv.cs.DB Pub Date : 2020-12-31
Sergio Cabello

We consider the problem of computing the \emph{distance-based representative skyline} in the plane, a problem introduced by Tao, Ding, Lin and Pei [Proc. 25th IEEE International Conference on Data Engineering (ICDE), 2009] and independently considered by Dupin, Nielsen and Talbi [Optimization and Learning - Third International Conference, OLA 2020] in the context of multi-objective optimization. Given

更新日期：2021-01-01
• arXiv.cs.DB Pub Date : 2020-12-29
Xiaoou Ding; Hongzhi Wang; Jiaxuan Su; Chen Wang; Hong Gao

Both the volume and the collection velocity of time series generated by monitoring sensors are increasing in the Internet of Things (IoT). Data management and analysis requires high quality and applicability of the IoT data. However, errors are prevalent in original time series data. Inconsistency in time series is a serious data quality problem existing widely in IoT. Such problem could be hardly

更新日期：2021-01-01
• arXiv.cs.DB Pub Date : 2020-12-29
Anna Fariha; Lucy Cousins; Narges Mahyar; Alexandra Meliou

Traditional data systems require specialized technical skills where users need to understand the data organization and write precise queries to access data. Therefore, novice users who lack technical expertise face hurdles in perusing and analyzing data. Existing tools assist in formulating queries through keyword search, query recommendation, and query auto-completion, but still require some technical

更新日期：2021-01-01
• arXiv.cs.DB Pub Date : 2020-12-28
Junya Arai; Makoto Onizuka; Yasuhiro Fujiwara; Sotetsu Iwamura

Subgraph matching is a compute-intensive problem that asks to enumerate all the isomorphic embeddings of a query graph within a data graph. This problem is generally solved with backtracking, which recursively evolves every possible partial embedding until it becomes an isomorphic embedding or is found unable to become it. While existing methods reduce the search space by analyzing graph structures

更新日期：2020-12-29
• arXiv.cs.DB Pub Date : 2020-12-28
Bowen Hao; Jing Zhang; Cuiping Li; Hong Chen; Hongzhi Yin

The proliferation of massive open online courses (MOOCs) demands an effective way of course recommendation for jobs posted in recruitment websites, especially for the people who take MOOCs to find new jobs. Despite the advances of supervised ranking models, the lack of enough supervised signals prevents us from directly learning a supervised ranking model. This paper proposes a general automated weak

更新日期：2020-12-29
• arXiv.cs.DB Pub Date : 2020-12-26
Xiaoying Wu; Dimitri Theodoratos; Nikos Mamoulis

We address the problem of summarizing embedded tree patterns extracted from large data trees. We do so by defining and mining closed and maximal embedded unordered tree patterns from a single large data tree. We design an embedded frequent pattern mining algorithm extended with a local closedness checking technique. This algorithm is called {\em closedEmbTM-prune} as it eagerly eliminates non-closed

更新日期：2020-12-29
• arXiv.cs.DB Pub Date : 2020-12-26
Song-KyooAmang; Kim

Bigdata is a dataset of which size is beyond the ability of handling a valuable raw material that can be refined and distilled into valuable specific insights. Compact data is a method that optimizes the big dataset that gives best assets without handling complex bigdata. The compact dataset contains the maximum knowledge patterns at fine grained level for effective and personalized utilization of

更新日期：2020-12-29
• arXiv.cs.DB Pub Date : 2020-12-24
Leonid Libkin; Liat Peterfreund

The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic with truth values true and false, to accommodate the additional truth value unknown for handling nulls. It is viewed as indispensable for SQL expressiveness but is at the same time much criticized for leading to unintuitive behavior of queries and thus being a source of programmer mistakes. We show that

更新日期：2020-12-25
• arXiv.cs.DB Pub Date : 2020-12-23
Hussam Abu-LibdehSteve; Deniz AltınbükenSteve; Alex BeutelSteve; Ed H. ChiSteve; Lyric DoshiSteve; Tim KraskaSteve; XiaozhouSteve; Li; Andy Ly; Christopher Olston

There is great excitement about learned index structures, but understandable skepticism about the practicality of a new method uprooting decades of research on B-Trees. In this paper, we work to remove some of that uncertainty by demonstrating how a learned index can be integrated in a distributed, disk-based database system: Google's Bigtable. We detail several design decisions we made to integrate

更新日期：2020-12-24
• arXiv.cs.DB Pub Date : 2020-12-23
Xi Victoria Lin; Richard Socher; Caiming Xiong

We present BRIDGE, a powerful sequential architecture for modeling dependencies between natural language questions and relational databases in cross-DB semantic parsing. BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. The hybrid sequence is encoded by BERT with minimal subsequent layers and the

更新日期：2020-12-24
• arXiv.cs.DB Pub Date : 2020-11-19
Rolysent K Paredes; Alexander A. Hernandez

Purpose: This study proposes an adaptive bandwidth management system which can be explicitly used by educational institutions. The primary goal of the system is to increase the bandwidth of the users who access more on educational websites. Through this proposed bandwidth management, the users of the campus networks is encouraged to utilize the internet for educational purposes. Method: The weblog

更新日期：2020-12-24
• arXiv.cs.DB Pub Date : 2020-12-22
Albert Atserias; Phokion G. Kolaitis

Since the early days of relational databases, it was realized that acyclic hypergraphs give rise to database schemas with desirable structural and algorithmic properties. In a by-now classical paper, Beeri, Fagin, Maier, and Yannakakis established several different equivalent characterizations of acyclicity; in particular, they showed that the sets of attributes of a schema form an acyclic hypergraph

更新日期：2020-12-23
• arXiv.cs.DB Pub Date : 2020-12-21
Majid Rafiei; Wil M. P. van der Aalst

Process mining employs event logs to provide insights into the actual processes. Event logs are recorded by information systems and contain valuable information helping organizations to improve their processes. However, these data also include highly sensitive private information which is a major concern when applying process mining. Therefore, privacy preservation in process mining is growing in importance

更新日期：2020-12-23
• arXiv.cs.DB Pub Date : 2020-12-21
Mark P. J. van der Loo; Edwin de Jonge

Data validation is the activity where one decides whether or not a particular data set is fit for a given purpose. Formalizing the requirements that drive this decision process allows for unambiguous communication of the requirements, automation of the decision process, and opens up ways to maintain and investigate the decision process itself. The purpose of this article is to formalize the definition

更新日期：2020-12-23
• arXiv.cs.DB Pub Date : 2020-12-22
Nofar Carmeli; Nikolaos Tziavelis; Wolfgang Gatterbauer; Benny Kimelfeld; Mirek Riedewald

We study the question of when we can answer a Conjunctive Query (CQ) with an ordering over the answers by constructing a structure for direct (random) access to the sorted list of answers, without actually materializing this list, so that the construction time is linear (or quasilinear) in the size of the database. In the absence of answer ordering, such a construction has been devised for the task

更新日期：2020-12-23
Contents have been reproduced by permission of the publishers.

down
wechat
bug