-
Trav-SHACL: Efficiently Validating Networks of SHACL Constraints arXiv.cs.DB Pub Date : 2021-01-18 Mónica Figuera; Philipp D. Rohde; Maria-Esther Vidal
Knowledge graphs have emerged as expressive data structures for Web data. Knowledge graph potential and the demand for ecosystems to facilitate their creation, curation, and understanding, is testified in diverse domains, e.g., biomedicine. The Shapes Constraint Language (SHACL) is the W3C recommendation language for integrity constraints over RDF knowledge graphs. Enabling quality assements of knowledge
-
Real-Time LSM-Trees for HTAP Workloads arXiv.cs.DB Pub Date : 2021-01-17 Hemant Saxena; Lukasz Golab; Stratos Idreos; Ihab F. Ilyas
Real-time data analytics systems such as SAP HANA, MemSQL, and IBM Wildfire employ hybrid data layouts, in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high data rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge
-
AMALGAM: A Matching Approach to fairfy tabuLar data with knowledGe grAph Model arXiv.cs.DB Pub Date : 2021-01-17 Rabia Azzi; Gayo Diallo
In this paper we present AMALGAM, a matching approach to fairify tabular data with the use of a knowledge graph. The ultimate goal is to provide fast and efficient approach to annotate tabular data with entities from a background knowledge. The approach combines lookup and filtering services combined with text pre-processing techniques. Experiments conducted in the context of the 2020 Semantic Web
-
Time-Efficient and High-Quality Graph Partitioning for Graph Dynamic Scaling arXiv.cs.DB Pub Date : 2021-01-18 Masatoshi Hanai; Nikos Tziritas; Toyotaro Suzumura; Wentong Cai; Georgios Theodoropoulos
The dynamic scaling of distributed computations plays an important role in the utilization of elastic computational resources, such as the cloud. It enables the provisioning and de-provisioning of resources to match dynamic resource availability and demands. In the case of distributed graph processing, changing the number of the graph partitions while maintaining high partitioning quality imposes serious
-
A System for Efficiently Hunting for Cyber Threats in Computer Systems Using Threat Intelligence arXiv.cs.DB Pub Date : 2021-01-17 Peng Gao; Fei Shao; Xiaoyuan Liu; Xusheng Xiao; Haoyuan Liu; Zheng Qin; Fengyuan Xu; Prateek Mittal; Sanjeev R. Kulkarni; Dawn Song
Log-based cyber threat hunting has emerged as an important solution to counter sophisticated cyber attacks. However, existing approaches require non-trivial efforts of manual query construction and have overlooked the rich external knowledge about threat behaviors provided by open-source Cyber Threat Intelligence (OSCTI). To bridge the gap, we build ThreatRaptor, a system that facilitates cyber threat
-
Data stream fusion for accurate quantile tracking and analysis arXiv.cs.DB Pub Date : 2021-01-17 Massimo Cafaro; Catiuscia Melle; Italo Epicoco; Marco Pulimeno
UDDSKETCH is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSKETCH algorithm. UDDSKETCH provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSKETCH. In this paper we show how to compress and fuse data streams (or datasets) by using UDDSKETCH data summaries
-
Towards Approximate Query Enumeration with Sublinear Preprocessing Time arXiv.cs.DB Pub Date : 2021-01-15 Isolde Adler; Polly Fahey
This paper aims at providing extremely efficient algorithms for approximate query enumeration on sparse databases, that come with performance and accuracy guarantees. We introduce a new model for approximate query enumeration on classes of relational databases of bounded degree. We first prove that on databases of bounded degree any local first-order definable query can be enumerated approximately
-
EAGER: Embedding-Assisted Entity Resolution for Knowledge Graphs arXiv.cs.DB Pub Date : 2021-01-15 Daniel Obraczka; Jonathan Schuchart; Erhard Rahm
Entity Resolution (ER) is a constitutional part for integrating different knowledge graphs in order to identify entities referring to the same real-world object. A promising approach is the use of graph embeddings for ER in order to determine the similarity of entities based on the similarity of their graph neighborhood. The similarity computations for such embeddings translates to calculating the
-
Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization arXiv.cs.DB Pub Date : 2021-01-13 Adel Ardalan; Derek Paulsen; Amanpreet Singh Saini; Walter Cai; AnHai Doan
Many applications need to clean data with a target accuracy. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all string that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in
-
Immutable and Democratic Data in permissionless Peer-to-Peer Systems arXiv.cs.DB Pub Date : 2021-01-13 Maximilian Ernst Tschuchnig; Dejan Radovanovic; Eduard Hirsch; Anna-Maria Oberluggauer; Georg Schäfer
Conventional data storage methods like SQL and NoSQL offer a huge amount of possibilities with one major disadvantage, having to use a centralized authority. This authority may be in the form of a centralized or decentralized master server or a permissioned peer-to-peer setting. This paper looks at different technologies on how to persist data without using a central authority, mainly looking at permissionless
-
Flow-Loss: Learning Cardinality Estimates That Matter arXiv.cs.DB Pub Date : 2021-01-13 Parimarjan Negi; Ryan Marcus; Andreas Kipf; Hongzi Mao; Nesime Tatbul; Tim Kraska; Mohammad Alizadeh
Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer's
-
Privacy Aspects of Provenance Queries arXiv.cs.DB Pub Date : 2021-01-12 Tanja Auge; Nic Scharlau; Andreas Heuer
Given a query result of a big database, why-provenance can be used to calculate the necessary part of this database, consisting of so-called witnesses. If this database consists of personal data, privacy protection has to prevent the publication of these witnesses. This implies a natural conflict of interest between publishing original data (provenance) and protecting these data (privacy). In this
-
DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks arXiv.cs.DB Pub Date : 2021-01-11 Arif Usta; Akifhan Karakayali; Özgür Ulusoy
Translating Natural Language Queries (NLQs) to Structured Query Language (SQL) in interfaces deployed in relational databases is a challenging task, which has been widely studied in database community recently. Conventional rule based systems utilize series of solutions as a pipeline to deal with each step of this task, namely stop word filtering, tokenization, stemming/lemmatization, parsing, tagging
-
Enumeration Algorithms for Conjunctive Queries with Projection arXiv.cs.DB Pub Date : 2021-01-11 Shaleen Deep; Xiao Hu; Paraschos Koutris
We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to maintain
-
Query Lifting: Language-integrated query for heterogeneous nested collections arXiv.cs.DB Pub Date : 2021-01-11 Wilmer Ricciotti; James Cheney
Language-integrated query based on comprehension syntax is a powerful technique for safe database programming, and provides a basis for advanced techniques such as query shredding or query flattening that allow efficient programming with complex nested collections. However, the foundations of these techniques are lacking: although SQL, the most widely-used database query language, supports heterogeneous
-
FlashP: An Analytical Pipeline for Real-time Forecasting of Time-Series Relational Data arXiv.cs.DB Pub Date : 2021-01-09 Shuyuan Yan; Bolin Ding; Wei Guo; Jingren Zhou; Zhewei Wei; Xiaowei Jiang; Sheng Xu
Interactive response time is important in analytical pipelines for users to explore a sufficient number of possibilities and make informed business decisions. We consider a forecasting pipeline with large volumes of high-dimensional time series data. Real-time forecasting can be conducted in two steps. First, we specify the portion of data to be focused on and the measure to be predicted by slicing
-
Answer Counting under Guarded TGDs arXiv.cs.DB Pub Date : 2021-01-08 Cristina Feier; Carsten Lutz; Marcin Przybyłko
We study the complexity of answer counting for ontology-mediated queries and for querying under constraints, considering conjunctive queries and unions thereof (UCQs) as the query language and guarded TGDs as the ontology and constraint language, respectively. Our main result is a classification according to whether answer counting is fixed-parameter tractable (FPT), W[1]-equivalent, #W[1]-equivalent
-
Dataset Definition Standard (DDS) arXiv.cs.DB Pub Date : 2021-01-07 Cyril Cappi; Camille Chapdelaine; Laurent Gardes; Eric Jenn; Baptiste Lefevre; Sylvaine Picard; Thomas Soumarmon
This document gives a set of recommendations to build and manipulate the datasets used to develop and/or validate machine learning models such as deep neural networks. This document is one of the 3 documents defined in [1] to ensure the quality of datasets. This is a work in progress as good practices evolve along with our understanding of machine learning. The document is divided into three main parts
-
Approximate Query Processing for Group-By Queries based on Conditional Generative Models arXiv.cs.DB Pub Date : 2021-01-08 Meifan Zhang; Hongzhi Wang
The Group-By query is an important kind of query, which is common and widely used in data warehouses, data analytics, and data visualization. Approximate query processing is an effective way to increase the querying efficiency on big data. The answer to a group-by query involves multiple values, which makes it difficult to provide sufficiently accurate estimations for all the groups. Stratified sampling
-
Spatial Object Recommendation with Hints: When Spatial Granularity Matters arXiv.cs.DB Pub Date : 2021-01-08 Hui Luo; Jingbo Zhou; Zhifeng Bao; Shuangli Li; J. Shane Culpepper; Haochao Ying; Hao Liu; Hui Xiong
Existing spatial object recommendation algorithms generally treat objects identically when ranking them. However, spatial objects often cover different levels of spatial granularity and thereby are heterogeneous. For example, one user may prefer to be recommended a region (say Manhattan), while another user might prefer a venue (say a restaurant). Even for the same user, preferences can change at different
-
An Algorithm for the Discovery of Independence from Data arXiv.cs.DB Pub Date : 2021-01-07 Miika Hannula; Bor-Kuan Song; Sebastian Link
For years, independence has been considered as an important concept in many disciplines. Nevertheless, we present the first research that investigates the discovery problem of independence in data. In its arguably simplest form, independence is a statement between two sets of columns expressing that for every two rows in a table there is also a row in the table that coincides with the first row on
-
Efficient Data Management in Neutron Scattering Data Reduction Workflows at ORNL arXiv.cs.DB Pub Date : 2021-01-05 William F Godoy; Peter F Peterson; Steven E Hahn; Jay J Billings
Oak Ridge National Laboratory (ORNL) experimental neutron science facilities produce 1.2\,TB a day of raw event-based data that is stored using the standard metadata-rich NeXus schema built on top of the HDF5 file format. Performance of several data reduction workflows is largely determined by the amount of time spent on the loading and processing algorithms in Mantid, an open-source data analysis
-
Controlling Entity Integrity with Key Sets arXiv.cs.DB Pub Date : 2021-01-07 Miika Hannula; Xinyi Li; Sebastian Link
Codd's rule of entity integrity stipulates that every table has a primary key. Hence, the attributes of the primary key carry unique and complete value combinations. In practice, data cannot always meet such requirements. Previous work proposed the superior notion of key sets for controlling entity integrity. We establish a linear-time algorithm for validating whether a given key set holds on a given
-
On the Interaction of Functional and Inclusion Dependencies with Independence Atoms arXiv.cs.DB Pub Date : 2021-01-07 Miika Hannula; Juha Kontinen; Sebastian Link
Infamously, the finite and unrestricted implication problems for the classes of i) functional and inclusion dependencies together, and ii) embedded multivalued dependencies alone are each undecidable. Famously, the restriction of i) to functional and unary inclusion dependencies in combination with the restriction of ii) to multivalued dependencies yield implication problems that are still different
-
Privacy-Preserving Data Publishing in Process Mining arXiv.cs.DB Pub Date : 2021-01-04 Majid Rafiei; Wil M. P. van der Aalst
Process mining aims to provide insights into the actual processes based on event data. These data are often recorded by information systems and are widely available. However, they often contain sensitive private information that should be analyzed responsibly. Therefore, privacy issues in process mining are recently receiving more attention. Privacy preservation techniques obviously need to modify
-
Efficient Discovery of Approximate Order Dependencies arXiv.cs.DB Pub Date : 2021-01-06 Reza Karegar; Parke Godfrey; Lukasz Golab; Mehdi Kargar; Divesh Srivastava; Jaroslaw Szlichta
Order dependencies (ODs) capture relationships between ordered domains of attributes. Approximate ODs (AODs) capture such relationships even when there exist exceptions in the data. During automated discovery of ODs, validation is the process of verifying whether an OD holds. We present an algorithm for validating approximate ODs with significantly improved runtime performance over existing methods
-
Bridging BAD Islands: Declarative Data Sharing at Scale arXiv.cs.DB Pub Date : 2021-01-06 Xikui Wang; Michael J. Carey; Vassilis J. Tsotras
In many Big Data applications today, information needs to be actively shared between systems managed by different organizations. To enable sharing Big Data at scale, developers would have to create dedicated server programs and glue together multiple Big Data systems for scalability. Developing and managing such glued data sharing services requires a significant amount of work from developers. In our
-
Fine-Grained Complexity of Regular Path Queries arXiv.cs.DB Pub Date : 2021-01-06 Katrin Casel; Markus L. Schmid
A regular path query (RPQ) is a regular expression q that returns all node pairs (u, v) from a graph database that are connected by an arbitrary path labelled with a word from L(q). The obvious algorithmic approach to RPQ-evaluation (called PG-approach), i.e., constructing the product graph between an NFA for q and the graph database, is appealing due to its simplicity and also leads to efficient algorithms
-
Connecting The Dots To Combat Collective Fraud arXiv.cs.DB Pub Date : 2021-01-06 Mingxi Wu; Xi Chen
Modern fraudsters write malicious programs to coordinate a group of accounts to commit collective fraud for illegal profits in online platforms. These programs have access to a set of finite resources - a set of IPs, devices, and accounts etc. and sometime manipulate fake accounts to collaboratively attack the target system. Inspired by these observations, we share our experience in building two real-time
-
A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration arXiv.cs.DB Pub Date : 2021-01-05 Hai Lan; Zhifeng Bao; Yuwei Peng
Query optimizer is at the heart of the database systems. Cost-based optimizer studied in this paper is adopted in almost all current database systems. A cost-based optimizer introduces a plan enumeration algorithm to find a (sub)plan, and then uses a cost model to obtain the cost of that plan, and selects the plan with the lowest cost. In the cost model, cardinality, the number of tuples through an
-
Exploring Data and Knowledge combined Anomaly Explanation of Multivariate Industrial Data arXiv.cs.DB Pub Date : 2021-01-05 Xiaoou Ding; Hongzhi Wang; Chen Wang; Zijue Li; Zheng Liang
The demand for high-performance anomaly detection techniques of IoT data becomes urgent, especially in industry field. The anomaly identification and explanation in time series data is one essential task in IoT data mining. Since that the existing anomaly detection techniques focus on the identification of anomalies, the explanation of anomalies is not well-solved. We address the anomaly explanation
-
GeCo: Quality Counterfactual Explanations in Real Time arXiv.cs.DB Pub Date : 2021-01-05 Maximilian Schleich; Zixuan Geng; Yihong Zhang; Dan Suciu
Machine learning is increasingly applied in high-stakes decision making that directly affect people's lives, and this leads to an increased demand for systems to explain their decisions. Explanations often take the form of counterfactuals, which consists of conveying to the end user what she/he needs to change in order to improve the outcome. Computing counterfactual explanations is challenging, because
-
Searching Personalized $k$-wing in Large and Dynamic Bipartite Graphs arXiv.cs.DB Pub Date : 2021-01-04 Aman Abidi; Lu Chen; Rui Zhou; Chengfei Liu
There are extensive studies focusing on the application scenario that all the bipartite cohesive subgraphs need to be discovered in a bipartite graph. However, we observe that, for some applications, one is interested in finding bipartite cohesive subgraphs containing a specific vertex. In this paper, we study a new query dependent bipartite cohesive subgraph search problem based on $k$-wing model
-
A Pluggable Learned Index Method via Sampling and Gap Insertion arXiv.cs.DB Pub Date : 2021-01-04 Yaliang Li; Daoyuan Chen; Bolin Ding; Kai Zeng; Jingren Zhou
Database indexes facilitate data retrieval and benefit broad applications in real-world systems. Recently, a new family of index, named learned index, is proposed to learn hidden yet useful data distribution and incorporate such information into the learning of indexes, which leads to promising performance improvements. However, the "learning" process of learned indexes is still under-explored. In
-
To Share, or not to Share Online Event Trend Aggregation Over Bursty Event Streams arXiv.cs.DB Pub Date : 2021-01-02 Olga Poppe; Chuan Lei; Lei Ma; Allison Rozet; Elke A. Rundensteiner
Complex event processing (CEP) systems continuously evaluate large workloads of pattern queries under tight time constraints. Event trend aggregation queries with Kleene patterns are commonly used to retrieve summarized insights about the recent trends in event streams. State-of-art methods are limited either due to repetitive computations or unnecessary trend construction. Existing shared approaches
-
Optimizing Data Cube Visualization for Web Applications: Performance and User-Friendly Data Aggregation arXiv.cs.DB Pub Date : 2021-01-01 Daniel Szelogowski
Current open source applications which allow for cross-platform data visualization of OLAP cubes feature issues of high overhead and inconsistency due to data oversimplification. To improve upon this issue, there is a need to cut down the number of pipelines that the data must travel between for these aggregation operations and create a single, unified application which performs efficiently without
-
Visualization Techniques with Data Cubes: Utilizing Concurrency for Complex Data arXiv.cs.DB Pub Date : 2021-01-01 Daniel Szelogowski
With web and mobile platforms becoming more prominent devices utilized in data analysis, there are currently few systems which are not without flaw. In order to increase the performance of these systems and decrease errors of data oversimplification, we seek to understand how other programming languages can be used across these platforms which provide data and type safety, as well as utilizing concurrency
-
New Directions in Cloud Programming arXiv.cs.DB Pub Date : 2021-01-04 Alvin Cheung; Natacha Crooks; Joseph M. Hellerstein; Matthew Milano
Nearly twenty years after the launch of AWS, it remains difficult for most developers to harness the enormous potential of the cloud. In this paper we lay out an agenda for a new generation of cloud programming research aimed at bringing research ideas to programmers in an evolutionary fashion. Key to our approach is a separation of distributed programs into a PACT of four facets: Program semantics
-
SetSketch: Filling the Gap between MinHash and HyperLogLog arXiv.cs.DB Pub Date : 2021-01-01 Otmar Ertl
MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to
-
Chunk List: Concurrent Data Structures arXiv.cs.DB Pub Date : 2021-01-01 Daniel Szelogowski
Chunking data is obviously no new concept; however, I had never found any data structures that used chunking as the basis of their implementation. I figured that by using chunking alongside concurrency, I could create an extremely fast run-time in regards to particular methods as searching and/or sorting. By using chunking and concurrency to my advantage, I came up with the chunk list - a dynamic list-based
-
Kamino: Constraint-Aware Differentially Private Data Synthesis arXiv.cs.DB Pub Date : 2020-12-31 Chang Ge; Shubhankar Mohapatra; Xi He; Ihab F. Ilyas
Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring the privacy of individual data records. Existing differentially private data synthesis methods aim to generate useful data based on applications, but
-
bloomRF: On Performing Range-Queries with Bloom-Filters based on Piecewise-Monotone Hash Functions and Dyadic Trace-Trees arXiv.cs.DB Pub Date : 2020-12-31 Christian Riegger; Arthur Bernhardt; Bernhard Moessner; Ilia Petrov
We introduce bloomRF as a unified method for approximate membership testing that supports both point- and range-queries on a single data structure. bloomRF extends Bloom-Filters with range query support and may replace them. The core idea is to employ a dyadic interval scheme to determine the set of dyadic intervals covering a data point, which are then encoded and inserted. bloomRF introduces Dyadic
-
On the importance of functions in data modeling arXiv.cs.DB Pub Date : 2020-12-31 Alexandr Savinov
In this paper we argue that representing entity properties by tuple attributes, as evangelized in most set-oriented data models, is a controversial method conflicting with the principle of tuple immutability. As a principled solution to this problem of tuple immutability on one hand and the need to modify tuple attributes on the other hand, we propose to use mathematical functions for representing
-
Similarity Classification of Public Transit Stations arXiv.cs.DB Pub Date : 2020-12-30 Hannah Bast; Patrick Brosi; Markus Näther
We study the following problem: given two public transit station identifiers A and B, each with a label and a geographic coordinate, decide whether A and B describe the same station. For example, for "St Pancras International" at (51.5306, -0.1253) and "London St Pancras" at (51.5319, -0.1269), the answer would be "Yes". This problem frequently arises in areas where public transit data is used, for
-
BayesCard: A Unified Bayesian Framework for Cardinality Estimation arXiv.cs.DB Pub Date : 2020-12-29 Ziniu Wu; Amir Shaikhha
Cardinality estimation is one of the fundamental problems in database management systems and it is an essential component in query optimizers. Traditional machine-learning-based approaches use probabilistic models such as Bayesian Networks (BNs) to learn joint distributions on data. Recent research advocates for using deep unsupervised learning and achieves state-of-the-art performance in estimating
-
Faster Distance-Based Representative Skyline and $k$-Center Along Pareto Front in the Plane arXiv.cs.DB Pub Date : 2020-12-31 Sergio Cabello
We consider the problem of computing the \emph{distance-based representative skyline} in the plane, a problem introduced by Tao, Ding, Lin and Pei [Proc. 25th IEEE International Conference on Data Engineering (ICDE), 2009] and independently considered by Dupin, Nielsen and Talbi [Optimization and Learning - Third International Conference, OLA 2020] in the context of multi-objective optimization. Given
-
Misplaced Subsequences Repairing with Application to Multivariate Industrial Time Series Data arXiv.cs.DB Pub Date : 2020-12-29 Xiaoou Ding; Hongzhi Wang; Jiaxuan Su; Chen Wang; Hong Gao
Both the volume and the collection velocity of time series generated by monitoring sensors are increasing in the Internet of Things (IoT). Data management and analysis requires high quality and applicability of the IoT data. However, errors are prevalent in original time series data. Inconsistency in time series is a serious data quality problem existing widely in IoT. Such problem could be hardly
-
Example-Driven User Intent Discovery: Empowering Users to Cross the SQL Barrier Through Query by Example arXiv.cs.DB Pub Date : 2020-12-29 Anna Fariha; Lucy Cousins; Narges Mahyar; Alexandra Meliou
Traditional data systems require specialized technical skills where users need to understand the data organization and write precise queries to access data. Therefore, novice users who lack technical expertise face hurdles in perusing and analyzing data. Existing tools assist in formulating queries through keyword search, query recommendation, and query auto-completion, but still require some technical
-
Fast Subgraph Matching by Exploiting Search Failures arXiv.cs.DB Pub Date : 2020-12-28 Junya Arai; Makoto Onizuka; Yasuhiro Fujiwara; Sotetsu Iwamura
Subgraph matching is a compute-intensive problem that asks to enumerate all the isomorphic embeddings of a query graph within a data graph. This problem is generally solved with backtracking, which recursively evolves every possible partial embedding until it becomes an isomorphic embedding or is found unable to become it. While existing methods reduce the search space by analyzing graph structures
-
Recommending Courses in MOOCs for Jobs: An Auto Weak Supervision Approach arXiv.cs.DB Pub Date : 2020-12-28 Bowen Hao; Jing Zhang; Cuiping Li; Hong Chen; Hongzhi Yin
The proliferation of massive open online courses (MOOCs) demands an effective way of course recommendation for jobs posted in recruitment websites, especially for the people who take MOOCs to find new jobs. Despite the advances of supervised ranking models, the lack of enough supervised signals prevents us from directly learning a supervised ranking model. This paper proposes a general automated weak
-
Discovering Closed and Maximal Embedded Patterns from Large Tree Data arXiv.cs.DB Pub Date : 2020-12-26 Xiaoying Wu; Dimitri Theodoratos; Nikos Mamoulis
We address the problem of summarizing embedded tree patterns extracted from large data trees. We do so by defining and mining closed and maximal embedded unordered tree patterns from a single large data tree. We design an embedded frequent pattern mining algorithm extended with a local closedness checking technique. This algorithm is called {\em closedEmbTM-prune} as it eagerly eliminates non-closed
-
Toward Compact Data from Big Data arXiv.cs.DB Pub Date : 2020-12-26 Song-KyooAmang; Kim
Bigdata is a dataset of which size is beyond the ability of handling a valuable raw material that can be refined and distilled into valuable specific insights. Compact data is a method that optimizes the big dataset that gives best assets without handling complex bigdata. The compact dataset contains the maximum knowledge patterns at fine grained level for effective and personalized utilization of
-
Handling SQL Nulls with Two-Valued Logic arXiv.cs.DB Pub Date : 2020-12-24 Leonid Libkin; Liat Peterfreund
The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic with truth values true and false, to accommodate the additional truth value unknown for handling nulls. It is viewed as indispensable for SQL expressiveness but is at the same time much criticized for leading to unintuitive behavior of queries and thus being a source of programmer mistakes. We show that
-
Learned Indexes for a Google-scale Disk-based Database arXiv.cs.DB Pub Date : 2020-12-23 Hussam Abu-LibdehSteve; Deniz AltınbükenSteve; Alex BeutelSteve; Ed H. ChiSteve; Lyric DoshiSteve; Tim KraskaSteve; XiaozhouSteve; Li; Andy Ly; Christopher Olston
There is great excitement about learned index structures, but understandable skepticism about the practicality of a new method uprooting decades of research on B-Trees. In this paper, we work to remove some of that uncertainty by demonstrating how a learned index can be integrated in a distributed, disk-based database system: Google's Bigtable. We detail several design decisions we made to integrate
-
Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing arXiv.cs.DB Pub Date : 2020-12-23 Xi Victoria Lin; Richard Socher; Caiming Xiong
We present BRIDGE, a powerful sequential architecture for modeling dependencies between natural language questions and relational databases in cross-DB semantic parsing. BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. The hybrid sequence is encoded by BERT with minimal subsequent layers and the
-
Designing an Adaptive Bandwidth Management for Higher Education Institutions arXiv.cs.DB Pub Date : 2020-11-19 Rolysent K Paredes; Alexander A. Hernandez
Purpose: This study proposes an adaptive bandwidth management system which can be explicitly used by educational institutions. The primary goal of the system is to increase the bandwidth of the users who access more on educational websites. Through this proposed bandwidth management, the users of the campus networks is encouraged to utilize the internet for educational purposes. Method: The weblog
-
Structure and Complexity of Bag Consistency arXiv.cs.DB Pub Date : 2020-12-22 Albert Atserias; Phokion G. Kolaitis
Since the early days of relational databases, it was realized that acyclic hypergraphs give rise to database schemas with desirable structural and algorithmic properties. In a by-now classical paper, Beeri, Fagin, Maier, and Yannakakis established several different equivalent characterizations of acyclicity; in particular, they showed that the sets of attributes of a schema form an acyclic hypergraph
-
Towards Quantifying Privacy in Process Mining arXiv.cs.DB Pub Date : 2020-12-21 Majid Rafiei; Wil M. P. van der Aalst
Process mining employs event logs to provide insights into the actual processes. Event logs are recorded by information systems and contain valuable information helping organizations to improve their processes. However, these data also include highly sensitive private information which is a major concern when applying process mining. Therefore, privacy preservation in process mining is growing in importance
-
Data Validation arXiv.cs.DB Pub Date : 2020-12-21 Mark P. J. van der Loo; Edwin de Jonge
Data validation is the activity where one decides whether or not a particular data set is fit for a given purpose. Formalizing the requirements that drive this decision process allows for unambiguous communication of the requirements, automation of the decision process, and opens up ways to maintain and investigate the decision process itself. The purpose of this article is to formalize the definition
-
Tractable Orders for Direct Access to Ranked Answersof Conjunctive Queries arXiv.cs.DB Pub Date : 2020-12-22 Nofar Carmeli; Nikolaos Tziavelis; Wolfgang Gatterbauer; Benny Kimelfeld; Mirek Riedewald
We study the question of when we can answer a Conjunctive Query (CQ) with an ordering over the answers by constructing a structure for direct (random) access to the sorted list of answers, without actually materializing this list, so that the construction time is linear (or quasilinear) in the size of the database. In the absence of answer ordering, such a construction has been devised for the task
Contents have been reproduced by permission of the publishers.