当前期刊: arXiv - CS - Databases Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
我的关注
我的收藏
您暂时未登录!
登录
  • Trav-SHACL: Efficiently Validating Networks of SHACL Constraints
    arXiv.cs.DB Pub Date : 2021-01-18
    Mónica Figuera; Philipp D. Rohde; Maria-Esther Vidal

    Knowledge graphs have emerged as expressive data structures for Web data. Knowledge graph potential and the demand for ecosystems to facilitate their creation, curation, and understanding, is testified in diverse domains, e.g., biomedicine. The Shapes Constraint Language (SHACL) is the W3C recommendation language for integrity constraints over RDF knowledge graphs. Enabling quality assements of knowledge

    更新日期:2021-01-19
  • Real-Time LSM-Trees for HTAP Workloads
    arXiv.cs.DB Pub Date : 2021-01-17
    Hemant Saxena; Lukasz Golab; Stratos Idreos; Ihab F. Ilyas

    Real-time data analytics systems such as SAP HANA, MemSQL, and IBM Wildfire employ hybrid data layouts, in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high data rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge

    更新日期:2021-01-19
  • AMALGAM: A Matching Approach to fairfy tabuLar data with knowledGe grAph Model
    arXiv.cs.DB Pub Date : 2021-01-17
    Rabia Azzi; Gayo Diallo

    In this paper we present AMALGAM, a matching approach to fairify tabular data with the use of a knowledge graph. The ultimate goal is to provide fast and efficient approach to annotate tabular data with entities from a background knowledge. The approach combines lookup and filtering services combined with text pre-processing techniques. Experiments conducted in the context of the 2020 Semantic Web

    更新日期:2021-01-19
  • Time-Efficient and High-Quality Graph Partitioning for Graph Dynamic Scaling
    arXiv.cs.DB Pub Date : 2021-01-18
    Masatoshi Hanai; Nikos Tziritas; Toyotaro Suzumura; Wentong Cai; Georgios Theodoropoulos

    The dynamic scaling of distributed computations plays an important role in the utilization of elastic computational resources, such as the cloud. It enables the provisioning and de-provisioning of resources to match dynamic resource availability and demands. In the case of distributed graph processing, changing the number of the graph partitions while maintaining high partitioning quality imposes serious

    更新日期:2021-01-19
  • A System for Efficiently Hunting for Cyber Threats in Computer Systems Using Threat Intelligence
    arXiv.cs.DB Pub Date : 2021-01-17
    Peng Gao; Fei Shao; Xiaoyuan Liu; Xusheng Xiao; Haoyuan Liu; Zheng Qin; Fengyuan Xu; Prateek Mittal; Sanjeev R. Kulkarni; Dawn Song

    Log-based cyber threat hunting has emerged as an important solution to counter sophisticated cyber attacks. However, existing approaches require non-trivial efforts of manual query construction and have overlooked the rich external knowledge about threat behaviors provided by open-source Cyber Threat Intelligence (OSCTI). To bridge the gap, we build ThreatRaptor, a system that facilitates cyber threat

    更新日期:2021-01-19
  • Data stream fusion for accurate quantile tracking and analysis
    arXiv.cs.DB Pub Date : 2021-01-17
    Massimo Cafaro; Catiuscia Melle; Italo Epicoco; Marco Pulimeno

    UDDSKETCH is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSKETCH algorithm. UDDSKETCH provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSKETCH. In this paper we show how to compress and fuse data streams (or datasets) by using UDDSKETCH data summaries

    更新日期:2021-01-19
  • Towards Approximate Query Enumeration with Sublinear Preprocessing Time
    arXiv.cs.DB Pub Date : 2021-01-15
    Isolde Adler; Polly Fahey

    This paper aims at providing extremely efficient algorithms for approximate query enumeration on sparse databases, that come with performance and accuracy guarantees. We introduce a new model for approximate query enumeration on classes of relational databases of bounded degree. We first prove that on databases of bounded degree any local first-order definable query can be enumerated approximately

    更新日期:2021-01-18
  • EAGER: Embedding-Assisted Entity Resolution for Knowledge Graphs
    arXiv.cs.DB Pub Date : 2021-01-15
    Daniel Obraczka; Jonathan Schuchart; Erhard Rahm

    Entity Resolution (ER) is a constitutional part for integrating different knowledge graphs in order to identify entities referring to the same real-world object. A promising approach is the use of graph embeddings for ER in order to determine the similarity of entities based on the similarity of their graph neighborhood. The similarity computations for such embeddings translates to calculating the

    更新日期:2021-01-18
  • Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization
    arXiv.cs.DB Pub Date : 2021-01-13
    Adel Ardalan; Derek Paulsen; Amanpreet Singh Saini; Walter Cai; AnHai Doan

    Many applications need to clean data with a target accuracy. As far as we know, this problem has not been studied in depth. In this paper we take the first step toward solving it. We focus on value normalization (VN), the problem of replacing all string that refer to the same entity with a unique string. VN is ubiquitous, and we often want to do VN with 100% accuracy. This is typically done today in

    更新日期:2021-01-15
  • Immutable and Democratic Data in permissionless Peer-to-Peer Systems
    arXiv.cs.DB Pub Date : 2021-01-13
    Maximilian Ernst Tschuchnig; Dejan Radovanovic; Eduard Hirsch; Anna-Maria Oberluggauer; Georg Schäfer

    Conventional data storage methods like SQL and NoSQL offer a huge amount of possibilities with one major disadvantage, having to use a centralized authority. This authority may be in the form of a centralized or decentralized master server or a permissioned peer-to-peer setting. This paper looks at different technologies on how to persist data without using a central authority, mainly looking at permissionless

    更新日期:2021-01-14
  • Flow-Loss: Learning Cardinality Estimates That Matter
    arXiv.cs.DB Pub Date : 2021-01-13
    Parimarjan Negi; Ryan Marcus; Andreas Kipf; Hongzi Mao; Nesime Tatbul; Tim Kraska; Mohammad Alizadeh

    Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer's

    更新日期:2021-01-14
  • Privacy Aspects of Provenance Queries
    arXiv.cs.DB Pub Date : 2021-01-12
    Tanja Auge; Nic Scharlau; Andreas Heuer

    Given a query result of a big database, why-provenance can be used to calculate the necessary part of this database, consisting of so-called witnesses. If this database consists of personal data, privacy protection has to prevent the publication of these witnesses. This implies a natural conflict of interest between publishing original data (provenance) and protecting these data (privacy). In this

    更新日期:2021-01-13
  • DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using Bi-Directional Recurrent Neural Networks
    arXiv.cs.DB Pub Date : 2021-01-11
    Arif Usta; Akifhan Karakayali; Özgür Ulusoy

    Translating Natural Language Queries (NLQs) to Structured Query Language (SQL) in interfaces deployed in relational databases is a challenging task, which has been widely studied in database community recently. Conventional rule based systems utilize series of solutions as a pipeline to deal with each step of this task, namely stop word filtering, tokenization, stemming/lemmatization, parsing, tagging

    更新日期:2021-01-13
  • Enumeration Algorithms for Conjunctive Queries with Projection
    arXiv.cs.DB Pub Date : 2021-01-11
    Shaleen Deep; Xiao Hu; Paraschos Koutris

    We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to maintain

    更新日期:2021-01-12
  • Query Lifting: Language-integrated query for heterogeneous nested collections
    arXiv.cs.DB Pub Date : 2021-01-11
    Wilmer Ricciotti; James Cheney

    Language-integrated query based on comprehension syntax is a powerful technique for safe database programming, and provides a basis for advanced techniques such as query shredding or query flattening that allow efficient programming with complex nested collections. However, the foundations of these techniques are lacking: although SQL, the most widely-used database query language, supports heterogeneous

    更新日期:2021-01-12
  • FlashP: An Analytical Pipeline for Real-time Forecasting of Time-Series Relational Data
    arXiv.cs.DB Pub Date : 2021-01-09
    Shuyuan Yan; Bolin Ding; Wei Guo; Jingren Zhou; Zhewei Wei; Xiaowei Jiang; Sheng Xu

    Interactive response time is important in analytical pipelines for users to explore a sufficient number of possibilities and make informed business decisions. We consider a forecasting pipeline with large volumes of high-dimensional time series data. Real-time forecasting can be conducted in two steps. First, we specify the portion of data to be focused on and the measure to be predicted by slicing

    更新日期:2021-01-12
  • Answer Counting under Guarded TGDs
    arXiv.cs.DB Pub Date : 2021-01-08
    Cristina Feier; Carsten Lutz; Marcin Przybyłko

    We study the complexity of answer counting for ontology-mediated queries and for querying under constraints, considering conjunctive queries and unions thereof (UCQs) as the query language and guarded TGDs as the ontology and constraint language, respectively. Our main result is a classification according to whether answer counting is fixed-parameter tractable (FPT), W[1]-equivalent, #W[1]-equivalent

    更新日期:2021-01-11
  • Dataset Definition Standard (DDS)
    arXiv.cs.DB Pub Date : 2021-01-07
    Cyril Cappi; Camille Chapdelaine; Laurent Gardes; Eric Jenn; Baptiste Lefevre; Sylvaine Picard; Thomas Soumarmon

    This document gives a set of recommendations to build and manipulate the datasets used to develop and/or validate machine learning models such as deep neural networks. This document is one of the 3 documents defined in [1] to ensure the quality of datasets. This is a work in progress as good practices evolve along with our understanding of machine learning. The document is divided into three main parts

    更新日期:2021-01-11
  • Approximate Query Processing for Group-By Queries based on Conditional Generative Models
    arXiv.cs.DB Pub Date : 2021-01-08
    Meifan Zhang; Hongzhi Wang

    The Group-By query is an important kind of query, which is common and widely used in data warehouses, data analytics, and data visualization. Approximate query processing is an effective way to increase the querying efficiency on big data. The answer to a group-by query involves multiple values, which makes it difficult to provide sufficiently accurate estimations for all the groups. Stratified sampling

    更新日期:2021-01-11
  • Spatial Object Recommendation with Hints: When Spatial Granularity Matters
    arXiv.cs.DB Pub Date : 2021-01-08
    Hui Luo; Jingbo Zhou; Zhifeng Bao; Shuangli Li; J. Shane Culpepper; Haochao Ying; Hao Liu; Hui Xiong

    Existing spatial object recommendation algorithms generally treat objects identically when ranking them. However, spatial objects often cover different levels of spatial granularity and thereby are heterogeneous. For example, one user may prefer to be recommended a region (say Manhattan), while another user might prefer a venue (say a restaurant). Even for the same user, preferences can change at different

    更新日期:2021-01-11
  • An Algorithm for the Discovery of Independence from Data
    arXiv.cs.DB Pub Date : 2021-01-07
    Miika Hannula; Bor-Kuan Song; Sebastian Link

    For years, independence has been considered as an important concept in many disciplines. Nevertheless, we present the first research that investigates the discovery problem of independence in data. In its arguably simplest form, independence is a statement between two sets of columns expressing that for every two rows in a table there is also a row in the table that coincides with the first row on

    更新日期:2021-01-08
  • Efficient Data Management in Neutron Scattering Data Reduction Workflows at ORNL
    arXiv.cs.DB Pub Date : 2021-01-05
    William F Godoy; Peter F Peterson; Steven E Hahn; Jay J Billings

    Oak Ridge National Laboratory (ORNL) experimental neutron science facilities produce 1.2\,TB a day of raw event-based data that is stored using the standard metadata-rich NeXus schema built on top of the HDF5 file format. Performance of several data reduction workflows is largely determined by the amount of time spent on the loading and processing algorithms in Mantid, an open-source data analysis

    更新日期:2021-01-08
  • Controlling Entity Integrity with Key Sets
    arXiv.cs.DB Pub Date : 2021-01-07
    Miika Hannula; Xinyi Li; Sebastian Link

    Codd's rule of entity integrity stipulates that every table has a primary key. Hence, the attributes of the primary key carry unique and complete value combinations. In practice, data cannot always meet such requirements. Previous work proposed the superior notion of key sets for controlling entity integrity. We establish a linear-time algorithm for validating whether a given key set holds on a given

    更新日期:2021-01-08
  • On the Interaction of Functional and Inclusion Dependencies with Independence Atoms
    arXiv.cs.DB Pub Date : 2021-01-07
    Miika Hannula; Juha Kontinen; Sebastian Link

    Infamously, the finite and unrestricted implication problems for the classes of i) functional and inclusion dependencies together, and ii) embedded multivalued dependencies alone are each undecidable. Famously, the restriction of i) to functional and unary inclusion dependencies in combination with the restriction of ii) to multivalued dependencies yield implication problems that are still different

    更新日期:2021-01-08
  • Privacy-Preserving Data Publishing in Process Mining
    arXiv.cs.DB Pub Date : 2021-01-04
    Majid Rafiei; Wil M. P. van der Aalst

    Process mining aims to provide insights into the actual processes based on event data. These data are often recorded by information systems and are widely available. However, they often contain sensitive private information that should be analyzed responsibly. Therefore, privacy issues in process mining are recently receiving more attention. Privacy preservation techniques obviously need to modify

    更新日期:2021-01-08
  • Efficient Discovery of Approximate Order Dependencies
    arXiv.cs.DB Pub Date : 2021-01-06
    Reza Karegar; Parke Godfrey; Lukasz Golab; Mehdi Kargar; Divesh Srivastava; Jaroslaw Szlichta

    Order dependencies (ODs) capture relationships between ordered domains of attributes. Approximate ODs (AODs) capture such relationships even when there exist exceptions in the data. During automated discovery of ODs, validation is the process of verifying whether an OD holds. We present an algorithm for validating approximate ODs with significantly improved runtime performance over existing methods

    更新日期:2021-01-07
  • Bridging BAD Islands: Declarative Data Sharing at Scale
    arXiv.cs.DB Pub Date : 2021-01-06
    Xikui Wang; Michael J. Carey; Vassilis J. Tsotras

    In many Big Data applications today, information needs to be actively shared between systems managed by different organizations. To enable sharing Big Data at scale, developers would have to create dedicated server programs and glue together multiple Big Data systems for scalability. Developing and managing such glued data sharing services requires a significant amount of work from developers. In our

    更新日期:2021-01-07
  • Fine-Grained Complexity of Regular Path Queries
    arXiv.cs.DB Pub Date : 2021-01-06
    Katrin Casel; Markus L. Schmid

    A regular path query (RPQ) is a regular expression q that returns all node pairs (u, v) from a graph database that are connected by an arbitrary path labelled with a word from L(q). The obvious algorithmic approach to RPQ-evaluation (called PG-approach), i.e., constructing the product graph between an NFA for q and the graph database, is appealing due to its simplicity and also leads to efficient algorithms

    更新日期:2021-01-07
  • Connecting The Dots To Combat Collective Fraud
    arXiv.cs.DB Pub Date : 2021-01-06
    Mingxi Wu; Xi Chen

    Modern fraudsters write malicious programs to coordinate a group of accounts to commit collective fraud for illegal profits in online platforms. These programs have access to a set of finite resources - a set of IPs, devices, and accounts etc. and sometime manipulate fake accounts to collaboratively attack the target system. Inspired by these observations, we share our experience in building two real-time

    更新日期:2021-01-07
  • A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration
    arXiv.cs.DB Pub Date : 2021-01-05
    Hai Lan; Zhifeng Bao; Yuwei Peng

    Query optimizer is at the heart of the database systems. Cost-based optimizer studied in this paper is adopted in almost all current database systems. A cost-based optimizer introduces a plan enumeration algorithm to find a (sub)plan, and then uses a cost model to obtain the cost of that plan, and selects the plan with the lowest cost. In the cost model, cardinality, the number of tuples through an

    更新日期:2021-01-06
  • Exploring Data and Knowledge combined Anomaly Explanation of Multivariate Industrial Data
    arXiv.cs.DB Pub Date : 2021-01-05
    Xiaoou Ding; Hongzhi Wang; Chen Wang; Zijue Li; Zheng Liang

    The demand for high-performance anomaly detection techniques of IoT data becomes urgent, especially in industry field. The anomaly identification and explanation in time series data is one essential task in IoT data mining. Since that the existing anomaly detection techniques focus on the identification of anomalies, the explanation of anomalies is not well-solved. We address the anomaly explanation

    更新日期:2021-01-06
  • GeCo: Quality Counterfactual Explanations in Real Time
    arXiv.cs.DB Pub Date : 2021-01-05
    Maximilian Schleich; Zixuan Geng; Yihong Zhang; Dan Suciu

    Machine learning is increasingly applied in high-stakes decision making that directly affect people's lives, and this leads to an increased demand for systems to explain their decisions. Explanations often take the form of counterfactuals, which consists of conveying to the end user what she/he needs to change in order to improve the outcome. Computing counterfactual explanations is challenging, because

    更新日期:2021-01-06
  • Searching Personalized $k$-wing in Large and Dynamic Bipartite Graphs
    arXiv.cs.DB Pub Date : 2021-01-04
    Aman Abidi; Lu Chen; Rui Zhou; Chengfei Liu

    There are extensive studies focusing on the application scenario that all the bipartite cohesive subgraphs need to be discovered in a bipartite graph. However, we observe that, for some applications, one is interested in finding bipartite cohesive subgraphs containing a specific vertex. In this paper, we study a new query dependent bipartite cohesive subgraph search problem based on $k$-wing model

    更新日期:2021-01-05
  • A Pluggable Learned Index Method via Sampling and Gap Insertion
    arXiv.cs.DB Pub Date : 2021-01-04
    Yaliang Li; Daoyuan Chen; Bolin Ding; Kai Zeng; Jingren Zhou

    Database indexes facilitate data retrieval and benefit broad applications in real-world systems. Recently, a new family of index, named learned index, is proposed to learn hidden yet useful data distribution and incorporate such information into the learning of indexes, which leads to promising performance improvements. However, the "learning" process of learned indexes is still under-explored. In

    更新日期:2021-01-05
  • To Share, or not to Share Online Event Trend Aggregation Over Bursty Event Streams
    arXiv.cs.DB Pub Date : 2021-01-02
    Olga Poppe; Chuan Lei; Lei Ma; Allison Rozet; Elke A. Rundensteiner

    Complex event processing (CEP) systems continuously evaluate large workloads of pattern queries under tight time constraints. Event trend aggregation queries with Kleene patterns are commonly used to retrieve summarized insights about the recent trends in event streams. State-of-art methods are limited either due to repetitive computations or unnecessary trend construction. Existing shared approaches

    更新日期:2021-01-05
  • Optimizing Data Cube Visualization for Web Applications: Performance and User-Friendly Data Aggregation
    arXiv.cs.DB Pub Date : 2021-01-01
    Daniel Szelogowski

    Current open source applications which allow for cross-platform data visualization of OLAP cubes feature issues of high overhead and inconsistency due to data oversimplification. To improve upon this issue, there is a need to cut down the number of pipelines that the data must travel between for these aggregation operations and create a single, unified application which performs efficiently without

    更新日期:2021-01-05
  • Visualization Techniques with Data Cubes: Utilizing Concurrency for Complex Data
    arXiv.cs.DB Pub Date : 2021-01-01
    Daniel Szelogowski

    With web and mobile platforms becoming more prominent devices utilized in data analysis, there are currently few systems which are not without flaw. In order to increase the performance of these systems and decrease errors of data oversimplification, we seek to understand how other programming languages can be used across these platforms which provide data and type safety, as well as utilizing concurrency

    更新日期:2021-01-05
  • New Directions in Cloud Programming
    arXiv.cs.DB Pub Date : 2021-01-04
    Alvin Cheung; Natacha Crooks; Joseph M. Hellerstein; Matthew Milano

    Nearly twenty years after the launch of AWS, it remains difficult for most developers to harness the enormous potential of the cloud. In this paper we lay out an agenda for a new generation of cloud programming research aimed at bringing research ideas to programmers in an evolutionary fashion. Key to our approach is a separation of distributed programs into a PACT of four facets: Program semantics

    更新日期:2021-01-05
  • SetSketch: Filling the Gap between MinHash and HyperLogLog
    arXiv.cs.DB Pub Date : 2021-01-01
    Otmar Ertl

    MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to

    更新日期:2021-01-05
  • Chunk List: Concurrent Data Structures
    arXiv.cs.DB Pub Date : 2021-01-01
    Daniel Szelogowski

    Chunking data is obviously no new concept; however, I had never found any data structures that used chunking as the basis of their implementation. I figured that by using chunking alongside concurrency, I could create an extremely fast run-time in regards to particular methods as searching and/or sorting. By using chunking and concurrency to my advantage, I came up with the chunk list - a dynamic list-based

    更新日期:2021-01-05
  • Kamino: Constraint-Aware Differentially Private Data Synthesis
    arXiv.cs.DB Pub Date : 2020-12-31
    Chang Ge; Shubhankar Mohapatra; Xi He; Ihab F. Ilyas

    Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring the privacy of individual data records. Existing differentially private data synthesis methods aim to generate useful data based on applications, but

    更新日期:2021-01-01
  • bloomRF: On Performing Range-Queries with Bloom-Filters based on Piecewise-Monotone Hash Functions and Dyadic Trace-Trees
    arXiv.cs.DB Pub Date : 2020-12-31
    Christian Riegger; Arthur Bernhardt; Bernhard Moessner; Ilia Petrov

    We introduce bloomRF as a unified method for approximate membership testing that supports both point- and range-queries on a single data structure. bloomRF extends Bloom-Filters with range query support and may replace them. The core idea is to employ a dyadic interval scheme to determine the set of dyadic intervals covering a data point, which are then encoded and inserted. bloomRF introduces Dyadic

    更新日期:2021-01-01
  • On the importance of functions in data modeling
    arXiv.cs.DB Pub Date : 2020-12-31
    Alexandr Savinov

    In this paper we argue that representing entity properties by tuple attributes, as evangelized in most set-oriented data models, is a controversial method conflicting with the principle of tuple immutability. As a principled solution to this problem of tuple immutability on one hand and the need to modify tuple attributes on the other hand, we propose to use mathematical functions for representing

    更新日期:2021-01-01
  • Similarity Classification of Public Transit Stations
    arXiv.cs.DB Pub Date : 2020-12-30
    Hannah Bast; Patrick Brosi; Markus Näther

    We study the following problem: given two public transit station identifiers A and B, each with a label and a geographic coordinate, decide whether A and B describe the same station. For example, for "St Pancras International" at (51.5306, -0.1253) and "London St Pancras" at (51.5319, -0.1269), the answer would be "Yes". This problem frequently arises in areas where public transit data is used, for

    更新日期:2021-01-01
  • BayesCard: A Unified Bayesian Framework for Cardinality Estimation
    arXiv.cs.DB Pub Date : 2020-12-29
    Ziniu Wu; Amir Shaikhha

    Cardinality estimation is one of the fundamental problems in database management systems and it is an essential component in query optimizers. Traditional machine-learning-based approaches use probabilistic models such as Bayesian Networks (BNs) to learn joint distributions on data. Recent research advocates for using deep unsupervised learning and achieves state-of-the-art performance in estimating

    更新日期:2021-01-01
  • Faster Distance-Based Representative Skyline and $k$-Center Along Pareto Front in the Plane
    arXiv.cs.DB Pub Date : 2020-12-31
    Sergio Cabello

    We consider the problem of computing the \emph{distance-based representative skyline} in the plane, a problem introduced by Tao, Ding, Lin and Pei [Proc. 25th IEEE International Conference on Data Engineering (ICDE), 2009] and independently considered by Dupin, Nielsen and Talbi [Optimization and Learning - Third International Conference, OLA 2020] in the context of multi-objective optimization. Given

    更新日期:2021-01-01
  • Misplaced Subsequences Repairing with Application to Multivariate Industrial Time Series Data
    arXiv.cs.DB Pub Date : 2020-12-29
    Xiaoou Ding; Hongzhi Wang; Jiaxuan Su; Chen Wang; Hong Gao

    Both the volume and the collection velocity of time series generated by monitoring sensors are increasing in the Internet of Things (IoT). Data management and analysis requires high quality and applicability of the IoT data. However, errors are prevalent in original time series data. Inconsistency in time series is a serious data quality problem existing widely in IoT. Such problem could be hardly

    更新日期:2021-01-01
  • Example-Driven User Intent Discovery: Empowering Users to Cross the SQL Barrier Through Query by Example
    arXiv.cs.DB Pub Date : 2020-12-29
    Anna Fariha; Lucy Cousins; Narges Mahyar; Alexandra Meliou

    Traditional data systems require specialized technical skills where users need to understand the data organization and write precise queries to access data. Therefore, novice users who lack technical expertise face hurdles in perusing and analyzing data. Existing tools assist in formulating queries through keyword search, query recommendation, and query auto-completion, but still require some technical

    更新日期:2021-01-01
  • Fast Subgraph Matching by Exploiting Search Failures
    arXiv.cs.DB Pub Date : 2020-12-28
    Junya Arai; Makoto Onizuka; Yasuhiro Fujiwara; Sotetsu Iwamura

    Subgraph matching is a compute-intensive problem that asks to enumerate all the isomorphic embeddings of a query graph within a data graph. This problem is generally solved with backtracking, which recursively evolves every possible partial embedding until it becomes an isomorphic embedding or is found unable to become it. While existing methods reduce the search space by analyzing graph structures

    更新日期:2020-12-29
  • Recommending Courses in MOOCs for Jobs: An Auto Weak Supervision Approach
    arXiv.cs.DB Pub Date : 2020-12-28
    Bowen Hao; Jing Zhang; Cuiping Li; Hong Chen; Hongzhi Yin

    The proliferation of massive open online courses (MOOCs) demands an effective way of course recommendation for jobs posted in recruitment websites, especially for the people who take MOOCs to find new jobs. Despite the advances of supervised ranking models, the lack of enough supervised signals prevents us from directly learning a supervised ranking model. This paper proposes a general automated weak

    更新日期:2020-12-29
  • Discovering Closed and Maximal Embedded Patterns from Large Tree Data
    arXiv.cs.DB Pub Date : 2020-12-26
    Xiaoying Wu; Dimitri Theodoratos; Nikos Mamoulis

    We address the problem of summarizing embedded tree patterns extracted from large data trees. We do so by defining and mining closed and maximal embedded unordered tree patterns from a single large data tree. We design an embedded frequent pattern mining algorithm extended with a local closedness checking technique. This algorithm is called {\em closedEmbTM-prune} as it eagerly eliminates non-closed

    更新日期:2020-12-29
  • Toward Compact Data from Big Data
    arXiv.cs.DB Pub Date : 2020-12-26
    Song-KyooAmang; Kim

    Bigdata is a dataset of which size is beyond the ability of handling a valuable raw material that can be refined and distilled into valuable specific insights. Compact data is a method that optimizes the big dataset that gives best assets without handling complex bigdata. The compact dataset contains the maximum knowledge patterns at fine grained level for effective and personalized utilization of

    更新日期:2020-12-29
  • Handling SQL Nulls with Two-Valued Logic
    arXiv.cs.DB Pub Date : 2020-12-24
    Leonid Libkin; Liat Peterfreund

    The design of SQL is based on a three-valued logic (3VL), rather than the familiar Boolean logic with truth values true and false, to accommodate the additional truth value unknown for handling nulls. It is viewed as indispensable for SQL expressiveness but is at the same time much criticized for leading to unintuitive behavior of queries and thus being a source of programmer mistakes. We show that

    更新日期:2020-12-25
  • Learned Indexes for a Google-scale Disk-based Database
    arXiv.cs.DB Pub Date : 2020-12-23
    Hussam Abu-LibdehSteve; Deniz AltınbükenSteve; Alex BeutelSteve; Ed H. ChiSteve; Lyric DoshiSteve; Tim KraskaSteve; XiaozhouSteve; Li; Andy Ly; Christopher Olston

    There is great excitement about learned index structures, but understandable skepticism about the practicality of a new method uprooting decades of research on B-Trees. In this paper, we work to remove some of that uncertainty by demonstrating how a learned index can be integrated in a distributed, disk-based database system: Google's Bigtable. We detail several design decisions we made to integrate

    更新日期:2020-12-24
  • Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing
    arXiv.cs.DB Pub Date : 2020-12-23
    Xi Victoria Lin; Richard Socher; Caiming Xiong

    We present BRIDGE, a powerful sequential architecture for modeling dependencies between natural language questions and relational databases in cross-DB semantic parsing. BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. The hybrid sequence is encoded by BERT with minimal subsequent layers and the

    更新日期:2020-12-24
  • Designing an Adaptive Bandwidth Management for Higher Education Institutions
    arXiv.cs.DB Pub Date : 2020-11-19
    Rolysent K Paredes; Alexander A. Hernandez

    Purpose: This study proposes an adaptive bandwidth management system which can be explicitly used by educational institutions. The primary goal of the system is to increase the bandwidth of the users who access more on educational websites. Through this proposed bandwidth management, the users of the campus networks is encouraged to utilize the internet for educational purposes. Method: The weblog

    更新日期:2020-12-24
  • Structure and Complexity of Bag Consistency
    arXiv.cs.DB Pub Date : 2020-12-22
    Albert Atserias; Phokion G. Kolaitis

    Since the early days of relational databases, it was realized that acyclic hypergraphs give rise to database schemas with desirable structural and algorithmic properties. In a by-now classical paper, Beeri, Fagin, Maier, and Yannakakis established several different equivalent characterizations of acyclicity; in particular, they showed that the sets of attributes of a schema form an acyclic hypergraph

    更新日期:2020-12-23
  • Towards Quantifying Privacy in Process Mining
    arXiv.cs.DB Pub Date : 2020-12-21
    Majid Rafiei; Wil M. P. van der Aalst

    Process mining employs event logs to provide insights into the actual processes. Event logs are recorded by information systems and contain valuable information helping organizations to improve their processes. However, these data also include highly sensitive private information which is a major concern when applying process mining. Therefore, privacy preservation in process mining is growing in importance

    更新日期:2020-12-23
  • Data Validation
    arXiv.cs.DB Pub Date : 2020-12-21
    Mark P. J. van der Loo; Edwin de Jonge

    Data validation is the activity where one decides whether or not a particular data set is fit for a given purpose. Formalizing the requirements that drive this decision process allows for unambiguous communication of the requirements, automation of the decision process, and opens up ways to maintain and investigate the decision process itself. The purpose of this article is to formalize the definition

    更新日期:2020-12-23
  • Tractable Orders for Direct Access to Ranked Answersof Conjunctive Queries
    arXiv.cs.DB Pub Date : 2020-12-22
    Nofar Carmeli; Nikolaos Tziavelis; Wolfgang Gatterbauer; Benny Kimelfeld; Mirek Riedewald

    We study the question of when we can answer a Conjunctive Query (CQ) with an ordering over the answers by constructing a structure for direct (random) access to the sorted list of answers, without actually materializing this list, so that the construction time is linear (or quasilinear) in the size of the database. In the absence of answer ordering, such a construction has been devised for the task

    更新日期:2020-12-23
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
微生物研究
亚洲大洋洲地球科学
NPJ欢迎投稿
自然科研论文编辑
ERIS期刊投稿
欢迎阅读创刊号
自然职场,为您触达千万科研人才
spring&清华大学出版社
城市可持续发展前沿研究专辑
Springer 纳米技术权威期刊征稿
全球视野覆盖
施普林格·自然新
chemistry
物理学研究前沿热点精选期刊推荐
自然职位线上招聘会
欢迎报名注册2020量子在线大会
化学领域亟待解决的问题
材料学研究精选新
GIANT
ACS ES&T Engineering
ACS ES&T Water
屿渡论文,编辑服务
阿拉丁试剂right
上海中医药大学
清华大学
复旦大学
南科大
北京理工大学
清华
隐藏1h前已浏览文章
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
清华大学-1
武汉大学
浙江大学
天合科研
x-mol收录
试剂库存
down
wechat
bug