当前期刊: arXiv - CS - Databases Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
  • GeoFlink: A Framework for the Real-time Processing of Spatial Streams
    arXiv.cs.DB Pub Date : 2020-04-07
    Salman Ahmed Shaikh; Komal Mariam; Hiroyuki Kitagawa; Kyoung-Sook Kim

    Apache Flink is an open-source system for the scalable processing of batch and streaming data. The Flink does not natively support the efficient processing of spatial data streams, which is the requirement of many applications dealing with the spatial data. Besides Flink, other scalable spatial data processing platforms including GeoSpark, Spatial Hadoop, GeoMesa and Parallel Secondo do not support

  • Forecasting in multivariate irregularly sampled time series with missing values
    arXiv.cs.DB Pub Date : 2020-04-06
    Shivam Srivastava; Prithviraj Sen; Berthold Reinwald

    Sparse and irregularly sampled multivariate time series are common in clinical, climate, financial and many other domains. Most recent approaches focus on classification, regression or forecasting tasks on such data. In forecasting, it is necessary to not only forecast the right value but also to forecast when that value will occur in the irregular time series. In this work, we present an approach

  • Learning Individual Models for Imputation (Technical Report)
    arXiv.cs.DB Pub Date : 2020-04-07
    Aoqian Zhang; Shaoxu Song; Yu Sun; Jianmin Wang

    Missing numerical values are prevalent, e.g., owing to unreliable sensor reading, collection and transmission among heterogeneous sources. Unlike categorized data imputation over a limited domain, the numerical values suffer from two issues: (1) sparsity problem, the incomplete tuple may not have sufficient complete neighbors sharing the same/similar values for imputation, owing to the (almost) infinite

  • An Algorithm for Context-Free Path Queries over Graph Databases
    arXiv.cs.DB Pub Date : 2020-04-07
    Ciro M. Medeiros; Martin A. Musicante; Umberto S. Costa

    RDF (Resource Description Framework) is a standard language to represent graph databases. Query languages for RDF databases usually include primitives to support path queries, linking pairs of vertices of the graph that are connected by a path of labels belonging to a given language. Languages such as SPARQL include support for paths defined by regular languages (by means of Regular Expressions). A

  • Modularis: Modular Data Analytics for Hardware, Software, and Platform Heterogeneity
    arXiv.cs.DB Pub Date : 2020-04-07
    Dimitrios Koutsoukos; Ingo Müller; Renato Marroquín; Gustavo Alonso

    Today's data analytics displays an overwhelming diversity along many dimensions: data types, platforms, hardware acceleration, etc. As a result, system design often has to choose between depth and breadth: high efficiency for a narrow set of use cases or generality at a lower performance. In this paper, we pave the way to get the best of both worlds: We present Modularis-an execution layer for data

  • Usable & Scalable Learning Over Relational Data With Automatic Language Bias
    arXiv.cs.DB Pub Date : 2017-10-03
    Jose Picado; Arash Termehchy; Sudhanshu Pathak; Alan Fern; Praveen Ilango; Yunqiao Cai

    Relational databases are valuable resources for learning novel and interesting relations and concepts. In order to constraint the search through the large space of candidate definitions, users must tune the algorithm by specifying a language bias. Unfortunately, specifying the language bias is done via trial and error and is guided by the expert's intuitions. We propose AutoBias, a system that leverages

  • Possible/Certain Functional Dependencies
    arXiv.cs.DB Pub Date : 2019-09-27
    Lhouari Nourine; Jean Marc Petit

    Incomplete information allow to deal with data with errors, uncertainty or inconsistencies and have been studied in different application areas such as query answering or data integration. In this paper, we investigate classical functional dependencies in presence of incomplete information. To do so, we associate each attribute with a comparability function which maps every pair of domain values to

  • RisGraph: A Real-Time Streaming System for Evolving Graphs
    arXiv.cs.DB Pub Date : 2020-04-02
    Guanyu Feng; Zixuan Ma; Daixuan Li; Xiaowei Zhu; Yanzheng Cai; Wentao Han; Wenguang Chen

    Graphs in the real world are constantly changing and of large scale. In processing these evolving graphs, the combination of update workloads (updating vertices and edges in a streaming manner) and analytical (performing graph algorithms incrementally) workloads is ubiquitous. Throughput, latency, and granularity are three key requirements in processing evolving graphs with such combined workloads

  • A County-level Dataset for Informing the United States' Response to COVID-19
    arXiv.cs.DB Pub Date : 2020-04-01
    Benjamin D. Killeen; Jie Ying Wu; Kinjal Shah; Anna Zapaishchykova; Philipp Nikutta; Aniruddha Tamhane; Shreya Chakraborty; Jinchi Wei; Tiger Gao; Mareike Thies; Mathias Unberath

    As the coronavirus disease 2019 (COVID-19) becomes a global pandemic, policy makers must enact interventions to stop its spread. Data driven approaches might supply information to support the implementation of mitigation and suppression strategies. To facilitate research in this direction, we present a machine-readable dataset that aggregates relevant data from governmental, journalistic, and academic

  • Approximate Selection with Guarantees using Proxies
    arXiv.cs.DB Pub Date : 2020-04-02
    Daniel Kang; Edward Gan; Peter Bailis; Tatsunori Hashimoto; Matei Zaharia

    Due to the falling costs of data acquisition and storage, researchers and industry analysts often want to find all instances of rare events in large datasets. For instance, scientists can cheaply capture thousands of hours of video, but are limited by the need to manually inspect all the video to identify relevant objects and events. To reduce this cost, recent work proposes to use cheap proxy models

  • Nass: A New Approach to Graph Similarity Search
    arXiv.cs.DB Pub Date : 2020-04-02
    Jongik Kim

    In this paper, we study the problem of graph similarity search with graph edit distance (GED) constraints. Due to the NP-hardness of GED computation, existing solutions to this problem adopt the filtering-and-verification framework with a main focus on the filtering phase to generate a small number of candidate graphs. However, they have a limitation that the number of candidates grows extremely rapidly

  • Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries
    arXiv.cs.DB Pub Date : 2019-10-20
    Maciej Besta; Emanuel Peter; Robert Gerstenberger; Marc Fischer; Michał Podstawski; Claude Barthels; Gustavo Alonso; Torsten Hoefler

    Graph processing has become an important part of multiple areas of computer science, such as machine learning, computational sciences, medical applications, social network analysis, and many others. Numerous graphs such as web or social networks may contain up to trillions of edges. Often, these graphs are also dynamic (their structure changes over time) and have domain-specific rich data associated

  • A+ Indexes: Lightweight and Highly Flexible Adjacency Lists for Graph Database Management Systems
    arXiv.cs.DB Pub Date : 2020-03-31
    Amine Mhedhbi; Pranjal Gupta; Shahid Khaliq; Semih Salihoglu

    Graph database management systems (GDBMSs) are highly optimized to perform very fast joins of vertices by indexing the neighbourhoods of vertices in adjacency list indexes. However, existing GDBMSs have system-specific and fixed adjacency list index structures, which makes each system highly efficient on only a fixed set of workloads. We describe a highly flexible and lightweight indexing sub-system

  • Graph Summarization Methods and Applications: A Survey
    arXiv.cs.DB Pub Date : 2016-12-14
    Yike Liu; Tara Safavi; Abhilash Dighe; Danai Koutra

    While advances in computing resources have made processing enormous amounts of data possible, human ability to identify patterns in such data has not scaled accordingly. Efficient computational methods for condensing and simplifying data are thus becoming vital for extracting actionable insights. In particular, while data summarization techniques have been studied extensively, only recently has summarizing

  • Consistency and Certain Answers in Relational to RDF Data Exchange with Shape Constraints
    arXiv.cs.DB Pub Date : 2020-03-30
    Iovka Boneva; Jose Lozano; Sławek Staworko

    We investigate the data exchange from relational databases to RDF graphs inspired by R2RML with the addition of target shape schemas. We study the problems of consistency i.e., checking that every source instance admits a solution, and certain query answering i.e., finding answers present in every solution. We identify the class of constructive relational to RDF data exchange that uses IRI constructors

  • Towards Effective Differential Privacy Communication for Users' Data Sharing Decision and Comprehension
    arXiv.cs.DB Pub Date : 2020-03-31
    Aiping Xiong; Tianhao Wang; Ninghui Li; Somesh Jha

    Differential privacy protects an individual's privacy by perturbing data on an aggregated level (DP) or individual level (LDP). We report four online human-subject experiments investigating the effects of using different approaches to communicate differential privacy techniques to laypersons in a health app data collection setting. Experiments 1 and 2 investigated participants' data disclosure decisions

  • Towards Productionizing Subjective Search Systems
    arXiv.cs.DB Pub Date : 2020-03-31
    Aaron Feng; Shuwei Chen; Yuliang Li; Hiroshi Matsuda; Hidekazu Tamaki; Wang-Chiew Tan

    Existing e-commerce search engines typically support search only over objective attributes, such as price and locations, leaving the more desirable subjective attributes, such as romantic vibe and worklife balance unsearchable. We found that this is also the case for Recruit Group, which operates a wide range of online booking and search services, including jobs, travel, housing, bridal, dining, beauty

  • The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle
    arXiv.cs.DB Pub Date : 2020-03-31
    Xinyue Wang; Zhiwu Xie

    The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for

  • IBM Functional Genomics Platform, A Cloud-Based Platform for Studying Microbial Life at Scale
    arXiv.cs.DB Pub Date : 2019-11-05
    Edward E. Seabolt; Gowri Nayar; Harsha Krishnareddy; Akshay Agarwal; Kristen L. Beck; Ignacio Terrizzano; Eser Kandogan; Mary Roth; Vandana Mukherjee; James H. Kaufman

    The rapid growth in biological sequence data is revolutionizing our understanding of genotypic diversity and challenging conventional approaches to informatics. With the increasing availability of genomic data, traditional bioinformatic tools require substantial computational time and the creation of ever-larger indices each time a researcher seeks to gain insight from the data. To address these challenges

  • word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data
    arXiv.cs.DB Pub Date : 2020-03-27
    Martin Grohe

    Vector representations of graphs and relational structures, whether hand-crafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively

  • Blockchain-enabled Resource Management and Sharing for 6G Communications
    arXiv.cs.DB Pub Date : 2020-03-29
    Hao Xu; Paulo Valente Klainea; Oluwakayode Oniretia; Bin Caob; Muhammad Imrana; Lei Zhang

    The sixth generation (6G) network must provide performance superior to previous generations in order to meet the requirements of emerging services and applications, such as multi-gigabit transmission rate, even higher reliability, sub 1 millisecond latency and ubiquitous connection for Internet of Everything. However, with the scarcity of spectrum resources, efficient resource management and sharing

  • Best Practices for Implementing FAIR Vocabularies and Ontologies on the Web
    arXiv.cs.DB Pub Date : 2020-03-29
    Daniel Garijo; María Poveda-Villalón

    With the adoption of Semantic Web technologies, an increasing number of vocabularies and ontologies have been developed in different domains, ranging from Biology to Agronomy or Geosciences. However, many of these ontologies are still difficult to find, access and understand by researchers due to a lack of documentation, URI resolving issues, versioning problems, etc. In this chapter we describe guidelines

  • Dealer: End-to-End Data Marketplace with Model-based Pricing
    arXiv.cs.DB Pub Date : 2020-03-29
    Jinfei Liu

    Data-driven machine learning (ML) has witnessed great successes across a variety of application domains. Since ML model training are crucially relied on a large amount of data, there is a growing demand for high quality data to be collected for ML model training. However, from data owners' perspective, it is risky for them to contribute their data. To incentivize data contribution, it would be ideal

  • A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching
    arXiv.cs.DB Pub Date : 2020-03-29
    Venkata Vamsikrishna Meduri; Lucian Popa; Prithviraj Sen; Mohamed Sarwat

    Entity Matching (EM) is a core data cleaning task, aiming to identify different mentions of the same real-world entity. Active learning is one way to address the challenge of scarce labeled data in practice, by dynamically collecting the necessary examples to be labeled by an Oracle and refining the learned model (classifier) upon them. In this paper, we build a unified active learning benchmark framework

  • Bag Query Containment and Information Theory
    arXiv.cs.DB Pub Date : 2019-06-24
    Mahmoud Abo Khamis; Phokion G. Kolaitis; Hung Q. Ngo; Dan Suciu

    The query containment problem is a fundamental algorithmic problem in data management. While this problem is well understood under set semantics, it is by far less understood under bag semantics. In particular, it is a long-standing open question whether or not the conjunctive query containment problem under bag semantics is decidable. We unveil tight connections between information theory and the

  • Area Queries Based on Voronoi Diagrams
    arXiv.cs.DB Pub Date : 2019-12-01
    Yang Li

    The area query, to find all elements contained in a specified area from a certain set of spatial objects, is a very important spatial query widely required in various fields. A number of approaches have been proposed to implement this query, the best known of which is to obtain a rough candidate set through spatial indexes and then refine the candidates through geometric validations to get the final

  • Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing (Technical Report)
    arXiv.cs.DB Pub Date : 2020-03-27
    Aoqian Zhang; Shaoxu Song; Jianmin Wang; Philip S. Yu

    Errors are prevalent in time series data, such as GPS trajectories or sensor readings. Existing methods focus more on anomaly detection but not on repairing the detected anomalies. By simply filtering out the dirty data via anomaly detection, applications could still be unreliable over the incomplete time series. Instead of simply discarding anomalies, we propose to (iteratively) repair them in time

  • MultiRI: Fast Subgraph Matching in Labeled Multigraphs
    arXiv.cs.DB Pub Date : 2020-03-25
    Giovanni Micale; Vincenzo Bonnici; Alfredo Ferro; Dennis Shasha; Rosalba Giugno; Alfredo Pulvirenti

    The Subgraph Matching (SM) problem consists of finding all the embeddings of a given small graph, called the query, into a large graph, called the target. The SM problem has been widely studied for simple graphs, i.e. graphs where there is exactly one edge between two nodes and nodes have single labels, but few approaches have been devised for labeled multigraphs, i.e. graphs having possibly multiple

  • A Survey on Trajectory Data Management, Analytics, and Learning
    arXiv.cs.DB Pub Date : 2020-03-25
    Sheng Wang; Zhifeng Bao; J. Shane Culpepper; Gao Cong

    Recent advances in sensor and mobile devices have enabled an unprecedented increase in the availability and collection of urban trajectory data, thus increasing the demand for more efficient ways to manage and analyze the data being produced. In this survey, we comprehensively review recent research trends in trajectory data management, ranging from trajectory pre-processing, storage, common trajectory

  • Property Graph Schema Optimization for Domain-Specific Knowledge Graphs
    arXiv.cs.DB Pub Date : 2020-03-25
    Chuan Lei; Rana Alotaibi; Abdul Quamar; Vasilis Efthymiou; Fatma Özcan

    Enterprises are creating domain-specific knowledge graphs by curating and integrating their business data from multiple sources. The data in these knowledge graphs can be described using ontologies, which provide a semantic abstraction to define the content in terms of the entities and the relationships of the domain. The rich semantic relationships in an ontology contain a variety of opportunities

  • Founded Semantics and Constraint Semantics of Logic Rules
    arXiv.cs.DB Pub Date : 2016-06-20
    Yanhong A. Liu; Scott D. Stoller

    Logic rules and inference are fundamental in computer science and have been studied extensively. However, prior semantics of logic languages can have subtle implications and can disagree significantly, on even very simple programs, including in attempting to solve the well-known Russell's paradox. These semantics are often non-intuitive and hard-to-understand when unrestricted negation is used in recursion

  • Counting Problems over Incomplete Databases
    arXiv.cs.DB Pub Date : 2019-12-23
    Marcelo Arenas; Pablo Barceló; Mikaël Monet

    We study the complexity of various fundamental counting problems that arise in the context of incomplete databases, i.e., relational databases that can contain unknown values in the form of labeled nulls. Specifically, we assume that the domains of these unknown values are finite and, for a Boolean query $q$, we consider the following two problems: given as input an incomplete database $D$, (a) return

  • Implementing Suffix Array Algorithm Using Apache Big Table Data Implementation
    arXiv.cs.DB Pub Date : 2020-03-24
    Piero Giacomelli

    In this paper we will describe a new approach on the well-known suffix-array algorithm using Big Table Data Technology. We will demonstrate how it is possible to refactor a well-known algorithm coupled by taking advantage of an high-performance distributed datastore, to illustrate the advantages of using datastore cloud related technology for storing large text sequences and retrieving them. A case

  • EQL -- an extremely easy to learn knowledge graph query language, achieving highspeed and precise search
    arXiv.cs.DB Pub Date : 2020-03-19
    Han Liu; Shantao Liu

    EQL, also named as Extremely Simple Query Language, can be widely used in the field of knowledge graph, precise search, strong artificial intelligence, database, smart speaker ,patent search and other fields. EQL adopt the principle of minimalism in design and pursues simplicity and easy to learn so that everyone can master it quickly. EQL language and lambda calculus are interconvertible, that reveals

  • FITing-Tree: A Data-aware Index Structure
    arXiv.cs.DB Pub Date : 2018-01-30
    Alex Galakatos; Michael Markovitch; Carsten Binnig; Rodrigo Fonseca; Tim Kraska

    Index structures are one of the most important tools that DBAs leverage to improve the performance of analytics and transactional workloads. However, building several indexes over large datasets can often become prohibitive and consume valuable system resources. In fact, a recent study showed that indexes created as part of the TPC-C benchmark can account for 55% of the total memory available in a

  • Efficient Oblivious Database Joins
    arXiv.cs.DB Pub Date : 2020-03-20
    Simeon Krastnikov; Florian Kerschbaum; Douglas Stebila

    A major algorithmic challenge in designing applications intended for secure remote execution is ensuring that they are oblivious to their inputs, in the sense that their memory access patterns do not leak sensitive information to the server. This problem is particularly relevant to cloud databases that wish to allow queries over the client's encrypted data. One of the major obstacles to such a goal

  • A Framework for Generating Explanations from Temporal Personal Health Data
    arXiv.cs.DB Pub Date : 2020-03-20
    Jonathan J. Harris; Ching-Hua Chen; Mohammed J. Zaki

    Whereas it has become easier for individuals to track their personal health data (e.g., heart rate, step count, food log), there is still a wide chasm between the collection of data and the generation of meaningful explanations to help users better understand what their data means to them. With an increased comprehension of their data, users will be able to act upon the newfound information and work

  • Covering the Relational Join
    arXiv.cs.DB Pub Date : 2020-03-21
    Shi Li; Sai Vikneshwar Mani Jayaraman; Atri Rudra

    In this paper, we initiate a theoretical study of what we call the join covering problem. We are given a natural join query instance $Q$ on $n$ attributes and $m$ relations $(R_i)_{i \in [m]}$. Let $J_{Q} = \ \Join_{i=1}^m R_i$ denote the join output of $Q$. In addition to $Q$, we are given a parameter $\Delta: 1\le \Delta\le n$ and our goal is to compute the smallest subset $\mathcal{T}_{Q, \Delta}

  • Causality-Guided Adaptive Interventional Debugging
    arXiv.cs.DB Pub Date : 2020-03-21
    Anna Fariha; Suman Nath; Alexandra Meliou

    Runtime nondeterminism is a fact of life in modern database applications. Previous research has shown that nondeterminism can cause applications to intermittently crash, become unresponsive, or experience data corruption. We propose Adaptive Interventional Debugging (AID) for debugging such intermittent failures. AID combines existing statistical debugging, causal analysis, fault injection, and group

  • A Synopses Data Engine for Interactive Extreme-Scale Analytics
    arXiv.cs.DB Pub Date : 2020-03-21
    Antonis Kontaxakis; Nikos Giatrakos; Antonios Deligiannakis

    In this work, we detail the design and structure of a Synopses Data Engine (SDE) which combines the virtues of parallel processing and stream summarization towards delivering interactive analytics at extreme scale. Our SDE is built on top of Apache Flink and implements a synopsis-as-a-service paradigm. In that it achieves (a) concurrently maintaining thousands of synopses of various types for thousands

  • ARDA: Automatic Relational Data Augmentation for Machine Learning
    arXiv.cs.DB Pub Date : 2020-03-21
    Nadiia Chepurko; Ryan Marcus; Emanuel Zgraggen; Raul Castro Fernandez; Tim Kraska; David Karger

    Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation

  • The Solution Distribution of Influence Maximization: A High-level Experimental Study on Three Algorithmic Approaches
    arXiv.cs.DB Pub Date : 2020-03-22
    Naoto Ohsaka

    Influence maximization is among the most fundamental algorithmic problems in social influence analysis. Over the last decade, a great effort has been devoted to developing efficient algorithms for influence maximization, so that identifying the ``best'' algorithm has become a demanding task. In SIGMOD'17, Arora, Galhotra, and Ranu reported benchmark results on eleven existing algorithms and demonstrated

  • Translation of Array-Based Loops to Distributed Data-Parallel Programs
    arXiv.cs.DB Pub Date : 2020-03-21
    Leonidas Fegaras; Md Hasanuzzaman Noor

    Large volumes of data generated by scientific experiments and simulations come in the form of arrays, while programs that analyze these data are frequently expressed in terms of array operations in an imperative, loop-based language. But, as datasets grow larger, new frameworks in distributed Big Data analytics have become essential tools to large-scale scientific computing. Scientists, who are typically

  • A Transactional Perspective on Execute-order-validate Blockchains
    arXiv.cs.DB Pub Date : 2020-03-23
    Pingcheng Ruan; Dumitrel Loghin; Quang-Trung Ta; Meihui Zhang; Gang Chen; Beng Chin Ooi

    Smart contracts have enabled blockchain systems to evolve from simple cryptocurrency platforms, such as Bitcoin, to general transactional systems, such as Ethereum. Catering for emerging business requirements, a new architecture called execute-order-validate has been proposed in Hyperledger Fabric to support parallel transactions and improve the blockchain's throughput. However, this new architecture

  • Absolute Shapley Value
    arXiv.cs.DB Pub Date : 2020-03-23
    Jinfei Liu

    Shapley value is a concept in cooperative game theory for measuring the contribution of each participant, which was named in honor of Lloyd Shapley. Shapley value has been recently applied in data marketplaces for compensation allocation based on their contribution to the models. Shapley value is the only value division scheme used for compensation allocation that meets three desirable criteria: group

  • KloakDB: A Platform for Analyzing Sensitive Data with $K$-anonymous Query Processing
    arXiv.cs.DB Pub Date : 2019-03-31
    Madhav Suresh; Zuohao She; William Wallace; Adel Lahlou; Jennie Rogers

    A private data federation enables data owners to pool their information for querying without disclosing their secret tuples to one another. Here, a client queries the union of the records of all data owners. The data owners work together to answer the query using privacy-preserving algorithms that prevent them from learning unauthorized information about the inputs of their peers. Only the client,

  • Optimal Algorithms for Ranked Enumeration of Answers to Full Conjunctive Queries
    arXiv.cs.DB Pub Date : 2019-11-13
    Nikolaos Tziavelis; Deepak Ajwani; Wolfgang Gatterbauer; Mirek Riedewald; Xiaofeng Yang

    We study ranked enumeration of join-query results according to very general orders defined by selective dioids. Our main contribution is a framework for ranked enumeration over a class of dynamic programming problems that generalizes seemingly different problems that had been studied in isolation. To this end, we extend classic algorithms that find the k-shortest paths in a weighted graph. For full

  • Hihooi: A Database Replication Middleware for Scaling Transactional Databases Consistently
    arXiv.cs.DB Pub Date : 2020-03-16
    Michael A. Georgiou; Aristodemos Paphitis; Michael Sirivianos; Herodotos Herodotou

    With the advent of the Internet and Internet-connected devices, modern business applications can experience rapid increases as well as variability in transactional workloads. Database replication has been employed to scale performance and improve availability of relational databases but past approaches have suffered from various issues including limited scalability, performance versus consistency tradeoffs

  • Equivalent Rewritings on Path Views with Binding Patterns
    arXiv.cs.DB Pub Date : 2020-03-16
    Julien Romero; Nicoleta Preda; Antoine Amarilli; Fabian Suchanek

    A view with a binding pattern is a parameterized query on a database. Such views are used, e.g., to model Web services. To answer a query on such views, the views have to be orchestrated together in execution plans. We show how queries can be rewritten into equivalent execution plans, which are guaranteed to deliver the same results as the query on all databases. We provide a correct and complete algorithm

  • PolyFit: Polynomial-based Indexing Approach for Fast Approximate Range Aggregate Queries
    arXiv.cs.DB Pub Date : 2020-03-18
    Zhe Li; Tsz Nam Chan; Man Lung Yiu; Christian S. Jensen

    Range aggregate queries find frequent application in data analytics. In some use cases, approximate results are preferred over accurate results if they can be computed rapidly and satisfy approximation guarantees. Inspired by a recent indexing approach, we provide means of representing a discrete point data set by continuous functions that can then serve as compact index structures. More specifically

  • Discovering Business Area Effects to Process Mining Analysis Using Clustering and Influence Analysis
    arXiv.cs.DB Pub Date : 2020-03-18
    Teemu Lehto; Markku Hinkka

    A common challenge for improving business processes in large organizations is that business people in charge of the operations are lacking a fact-based understanding of the execution details, process variants, and exceptions taking place in business operations. While existing process mining methodologies can discover these details based on event logs, it is challenging to communicate the process mining

  • Multi-dimensional Skyline Query to Find Best Shopping Mall for Customers
    arXiv.cs.DB Pub Date : 2020-03-17
    Md Amiruzzaman; Suphanut Jamonnak

    This paper presents a new application for multi-dimensional Skyline query. The idea presented in this paper can be used to find best shopping malls based on users requirements. A web-based application was used to simulate the problem and proposed solution. Also, a mathematical definition was developed to define the problem and show how multi-dimensional Skyline query can be used to solve complex problems

  • Duoquest: A Dual-Specification System for Expressive SQL Queries
    arXiv.cs.DB Pub Date : 2020-03-16
    Christopher Baik; Zhongjun Jin; Michael Cafarella; H. V. Jagadish

    Querying a relational database is difficult because it requires users to know both the SQL language and be familiar with the schema. On the other hand, many users possess enough domain familiarity or expertise to describe their desired queries by alternative means. For such users, two major alternatives to writing SQL are natural language interfaces (NLIs) and programming-by-example (PBE). Both of

  • Evolution of the ROOT Tree I/O
    arXiv.cs.DB Pub Date : 2020-03-17
    Jakob Blomer; Philippe Canal; Axel Naumann; Danilo Piparo

    The ROOT TTree data format encodes hundreds of petabytes of High Energy and Nuclear Physics events. Its columnar layout drives rapid analyses, as only those parts ("branches") that are really used in a given analysis need to be read from storage. Its unique feature is the seamless C++ integration, which allows users to directly store their event classes without explicitly defining data schemas. In

  • A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs
    arXiv.cs.DB Pub Date : 2020-03-10
    Zequn Sun; Qingheng Zhang; Wei Hu; Chengming Wang; Muhao Chen; Farahnaz Akrami; Chengkai Li

    Entity alignment seeks to find entities in different knowledge graphs (KGs) that refer to the same real-world object. Recent advancement in KG embedding impels the advent of embedding-based entity alignment, which encodes entities in a continuous embedding space and measures entity similarities based on the learned embeddings. In this paper, we conduct a comprehensive experimental study of this emerging

  • When is Ontology-Mediated Querying Efficient?
    arXiv.cs.DB Pub Date : 2020-03-17
    Pablo Barcelo; Cristina Feier; Carsten Lutz; Andreas Pieris

    In ontology-mediated querying, description logic (DL) ontologies are used to enrich incomplete data with domain knowledge which results in more complete answers to queries. However, the evaluation of ontology-mediated queries (OMQs) over relational databases is computationally hard. This raises the question when OMQ evaluation is efficient, in the sense of being tractable in combined complexity or

  • Approximate Query Service on Autonomous IoT Cameras
    arXiv.cs.DB Pub Date : 2019-09-02
    Mengwei Xu; Xiwen Zhang; Yunxin Liu; Xuanzhe Liu; Felix Xiaozhu Lin

    Today's analytics-powered cameras are still limited to urban, residential areas where power/network resources abound. To expand them to more diverse environments, especially those are off-grid and highly network-constrained, the cameras shall be ``autonomous'', i.e., independent from external power supply and compute infrastructure. Can autonomous cameras do any useful analytics? Our response is iCam

  • Understanding and Benchmarking the Impact of GDPR on Database Systems
    arXiv.cs.DB Pub Date : 2019-10-02
    Supreeth Shastri; Vinay Banakar; Melissa Wasserman; Arun Kumar; Vijay Chidambaram

    The General Data Protection Regulation (GDPR) provides new rights and protections to European people concerning their personal data. We analyze GDPR from a systems perspective, translating its legal articles into a set of capabilities and characteristics that compliant systems must support. Our analysis reveals the phenomenon of metadata explosion, wherein large quantities of metadata needs to be stored

  • A Fault-Tolerance Shim for Serverless Computing
    arXiv.cs.DB Pub Date : 2020-03-12
    Vikram Sreekanti; Chenggang Wu; Saurav Chhatrapati; Joseph E. Gonzalez; Joseph M. Hellerstein; Jose M. Faleiro

    Serverless computing has grown in popularity in recent years, with an increasing number of applications being built on Functions-as-a-Service (FaaS) platforms. By default, FaaS platforms support retry-based fault tolerance, but this is insufficient for programs that modify shared state, as they can unwittingly persist partial sets of updates in case of failures. To address this challenge, we would

  • mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia Data
    arXiv.cs.DB Pub Date : 2020-03-13
    Omid Jafari; Parth Nagarkar; Johnathan Montaño

    Many large multimedia applications require efficient processing of nearest neighbor queries. Often, multimedia data are represented as a collection of important high-dimensional feature vectors. Locality Sensitive Hashing (LSH) is a very popular approximate technique for finding nearest neighbors in high-dimensional spaces. In order to find top-k similar multimedia objects, existing LSH techniques

Contents have been reproduced by permission of the publishers.
全球疫情及响应:BMC Medicine专题征稿