当前期刊: arXiv - CS - Databases Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
我的关注
我的收藏
您暂时未登录!
登录
  • DrugDBEmbed : Semantic Queries on Relational Database using Supervised Column Encodings
    arXiv.cs.DB Pub Date : 2020-07-05
    Bortik Bandyopadhyay; Pranav Maneriker; Vedang Patel; Saumya Yashmohini Sahai; Ping Zhang; Srinivasan Parthasarathy

    Traditional relational databases contain a lot of latent semantic information that have largely remained untapped due to the difficulty involved in automatically extracting such information. Recent works have proposed unsupervised machine learning approaches to extract such hidden information by textifying the database columns and then projecting the text tokens onto a fixed dimensional semantic vector

    更新日期:2020-07-07
  • Detecting Opportunities for Differential Maintenance of Extracted Views
    arXiv.cs.DB Pub Date : 2020-07-04
    Besat Kassaie; Frank Wm. Tompa

    Semi-structured and unstructured data management is challenging, but many of the problems encountered are analogous to problems already addressed in the relational context. In the area of information extraction, for example, the shift from engineering ad hoc, application-specific extraction rules towards using expressive languages such as CPSL and AQL creates opportunities to propose solutions that

    更新日期:2020-07-07
  • CICLAD: A Fast and Memory-efficient Closed Itemset Miner for Streams
    arXiv.cs.DB Pub Date : 2020-07-03
    Tomas Martin; Guy Francoeur; Petko Valtchev

    Mining association rules from data streams is a challenging task due to the (typically) limited resources available vs. the large size of the result. Frequent closed itemsets (FCI) enable an efficient first step, yet current FCI stream miners are not optimal on resource consumption, e.g. they store a large number of extra itemsets at an additional cost. In a search for a better storage-efficiency trade-off

    更新日期:2020-07-07
  • PPaaS: Privacy Preservation as a Service
    arXiv.cs.DB Pub Date : 2020-07-04
    Pathum Chamikara Mahawaga Arachchige; Peter Bertok; Ibrahim Khalil; Dongxi Liu; Seyit Camtepe

    Personally identifiable information (PII) can find its way into cyberspace through various channels, and many potential sources can leak such information. To preserve user privacy, researchers have devised different privacy-preserving approaches; however, the usability of these methods, in terms of practical use, needs careful analysis due to the high diversity and complexity of the methods. This paper

    更新日期:2020-07-07
  • ER model Partitioning: Towards Trustworthy Automated Systems Development
    arXiv.cs.DB Pub Date : 2020-07-02
    Dhammika Pieris; M. C Wijegunesekera; N. G. J Dias

    In database development, a conceptual model is created, in the form of an Entity-relationship(ER) model, and transformed to a relational database schema (RDS) to create the database. However, some important information represented on the ER model may not be transformed and represented on the RDS. This situation causes a loss of information during the transformation process. With a view to preserving

    更新日期:2020-07-03
  • Query Based Access Control for Linked Data
    arXiv.cs.DB Pub Date : 2020-07-01
    Sabrina Kirrane; Alessandra Mileo; Axel Polleres; Stefan Decker

    In recent years we have seen significant advances in the technology used to both publish and consume Linked Data. However, in order to support the next generation of ebusiness applications on top of interlinked machine readable data suitable forms of access control need to be put in place. Although a number of access control models and frameworks have been put forward, very little research has been

    更新日期:2020-07-02
  • FathomNet: An underwater image training database for ocean exploration and discovery
    arXiv.cs.DB Pub Date : 2020-06-30
    Océane Boulais; Ben Woodward; Brian Schlining; Lonny Lundsten; Kevin Barnard; Katy Croff Bell; Kakani Katija

    Thousands of hours of marine video data are collected annually from remotely operated vehicles (ROVs) and other underwater assets. However, current manual methods of analysis impede the full utilization of collected data for real time algorithms for ROV and large biodiversity analyses. \textit{FathomNet} is a novel baseline image training set, optimized to accelerate development of modern, intelligent

    更新日期:2020-07-02
  • Hierarchical Graph Matching Network for Graph Similarity Computation
    arXiv.cs.DB Pub Date : 2020-06-30
    Haibo Xiu; Xiao Yan; Xiaoqiang Wang; James Cheng; Lei Cao

    Graph edit distance / similarity is widely used in many tasks, such as graph similarity search, binary function analysis, and graph clustering. However, computing the exact graph edit distance (GED) or maximum common subgraph (MCS) between two graphs is known to be NP-hard. In this paper, we propose the hierarchical graph matching network (HGMN), which learns to compute graph similarity from data.

    更新日期:2020-07-01
  • Lachesis: Automated Generation of Persistent Partitionings for Big Data Applications
    arXiv.cs.DB Pub Date : 2020-06-30
    Jia Zou; Pratik Barhate; Amitabh Das; Arun Iyengar; Binhang Yuan; Dimitrije Jankov; Chis Jermaine

    Persistent partitioning is effective in improving the performance by avoiding the expensive shuffling operation, while incurring relatively small overhead. However it remains a significant challenge to automate this process for UDF-centric analytics workloads, which is closely integrated with a high-level programming language such asPython, Scala, Java. That is because user defined functions (UDFs)in

    更新日期:2020-07-01
  • Hands-off Model Integration in Spatial Index Structures
    arXiv.cs.DB Pub Date : 2020-06-29
    Ali Hadian; Ankit Kumar; Thomas Heinis

    Spatial indexes are crucial for the analysis of the increasing amounts of spatial data, for example generated through IoT applications. The plethora of indexes that has been developed in recent decades has primarily been optimised for disk. With increasing amounts of memory even on commodity machines, however, moving them to main memory is an option. Doing so opens up the opportunity to use additional

    更新日期:2020-07-01
  • Leveraging Soft Functional Dependencies for Indexing Multi-dimensional Data
    arXiv.cs.DB Pub Date : 2020-06-29
    Behzad Ghaffari; Ali Hadian; Thomas Heinis

    A new proposal in database indexing has been for index structures to automatically learn and use the distribution of the underlying data to improve their performance. Initial work on \textit{learned indexes} has repeatedly shown that by learning the distribution of the data, index structures such as the B-Tree, can boost their performance by an order of magnitude while using a smaller memory footprint

    更新日期:2020-07-01
  • Mining Documentation to Extract Hyperparameter Schemas
    arXiv.cs.DB Pub Date : 2020-06-30
    Guillaume Baudart; Peter D. Kirchner; Martin Hirzel; Kiran Kate

    AI automation tools need machine-readable hyperparameter schemas to define their search spaces. At the same time, AI libraries often come with good human-readable documentation. While such documentation contains most of the necessary information, it is unfortunately not ready to consume by tools. This paper describes how to automatically mine Python docstrings in AI libraries to extract JSON Schemas

    更新日期:2020-07-01
  • On Finite Entailment of Non-Local Queries in Description Logics
    arXiv.cs.DB Pub Date : 2020-06-30
    Tomasz Gogacz; Víctor Gutiérrez-Basulto; Albert Gutowski; Yazmín Ibáñez-García; Filip Murlak

    We study the problem of finite entailment of ontology-mediated queries. Going beyond local queries, we allow transitive closure over roles. We focus on ontologies formulated in the description logics ALCOI and ALCOQ, extended with transitive closure. For both logics, we show 2EXPTIME upper bounds for finite entailment of unions of conjunctive queries with transitive closure. We also provide a matching

    更新日期:2020-07-01
  • Neural Datalog Through Time: Informed Temporal Modeling via Logical Specification
    arXiv.cs.DB Pub Date : 2020-06-30
    Hongyuan Mei; Guanghui Qin; Minjie Xu; Jason Eisner

    Learning how to predict future events from patterns of past events is difficult when the set of possible event types is large. Training an unrestricted neural model might overfit to spurious patterns. To exploit domain-specific knowledge of how past events might affect an event's present probability, we propose using a temporal deductive database to track structured facts over time. Rules serve to

    更新日期:2020-07-01
  • On the Privacy-Utility Tradeoff in Peer-Review Data Analysis
    arXiv.cs.DB Pub Date : 2020-06-29
    Wenxin Ding; Nihar B. Shah; Weina Wang

    A major impediment to research on improving peer review is the unavailability of peer-review data, since any release of such data must grapple with the sensitivity of the peer review data in terms of protecting identities of reviewers from authors. We posit the need to develop techniques to release peer-review data in a privacy-preserving manner. Identifying this problem, in this paper we propose a

    更新日期:2020-07-01
  • Parallel Betweenness Computation in Graph Database for Contingency Selection
    arXiv.cs.DB Pub Date : 2020-06-29
    Yongli Zhu; Renchang Dai; Guangyi Liu

    Parallel betweenness computation algorithms are proposed and implemented in a graph database for power system contingency selection. Principles of the graph database and graph computing are investigated for both node and edge betweenness computation. Experiments on the 118-bus system and a real power system show that speed-up can be achieved for both node and edge betweenness computation while the

    更新日期:2020-07-01
  • Differential Privacy of Hierarchical Census Data: An Optimization Approach
    arXiv.cs.DB Pub Date : 2020-06-28
    Ferdinando Fioretto; Pascal Van Hentenryck; Keyu Zhu

    This paper is motivated by applications of a Census Bureau interested in releasing aggregate socio-economic data about a large population without revealing sensitive information about any individual. The released information can be the number of individuals living alone, the number of cars they own, or their salary brackets. Recent events have identified some of the privacy challenges faced by these

    更新日期:2020-06-30
  • Efficient Matrix Factorization on Heterogeneous CPU-GPU Systems
    arXiv.cs.DB Pub Date : 2020-06-24
    Yuanhang Yu; Dong Wen; Ying Zhang; Xiaoyang Wang; Wenjie Zhang; Xuemin Lin

    Matrix Factorization (MF) has been widely applied in machine learning and data mining. A large number of algorithms have been studied to factorize matrices. Among them, stochastic gradient descent (SGD) is a commonly used method. Heterogeneous systems with multi-core CPUs and GPUs have become more and more promising recently due to the prevalence of GPUs in general-purpose data-parallel applications

    更新日期:2020-06-30
  • SPIDER: Selective Plotting of Interconnected Data and Entity Relations
    arXiv.cs.DB Pub Date : 2020-06-25
    Pranav Addepalli; Eric Wu; Douglas Bossart; Christina Lin; Allistar Smith

    Intelligence analysts have long struggled with an abundance of data that must be investigated on a daily basis. In the U.S. Army, this activity involves reconciling information from various sources, a process that has been automated to a certain extent, but which remains highly manual. To promote automation, a semantic analysis prototype was designed to aid in the intelligence analysis process. This

    更新日期:2020-06-26
  • Coconut: a scalable bottom-up approach for building data series indexes
    arXiv.cs.DB Pub Date : 2020-06-20
    Haridimos Kondylakis; Niv Dayan; Kostas Zoumpatianos; Themis Palpanas

    Many modern applications produce massive amounts of data series that need to be analyzed, requiring efficient similarity search operations. However, the state-of-the-art data series indexes that are used for this purpose do not scale well for massive datasets in terms of performance, or storage costs. We pinpoint the problem to the fact that existing summarizations of data series used for indexing

    更新日期:2020-06-25
  • Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads
    arXiv.cs.DB Pub Date : 2020-06-23
    Jialin Ding; Vikram Nathan; Mohammad Alizadeh; Tim Kraska

    Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse. Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (e.g., Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes. However, these schemes are hard to tune and their performance is inconsistent. Recent

    更新日期:2020-06-25
  • Coconut Palm: Static and Streaming Data Series Exploration Now in your Palm
    arXiv.cs.DB Pub Date : 2020-06-20
    Haridimos Kondylakis; Niv Dayan; Kostas Zoumpatianos; Themis Palpanas

    Many modern applications produce massive streams of data series and maintain them in indexes to be able to explore them through nearest neighbor search. Existing data series indexes, however, are expensive to operate as they issue many random I/Os to storage. To address this problem, we recently proposed Coconut, a new infrastructure that organizes data series based on a new sortable format. In this

    更新日期:2020-06-24
  • PRIPEL: Privacy-Preserving Event Log Publishing Including Contextual Information
    arXiv.cs.DB Pub Date : 2020-06-23
    Stephan A. Fahrenkrog-Petersen; Han van der Aa; Matthias Weidlich

    Event logs capture the execution of business processes in terms of executed activities and their execution context. Since logs contain potentially sensitive information about the individuals involved in the process, they should be pre-processed before being published to preserve the individuals' privacy. However, existing techniques for such pre-processing are limited to a process' control-flow and

    更新日期:2020-06-24
  • An Efficient Index for Contact Tracing Query in a Large Spatio-Temporal Database
    arXiv.cs.DB Pub Date : 2020-06-23
    Mohammed Eunus Ali; Shadman Saqib Eusuf; Kazi Ashik Islam

    In this paper, we study a novel contact tracing query (CTQ) that finds users who have been in \emph{direct contact} with the query user or \emph{in contact with the already contacted users} in subsequent timestamps from a large spatio-temporal database. The CTQ is of paramount importance in the era of new COVID-19 pandemic world for finding possible list of potential COVID-19 exposed patients. A straightforward

    更新日期:2020-06-24
  • Benchmarking Learned Indexes
    arXiv.cs.DB Pub Date : 2020-06-23
    Ryan Marcus; Andreas Kipf; Alexander van Renen; Mihail Stoian; Sanchit Misra; Alfons Kemper; Thomas Neumann; Tim Kraska

    Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that compares well-tuned implementations of three learned index structures against several state-of-the-art "traditional" baselines. Using four real-world datasets, we demonstrate that learned index structures can indeed

    更新日期:2020-06-24
  • Database Optimization to Recommend Software Developers using Canonical Order Tree
    arXiv.cs.DB Pub Date : 2020-06-21
    T. M. Amir-Ul-Haque Bhuiyan; Mehedi Hasan Talukdar; Ziaur Rahman; Dr. Mohammad Motiur Rahman

    Recently frequent and sequential pattern mining algorithms have been widely used in the field of software engineering to mine various source code or specification patterns. In practice software evolves from one version to another is needed for providing extra facilities to user. This kind of task is challenging in this domain since the database is usually updated in all kinds of manners such as insertion

    更新日期:2020-06-24
  • Distributed Subgraph Enumeration via Backtracking-based Framework
    arXiv.cs.DB Pub Date : 2020-06-23
    Zhaokang Wang; Weiwei Hu; Chunfeng Yuan; Rong Gu; Yihua Huang

    Given a small pattern graph and a large data graph, the task of subgraph enumeration is to find all subgraphs of the data graph that are isomorphic to the pattern graph. When the data graph is dynamic, the task of continuous subgraph enumeration is to detect the changes in the matching results caused by the edge updates at each time step. The two tasks are fundamental in many graph analysis applications

    更新日期:2020-06-24
  • AOT: Pushing the Efficiency Boundary of Main-memory Triangle Listing
    arXiv.cs.DB Pub Date : 2020-06-20
    Michael Yu; Lu Qin; Ying Zhang; Wenjie Zhang; Xuemin Lun

    Triangle listing is an important topic significant in many practical applications. Efficient algorithms exist for the task of triangle listing. Recent algorithms leverage an orientation framework, which can be thought of as mapping an undirected graph to a directed acylic graph, namely oriented graph, with respect to any global vertex order. In this paper, we propose an adaptive orientation technique

    更新日期:2020-06-23
  • Coconut: sortable summarizations for scalable indexes over static and streaming data series
    arXiv.cs.DB Pub Date : 2020-06-20
    Haridimos Kondylakis; Niv Dayan; Kostas Zoumpatianos; Themis Palpanas

    Many modern applications produce massive streams of data series that need to be analyzed, requiring efficient similarity search operations. However, the state-of-the-art data series indexes that are used for this purpose do not scale well for massive datasets in terms of performance, or storage costs. We pinpoint the problem to the fact that existing summarizations of data series used for indexing

    更新日期:2020-06-23
  • Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search
    arXiv.cs.DB Pub Date : 2020-06-20
    Karima Echihabi; Kostas Zoumpatianos; Themis Palpanas; Houda Benbrahim

    Data series are a special type of multidimensional data present in numerous domains, where similarity search is a key operation that has been extensively studied in the data series literature. In parallel, the multidimensional community has studied approximate similarity search techniques. We propose a taxonomy of similarity search techniques that reconciles the terminology used in these two domains

    更新日期:2020-06-23
  • The Lernaean Hydra of Data Series Similarity Search: An Experimental Evaluation of the State of the Art
    arXiv.cs.DB Pub Date : 2020-06-20
    Karima Echihabi; Kostas Zoumpatianos; Themis Palpanas; Houda Benbrahim

    Increasingly large data series collections are becoming commonplace across many different domains and applications. A key operation in the analysis of data series collections is similarity search, which has attracted lots of attention and effort over the past two decades. Even though several relevant approaches have been proposed in the literature, none of the existing studies provides a detailed evaluation

    更新日期:2020-06-23
  • Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches
    arXiv.cs.DB Pub Date : 2020-06-19
    Omid Jafari; Parth Nagarkar

    Finding nearest neighbors in high-dimensional spaces is a fundamental operation in many multimedia retrieval applications. Exact tree-based indexing approaches are known to suffer from the notorious curse of dimensionality for high-dimensional data. Approximate searching techniques sacrifice some accuracy while returning good enough results for faster performance. Locality Sensitive Hashing (LSH) is

    更新日期:2020-06-23
  • Improving Locality Sensitive Hashing by Efficiently Finding Projected Nearest Neighbors
    arXiv.cs.DB Pub Date : 2020-06-19
    Omid Jafari; Parth Nagarkar; Jonathan Montaño

    Similarity search in high-dimensional spaces is an important task for many multimedia applications. Due to the notorious curse of dimensionality, approximate nearest neighbor techniques are preferred over exact searching techniques since they can return good enough results at a much better speed. Locality Sensitive Hashing (LSH) is a very popular random hashing technique for finding approximate nearest

    更新日期:2020-06-23
  • P3GM: Private High-Dimensional Data Release via Privacy Preserving Phased Generative Model
    arXiv.cs.DB Pub Date : 2020-06-22
    Shun Takagi; Tsubasa Takahashi; Yang Cao; Masatoshi Yoshikawa

    How can we release a massive volume of sensitive data while mitigating privacy risks? Privacy-preserving data synthesis enables the data holder to outsource analytical tasks to an untrusted third party. The state-of-the-art approach for this problem is to build a generative model under differential privacy, which offers a rigorous privacy guarantee. However, the existing method cannot adequately handle

    更新日期:2020-06-23
  • Overlook: Differentially Private Exploratory Visualization for Big Data
    arXiv.cs.DB Pub Date : 2020-06-22
    Pratiksha Thaker; Mihai Budiu; Parikshit Gopalan; Udi Wieder; Matei Zaharia

    Data exploration systems that provide differential privacy must manage a privacy budget that measures the amount of privacy lost across multiple queries. One effective strategy to manage the privacy budget is to compute a one-time private synopsis of the data, to which users can make an unlimited number of queries. However, existing systems using synopses are built for offline use cases, where a set

    更新日期:2020-06-23
  • Sorting-based Interactive Regret Minimization
    arXiv.cs.DB Pub Date : 2020-06-19
    Jiping Zheng; Chen Chen

    As an important tool for multi-criteria decision making in database systems, the regret minimization query is shown to have the merits of top-k and skyline queries: it controls the output size while does not need users to provide any preferences. Existing researches verify that the regret ratio can be much decreased when interaction is available. In this paper, we study how to enhance current interactive

    更新日期:2020-06-22
  • Record fusion: A learning approach
    arXiv.cs.DB Pub Date : 2020-06-18
    Alireza Heidari; George Michalopoulos; Shrinu Kushagra; Ihab F. Ilyas; Theodoros Rekatsinas

    Record fusion is the task of aggregating multiple records that correspond to the same real-world entity in a database. We can view record fusion as a machine learning problem where the goal is to predict the "correct" value for each attribute for each entity. Given a database, we use a combination of attribute-level, recordlevel, and database-level signals to construct a feature vector for each cell

    更新日期:2020-06-19
  • Wide-Area Data Analytics
    arXiv.cs.DB Pub Date : 2020-06-17
    Rachit Agarwalworkshop co-chairs; Jen Rexfordworkshop co-chairs; with contributions from numerous workshop attendees

    We increasingly live in a data-driven world, with diverse kinds of data distributed across many locations. In some cases, the datasets are collected from multiple locations, such as sensors (e.g., mobile phones and street cameras) spread throughout a geographic region. The data may need to be analyzed close to where they are produced, particularly when the applications require low latency, high, low

    更新日期:2020-06-19
  • Incremental Lossless Graph Summarization
    arXiv.cs.DB Pub Date : 2020-06-17
    Jihoon Ko; Yunbum Kook; Kijung Shin

    Given a fully dynamic graph, represented as a stream of edge insertions and deletions, how can we obtain and incrementally update a lossless summary of its current snapshot? As large-scale graphs are prevalent, concisely representing them is inevitable for efficient storage and analysis. Lossless graph summarization is an effective graph-compression technique with many desirable properties. It aims

    更新日期:2020-06-18
  • Index Selection for NoSQL Database with Deep Reinforcement Learning
    arXiv.cs.DB Pub Date : 2020-06-16
    Shun Yao; Hongzhi Wang; Yu Yan

    We propose a new approach of NoSQL database index selection. For different workloads, we select different indexes and their different parameters to optimize the database performance. The approach builds a deep reinforcement learning model to select an optimal index for a given fixed workload and adapts to a changing workload. Experimental results show that, Deep Reinforcement Learning Index Selection

    更新日期:2020-06-16
  • MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining
    arXiv.cs.DB Pub Date : 2020-06-16
    Leonardo Pellegrina; Cyrus Cousins; Fabio Vandin; Matteo Riondato

    We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both statistically-significant

    更新日期:2020-06-16
  • Comparing Alternative Route Planning Techniques: A Web-based Demonstration and User Study
    arXiv.cs.DB Pub Date : 2020-06-15
    Lingxiao Li; Muhammad Aamir Cheema; Hua Lu; Mohammed Eunus Ali; Adel N. Toosi

    Due to the popularity of smartphones, cheap wireless networks and availability of road network data, navigation applications have become a part of our everyday life. Many modern navigation systems and map-based services do not only provide the fastest route from a source location s to a target location t but also provide a few alternative routes to the users as more options to choose from. Consequently

    更新日期:2020-06-15
  • Needles in the 'Sheet'stack: Augmented Analytics to get Insights from Spreadsheets
    arXiv.cs.DB Pub Date : 2020-06-15
    Medha Atre; Anand Deshpande; Reshma Godse; Pooja Deokar; Sandip Moharir; Dhruva Ray; Akshay Chitlangia; Trupti Phadnis; Yugansh Goyal

    Business intelligence (BI) tools for database analytics have come a long way and nowadays also provide ready insights or visual query explorations, e.g. QuickInsights by Microsoft Power BI, SpotIQ by ThoughtSpot, Zenvisage, etc. In this demo, we focus on providing insights by examining periodic spreadsheets of different reports (aka views), without prior knowledge of the schema of the database or reports

    更新日期:2020-06-15
  • NeuroCard: One Cardinality Estimator for All Tables
    arXiv.cs.DB Pub Date : 2020-06-15
    Zongheng Yang; Amog Kamsetty; Sifei Luan; Eric Liang; Yan Duan; Xi Chen; Ion Stoica

    Query optimizers rely on accurate cardinality estimates to produce good execution plans. Despite decades of research, existing cardinality estimators are inaccurate for complex queries, due to making lossy modeling assumptions and not capturing inter-table correlations. In this work, we show that it is possible to learn the correlations across all tables in a database without any independence assumptions

    更新日期:2020-06-15
  • Oblivious and Semi-Oblivious Boundedness for Existential Rules
    arXiv.cs.DB Pub Date : 2020-06-15
    Pierre Bourhis; Michel Leclère; Marie-Laure Mugnier; Sophie Tison; Federico Ulliana; Lily Galois

    We study the notion of boundedness in the context of positive existential rules, that is, whether there exists an upper bound to the depth of the chase procedure, that is independent from the initial instance. By focussing our attention on the oblivious and the semi-oblivious chase variants, we give a characterization of boundedness in terms of FO-rewritability and chase termination. We show that it

    更新日期:2020-06-15
  • CoT: Decentralized Elastic Caches for Cloud Environments
    arXiv.cs.DB Pub Date : 2020-06-15
    Victor Zakhary; Lawrence Lim; Divyakant Agrawal; Amr {El Abbadi}

    Distributed caches are widely deployed to serve social networks and web applications at billion-user scales. This paper presents Cache-on-Track (CoT), a decentralized, elastic, and predictive caching framework for cloud environments. CoT proposes a new cache replacement policy specifically tailored for small front-end caches that serve skewed workloads. Front-end servers use a heavy hitter tracking

    更新日期:2020-06-15
  • Categorical anomaly detection in heterogeneous data using minimum description length clustering
    arXiv.cs.DB Pub Date : 2020-06-14
    James Cheney; Xavier Gombau; Ghita Berrada; Sidahmed Benabderrahmane

    Fast and effective unsupervised anomaly detection algorithms have been proposed for categorical data based on the minimum description length (MDL) principle. However, they can be ineffective when detecting anomalies in heterogeneous datasets representing a mixture of different sources, such as security scenarios in which system and user processes have distinct behavior patterns. We propose a meta-algorithm

    更新日期:2020-06-14
  • Solos: A Dataset for Audio-Visual Music Analysis
    arXiv.cs.DB Pub Date : 2020-06-14
    Juan F. Montesinos; Olga Slizovskaia; Gloria Haro

    In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual selfsupervised task. These videos, gathered from YouTube, consist of solo musical performances of 13 different

    更新日期:2020-06-14
  • High-Level ETL for Semantic Data Warehouses---Full Version
    arXiv.cs.DB Pub Date : 2020-06-12
    Rudra Pratap Deb Nath; Oscar Romero; Torben Bach Pedersen; Katja Hose

    The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load

    更新日期:2020-06-12
  • Indexing Data on the Web: A Comparison of Schema-level Indices for Data Search -- Extended Technical Report
    arXiv.cs.DB Pub Date : 2020-06-12
    Till Blume; Ansgar Scherp

    Indexing the Web of Data offers many opportunities, in particular, to find and explore data sources. One major design decision when indexing the Web of Data is to find a suitable index model, i.e., how to index and summarize data. Various efforts have been conducted to develop specific index models for a given task. With each index model designed, implemented, and evaluated independently, it remains

    更新日期:2020-06-12
  • Hindsight Logging for Model Training
    arXiv.cs.DB Pub Date : 2020-06-12
    Rolando Garcia; Eric Liu; Vikram Sreekanti; Bobby Yan; Anusha Dandamudi; Joseph E. Gonzalez; Joseph M. Hellerstein; Koushik Sen

    Due to the long time-lapse between the triggering and detection of a bug in the machine learning lifecycle, model developers favor data-centric logfile analysis over traditional interactive debugging techniques. But when useful execution data is missing from the logs after training, developers have little recourse beyond re-executing training with more logging statements, or guessing. In this paper

    更新日期:2020-06-12
  • Google Dataset Search by the Numbers
    arXiv.cs.DB Pub Date : 2020-06-12
    Omar Benjelloun; Shiyu Chen; Natasha Noy

    Scientists, governments, and companies increasingly publish datasets on the Web. Google's Dataset Search extracts dataset metadata -- expressed using schema.org and similar vocabularies -- from Web pages in order to make datasets discoverable. Since we started the work on Dataset Search in 2016, the number of datasets described in schema.org has grown from about 500K to almost 30M. Thus, this corpus

    更新日期:2020-06-12
  • EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs
    arXiv.cs.DB Pub Date : 2020-06-12
    Seung Won Min; Vikram Sharma Mailthody; Zaid Qureshi; Jinjun Xiong; Eiman Ebrahimi; Wen-mei Hwu

    Modern analytics and recommendation systems are increasingly based on graph data that capture the relations between entities being analyzed. Practical graphs come in huge sizes, offer massive parallelism, and are stored in sparse-matrix formats such as CSR. To exploit the massive parallelism, developers are increasingly interested in using GPUs for graph traversal. However, due to their sizes, graphs

    更新日期:2020-06-12
  • TableQA: a Large-Scale Chinese Text-to-SQL Dataset for Table-Aware SQL Generation
    arXiv.cs.DB Pub Date : 2020-06-10
    Ningyuan Sun; Xuefeng Yang; Yunfeng Liu

    Parsing natural language to corresponding SQL (NL2SQL) with data driven approaches like deep neural networks attracts much attention in recent years. Existing NL2SQL datasets assume that condition values should appear exactly in natural language questions and the queries are answerable given the table. However, these assumptions may fail in practical scenarios, because user may use different expressions

    更新日期:2020-06-10
  • Fair Data Integration
    arXiv.cs.DB Pub Date : 2020-06-10
    Sainyam Galhotra; Karthikeyan Shanmugam; Prasanna Sattigeri; Kush R. Varshney

    The use of machine learning (ML) in high-stakes societal decisions has encouraged the consideration of fairness throughout the ML lifecycle. Although data integration is one of the primary steps to generate high quality training data, most of the fairness literature ignores this stage. In this work, we consider fairness in the integration component of data management, aiming to identify features that

    更新日期:2020-06-10
  • Fast Subtrajectory Similarity Search in Road Networks under Weighted Edit Distance Constraints
    arXiv.cs.DB Pub Date : 2020-06-09
    Satoshi Koide; Chuan Xiao; Yoshiharu Ishikawa

    In this paper, we address a similarity search problem for spatial trajectories in road networks. In particular, we focus on the subtrajectory similarity search problem, which involves finding in a database the subtrajectories similar to a query trajectory. A key feature of our approach is that we do not focus on a specific similarity function; instead, we consider weighted edit distance (WED), a class

    更新日期:2020-06-09
  • Dynamic Interleaving of Content and Structure for Robust Indexing of Semi-Structured Hierarchical Data (Extended Version)
    arXiv.cs.DB Pub Date : 2020-06-09
    Kevin Wellenzohn; Michael H. Böhlen; Sven Helmer

    We propose a robust index for semi-structured hierarchical data that supports content-and-structure (CAS) queries specified by path and value predicates. At the heart of our approach is a novel dynamic interleaving scheme that merges the path and value dimensions of composite keys in a balanced way. We store these keys in our trie-based Robust Content-And-Structure index, which efficiently supports

    更新日期:2020-06-09
  • Lethe: A Tunable Delete-Aware LSM Engine
    arXiv.cs.DB Pub Date : 2020-06-08
    Subhadeep Sarkar; Tarikul Islam Papon; Dimitris Staratzis; Manos Athanassoulis

    Data-intensive applications fueled the evolution of log structured merge (LSM) based key-value engines that employ the \textit{out-of-place} paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of \textit{treating deletes as a second-class citizen}. A delete inserts a \textit{tombstone} that invalidates older instances of the deleted key

    更新日期:2020-06-08
  • Toward a Better Understanding and Evaluation of Tree Structures on Flash SSDs
    arXiv.cs.DB Pub Date : 2020-06-08
    Diego Didona; Nikolas Ioannou; Radu Stoica; Kornilios Kourtis

    Solid-state drives (SSDs) are extensively used to deploy persistent data stores, as they provide low latency random access, high write throughput, high data density, and low cost. Tree-based data structures are widely used to build persistent data stores, and indeed they lie at the backbone of many of the data management systems used in production and research today. In this paper, we show that benchmarking

    更新日期:2020-06-08
  • Blockchain-Based Differential Privacy Cost Management System
    arXiv.cs.DB Pub Date : 2020-06-08
    Leong Mei Han; Yang Zhao; Jun Zhao

    Privacy preservation is a big concern for various sectors. To protect individual user data, one emerging technology is differential privacy. However, it still has limitations for datasets with frequent queries, such as the fast accumulation of privacy cost. To tackle this limitation, this paper explores the integration of a secured decentralised ledger, blockchain. Blockchain will be able to keep track

    更新日期:2020-06-08
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
材料学研究精选
Springer Nature Live 产业与创新线上学术论坛
胸腔和胸部成像专题
自然科研论文编辑服务
ACS ES&T Engineering
ACS ES&T Water
屿渡论文,编辑服务
杨超勇
周一歌
华东师范大学
南京工业大学
清华大学
中科大
唐勇
跟Nature、Science文章学绘图
隐藏1h前已浏览文章
中洪博元
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
x-mol收录
福州大学
南京大学
王杰
左智伟
湖南大学
清华大学
吴杰
赵延川
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug