• arXiv.cs.DB Pub Date : 2020-01-17
Evgeny Krivosheev; Mattia Atzeni; Katsiaryna Mirylenka; Paolo Scotton; Fabio Casati

Data integration has been studied extensively for decades and approached from different angles. However, this domain still remains largely rule-driven and lacks universal automation. Recent development in machine learning and in particular deep learning has opened the way to more general and more efficient solutions to data integration problems. In this work, we propose a general approach to modeling and integrating entities from structured data, such as relational databases, as well as unstructured sources, such as free text from news articles. Our approach is designed to explicitly model and leverage relations between entities, thereby using all available information and preserving as much context as possible. This is achieved by combining siamese and graph neural networks to propagate information between connected entities and support high scalability. We evaluate our method on the task of integrating data about business entities, and we demonstrate that it outperforms standard rule-based systems, as well as other deep learning approaches that do not use graph-based representations.

更新日期：2020-01-22
• arXiv.cs.DB Pub Date : 2020-01-18
Xinxun Zeng; Shiqi Zhang; Bo Tang

Influence Maximization Problem (IMP) is selecting a seed set of nodes in the social network to spread the influence as widely as possible. It has many applications in multiple domains, e.g., viral marketing is frequently used for new products or activities advertisements. While it is a classic and well-studied problem in computer science, unfortunately, all those proposed techniques are compromising among time efficiency, memory consumption, and result quality. In this paper, we conduct comprehensive experimental studies on the state-of-the-art IMP approximate approaches to reveal the underlying trade-off strategies. Interestingly, we find that even the state-of-the-art approaches are impractical when the propagation probability of the network have been taken into consideration. With the findings of existing approaches, we propose a novel residual-based approach (i.e., RCELF) for IMP, which i) overcomes the deficiencies of existing approximate approaches, and ii) provides theoretical guaranteed results with high efficiency in both time- and space- perspectives. We demonstrate the superiority of our proposal by extensive experimental evaluation on real datasets.

更新日期：2020-01-22
• arXiv.cs.DB Pub Date : 2020-01-18
Shuo Wang; Tianle Chen; Shangyu Chen; Carsten Rudolph; Surya Nepal; Marthie Grobler

Anomaly detection aims to recognize samples with anomalous and unusual patterns with respect to a set of normal data, which is significant for numerous domain applications, e.g. in industrial inspection, medical imaging, and security enforcement. There are two key research challenges associated with existing anomaly detention approaches: (1) many of them perform well on low-dimensional problems however the performance on high-dimensional instances is limited, such as images; (2) many of them depend on often still rely on traditional supervised approaches and manual engineering of features, while the topic has not been fully explored yet using modern deep learning approaches, even when the well-label samples are limited. In this paper, we propose a One-for-all Image Anomaly Detection system (OIAD) based on disentangled learning using only clean samples. Our key insight is that the impact of small perturbation on the latent representation can be bounded for normal samples while anomaly images are usually outside such bounded intervals, called structure consistency. We implement this idea and evaluate its performance for anomaly detention. Our experiments with three datasets show that OIAD can detect over $90\%$ of anomalies while maintaining a high low false alarm rate. It can also detect suspicious samples from samples labeled as clean, coincided with what humans would deem unusual.

更新日期：2020-01-22
• arXiv.cs.DB Pub Date : 2020-01-18
Jeremy Kepner; Vijay Gadepally; Hayden Jananthan; Lauren Milechin; Siddharth Samsi

The AI revolution is data driven. AI "data wrangling" is the process by which unusable data is transformed to support AI algorithm development (training) and deployment (inference). Significant time is devoted to translating diverse data representations supporting the many query and analysis steps found in an AI pipeline. Rigorous mathematical representations of these data enables data translation and analysis optimization within and across steps. Associative array algebra provides a mathematical foundation that naturally describes the tabular structures and set mathematics that are the basis of databases. Likewise, the matrix operations and corresponding inference/training calculations used by neural networks are also well described by associative arrays. More surprisingly, a general denormalized form of hierarchical formats, such as XML and JSON, can be readily constructed. Finally, pivot tables, which are among the most widely used data analysis tools, naturally emerge from associative array constructors. A common foundation in associative arrays provides interoperability guarantees, proving that their operations are linear systems with rigorous mathematical properties, such as, associativity, commutativity, and distributivity that are critical to reordering optimizations.

更新日期：2020-01-22
• arXiv.cs.DB Pub Date : 2020-01-19
Yueji Yang; Anthony K. H. Tung

Recently, keyword search on Knowledge Graphs (KGs) becomes popular. Typical keyword search approaches aim at finding a concise subgraph from a KG, which can reflect a close relationship among all input keywords. The connection paths between keywords are selected in a way that leads to a result subgraph with a better semantic score. However, such a result may not meet user information need because it relies on the scoring function to decide what keywords to link closer. Therefore, such a result may miss close connections among some keywords on which users intend to focus. In this paper, we propose a parallel keyword search engine, called RAKS. It allows users to specify a query as two sets of keywords, namely central keywords and marginal keywords. Specifically, central keywords are those keywords on which users focus more. Their relationships are desired in the results. Marginal keywords are those less focused keywords. Their connections to the central keywords are desired. In addition, they provide additional information that helps discover better results in terms of user intents. To improve the efficiency, we propose novel weighting and scoring schemes that boost the parallel execution during search while retrieving semantically relevant results. We conduct extensive experiments to validate that RAKS can work efficiently and effectively on open KGs with large size and variety.

更新日期：2020-01-22
• arXiv.cs.DB Pub Date : 2020-01-19
Yi Wang; Yang Yang; Weiguo Zhu; Yi Wu; Xu Yan; Yongfeng Liu; Yu Wang; Liang Xie; Ziyao Gao; Wenjing Zhu; Xiang Chen; Wei Yan; Mingjie Tang; Yuan Tang

Industrial AI systems are mostly end-to-end machine learning (ML) workflows. A typical recommendation or business intelligence system includes many online micro-services and offline jobs. We describe SQLFlow for developing such workflows efficiently in SQL. SQL enables developers to write short programs focusing on the purpose (what) and ignoring the procedure (how). Previous database systems extended their SQL dialect to support ML. SQLFlow (https://sqlflow.org/sqlflow ) takes another strategy to work as a bridge over various database systems, including MySQL, Apache Hive, and Alibaba MaxCompute, and ML engines like TensorFlow, XGBoost, and scikit-learn. We extended SQL syntax carefully to make the extension working with various SQL dialects. We implement the extension by inventing a collaborative parsing algorithm. SQLFlow is efficient and expressive to a wide variety of ML techniques -- supervised and unsupervised learning; deep networks and tree models; visual model explanation in addition to training and prediction; data processing and feature extraction in addition to ML. SQLFlow compiles a SQL program into a Kubernetes-native workflow for fault-tolerable execution and on-cloud deployment. Current industrial users include Ant Financial, DiDi, and Alibaba Group.

更新日期：2020-01-22
• arXiv.cs.DB Pub Date : 2020-01-20
Sujaya Maiyya; Danny Hyun Bum Cho; Divyakant Agrawal; Amr El Abbadi

Significant amounts of data are currently being stored and managed on third-party servers. It is impractical for many small scale enterprises to own their private datacenters, hence renting third-party servers is a viable solution for such businesses. But the increasing number of malicious attacks, both internal and external, as well as buggy software on third-party servers is causing clients to lose their trust in these external infrastructures. While small enterprises cannot avoid using external infrastructures, they need the right set of protocols to manage their data on untrusted infrastructures. In this paper, we propose TFCommit, a novel atomic commitment protocol that executes transactions on data stored across multiple untrusted servers. To our knowledge, TFCommit is the first atomic commitment protocol to execute transactions in an untrusted environment without using expensive Byzantine replication. Using TFCommit, we propose an auditable data management system, Fides, residing completely on untrustworthy infrastructure. As an auditable system, Fides guarantees the detection of potentially malicious failures occurring on untrusted servers using tamper-resistant logs with the support of cryptographic techniques. The experimental evaluation demonstrates the scalability and the relatively low overhead of our approach that allows executing transactions on untrusted infrastructure.

更新日期：2020-01-22
• arXiv.cs.DB Pub Date : 2020-01-20
Jeremy Kepner; Tim Davis; Chansup Byun; William Arcand; David Bestor; William Bergeron; Vijay Gadepally; Matthew Hubbell; Michael Houle; Michael Jones; Anna Klein; Peter Michaleas; Lauren Milechin; Julie Mullen; Andrew Prout; Antonio Rosa; Siddharth Samsi; Charles Yee; Albert Reuther

The SuiteSparse GraphBLAS C-library implements high performance hypersparse matrices with bindings to a variety of languages (Python, Julia, and Matlab/Octave). GraphBLAS provides a lightweight in-memory database implementation of hypersparse matrices that are ideal for analyzing many types of network data, while providing rigorous mathematical guarantees, such as linearity. Streaming updates of hypersparse matrices put enormous pressure on the memory hierarchy. This work benchmarks an implementation of hierarchical hypersparse matrices that reduces memory pressure and dramatically increases the update rate into a hypersparse matrices. The parameters of hierarchical hypersparse matrices rely on controlling the number of entries in each level in the hierarchy before an update is cascaded. The parameters are easily tunable to achieve optimal performance for a variety of applications. Hierarchical hypersparse matrices achieve over 1,000,000 updates per second in a single instance. Scaling to 31,000 instances of hierarchical hypersparse matrices arrays on 1,100 server nodes on the MIT SuperCloud achieved a sustained update rate of 75,000,000,000 updates per second. This capability allows the MIT SuperCloud to analyze extremely large streaming network data sets.

更新日期：2020-01-22
• arXiv.cs.DB Pub Date : 2020-01-20
Suhas Thejaswi; Aristides Gionis

In this paper we study a family of pattern-detection problems in vertex-colored temporal graphs. In particular, given a vertex-colored temporal graph and a multi-set of colors as a query, we search for temporal paths in the graph that contain the colors specified in the query. These types of problems have several interesting applications, for example, recommending tours for tourists, or searching for abnormal behavior in a network of financial transactions. For the family of pattern-detection problems we define, we establish complexity results and design an algebraic-algorithmic framework based on constrained multilinear sieving. We demonstrate that our solution can scale to massive graphs with up to hundred million edges, despite the problems being NP-hard. Our implementation, which is publicly available, exhibits practical edge-linear scalability and highly optimized. For example, in a real-world graph dataset with more than six million edges and a multi-set query with ten colors, we can extract an optimal solution in less than eight minutes on a haswell desktop with four cores.

更新日期：2020-01-22
• arXiv.cs.DB Pub Date : 2019-06-21
Wei Wang; Meihui Zhang; Gang Chen; H. V. Jagadish; Beng Chin Ooi; Kian-Lee Tan

Deep learning has recently become very popular on account of its incredible success in many complex data-driven applications, such as image classification and speech recognition. The database community has worked on data-driven applications for many years, and therefore should be playing a lead role in supporting this new wave. However, databases and deep learning are different in terms of both techniques and applications. In this paper, we discuss research problems at the intersection of the two fields. In particular, we discuss possible improvements for deep learning systems from a database perspective, and analyze database applications that may benefit from deep learning techniques.

更新日期：2020-01-22
• arXiv.cs.DB Pub Date : 2019-12-07
Dawei Huang; Dong Young Yoon; Seth Pettie; Barzan Mozafari

Despite decades of research on approximate query processing (AQP), our understanding of sample-based joins has remained limited and, to some extent, even superficial. The common belief in the community is that joining random samples is futile. This belief is largely based on an early result showing that the join of two uniform samples is not an independent sample of the original join, and that it leads to quadratically fewer output tuples. However, unfortunately, this result has little applicability to the key questions practitioners face. For example, the success metric is often the final approximation's accuracy, rather than output cardinality. Moreover, there are many non-uniform sampling strategies that one can employ. Is sampling for joins still futile in all of these settings? If not, what is the best sampling strategy in each case? To the best of our knowledge, there is no formal study answering these questions. This paper aims to improve our understanding of sample-based joins and offer a guideline for practitioners building and using real-world AQP systems. We study limitations of offline samples in approximating join queries: given an offline sampling budget, how well can one approximate the join of two tables? We answer this question for two success metrics: output size and estimator variance. We show that maximizing output size is easy, while there is an information-theoretical lower bound on the lowest variance achievable by any sampling strategy. We then define a hybrid sampling scheme that captures all combinations of stratified, universe, and Bernoulli sampling, and show that this scheme with our optimal parameters achieves the theoretical lower bound within a constant factor. Since computing these optimal parameters requires shuffling statistics across the network, we also propose a decentralized variant where each node acts autonomously using minimal statistics.

更新日期：2020-01-22
• arXiv.cs.DB Pub Date : 2020-01-15
Tobias Emrich; Hans-Peter Kriegel; Andreas Züfle; Peer Kröger; Matthias Renz

Rectangles are used to approximate objects, or sets of objects, in a plethora of applications, systems and index structures. Many tasks, such as nearest neighbor search and similarity ranking, require to decide if objects in one rectangle A may, must, or must not be closer to objects in a second rectangle B, than objects in a third rectangle R. To decide this relation of "Spatial Domination" it can be shown that using minimum and maximum distances it is often impossible to detect spatial domination. This spatial gem provides a necessary and sufficient decision criterion for spatial domination that can be computed efficiently even in higher dimensional space. In addition, this spatial gem provides an example, pseudocode and an implementation in Python.

更新日期：2020-01-17
• arXiv.cs.DB Pub Date : 2020-01-16
Shuhao Zhang; Feng Zhang; Yingjun Wu; Bingsheng He; Paul Johns

Data stream processing systems (DSPSs) enable users to express and run stream applications to continuously process data streams. To achieve real-time data analytics, recent researches keep focusing on optimizing the system latency and throughput. Witnessing the recent great achievements in the computer architecture community, researchers and practitioners have investigated the potential of adoption hardware-conscious stream processing by better utilizing modern hardware capacity in DSPSs. In this paper, we conduct a systematic survey of recent work in the field, particularly along with the following three directions: 1) computation optimization, 2) stream I/O optimization, and 3) query deployment. Finally, we advise on potential future research directions.

更新日期：2020-01-17
• arXiv.cs.DB Pub Date : 2020-01-16
Yvonne Mülle; Michael H. Böhlen

Ongoing time point now is used to state that a tuple is valid from the start point onward. For database systems ongoing time points have far-reaching implications since they change continuously as time passes by. State-of-the-art approaches deal with ongoing time points by instantiating them to the reference time. The instantiation yields query results that are only valid at the chosen time and get invalidated as time passes by. We propose a solution that keeps ongoing time points uninstantiated during query processing. We do so by evaluating predicates and functions at all possible reference times. This renders query results independent of a specific reference time and yields results that remain valid as time passes by. As query results, we propose ongoing relations that include a reference time attribute. The value of the reference time attribute is restricted by predicates and functions on ongoing attributes. We describe and evaluate an efficient implementation of ongoing data types and operations in PostgreSQL.

更新日期：2020-01-17
• arXiv.cs.DB Pub Date : 2020-01-16
Mahdi Bohlouli; Jens Dalter; Mareike Dornhöfer; Johannes Zenkert; Madjid Fathi

In todays competitive business world, being aware of customer needs and market-oriented production is a key success factor for industries. To this aim, the use of efficient analytic algorithms ensures a better understanding of customer feedback and improves the next generation of products. Accordingly, the dramatic increase in using social media in daily life provides beneficial sources for market analytics. But how traditional analytic algorithms and methods can scale up for such disparate and multi-structured data sources is the main challenge in this regard. This paper presents and discusses the technological and scientific focus of the SoMABiT as a social media analysis platform using big data technology. Sentiment analysis has been employed in order to discover knowledge from social media. The use of MapReduce and developing a distributed algorithm towards an integrated platform that can scale for any data volume and provide a social media-driven knowledge is the main novelty of the proposed concept in comparison to the state-of-the-art technologies.

更新日期：2020-01-17
• arXiv.cs.DB Pub Date : 2020-01-16
Islam Samy; Mohamed A. Attia; Ravi Tandon; Loukas Lazos

In many applications, content accessed by users (movies, videos, news articles, etc.) can leak sensitive latent attributes, such as religious and political views, sexual orientation, ethnicity, gender, and others. To prevent such information leakage, the goal of classical PIR is to hide the identity of the content/message being accessed, which subsequently also hides the latent attributes. This solution, while private, can be too costly, particularly, when perfect (information-theoretic) privacy constraints are imposed. For instance, for a single database holding $K$ messages, privately retrieving one message is possible if and only if the user downloads the entire database of $K$ messages. Retrieving content privately, however, may not be necessary to perfectly hide the latent attributes. Motivated by the above, we formulate and study the problem of latent-variable private information retrieval (LV-PIR), which aims at allowing the user efficiently retrieve one out of $K$ messages (indexed by $\theta$) without revealing any information about the latent variable (modeled by $S$). We focus on the practically relevant setting of a single database and show that one can significantly reduce the download cost of LV-PIR (compared to the classical PIR) based on the correlation between $\theta$ and $S$. We present a general scheme for LV-PIR as a function of the statistical relationship between $\theta$ and $S$, and also provide new results on the capacity/download cost of LV-PIR. Several open problems and new directions are also discussed.

更新日期：2020-01-17
• arXiv.cs.DB Pub Date : 2019-04-08
Shuhao Zhang; Yingjun Wu; Feng Zhang; Bingsheng He

Recent data stream processing systems (DSPSs) can achieve excellent performance when processing large volumes of data under tight latency constraints. However, they sacrifice support for concurrent state access that eases the burden of developing stateful stream applications. Recently, some have proposed managing concurrent state access during stream processing by modeling state accesses as transactions. However, these are realized with locks involving serious contention overhead. Their coarse-grained processing paradigm further magnifies contention issues and tends to poorly utilize modern multicore architectures. This paper introduces TStream , a novel DSPS supporting efficient concurrent state access on multicore processors. Transactional semantics is employed like previous work, but scalability is greatly improved due to two novel designs: 1) dual-mode scheduling, which exposes more parallelism opportunities, 2) dynamic restructuring execution, which aggressively exploits the parallelism opportunities from dual-mode scheduling without centralized lock contentions. To validate our proposal, we evaluate TStream with a benchmark of four applications on a modern multicore machine. The experimental results show that 1) TStream achieves up to 4.8 times higher throughput with similar processing latency compared to the state-of-the-art and 2) unlike prior solutions, TStream is highly tolerant of varying application workloads such as key skewness and multi-partition state accesses.

更新日期：2020-01-17
• arXiv.cs.DB Pub Date : 2020-01-15
Jiaqi Dong; Runyu Zhang; Chaoshu Yang; Yujuan Tan; Duo Liu

Frequent-pattern mining is a common approach to reveal the valuable hidden trends behind data. However, existing frequent-pattern mining algorithms are designed for DRAM, instead of persistent memories (PMs), which can lead to severe performance and energy overhead due to the utterly different characteristics between DRAM and PMs when they are running on PMs. In this paper, we propose an efficient and Wear-leveling-aware Frequent-Pattern Mining scheme, WFPM, to solve this problem. The proposed WFPM is evaluated by a series of experiments based on realistic datasets from diversified application scenarios, where WFPM achieves 32.0% performance improvement and prolongs the NVM lifetime of header table by 7.4x over the EvFP-Tree.

更新日期：2020-01-16
• arXiv.cs.DB Pub Date : 2017-10-12
Lior Kogan

The property graph is an increasingly popular data model. An important task when dealing with property graphs is pattern matching. Given a property graph schema S, a property graph G and a query pattern P, all expressed in language L, pattern matching is the process of finding, merging and annotating subgraphs of G that match P. Expressive pattern languages support topological constraints and property values constraints, as well as negation, quantification, grouping, aggregation, and path semantics. Calculated properties may be defined for vertices, edges, and subgraphs, and constraints may be imposed on their evaluation result. Query-posers would like to pose complex queries in a manner that is coherent with the way they think. They want to do it with minimal technical training, minimal effort, and minimal trial and error. The ability to express patterns in a way that is aligned with their mental processes is crucial to the flow of their work, and to the quality of the insights they can draw. Since the capabilities of the human visual system with respect to pattern perception are remarkable, it is a matter of course that query patterns were to be expressed visually. Visual query languages have the potential to be much more 'user-friendly' than their textual counterparts in the sense that patterns may be constructed and understood much more quickly and with much less mental effort. A long-standing challenge is to design a visual query language that is generic, has rich expressive power, and is highly receptive and productive. V1 is a declarative visual pattern query language for schema-based property graphs. V1 supports property graphs with mixed (both directed and undirected) edges and half-edges, with multivalued and composite properties, and with null property values. V1 is generic, concise, has rich expressive power, and is highly receptive and productive.

更新日期：2020-01-16
• arXiv.cs.DB Pub Date : 2020-01-14
Alejandro Cortiñas; Miguel R. Luaces; Tirso V. Rodeiro

Lately, many companies are using Mobile Workforce Management technologies combined with information collected by sensors from mobile devices in order to improve their business processes. Even for small companies, the information that needs to be handled grows at a high rate, and most of the data collected have a geographic dimension. Being able to visualize this data in real-time within a map viewer is a very important deal for these companies. In this paper we focus on this topic, presenting a case study on visualizing large spatial datasets. Particularly, since most of the Mobile Workforce Management software is web-based, we propose a solution suitable for this environment.

更新日期：2020-01-15
• arXiv.cs.DB Pub Date : 2020-01-14
Henrik Forssell; Evgeny Kharlamov; Evgenij Thorstensen

Data exchange heavily relies on the notion of incomplete database instances. Several semantics for such instances have been proposed and include open (OWA), closed (CWA), and open-closed (OCWA) world. For all these semantics important questions are: whether one incomplete instance semantically implies another; when two are semantically equivalent; and whether a smaller or smallest semantically equivalent instance exists. For OWA and CWA these questions are fully answered. For several variants of OCWA, however, they remain open. In this work we adress these questions for Closed Powerset semantics and the OCWA semantics of Libkin and Sirangelo, 2011. We define a new OCWA semantics, called OCWA*, in terms of homomorphic covers that subsumes both semantics, and characterize semantic implication and equivalence in terms of such covers. This characterization yields a guess-and-check algorithm to decide equivalence, and shows that the problem is NP-complete. For the minimization problem we show that for several common notions of minimality there is in general no unique minimal equivalent instance for Closed Powerset semantics, and consequently not for the more expressive OCWA* either. However, for Closed Powerset semantics we show that one can find, for any incomplete database, a unique finite set of its subinstances which are subinstances (up to renaming of nulls) of all instances semantically equivalent to the original incomplete one. We study properties of this set, and extend the analysis to OCWA*.

更新日期：2020-01-15
• arXiv.cs.DB Pub Date : 2020-01-14
Stefan Böttcher; Rita Hartel; Sven Peeters

Like [1], we present an algorithm to compute the simulation of a query pattern in a graph of labeled nodes and unlabeled edges. However, our algorithm works on a compressed graph grammar, instead of on the original graph. The speed-up of our algorithm compared to the algorithm in [1] grows with the size of the graph and with the compression strength.

更新日期：2020-01-15
• arXiv.cs.DB Pub Date : 2020-01-14
Chaitanya K. Ryali; John J. Hopfield; Leopold Grinberg; Dmitry Krotov

The fruit fly Drosophila's olfactory circuit has inspired a new locality sensitive hashing (LSH) algorithm, FlyHash. In contrast with classical LSH algorithms that produce low dimensional hash codes, FlyHash produces sparse high-dimensional hash codes and has also been shown to have superior empirical performance compared to classical LSH algorithms in similarity search. However, FlyHash uses random projections and cannot learn from data. Building on inspiration from FlyHash and the ubiquity of sparse expansive representations in neurobiology, our work proposes a novel hashing algorithm BioHash that produces sparse high dimensional hash codes in a data-driven manner. We show that BioHash outperforms previously published benchmarks for various hashing methods. Since our learning algorithm is based on a local and biologically plausible synaptic plasticity rule, our work provides evidence for the proposal that LSH might be a computational reason for the abundance of sparse expansive motifs in a variety of biological systems. We also propose a convolutional variant BioConvHash that further improves performance. From the perspective of computer science, BioHash and BioConvHash are fast, scalable and yield compressed binary representations that are useful for similarity search.

更新日期：2020-01-15
• arXiv.cs.DB Pub Date : 2020-01-12
Sen Zheng; Renate A. Schmidt

We consider the following query answering problem: Given a Boolean conjunctive query and a theory in the Horn loosely guarded fragment, the aim is to determine whether the query is entailed by the theory. In this paper, we present a resolution decision procedure for the loosely guarded fragment, and use such a procedure to answer Boolean conjunctive queries against the Horn loosely guarded fragment. The Horn loosely guarded fragment subsumes classes of rules that are prevalent in ontology-based query answering, such as Horn ALCHOI and guarded existential rules. Additionally, we identify star queries and cloud queries, which using our procedure, can be answered against the loosely guarded fragment.

更新日期：2020-01-14
• arXiv.cs.DB Pub Date : 2020-01-13
Manuel Rigger; Zhendong Su

Relational databases are used ubiquitously. They are managed by database management systems (DBMS), which allow inserting, modifying, and querying data using a domain-specific language called Structured Query Language (SQL). Popular DBMS have been extensively tested by fuzzers, which have been successful in finding crash bugs. However, approaches to finding logic bugs, such as when a DBMS computes an incorrect result set, have remained mostly untackled. Differential testing is an effective technique to test systems that support a common language by comparing the outputs of these systems. However, this technique is ineffective for DBMS, because each DBMS typically supports its own SQL dialect. To this end, we devised a novel and general approach that we have termed Pivoted Query Synthesis. The core idea of this approach is to automatically generate queries for which we ensure that they fetch a specific, randomly selected row, called the pivot row. If the DBMS fails to fetch the pivot row, the likely cause is a bug in the DBMS. We tested our approach on three widely-used and mature DBMS, namely SQLite, MySQL, and PostgreSQL. In total, we reported 123 bugs in these DBMS, 99 of which have been fixed or verified, demonstrating that the approach is highly effective and general. We expect that the wide applicability and simplicity of our approach will enable the improvement of robustness of many DBMS.

更新日期：2020-01-14
• arXiv.cs.DB Pub Date : 2020-01-13
Martin Thomas Horsch; Silvia Chiacchiera; Youness Bami; Georg J. Schmitz; Gabriele Mogni; Gerhard Goldbeck; Emanuele Ghedini

The European Materials and Modelling Ontology (EMMO) is a top-level ontology designed by the European Materials Modelling Council to facilitate semantic interoperability between platforms, models, and tools in computational molecular engineering, integrated computational materials engineering, and related applications of materials modelling and characterization. Additionally, domain ontologies exist based on data technology developments from specific platforms. The present work discusses the ongoing work on establishing a European Virtual Marketplace Framework, into which diverse platforms can be integrated. It addresses common challenges that arise when marketplace-level domain ontologies are combined with a top-level ontology like the EMMO by ontology alignment.

更新日期：2020-01-14
• arXiv.cs.DB Pub Date : 2020-01-13
Hiba Arnaout; Simon Razniewski; Gerhard Weikum

Knowledge bases (KBs), pragmatic collections of knowledge about notable entities, are an important asset in applications such as search, question answering and dialogue. Rooted in a long tradition in knowledge representation, all popular KBs only store positive information, while they abstain from taking any stance towards statements not contained in them. In this paper, we make the case for explicitly stating interesting statements which are not true. Negative statements would be important to overcome current limitations of question answering, yet due to their potential abundance, any effort towards compiling them needs a tight coupling with ranking. We introduce two approaches towards compiling negative statements. (i) In peer-based statistical inferences, we compare entities with highly related entities in order to derive potential negative statements, which we then rank using supervised and unsupervised features. (ii) In query-log-based text extraction, we use a pattern-based approach for harvesting search engine query logs. Experimental results show that both approaches hold promising and complementary potential. Along with this paper, we publish the first datasets on interesting negative information, containing over 1.1M statements for 100K popular Wikidata entities.

更新日期：2020-01-14
• arXiv.cs.DB Pub Date : 2020-01-13
Abdulwahab Alazeb; Brajendra Panda

The evolution of the utilization of technologies in nearly all aspects of life has produced an enormous amount of data essential in a smart city. Therefore, maximizing the benefits of technologies such as cloud computing, fog computing, and the Internet of things is important to manage and manipulate data in smart cities. However, certain types of data are sensitive and risky and may be infiltrated by malicious attacks. As a result, such data may be corrupted, thereby causing concern. The damage inflicted by an attacker on a set of data can spread through an entire database. Valid transactions that have read corrupted data can update other data items based on the values read. In this study, we introduce a unique model that uses fog computing in smart cities to manage utility service companies and consumer data. We also propose a novel technique to assess damage to data caused by an attack. Thus, original data can be recovered, and a database can be returned to its consistent state as no attacking has occurred.

更新日期：2020-01-14
• arXiv.cs.DB Pub Date : 2019-09-24
Rasmus Pagh; Johan Sivertsen

Motivated by the problem of filtering candidate pairs in inner product similarity joins we study the following inner product estimation problem: Given parameters $d\in {\bf N}$, $\alpha>\beta\geq 0$ and unit vectors $x,y\in {\bf R}^{d}$ consider the task of distinguishing between the cases $\langle x, y\rangle\leq\beta$ and $\langle x, y\rangle\geq \alpha$ where $\langle x, y\rangle = \sum_{i=1}^d x_i y_i$ is the inner product of vectors $x$ and $y$. The goal is to distinguish these cases based on information on each vector encoded independently in a bit string of the shortest length possible. In contrast to much work on compressing vectors using randomized dimensionality reduction, we seek to solve the problem deterministically, with no probability of error. Inner product estimation can be solved in general via estimating $\langle x, y\rangle$ with an additive error bounded by $\varepsilon = \alpha - \beta$. We show that $d \log_2 \left(\tfrac{\sqrt{1-\beta}}{\varepsilon}\right) \pm \Theta(d)$ bits of information about each vector is necessary and sufficient. Our upper bound is constructive and improves a known upper bound of $d \log_2(1/\varepsilon) + O(d)$ by up to a factor of 2 when $\beta$ is close to $1$. The lower bound holds even in a stronger model where one of the vectors is known exactly, and an arbitrary estimation function is allowed.

更新日期：2020-01-14
• arXiv.cs.DB Pub Date : 2019-10-04
Antoine Amarilli; İsmail İlkan Ceylan

We study the problem of probabilistic query evaluation (PQE) over probabilistic graphs, namely, tuple-independent probabilistic databases (TIDs) on signatures of arity two. Our focus is the class of queries that is closed under homomorphisms, or equivalently, the infinite unions of conjunctive queries, denoted UCQ^\infty. Our main result states that all unbounded queries in UCQ^\infty are #P-hard for PQE. As bounded queries in UCQ^\infty are already classified by the dichotomy of Dalvi and Suciu [17], our results and theirs imply a complete dichotomy on PQE for UCQ^\infty queries over probabilistic graphs. This dichotomy covers in particular all fragments in UCQ^\infty such as negation-free (disjunctive) Datalog, regular path queries, and a large class of ontology-mediated queries on arity-two signatures. Our result is shown by reducing from counting the valuations of positive partitioned 2-DNF formulae (#PP2DNF) for some queries, or from the source-to-target reliability problem in an undirected graph (#U-ST-CON) for other queries, depending on properties of minimal models.

更新日期：2020-01-14
• arXiv.cs.DB Pub Date : 2019-10-16
Guangyan Hu; Yongfeng Zhang; Sandro Rigo; Thu D. Nguyen

Text analytics has become an important part of business intelligence as enterprises increasingly seek to extract insights for decision making from text data sets. Processing large text data sets can be computationally expensive, however, especially if it involves sophisticated algorithms. This challenge is exacerbated when it is desirable to run different types of queries against a data set, making it expensive to build multiple indices to speed up query processing. In this paper, we propose and evaluate a framework called EmApprox that uses approximation to speed up the processing of a wide range of queries over large text data sets. The key insight is that different types of queries can be approximated by processing subsets of data that are most similar to the queries. EmApprox builds a general index for a data set by learning a natural language processing model, producing a set of highly compressed vectors representing words and subcollections of documents. Then, at query processing time, EmApprox uses the index to guide sampling of the data set, with the probability of selecting each subcollection of documents being proportional to its {\em similarity} to the query as computed using the vector representations. We have implemented a prototype of EmApprox as an extension of the Apache Spark system, and used it to approximate three types of queries: aggregation, information retrieval, and recommendation. Experimental results show that EmApprox's similarity-guided sampling achieves much better accuracy than random sampling. Further, EmApprox can achieve significant speedups if users can tolerate small amounts of inaccuracies. For example, when sampling at 10\%, EmApprox speeds up a set of queries counting phrase occurrences by almost 10x while achieving estimated relative errors of less than 22\% for 90\% of the queries.

更新日期：2020-01-14
• arXiv.cs.DB Pub Date : 2019-10-24
Maximiliaan Huisman; Mathias Hammer; Alex Rigano; Farzin Farzam; Renu Gopinathan; Carlas Smith; David Grunwald; Caterina Strambio-De-Castillia

The application of microscopy in biomedical research has come a long way since Antonie van Leeuwenhoek discovery of unicellular organisms through his hand-crafted microscope. Countless innovations have propelled imaging techniques and have positioned fluorescence microscopy as a cornerstone of modern biology and as a method of choice for connecting omics datasets to their biological and clinical correlates. Still, regardless of how convincing imaging results look, they do not always convey meaningful information about the conditions in which they were acquired, processed and analyzed to achieve the presented results. Adequate record-keeping and quality control are therefore essential to ensure experimental rigor and data fidelity, to allow experiments to be reproducibly repeated and to promote the proper evaluation, interpretation, comparison and re-use of the results. Microscopy images must be accompanied by complete descriptions of experimental procedures, biological samples, microscope hardware specifications, image acquisition parameters, and metrics detailing instrument performance and calibration. Despite considerable effort (Goldberg et al., 2005; Linkert et al., 2010), however, universal data standards and reporting guidelines for the fair (Wilkinson et al., 2016) assessment and comparison of microscopy data have not been established. To understand this discrepancy and propose a way forward, we examine the benefits, pitfalls, and limitations of shared standards for fluorescence microscopy.

更新日期：2020-01-14
• arXiv.cs.DB Pub Date : 2019-12-12
Dominik D. Freydenberger; Liat Peterfreund

We propose FC, a logic on words that combines the previous approaches of finite-model theory and the theory of concatenation, and that has immediate applications in information extraction and database theory in the form of document spanners. Like the theory of concatenation, FC is built around word equations; in contrast to it, its semantics are defined to only allow finite models, by limiting the universe to a word and all its subwords. As a consequence of this, FC has many of the desirable properties of FO[<], while being far more expressive. Most noteworthy among these desirable properties are sufficient criteria for efficient model checking and capturing various complexity classes by extending the logic with appropriate closure or iteration operators. These results allows us to obtain new insights into and techniques for the expressive power and efficient evaluation of document spanners. In fact, FC provides us with a general framework for reasoning about words that has potential applications far beyond document spanners.

更新日期：2020-01-14
• arXiv.cs.DB Pub Date : 2019-12-17
Laurel Orr; Samuel Ainsworth; Walter Cai; Kevin Jamieson; Magda Balazinska; Dan Suciu

Data scientists have relied on samples to analyze populations of interest for decades. Recently, with the increase in the number of public data repositories, sample data has become easier to access. It has not, however, become easier to analyze. This sample data is arbitrarily biased with an unknown sampling probability, meaning data scientists must manually debias the sample with custom techniques to avoid inaccurate results. In this vision paper, we propose Mosaic, a database system that treats samples as first-class citizens and allows users to ask questions over populations represented by these samples. Answering queries over biased samples is non-trivial as there is no existing, standard technique to answer population queries when the sampling probability is unknown. In this paper, we show how our envisioned system solves this problem by having a unique sample-based data model with extensions to the SQL language. We propose how to perform population query answering using biased samples and give preliminary results for one of our novel query answering techniques.

更新日期：2020-01-14
• arXiv.cs.DB Pub Date : 2020-01-10
Sultan Almakdi; Brajendra Panda

Database users have started moving toward the use of cloud computing as a service because it provides computation and storage needs at affordable prices. However, for most of the users, the concern of privacy plays a major role as they cannot control data access once their data are outsourced, especially if the cloud provider is curious about their data. Data encryption is an effective way to solve privacy concerns, but executing queries over encrypted data is a problem that needs attention. In this research, we introduce a bit-based model to execute different relational algebra operators over encrypted databases at the cloud without decrypting the data. To encrypt data, we use the randomized encryption algorithm (Advanced Encryption Standard-CBC) to provide the maximum-security level. The idea is based on classifying attributes as sensitive and non-sensitive, where only sensitive attributes are encrypted. For each sensitive attribute, the table owner predefined the possible partition domains on which the tuples will be encoded into bit vectors before the encryption. We store the bit vectors in an additional column(s) in the encrypted table in the cloud. We use those bits to retrieve only part of encrypted records that are candidates for a specific query. We implemented and evaluated our model and found that the proposed model is practical and success to minimize the range of the retrieved encrypted records to less than 30 percent of the whole set of encrypted records in a table.

更新日期：2020-01-13
• arXiv.cs.DB Pub Date : 2020-01-10
Jang You Park; YongHee Jung; Wei Ding; Kwang Woo Nam

In this paper, we propose the design and implementation of the new geotagged media management system. A large amount of daily geo-tagged media data generated by user's smart phone, mobile device, dash cam and camera. Geotagged media, such as geovideos and geophotos, can be captured with spatial temporal information such as time, location, visible area, camera direction, moving direction and visible distance information. Due to the increase in geo-tagged multimedia data, the researches for efficient managing and mining geo-tagged multimedia are newly expected to be a new area in database and data mining. This paper proposes a geo-tagged media management system, so called Open GeoCMS(Geotagged media Contents Management System). Open GeoCMS is a new framework to manage geotagged media data on the web. Our framework supports various types which are for moving point, moving photo - a sequence of photos by a drone, moving double and moving video. Also, GeoCMS has the label viewer and editor system for photos and videos. The Open GeoCMS have been developed as an open source system.

更新日期：2020-01-13
• arXiv.cs.DB Pub Date : 2020-01-10
Xixi Lu; Seyed Amin Tabatabaei; Mark Hoogendoorn; Hajo A. Reijers

Trace clustering has increasingly been applied to find homogenous process executions. However, current techniques have difficulties in finding a meaningful and insightful clustering of patients on the basis of healthcare data. The resulting clusters are often not in line with those of medical experts, nor do the clusters guarantee to help return meaningful process maps of patients' clinical pathways. After all, a single hospital may conduct thousands of distinct activities and generate millions of events per year. In this paper, we propose a novel trace clustering approach by using sample sets of patients provided by medical experts. More specifically, we learn frequent sequence patterns on a sample set, rank each patient based on the patterns, and use an automated approach to determine the corresponding cluster. We find each cluster separately, while the frequent sequence patterns are used to discover a process map. The approach is implemented in ProM and evaluated using a large data set obtained from a university medical center. The evaluation shows F1-scores of 0.7 for grouping kidney injury, 0.9 for diabetes, and 0.64 for head/neck tumor, while the process maps show meaningful behavioral patterns of the clinical pathways of these groups, according to the domain experts.

更新日期：2020-01-13
• arXiv.cs.DB Pub Date : 2020-01-10
Amir Shaikhha; Maximilian Schleich; Alexandru Ghita; Dan Olteanu

We consider the problem of training machine learning models over multi-relational data. The mainstream approach is to first construct the training dataset using a feature extraction query over input database and then use a statistical software package of choice to train the model. In this paper we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language, which captures a subset of Python commonly used in Jupyter notebooks for rapid prototyping of machine learning applications. The program is subject to several layers of IFAQ optimizations, such as algebraic transformations, loop transformations, schema specialization, data layout optimizations, and finally compilation into efficient low-level C++ code specialized for the given workload and data. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and TensorFlow by several orders of magnitude for linear regression and regression tree models over several relational datasets.

更新日期：2020-01-13
• arXiv.cs.DB Pub Date : 2020-01-09
Ida Mele; Nicola Tonellotto; Ophir Frieder; Raffaele Perego

Caching search results is employed in information retrieval systems to expedite query processing and reduce back-end server workload. Motivated by the observation that queries belonging to different topics have different temporal-locality patterns, we investigate a novel caching model called STD (Static-Topic-Dynamic cache). It improves traditional SDC (Static-Dynamic Cache) that stores in a static cache the results of popular queries and manages the dynamic cache with a replacement policy for intercepting the temporal variations in the query stream. Our proposed caching scheme includes another layer for topic-based caching, where the entries are allocated to different topics (e.g., weather, education). The results of queries characterized by a topic are kept in the fraction of the cache dedicated to it. This permits to adapt the cache-space utilization to the temporal locality of the various topics and reduces cache misses due to those queries that are neither sufficiently popular to be in the static portion nor requested within short-time intervals to be in the dynamic portion. We simulate different configurations for STD using two real-world query streams. Experiments demonstrate that our approach outperforms SDC with an increase up to 3% in terms of hit rates, and up to 36% of gap reduction w.r.t. SDC from the theoretical optimal caching algorithm.

更新日期：2020-01-10
• arXiv.cs.DB Pub Date : 2020-01-09
Ioannis Kouvatis; Konstantinos Semertzidis; Maria Zerva; Evaggelia Pitoura; Panayiotis Tsaparas

The problem of team formation in a social network asks for a set of individuals who not only have the required skills to perform a task but who can also communicate effectively with each other. Existing work assumes that all links in a social network are positive, that is, they indicate friendship or collaboration between individuals. However, it is often the case that the network is signed, that is, it contains both positive and negative links, corresponding to friend and foe relationships. Building on the concept of structural balance, we provide definitions of compatibility between pairs of users in a signed network, and algorithms for computing it. We then define the team formation problem in signed networks, where we ask for a compatible team of individuals that can perform a task with small communication cost. We show that the problem is NP-hard even when there are no communication cost constraints, and we provide heuristic algorithms for solving it. We present experimental results with real data to investigate the properties of the different compatibility definitions, and the effectiveness of our algorithms.

更新日期：2020-01-10
• arXiv.cs.DB Pub Date : 2020-01-09
Ariel Shtul; Carlos Baquero; Paulo Sérgio Almeida

Bloom filters (BF) are widely used for approximate membership queries over a set of elements. BF variants allow removals, sets of unbounded size or querying a sliding window over an unbounded stream. However, for this last case the best current approaches are dictionary based (e.g., based on Cuckoo Filters or TinyTable), and it may seem that BF-based approaches will never be competitive to dictionary-based ones. In this paper we present Age-Partitioned Bloom Filters, a BF-based approach for duplicate detection in sliding windows that not only is competitive in time-complexity, but has better space usage than current dictionary-based approaches (e.g., SWAMP), at the cost of some moderate slack. APBFs retain the BF simplicity, unlike dictionary-based approaches, important for hardware-based implementations, and can integrate known improvements such as double hashing or blocking. We present an Age-Partitioned Blocked Bloom Filter variant which can operate with 2-3 cache-line accesses per insertion and around 2-4 per query, even for high accuracy filters.

更新日期：2020-01-10
• arXiv.cs.DB Pub Date : 2018-03-23
Timnit Gebru; Jamie Morgenstern; Briana Vecchione; Jennifer Wortman Vaughan; Hanna Wallach; Hal Daumeé III; Kate Crawford

Currently there is no standard way to identify how a dataset was created, and what characteristics, motivations, and potential skews it represents. To begin to address this issue, we propose the concept of a datasheet for datasets, a short document to accompany public datasets, commercial APIs, and pretrained models. The goal of this proposal is to enable better communication between dataset creators and users, and help the AI community move toward greater transparency and accountability. By analogy, in computer hardware, it has become industry standard to accompany everything from the simplest components (e.g., resistors), to the most complex microprocessor chips, with datasheets detailing standard operating characteristics, test results, recommended usage, and other information. We outline some of the questions a datasheet for datasets should answer. These questions focus on when, where, and how the training data was gathered, its recommended use cases, and, in the case of human-centric datasets, information regarding the subjects' demographics and consent as applicable. We develop prototypes of datasheets for two well-known datasets: Labeled Faces in The Wild and the Pang \& Lee Polarity Dataset.

更新日期：2020-01-10
• arXiv.cs.DB Pub Date : 2019-08-05
Gonzalo Navarro; Juan L. Reutter; Javiel Rojas-Ledesma

Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count with several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality we either need to build completely new indexes, or we must populate the database with several instantiations of indexes such as B$+$-trees. Either way, this means spending an extra amount of storage space that may be non-negligible. We show that optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of extra storage. Our representation is a compact quad tree for the static indexes, and a dynamic quadtree sharing subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, and show that the running time of this algorithm is worst-case optimal in data complexity. Remarkably, we can extend our framework to evaluate more expressive queries from relational algebra by introducing a lazy version of qdags (lqdags). Once again, we can show that the running time of our algorithms is worst-case optimal.

更新日期：2020-01-10
• arXiv.cs.DB Pub Date : 2020-01-07
Renzo Angles; János Benjamin Antal; Alex Averbuch; Peter Boncz; Orri Erling; Andrey Gubichev; Vlad Haprian; Moritz Kaufmann; Josep Lluís Larriba Pey; Norbert Martínez; József Marton; Marcus Paradies; Minh-Duc Pham; Arnau Prat-Pérez; Mirko Spasić; Benjamin A. Steer; Gábor Szárnyas; Jack Waudby

The Linked Data Benchmark Council's Social Network Benchmark (LDBC SNB) is an effort intended to test various functionalities of systems used for graph-like data management. For this, LDBC SNB uses the recognizable scenario of operating a social network, characterized by its graph-shaped data. LDBC SNB consists of two workloads that focus on different functionalities: the Interactive workload (interactive transactional queries) and the Business Intelligence workload (analytical queries). This document contains the definition of the Interactive Workload and the first draft of the Business Intelligence Workload. This includes a detailed explanation of the data used in the LDBC SNB benchmark, a detailed description for all queries, and instructions on how to generate the data and run the benchmark with the provided software.

更新日期：2020-01-09
• arXiv.cs.DB Pub Date : 2020-01-08
Bo Jiang; Ming Li; Ravi Tandon

In this paper, we study local information privacy (LIP), and design LIP based mechanisms for statistical aggregation while protecting users' privacy without relying on a trusted third party. The notion of context-awareness is incorporated in LIP, which can be viewed as explicit modeling of the adversary's background knowledge. It enables the design of privacy-preserving mechanisms leveraging the prior distribution, which can potentially achieve a better utility-privacy tradeoff than context-free notions such as Local Differential Privacy (LDP). We present an optimization framework to minimize the mean square error in the data aggregation while protecting the privacy of each individual user's input data or a correlated latent variable while satisfying LIP constraints. Then, we study two different types of applications: (weighted) summation and histogram estimation, and derive the optimal context-aware data perturbation parameters for each case, based on randomized response type of mechanism. We further compare the utility-privacy tradeoff between LIP and LDP and theoretically explain why the incorporation of prior knowledge enlarges feasible regions of the perturbation parameters, which thereby leads to higher utility. We also extend the LIP-based privacy mechanisms to the more general case when exact prior knowledge is not available. Finally, we validate our analysis by simulations using both synthetic and real-world data. Results show that our LIP-based privacy mechanism provides better utility-privacy tradeoffs than LDP, and the advantage of LIP is even more significant when the prior distribution is more skewed.

更新日期：2020-01-09
• arXiv.cs.DB Pub Date : 2020-01-08
Alessandro Berti; Wil van der Aalst

Much time in process mining projects is spent on finding and understanding data sources and extracting the event data needed. As a result, only a fraction of time is spent actually applying techniques to discover, control and predict the business process. Moreover, current process mining techniques assume a single case notion. However, in reallife processes often different case notions are intertwined. For example, events of the same order handling process may refer to customers, orders, order lines, deliveries, and payments. Therefore, we propose to use Multiple Viewpoint (MVP) models that relate events through objects and that relate activities through classes. The required event data are much closer to existing relational databases. MVP models provide a holistic view on the process, but also allow for the extraction of classical event logs using different viewpoints. This way existing process mining techniques can be used for each viewpoint without the need for new data extractions and transformations. We provide a toolchain allowing for the discovery of MVP models (annotated with performance and frequency information) from relational databases. Moreover, we demonstrate that classical process mining techniques can be applied to any selected viewpoint.

更新日期：2020-01-09
• arXiv.cs.DB Pub Date : 2020-01-08
Benjamin Nguyen; Claude Castelluccia

In this document, we present a state of the art of anonymization techniques for classical tabular datasets. This article is geared towards a general public having some knowledge of mathematics and computer science, but with no need for specific knowledge in anonymization. The objective of this document it to explain anonymization concepts in order to be able to sanitize a dataset and compute reindentification risk. The document contains a large number of examples to help understand the calculations. ----- Dans ce document, nous pr\'esentons l'\'etat de l'art des techniques d'anonymisation pour des bases de donn\'ees classiques (i.e. des tables), \`a destination d'un public technique ayant une formation universitaire de base en math\'ematiques et informatique, mais non sp\'ecialiste. L'objectif de ce document est d'expliquer les concepts permettant de r\'ealiser une anonymisation de donn\'ees tabulaires, et de calculer les risques de r\'eidentification. Le document est largement compos\'e d'exemples permettant au lecteur de comprendre comment mettre en oeuvre les calculs.

更新日期：2020-01-09
• arXiv.cs.DB Pub Date : 2019-04-14
Martin Grohe; Peter Lindner

Probabilistic databases (PDBs) are used to model uncertainty in data in a quantitative way. In the standard formal framework, PDBs are finite probability spaces over relational database instances. It has been argued convincingly that this is not compatible with an open world semantics (Ceylan et al., KR 2016) and with application scenarios that are modeled by continuous probability distributions (Dalvi et al., CACM 2009). We recently introduced a model of PDBs as infinite probability spaces that addresses these issues (Grohe and Lindner, PODS 2019). While that work was mainly concerned with countably infinite probability spaces, our focus here is on uncountable spaces. Such an extension is necessary to model typical continuous probability distributions that appear in many applications. However, an extension beyond countable probability spaces raises nontrivial foundational issues concerned with the measurability of events and queries and ultimately with the question whether queries have a well-defined semantics. It turns out that so-called finite point processes are the appropriate model from probability theory for dealing with probabilistic databases. This model allows us to construct suitable (uncountable) probability spaces of database instances in a systematic way. Our main technical results are measurability statements for relational algebra queries as well as aggregate queries and datalog queries.

更新日期：2020-01-09
• arXiv.cs.DB Pub Date : 2019-09-06
Matthias Boehm; Iulian Antonov; Sebastian Baunsgaard; Mark Dokter; Robert Ginthoer; Kevin Innerebner; Florijan Klezin; Stefanie Lindstaedt; Arnab Phani; Benjamin Rath; Berthold Reinwald; Shafaq Siddiqi; Sebastian Benjamin Wrede

Machine learning (ML) applications become increasingly common in many domains. ML systems to execute these workloads include numerical computing frameworks and libraries, ML algorithm libraries, and specialized systems for deep neural networks and distributed ML. These systems focus primarily on efficient model training and scoring. However, the data science process is exploratory, and deals with underspecified objectives and a wide variety of heterogeneous data sources. Therefore, additional tools are employed for data engineering and debugging, which requires boundary crossing, unnecessary manual effort, and lacks optimization across the lifecycle. In this paper, we introduce SystemDS, an open source ML system for the end-to-end data science lifecycle from data integration, cleaning, and preparation, over local, distributed, and federated ML model training, to debugging and serving. To this end, we aim to provide a stack of declarative language abstractions for the different lifecycle tasks, and users with different expertise. We describe the overall system architecture, explain major design decisions (motivated by lessons learned from Apache SystemML), and discuss key features and research directions. Finally, we provide preliminary results that show the potential of end-to-end lifecycle optimization.

更新日期：2020-01-09
• arXiv.cs.DB Pub Date : 2020-01-07
Yiru Chen; Eugene Wu

Interactive tools like user interfaces help democratize data access for end-users by hiding underlying programming details and exposing the necessary widget interface to users. Since customized interfaces are costly to build, automated interface generation is desirable. SQL is the dominant way to analyze data and there already exists logs to analyze data. Previous work proposed a syntactic approach to analyze structural changes in SQL query logs and automatically generates a set of widgets to express the changes. However, they do not consider layout usability and the sequential order of queries in the log. We propose to adopt Monte Carlo Tree Search(MCTS) to search for the optimal interface that accounts for hierarchical layout as well as the usability in terms of how easy to express the query log.

更新日期：2020-01-08
• arXiv.cs.DB Pub Date : 2020-01-07
Philipp Götze; Arun Kumar Tharanatha; Kai-Uwe Sattler

Persistent Memory (PM), as already available e.g. with Intel Optane DC Persistent Memory, represents a very promising, next generation memory solution with a significant impact on database architectures. Several data structures for this new technology and its properties have already been proposed. However, primarily merely complete structures were presented and evaluated hiding the impact of the individual ideas and PM characteristics. Therefore, in this paper, we disassemble the structures presented so far, identify their underlying design primitives, and assign them to appropriate design goals regarding PM. As a result of our comprehensive experiments on real PM hardware, we were able to reveal the trade-offs of the primitives at the micro level. From this, performance profiles could be derived for selected primitives. With these it is possible to precisely identify their best use cases as well as vulnerabilities. Beside our general insights regarding PM-based data structure design, we also discovered new promising combinations not considered in the literature so far.

更新日期：2020-01-08
• arXiv.cs.DB Pub Date : 2019-05-15
George Papadakis; Dimitrios Skoutas; Emmanouil Thanos; Themis Palpanas

Efficiency techniques are an integral part of Entity Resolution, since its infancy. In this survey, we organized the bulk of works in the field into Blocking, Filtering and hybrid techniques, facilitating their understanding and use. We also provided an in-dept coverage of each category, further classifying the corresponding works into novel sub-categories. Lately, the efficiency techniques have received more attention, due to the rise of Big Data. This includes large volumes of semi-structured data, which pose challenges not only to the scalability of efficiency techniques, but also to their core assumptions: the requirement of Blocking for schema knowledge and of Filtering for high similarity thresholds. The former led to the introduction of schema-agnostic Blocking in conjunction with Block Processing techniques, while the latter led to more relaxed criteria of similarity. Our survey covers these new fields in detail, putting in context all relevant works.

更新日期：2020-01-08
• arXiv.cs.DB Pub Date : 2019-12-01
Xinyu Fan; Faen Zhang; Jianfei Song; Jingming Guo; Fujie Gao

Even though the idea of partitioning provenance graphs for access control was previously proposed, employing segments of the provenance DAG for fine-grained access control to provenance data has not been thoroughly explored. Hence, we take segments of a provenance graph, based on the extended OPM, and defined use a variant of regular expressions, and utilize them in our fine-grained access control language. It can not only return partial graphs to answer access requests but also introduce segments as restrictions in order to screen targeted data.

更新日期：2020-01-08
• arXiv.cs.DB Pub Date : 2020-01-03
Mahmoud Barhamgi; Charith Perera; Chia-Mu Yu; Djamal Benslimane; David Camacho; Christine Bonnet

In modern information systems different information features, about the same individual, are often collected and managed by autonomous data collection services that may have different privacy policies. Answering many end-users' legitimate queries requires the integration of data from multiple such services. However, data integration is often hindered by the lack of a trusted entity, often called a mediator, with which the services can share their data and delegate the enforcement of their privacy policies. In this paper, we propose a flexible privacy-preserving data integration approach for answering data integration queries without the need for a trusted mediator. In our approach, services are allowed to enforce their privacy policies locally. The mediator is considered to be untrusted, and only has access to encrypted information to allow it to link data subjects across the different services. Services, by virtue of a new privacy requirement, dubbed k-Protection, limiting privacy leaks, cannot infer information about the data held by each other. End-users, in turn, have access to privacy-sanitized data only. We evaluated our approach using an example and a real dataset from the healthcare application domain. The results are promising from both the privacy preservation and the performance perspectives.

更新日期：2020-01-07
• arXiv.cs.DB Pub Date : 2020-01-05
Xinying Wang; Olamide Timothy Tawose; Feng Yan; Dongfang Zhao

The interoperability across multiple blockchains would play a critical role in future blockchain-based data management paradigm. Existing techniques either work only for two blockchains or requires a centralized component to govern the cross-blockchain transaction execution, neither of which would meet the scalability requirement. This paper proposes a new distributed commit protocol, namely \textit{cross-blockchain transaction} (CBT), for conducting transactions across an arbitrary number of blockchains without any centralized component. The key idea of CBT is to extend the two-phase commit protocol with a heartbeat mechanism to ensure the liveness of CBT without introducing additional nodes or blockchains. We have implemented CBT and compared it to the state-of-the-art protocols, demonstrating CBT's low overhead (3.6\% between two blockchains, less than $1\%$ among 32 or more blockchains) and high scalability (linear scalability on up to 64-blockchain transactions). In addition, we developed a graphic user interface for users to virtually monitor the status of the cross-blockchain transactions.

更新日期：2020-01-07
• arXiv.cs.DB Pub Date : 2020-01-06

Big Data is used by data miner for analysis purpose which may contain sensitive information. During the procedures it raises certain privacy challenges for researchers. The existing privacy preserving methods use different algorithms that results into limitation of data reconstruction while securing the sensitive data. This paper presents a clustering based privacy preservation probabilistic model of big data to secure sensitive information..model to attain minimum perturbation and maximum privacy. In our model, sensitive information is secured after identifying the sensitive data from data clusters to modify or generalize it.The resulting dataset is analysed to calculate the accuracy level of our model in terms of hidden data, lossed data as result of reconstruction. Extensive experiements are carried out in order to demonstrate the results of our proposed model. Clustering based Privacy preservation of individual data in big data with minimum perturbation and successful reconstruction highlights the significance of our model in addition to the use of standard performance evaluation measures.

更新日期：2020-01-07
• arXiv.cs.DB Pub Date : 2020-01-03
Ruoyu WangShanghai Jiao Tong UniversityUniversity of New South Wales; Daniel SunUniversity of New South WalesCSIRO; Guoqiang LiShanghai Jiao Tong University

Statistical divergence is widely applied in multimedia processing, basically due to regularity and explainable features displayed in data. However, in a broader range of data realm, these advantages may not out-stand, and therefore a more general approach is required. In data detection, statistical divergence can be used as an similarity measurement based on collective features. In this paper, we present a collective detection technique based on statistical divergence. The technique extracts distribution similarities among data collections, and then uses the statistical divergence to detect collective anomalies. Our technique continuously evaluates metrics as evolving features and calculates adaptive threshold to meet the best mathematical expectation. To illustrate details of the technique and explore its efficiency, we case-studied a real world problem of click farming detection against malicious online sellers. The evaluation shows that these techniques provided efficient classifiers. They were also sufficiently sensitive to a much smaller magnitude of data alteration, compared with real world malicious behaviours. Thus, it is applicable in the real world.

更新日期：2020-01-06
• arXiv.cs.DB Pub Date : 2020-01-03
Devin Petersohn; William Ma; Doris Lee; Stephen Macke; Doris Xin; Xiangxi Mo; Joseph E. Gonzalez; Anthony D. Joseph; Joseph M. Hellerstein; Aditya Parameswaran

Dataframes are a popular and convenient abstraction to represent, structure, clean, and analyze data during exploratory data analysis. Despite the success of dataframe libraries in R and Python (pandas), dataframes face performance issues even on moderately large datasets. In this vision paper, we take the first steps towards formally defining dataframes, characterizing their properties, and outlining a research agenda towards making dataframes more interactive at scale. We draw on tools and techniques from the database community, and describe ways they may be adapted to serve dataframe systems, as well as the new challenges therein. We also describe our current progress toward a scalable dataframe system, Modin, which is already up to 30$times$ faster than pandas in preliminary case studies, while enabling unmodified pandas code to run as-is. In its first 18 months, Modin is already used by over 60 downstream projects, has over 250 forks, and 3,900 stars on GitHub, indicating the pressing need for pursuing this agenda.

更新日期：2020-01-06
• arXiv.cs.DB Pub Date : 2020-01-02
Dongjing Miao; Zhipeng Cai; Jianzhong Li; Xiangyu Gao; Xianmin Liu

Data inconsistency evaluating and repairing are major concerns in data quality management. As the basic computing task, optimal subset repair is not only applied for cost estimation during the progress of database repairing, but also directly used to derive the evaluation of database inconsistency. Computing an optimal subset repair is to find a minimum tuple set from an inconsistent database whose remove results in a consistent subset left. Tight bound on the complexity and efficient algorithms are still unknown. In this paper, we improve the existing complexity and algorithmic results, together with a fast estimation on the size of optimal subset repair. We first strengthen the dichotomy for optimal subset repair computation problem, we show that it is not only APXcomplete, but also NPhard to approximate an optimal subset repair with a factor better than $17/16$ for most cases. We second show a $(2-0.5^{\tiny\sigma-1})$-approximation whenever given $\sigma$ functional dependencies, and a $(2-\eta_k+\frac{\eta_k}{k})$-approximation when an $\eta_k$-portion of tuples have the $k$-quasi-Tur$\acute{\text{a}}$n property for some $k>1$. We finally show a sublinear estimator on the size of optimal \textit{S}-repair for subset queries, it outputs an estimation of a ratio $2n+\epsilon n$ with a high probability, thus deriving an estimation of FD-inconsistency degree of a ratio $2+\epsilon$. To support a variety of subset queries for FD-inconsistency evaluation, we unify them as the $\subseteq$-oracle which can answer membership-query, and return $p$ tuples uniformly sampled whenever given a number $p$. Experiments are conducted on range queries as an implementation of $\subseteq$-oracle, and results show the efficiency of our FD-inconsistency degree estimator.

更新日期：2020-01-04
• arXiv.cs.DB Pub Date : 2020-01-02
Eric Daimler; Ryan Wisnesky

In this paper we take the common position that AI systems are limited more by the integrity of the data they are learning from than the sophistication of their algorithms, and we take the uncommon position that the solution to achieving better data integrity in the enterprise is not to clean and validate data ex-post-facto whenever needed (the so-called data lake approach to data management, which can lead to data scientists spending 80% of their time cleaning data), but rather to formally and automatically guarantee that data integrity is preserved as it transformed (migrated, integrated, composed, queried, viewed, etc) throughout the enterprise, so that data and programs that depend on that data need not constantly be re-validated for every particular use.

更新日期：2020-01-04
Contents have been reproduced by permission of the publishers.

down
wechat
bug