Elsevier

Journal of Web Semantics

Volume 59, December 2019, 100486
Journal of Web Semantics

User-centric pattern mining on knowledge graphs: An archaeological case study

https://doi.org/10.1016/j.websem.2018.12.004Get rights and content

Abstract

In recent years, there has been a growing interest from the digital humanities in knowledge graphs as data modelling paradigm. Already, this has led to the creation of many such knowledge graphs, many of which are now available as part of the Linked Open Data cloud. This presents new opportunities for data mining. In this work, we develop, implement, and evaluate (both data-driven and user-driven) an end-to-end pipeline for user-centric pattern mining on knowledge graphs in the humanities. This pipeline combines constrained generalized association rule mining with natural language output and facet rule browsing to allow for transparency and interpretability—two key domain requirements. Experiments in the archaeological domain show that domain experts were positively surprised by the range of patterns that were discovered and were overall optimistic about the future potential of this approach.

Introduction

Digital humanities communities have shown a growing interest in the knowledge graph as a data modelling paradigm [1]. Already, this interest has inspired several large-scale international projects – amongst which are Europeana,1 CARARE,2 and ARIADNE3 – to actively explore the creation and publication of knowledge graphs in their respective domains. These knowledge graphs, and many others like them, have been made available as part of the Linked Open Data (LOD) cloud – a vast and internationally distributed network of heterogeneous knowledge – bringing large amounts of structured data within arm’s reach of humanities researchers, who are now looking for ways to analyse this wealth of knowledge. This presents new opportunities for data mining [2].

Data mining is the process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [3]. These patterns describe regularities in a dataset which can help researchers gain more insight into their data. Researchers can then use this insight as a starting point to form new research hypotheses, as support for existing ones, or simply to get a better understanding of their data [4]. This entire process can take weeks or even months of hard work in the traditional setting. However, by incorporating data mining into the workflow, much of this time can be saved through the automatic discovery of potentially-relevant patterns. This makes data mining interesting as a support tool for humanities researchers.

Of course, the idea of using data mining as a support tool in the humanities is, in itself, not novel. There have been various attempts before, for instance to classify coins [5] or to cluster cultural heritage [6]. However, the majority of these studies involved mining unstructured data, most commonly in the form of text mining, whereas mining structured data has thus far been largely limited to tabular data and tailored to specific use cases. With the growing popularity of knowledge graphs in the humanities, mining patterns from these structures becomes ever more important to researchers in this domain.

This work present the MINing On Semantics pipeline (MINOS) for pattern mining on knowledge graphs in the humanities. Its aim is to support domain experts in their analyses of such knowledge graphs by helping them discover useful and interesting patterns in their data. To this end, the MINOS pipeline places users in the centre by letting them guide the mining process towards their topics of interest and by letting them focus the results via a facet pattern browser.

Under the hood, MINOS employs generalized association rule mining (ARM). An association rule is an implication of the form Xy, where the presence of a set of items X implies the presence of another item y. These implications are learned by iterating over a dataset of examples, called transactions. Generalized ARM works largely the same, except that the antecedent X holds the item classes rather then the items themselves.

This method was specifically chosen to help overcome two key issues with technological acceptance in the humanities, namely transparency and interpretability [7], [8]. With transparency, we refer to the ease with which a method and its underlying theory can be understood: a black box method, for example, is less transparent than a glass box one. With interpretability, we mean how easily one can interpret the results of a method: it is, for instance, typically more difficult to interpret an n-order tensor than it is to interpret a set of symbolic statements.

Generalized ARM satisfies both of these constraints: it employs basic statistical know-how to produce human-readable rules in an overall deterministic process. A limited background in statistics, which most humanities researchers possess, therefore already suffices to understand how these rules map back to the input data and to check whether they are valid. This allows humanities researchers to put their trust in both the method and its results [9].

Of course, this trust is only gained if the produced rules provide useful and interesting patterns which can help these researchers to get a better understanding of their data. We call this the effectiveness of the approach. To assess this effectiveness we conducted experiments in the archaeological domain, specifically on data from various excavations, during which domain experts were asked to evaluate a set of candidate rules on interestingness.

By placing domain experts at the centre of both the pipeline and its evaluation, as opposed to data scientists, we contribute to an as yet largely unexplored niche in this intersecting field of data mining, knowledge graphs, and humanities. Concretely, our main contributions are (1) insight into some of the challenges and possible solutions for introducing data science tools to the humanities, (2) a pipeline design for pattern mining on knowledge graphs which is tailored to domain experts rather than to data scientists, and (3) a user-driven evaluation of our design choices instead of only a data-driven one.

With these contributions, our research aims to add to the interdisciplinary field of the Digital Humanities. For this reason, we will refrain from developing an ARM algorithm from scratch, but instead focus on how we can augment such an algorithm with complementary components to make it into an effective tool for Humanities researchers.

A concise overview of related work is given next, followed by an overview of the pipeline, the dataset, and the experimental setup. This paper then discusses the results from the user-driven evaluation, and concludes with a reflection on the chosen approach in light of these results.

Section snippets

Related work

Studies on data mining in the humanities have thus far largely focussed on unstructured data (text mining), whereas data mining on semi-structured or structured data has been explored less frequently [4], [10]. An example of the latter kind is discussed in [11], in which the authors propose mining association rules from excavation data – sites as transactions, artefacts as items – using the proven Apriori algorithm. This task is similar to that described in this work, but it is executed on a

The MINOS pipeline

The MINOS pattern mining pipeline combines an off-the-shelf ARM algorithm with a simple facet rule browser, and a number of crucial pre- and post-processing components. These components enable users to integrate their interests into the process by restricting the search space beforehand, and by filtering the results afterwards. Hereto, the pre-processing components translate user-provided target patterns into SPARQL queries, use these queries to retrieve relevant resources from the LOD cloud,

The package-slip knowledge graph

Excavation data is a valuable source of information in many archaeological studies [9]. These studies are therefore likely to benefit from pattern mining on this type of data, and thus make it a suitable choice to base our case study on. In agreement with domain experts, we therefore selected the package-slip knowledge graph to run our experiments with.

Package slips are detailed summarizations of entire excavation projects. They are structured as specified by the SIKB Protocol 0102, which is a

Experiments

To assess the effectiveness of the MINOS pipeline, we have conducted four experiments on the package-slip knowledge graph (see Section 4). Each of these experiments addressed a different granularity of the package-slip graph to investigate the effects of these different granularities on the usefulness of the discovered patterns for domain experts. In order from coarse-grained to fine-grained, these are

    Projects,

    which, amongst other, are of a project class, are held at a specific location, and

Discussion

Our analysis of the survey’s results indicates that the panel of experts was cautiously positive about the plausibility of the produced patterns. This (slight) positiveness does not come as surprise, as association rules describe the actual patterns which exist in the data, rather than predict new ones. We can even further explain this observation by our decision to order the candidate rules on confidence – favouring accuracy above coverage – and because the package-slip knowledge graph only

Conclusion

In this work, we introduced the user-centric MINOS pipeline for pattern mining on knowledge graphs in the humanities. With this pipeline, we aim to support domain experts in their analyses of such knowledge graphs by helping them discover useful and interesting patterns in their data. Our pipeline therefore emphasizes the importance of these experts and their requirements, rather than those of the usual data scientists. This has led to several design choices, most particular of which is the use

Acknowledgements

We wish to express our deep gratitude to domain experts Milco Wansleeben and Rein van ’t Veer for their enthusiastic encouragement and useful critiques during the various steps that have lead to this work. We also wish to thank all domain experts who participated in our survey for their willingness to sacrifice their free time, and without whom we would not have been able to complete this research.

This research has been partially funded by the ARIADNE project through the European Commission

References (31)

  • NebotV. et al.

    Finding association rules in semantic web data

    Knowl.-Based Syst.

    (2012)
  • FreitasA.A.

    On rule interestingness measures

    Knowl.-Based Syst.

    (1999)
  • HalloM. et al.

    Current state of linked data in digital libraries

    J. Inf. Sci.

    (2016)
  • RaptiA. et al.

    A survey: Mining linked cultural heritage data

  • FayyadU. et al.

    From data mining to knowledge discovery in databases

    AI Mag.

    (1996)
  • HagoodJ.

    A brief introduction to data mining projects in the humanities

    Bull. Am. Soc. Inf. Sci. Tech.

    (2012)
  • VeluC. et al.

    Indian coin recognition and sum counting system of image data mining using artificial neural networks

    Int. J. Adv. Sci. Tech.

    (2011)
  • MakantasisK. et al.

    In the wild image retrieval and clustering for 3D cultural heritage landmarks reconstruction

    Multimedia Tools Appl.

    (2016)
  • ManovichL.

    Trending: the promises and the challenges of big social data

    Debates in the Digital Humanities

    (2011)
  • RöhleB.R.T.

    Digital methods: Five challenges

  • SelhoferH. et al.

    D2.1: first report on users needsTech. rep.

    (2014)
  • KamadaH.

    Digital humanities roles for libraries?

    College Res. Lib. News

    (2010)
  • KriegelH.-P. et al.

    Towards archaeo-informatics: scientific data management for archaeobiology

  • TrespV. et al.

    Towards machine learning on the semantic web

  • GalárragaL. et al.

    Fast rule mining in ontological knowledge bases with AMIE+

    VLDB J.

    (2015)
  • Cited by (6)

    • Extracting Top-κ Frequent and Diversified Patterns in Knowledge Graphs

      2024, IEEE Transactions on Knowledge and Data Engineering
    • Explainable Drug Repurposing in Context via Deep Reinforcement Learning

      2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Correction Tower: A General Embedding Method of the Error Recognition for the Knowledge Graph Correction

      2020, International Journal of Pattern Recognition and Artificial Intelligence
    • Decision Tree and Knowledge Graph Based on Grain Loss Prediction

      2020, Communications in Computer and Information Science
    View full text