User-centric pattern mining on knowledge graphs: An archaeological case study

doi:10.1016/j.websem.2018.12.004

Journal of Web Semantics

Volume 59, December 2019, 100486

https://doi.org/10.1016/j.websem.2018.12.004 Get rights and content

Abstract

In recent years, there has been a growing interest from the digital humanities in knowledge graphs as data modelling paradigm. Already, this has led to the creation of many such knowledge graphs, many of which are now available as part of the Linked Open Data cloud. This presents new opportunities for data mining. In this work, we develop, implement, and evaluate (both data-driven and user-driven) an end-to-end pipeline for user-centric pattern mining on knowledge graphs in the humanities. This pipeline combines constrained generalized association rule mining with natural language output and facet rule browsing to allow for transparency and interpretability—two key domain requirements. Experiments in the archaeological domain show that domain experts were positively surprised by the range of patterns that were discovered and were overall optimistic about the future potential of this approach.

Introduction

Digital humanities communities have shown a growing interest in the knowledge graph as a data modelling paradigm [1]. Already, this interest has inspired several large-scale international projects – amongst which are Europeana,¹ CARARE,² and ARIADNE³ – to actively explore the creation and publication of knowledge graphs in their respective domains. These knowledge graphs, and many others like them, have been made available as part of the Linked Open Data (LOD) cloud – a vast and internationally distributed network of heterogeneous knowledge – bringing large amounts of structured data within arm’s reach of humanities researchers, who are now looking for ways to analyse this wealth of knowledge. This presents new opportunities for data mining [2].

Data mining is the process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [3]. These patterns describe regularities in a dataset which can help researchers gain more insight into their data. Researchers can then use this insight as a starting point to form new research hypotheses, as support for existing ones, or simply to get a better understanding of their data [4]. This entire process can take weeks or even months of hard work in the traditional setting. However, by incorporating data mining into the workflow, much of this time can be saved through the automatic discovery of potentially-relevant patterns. This makes data mining interesting as a support tool for humanities researchers.

Of course, the idea of using data mining as a support tool in the humanities is, in itself, not novel. There have been various attempts before, for instance to classify coins [5] or to cluster cultural heritage [6]. However, the majority of these studies involved mining unstructured data, most commonly in the form of text mining, whereas mining structured data has thus far been largely limited to tabular data and tailored to specific use cases. With the growing popularity of knowledge graphs in the humanities, mining patterns from these structures becomes ever more important to researchers in this domain.

This work present the MINing On Semantics pipeline (MINOS) for pattern mining on knowledge graphs in the humanities. Its aim is to support domain experts in their analyses of such knowledge graphs by helping them discover useful and interesting patterns in their data. To this end, the MINOS pipeline places users in the centre by letting them guide the mining process towards their topics of interest and by letting them focus the results via a facet pattern browser.

Under the hood, MINOS employs generalized association rule mining (ARM). An association rule is an implication of the form $X ⟹ y$ , where the presence of a set of items $X$ implies the presence of another item $y$ . These implications are learned by iterating over a dataset of examples, called transactions. Generalized ARM works largely the same, except that the antecedent $X$ holds the item classes rather then the items themselves.

This method was specifically chosen to help overcome two key issues with technological acceptance in the humanities, namely transparency and interpretability [7], [8]. With transparency, we refer to the ease with which a method and its underlying theory can be understood: a black box method, for example, is less transparent than a glass box one. With interpretability, we mean how easily one can interpret the results of a method: it is, for instance, typically more difficult to interpret an $n$ -order tensor than it is to interpret a set of symbolic statements.

Generalized ARM satisfies both of these constraints: it employs basic statistical know-how to produce human-readable rules in an overall deterministic process. A limited background in statistics, which most humanities researchers possess, therefore already suffices to understand how these rules map back to the input data and to check whether they are valid. This allows humanities researchers to put their trust in both the method and its results [9].

Of course, this trust is only gained if the produced rules provide useful and interesting patterns which can help these researchers to get a better understanding of their data. We call this the effectiveness of the approach. To assess this effectiveness we conducted experiments in the archaeological domain, specifically on data from various excavations, during which domain experts were asked to evaluate a set of candidate rules on interestingness.

By placing domain experts at the centre of both the pipeline and its evaluation, as opposed to data scientists, we contribute to an as yet largely unexplored niche in this intersecting field of data mining, knowledge graphs, and humanities. Concretely, our main contributions are (1) insight into some of the challenges and possible solutions for introducing data science tools to the humanities, (2) a pipeline design for pattern mining on knowledge graphs which is tailored to domain experts rather than to data scientists, and (3) a user-driven evaluation of our design choices instead of only a data-driven one.

With these contributions, our research aims to add to the interdisciplinary field of the Digital Humanities. For this reason, we will refrain from developing an ARM algorithm from scratch, but instead focus on how we can augment such an algorithm with complementary components to make it into an effective tool for Humanities researchers.

A concise overview of related work is given next, followed by an overview of the pipeline, the dataset, and the experimental setup. This paper then discusses the results from the user-driven evaluation, and concludes with a reflection on the chosen approach in light of these results.

Section snippets

Related work

Studies on data mining in the humanities have thus far largely focussed on unstructured data (text mining), whereas data mining on semi-structured or structured data has been explored less frequently [4], [10]. An example of the latter kind is discussed in [11], in which the authors propose mining association rules from excavation data – sites as transactions, artefacts as items – using the proven Apriori algorithm. This task is similar to that described in this work, but it is executed on a

The MINOS pipeline

The MINOS pattern mining pipeline combines an off-the-shelf ARM algorithm with a simple facet rule browser, and a number of crucial pre- and post-processing components. These components enable users to integrate their interests into the process by restricting the search space beforehand, and by filtering the results afterwards. Hereto, the pre-processing components translate user-provided target patterns into SPARQL queries, use these queries to retrieve relevant resources from the LOD cloud,

The package-slip knowledge graph

Excavation data is a valuable source of information in many archaeological studies [9]. These studies are therefore likely to benefit from pattern mining on this type of data, and thus make it a suitable choice to base our case study on. In agreement with domain experts, we therefore selected the package-slip knowledge graph to run our experiments with.

Package slips are detailed summarizations of entire excavation projects. They are structured as specified by the SIKB Protocol 0102, which is a

Experiments

To assess the effectiveness of the MINOS pipeline, we have conducted four experiments on the package-slip knowledge graph (see Section 4). Each of these experiments addressed a different granularity of the package-slip graph to investigate the effects of these different granularities on the usefulness of the discovered patterns for domain experts. In order from coarse-grained to fine-grained, these are

Projects,

which, amongst other, are of a project class, are held at a specific location, and

Discussion

Our analysis of the survey’s results indicates that the panel of experts was cautiously positive about the plausibility of the produced patterns. This (slight) positiveness does not come as surprise, as association rules describe the actual patterns which exist in the data, rather than predict new ones. We can even further explain this observation by our decision to order the candidate rules on confidence – favouring accuracy above coverage – and because the package-slip knowledge graph only

Conclusion

In this work, we introduced the user-centric MINOS pipeline for pattern mining on knowledge graphs in the humanities. With this pipeline, we aim to support domain experts in their analyses of such knowledge graphs by helping them discover useful and interesting patterns in their data. Our pipeline therefore emphasizes the importance of these experts and their requirements, rather than those of the usual data scientists. This has led to several design choices, most particular of which is the use

Acknowledgements

We wish to express our deep gratitude to domain experts Milco Wansleeben and Rein van ’t Veer for their enthusiastic encouragement and useful critiques during the various steps that have lead to this work. We also wish to thank all domain experts who participated in our survey for their willingness to sacrifice their free time, and without whom we would not have been able to complete this research.

This research has been partially funded by the ARIADNE project through the European Commission

References (31)

NebotV. et al.
Finding association rules in semantic web data
Knowl.-Based Syst.
(2012)
FreitasA.A.
On rule interestingness measures
Knowl.-Based Syst.
(1999)
HalloM. et al.
Current state of linked data in digital libraries
J. Inf. Sci.
(2016)
RaptiA. et al.
A survey: Mining linked cultural heritage data
FayyadU. et al.
From data mining to knowledge discovery in databases
AI Mag.
(1996)
HagoodJ.
A brief introduction to data mining projects in the humanities
Bull. Am. Soc. Inf. Sci. Tech.
(2012)
VeluC. et al.
Indian coin recognition and sum counting system of image data mining using artificial neural networks
Int. J. Adv. Sci. Tech.
(2011)
MakantasisK. et al.
In the wild image retrieval and clustering for 3D cultural heritage landmarks reconstruction
Multimedia Tools Appl.
(2016)
ManovichL.
Trending: the promises and the challenges of big social data
Debates in the Digital Humanities
(2011)
RöhleB.R.T.
Digital methods: Five challenges

SelhoferH. et al.

D2.1: first report on users needsTech. rep.

(2014)

KamadaH.

Digital humanities roles for libraries?

College Res. Lib. News

(2010)

KriegelH.-P. et al.

Towards archaeo-informatics: scientific data management for archaeobiology

TrespV. et al.

Towards machine learning on the semantic web

GalárragaL. et al.

Fast rule mining in ontological knowledge bases with AMIE+

VLDB J.

(2015)

Cited by (6)

Towards defining data interpretability in open data portals: Challenges and research opportunities
2022, Information Systems
Open data portals are growing in scope, and the development of this initiative remains one of the main ways to help create new value for society and the economy. Citizens can use the open data made available on these portals to participate more effectively in democratic processes. For that, they have to be able to access, manipulate and interpret such data. Different authors present different definitions and perceptions about the meaning of data interpretability. Today there is no formal consensus on the concept of data interpretability. The goal of this work is to conceptualize what data interpretability is formal. For this, we carried out literature research to identify the definitions of data interpretability. In addition, we studied the information quality literature to identify the Non-Function Requirements that shape the concept of information quality. So, we aligned the interpretability characteristics with the NFR Framework characteristics to find a unique definition. We also conduct a qualitative analysis with experts in data analysis, e-government, and transparency to identify what these experts understand by interpretability. Based on these two studies, we defined interpretability through a model composed of 8 dimensions, each consisting of different characteristics, which must be guaranteed in the data interpretability process to interpret the data correctly. We understand that, for such characteristics to being guaranteed in the interpretability of open government data, it is necessary to have computational tools to support the user. Thus, we also surveyed which technologies and methods ensure each of the interpretability characteristics and pointed out which computational tools implement such technologies and methods. Finally, we analyzed three large open data portals to identify which characteristics are present in these portals, and we note that there are still several challenges to be handled in open government data portals.
Relationship Prediction in a Knowledge Graph Embedding Model of the Illicit Antiquities Trade
2023, Advances in Archaeological Practice
Extracting Top-κ Frequent and Diversified Patterns in Knowledge Graphs
2024, IEEE Transactions on Knowledge and Data Engineering
Explainable Drug Repurposing in Context via Deep Reinforcement Learning
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Correction Tower: A General Embedding Method of the Error Recognition for the Knowledge Graph Correction
2020, International Journal of Pattern Recognition and Artificial Intelligence
Decision Tree and Knowledge Graph Based on Grain Loss Prediction
2020, Communications in Computer and Information Science

View full text

User-centric pattern mining on knowledge graphs: An archaeological case study

Abstract

Introduction

Section snippets

Related work

The MINOS pipeline

The package-slip knowledge graph

Experiments

Discussion

Conclusion

Acknowledgements

Knowl.-Based Syst.

Knowl.-Based Syst.

Current state of linked data in digital libraries

J. Inf. Sci.

A survey: Mining linked cultural heritage data

From data mining to knowledge discovery in databases

AI Mag.

A brief introduction to data mining projects in the humanities

Bull. Am. Soc. Inf. Sci. Tech.

Indian coin recognition and sum counting system of image data mining using artificial neural networks

Int. J. Adv. Sci. Tech.

In the wild image retrieval and clustering for 3D cultural heritage landmarks reconstruction

Multimedia Tools Appl.

Trending: the promises and the challenges of big social data

Debates in the Digital Humanities

Digital methods: Five challenges