Labelling the past: data set creation and multi-label classification of Dutch archaeological excavation reports

Brandsen, Alex; Koole, Martin

doi:10.1007/s10579-021-09552-6

Labelling the past: data set creation and multi-label classification of Dutch archaeological excavation reports

Original Paper
Open access
Published: 28 July 2021

Volume 56, pages 543–572, (2022)
Cite this article

Download PDF

You have full access to this open access article

Language Resources and Evaluation Aims and scope Submit manuscript

Labelling the past: data set creation and multi-label classification of Dutch archaeological excavation reports

Download PDF

2118 Accesses
4 Citations
6 Altmetric
Explore all metrics

Abstract

The extraction of information from Dutch archaeological grey literature has recently been investigated by the AGNES project. AGNES aims to disclose relevant information by means of a web search engine, to enable researchers to search through excavation reports. In this paper, we focus on the multi-labelling of archaeological excavation reports with time periods and site types, and provide a manually labelled reference set to this end. We propose a series of approaches, pre-processing methods, and various modifications of the training set to address the often low quality of both texts and labels. We find that despite those issues, our proposed methods lead to promising results.

Text Mining Oral Histories in Historical Archaeology

Article 13 January 2023

The West Bank and East Jerusalem Archaeological Database: Narratives of Archaeology and Archaeological Practices

Semantic Information Extraction in Archaeology: Challenges in the Construction of a Portuguese Corpus of Megalithism

1 Introduction

Over the past decades, the archaeological domain has produced a large quantity of literature in the form of excavation reports, scholarly articles, and books. The Archaeological Grey literature Named Entity Search (AGNES) project (Brandsen et al., 2019) aims to uncover any relevant information from Dutch archaeological excavation reports. Such reports are often grey literature: material that is either unpublished, or published in a non-traditional manner. Information uncovered by AGNES will be made easily accessible through a specifically designed search engine, enabling researchers to search for relevant texts.

In this search engine, certain aspects of documents are used for faceted search, allowing archaeologists to filter search results on site type and time period metadata fields. This information need is further detailed by Brandsen et al. (2019). AGNES currently only indexes documents with manually assigned metadata, but in the near future, documents without metadata will be added. To allow for faceted search on these documents as well, we propose to automatically assign metadata. Manual labelling is an unfeasible task due to the amount of texts: there are currently an estimated 70,000 documents and four to five thousand are added each year. Due to this volume, using text mining and machine learning techniques becomes a necessity.

In this paper, the labelling of Dutch archaeological excavation reports with time periods and site types^{Footnote 1} will be addressed in the form of a multi-label classification task.

We first create a manually labelled reference set, and then define a collection of pre-processing steps, classification methods, further text formatting and sampling techniques that lead to a multitude of different combinations. We determine which approaches are suitable for this particular type of data, and we discuss how these methods could be further improved.

Although reports are typically freely available in online repositories and archives, processing the documents proves to be rather difficult for four main reasons:

1.
Some of the documents are scanned hard copies, and the OCR process introduces noise
2.
The documents are only available in PDF format, and conversion to plain text introduces noise
3.
The training data labels are derived from the metadata values which has been added through a free text field, leading to highly diverse and inaccurate metadata
4.
There is a large number of target labels (146 site types, 42 time periods) with a strong class imbalance

See Table 1 for examples of point 1 to 3, and see Figs. 2 and 3 for point 4.

Table 1 Examples of noise introduced by (1) OCR mistakes, (2) PDF to text conversion and (3) manual metadata entry in free text fields (locations in time period field)

Labelling the past: data set creation and multi-label classification of Dutch archaeological excavation reports

Abstract

Similar content being viewed by others

Text Mining Oral Histories in Historical Archaeology

The West Bank and East Jerusalem Archaeological Database: Narratives of Archaeology and Archaeological Practices

Semantic Information Extraction in Archaeology: Challenges in the Construction of a Portuguese Corpus of Megalithism

1 Introduction

2 Related work

2.1 Text mining in the archaeological domain

2.2 Multi-label text classification

3 Data

3.1 Source data

3.2 ABR ontology

3.3 Definition of categories

3.4 Obtaining the document labels from the data

3.5 Exploration of the extracted labels

3.6 Pre-processing the metadata

4 Methods

4.1 Document pre-processing

4.2 Document filtering

4.3 Balancing the training set

4.4 Construction of a manually labelled reference set

4.5 Classification methods

4.5.1 Baseline

4.5.2 Binary relevance

4.5.3 Direct multi-labelling

4.6 Selection round

5 Results

5.1 Selection round

5.2 Pre-processing optimisation

5.3 Best methods per category

6 Conclusion

6.1 Future work

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendices

1.1 A: Category frequencies

1.2 B: Filter list

1.3 C: Category frequencies test set

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation