Machine Learning Based Assembly of Fragments of Ancient Papyrus,ACM Journal on Computing and Cultural Heritage

当前位置： X-MOL 学术 › ACM J. Comput. Cult. Herit. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Machine Learning Based Assembly of Fragments of Ancient Papyrus
ACM Journal on Computing and Cultural Heritage ( IF 2.1 ) Pub Date : 2021-07-01 , DOI: 10.1145/3460961
Roy Abitbol ₁ , Ilan Shimshoni ₁ , Jonathan Ben-Dov ₂

Affiliation

The task of assembling fragments in a puzzle-like manner into a composite picture plays a significant role in the field of archaeology as it supports researchers in their attempt to reconstruct historic artifacts. In this article, we propose a method for matching and assembling pairs of ancient papyrus fragments containing mostly unknown scriptures. Papyrus paper is manufactured from papyrus plants and therefore portrays typical thread patterns resulting from the plant’s stems. The proposed algorithm is founded on the hypothesis that these thread patterns contain unique local attributes such that nearby fragments show similar patterns reflecting the continuations of the threads. We posit that these patterns can be exploited using image processing and machine learning techniques to identify matching fragments. The algorithm and system which we present support the quick and automated classification of matching pairs of papyrus fragments as well as the geometric alignment of the pairs against each other. The algorithm consists of a series of steps and is based on deep-learning and machine learning methods. The first step is to deconstruct the problem of matching fragments into a smaller problem of finding thread continuation matches in local edge areas (squares) between pairs of fragments. This phase is solved using a convolutional neural network ingesting raw images of the edge areas and producing local matching scores. The result of this stage yields very high recall but low precision. Thus, we utilize these scores in order to conclude about the matching of entire fragments pairs by establishing an elaborate voting mechanism. We enhance this voting with geometric alignment techniques from which we extract additional spatial information. Eventually, we feed all the data collected from these steps into a Random Forest classifier in order to produce a higher order classifier capable of predicting whether a pair of fragments is a match. Our algorithm was trained on a batch of fragments which was excavated from the Dead Sea caves and is dated circa the 1st century BCE. The algorithm shows excellent results on a validation set which is of a similar origin and conditions. We then tried to run the algorithm against a real-life set of fragments for which we have no prior knowledge or labeling of matches. This test batch is considered extremely challenging due to its poor condition and the small size of its fragments. Evidently, numerous researchers have tried seeking matches within this batch with very little success. Our algorithm performance on this batch was sub-optimal, returning a relatively large ratio of false positives. However, the algorithm was quite useful by eliminating 98% of the possible matches thus reducing the amount of work needed for manual inspection. Indeed, experts that reviewed the results have identified some positive matches as potentially true and referred them for further investigation.

中文翻译：

基于机器学习的古代纸莎草碎片组装

以类似拼图的方式将碎片组装成合成图片的任务在考古学领域发挥着重要作用，因为它支持研究人员重建历史文物的尝试。在本文中，我们提出了一种匹配和组装成对的古代纸莎草纸碎片的方法，这些碎片大多包含未知的经文。纸莎草纸由纸莎草植物制成，因此描绘了由植物茎产生的典型线纹。所提出的算法基于这样的假设，即这些线程模式包含独特的局部属性，因此附近的片段显示出相似的模式，反映了线程的延续。我们假设可以使用图像处理和机器学习技术来利用这些模式来识别匹配的片段。我们提出的算法和系统支持匹配纸莎草碎片对的快速和自动分类，以及纸莎草碎片对的几何对齐。该算法由一系列步骤组成，基于深度学习和机器学习方法。第一步是将匹配片段的问题解构为一个较小的问题，即在片段对之间的局部边缘区域（正方形）中找到线程连续匹配。这个阶段是使用卷积神经网络来解决的，该网络摄取边缘区域的原始图像并产生局部匹配分数。这个阶段的结果产生了非常高的召回率，但精度很低。因此，我们利用这些分数通过建立一个精心设计的投票机制来总结整个片段对的匹配。我们使用几何对齐技术增强这种投票，从中提取额外的空间信息。最终，我们将从这些步骤中收集的所有数据输入随机森林分类器，以生成能够预测一对片段是否匹配的高阶分类器。我们的算法是在一批从死海洞穴中挖掘出来的碎片上进行训练的，这些碎片的年代大约是公元前 1 世纪。该算法在具有相似起源和条件的验证集上显示了出色的结果。然后，我们尝试针对我们没有先验知识或匹配标记的真实片段集运行该算法。该测试批次被认为具有极强的挑战性，因为它的状况很差，而且其碎片很小。显然，许多研究人员尝试在这批中寻找匹配项，但收效甚微。我们在这批上的算法性能是次优的，返回的误报率相对较高。然而，该算法非常有用，因为它消除了 98% 的可能匹配，从而减少了手动检查所需的工作量。事实上，审查结果的专家已经确定了一些积极的匹配可能是真实的，并将它们提交给进一步调查。

更新日期：2021-07-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11