Latin BERT: A Contextual Language Model for Classical Philology,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Latin BERT: A Contextual Language Model for Classical Philology
arXiv - CS - Computation and Language Pub Date : 2020-09-21 , DOI: arxiv-2009.10053
David Bamman and Patrick J. Burns

We present Latin BERT, a contextual language model for the Latin language, trained on 642.7 million words from a variety of sources spanning the Classical era to the 21st century. In a series of case studies, we illustrate the affordances of this language-specific model both for work in natural language processing for Latin and in using computational methods for traditional scholarship: we show that Latin BERT achieves a new state of the art for part-of-speech tagging on all three Universal Dependency datasets for Latin and can be used for predicting missing text (including critical emendations); we create a new dataset for assessing word sense disambiguation for Latin and demonstrate that Latin BERT outperforms static word embeddings; and we show that it can be used for semantically-informed search by querying contextual nearest neighbors. We publicly release trained models to help drive future work in this space.

中文翻译：

拉丁语 BERT：古典语言学的语境语言模型

我们展示了拉丁语 BERT，这是一种用于拉丁语的语境语言模型，对来自古典时代到 21 世纪的各种来源的 6.427 亿个单词进行了训练。在一系列案例研究中，我们说明了这种特定于语言的模型在拉丁语自然语言处理工作和传统学术研究中使用计算方法的能力：我们展示了拉丁语 BERT 在部分方面达到了新的最先进水平在所有三个拉丁语通用依赖数据集上的语音标记，可用于预测丢失的文本（包括关键修正）；我们创建了一个用于评估拉丁语词义消歧的新数据集，并证明拉丁语 BERT 优于静态词嵌入；我们展示了它可以通过查询上下文最近邻居来进行语义信息搜索。

更新日期：2020-09-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文