Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model,arXiv - CS - Digital Libraries

当前位置： X-MOL 学术 › arXiv.cs.DL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Operationalizing a National Digital Library: The Case for a Norwegian Transformer Model
arXiv - CS - Digital Libraries Pub Date : 2021-04-19 , DOI: arxiv-2104.09617
Per E Kummervold, Javier de la Rosa, Freddy Wetjen, Svein Arne Brygfjeld

In this work, we show the process of building a large-scale training set from digital and digitized collections at a national library. The resulting Bidirectional Encoder Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks for both Norwegian Bokm{\aa}l and Norwegian Nynorsk. Our model also improves the mBERT performance for other languages present in the corpus such as English, Swedish, and Danish. For languages not included in the corpus, the weights degrade moderately while keeping strong multilingual properties. Therefore, we show that building high-quality models within a memory institution using somewhat noisy optical character recognition (OCR) content is feasible, and we hope to pave the way for other memory institutions to follow.

中文翻译：

国家数字图书馆的运营：挪威变压器模型的案例

在这项工作中，我们展示了在国家图书馆中根据数字和数字化馆藏构建大规模培训集的过程。在挪威语Bokm {\ aa} l和挪威语Nynorsk的一些令牌和序列分类任务中，基于挪威语的基于变压器（BERT）的语言模型生成的双向编码器表示要胜过多语言BERT（mBERT）模型。我们的模型还提高了语料库中存在的其他语言（例如英语，瑞典语和丹麦语）的mBERT性能。对于不包含在语料库中的语言，权重会适度降低，同时保持强大的多语言属性。因此，我们表明，使用有点嘈杂的光学字符识别（OCR）内容在内存机构中建立高质量的模型是可行的，

更新日期：2021-04-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文