Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks,Natural Language Engineering

当前位置： X-MOL 学术 › Nat. Lang. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Universal Lemmatizer: A sequence-to-sequence model for lemmatizing Universal Dependencies treebanks
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-05-27 , DOI: 10.1017/s1351324920000224
Jenna Kanerva , Filip Ginter , Tapio Salakoski

In this paper, we present a novel lemmatization method based on a sequence-to-sequence neural network architecture and morphosyntactic context representation. In the proposed method, our context-sensitive lemmatizer generates the lemma one character at a time based on the surface form characters and its morphosyntactic features obtained from a morphological tagger. We argue that a sliding window context representation suffers from sparseness, while in majority of cases the morphosyntactic features of a word bring enough information to resolve lemma ambiguities while keeping the context representation dense and more practical for machine learning systems. Additionally, we study two different data augmentation methods utilizing autoencoder training and morphological transducers especially beneficial for low-resource languages. We evaluate our lemmatizer on 52 different languages and 76 different treebanks, showing that our system outperforms all latest baseline systems. Compared to the best overall baseline, UDPipe Future, our system outperforms it on 62 out of 76 treebanks reducing errors on average by 19% relative. The lemmatizer together with all trained models is made available as a part of the Turku-neural-parsing-pipeline under the Apache 2.0 license.

中文翻译：

Universal Lemmatizer：用于词形化通用依赖树库的序列到序列模型

在本文中，我们提出了一种基于序列到序列神经网络架构和形态句法上下文表示的新型词形还原方法。在所提出的方法中，我们的上下文敏感词条分析器基于从形态标记器获得的表面形式字符及其形态句法特征，一次生成一个字符。我们认为滑动窗口上下文表示存在稀疏性，而在大多数情况下，单词的形态句法特征带来了足够的信息来解决引理歧义，同时保持上下文表示密集且对机器学习系统更实用。此外，我们研究了两种不同的数据增强方法，它们利用自动编码器训练和形态变换器，特别有利于低资源语言。我们在 52 种不同的语言和 76 种不同的树库上评估了我们的 lemmatizer，表明我们的系统优于所有最新的基线系统。与最佳的整体基线 UDPipe Future 相比，我们的系统在 76 个树库中的 62 个上的性能优于它，相对而言平均减少了 19% 的错误。在 Apache 2.0 许可下，词形还原器与所有经过训练的模型一起作为 Turku-neural-parsing-pipeline 的一部分提供。

更新日期：2020-05-27

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11