当前位置: X-MOL 学术J. Assoc. Inf. Sci. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Comparing neural- and N-gram-based language models for word segmentation
Journal of the Association for Information Science and Technology ( IF 2.8 ) Pub Date : 2018-12-02 , DOI: 10.1002/asi.24082
Yerai Doval 1 , Carlos Gómez-Rodríguez 2
Affiliation  

Word segmentation is the task of inserting or deleting word boundary characters in order to separate character sequences that correspond to words in some language. In this article we propose an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n‐gram model or a recurrent neural network. The resulting system analyzes the text input with no word boundaries one token at a time, which can be a character or a byte, and uses the information gathered by the language model to determine if a boundary must be placed in the current position or not. Our aim is to use this system in a preprocessing step for a microtext normalization system. This means that it needs to effectively cope with the data sparsity present on this kind of texts. We also strove to surpass the performance of two readily available word segmentation systems: The well‐known and accessible Word Breaker by Microsoft, and the Python module WordSegment by Grant Jenks. The results show that we have met our objectives, and we hope to continue to improve both the precision and the efficiency of our system in the future.

中文翻译:

比较基于神经和 N-gram 的语言模型进行分词

分词是插入或删除词边界字符以分离与某种语言中的词相对应的字符序列的任务。在本文中,我们提出了一种基于波束搜索算法和在字节/字符级别工作的语言模型的方法,后者作为 n-gram 模型或循环神经网络实现。生成的系统每次分析一个没有单词边界的文本输入,它可以是一个字符或一个字节,并使用语言模型收集的信息来确定是否必须在当前位置放置一个边界。我们的目标是在缩微文本规范化系统的预处理步骤中使用该系统。这意味着它需要有效应对此类文本上存在的数据稀疏性。我们还努力超越两个现成的分词系统的性能:微软众所周知且易于访问的 Word Breaker,以及 Grant Jenks 的 Python 模块 WordSegment。结果表明我们已经达到了我们的目标,我们希望在未来继续提高我们系统的精度和效率。
更新日期:2018-12-02
down
wechat
bug