Normalizing Text using Language Modelling based on Phonetics and String Similarity,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Normalizing Text using Language Modelling based on Phonetics and String Similarity
arXiv - CS - Computation and Language Pub Date : 2020-06-25 , DOI: arxiv-2006.14116
Fenil Doshi, Jimit Gandhi, Deep Gosalia and Sudhir Bagul

Social media networks and chatting platforms often use an informal version of natural text. Adversarial spelling attacks also tend to alter the input text by modifying the characters in the text. Normalizing these texts is an essential step for various applications like language translation and text to speech synthesis where the models are trained over clean regular English language. We propose a new robust model to perform text normalization. Our system uses the BERT language model to predict the masked words that correspond to the unnormalized words. We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form using a unique score based on phonetic and string similarity metrics.We use human-centric evaluations where volunteers were asked to rank the normalized text. Our strategies yield an accuracy of 86.7% and 83.2% which indicates the effectiveness of our system in dealing with text normalization.

中文翻译：

使用基于语音学和字符串相似性的语言建模规范文本

社交媒体网络和聊天平台通常使用非正式版本的自然文本。对抗性拼写攻击也倾向于通过修改文本中的字符来改变输入文本。对这些文本进行规范化是语言翻译和文本到语音合成等各种应用的重要步骤，其中模型是在干净的常规英语语言上训练的。我们提出了一种新的稳健模型来执行文本规范化。我们的系统使用 BERT 语言模型来预测与非规范化词对应的掩码词。我们提出了两种独特的掩蔽策略，尝试使用基于语音和字符串相似性度量的唯一分数将文本中未规范化的单词替换为其根形式。我们使用以人为中心的评估，其中要求志愿者对规范化文本进行排名。

更新日期：2020-06-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文