当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2020-07-07 , DOI: 10.1145/3389037
Chanatip Saetia 1 , Tawunrat Chalothorn 2 , Ekapol Chuangsuwanich 1 , Peerapon Vateekul 1
Affiliation  

A sentence is typically treated as the minimal syntactic unit used for extracting valuable information from a longer piece of text. However, in written Thai, there are no explicit sentence markers. We proposed a deep learning model for the task of sentence segmentation that includes three main contributions. First, we integrate n-gram embedding as a local representation to capture word groups near sentence boundaries. Second, to focus on the keywords of dependent clauses, we combine the model with a distant representation obtained from self-attention modules. Finally, due to the scarcity of labeled data, for which annotation is difficult and time-consuming, we also investigate and adapt Cross-View Training (CVT) as a semi-supervised learning technique, allowing us to utilize unlabeled data to improve the model representations. In the Thai sentence segmentation experiments, our model reduced the relative error by 7.4% and 10.5% compared with the baseline models on the Orchid and UGWC datasets, respectively. We also applied our model to the task of pronunciation recovery on the IWSLT English dataset. Our model outperformed the prior sequence tagging models, achieving a relative error reduction of 2.5%. Ablation studies revealed that utilizing n-gram presentations was the main contributing factor for Thai, while the semi-supervised training helped the most for English.

中文翻译:

使用本地和远距离词表示的半监督泰语句子分割

句子通常被视为用于从较长的文本中提取有价值信息的最小句法单元。然而,在书面泰语中,没有明确的句子标记。我们提出了一个用于句子分割任务的深度学习模型,包括三个主要贡献。首先,我们将 n-gram 嵌入作为局部表示来捕获句子边界附近的词组。其次,为了关注依赖子句的关键字,我们将模型与从自注意力模块获得的远距离表示相结合。最后,由于标记数据的稀缺性,标注困难且耗时,我们还研究并采用交叉视图训练(CVT)作为半监督学习技术,使我们能够利用未标记的数据来改进模型申述。在泰语句子分割实验中,与 Orchid 和 UGWC 数据集上的基线模型相比,我们的模型分别将相对误差降低了 7.4% 和 10.5%。我们还将我们的模型应用于 IWSLT 英语数据集上的发音恢复任务。我们的模型优于之前的序列标记模型,实现了 2.5% 的相对误差降低。消融研究表明,使用 n-gram 表示是泰语的主要贡献因素,而半监督训练对英语的帮助最大。我们的模型优于之前的序列标记模型,实现了 2.5% 的相对误差降低。消融研究表明,使用 n-gram 表示是泰语的主要贡献因素,而半监督训练对英语的帮助最大。我们的模型优于之前的序列标记模型,实现了 2.5% 的相对误差降低。消融研究表明,使用 n-gram 表示是泰语的主要贡献因素,而半监督训练对英语的帮助最大。
更新日期:2020-07-07
down
wechat
bug