当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning Syllables Using Conv-LSTM Model for Swahili Word Representation and Part-of-speech Tagging
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2021-05-26 , DOI: 10.1145/3445975
Casper Shikali Shivachi 1 , Refuoe Mokhosi 2 , Zhou Shijie 2 , Liu Qihe 2
Affiliation  

The need to capture intra-word information in natural language processing (NLP) tasks has inspired research in learning various word representations at word, character, or morpheme levels, but little attention has been given to syllables from a syllabic alphabet. Motivated by the success of compositional models in morphological languages, we present a Convolutional-long short term memory (Conv-LSTM) model for constructing Swahili word representation vectors from syllables. The unified architecture addresses the word agglutination and polysemous nature of Swahili by extracting high-level syllable features using a convolutional neural network (CNN) and then composes quality word embeddings with a long short term memory (LSTM). The word embeddings are then validated using a syllable-aware language model ( 31.267 ) and a part-of-speech (POS) tagging task ( 98.78 ), both yielding very competitive results to the state-of-art models in their respective domains. We further validate the language model using Xhosa and Shona, which are syllabic-based languages. The novelty of the study is in its capability to construct quality word embeddings from syllables using a hybrid model that does not use max-over-pool common in CNN and then the exploitation of these embeddings in POS tagging. Therefore, the study plays a crucial role in the processing of agglutinative and syllabic-based languages by contributing quality word embeddings from syllable embeddings, a robust Conv–LSTM model that learns syllables for not only language modeling and POS tagging, but also for other downstream NLP tasks.

中文翻译:

使用 Conv-LSTM 模型学习音节以进行斯瓦希里语单词表示和词性标注

在自然语言处理 (NLP) 任务中捕获词内信息的需求激发了在词、字符或词素级别学习各种词表示的研究,但很少关注音节字母表中的音节。受形态语言组合模型成功的启发,我们提出了一个卷积长期短期记忆 (Conv-LSTM) 模型,用于从音节构建斯瓦希里语单词表示向量。统一架构通过使用卷积神经网络 (CNN) 提取高级音节特征,然后使用长短期记忆 (LSTM) 组成高质量的词嵌入,解决了斯瓦希里语的单词凝集和多义性问题。然后使用音节感知语言模型验证词嵌入(31.267) 和词性 (POS) 标记任务 (98.78),两者都对各自领域的最先进模型产生了非常有竞争力的结果。我们使用基于音节的语言科萨语和绍纳语进一步验证了语言模型。该研究的新颖之处在于它能够使用混合模型从音节构建高质量的词嵌入,该模型不使用 CNN 中常见的 max-over-pool,然后在 POS 标记中利用这些嵌入。因此,该研究通过从音节嵌入中提供高质量的词嵌入,在处理粘着性和基于音节的语言中发挥着至关重要的作用,音节嵌入是一种强大的 Conv-LSTM 模型,它不仅可以学习用于语言建模和 POS 标记的音节,还可以用于其他下游NLP 任务。
更新日期:2021-05-26
down
wechat
bug