当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Named Entity Recognition and Classification for Punjabi Shahmukhi
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2020-05-04 , DOI: 10.1145/3383306
Muhammad Tayyab Ahmad 1 , Muhammad Kamran Malik 1 , Khurram Shahzad 1 , Faisal Aslam 1 , Asif Iqbal 1 , Zubair Nawaz 1 , Faisal Bukhari 1
Affiliation  

Named entity recognition (NER) refers to the identification of proper nouns from natural language text and classifying them into named entity types, such as person, location, and organization. Due to the widespread applications of NER, numerous NER techniques and benchmark datasets have been developed for both Western and Asian languages. Even though Shahmukhi script of the Punjabi language has been used by nearly three fourths of the Punjabi speakers worldwide, Gurmukhi has been the main focus of research activities. Specifically, a benchmark NER corpus for Shahmukhi is non-existent, which has thwarted the commencement of NER research for the Shahmukhi script. To this end, this article presents the development and specifications of the first-ever NER corpus for Shahmukhi. The newly developed corpus is composed of 318,275 tokens and 16,300 named entities, including 11,147 persons, 3,140 locations, and 2,013 organizations. To establish the strength of our corpus, we have compared the specifications of our corpus with its Gurmukhi counterparts. Furthermore, we have demonstrated the usability of our corpus using five supervised learning techniques, including two state-of-the-art deep learning techniques. The results are compared, and valuable insights about the behaviors of the most effective technique are discussed.

中文翻译:

旁遮普语 Shahmukhi 的命名实体识别和分类

命名实体识别 (NER) 是指从自然语言文本中识别专有名词并将其分类为命名实体类型,例如人、位置和组织。由于 NER 的广泛应用,已经为西方和亚洲语言开发了许多 NER 技术和基准数据集。尽管旁遮普语的 Shahmukhi 脚本已被全球近四分之三的旁遮普语使用者使用,但 Gurmukhi 一直是研究活动的主要焦点。具体来说,不存在用于 Shahmukhi 的基准 NER 语料库,这阻碍了对 Shahmukhi 脚本的 NER 研究的开始。为此,本文介绍了 Shahmukhi 的第一个 NER 语料库的开发和规范。新开发的语料库由 318,275 个令牌和 16 个、300 个命名实体,包括 11,147 个人、3,140 个地点和 2,013 个组织。为了确定我们的语料库的实力,我们将我们的语料库的规格与其 Gurmukhi 对应物进行了比较。此外,我们使用五种监督学习技术证明了我们的语料库的可用性,其中包括两种最先进的深度学习技术。对结果进行了比较,并讨论了有关最有效技术行为的宝贵见解。
更新日期:2020-05-04
down
wechat
bug