当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Factorization of Fact-Checks for Low Resource Indian Languages
arXiv - CS - Computation and Language Pub Date : 2021-02-23 , DOI: arxiv-2102.11276
Shivangi Singhal, Rajiv Ratn Shah, Ponnurangam Kumaraguru

The advancement in technology and accessibility of internet to each individual is revolutionizing the real time information. The liberty to express your thoughts without passing through any credibility check is leading to dissemination of fake content in the ecosystem. It can have disastrous effects on both individuals and society as a whole. The amplification of fake news is becoming rampant in India too. Debunked information often gets republished with a replacement description, claiming it to depict some different incidence. To curb such fabricated stories, it is necessary to investigate such deduplicates and false claims made in public. The majority of studies on automatic fact-checking and fake news detection is restricted to English only. But for a country like India where only 10% of the literate population speak English, role of regional languages in spreading falsity cannot be undermined. In this paper, we introduce FactDRIL: the first large scale multilingual Fact-checking Dataset for Regional Indian Languages. We collect an exhaustive dataset across 7 months covering 11 low-resource languages. Our propose dataset consists of 9,058 samples belonging to English, 5,155 samples to Hindi and remaining 8,222 samples are distributed across various regional languages, i.e. Bangla, Marathi, Malayalam, Telugu, Tamil, Oriya, Assamese, Punjabi, Urdu, Sinhala and Burmese. We also present the detailed characterization of three M's (multi-lingual, multi-media, multi-domain) in the FactDRIL accompanied with the complete list of other varied attributes making it a unique dataset to study. Lastly, we present some potential use cases of the dataset. We expect this dataset will be a valuable resource and serve as a starting point to fight proliferation of fake news in low resource languages.

中文翻译:

低资源印度语事实检查的因式分解

互联网技术的发展和每个人的可访问性正在改变实时信息。在不通过任何信誉检查的情况下表达思想的自由导致了生态系统中虚假内容的传播。它可能对个人和整个社会造成灾难性影响。假新闻的泛滥在印度也变得猖ramp。被拆封的信息通常会以替换说明重新发布,声称它描述了一些不同的事件。为了遏制此类虚假故事,有必要调查此类重复数据删除和在公共场合提出的虚假主张。有关自动事实检查和虚假新闻检测的大多数研究仅限于英语。但是对于像印度这样的国家来说,只有10%的识字人口会说英语,不能削弱地区性语言在传播虚假信息方面的作用。在本文中,我们介绍FactDRIL:第一个大规模的多语言事实检查数据集,用于区域印度语言。我们在7个月内收集了涵盖11种低资源语言的详尽数据集。我们建议的数据集包括9,058个属于英语的样本,5,155个属于印地语的样本,其余的8,222个样本分布在各个区域语言中,即孟加拉语,马拉地语,马拉雅拉姆语,泰卢固语,泰米尔语,奥里亚语,阿萨姆语,旁遮普语,乌尔都语,僧伽罗语和缅甸语。我们还介绍了FactDRIL中三个M(多语言,多媒体,多域)的详细特征以及其他各种属性的完整列表,使其成为一个独特的研究数据集。最后,我们介绍了数据集的一些潜在用例。
更新日期:2021-02-24
down
wechat
bug