当前位置: X-MOL 学术Cognit. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
Cognitive Computation ( IF 4.3 ) Pub Date : 2021-01-18 , DOI: 10.1007/s12559-020-09800-x
Juan Pablo Tessore , Leonardo Martín Esnaola , Laura Lanzarini , Sandra Baldassarri

Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.



中文翻译:

远程监督建设和评估西班牙语中带有情感标签的社交媒体评论的新数据集

标记语言资源是开发基于机器学习的文本分类器的基本要求。但是,手动标记非常耗时,并且生成的数据集非常小,仅包含数千个样本。基本情感数据集尤其难以手动分类,因为分类容易产生主观性,因此需要冗余分类来验证分配的标签。即使近年来,西班牙语中带有情感标签的文本数据集的数量一直在增长,但无法与英语中的数据集的数量,大小和质量进行比较。质量是一个特别令人关注的问题,因为西班牙语中没有很多数据集在构建过程中包含验证步骤。在本文中,西班牙语的社交媒体评论数据集经过编译,选择,过滤,并提出。一组心理学家对数据集的样本进行了重新分类,并使用Fleiss Kappa跨界协议进行了验证。使用Sentic Computing工具BabelSenticNet进行错误分析。结果表明,与其他手动标记的数据集相似,人类评分者与自动获取的标签之间的协议适度,其优点是所提供的数据集包含数十万个标记的注释,并且不需要大量的手动标记。评估人之间达成的协议非常类似于评估人与原始标签之间的协议。提出的每项措施都在适度一致的区域内,因此适合在情感分析领域中训练分类算法。一组心理学家对数据集的样本进行了重新分类,并使用Fleiss Kappa跨界协议进行了验证。使用Sentic Computing工具BabelSenticNet进行错误分析。结果表明,与其他手动标记的数据集相似,人类评分者与自动获取的标签之间的协议适度,其优点是所提供的数据集包含数十万个标记的注释,并且不需要大量的手动标记。评估人之间达成的协议非常类似于评估人与原始标签之间的协议。提出的每项措施都在适度一致的区域内,因此适合在情感分析领域中训练分类算法。一组心理学家对数据集的样本进行了重新分类,并使用Fleiss Kappa跨界协议进行了验证。使用Sentic Computing工具BabelSenticNet进行错误分析。结果表明,与其他手动标记的数据集相似,人类评分者与自动获取的标签之间的协议适度,其优点是所提供的数据集包含数十万个标记的注释,并且不需要大量的手动标记。评估人之间达成的协议非常类似于评估人与原始标签之间的协议。提出的每项措施都在适度一致的区域内,因此适合在情感分析领域中训练分类算法。结果表明,与其他手动标记的数据集相似,人类评分者与自动获取的标签之间的协议适度,其优点是所提供的数据集包含数十万个标记的注释,并且不需要大量的手动标记。评估人之间达成的协议非常类似于评估人与原始标签之间的协议。提出的每项措施都在适度一致的区域内,因此适合在情感分析领域中训练分类算法。结果表明,与其他手动标记的数据集相似,人类评分者与自动获取的标签之间的协议适度,其优点是所提供的数据集包含数十万个标记的注释,并且不需要大量的手动标记。评估人之间达成的协议非常类似于评估人与原始标签之间的协议。提出的每项措施都在适度一致的区域内,因此适合在情感分析领域中训练分类算法。评估人之间达成的协议非常类似于评估人与原始标签之间的协议。提出的每个度量都在适度一致的区域内,因此适合在情感分析领域中训练分类算法。评估人之间达成的协议非常类似于评估人与原始标签之间的协议。提出的每项措施都在适度一致的区域内,因此适合在情感分析领域中训练分类算法。

更新日期:2021-01-18
down
wechat
bug