当前位置: X-MOL 学术J. Assoc. Inf. Sci. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Gender identification on Twitter
Journal of the Association for Information Science and Technology ( IF 3.5 ) Pub Date : 2021-06-14 , DOI: 10.1002/asi.24541
Catherine Ikae 1 , Jacques Savoy 1
Affiliation  

To determine the author of a text's gender, various feature types have been suggested (e.g., function words, n-gram of letters, etc.) leading to a huge number of stylistic markers. To determine the target category, different machine learning models have been suggested (e.g., logistic regression, decision tree, k nearest-neighbors, support vector machine, naïve Bayes, neural networks, and random forest). In this study, our first objective is to know whether or not the same model always proposes the best effectiveness when considering similar corpora under the same conditions. Thus, based on 7 CLEF-PAN collections, this study analyzes the effectiveness of 10 different classifiers. Our second aim is to propose a 2-stage feature selection to reduce the feature size to a few hundred terms without any significant change in the performance level compared to approaches using all the attributes (increase of around 5% after applying the proposed feature selection). Based on our experiments, neural network or random forest tend, on average, to produce the highest effectiveness. Moreover, empirical evidence indicates that reducing the feature set size to around 300 without penalizing the effectiveness is possible. Finally, based on such reduced feature sizes, an analysis reveals some of the specific terms that clearly discriminate between the 2 genders.

中文翻译:

Twitter 上的性别识别

为了确定文本作者的性别,已经提出了各种特征类型(例如,功能词、字母n元组等),从而产生了大量的文体标记。为了确定目标类别,建议使用不同的机器学习模型(例如,逻辑回归、决策树、k 个最近邻、支持向量机、朴素贝叶斯、神经网络和随机森林)。在这项研究中,我们的第一个目标是了解在相同条件下考虑相似语料库时,相同模型是否总是提出最佳有效性。因此,基于 7 CLEF-PAN集合,本研究分析了 10 个不同分类器的有效性。我们的第二个目标是提出一个 2 阶段特征选择,以将特征大小减少到几百项,与使用所有属性的方法相比,性能水平没有任何显着变化(应用建议的特征选择后增加约 5%) . 根据我们的实验,平均而言,神经网络或随机森林倾向于产生最高的有效性。此外,经验证据表明,在不影响有效性的情况下将特征集大小减少到 300 左右是可能的。最后,基于这种缩小的特征大小,分析揭示了一些明确区分两种性别的特定术语。
更新日期:2021-06-14
down
wechat
bug