当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Richer Document Embeddings for Author Profiling tasks based on a heuristic search
Information Processing & Management ( IF 7.4 ) Pub Date : 2020-02-29 , DOI: 10.1016/j.ipm.2020.102227
Roberto López-Santillán , Manuel Montes-Y-Gómez , Luis Carlos González-Gurrola , Graciela Ramírez-Alonso , Olanda Prieto-Ordaz

In this study we propose a novel method to generate Document Embeddings (DEs) by means of evolving mathematical equations that integrate classical term frequency statistics. To accomplish this, we employed a Genetic Programming (GP) strategy to build competitive formulae to weight custom Word Embeddings (WEs), produced by cutting edge feature extraction techniques (e.g., word2vec, fastText, BERT), and then we create DEs by their weighted averaging. We exhaustively evaluated the proposed method over 9 datasets that are composed of several multilingual social media sources, with the aim to predict personal attributes of authors (e.g., gender, age, personality traits) in 17 tasks. In each dataset we contrast the results obtained by our method against state-of-the-art competitors, placing our approach at the top-quartile in all cases. Furthermore, we introduce a new numerical statistic feature called Relevance Topic Value (rtv), which could be used to enhance the forecasting of characteristics of authors, by numerically describing the topic of a document and the personal use of words by users. Interestingly, based on a frequency analysis of terminals used by GP, rtv turned out to be the most likely feature to appear alone in a single equation, then suggesting its usefulness as a WE weighting scheme.



中文翻译:

基于启发式搜索的作者分析任务的更丰富的文档嵌入

在这项研究中,我们提出了一种新颖的方法,该方法通过演化数学方程式来生成文档嵌入(DE),该数学方程式集成了经典词频统计。为此,我们采用了一种遗传编程(GP)策略来构建具有竞争力的公式,以权衡由尖端特征提取技术(例如word2vec,fastText,BERT)生成的自定义单词嵌入(WE),然后通过它们的DE创建加权平均。我们对9种数据集进行了详尽的评估,该数据集由几种多语言社交媒体资源组成,旨在预测17项任务中作者的个人属性(例如性别,年龄,人格特质)。在每个数据集中,我们将通过我们的方法获得的结果与最新的竞争对手进行对比,从而在所有情况下均将我们的方法放在前四分之一位置。此外,相关主题值(rtv)可以通过数字描述文档的主题和用户对单词的个人使用来增强对作者特征的预测。有趣的是,基于GP使用的终端的频率分析,rtv被证明是最有可能单独出现在单个方程式中的特征,然后暗示了它作为WE加权方案的有用性。

更新日期:2020-04-21
down
wechat
bug