当前位置: X-MOL 学术EPJ Data Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Generalized word shift graphs: a method for visualizing and explaining pairwise comparisons between texts
EPJ Data Science ( IF 3.0 ) Pub Date : 2021-01-19 , DOI: 10.1140/epjds/s13688-021-00260-3
Ryan J. Gallagher , Morgan R. Frank , Lewis Mitchell , Aaron J. Schwartz , Andrew J. Reagan , Christopher M. Danforth , Peter Sheridan Dodds

A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts’ rich stories into a single number is often conceptually perilous, and it is difficult to confidently interpret interesting or unexpected textual patterns without looming concerns about data artifacts or measurement validity. To better capture fine-grained differences between texts, we introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts for any measure that can be formulated as a weighted average. We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback–Leibler and Jensen–Shannon divergences. Through a diverse set of case studies ranging from presidential speeches to tweets posted in urban green spaces, we demonstrate how generalized word shift graphs can be flexibly applied across domains for diagnostic investigation, hypothesis generation, and substantive interpretation. By providing a detailed lens into textual shifts between corpora, generalized word shift graphs help computational social scientists, digital humanists, and other text analysis practitioners fashion more robust scientific narratives.



中文翻译:

广义词移图:一种可视化和解释文本之间成对比较的方法

计算文本分析中的一项常见任务是根据单词频率,情感或信息内容等度量来量化两个语料库的差异。但是,将文本的丰富故事折叠成单个数字通常在概念上是危险的,并且很难自信地解释有趣或意想不到的文本模式,而又无需担心数据假象或度量的有效性。为了更好地捕获文本之间的细微差别,我们引入了广义的词移图,可视化,这些可视化给出了有意义的,可解释的摘要,以总结可以用加权平均值表示的任何量度各个词如何影响两个文本之间的变化。我们证明了这个框架自然包含了许多最常用的文本比较方法,包括相对频率,字典分数和基于熵的度量,例如Kullback-Leibler和Jensen-Shannon发散。通过一系列从总统致辞到城市绿色空间中发布的推文的案例研究,我们证明了通用字位移图可以如何灵活地应用于各个领域,以用于诊断研究,假设生成和实质性解释。通过详细介绍语料库之间的文本转换,广义的单词转换图可帮助计算社会科学家,数字人文主义者和其他文本分析从业人员塑造更健壮的科学叙述。通过一系列从总统致辞到城市绿色空间中发布的推文的案例研究,我们证明了通用字位移图可以如何灵活地应用于各个领域,以用于诊断研究,假设生成和实质性解释。通过详细介绍语料库之间的文本转换,广义的单词转换图可帮助计算社会科学家,数字人文主义者和其他文本分析从业人员塑造更健壮的科学叙述。通过一系列从总统致辞到城市绿色空间中发布的推文的案例研究,我们证明了通用字位移图可以如何灵活地应用于各个领域,以用于诊断研究,假设生成和实质性解释。通过详细介绍语料库之间的文本转换,广义的单词转换图可帮助计算社会科学家,数字人文主义者和其他文本分析从业人员塑造更健壮的科学叙述。

更新日期:2021-01-20
down
wechat
bug