Pull out all the stops: Textual analysis via punctuation sequences,European Journal of Applied Mathematics

当前位置： X-MOL 学术 › Eur. J. Appl. Math. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Pull out all the stops: Textual analysis via punctuation sequences
European Journal of Applied Mathematics ( IF 2.3 ) Pub Date : 2020-09-21 , DOI: 10.1017/s0956792520000157
ALEXANDRA N. M. DARMON , MARYA BAZZI , SAM D. HOWISON , MASON A. PORTER

Whether enjoying the lucid prose of a favourite author or slogging through some other writer’s cumbersome, heavy-set prattle (full of parentheses, em dashes, compound adjectives, and Oxford commas), readers will notice stylistic signatures not only in word choice and grammar but also in punctuation itself. Indeed, visual sequences of punctuation from different authors produce marvellously different (and visually striking) sequences. Punctuation is a largely overlooked stylistic feature in stylometry, the quantitative analysis of written text. In this paper, we examine punctuation sequences in a corpus of literary documents and ask the following questions: Are the properties of such sequences a distinctive feature of different authors? Is it possible to distinguish literary genres based on their punctuation sequences? Do the punctuation styles of authors evolve over time? Are we on to something interesting in trying to do stylometry without words, or are we full of sound and fury (signifying nothing)?In our investigation, we examine a large corpus of documents from Project Gutenberg (a digital library with many possible editorial influences). We extract punctuation sequences from each document in our corpus and record the number of words that separate punctuation marks. Using such information about punctuation-usage patterns, we attempt both author and genre recognition, and we also examine the evolution of punctuation usage over time. Our efforts at author recognition are particularly successful. Among the features that we consider, the one that seems to carry the most explanatory power is an empirical approximation of the joint probability of the successive occurrence of two punctuation marks. In our conclusions, we suggest several directions for future work, including the application of similar analyses for investigating translations and other types of categorical time series.

中文翻译：

全力以赴：通过标点序列进行文本分析

无论是欣赏最喜欢的作者清晰的散文，还是阅读其他作者繁琐、冗长的闲谈（充满括号、破折号、复合形容词和牛津逗号），读者都会注意到文体签名，不仅在单词选择和语法上，而且标点符号本身也是如此。事实上，来自不同作者的标点符号的视觉序列产生了惊人的不同（和视觉冲击）序列。标点符号是文体测量中一个很大程度上被忽视的文体特征，这是对书面文本的定量分析。在本文中，我们检查了文学文献语料库中的标点符号序列，并提出以下问题：这些序列的性质是不同作者的显着特征吗？是否可以根据标点符号序列来区分文学体裁？作者的标点符号风格会随着时间而演变吗？我们是否在尝试不使用文字进行文体测量时遇到了一些有趣的事情，还是我们充满了喧嚣和愤怒（什么都没有）？在我们的调查中，我们检查了古腾堡项目（一个具有许多可能的编辑影响的数字图书馆）的大量文档）。我们从语料库中的每个文档中提取标点符号序列，并记录分隔标点符号的单词数。使用有关标点符号使用模式的此类信息，我们尝试识别作者和流派，并且我们还检查了标点符号使用随时间的演变。我们在作者认可方面的努力尤其成功。在我们考虑的特征中，似乎具有最大解释力的特征是两个标点符号连续出现的联合概率的经验近似。在我们的结论中，我们为未来的工作提出了几个方向，包括应用类似的分析来调查翻译和其他类型的分类时间序列。

更新日期：2020-09-21

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11