当前位置: X-MOL 学术Digit. Scholarsh. Hum.it. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Introduction to a special section on ‘Computational Methods for Literary–Historical Textual Scholarship’
Digital Scholarship in the Humanities ( IF 1.299 ) Pub Date : 2019-12-01 , DOI: 10.1093/llc/fqz071
Gabriel Egan 1
Affiliation  

All sorts of surprising discoveries about literary and historical texts have been made in the past 30 years or so by investigators employing new computational methods unavailable to previous generations. One landmark publication was John Burrows’s book Computation into Criticism (1987), which showed that literary scholars had been simply ignoring most of the available evidence, as expressed in the celebrated opening sentence ‘It is a truth not generally acknowledged that, in most discussions of works of English fiction, we proceed as if a third, two-fifths, a half of our material were not really there’. Burrows showed that the function words— the 100 or so words that comprise articles, conjunctions, prepositions, and other linguistic ‘glue’ holding our sentences together—are just as amenable to literary criticism as the more visible, rarer lexical words. Burrows could undertake his innovative research because digital transcriptions of literary works made it possible to count the function words, and he developed a series of algorithms for processing the resulting counts that are now widely used in the field. Since 1987, many more texts have been digitized and many more algorithms have been invented to process them in various ways. A conference at De Montfort University, Leicester, on July 2018, generously funded by the UK’s Arts and Humanities Research Council and by the host university, was an opportunity to take stock of where these three decades of work had brought those interested in analysing texts using computers. This special section of Digital Scholarship in the Humanities presents a selection of the best articles from the conference; other fine articles had already been committed to other outlets. The expansion in digital texts available to investigators, which has occurred in the past 30 years, has come from two means: the keyboarding of existing nondigital texts and the transformation of images of printed pages into digital texts by optical character recognition (OCR) of the letter shapes in those images. The former approach, involving human labour, is several orders of magnitude more expensive than the latter but produces more accurate representations of the original writing. Because it is relatively inexpensive, OCR has been the means by which most of the expansion of our digital text collections has taken place. How much does its inaccuracy matter? In ‘Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study’, Mark J. Hill and Simon Hengchen get a handle on just how good or bad OCR is and how much the badness affects certain applications we put the texts to. They compared the part of the dataset Eighteenth Century Collections Online (ECCO), sold by Gale Cengage, that was manually keyboarded for the Text Creation Partnership (TCP) project with the part that was merely OCR’d, to judge how bad the OCR really is. To determine what difference OCR makes, they ran standard tests in topic modelling, collocation analysis, vector space modelling, and authorial attribution, using the keyboarded and OCR’d versions of the same books.

中文翻译:

关于“文学-历史文本奖学金的计算方法”特别部分的介绍

在过去的30多年中,研究人员采用了前几代人无法获得的新的计算方法,对文学和历史文本进行了各种令人惊讶的发现。一个具有里程碑意义的出版物是约翰·伯罗斯(John Burrows)的《对批评的计算》(Computation into Criticism,1987年)一书,该书表明文学学者只是无视了大多数可得的证据,正如著名的开场白所表达的那样:“这是一个普遍不承认的事实,在大多数讨论中在英语小说作品中,我们进行的工作就好像我们的材料中的三分之一,五分之二,一半都没有。伯罗斯(Burrows)表明,功能词-由100个左右的词组成的文章,连词,介词和其他将我们的句子结合在一起的语言“胶水”-像文学批评一样容易被接受,罕见的词汇。Burrows可以进行他的创新研究,因为文学作品的数字转录使对功能词进行计数成为可能,并且他开发了一系列算法来处理由此产生的计数,该算法现已广泛应用于该领域。自1987年以来,更多的文本已经数字化,并且发明了更多的算法以各种方式处理它们。由英国艺术与人文研究委员会和主办大学慷慨资助的一次会议于2018年7月在莱斯特的De Montfort大学举行,是一次总结这三十年的工作带给那些对使用文本分析文本感兴趣的人的机会。电脑。人文学科数字奖学金的这一特殊部分介绍了会议中精选的最佳文章;其他精美的文章已经提交给其他商店。在过去的30年中,调查人员可以使用的数字文本的扩展有两种方式:对现有非数字文本进行键盘输入,以及通过打印纸的光学字符识别(OCR)将打印的页面图像转换为数字文本。这些图像中的字母形状。前一种方法涉及人工,比后者昂贵几个数量级,但可以更准确地表示原始文字。因为它相对便宜,所以OCR已经成为我们扩展数字文本收藏的大部分手段。它的误差有多大?在“量化肮脏的OCR对历史文本分析的影响:作为案例研究的18世纪在线丛书”中,马克·J·希尔(Mark J. Hill)和西蒙·亨亨(Simon Hengchen)掌握了OCR的好坏,以及不良影响了我们将其应用于某些应用程序的程度。他们比较了Gale Cengage出售的数据集“十八世纪在线收藏集”(ECCO)的一部分(该文本是为“文本创建合作伙伴”(TCP)项目手动输入的)与仅由OCR编写的部分,来判断OCR的真实程度是。为了确定OCR有什么不同,他们使用同一本书的键盘版本和OCR版本对主题建模,搭配分析,向量空间建模和作者归因进行了标准测试。由Gale Cengage出售,它是为文本创建合作伙伴(TCP)项目手动键盘输入的,而该部分仅仅是OCR,以判断OCR的严重程度。为了确定OCR有什么不同,他们使用同一本书的键盘版本和OCR版本对主题建模,搭配分析,向量空间建模和作者归因进行了标准测试。由Gale Cengage出售,它是为文本创建合作伙伴(TCP)项目手动键盘输入的,而该部分仅仅是OCR,以判断OCR的严重程度。为了确定OCR有什么不同,他们使用同一本书的键盘版本和OCR版本对主题建模,搭配分析,向量空间建模和作者归因进行了标准测试。
更新日期:2019-12-01
down
wechat
bug