当前位置: X-MOL 学术Book History › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
"Q i-jtb the Raven": Taking Dirty OCR Seriously
Book History ( IF 0.5 ) Pub Date : 2017-01-01 , DOI: 10.1353/bh.2017.0006
Ryan Cordell

This article argues that scholars must understand mass digitized texts as assemblages of new editions, subsidiary editions, and impressions of their historical sources, and that these various parts require sustained bibliographic analysis and description. To adequately theorize any research conducted in large-scale text archives—including research that includes primary or secondary sources discovered through keyword search—we must avoid the myth of surrogacy proffered by page images and instead consider directly the text files they overlay. Focusing on the OCR (optical character recognition) from which most large-scale historical text data derives, this article argues that the results of this "automatic" process are in fact new editions of their source texts that offer unique insights into both the historical texts they remediate and the more recent era of their remediation. The constitution and provenance of digitized archives are, to some extent at least, knowable and describable. Just as details of type, ink, or paper, or paratext such as printer's records can help us establish the histories under which a printed book was created, details of format, interface, and even grant proposals can help us establish the histories of corpora created under conditions of mass digitization.

中文翻译:

“Q i-jtb the Raven”:认真对待肮脏的 OCR

本文认为,学者们必须将大量数字化文本理解为新版本、附属版本及其历史来源印象的组合,并且这些不同的部分需要持续的书目分析和描述。为了对在大规模文本档案中进行的任何研究(包括通过关键字搜索发现的主要或次要来源的研究)进行充分理论化,我们必须避免页面图像提供的代孕神话,而是直接考虑它们覆盖的文本文件。本文针对大多数大规模历史文本数据所源自的 OCR(光学字符识别),认为这种“自动”的结果 过程实际上是他们源文本的新版本,为他们修复的历史文本和他们修复的最近时代提供了独特的见解。至少在某种程度上,数字化档案的构成和来源是可知和可描述的。正如类型、墨水、纸张或副文本(例如印刷机记录)的详细信息可以帮助我们建立印刷书籍的创建历史一样,格式、界面甚至资助提案的详细信息也可以帮助我们建立创建的语料库的历史在大规模数字化的条件下。
更新日期:2017-01-01
down
wechat
bug