Comparing web-crawled and traditional corpora,Language Resources and Evaluation

当前位置： X-MOL 学术 › Lang. Resour. Eval. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Comparing web-crawled and traditional corpora
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2020-03-19 , DOI: 10.1007/s10579-020-09487-4
Václav Cvrček , Zuzana Komrsková , David Lukeš , Petra Poukarová , Anna Řehořková , Adrian Jan Zasina , Vladimír Benko

Using a multi-dimensional (MD) analysis of register variability, the study compares two corpora of Czech: Koditex, a “traditional” corpus carefully designed using various sources with rich metadata, and Araneum Bohemicum Maximum, a web-crawled corpus with an opportunistic composition representative of the “searchable” web. Both types of corpora are projected onto the space induced by the MD model, with the main objective being to find out whether they overlap in the linguistic variation they cover, or whether one introduces some specific variation which cannot be found in the other. We also document a crucial methodological point which has broader relevance for MD analyses in general, namely that texts have to be of similar lengths in order for their scores on the dimensions to be comparable. Results indicate that some traditional text categories, such as journalism or non-fiction, are characterized by language phenomena which are equally well covered by web-crawled data, though of course traditional corpora keep their edge in terms of the richness of the accompanying metadata. But overall, the range of variation in Koditex is broader as it contains texts which have no adequate substitute (i.e. texts with a comparable set of linguistic characteristics, regardless of their extratextual label) in data acquired through general-purpose web-crawling techniques. These include informal conversations, private correspondence, some types of fiction, but also user-generated content (comments on Facebook, forums etc.).

中文翻译：

比较网络爬虫和传统语料库

该研究使用寄存器变异性的多维（MD）分析，比较了捷克语的两个语料库：Koditex，这是一个“传统”语料库，它使用了丰富的元数据经过精心设计，而Araneum Bohemicum Maximum，是一个具有机会主义的网络抓取语料库。 “可搜索”网络的组成代表。两种类型的语料库都被投影到由MD模型诱发的空间上，其主要目的是找出它们是否在所覆盖的语言变异中重叠，或者一种是否引入了某些在另一种中找不到的特定变异。我们还记录了一个至关重要的方法论点，该点在总体上与MD分析具有更广泛的相关性，即，文本的长度必须相似，以便它们在维度上的得分具有可比性。结果表明，一些传统的文字类别，诸如新闻或非小说类之类的特征是语言现象，网络爬虫数据同样可以很好地覆盖这些现象，尽管传统语料库当然在附带的元数据的丰富性方面保持优势。但是总的来说，Koditex的变化范围更广，因为它包含通过通用Web爬网技术获取的数据中没有足够替代品的文本（即具有可比较的语言特征集的文本，无论其文本外的标签如何）。这些内容包括非正式对话，私人信件，某些类型的小说，还包括用户生成的内容（在Facebook，论坛等上的评论）。尽管传统语料库在附带元数据的丰富性方面当然保持了优势。但是总的来说，Koditex的变化范围更广，因为它包含通过通用Web爬网技术获取的数据中没有足够替代品的文本（即具有可比较的语言特征集的文本，无论其文本外的标签如何）。这些内容包括非正式对话，私人信件，某些类型的小说，还包括用户生成的内容（在Facebook，论坛等上的评论）。尽管传统语料库在附带元数据的丰富性方面当然保持了优势。但是总的来说，Koditex的变化范围更广，因为它包含通过通用Web爬网技术获取的数据中没有足够替代品的文本（即具有可比较的语言特征集的文本，无论其文本外的标签如何）。这些内容包括非正式对话，私人信件，某些类型的小说，还包括用户生成的内容（在Facebook，论坛等上的评论）。不论它们的文字外标签如何）都可以通过通用网络抓取技术获取。这些内容包括非正式对话，私人信件，某些类型的小说，还包括用户生成的内容（在Facebook，论坛等上的评论）。不论它们的文字外标签如何）都可以通过通用网络抓取技术获取。这些内容包括非正式对话，私人信件，某些类型的小说，还包括用户生成的内容（在Facebook，论坛等上的评论）。

更新日期：2020-03-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11