当前位置: X-MOL 学术Soil › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Potential of natural language processing for metadata extraction from environmental scientific publications
Soil ( IF 6.8 ) Pub Date : 2023-03-14 , DOI: 10.5194/soil-9-155-2023
Guillaume Blanchy, Lukas Albrecht, John Koestel, Sarah Garré

Abstract. Summarizing information from large bodies of scientific literature is an essential but work-intensive task. This is especially true in environmental studies where multiple factors (e.g., soil, climate, vegetation) can contribute to the effects observed. Meta-analyses, studies that quantitatively summarize findings of a large body of literature, rely on manually curated databases built upon primary publications. However, given the increasing amount of literature, this manual work is likely to require more and more effort in the future. Natural language processing (NLP) facilitates this task, but it is not clear yet to which extent the extraction process is reliable or complete. In this work, we explore three NLP techniques that can help support this task: topic modeling, tailored regular expressions and the shortest dependency path method. We apply these techniques in a practical and reproducible workflow on two corpora of documents: the Open Tension-disk Infiltrometer Meta-database (OTIM) and the Meta corpus. The OTIM corpus contains the source publications of the entries of the OTIM database of near-saturated hydraulic conductivity from tension-disk infiltrometer measurements (https://github.com/climasoma/otim-db, last access: 1 March 2023). The Meta corpus is constituted of all primary studies from 36 selected meta-analyses on the impact of agricultural practices on sustainable water management in Europe. As a first step of our practical workflow, we identified different topics from the individual source publications of the Meta corpus using topic modeling. This enabled us to distinguish well-researched topics (e.g., conventional tillage, cover crops), where meta-analysis would be useful, from neglected topics (e.g., effect of irrigation on soil properties), showing potential knowledge gaps. Then, we used tailored regular expressions to extract coordinates, soil texture, soil type, rainfall, disk diameter and tensions from the OTIM corpus to build a quantitative database. We were able to retrieve the respective information with 56 % up to 100 % of all relevant information (recall) and with a precision between 83 % and 100 %. Finally, we extracted relationships between a set of drivers corresponding to different soil management practices or amendments (e.g., “biochar”, “zero tillage”) and target variables (e.g., “soil aggregate”, “hydraulic conductivity”, “crop yield”) from the source publications' abstracts of the Meta corpus using the shortest dependency path between them. These relationships were further classified according to positive, negative or absent correlations between the driver and the target variable. This quickly provided an overview of the different driver–variable relationships and their abundance for an entire body of literature. Overall, we found that all three tested NLP techniques were able to support evidence synthesis tasks. While human supervision remains essential, NLP methods have the potential to support automated evidence synthesis which can be continuously updated as new publications become available.

中文翻译:

从环境科学出版物中提取元数据的自然语言处理潜力

摘要。从大量科学文献中总结信息是一项必不可少但需要大量工作的任务。在环境研究中尤其如此,在这些研究中,多种因素(例如,土壤、气候、植被)可能对观察到的影响产生影响。荟萃分析是定量总结大量文献发现的研究,它依赖于建立在主要出版物基础上的人工管理数据库。然而,鉴于越来越多的文献,这项手工工作在未来可能需要越来越多的努力。自然语言处理 (NLP) 促进了这项任务,但目前尚不清楚提取过程在多大程度上是可靠的或完整的。在这项工作中,我们探索了三种有助于支持这项任务的 NLP 技术:主题建模、定制正则表达式和最短依赖路径方法。我们将这些技术应用于两个文档语料库的实用且可重现的工作流程:Open Tension-disk Infiltrometer Meta-database (OTIM) 和 Meta 语料库。OTIM 语料库包含来自张力盘渗透仪测量的近饱和导水率的 OTIM 数据库条目的源出版物(https://github.com/climasoma/otim-db,最后访问:2023 年 3 月 1 日)。元语料库由 36 项关于农业实践对欧洲可持续水资源管理影响的选定元分析的所有主要研究组成。作为我们实际工作流程的第一步,我们使用主题建模从 Meta 语料库的各个源出版物中识别出不同的主题。这使我们能够区分经过深入研究的主题(例如,常规耕作、覆盖作物)、在元分析有用的地方,从被忽视的主题(例如,灌溉对土壤特性的影响)中,显示潜在的知识差距。然后,我们使用定制的正则表达式从 OTIM 语料库中提取坐标、土壤质地、土壤类型、降雨量、圆盘直径和张力,以构建量化数据库。我们能够以 56% 到 100% 的所有相关信息(召回率)检索相应信息,精度在 83% 到 100% 之间。最后,我们提取了一组与不同土壤管理实践或修正案(例如,“生物炭”、“零耕作”)相对应的驱动因素与目标变量(例如,“土壤团聚体”、“水力传导率”、“作物产量”)之间的关系) 从元语料库的源出版物摘要中使用它们之间的最短依赖路径。这些关系根据驱动程序和目标变量之间的正相关、负相关或不相关进一步分类。这很快就为整个文献提供了不同驱动变量关系及其丰富性的概述。总的来说,我们发现所有三种经过测试的 NLP 技术都能够支持证据合成任务。虽然人工监督仍然必不可少,但 NLP 方法有可能支持自动证据合成,这些证据可以随着新出版物的出现而不断更新。我们发现所有三种经过测试的 NLP 技术都能够支持证据合成任务。虽然人工监督仍然必不可少,但 NLP 方法有可能支持自动证据合成,这些证据可以随着新出版物的出现而不断更新。我们发现所有三种经过测试的 NLP 技术都能够支持证据合成任务。虽然人工监督仍然必不可少,但 NLP 方法有可能支持自动证据合成,这些证据可以随着新出版物的出现而不断更新。
更新日期:2023-03-14
down
wechat
bug