当前位置: X-MOL 学术Front. Environ. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
State Tagging for Improved Earth and Environmental Data Quality Assurance
Frontiers in Environmental Science ( IF 3.3 ) Pub Date : 2020-05-06 , DOI: 10.3389/fenvs.2020.00046
Chak-Hau Michael Tso , Peter Henrys , Susannah Rennie , John Watkins

Environmental data allows us to monitor the constantly changing environment that we live in. It allows us to study trends and helps us to develop better models to describe processes in our environment and they, in turn, can provide information to improve management practices. To ensure that the data are reliable for analysis and interpretation, they must undergo quality assurance procedures. Such procedures generally include standard operating procedures during sampling and laboratory measurement (if applicable), as well as data validation upon entry to databases. The latter usually involves compliance (i.e., format) and conformity (i.e., value) checks that are most likely to be in the form of single parameter range tests. Such tests take no consideration of the system state at which each measurement is made, and provide the user with little contextual information on the probable cause for a measurement to be flagged out of range. We propose the use of data science techniques to tag each measurement with an identified system state. The term “state” here is defined loosely and they are identified using k-means clustering, an unsupervised machine learning method. The meaning of the states is open to specialist interpretation. Once the states are identified, state-dependent prediction intervals can be calculated for each observational variable. This approach provides the user with more contextual information to resolve out-of-range flags and derive prediction intervals for observational variables that considers the changes in system states. The users can then apply further analysis and filtering as they see fit. We illustrate our approach with two well-established long-term monitoring datasets in the UK: moth and butterfly data from the UK Environmental Change Network (ECN), and the UK CEH Cumbrian Lakes monitoring scheme. Our work contributes to the ongoing development of a better data science framework that allows researchers and other stakeholders to find and use the data they need more readily.

中文翻译:

用于改进地球和环境数据质量保证的状态标记

环境数据使我们能够监控我们所生活的不断变化的环境。它使我们能够研究趋势并帮助我们开发更好的模型来描述我们环境中的过程,反过来,它们可以提供信息以改进管理实践。为了确保数据对于分析和解释的可靠性,它们必须经过质量保证程序。此类程序通常包括采样和实验室测量(如果适用)期间的标准操作程序,以及输入数据库时​​的数据验证。后者通常涉及最有可能采用单一参数范围测试形式的合规性(即格式)和一致性(即值)检查。此类测试不考虑进行每次测量时的系统状态,并为用户提供很少的上下文信息,说明测量被标记为超出范围的可能原因。我们建议使用数据科学技术来标记每个具有识别系统状态的测量。这里的术语“状态”定义松散,它们使用 k 均值聚类(一种无监督的机器学习方法)进行识别。状态的含义可由专家解释。一旦确定了状态,就可以为每个观察变量计算依赖于状态的预测区间。这种方法为用户提供了更多的上下文信息,以解决超出范围的标志,并为考虑到系统状态变化的观察变量推导出预测区间。然后,用户可以在他们认为合适的时候应用进一步的分析和过滤。我们用英国两个完善的长期监测数据集来说明我们的方法:来自英国环境变化网络 (ECN) 的蛾和蝴蝶数据,以及英国 CEH 坎布里亚湖监测计划。我们的工作有助于不断开发更好的数据科学框架,使研究人员和其他利益相关者能够更轻松地查找和使用他们需要的数据。
更新日期:2020-05-06
down
wechat
bug