当前位置: X-MOL 学术Sci. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Standardised Versioning of Datasets: a FAIR–compliant Proposal
Scientific Data ( IF 9.8 ) Pub Date : 2024-04-09 , DOI: 10.1038/s41597-024-03153-y
Alba González–Cebrián , Michael Bradford , Adriana E. Chis , Horacio González–Vélez

This paper presents a standardised dataset versioning framework for improved reusability, recognition and data version tracking, facilitating comparisons and informed decision-making for data usability and workflow integration. The framework adopts a software engineering-like data versioning nomenclature (“major.minor.patch”) and incorporates data schema principles to promote reproducibility and collaboration. To quantify changes in statistical properties over time, the concept of data drift metrics (d) is introduced. Three metrics (dP, dE,PCA, and dE,AE) based on unsupervised Machine Learning techniques (Principal Component Analysis and Autoencoders) are evaluated for dataset creation, update, and deletion. The optimal choice is the dE,PCA metric, combining PCA models with splines. It exhibits efficient computational time, with values below 50 for new dataset batches and values consistent with seasonal or trend variations. Major updates (i.e., values of 100) occur when scaling transformations are applied to over 30% of variables while efficiently handling information loss, yielding values close to 0. This metric achieved a favourable trade-off between interpretability, robustness against information loss, and computation time.



中文翻译:

数据集的标准化版本:符合 FAIR 的提案

本文提出了一个标准化数据集版本控制框架,用于提高可重用性、识别和数据版本跟踪,促进数据可用性和工作流程集成的比较和明智决策。该框架采用类似软件工程的数据版本命名法(“major.minor.patch”),并结合数据模式原则来促进可重复性和协作。为了量化统计属性随时间的变化,引入了数据漂移指标( d )的概念。基于无监督机器学习技术(主成分分析和自动编码器)的三个指标(d Pd EPCAd E,AE )针对数据集创建、更新和删除进行评估。最佳选择是d EPCA度量,将 PCA 模型与样条曲线相结合。它表现出高效的计算时间,新数据集批次的值低于 50,并且值与季节性或趋势变化一致。当缩放变换应用于超过 30% 的变量,同时有效处理信息丢失时,会发生主要更新(即值 100),产生接近 0 的值。该指标在可解释性、针对信息丢失的鲁棒性和计算时间。

更新日期:2024-04-09
down
wechat
bug