当前位置: X-MOL 学术Methods Inf. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning
Methods of Information in Medicine ( IF 1.3 ) Pub Date : 2021-07-08 , DOI: 10.1055/s-0041-1731388
Takuma Oda 1 , Shih-Wei Chiu 1 , Takuhiro Yamaguchi 1
Affiliation  

Objective This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning.

Materials and Methods Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms—namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms—to identify the one that could generate the best prediction model.

Results The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format.

Conclusion By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.



中文翻译:

使用监督机器学习将临床试验遗留数据半自动转换为 CDISC SDTM 标准格式

目的 本研究旨在开发一种半自动化过程,通过结合人工验证和三种方法,将遗留数据转换为临床数据交换标准联盟 (CDISC) 研究数据制表模型 (SDTM) 格式:数据标准化;通过数据集名称、变量名称和变量标签的分布式表示进行特征提取;和监督机器学习。

材料和方法 变量标签、数据集名称、变量名称和遗留数据的值被用作机器学习特征。由于这些数据大部分是字符串数据,因此它们已被转换为分布式表示,以使其可用作机器学习功能。为此,我们使用了以下分布式表示方法:格式塔模式匹配、Doc2vec 向量化后的余弦相似度和 Doc2vec 向量化。在这项研究中,我们检查了五种算法——即决策树、随机森林、梯度提升、神经网络和结合这四种算法的集成——以确定可以生成最佳预测模型的算法。

结果 神经网络的准确率最高,预测概率的分布也显示正确和错误分布之间的分裂。通过结合人工验证和这三种方法,我们能够半自动地将遗留数据转换为 CDISC SDTM 格式。

结论 通过结合人工验证和三种方法,我们成功开发了一种半自动化的过程,将遗留数据转换为 CDISC SDTM 格式;这个过程比传统的全手动过程更有效。

更新日期:2021-07-09
down
wechat
bug