当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Assessing data quality from the Clinical Practice Research Datalink: a methodological approach applied to the full blood count blood test
Journal of Big Data ( IF 8.1 ) Pub Date : 2020-11-10 , DOI: 10.1186/s40537-020-00375-w
Pradeep S. Virdee , Alice Fuller , Michael Jacobs , Tim Holt , Jacqueline Birks

A Full Blood Count (FBC) is a common blood test including 20 parameters, such as haemoglobin and platelets. FBCs from Electronic Health Record (EHR) databases provide a large sample of anonymised individual patient data and are increasingly used in research. We describe the quality of the FBC data in one EHR. The Test dataset from the Clinical Research Practice Datalink (CPRD) was accessed, which contains results of tests performed in primary care, such as FBC blood tests. Medical codes and entity codes, two coding systems used within CPRD to identify FBC records, were compared, with levels of mismatched coding, and number that could be rectified reported. The reliability of units of measurement are also described and missing data discussed. There were 14 entity codes and 138 medical codes for the FBC in the data. Medical and entity codes consistently corresponded to the same FBC parameter in 95.2% (n = 217,752,448) of parameters. In the 4.8% (n = 10,955,006) mismatches, the most common parameter rectified was mean platelet volume (n = 2,041,360) and 1,191,540 could not be rectified and were removed. Units of measurement were often either missing, partially entered, or did not appear to correspond to the blood value. The final dataset contained 16,537,017 FBC tests. Applying mathematical equations to derive some missing parameters in these FBCs resulted in 15 of 20 parameters available per FBC on average, with 0.3% of FBCs having all 20 parameters. Performing data quality checks can help to understand the extent of any issues in the dataset. We emphasise balancing large sample sizes with reliability of the data.



中文翻译:

从临床实践研究数据链路评估数据质量:一种用于全血细胞计数血液测试的方法学方法

全血细胞计数(FBC)是一项常见的血液检查,包括20个参数,例如血红蛋白和血小板。电子健康记录(EHR)数据库中的FBC提供了大量匿名患者个人数据的样本,并且越来越多地用于研究中。我们在一个EHR中描述了FBC数据的质量。访问了来自临床研究实践数据链接(CPRD)的测试数据集,其中包含在初级保健中执行的测试结果,例如FBC血液测试。比较了CPRD中用于识别FBC记录的两种编码系统医疗代码和实体代码,以及编码不匹配的级别以及可以纠正的数量。还描述了测量单位的可靠性,并讨论了缺失的数据。数据中有14个实体代码和138个FBC医疗代码。医疗和实体代码始终在95.2%(n = 217,752,448)的参数中始终对应于相同的FBC参数。在4.8%(n = 10,955,006)的不匹配中,校正的最常见参数是平均血小板体积(n = 2,041,360),无法校正并去除1,191,540。度量单位通常要么丢失,部分输入,要么似乎与血液价值不符。最终数据集包含16,537,017 FBC测试。应用数学方程式导出这些FBC中的某些缺失参数导致每个FBC平均可获得20个参数中的15个,其中0.3%的FBC具有所有20个参数。进行数据质量检查可以帮助您了解数据集中任何问题的严重程度。我们强调在大样本量与数据可靠性之间取得平衡。

更新日期:2020-11-12
down
wechat
bug