当前位置: X-MOL 学术Behav. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Best Practices for Binary and Ordinal Data Analyses
Behavior Genetics ( IF 2.6 ) Pub Date : 2021-01-05 , DOI: 10.1007/s10519-020-10031-x
Brad Verhulst 1 , Michael C Neale 2
Affiliation  

The measurement of many human traits, states, and disorders begins with a set of items on a questionnaire. The response format for these questions is often simply binary (e.g., yes/no) or ordered (e.g., high, medium or low). During data analysis, these items are frequently summed or used to estimate factor scores. In clinical applications, such assessments are often non-normally distributed in the general population because many respondents are unaffected, and therefore asymptomatic. As a result, in many cases these measures violate the statistical assumptions required for subsequent analyses. To reduce the influence of the non-normality and quasi-continuous assessment, variables are frequently recoded into binary (affected–unaffected) or ordinal (mild–moderate–severe) diagnoses. Ordinal data therefore present challenges at multiple levels of analysis. Categorizing continuous variables into ordered categories typically results in a loss of statistical power, which represents an incentive to the data analyst to assume that the data are normally distributed, even when they are not. Despite prior zeitgeists suggesting that, e.g., variables with more than 10 ordered categories may be regarded as continuous and analyzed as if they were, we show via simulation studies that this is not generally the case. In particular, using Pearson product-moment correlations instead of maximum likelihood estimates of polychoric correlations biases the estimated correlations towards zero. This bias is especially severe when a plurality of the observations fall into a single observed category, such as a score of zero. By contrast, estimating the ordinal correlation by maximum likelihood yields no estimation bias, although standard errors are (appropriately) larger. We also illustrate how odds ratios depend critically on the proportion or prevalence of affected individuals in the population, and therefore are sub-optimal for studies where comparisons of association metrics are needed. Finally, we extend these analyses to the classical twin model and demonstrate that treating binary data as continuous will underestimate genetic and common environmental variance components, and overestimate unique environment (residual) variance. These biases increase as prevalence declines. While modeling ordinal data appropriately may be more computationally intensive and time consuming, failing to do so will likely yield biased correlations and biased parameter estimates from modeling them.



中文翻译:


二进制和序数数据分析的最佳实践



许多人类特征、状态和疾病的测量都是从问卷上的一组项目开始的。这些问题的回答格式通常是简单的二元(例如,是/否)或有序(例如,高、中或低)。在数据分析过程中,这些项目经常被求和或用于估计因子得分。在临床应用中,此类评估在普通人群中通常呈非正态分布,因为许多受访者不受影响,因此无症状。因此,在许多情况下,这些措施违反了后续分析所需的统计假设。为了减少非正态性和准连续评估的影响,变量经常被重新编码为二元(受影响 - 不受影响)或序数(轻度 - 中度 - 重度)诊断。因此,序数数据在多个分析层面提出了挑战。将连续变量分类为有序类别通常会导致统计功效的损失,这代表了数据分析师假设数据呈正态分布的动机,即使数据不是正态分布的。尽管先前的时代精神表明,例如,具有超过 10 个有序类别的变量可以被视为连续的并进行分析,但我们通过模拟研究表明,情况通常并非如此。特别是,使用皮尔逊积矩相关性而不是多向相关性的最大似然估计会使估计的相关性偏向于零。当多个观察结果落入单个观察类别(例如分数为零)时,这种偏差尤其严重。相比之下,通过最大似然估计序数相关性不会产生估计偏差,尽管标准误差(适当地)较大。 我们还说明了优势比如何严重依赖于人群中受影响个体的比例或患病率,因此对于需要比较关联指标的研究来说不是最佳选择。最后,我们将这些分析扩展到经典孪生模型,并证明将二进制数据视为连续数据会低估遗传和常见环境方差分量,并高估独特环境(残余)方差。随着患病率下降,这些偏见也会增加。虽然对序数数据进行适当的建模可能会耗费更多的计算量和时间,但如果不这样做,很可能会产生有偏差的相关性和有偏差的参数估计。

更新日期:2021-01-05
down
wechat
bug