当前位置: X-MOL 学术Biol. Direct › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Finite-size effects in transcript sequencing count distribution: its power-law correction necessarily precedes downstream normalization and comparative analysis.
Biology Direct ( IF 5.7 ) Pub Date : 2018-02-12 , DOI: 10.1186/s13062-018-0204-y
Wing-Cheong Wong 1 , Hong-Kiat Ng 2 , Erwin Tantoso 1 , Richie Soong 2 , Frank Eisenhaber 1, 3
Affiliation  

BACKGROUND Though earlier works on modelling transcript abundance from vertebrates to lower eukaroytes have specifically singled out the Zip's law, the observed distributions often deviate from a single power-law slope. In hindsight, while power-laws of critical phenomena are derived asymptotically under the conditions of infinite observations, real world observations are finite where the finite-size effects will set in to force a power-law distribution into an exponential decay and consequently, manifests as a curvature (i.e., varying exponent values) in a log-log plot. If transcript abundance is truly power-law distributed, the varying exponent signifies changing mathematical moments (e.g., mean, variance) and creates heteroskedasticity which compromises statistical rigor in analysis. The impact of this deviation from the asymptotic power-law on sequencing count data has never truly been examined and quantified. RESULTS The anecdotal description of transcript abundance being almost Zipf's law-like distributed can be conceptualized as the imperfect mathematical rendition of the Pareto power-law distribution when subjected to the finite-size effects in the real world; This is regardless of the advancement in sequencing technology since sampling is finite in practice. Our conceptualization agrees well with our empirical analysis of two modern day NGS (Next-generation sequencing) datasets: an in-house generated dilution miRNA study of two gastric cancer cell lines (NUGC3 and AGS) and a publicly available spike-in miRNA data; Firstly, the finite-size effects causes the deviations of sequencing count data from Zipf's law and issues of reproducibility in sequencing experiments. Secondly, it manifests as heteroskedasticity among experimental replicates to bring about statistical woes. Surprisingly, a straightforward power-law correction that restores the distribution distortion to a single exponent value can dramatically reduce data heteroskedasticity to invoke an instant increase in signal-to-noise ratio by 50% and the statistical/detection sensitivity by as high as 30% regardless of the downstream mapping and normalization methods. Most importantly, the power-law correction improves concordance in significant calls among different normalization methods of a data series averagely by 22%. When presented with a higher sequence depth (4 times difference), the improvement in concordance is asymmetrical (32% for the higher sequencing depth instance versus 13% for the lower instance) and demonstrates that the simple power-law correction can increase significant detection with higher sequencing depths. Finally, the correction dramatically enhances the statistical conclusions and eludes the metastasis potential of the NUGC3 cell line against AGS of our dilution analysis. CONCLUSIONS The finite-size effects due to undersampling generally plagues transcript count data with reproducibility issues but can be minimized through a simple power-law correction of the count distribution. This distribution correction has direct implication on the biological interpretation of the study and the rigor of the scientific findings. REVIEWERS This article was reviewed by Oliviero Carugo, Thomas Dandekar and Sandor Pongor.

中文翻译:

转录本测序计数分布中的有限尺寸效应:其幂律校正必然先于下游标准化和比较分析。

背景虽然早期对从脊椎动物到低等真核生物的转录本丰度进行建模的工作特别指出了Zip定律,但观察到的分布通常偏离单一幂律斜率。事后看来,虽然关键现象的幂律是在无限观察的条件下渐近推导的,但现实世界的观察是有限的,有限尺寸效应将迫使幂律分布呈指数衰减,因此表现为双对数图中的曲率(即变化的指数值)。如果转录本丰度确实呈幂律分布,则变化的指数表示数学矩的变化(例如平均值、方差)并产生异方差性,从而损害分析中的统计严谨性。这种渐近幂律偏差对测序计数数据的影响从未真正得到检验和量化。结果转录本丰度几乎呈齐普夫定律分布的轶事描述可以被概念化为帕累托幂律分布在现实世界中受到有限大小效应时的不完美数学再现;这与测序技术的进步无关,因为采样在实践中是有限的。我们的概念与我们对两个现代 NGS(下一代测序)数据集的实证分析非常吻合:对两种胃癌细胞系(NUGC3 和 AGS)的内部生成的稀释 miRNA 研究和公开的掺入 miRNA 数据;首先,有限尺寸效应导致测序计数数据与Zipf定律的偏差以及测序实验的重现性问题。其次,它表现为实验重复之间的异方差性,从而带来统计问题。令人惊讶的是,将分布失真恢复为单个指数值的简单幂律校正可以显着降低数据异方差,使信噪比立即增加 50%,统计/检测灵敏度高达 30%无论下游映射和标准化方法如何。最重要的是,幂律校正将数据系列的不同归一化方法之间的显着调用的一致性平均提高了 22%。当提供更高的序列深度(4 倍差异)时,一致性的提高是不对称的(较高测序深度实例为 32%,较低测序深度实例为 13%),并且表明简单的幂律校正可以显着提高检测率更高的测序深度。最后,校正极大地增强了统计结论,并规避了稀释分析中 NUGC3 细胞系针对 AGS 的转移潜力。结论 由于采样不足而导致的有限大小效应通常会困扰转录本计数数据,并带来再现性问题,但可以通过计数分布的简单幂律校正来最小化。这种分布修正对研究的生物学解释和科学发现的严谨性有直接影响。审稿人 本文由 Oliviero Carugo、Thomas Dandekar 和 Sandor Pongor 审阅。
更新日期:2019-11-01
down
wechat
bug