当前位置: X-MOL 学术Adv. Data Anal. Classif. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Clustering genomic words in human DNA using peaks and trends of distributions
Advances in Data Analysis and Classification ( IF 1.6 ) Pub Date : 2019-05-31 , DOI: 10.1007/s11634-019-00362-x
Ana Helena Tavares , Jakob Raymaekers , Peter J. Rousseeuw , Paula Brito , Vera Afreixo

In this work we seek clusters of genomic words in human DNA by studying their inter-word lag distributions. Due to the particularly spiked nature of these histograms, a clustering procedure is proposed that first decomposes each distribution into a baseline and a peak distribution. An outlier-robust fitting method is used to estimate the baseline distribution (the ‘trend’), and a sparse vector of detrended data captures the peak structure. A simulation study demonstrates the effectiveness of the clustering procedure in grouping distributions with similar peak behavior and/or baseline features. The procedure is applied to investigate similarities between the distribution patterns of genomic words of lengths 3 and 5 in the human genome. These experiments demonstrate the potential of the new method for identifying words with similar distance patterns.

中文翻译:

利用分布的峰值和趋势将人类DNA中的基因组词聚类

在这项工作中,我们通过研究人类词间滞后分布来寻找人类DNA中的基因词簇。由于这些直方图的特别尖峰的性质,提出了一种聚类程序,该程序首先将每个分布分解为基线和峰值分布。使用离群值-鲁棒拟合法估计基线分布(“趋势”),而去趋势数据的稀疏矢量捕获峰结构。仿真研究证明了聚类程序在对具有相似峰值行为和/或基线特征的分布进行分组时的有效性。该程序用于研究人类基因组中长度为3和5的基因组单词分布模式之间的相似性。这些实验证明了这种新方法用于识别具有相似距离模式的单词的潜力。
更新日期:2019-05-31
down
wechat
bug