Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Communication-Efficient Jaccard Similarity for High-Performance Distributed Genome Comparisons
arXiv - CS - Performance Pub Date : 2019-11-11 , DOI: arxiv-1911.04200
Maciej Besta, Raghavendra Kanakagiri, Harun Mustafa, Mikhail Karasikov, Gunnar R\"atsch, Torsten Hoefler, Edgar Solomonik

The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using largescale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains.

中文翻译：

用于高性能分布式基因组比较的通信高效 Jaccard 相似性

Jaccard 相似度指数是衡量两组重叠的重要指标，广泛应用于机器学习、计算基因组学、信息检索等诸多领域。我们设计并实现了 SimilarityAtScale，这是第一个用于计算大数据集对之间 Jaccard 相似度的高效通信分布式算法。我们的算法将此问题有效地编码为稀疏矩阵的乘法。编码和稀疏矩阵乘积都以在通信和同步成本方面最小化数据移动的方式执行。我们应用我们的算法来获得一组大基因组样本的所有对之间的相似性。这项任务是现代宏基因组学分析的关键部分，并且由于高通量 DNA 测序数据的可用性不断增加，因此需求不断增长。由此产生的方案是第一个使用大规模分布式内存系统为大量数据集启用精确 Jaccard 距离推导的方案。我们将例程打包在一个名为 GenomeAtScale 的工具中，该工具将所提出的算法与用于处理输入序列的工具相结合。我们对真实数据的评估表明，可以使用 GenomeAtScale 有效地使用数以万计的处理器来达到大规模基因组和宏基因组分析的新领域。虽然 GenomeAtScale 可用于促进 DNA 研究，但更通用的底层 SimilarityAtScale 算法可用于其他数据分析应用领域中的高性能分布式相似度计算。我们将例程打包在一个名为 GenomeAtScale 的工具中，该工具将所提出的算法与处理输入序列的工具相结合。我们对真实数据的评估表明，可以使用 GenomeAtScale 有效地使用数以万计的处理器来达到大规模基因组和宏基因组分析的新领域。虽然 GenomeAtScale 可用于促进 DNA 研究，但更通用的底层 SimilarityAtScale 算法可用于其他数据分析应用领域中的高性能分布式相似度计算。我们将例程打包在一个名为 GenomeAtScale 的工具中，该工具将所提出的算法与用于处理输入序列的工具相结合。我们对真实数据的评估表明，可以使用 GenomeAtScale 有效地使用数以万计的处理器来达到大规模基因组和宏基因组分析的新领域。虽然 GenomeAtScale 可用于促进 DNA 研究，但更通用的底层 SimilarityAtScale 算法可用于其他数据分析应用领域中的高性能分布式相似度计算。我们对真实数据的评估表明，可以使用 GenomeAtScale 有效地使用数以万计的处理器来达到大规模基因组和宏基因组分析的新领域。虽然 GenomeAtScale 可用于促进 DNA 研究，但更通用的底层 SimilarityAtScale 算法可用于其他数据分析应用领域中的高性能分布式相似度计算。我们对真实数据的评估表明，可以使用 GenomeAtScale 有效地使用数以万计的处理器来达到大规模基因组和宏基因组分析的新领域。虽然 GenomeAtScale 可用于促进 DNA 研究，但更通用的底层 SimilarityAtScale 算法可用于其他数据分析应用领域中的高性能分布式相似度计算。

更新日期：2020-11-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>