High throughput BLAST algorithm using spark and cassandra

Cores, Fernando; Guirado, Fernando; Lerida, Josep Lluis

doi:10.1007/s11227-020-03338-3

High throughput BLAST algorithm using spark and cassandra

Published: 28 May 2020

Volume 77, pages 1879–1896, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

345 Accesses
3 Citations
Explore all metrics

Abstract

The rise of high-resolution and high-throughput sequencing technologies has driven the emergence of such new fields of application as precision medicine. However, this has also led to an increase in the storage and processing requirements for the bioinformatics tools, which can only be provided by high-performance and massive data processing infrastructures. Such technologies allow the development of scalable, efficient and reliable bioinformatics tools. In this paper, a new implementation of the Basic Local Alignment Search Tool algorithm is presented. Our proposal, named Sparky-Blast, utilizes Cassandra database to store the different reference datasets and the Apache Spark processing framework to calculate the indexes and process the queries. This successful approach avoids the bottleneck that suffers the original BLAST version that is limited to the resources of a single machine. Sparky-Blast is capable of using the distributed resources of a Big-Data Cluster to process queries in parallel, thus, improving both the response time and the system throughput. At the same time, the use of a distributed architecture like Hadoop provides unlimited scalability from the point of view of both the hardware infrastructure and performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Illumina Sequencing Protocol and the NovaSeq 6000 System

Next-Generation Sequencing: Advantages, Disadvantages, and Future

New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: an updated comparison of DNA- and RNA-Seq data from Illumina and Ion Torrent technologies

Article 16 June 2021

Luigi Donato, Concetta Scimone, … Antonina Sidoti

Notes

The maximum row/record size that can be written in Cassandra is 16 MB.
The Sparky-Blast code is available in the following GitHub repository: https://github.com/Sherynan/SparkyBlast.
Executors are processes on the worker nodes whose job is to execute the assigned tasks for a Spark job. Executor runs tasks and keeps data in memory or disk storage across them. Each Spark application has its own executors, launched at the beginning of the application and typically run during its entire lifetime. A single node can run multiple executors and executors for an application can span multiple worker nodes.

References

Abuín JM, Pichel JC, Pena TF, Amigo J (2016) Sparkbwa: speeding up the alignment of high-throughput dna sequencing data. PloS ONE 11(5):e0155461
Article Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Article Google Scholar
Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using diamond. Nat Methods 12(1):59–60
Article Google Scholar
Carpenter J, Hewitt E (2016) Cassandra: the definitive guide: distributed data at web scale. O’Reilly Media, Inc, Sebastopol
Google Scholar
de Castro Rodrigo M, Tostes CdS, Dávila AMR, Senger H, da Silva FAB (2017) Sparkblast: scalable blast processing using in-memory operations. BMC Bioinf 18(1):318. https://doi.org/10.1186/s12859-017-1723-8
Article Google Scholar
Coulouris G et al (2016) Blast benchmaks. https://fiehnlab.ucdavis.edu/staff/kind/collector/benchmark/blast-benchmark
Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77
Article Google Scholar
Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on apache spark. GigaScience 7(8):giy098
Google Scholar
Karun AK, Chitharanjan K (2013) A review on hadoop—hdfs infrastructure extensions. In: 2013 IEEE Conference on Information & Communication Technologies (ICT), pp 132–137
Li X, Tan G, Zhang C, Li X, Zhang Z, Sun N (2016) Accelerating large-scale genomic analysis with spark. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp 747–751
Lladós J, Cores F, Guirado F (2019) Optimization of consistency-based multiple sequence alignment using big data technologies. J Supercomput. 75(3):1310–1322https://doi.org/10.1007/s11227-018-2424-4
Article Google Scholar
Matsunaga A, Tsugawa M, Fortes J (2008) Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In: 2008 IEEE Fourth International Conference on eScience, IEEE, pp 222–229
Mushtaq H, Ahmed N, Al-Ars Z (2017) Streaming distributed dna sequence alignment using apache spark. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), IEEE, pp 188–193
Nothaft FA, Massie M, Danford T, Zhang Z, Laserson U, Yeksigian C, Kottalam J, Ahuja A, Hammerbacher J, Linderman M, Franklin MJ, Joseph AD, Patterson DA (2015) Rethinking data-intensive science using scalable analytics systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, ACM, New York, pp 631–646.https://doi.org/10.1145/2723372.2742787
Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Molecular Cell 58(4):586–597
Article Google Scholar
Sakr S (2017) Big data processing stacks. IT Professional 19(1):34–41
Article Google Scholar
Smith TF, Waterman MS et al (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Article Google Scholar
Xu B, Li C, Zhuang H, Wang J, Wang Q, Zhou X (2017) Efficient distributed smith-waterman algorithm based on apache spark. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), IEEE, pp 608–615
Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2016) Parallel processing systems for big data: a survey. Proceed IEEE 104(11):2114–2136
Article Google Scholar

Download references

Acknowledgements

This work has been supported by the MINECO-Spain under contract TIN2017-84553-C2-2-R.

Author information

Authors and Affiliations

INSPIRES Research Center, Universitat de Lleida, Lleida, Spain
Fernando Cores, Fernando Guirado & Josep Lluis Lerida

Authors

Fernando Cores
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Guirado
View author publications
You can also search for this author in PubMed Google Scholar
Josep Lluis Lerida
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fernando Cores.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cores, F., Guirado, F. & Lerida, J.L. High throughput BLAST algorithm using spark and cassandra. J Supercomput 77, 1879–1896 (2021). https://doi.org/10.1007/s11227-020-03338-3

Download citation

Published: 28 May 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s11227-020-03338-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

High throughput BLAST algorithm using spark and cassandra

Abstract

Access this article

Similar content being viewed by others

The Illumina Sequencing Protocol and the NovaSeq 6000 System

Next-Generation Sequencing: Advantages, Disadvantages, and Future

New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: an updated comparison of DNA- and RNA-Seq data from Illumina and Ion Torrent technologies

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

High throughput BLAST algorithm using spark and cassandra

Abstract

Access this article

Similar content being viewed by others

The Illumina Sequencing Protocol and the NovaSeq 6000 System

Next-Generation Sequencing: Advantages, Disadvantages, and Future

New evaluation methods of read mapping by 17 aligners on simulated and empirical NGS data: an updated comparison of DNA- and RNA-Seq data from Illumina and Ion Torrent technologies

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation