Abstract
The rise of high-resolution and high-throughput sequencing technologies has driven the emergence of such new fields of application as precision medicine. However, this has also led to an increase in the storage and processing requirements for the bioinformatics tools, which can only be provided by high-performance and massive data processing infrastructures. Such technologies allow the development of scalable, efficient and reliable bioinformatics tools. In this paper, a new implementation of the Basic Local Alignment Search Tool algorithm is presented. Our proposal, named Sparky-Blast, utilizes Cassandra database to store the different reference datasets and the Apache Spark processing framework to calculate the indexes and process the queries. This successful approach avoids the bottleneck that suffers the original BLAST version that is limited to the resources of a single machine. Sparky-Blast is capable of using the distributed resources of a Big-Data Cluster to process queries in parallel, thus, improving both the response time and the system throughput. At the same time, the use of a distributed architecture like Hadoop provides unlimited scalability from the point of view of both the hardware infrastructure and performance.
Similar content being viewed by others
Notes
The maximum row/record size that can be written in Cassandra is 16 MB.
The Sparky-Blast code is available in the following GitHub repository: https://github.com/Sherynan/SparkyBlast.
Executors are processes on the worker nodes whose job is to execute the assigned tasks for a Spark job. Executor runs tasks and keeps data in memory or disk storage across them. Each Spark application has its own executors, launched at the beginning of the application and typically run during its entire lifetime. A single node can run multiple executors and executors for an application can span multiple worker nodes.
References
Abuín JM, Pichel JC, Pena TF, Amigo J (2016) Sparkbwa: speeding up the alignment of high-throughput dna sequencing data. PloS ONE 11(5):e0155461
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using diamond. Nat Methods 12(1):59–60
Carpenter J, Hewitt E (2016) Cassandra: the definitive guide: distributed data at web scale. O’Reilly Media, Inc, Sebastopol
de Castro Rodrigo M, Tostes CdS, Dávila AMR, Senger H, da Silva FAB (2017) Sparkblast: scalable blast processing using in-memory operations. BMC Bioinf 18(1):318. https://doi.org/10.1186/s12859-017-1723-8
Coulouris G et al (2016) Blast benchmaks. https://fiehnlab.ucdavis.edu/staff/kind/collector/benchmark/blast-benchmark
Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77
Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on apache spark. GigaScience 7(8):giy098
Karun AK, Chitharanjan K (2013) A review on hadoop—hdfs infrastructure extensions. In: 2013 IEEE Conference on Information & Communication Technologies (ICT), pp 132–137
Li X, Tan G, Zhang C, Li X, Zhang Z, Sun N (2016) Accelerating large-scale genomic analysis with spark. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp 747–751
Lladós J, Cores F, Guirado F (2019) Optimization of consistency-based multiple sequence alignment using big data technologies. J Supercomput. 75(3):1310–1322https://doi.org/10.1007/s11227-018-2424-4
Matsunaga A, Tsugawa M, Fortes J (2008) Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In: 2008 IEEE Fourth International Conference on eScience, IEEE, pp 222–229
Mushtaq H, Ahmed N, Al-Ars Z (2017) Streaming distributed dna sequence alignment using apache spark. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), IEEE, pp 188–193
Nothaft FA, Massie M, Danford T, Zhang Z, Laserson U, Yeksigian C, Kottalam J, Ahuja A, Hammerbacher J, Linderman M, Franklin MJ, Joseph AD, Patterson DA (2015) Rethinking data-intensive science using scalable analytics systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, ACM, New York, pp 631–646.https://doi.org/10.1145/2723372.2742787
Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Molecular Cell 58(4):586–597
Sakr S (2017) Big data processing stacks. IT Professional 19(1):34–41
Smith TF, Waterman MS et al (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Xu B, Li C, Zhuang H, Wang J, Wang Q, Zhou X (2017) Efficient distributed smith-waterman algorithm based on apache spark. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), IEEE, pp 608–615
Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2016) Parallel processing systems for big data: a survey. Proceed IEEE 104(11):2114–2136
Acknowledgements
This work has been supported by the MINECO-Spain under contract TIN2017-84553-C2-2-R.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cores, F., Guirado, F. & Lerida, J.L. High throughput BLAST algorithm using spark and cassandra. J Supercomput 77, 1879–1896 (2021). https://doi.org/10.1007/s11227-020-03338-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03338-3