Skip to main content
Log in

High throughput BLAST algorithm using spark and cassandra

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

The rise of high-resolution and high-throughput sequencing technologies has driven the emergence of such new fields of application as precision medicine. However, this has also led to an increase in the storage and processing requirements for the bioinformatics tools, which can only be provided by high-performance and massive data processing infrastructures. Such technologies allow the development of scalable, efficient and reliable bioinformatics tools. In this paper, a new implementation of the Basic Local Alignment Search Tool algorithm is presented. Our proposal, named Sparky-Blast, utilizes Cassandra database to store the different reference datasets and the Apache Spark processing framework to calculate the indexes and process the queries. This successful approach avoids the bottleneck that suffers the original BLAST version that is limited to the resources of a single machine. Sparky-Blast is capable of using the distributed resources of a Big-Data Cluster to process queries in parallel, thus, improving both the response time and the system throughput. At the same time, the use of a distributed architecture like Hadoop provides unlimited scalability from the point of view of both the hardware infrastructure and performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The maximum row/record size that can be written in Cassandra is 16 MB.

  2. The Sparky-Blast code is available in the following GitHub repository: https://github.com/Sherynan/SparkyBlast.

  3. Executors are processes on the worker nodes whose job is to execute the assigned tasks for a Spark job. Executor runs tasks and keeps data in memory or disk storage across them. Each Spark application has its own executors, launched at the beginning of the application and typically run during its entire lifetime. A single node can run multiple executors and executors for an application can span multiple worker nodes.

References

  1. Abuín JM, Pichel JC, Pena TF, Amigo J (2016) Sparkbwa: speeding up the alignment of high-throughput dna sequencing data. PloS ONE 11(5):e0155461

    Article  Google Scholar 

  2. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    Article  Google Scholar 

  3. Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using diamond. Nat Methods 12(1):59–60

    Article  Google Scholar 

  4. Carpenter J, Hewitt E (2016) Cassandra: the definitive guide: distributed data at web scale. O’Reilly Media, Inc, Sebastopol

    Google Scholar 

  5. de Castro Rodrigo M, Tostes CdS, Dávila AMR, Senger H, da Silva FAB (2017) Sparkblast: scalable blast processing using in-memory operations. BMC Bioinf 18(1):318. https://doi.org/10.1186/s12859-017-1723-8

    Article  Google Scholar 

  6. Coulouris G et al (2016) Blast benchmaks. https://fiehnlab.ucdavis.edu/staff/kind/collector/benchmark/blast-benchmark

  7. Dean J, Ghemawat S (2010) Mapreduce: a flexible data processing tool. Commun ACM 53(1):72–77

    Article  Google Scholar 

  8. Guo R, Zhao Y, Zou Q, Fang X, Peng S (2018) Bioinformatics applications on apache spark. GigaScience 7(8):giy098

    Google Scholar 

  9. Karun AK, Chitharanjan K (2013) A review on hadoop—hdfs infrastructure extensions. In: 2013 IEEE Conference on Information & Communication Technologies (ICT), pp 132–137

  10. Li X, Tan G, Zhang C, Li X, Zhang Z, Sun N (2016) Accelerating large-scale genomic analysis with spark. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp 747–751

  11. Lladós J, Cores F, Guirado F (2019) Optimization of consistency-based multiple sequence alignment using big data technologies. J Supercomput. 75(3):1310–1322https://doi.org/10.1007/s11227-018-2424-4

    Article  Google Scholar 

  12. Matsunaga A, Tsugawa M, Fortes J (2008) Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In: 2008 IEEE Fourth International Conference on eScience, IEEE, pp 222–229

  13. Mushtaq H, Ahmed N, Al-Ars Z (2017) Streaming distributed dna sequence alignment using apache spark. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), IEEE, pp 188–193

  14. Nothaft FA, Massie M, Danford T, Zhang Z, Laserson U, Yeksigian C, Kottalam J, Ahuja A, Hammerbacher J, Linderman M, Franklin MJ, Joseph AD, Patterson DA (2015) Rethinking data-intensive science using scalable analytics systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, ACM, New York, pp 631–646.https://doi.org/10.1145/2723372.2742787

  15. Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Molecular Cell 58(4):586–597

    Article  Google Scholar 

  16. Sakr S (2017) Big data processing stacks. IT Professional 19(1):34–41

    Article  Google Scholar 

  17. Smith TF, Waterman MS et al (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197

    Article  Google Scholar 

  18. Xu B, Li C, Zhuang H, Wang J, Wang Q, Zhou X (2017) Efficient distributed smith-waterman algorithm based on apache spark. In: 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), IEEE, pp 608–615

  19. Zhang Y, Cao T, Li S, Tian X, Yuan L, Jia H, Vasilakos AV (2016) Parallel processing systems for big data: a survey. Proceed IEEE 104(11):2114–2136

    Article  Google Scholar 

Download references

Acknowledgements

This work has been supported by the MINECO-Spain under contract TIN2017-84553-C2-2-R.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fernando Cores.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cores, F., Guirado, F. & Lerida, J.L. High throughput BLAST algorithm using spark and cassandra. J Supercomput 77, 1879–1896 (2021). https://doi.org/10.1007/s11227-020-03338-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03338-3

Keywords

Navigation