Skip to main content
Log in

WBTC: a new approach for efficient storage of genomic data

  • Original Research
  • Published:
International Journal of Information Technology Aims and scope Submit manuscript

Abstract

With the improvement in high-throughput genome sequencing technology, huge amount of genomic data are generated every day. These data are used in numerous applications: sequence alignment, drug discovery and personalized medicine, etc. To efficiently handle genome data for storage, processing, and transmission, some specific genomic data compression approach is a need of today. In this paper, a hybrid approach-WBTC (Word Based Compression Technique) based on statistical and substitution model is proposed for genome compression. WBTC can support genomic data in raw forms as well as Fasta/Multi-fasta file formats. WBTC is a lossless genome compression algorithm in which searching is possible without full decompression. Experiments show that the proposed algorithm-WBTC outperforms in comparison to other state-of-the-art algorithms with respect to compression ratio, compression time, decompression time, compression memory and decompression memory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  1. Adjeroh D, Nan F (2006) On compressibility of protein sequences. DCC, pp. 422–434

  2. Kumar S, Agarwal S, Prasad R (2015) Efficient read alignment using burrows Wheeler transform and wavelet tree. In: 2015 Second international conference on advances in computing and communication engineering, IEEE, pp 133–138

  3. Apostolico A, Lonardi S (2000) Compression of biological sequences by greedy off-line textual substitution. DCC, pp. 143–152

  4. Behzadi B, Fessant FL (2005) DNA compression challenge revisited: a dynamic programming approach. CPM, pp. 190–200

  5. Boulton DM, Wallace CS (1969) The information content of a multistate distribution. Theor Biol 23(2):269–278

    MathSciNet  Google Scholar 

  6. Rivals E et al. (1996) A guaranteed compression scheme for repetitive DNA sequences. Data Compression Conference, 1996. DCC’96. Proceedings. IEEE

  7. Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. Workshop on Genome Informatics, vol 10, pp 51–61

  8. Chen X et al (2002) DNACompress: fast and effective DNA sequence compression. Bioinformatics 18(2):1696–1698

    Google Scholar 

  9. Cleary JG, Witten IH (1984) Data compression using adaptive coding and partial string matching. IEEE Trans. Comm COM 32(4):396–402

    Google Scholar 

  10. Cleary JG, Teahan WJ (1997) Unbounded length contexts for PPM. Comput J 40(2/3):67–75

    Google Scholar 

  11. Dix TI et al (2006) Exploring long DNA sequences by information content. Probabilistic modeling and machine learning in structural and systems biology, Workshop Proc, pp 97–102

  12. Dix TI, Powell DR, Allison L et al (2007) Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinform 8:S10

    Google Scholar 

  13. Kumar S, Agarwal S (2018) WBMFC: efficient and secure storage of genomic data. Pertanika J Sci Technol 26(4):1913–1925

    Google Scholar 

  14. Grumbach S, Tahi F (1993) Compression of DNA sequences. DCC, pp. 340–350

  15. Grumbach S, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf. Process. Manag. 30(6):866–875

    MATH  Google Scholar 

  16. Gupta A, Agarwal S (2008) A scheme that facilitates searching and partial decompression of textual documents. Int J Adv Comput Eng 1(2):99–109

    Google Scholar 

  17. Gupta A, Agarwal S (2008) Transforming the natural language text for improving compression performance, Lecture notes in electrical engineering Vol. 6, Trends in intelligent systems and computer engineering (ISCE), Springer, pp. 637-644

  18. Kumar S, Agarwal S (2019) Fast and memory efficient approach for mapping NGS reads to a reference genome. J Bioinform Comput Biol 17(2):1–18

    MathSciNet  Google Scholar 

  19. Ghoshdastider U, Saha B (2007) GenomeCompress: a novel algorithm for DNA compression. In: Proceedings of international conference on information technology

  20. Korodi G, Tabus I (2005) An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. Inf. Syst. 23(1):3–34

    Google Scholar 

  21. Bose T, Mohammad MH, Anirban D, Sharmila SM (2012) BIND-an algorithm for loss-less compression of nucleotide sequence data. J Bio-sci 37:785–789

    Google Scholar 

  22. Haque MM, Dutta A, Bose T, Chadaram S, Mande SS (2012) DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences sequence analysis. Bioinformatics 28:2527–2529

    Google Scholar 

  23. Sardaraj M, Tahir M, Ikram A, Bajwa H (2014) SeqCompress: an algorithm for biological sequence compression. Genomics 104:225–228

    Google Scholar 

  24. Hosseini M, Pratas D, Pinho AJ (2016) A survey on data compression methods for biological sequences. Information 7(4):56

    Google Scholar 

  25. Kumar Sanjeev, Agarwal Suneeta (2018) WBFQC: a new approach for compressing next-generation sequencing data splitting into homogeneous streams. J Bioinform Comput Biol 16(5):1–18

    Google Scholar 

  26. Deorowicz S, Walczyszyn J, Debudaj-Grabysz A (2017) MSAC: compression of multiple sequence alignment files. bioRxiv, pp 240–341

  27. Hosseini Morteza, Pratas Diogo, Pinho Armando J (2018) Cryfa: a secure encryption tool for genomic data. Bioinformatics 35(1):146–148

    Google Scholar 

  28. Chen Min, Li Rui, Yang LiJun (2018) Optimized context weighting for the compression of the un-repetitive genome sequence fragment. Wirel Personal Commun 103(1):921–939

    Google Scholar 

  29. https://www.gzip.org/. Accessed 20 Jan 2019

  30. https://www.7-zip.org/sdk.html. Accessed 20 Jan 2019

  31. http://www.bzip.org/. Accessed 20 Jan 2019

  32. ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz. Accessed 11 Jan 2019

  33. ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ffn.tar.gz. Accessed 11 Jan 2019

  34. http://hgdownload.cse.ucsc.edu/goldenPath/hg18/Chromosomes/. Accessed 11 Jan 2019

  35. https://portal.camera.calit2.net. Accessed 11 Jan 2019

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sanjeev kumar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

kumar, S., Agarwal, S. & Ranvijay WBTC: a new approach for efficient storage of genomic data. Int. j. inf. tecnol. 12, 915–921 (2020). https://doi.org/10.1007/s41870-020-00472-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41870-020-00472-2

Keywords

Navigation