Abstract
With the improvement in high-throughput genome sequencing technology, huge amount of genomic data are generated every day. These data are used in numerous applications: sequence alignment, drug discovery and personalized medicine, etc. To efficiently handle genome data for storage, processing, and transmission, some specific genomic data compression approach is a need of today. In this paper, a hybrid approach-WBTC (Word Based Compression Technique) based on statistical and substitution model is proposed for genome compression. WBTC can support genomic data in raw forms as well as Fasta/Multi-fasta file formats. WBTC is a lossless genome compression algorithm in which searching is possible without full decompression. Experiments show that the proposed algorithm-WBTC outperforms in comparison to other state-of-the-art algorithms with respect to compression ratio, compression time, decompression time, compression memory and decompression memory.
Similar content being viewed by others
References
Adjeroh D, Nan F (2006) On compressibility of protein sequences. DCC, pp. 422–434
Kumar S, Agarwal S, Prasad R (2015) Efficient read alignment using burrows Wheeler transform and wavelet tree. In: 2015 Second international conference on advances in computing and communication engineering, IEEE, pp 133–138
Apostolico A, Lonardi S (2000) Compression of biological sequences by greedy off-line textual substitution. DCC, pp. 143–152
Behzadi B, Fessant FL (2005) DNA compression challenge revisited: a dynamic programming approach. CPM, pp. 190–200
Boulton DM, Wallace CS (1969) The information content of a multistate distribution. Theor Biol 23(2):269–278
Rivals E et al. (1996) A guaranteed compression scheme for repetitive DNA sequences. Data Compression Conference, 1996. DCC’96. Proceedings. IEEE
Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. Workshop on Genome Informatics, vol 10, pp 51–61
Chen X et al (2002) DNACompress: fast and effective DNA sequence compression. Bioinformatics 18(2):1696–1698
Cleary JG, Witten IH (1984) Data compression using adaptive coding and partial string matching. IEEE Trans. Comm COM 32(4):396–402
Cleary JG, Teahan WJ (1997) Unbounded length contexts for PPM. Comput J 40(2/3):67–75
Dix TI et al (2006) Exploring long DNA sequences by information content. Probabilistic modeling and machine learning in structural and systems biology, Workshop Proc, pp 97–102
Dix TI, Powell DR, Allison L et al (2007) Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinform 8:S10
Kumar S, Agarwal S (2018) WBMFC: efficient and secure storage of genomic data. Pertanika J Sci Technol 26(4):1913–1925
Grumbach S, Tahi F (1993) Compression of DNA sequences. DCC, pp. 340–350
Grumbach S, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf. Process. Manag. 30(6):866–875
Gupta A, Agarwal S (2008) A scheme that facilitates searching and partial decompression of textual documents. Int J Adv Comput Eng 1(2):99–109
Gupta A, Agarwal S (2008) Transforming the natural language text for improving compression performance, Lecture notes in electrical engineering Vol. 6, Trends in intelligent systems and computer engineering (ISCE), Springer, pp. 637-644
Kumar S, Agarwal S (2019) Fast and memory efficient approach for mapping NGS reads to a reference genome. J Bioinform Comput Biol 17(2):1–18
Ghoshdastider U, Saha B (2007) GenomeCompress: a novel algorithm for DNA compression. In: Proceedings of international conference on information technology
Korodi G, Tabus I (2005) An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. Inf. Syst. 23(1):3–34
Bose T, Mohammad MH, Anirban D, Sharmila SM (2012) BIND-an algorithm for loss-less compression of nucleotide sequence data. J Bio-sci 37:785–789
Haque MM, Dutta A, Bose T, Chadaram S, Mande SS (2012) DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences sequence analysis. Bioinformatics 28:2527–2529
Sardaraj M, Tahir M, Ikram A, Bajwa H (2014) SeqCompress: an algorithm for biological sequence compression. Genomics 104:225–228
Hosseini M, Pratas D, Pinho AJ (2016) A survey on data compression methods for biological sequences. Information 7(4):56
Kumar Sanjeev, Agarwal Suneeta (2018) WBFQC: a new approach for compressing next-generation sequencing data splitting into homogeneous streams. J Bioinform Comput Biol 16(5):1–18
Deorowicz S, Walczyszyn J, Debudaj-Grabysz A (2017) MSAC: compression of multiple sequence alignment files. bioRxiv, pp 240–341
Hosseini Morteza, Pratas Diogo, Pinho Armando J (2018) Cryfa: a secure encryption tool for genomic data. Bioinformatics 35(1):146–148
Chen Min, Li Rui, Yang LiJun (2018) Optimized context weighting for the compression of the un-repetitive genome sequence fragment. Wirel Personal Commun 103(1):921–939
https://www.gzip.org/. Accessed 20 Jan 2019
https://www.7-zip.org/sdk.html. Accessed 20 Jan 2019
http://www.bzip.org/. Accessed 20 Jan 2019
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz. Accessed 11 Jan 2019
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ffn.tar.gz. Accessed 11 Jan 2019
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/Chromosomes/. Accessed 11 Jan 2019
https://portal.camera.calit2.net. Accessed 11 Jan 2019
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
kumar, S., Agarwal, S. & Ranvijay WBTC: a new approach for efficient storage of genomic data. Int. j. inf. tecnol. 12, 915–921 (2020). https://doi.org/10.1007/s41870-020-00472-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-020-00472-2