WBTC: a new approach for efficient storage of genomic data

kumar, Sanjeev; Agarwal, Suneeta; Ranvijay

doi:10.1007/s41870-020-00472-2

WBTC: a new approach for efficient storage of genomic data

Original Research
Published: 13 June 2020

Volume 12, pages 915–921, (2020)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

Sanjeev kumar¹,
Suneeta Agarwal¹ &
Ranvijay¹

82 Accesses
Explore all metrics

Abstract

With the improvement in high-throughput genome sequencing technology, huge amount of genomic data are generated every day. These data are used in numerous applications: sequence alignment, drug discovery and personalized medicine, etc. To efficiently handle genome data for storage, processing, and transmission, some specific genomic data compression approach is a need of today. In this paper, a hybrid approach-WBTC (Word Based Compression Technique) based on statistical and substitution model is proposed for genome compression. WBTC can support genomic data in raw forms as well as Fasta/Multi-fasta file formats. WBTC is a lossless genome compression algorithm in which searching is possible without full decompression. Experiments show that the proposed algorithm-WBTC outperforms in comparison to other state-of-the-art algorithms with respect to compression ratio, compression time, decompression time, compression memory and decompression memory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Adjeroh D, Nan F (2006) On compressibility of protein sequences. DCC, pp. 422–434
Kumar S, Agarwal S, Prasad R (2015) Efficient read alignment using burrows Wheeler transform and wavelet tree. In: 2015 Second international conference on advances in computing and communication engineering, IEEE, pp 133–138
Apostolico A, Lonardi S (2000) Compression of biological sequences by greedy off-line textual substitution. DCC, pp. 143–152
Behzadi B, Fessant FL (2005) DNA compression challenge revisited: a dynamic programming approach. CPM, pp. 190–200
Boulton DM, Wallace CS (1969) The information content of a multistate distribution. Theor Biol 23(2):269–278
MathSciNet Google Scholar
Rivals E et al. (1996) A guaranteed compression scheme for repetitive DNA sequences. Data Compression Conference, 1996. DCC’96. Proceedings. IEEE
Chen X, Kwong S, Li M (1999) A compression algorithm for DNA sequences and its applications in genome comparison. Workshop on Genome Informatics, vol 10, pp 51–61
Chen X et al (2002) DNACompress: fast and effective DNA sequence compression. Bioinformatics 18(2):1696–1698
Google Scholar
Cleary JG, Witten IH (1984) Data compression using adaptive coding and partial string matching. IEEE Trans. Comm COM 32(4):396–402
Google Scholar
Cleary JG, Teahan WJ (1997) Unbounded length contexts for PPM. Comput J 40(2/3):67–75
Google Scholar
Dix TI et al (2006) Exploring long DNA sequences by information content. Probabilistic modeling and machine learning in structural and systems biology, Workshop Proc, pp 97–102
Dix TI, Powell DR, Allison L et al (2007) Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinform 8:S10
Google Scholar
Kumar S, Agarwal S (2018) WBMFC: efficient and secure storage of genomic data. Pertanika J Sci Technol 26(4):1913–1925
Google Scholar
Grumbach S, Tahi F (1993) Compression of DNA sequences. DCC, pp. 340–350
Grumbach S, Tahi F (1994) A new challenge for compression algorithms: genetic sequences. Inf. Process. Manag. 30(6):866–875
MATH Google Scholar
Gupta A, Agarwal S (2008) A scheme that facilitates searching and partial decompression of textual documents. Int J Adv Comput Eng 1(2):99–109
Google Scholar
Gupta A, Agarwal S (2008) Transforming the natural language text for improving compression performance, Lecture notes in electrical engineering Vol. 6, Trends in intelligent systems and computer engineering (ISCE), Springer, pp. 637-644
Kumar S, Agarwal S (2019) Fast and memory efficient approach for mapping NGS reads to a reference genome. J Bioinform Comput Biol 17(2):1–18
MathSciNet Google Scholar
Ghoshdastider U, Saha B (2007) GenomeCompress: a novel algorithm for DNA compression. In: Proceedings of international conference on information technology
Korodi G, Tabus I (2005) An efficient normalized maximum likelihood algorithm for DNA sequence compression. ACM Trans. Inf. Syst. 23(1):3–34
Google Scholar
Bose T, Mohammad MH, Anirban D, Sharmila SM (2012) BIND-an algorithm for loss-less compression of nucleotide sequence data. J Bio-sci 37:785–789
Google Scholar
Haque MM, Dutta A, Bose T, Chadaram S, Mande SS (2012) DELIMINATE-a fast and efficient method for loss-less compression of genomic sequences sequence analysis. Bioinformatics 28:2527–2529
Google Scholar
Sardaraj M, Tahir M, Ikram A, Bajwa H (2014) SeqCompress: an algorithm for biological sequence compression. Genomics 104:225–228
Google Scholar
Hosseini M, Pratas D, Pinho AJ (2016) A survey on data compression methods for biological sequences. Information 7(4):56
Google Scholar
Kumar Sanjeev, Agarwal Suneeta (2018) WBFQC: a new approach for compressing next-generation sequencing data splitting into homogeneous streams. J Bioinform Comput Biol 16(5):1–18
Google Scholar
Deorowicz S, Walczyszyn J, Debudaj-Grabysz A (2017) MSAC: compression of multiple sequence alignment files. bioRxiv, pp 240–341
Hosseini Morteza, Pratas Diogo, Pinho Armando J (2018) Cryfa: a secure encryption tool for genomic data. Bioinformatics 35(1):146–148
Google Scholar
Chen Min, Li Rui, Yang LiJun (2018) Optimized context weighting for the compression of the un-repetitive genome sequence fragment. Wirel Personal Commun 103(1):921–939
Google Scholar
https://www.gzip.org/. Accessed 20 Jan 2019
https://www.7-zip.org/sdk.html. Accessed 20 Jan 2019
http://www.bzip.org/. Accessed 20 Jan 2019
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz. Accessed 11 Jan 2019
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.ffn.tar.gz. Accessed 11 Jan 2019
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/Chromosomes/. Accessed 11 Jan 2019
https://portal.camera.calit2.net. Accessed 11 Jan 2019

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, MNNIT Allahabad, Allahabad, India
Sanjeev kumar, Suneeta Agarwal & Ranvijay

Authors

Sanjeev kumar
View author publications
You can also search for this author in PubMed Google Scholar
Suneeta Agarwal
View author publications
You can also search for this author in PubMed Google Scholar
Ranvijay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sanjeev kumar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

kumar, S., Agarwal, S. & Ranvijay WBTC: a new approach for efficient storage of genomic data. Int. j. inf. tecnol. 12, 915–921 (2020). https://doi.org/10.1007/s41870-020-00472-2

Download citation

Received: 21 February 2019
Accepted: 09 May 2020
Published: 13 June 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s41870-020-00472-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

WBTC: a new approach for efficient storage of genomic data

Abstract

Access this article

Similar content being viewed by others

A new efficient referential genome compression technique for FastQ files

Lossless Genome Data Compression Using V-Gram

Trends and Advancements in Genome Data Compression and Processing Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

WBTC: a new approach for efficient storage of genomic data

Abstract

Access this article

Similar content being viewed by others

A new efficient referential genome compression technique for FastQ files

Lossless Genome Data Compression Using V-Gram

Trends and Advancements in Genome Data Compression and Processing Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation