当前位置: X-MOL 学术arXiv.cs.ET › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Single-Read Reconstruction for DNA Data Storage Using Transformers
arXiv - CS - Emerging Technologies Pub Date : 2021-09-12 , DOI: arxiv-2109.05478
Yotam Nahum, Eyar Ben-Tolila, Leon Anavy

As the global need for large-scale data storage is rising exponentially, existing storage technologies are approaching their theoretical and functional limits in terms of density and energy consumption, making DNA based storage a potential solution for the future of data storage. Several studies introduced DNA based storage systems with high information density (petabytes/gram). However, DNA synthesis and sequencing technologies yield erroneous outputs. Algorithmic approaches for correcting these errors depend on reading multiple copies of each sequence and result in excessive reading costs. The unprecedented success of Transformers as a deep learning architecture for language modeling has led to its repurposing for solving a variety of tasks across various domains. In this work, we propose a novel approach for single-read reconstruction using an encoder-decoder Transformer architecture for DNA based data storage. We address the error correction process as a self-supervised sequence-to-sequence task and use synthetic noise injection to train the model using only the decoded reads. Our approach exploits the inherent redundancy of each decoded file to learn its underlying structure. To demonstrate our proposed approach, we encode text, image and code-script files to DNA, produce errors with high-fidelity error simulator, and reconstruct the original files from the noisy reads. Our model achieves lower error rates when reconstructing the original data from a single read of each DNA strand compared to state-of-the-art algorithms using 2-3 copies. This is the first demonstration of using deep learning models for single-read reconstruction in DNA based storage which allows for the reduction of the overall cost of the process. We show that this approach is applicable for various domains and can be generalized to new domains as well.

中文翻译:

使用转换器进行 DNA 数据存储的单读重建

随着全球对大规模数据存储的需求呈指数级增长,现有存储技术在密度和能耗方面接近其理论和功能极限,使基于 DNA 的存储成为未来数据存储的潜在解决方案。几项研究介绍了具有高信息密度(PB/克)的基于 DNA 的存储系统。然而,DNA 合成和测序技术会产生错误的输出。纠正这些错误的算法方法依赖于读取每个序列的多个副本并导致过多的读取成本。Transformers 作为语言建模的深度学习架构取得了前所未有的成功,这导致其重新用于解决各个领域的各种任务。在这项工作中,我们提出了一种使用编码器-解码器 Transformer 架构进行基于 DNA 的数据存储的单读取重建的新方法。我们将纠错过程作为自我监督的序列到序列任务进行处理,并使用合成噪声注入仅使用解码读取来训练模型。我们的方法利用每个解码文件的固有冗余来学习其底层结构。为了演示我们提出的方法,我们将文本、图像和代码脚本文件编码为 DNA,使用高保真错误模拟器产生错误,并从嘈杂的读取中重建原始文件。与使用 2-3 个副本的最新算法相比,我们的模型在从每个 DNA 链的单次读取重建原始数据时实现了更低的错误率。这是首次在基于 DNA 的存储中使用深度学习模型进行单读重建,从而降低了整个过程的成本。我们表明这种方法适用于各种领域,也可以推广到新领域。
更新日期:2021-09-14
down
wechat
bug