当前位置: X-MOL 学术Entropy › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models
Entropy ( IF 2.1 ) Pub Date : 2021-04-26 , DOI: 10.3390/e23050530
Milton Silva , Diogo Pratas , Armando J. Pinho

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2–9% and 6–7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences’ input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

中文翻译:

AC2:使用人工神经网络和缓存哈希模型的高效蛋白质序列压缩工具

最近,科学界见证了蛋白质序列数据生成的显着增加,引发了越来越重要的新挑战,即有效存储和改进的数据分析。对于这两种应用程序,数据压缩都是一个简单的解决方案。然而,在文献中,特定蛋白质序列压缩器的数量相对较低。而且,这些专用压缩机比最佳的通用压缩机略微提高了压缩比。在本文中,我们介绍了AC2,这是一种用于蛋白质(或氨基酸)序列的新型无损数据压缩器。AC2使用神经网络将专家与具有堆叠泛化方法和各个高速缓存哈希模型的专家混合到最高上下文顺序。与之前的压缩机(AC)相比,在无参考模式和基于参考的模式下,我们分别显示出2–9%和6–7%的增益。这些增益的代价是计算速度降低了三倍。AC2还提高了针对AC的内存使用率,其要求降低了约7倍,而不受序列输入大小的影响。作为分析应用程序,我们使用AC2来测量来自整个UniProt数据库的每个SARS-CoV-2蛋白序列与每个病毒蛋白序列之间的相似性。结果始终显示与穿山甲冠状病毒相似,与蝙蝠和人类冠状病毒相似,对当前有争议的受试者做出了关键性贡献。根据GPLv3许可,可以免费下载AC2。AC2还提高了针对AC的内存使用率,其要求降低了约7倍,而不受序列输入大小的影响。作为分析应用程序,我们使用AC2来测量来自整个UniProt数据库的每个SARS-CoV-2蛋白序列与每个病毒蛋白序列之间的相似性。结果始终显示与穿山甲冠状病毒相似,与蝙蝠和人类冠状病毒相似,对当前有争议的受试者做出了关键性贡献。根据GPLv3许可,可以免费下载AC2。AC2还提高了针对AC的内存使用率,其要求降低了约7倍,而不受序列输入大小的影响。作为分析应用程序,我们使用AC2来测量来自整个UniProt数据库的每个SARS-CoV-2蛋白序列与每个病毒蛋白序列之间的相似性。结果始终显示与穿山甲冠状病毒相似,与蝙蝠和人类冠状病毒相似,对当前有争议的受试者做出了关键性贡献。根据GPLv3许可,可以免费下载AC2。结果始终显示与穿山甲冠状病毒相似,与蝙蝠和人类冠状病毒相似,对当前有争议的受试者做出了关键性贡献。根据GPLv3许可,可以免费下载AC2。结果始终显示与穿山甲冠状病毒相似,与蝙蝠和人类冠状病毒相似,对当前有争议的受试者做出了关键性贡献。根据GPLv3许可,可以免费下载AC2。
更新日期:2021-04-26
down
wechat
bug