当前位置: X-MOL 学术BMC Med. Genomics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Using blockchain to log genome dataset access: efficient storage and query.
BMC Medical Genomics ( IF 2.1 ) Pub Date : 2020-07-21 , DOI: 10.1186/s12920-020-0716-z
Gamze Gürsoy 1, 2 , Robert Bjornson 3, 4 , Molly E Green 1, 2 , Mark Gerstein 1, 2, 4
Affiliation  

Genomic variants are considered sensitive information, revealing potentially private facts about individuals. Therefore, it is important to control access to such data. A key aspect of controlled access is secure storage and efficient query of access logs, for potential misuse. However, there are challenges to securing logs, such as designing against the consequences of “single points of failure”. A potential approach to circumvent these challenges is blockchain technology, which is currently popular in cryptocurrency due to its properties of security, immutability, and decentralization. One of the tasks of the iDASH (Integrating Data for Analysis, Anonymization, and Sharing) Secure Genome Analysis Competition in 2018 was to develop time- and space-efficient blockchain-based ledgering solutions to log and query user activity accessing genomic datasets across multiple sites, using MultiChain. MultiChain is a specific blockchain platform that offers “data streams” embedded in the chain for rapid and secure data storage. We devised a storage protocol taking advantage of the keys in the MultiChain data streams and created a data frame from the chain allowing efficient query. Our solution to the iDASH competition was selected as the winner at a workshop held in San Diego, CA in October 2018. Although our solution worked well in the challenge, it has the drawback that it requires downloading all the data from the chain and keeping it locally in memory for fast query. To address this, we provide an alternate “bigmem” solution that uses indices rather than local storage for rapid queries. We profiled the performance of both of our solutions using logs with 100,000 to 600,000 entries, both for querying the chain and inserting data into it. The challenge solution requires 12 seconds time and 120 Mb of memory for querying from 100,000 entries. The memory requirement increases linearly and reaches 470 MB for a chain with 600,000 entries. Although our alternate bigmem solution is slower and requires more memory (408 seconds and 250 MB, respectively, for 100,000 entries), the memory requirement increases at a slower rate and reaches only 360 MB for 600,000 entries. Overall, we demonstrate that genomic access log files can be stored and queried efficiently with blockchain. Beyond this, our protocol potentially could be applied to other types of health data such as electronic health records.

中文翻译:

使用区块链记录基因组数据集访问:高效的存储和查询。

基因组变异被认为是敏感信息,揭示了有关个人的潜在私人事实。因此,控制对此类数据的访问非常重要。受控访问的一个关键方面是安全存储和对访问日志的有效查询,以防止潜在的滥用。但是,在保护日志方面存在挑战,例如针对“单点故障”的后果进行设计。克服这些挑战的一种潜在方法是区块链技术,由于其安全性,不变性和去中心化特性,目前在加密货币中很流行。iDASH的任务之一(集成数据以进行分析,匿名化,和共享)2018年的安全基因组分析竞赛旨在开发省时,省空间的基于区块链的分类账解决方案,以记录和查询用户活动,并使用MultiChain访问跨多个站点的基因组数据集。MultiChain是一个特定的区块链平台,提供嵌入在链中的“数据流”,用于快速安全的数据存储。我们设计了一种利用MultiChain数据流中密钥的存储协议,并从链中创建了一个数据帧,从而可以进行高效查询。我们的iDASH竞赛解决方案在2018年10月于加利福尼亚州圣地亚哥举行的研讨会上被选为获胜者。尽管我们的解决方案在挑战中表现良好,但它的缺点是需要从链中下载所有数据并保存在内存中本地进行快速查询。为了解决这个问题,我们提供了替代的“ bigmem”解决方案,该解决方案使用索引而不是本地存储来进行快速查询。我们使用具有100,000至600,000个条目的日志来描述这两种解决方案的性能,以用于查询链并将数据插入其中。质询解决方案需要12秒的时间和120 Mb的内存才能从100,000个条目中进行查询。内存需求呈线性增加,对于具有600,000个条目的链而言,达到470 MB。尽管我们的替代bigmem解决方案速度较慢,并且需要更多的内存(100,000个条目分别需要408秒和250 MB),但是内存要求却以较慢的速度增加,并且对于600,000个条目仅达到360 MB。总体而言,我们证明了基因组访问日志文件可以使用区块链高效存储和查询。超出此,
更新日期:2020-07-21
down
wechat
bug