当前位置: X-MOL 学术Big Data Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On Divide&Conquer in Image Processing of Data Monster
Big Data Research ( IF 3.3 ) Pub Date : 2021-02-11 , DOI: 10.1016/j.bdr.2021.100214
Hermann Heßling , Marco Strutz , Elsa Irmgard Buchholz , Peter Hufnagl

The steadily improving resolution power of sensors results in larger and larger data objects, which cannot be analysed in a reasonable amount of time on single workstations. To speed up the analysis the Divide and Conquer method can be used by splitting (large) data objects into smaller pieces where each piece is analysed on a single node and, finally, the partial results are collected and combined. We apply this method to the validated bio–medical framework Ki67–Analysis that determines the amount of cancer cells in high–resolution images from breast examinations.

In previous work, we observed an anomalous behaviour when the framework is applied to subtiles of an image. To this end, we determined for each subtile a so–called Ki67–Analysis score parameter, which is given by the ratio of the number of identified cancer cells and the total number of cells. This parameter turns out to be underestimated the more the smaller the subtiles. The anomaly prevents a direct application of the Divide and Conquer method.

In this work, we suggest a novel grey–box testing method for understanding the origin of the anomaly. It allows to identify a class of subtiles for which the Ki67–Analysis score parameter can be determined reasonably well, i.e. for which the Divide and Conquer method can be applied. By demanding the stability of the framework with regard to small additive noise in brightness, “ghost cells” are identified that turn out to be an artefact of the framework.

Finally, the challenge of analysing huge single data objects is considered. The upcoming observatory Square Kilometre Array (SKA) will consist of thousands of antennas and telescopes. Due to the exceptional resolution power of SKA, single images from the Universe may be as large as one Petabyte. “Data monster” of that huge size cannot be analysed reasonably fast on traditional computing architectures. The relatively small throughput rates when reading data from disks is a serious bottleneck (memory–wall problem). Memory–based computing offers a change in paradigm: the current processor–centric architecture is replaced by a memory–based architecture. Hewlett Packard Enterprise (HPE) developed a prototype with 48 Terabyte of memory, called Sandbox. Counting words in large files can be considered as a first step towards simulating image processing of “Data Monster” at SKA. We run the big data framework Thrill on the Sandbox and determine the speedup of different setups for distributed word counting.



中文翻译:

数据怪物图像处理中的分而治之

传感器分辨率的稳定提高会导致越来越大的数据对象,无法在单个工作站上的合理时间内分析这些数据对象。为了加快分析速度,可以使用“分而治之”的方法,将(大)数据对象分成较小的部分,然后在单个节点上分析每个部分,最后收集并合并部分结果。我们将此方法应用于经过验证的生物医学框架Ki67-Analysis,该框架可通过乳房检查确定高分辨率图像中的癌细胞数量。

在以前的工作中,当框架应用于图像的细分时,我们观察到了异常行为。为此,我们为每个亚类确定了一个所谓的Ki67分析得分参数,该参数由已识别的癌细胞数与细胞总数的比值给出。事实证明,此参数被低估了。异常会阻止直接应用分而治之方法。

在这项工作中,我们建议一种新颖的灰盒测试方法来理解异常的起源。它可以识别可以合理确定Ki67分析分数参数(即可以应用分而治之方法)的一类子实体。通过要求框架具有亮度方面的较小附加噪声的稳定性,可以识别出“鬼单元”,这些“鬼单元”被证明是框架的伪像。

最后,考虑了分析巨大的单个数据对象的挑战。即将到来的天文台平方公里阵列(SKA)将由数千个天线和望远镜组成。由于SKA具有出色的分辨率,因此来自宇宙的单个图像可能多达一个PB。在传统的计算架构上,无法合理快速地分析如此庞大的“数据怪物”。从磁盘读取数据时,相对较小的吞吐率是一个严重的瓶颈(内存-墙问题)。基于内存的计算带来了范式的变化:当前以处理器为中心的体系结构已被基于内存的体系结构取代。惠普企业(HPE)开发了具有48 TB内存的原型,称为Sandbox。在SKA中,对大文件中的单词进行计数可被认为是模拟“ Data Monster”的图像处理的第一步。我们在沙盒上运行大数据框架Thrill,并确定用于分布式单词计数的不同设置的速度。

更新日期:2021-02-16
down
wechat
bug