当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
G-Tric: generating three-way synthetic datasets with triclustering solutions
BMC Bioinformatics ( IF 3 ) Pub Date : 2021-01-07 , DOI: 10.1186/s12859-020-03925-4
João Lobo 1 , Rui Henriques 2 , Sara C Madeira 1
Affiliation  

Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ features $$\times$$ contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

中文翻译:

G-Tric:使用三聚类解决方案生成三向合成数据集

三向数据开始受到欢迎,因为它们描述固有的多变量和时间事件的能力越来越强,例如生物反应、随时间的社会互动、城市动态或复杂的地球物理现象。Triclustering,三向数据的子空间聚类,能够发现与数据子空间(triclusters)对应的模式,这些模式具有跨三个维度相关的值(观察 $$\times$$ 具有 $$\times$$ 上下文)。随着越来越多的算法被提出,有效地将它们与最先进的算法进行比较是至关重要的。这些比较通常使用真实数据进行,没有已知的基本事实,因此限制了评估。在这种情况下,我们提出了一种合成数据生成器 G-Tric,允许创建具有可配置属性的合成数据集和种植三簇的可能性。生成器准备创建类似于来自生物医学和社会数据域的真实 3 路数据的数据集,并具有进一步提供基本事实(三聚类解决方案)作为输出的额外优势。G-Tric 可以复制现实世界的数据集并创建新的数据集,以满足研究人员在多个属性(包括数据类型(数字或符号)、维度和背景分布)方面的需求。用户可以调整表征种植的三簇(子空间)的模式和结构以及它们如何交互(重叠)。还可以通过定义缺失、噪声或错误的数量来控制数据质量。此外,还提供了类似于真实数据的数据集基准,连同相应的三聚类解决方案(种植三聚类)和生成参数。使用 G-Tric 的三聚类评估提供了结合内在和外在指标来比较产生更可靠分析的解决方案的可能性。生成并提供了一组预定义的数据集,模仿广泛使用的三向数据并探索关键属性,突出了 G-Tric 通过简化评估新三聚类方法质量的过程来推进三聚类最新技术的潜力.
更新日期:2021-01-07
down
wechat
bug