当前位置: X-MOL 学术Data Min. Knowl. Discov. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
For real: a thorough look at numeric attributes in subgroup discovery
Data Mining and Knowledge Discovery ( IF 2.8 ) Pub Date : 2020-09-21 , DOI: 10.1007/s10618-020-00703-x
Marvin Meeng , Arno Knobbe

Subgroup discovery (SD) is an exploratory pattern mining paradigm that comes into its own when dealing with large real-world data, which typically involves many attributes, of a mixture of data types. Essential is the ability to deal with numeric attributes, whether they concern the target (a regression setting) or the description attributes (by which subgroups are identified). Various specific algorithms have been proposed in the literature for both cases, but a systematic review of the available options is missing. This paper presents a generic framework that can be instantiated in various ways in order to create different strategies for dealing with numeric data. The bulk of the work in this paper describes an experimental comparison of a considerable range of numeric strategies in SD, where these strategies are organised according to four central dimensions. These experiments are furthermore repeated for both the classification task (target is nominal) and regression task (target is numeric), and the strategies are compared based on the quality of the top subgroup, and the quality and redundancy of the top-k result set. Results of three search strategies are compared: traditional beam search, complete search, and a variant of diverse subgroup set discovery called cover-based subgroup selection. Although there are various subtleties in the outcome of the experiments, the following general conclusions can be drawn: it is often best to determine numeric thresholds dynamically (locally), in a fine-grained manner, with binary splits, while considering multiple candidate thresholds per attribute.



中文翻译:

真实:彻底了解子组发现中的数字属性

子组发现(SD)是一种探索性模式挖掘范例,在处理大型实际数据(通常涉及多种数据类型的多种属性)时,它会独树一帜。基本的能力是处理数字属性的能力,无论它们涉及目标(回归设置)还是描述属性(通过其标识子组)。对于这两种情况,文献中已经提出了各种特定的算法,但是缺少对可用选项的系统评价。本文提出了一种通用框架,可以通过多种方式实例化该框架,以创建用于处理数字数据的不同策略。本文的大部分工作描述了SD中相当数量的数值策略的实验比较,这些策略是根据四个主要方面来组织的。此外,针对分类任务(目标是名义上的)和回归任务(目标是数字)都重复了这些实验,并根据最高级子组的质量以及最高级子组的质量和冗余对策略进行了比较k结果集。比较了三种搜索策略的结果:传统的波束搜索,完全搜索以及称为基于封面的子组选择的各种子组集发现的变体。尽管实验的结果有各种微妙之处,但可以得出以下一般结论:通常最好以二进制方式细粒度地动态(局部)确定数字阈值,同时考虑每个阈值的多个候选阈值。属性。

更新日期:2020-09-22
down
wechat
bug