Parallel and Distributed Powerset Generation Using Big Data Processing,Applied Artificial Intelligence

当前位置： X-MOL 学术 › Appl. Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Parallel and Distributed Powerset Generation Using Big Data Processing
Applied Artificial Intelligence ( IF 2.9 ) Pub Date : 2019-10-01 , DOI: 10.1080/08839514.2019.1665262
Youssef M. Essa ₁ , Ahmed El-Mahalawy ₂ , Gamal Attiya ₂ , Ayman El-Sayed ₂

Affiliation

ABSTRACT Data mining algorithms are more important today as it allows stakeholders to get a 360-degree view of their customers. Recently, powerset has become the basic core for many algorithms and techniques in different data mining domains as it provides optimal solutions for many problems in data mining. Nevertheless, it is challenging to be used in several instances because the complexity of powerset grows exponentially with the number of sets. Constructing powerset from huge datasets on a single machine causes an out-of-memory exception. So, from a business perspective in mega data projects, the enterprise companies need to invest a lot of money to build high-performance system infrastructure of powerset. Also, enterprise companies have to invest more money to build a standby system to keep the system alive if the high-performance machines break down. Furthermore, the existing powerset techniques are designed for structured data and not useful in intensive processing using in-memory unstructured data store. Thus, this paper tackles most problems that hinder deploying powerset algorithm toward Big Data and presents a series of pruning techniques that can greatly improve construction efficiency of powerset generation. The approach allows enterprise companies to explore huge data volumes and gain business insights into near-real-time and save the cost of infrastructure.

中文翻译：

使用大数据处理并行和分布式电源集生成

摘要数据挖掘算法在今天变得更加重要，因为它允许利益相关者获得他们客户的 360 度视图。最近，powerset 已经成为不同数据挖掘领域中许多算法和技术的基本核心，因为它为数据挖掘中的许多问题提供了最佳解决方案。然而，在多个实例中使用是具有挑战性的，因为 powerset 的复杂性随着集合的数量呈指数增长。在单台机器上从庞大的数据集构建 powerset 会导致内存不足异常。所以，从商业角度来看，在大数据项目中，企业公司需要投入大量资金来建设powerset的高性能系统基础设施。还，如果高性能机器出现故障，企业公司必须投入更多资金来构建备用系统，以保持系统正常运行。此外，现有的 powerset 技术是为结构化数据设计的，在使用内存非结构化数据存储的密集处理中没有用。因此，本文解决了大多数阻碍将 powerset 算法部署到大数据的问题，并提出了一系列可以大大提高 powerset 生成构建效率的剪枝技术。该方法允许企业公司探索海量数据并获得近实时的业务洞察力，并节省基础设施成本。本文解决了大多数阻碍将 powerset 算法部署到大数据的问题，并提出了一系列可以大大提高 powerset 生成构建效率的剪枝技术。该方法允许企业公司探索海量数据并获得近实时的业务洞察力，并节省基础设施成本。本文解决了大多数阻碍将 powerset 算法部署到大数据的问题，并提出了一系列可以大大提高 powerset 生成构建效率的剪枝技术。该方法允许企业公司探索海量数据并获得近实时的业务洞察力，并节省基础设施成本。

更新日期：2019-10-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11