当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training
Journal of Big Data ( IF 8.1 ) Pub Date : 2019-10-31 , DOI: 10.1186/s40537-019-0259-3
Sumedh Yadav , Mathis Bode

A scalable graphical method is presented for selecting and partitioning datasets for the training phase of a classification task. For the heuristic, a clustering algorithm is required to get its computation cost in a reasonable proportion to the task itself. This step is succeeded by construction of an information graph of the underlying classification patterns using approximate nearest neighbor methods. The presented method consists of two approaches, one for reducing a given training set, and another for partitioning the selected/reduced set. The heuristic targets large datasets, since the primary goal is a significant reduction in training computation run-time without compromising prediction accuracy. Test results show that both approaches significantly speed-up the training task when compared against that of state-of-the-art shrinking heuristics available in LIBSVM. Furthermore, the approaches closely follow or even outperform in prediction accuracy. A network design is also presented for a partitioning based distributed training formulation. Added speed-up in training run-time is observed when compared to that of serial implementation of the approaches.

中文翻译:

图形化试探法,用于大型数据集的缩减和分区,以实现可扩展的监督训练

提出了一种可伸缩的图形方法,用于为分类任务的训练阶段选择和分区数据集。对于启发式算法,需要一种聚类算法以使其计算成本与任务本身成合理比例。通过使用近似最近邻方法构造基础分类模式的信息图,可以成功完成此步骤。所提出的方法包括两种方法,一种用于减少给定的训练集,另一种用于划分所选/减少的集。启发式算法以大型数据集为目标,因为主要目标是在不影响预测准确性的情况下显着减少训练计算的运行时间。测试结果表明,与LIBSVM中可用的最新收缩启发式方法相比,这两种方法都可以显着加快训练任务的速度。此外,这些方法在预测精度上紧随甚至超越。还提出了一种基于分区的分布式培训公式的网络设计。与这些方法的串行实施相比,可以提高训练时的运行速度。
更新日期:2019-10-31
down
wechat
bug