Evaluating the Effects of Modern Storage Devices on the Efficiency of Parallel Machine Learning Algorithms,International Journal on Artificial Intelligence Tools

当前位置： X-MOL 学术 › Int. J. Artif. Intell. Tools › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Evaluating the Effects of Modern Storage Devices on the Efficiency of Parallel Machine Learning Algorithms
International Journal on Artificial Intelligence Tools ( IF 1.0 ) Pub Date : 2020-06-17 , DOI: 10.1142/s0218213020600088
Leonidas Akritidis ₁ , Athanasios Fevgas ₁ , Panagiota Tsompanopoulou ₁ , Panayiotis Bozanis ₁

Affiliation

Big Data analytics is presently one of the most emerging areas of research for both organizations and enterprises. The requirement for deployment of efficient machine learning algorithms over huge amounts of data led to the development of parallelization frameworks and of specialized libraries (like Mahout and MLlib) which implement the most important among these algorithms. Moreover, the recent advances in storage technology resulted in the introduction of high-performing devices, broadly known as Solid State Drives (SSDs). Compared to the traditional Hard Drives (HDDs), SSDs offer considerably higher performance and lower power consumption. Motivated by these appealing features and the growing necessity for efficient large-scale data processing, we compared the performance of several machine learning algorithms on MapReduce clusters whose nodes are equipped with HDDs, SSDs, and devices which implement the latest 3D XPoint technology. In particular, we evaluate several dataset preprocessing methods like vectorization and dimensionality reduction, two supervised classifiers, Naive Bayes and Linear Regression, and the popular k-Means clustering algorithm. We use an experimental cluster equipped with the three aforementioned storage devices under different configurations, and two large datasets, Wikipedia and HIGGS. The experiments showed that the benefits which derive from the usage of SSDs depend on the cluster setup and the nature of the applied algorithms.

中文翻译：

评估现代存储设备对并行机器学习算法效率的影响

大数据分析目前是组织和企业最新兴的研究领域之一。在海量数据上部署高效机器学习算法的需求导致了并行化框架和专用库（如 Mahout 和 MLlib）的开发，这些库实现了这些算法中最重要的。此外，存储技术的最新进展导致引入了广泛称为固态驱动器 (SSD) 的高性能设备。与传统的硬盘驱动器 (HDD) 相比，SSD 可提供更高的性能和更低的功耗。受到这些吸引人的特性和对高效大规模数据处理日益增长的需求的推动，我们比较了几种机器学习算法在 MapReduce 集群上的性能，这些集群的节点配备了 HDD、SSD 和实施最新 3D XPoint 技术的设备。特别是，我们评估了几种数据集预处理方法，如矢量化和降维、两个监督分类器、朴素贝叶斯和线性回归，以及流行的 k-Means 聚类算法。我们使用了一个实验集群，该集群配备了上述三种不同配置的存储设备，以及两个大型数据集，Wikipedia 和 HIGGS。实验表明，使用 SSD 带来的好处取决于集群设置和应用算法的性质。我们评估了几种数据集预处理方法，如矢量化和降维、两个监督分类器、朴素贝叶斯和线性回归，以及流行的 k-Means 聚类算法。我们使用了一个实验集群，该集群配备了上述三种不同配置的存储设备，以及两个大型数据集，Wikipedia 和 HIGGS。实验表明，使用 SSD 带来的好处取决于集群设置和应用算法的性质。我们评估了几种数据集预处理方法，如矢量化和降维、两个监督分类器、朴素贝叶斯和线性回归，以及流行的 k-Means 聚类算法。我们使用了一个实验集群，该集群配备了上述三种不同配置的存储设备，以及两个大型数据集，Wikipedia 和 HIGGS。实验表明，使用 SSD 带来的好处取决于集群设置和应用算法的性质。

更新日期：2020-06-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11