Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2019-09-23 , DOI: arxiv-1909.10389
Matteo Migliorini, Riccardo Castellotti, Luca Canali, Marco Zanetti

The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to these challenges is presented, which allows training neural network classifiers using solutions from the Big Data and data science ecosystems, integrated with tools, software, and platforms common in the HEP environment. In particular, Apache Spark is exploited for data preparation and feature engineering, running the corresponding (Python) code interactively on Jupyter notebooks. Key integrations and libraries that make Spark capable of ingesting data stored using ROOT format and accessed via the XRootD protocol, are described and discussed. Training of the neural network models, defined using the Keras API, is performed in a distributed fashion on Spark clusters by using BigDL with Analytics Zoo and also by using TensorFlow, notably for distributed training on CPU and GPU resourcess. The implementation and the results of the distributed training are described in detail in this work.

中文翻译：

具有用于高能物理的现代大数据工具的机器学习管道

在 HEP 用例中大规模有效利用复杂的机器学习 (ML) 技术带来了几个技术挑战，最重要的是在专用端到端数据管道的实际实施方面。提出了应对这些挑战的解决方案，它允许使用来自大数据和数据科学生态系统的解决方案训练神经网络分类器，并与 HEP 环境中常见的工具、软件和平台集成。特别是，Apache Spark 被用于数据准备和特征工程，在 Jupyter 笔记本上以交互方式运行相应的（Python）代码。描述和讨论了使 Spark 能够摄取使用 ROOT 格式存储并通过 XRootD 协议访问的数据的关键集成和库。使用 Keras API 定义的神经网络模型的训练，通过使用 BigDL 和 Analytics Zoo 以及使用 TensorFlow 在 Spark 集群上以分布式方式执行，特别是在 CPU 和 GPU 资源上进行分布式训练。在这项工作中详细描述了分布式训练的实现和结果。

更新日期：2020-06-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>