Scaling-Up Distributed Processing of Data Streams for Machine Learning,Proceedings of the IEEE

当前位置： X-MOL 学术 › Proc. IEEE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Scaling-Up Distributed Processing of Data Streams for Machine Learning
Proceedings of the IEEE ( IF 23.2 ) Pub Date : 2020-09-29 , DOI: 10.1109/jproc.2020.3021381
Matthew Nokleby , Haroon Raja , Waheed U. Bajwa

Emerging applications of machine learning in numerous areas-including online social networks, remote sensing, Internet-of-Things (IoT) systems, smart grids, and more-involve continuous gathering of and learning from streams of data samples. Real-time incorporation of streaming data into the learned machine learning models is essential for improved inference in these applications. Furthermore, these applications often involve data that are either inherently gathered at geographically distributed entities due to physical reasons, for example, IoT systems and smart grids, or that are intentionally distributed across multiple computing machines for memory, storage, computational, and/or privacy reasons. Training of machine learning models in this distributed, streaming setting requires solving stochastic optimization (SO) problems in a collaborative manner over communication links between the physical entities. When the streaming data rate is high compared with the processing capabilities of individual computing entities and/or the rate of the communications links, this poses a challenging question: How can one best leverage the incoming data for distributed training of machine learning models under constraints on computing capabilities and/or communications rate? A large body of research in distributed online optimization has emerged in recent decades to tackle this and related problems. This article reviews recently developed methods that focus on large-scale distributed SO in the compute- and bandwidth-limited regimes, with an emphasis on convergence analysis that explicitly accounts for the mismatch between computation, communication, and streaming rates and provides sufficient conditions for order-optimum convergence. In particular, it focuses on methods that solve: 1) distributed stochastic convex problems and 2) distributed principal component analysis, which is a nonconvex problem with the geometric structure that permits global convergence. For such methods, this article discusses recent advances in terms of distributed algorithmic designs when faced with high-rate streaming data. Furthermore, it reviews theoretical guarantees underlying these methods that show that there exist regimes in which systems can learn from distributed processing of streaming data at order-optimal rates-nearly as fast as if all the data were processed at a single superpowerful machine.

中文翻译：

扩大机器学习数据流的分布式处理

机器学习在许多领域的新兴应用，包括在线社交网络、遥感、物联网 (IoT) 系统、智能电网等，都涉及到数据样本流的持续收集和学习。将流数据实时合并到学习的机器学习模型中对于改进这些应用程序中的推理至关重要。此外，这些应用程序通常涉及的数据要么由于物理原因固有地收集在地理分布的实体（例如物联网系统和智能电网）中，要么出于内存、存储、计算和/或隐私的目的故意分布在多个计算机上原因。在这种分布式流式设置中训练机器学习模型需要通过物理实体之间的通信链路以协作方式解决随机优化 (SO) 问题。当流数据速率与单个计算实体的处理能力和/或通信链路的速率相比较高时，这就提出了一个具有挑战性的问题：如何在以下约束下最好地利用输入数据进行机器学习模型的分布式训练：计算能力和/或通信速率？近几十年来，出现了大量分布式在线优化研究来解决这个问题和相关问题。本文回顾了最近开发的方法，这些方法专注于计算和带宽有限的情况下的大规模分布式 SO，重点是收敛分析，该分析明确地考虑了计算、通信和流速率之间的不匹配，并为顺序提供了充分的条件-最佳收敛。特别是，它重点关注解决以下问题的方法：1）分布式随机凸问题和2）分布式主成分分析，这是一个具有允许全局收敛的几何结构的非凸问题。对于此类方法，本文讨论了面对高速流数据时分布式算法设计的最新进展。此外，它回顾了这些方法背后的理论保证，这些保证表明，存在一些机制，在这些机制中，系统可以以顺序最优速率从流数据的分布式处理中学习，速度几乎就像所有数据都在单个超强大的机器上处理一样快。

更新日期：2020-09-29

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11