当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Fault-Tolerant Distributed Framework for Asynchronous Iterative Computations
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-02-15 , DOI: 10.1109/tpds.2021.3059420
Tian Zhou , Lixin Gao , Xiaohong Guan

Asynchronous iterative computations (AIC) are common in machine learning and data mining systems. However, the lack of synchronization barriers in asynchronous processing brings challenges for continuous processing while workers might fail. There is no global synchronization point that all workers can roll back to. In this article, we propose a fault-tolerant framework for asynchronous iterative computations (FAIC). Our framework takes a virtual snapshot of the AIC system without halting the computation of any worker. We prove that the virtual snapshot capture by FAIC can recover the AIC system correctly. We evaluate our FAIC framework on two existing AIC systems, Maiter and NOMAD. Our experiment result shows that the checkpoint overhead of FAIC is more than 50 percent shorter than the synchronous checkpoint method. FAIC is around 10 percent faster than other asynchronous snapshot algorithms, such as the Chandy-Lamport algorithm. Our experiments on a large cluster demonstrate that FAIC scales with the number of workers.

中文翻译:

异步迭代计算的容错分布式框架

异步迭代计算(AIC)在机器学习和数据挖掘系统中很常见。但是,异步处理中缺少同步障碍,这给连续处理带来了挑战,而工作程序可能会失败。没有所有工作人员都可以回滚的全局同步点。在本文中,我们提出了一个用于异步迭代计算(FAIC)的容错框架。我们的框架对AIC系统进行了虚拟快照,而不会停止任何工作程序的计算。我们证明了FAIC捕获的虚拟快照可以正确恢复AIC系统。我们在两个现有的AIC系统Maier和NOMAD上评估FAIC框架。我们的实验结果表明,FAIC的检查点开销比同步检查点方法短50%以上。FAIC比其他异步快照算法(例如Chandy-Lamport算法)快约10%。我们在一个大型集群上进行的实验表明,FAIC可以随着工人数量的增加而扩展。
更新日期:2021-03-02
down
wechat
bug