High-Performance Training by Exploiting Hot-Embeddings in Recommendation Systems,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

High-Performance Training by Exploiting Hot-Embeddings in Recommendation Systems
arXiv - CS - Hardware Architecture Pub Date : 2021-03-01 , DOI: arxiv-2103.00686
Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, Prashant J. Nair

Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications. Current recommendation models include deep-learning-based (DLRM) and time-based sequence (TBSM) models. These models use massive embedding tables to store a numerical representation of item's and user's categorical variables (memory-bound) while also using neural networks to generate outputs (compute-bound). Due to these conflicting compute and memory requirements, the training process for recommendation models is divided across CPU and GPU for embedding and neural network executions, respectively. Such a training process naively assigns the same level of importance to each embedding entry. This paper observes that some training inputs and their accesses into the embedding tables are heavily skewed with certain entries being accessed up to 10000x more. This paper tries to leverage skewed embedded table accesses to efficiently use the GPU resources during training. To this end, this paper proposes a Frequently Accessed Embeddings (FAE) framework that exposes a dynamic knob to the software based on the GPU memory capacity and the input popularity index. This framework efficiently estimates and varies the size of the hot portions of the embedding tables within GPUs and reallocates the rest of the embeddings on the CPU. Overall, our framework speeds-up the training of the recommendation models on Kaggle, Terabyte, and Alibaba datasets by 2.34x as compared to a baseline that uses Intel-Xeon CPUs and Nvidia Tesla-V100 GPUs, while maintaining accuracy.

中文翻译：

利用推荐系统中的热嵌入功能进行高性能培训

推荐模型是常用的学习模型，向电子商务和基于在线广告的应用程序向用户建议相关项目。当前的推荐模型包括基于深度学习（DLRM）和基于时间的序列（TBSM）模型。这些模型使用大量的嵌入表来存储项目和用户的分类变量的数字表示（受内存限制），同时还使用神经网络生成输出（受计算限制）。由于这些冲突的计算和内存要求，推荐模型的训练过程分别在CPU和GPU之间划分，分别用于嵌入和神经网络执行。这样的训练过程会天真地为每个嵌入条目分配相同级别的重要性。本文发现，某些训练输入及其对嵌入表的访问存在严重偏差，某些条目的访问量最高可达10000x。本文尝试利用偏斜的嵌入式表访问来在训练期间有效地使用GPU资源。为此，本文提出了一种频繁访问的嵌入（FAE）框架，该框架根据GPU的存储容量和输入的流行指数向软件公开了动态旋钮。该框架有效地估计并改变了GPU中嵌入表的热门部分的大小，并在CPU上重新分配了其余的嵌入。总体而言，与使用Intel-Xeon CPU和Nvidia Tesla-V100 GPU的基准相比，我们的框架将Kaggle，Terabyte和Alibaba数据集上的推荐模型的训练速度提高了2.34倍，

更新日期：2021-03-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文