当前位置:
X-MOL 学术
›
arXiv.cs.AR
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale
arXiv - CS - Hardware Architecture Pub Date : 2021-05-26 , DOI: arxiv-2105.12676 ZhaoxiaSummer, DengAmy, Jongsoo ParkAmy, Ping Tak Peter TangAmy, Haixin LiuAmy, JieAmy, Yang, Hector Yuen, Jianyu Huang, Daya Khudia, Xiaohan Wei, Ellie Wen, Dhruv Choudhary, Raghuraman Krishnamoorthi, Carole-Jean Wu, Satish Nadathur, Changkyu Kim, Maxim Naumov, Sam Naghshineh, Mikhail Smelyanskiy
arXiv - CS - Hardware Architecture Pub Date : 2021-05-26 , DOI: arxiv-2105.12676 ZhaoxiaSummer, DengAmy, Jongsoo ParkAmy, Ping Tak Peter TangAmy, Haixin LiuAmy, JieAmy, Yang, Hector Yuen, Jianyu Huang, Daya Khudia, Xiaohan Wei, Ellie Wen, Dhruv Choudhary, Raghuraman Krishnamoorthi, Carole-Jean Wu, Satish Nadathur, Changkyu Kim, Maxim Naumov, Sam Naghshineh, Mikhail Smelyanskiy
Tremendous success of machine learning (ML) and the unabated growth in ML
model complexity motivated many ML-specific designs in both CPU and accelerator
architectures to speed up the model inference. While these architectures are
diverse, highly optimized low-precision arithmetic is a component shared by
most. Impressive compute throughputs are indeed often exhibited by these
architectures on benchmark ML models. Nevertheless, production models such as
recommendation systems important to Facebook's personalization services are
demanding and complex: These systems must serve billions of users per month
responsively with low latency while maintaining high prediction accuracy,
notwithstanding computations with many tens of billions parameters per
inference. Do these low-precision architectures work well with our production
recommendation systems? They do. But not without significant effort. We share
in this paper our search strategies to adapt reference recommendation models to
low-precision hardware, our optimization of low-precision compute kernels, and
the design and development of tool chain so as to maintain our models' accuracy
throughout their lifespan during which topic trends and users' interests
inevitably evolve. Practicing these low-precision technologies helped us save
datacenter capacities while deploying models with up to 5X complexity that
would otherwise not be deployed on traditional general-purpose CPUs. We believe
these lessons from the trenches promote better co-design between hardware
architecture and software engineering and advance the state of the art of ML in
industry.
中文翻译:
低精度硬件架构可大规模满足建议模型推论
机器学习(ML)的巨大成功以及ML模型复杂性的持续增长促使CPU和加速器体系结构中的许多ML特定设计加快了模型推断的速度。尽管这些体系结构是多种多样的,但高度优化的低精度算术是大多数人所共有的组件。这些体系结构确实经常在基准ML模型上展现出令人印象深刻的计算吞吐量。尽管如此,诸如对Facebook的个性化服务很重要的推荐系统之类的生产模型却要求苛刻且复杂:这些系统必须每月响应数十亿用户,并以低延迟响应,同时保持较高的预测准确性,尽管每次推论需要计算数百亿个参数。这些低精度体系结构是否可以与我们的生产推荐系统配合使用?他们是这样。但并非没有很大的努力。我们在本文中分享了我们的搜索策略,以使参考推荐模型适应低精度的硬件,低精度的计算内核的优化以及工具链的设计和开发,从而在整个模型生命周期中保持模型的准确性。趋势和用户兴趣不可避免地会发生变化。实施这些低精度技术有助于我们节省数据中心的容量,同时可以部署高达5倍复杂度的模型,否则这些模型将无法部署在传统的通用CPU上。
更新日期:2021-05-27
中文翻译:
低精度硬件架构可大规模满足建议模型推论
机器学习(ML)的巨大成功以及ML模型复杂性的持续增长促使CPU和加速器体系结构中的许多ML特定设计加快了模型推断的速度。尽管这些体系结构是多种多样的,但高度优化的低精度算术是大多数人所共有的组件。这些体系结构确实经常在基准ML模型上展现出令人印象深刻的计算吞吐量。尽管如此,诸如对Facebook的个性化服务很重要的推荐系统之类的生产模型却要求苛刻且复杂:这些系统必须每月响应数十亿用户,并以低延迟响应,同时保持较高的预测准确性,尽管每次推论需要计算数百亿个参数。这些低精度体系结构是否可以与我们的生产推荐系统配合使用?他们是这样。但并非没有很大的努力。我们在本文中分享了我们的搜索策略,以使参考推荐模型适应低精度的硬件,低精度的计算内核的优化以及工具链的设计和开发,从而在整个模型生命周期中保持模型的准确性。趋势和用户兴趣不可避免地会发生变化。实施这些低精度技术有助于我们节省数据中心的容量,同时可以部署高达5倍复杂度的模型,否则这些模型将无法部署在传统的通用CPU上。