Feature Engineering for Scalable Application-Level Post-Silicon Debugging,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Feature Engineering for Scalable Application-Level Post-Silicon Debugging
arXiv - CS - Hardware Architecture Pub Date : 2021-02-08 , DOI: arxiv-2102.04554
Debjit Pal, Shobha Vasudevan

We present systematic and efficient solutions for both observability enhancement and root-cause diagnosis of post-silicon System-on-Chips (SoCs) validation with diverse usage scenarios. We model specification of interacting flows in typical applications for message selection. Our method for message selection optimizes flow specification coverage and trace buffer utilization. We define the diagnosis problem as identifying buggy traces as outliers and bug-free traces as inliers/normal behaviors, for which we use unsupervised learning algorithms for outlier detection. Instead of direct application of machine learning algorithms over trace data using the signals as raw features, we use feature engineering to transform raw features into more sophisticated features using domain specific operations. The engineered features are highly relevant to the diagnosis task and are generic to be applied across any hardware designs. We present debugging and root cause analysis of subtle post-silicon bugs in industry-scale OpenSPARC T2 SoC. We achieve a trace buffer utilization of 98.96\% with a flow specification coverage of 94.3\% (average). Our diagnosis method was able to diagnose up to 66.7\% more bugs and took up to 847$\times$ less diagnosis time as compared to the manual debugging with a diagnosis precision of 0.769.

中文翻译：

可扩展的应用程序级硅后调试功能工程

我们提供了系统有效的解决方案，用于可观察性增强和硅片后片上系统（SoC）验证的根本原因诊断，具有多种使用场景。我们对典型应用程序中交互流的规范进行建模，以进行消息选择。我们的消息选择方法优化了流规范的覆盖范围和跟踪缓冲区的利用率。我们将诊断问题定义为将越野车痕迹识别为离群值，将没有错误的迹线识别为离群值/正常行为，为此，我们使用无监督学习算法进行离群值检测。我们不是使用信号作为原始特征将机器学习算法直接应用于跟踪数据，而是使用特征工程使用特定于域的操作将原始特征转换为更复杂的特征。这些工程设计的功能与诊断任务高度相关，并且通用于所有硬件设计。我们将介绍工业规模的OpenSPARC T2 SoC中微妙的硅后缺陷的调试和根本原因分析。我们实现了98.96％的跟踪缓冲区利用率，流规范覆盖率为94.3％（平均）。与手动调试相比，我们的诊断方法能够诊断多达66.7％的错误，并减少了847 $ \ times $的诊断时间，诊断精度为0.769。

更新日期：2021-02-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文