psc2code,ACM Transactions on Software Engineering and Methodology

当前位置： X-MOL 学术 › ACM Trans. Softw. Eng. Methodol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

psc2code
ACM Transactions on Software Engineering and Methodology ( IF 6.6 ) Pub Date : 2020-06-01 , DOI: 10.1145/3392093
Lingfeng Bao ₁ , Zhenchang Xing ₂ , Xin Xia ₃ , David Lo ₄ , Minghui Wu ₅ , Xiaohu Yang ₆

Affiliation

Programming screencasts have become a pervasive resource on the Internet, which help developers learn new programming technologies or skills. The source code in programming screencasts is an important and valuable information for developers. But the streaming nature of programming screencasts (i.e., a sequence of screen-captured images) limits the ways that developers can interact with the source code in the screencasts. Many studies use the Optical Character Recognition (OCR) technique to convert screen images (also referred to as video frames) into textual content, which can then be indexed and searched easily. However, noisy screen images significantly affect the quality of source code extracted by OCR, for example, no-code frames (e.g., PowerPoint slides, web pages of API specification), non-code regions (e.g., Package Explorer view, Console view), and noisy code regions with code in completion suggestion popups. Furthermore, due to the code characteristics (e.g., long compound identifiers like ItemListener), even professional OCR tools cannot extract source code without errors from screen images. The noisy OCRed source code will negatively affect the downstream applications, such as the effective search and navigation of the source code content in programming screencasts. In this article, we propose an approach named psc2code to denoise the process of extracting source code from programming screencasts. First, psc2code leverages the Convolutional Neural Network (CNN) based image classification to remove non-code and noisy-code frames. Then, psc2code performs edge detection and clustering-based image segmentation to detect sub-windows in a code frame, and based on the detected sub-windows, it identifies and crops the screen region that is most likely to be a code editor. Finally, psc2code calls the API of a professional OCR tool to extract source code from the cropped code regions and leverages the OCRed cross-frame information in the programming screencast and the statistical language model of a large corpus of source code to correct errors in the OCRed source code. We conduct an experiment on 1,142 programming screencasts from YouTube. We find that our CNN-based image classification technique can effectively remove the non-code and noisy-code frames, which achieves an F1-score of 0.95 on the valid code frames. We also find that psc2code can significantly improve the quality of the OCRed source code by truly correcting about half of incorrectly OCRed words. Based on the source code denoised by psc2code , we implement two applications: (1) a programming screencast search engine; (2) an interaction-enhanced programming screencast watching tool. Based on the source code extracted from the 1,142 collected programming screencasts, our experiments show that our programming screencast search engine achieves the precision@5, 10, and 20 of 0.93, 0.81, and 0.63, respectively. We also conduct a user study of our interaction-enhanced programming screencast watching tool with 10 participants. This user study shows that our interaction-enhanced watching tool can help participants learn the knowledge in the programming video more efficiently and effectively.

中文翻译：

psc2code

编程截屏视频已成为 Internet 上的一种普遍资源，可帮助开发人员学习新的编程技术或技能。编程截屏中的源代码对开发人员来说是重要且有价值的信息。但是编程截屏（即一系列截屏图像）的流特性限制了开发人员与截屏中的源代码交互的方式。许多研究使用光学字符识别 (OCR) 技术将屏幕图像（也称为视频帧）转换为文本内容，然后可以轻松地对其进行索引和搜索。然而，嘈杂的屏幕图像会显着影响 OCR 提取的源代码的质量，例如，无代码帧（例如，PowerPoint 幻灯片、API 规范的网页）、非代码区域（例如，Package Explorer 视图、控制台视图），以及在完成建议弹出窗口中带有代码的嘈杂代码区域。此外，由于代码的特性（例如，像 ItemListener 这样的长复合标识符），即使是专业的 OCR 工具也无法从屏幕图像中提取源代码而不会出错。嘈杂的 OCRed 源代码会对下游应用程序产生负面影响，例如编程截屏中源代码内容的有效搜索和导航。在本文中，我们提出了一种名为例如编程截屏中源代码内容的有效搜索和导航。在本文中，我们提出了一种名为例如编程截屏中源代码内容的有效搜索和导航。在本文中，我们提出了一种名为psc2code对从编程截屏视频中提取源代码的过程进行降噪。第一的，psc2code利用基于卷积神经网络 (CNN) 的图像分类来去除非代码和噪声代码帧。然后，psc2code执行边缘检测和基于聚类的图像分割以检测代码帧中的子窗口，并根据检测到的子窗口识别和裁剪最有可能是代码编辑器的屏幕区域。最后，psc2code调用专业OCR工具的API从裁剪的代码区域中提取源代码，并利用编程截屏中的OCRed跨帧信息和大量源代码语料库的统计语言模型来纠正OCRed源代码中的错误。我们对来自 YouTube 的 1,142 个编程截屏视频进行了实验。我们发现我们基于 CNN 的图像分类技术可以有效地去除非代码和噪声代码帧，在有效代码帧上实现了 0.95 的 F1 分数。我们还发现psc2code通过真正纠正大约一半的错误 OCRed 单词，可以显着提高 OCRed 源代码的质量。基于源代码去噪psc2code，我们实现了两个应用程序：（1）一个编程截屏搜索引擎；(2) 一种交互增强的编程截屏观看工具。基于从收集的 1142 个编程截屏中提取的源代码，我们的实验表明，我们的编程截屏搜索引擎的精度@5、10 和 20 分别为 0.93、0.81 和 0.63。我们还对 10 名参与者的交互增强型编程截屏观看工具进行了用户研究。这项用户研究表明，我们的交互增强观看工具可以帮助参与者更有效地学习编程视频中的知识。

更新日期：2020-06-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11