当前位置: X-MOL 学术ACM Trans. Softw. Eng. Methodol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Speeding Up Data Manipulation Tasks with Alternative Implementations
ACM Transactions on Software Engineering and Methodology ( IF 4.4 ) Pub Date : 2021-07-23 , DOI: 10.1145/3456873
Yida Tao 1 , Shan Tang 1 , Yepang Liu 2 , Zhiwu Xu 1 , Shengchao Qin 3
Affiliation  

As data volume and complexity grow at an unprecedented rate, the performance of data manipulation programs is becoming a major concern for developers. In this article, we study how alternative API choices could improve data manipulation performance while preserving task-specific input/output equivalence. We propose a lightweight approach that leverages the comparative structures in Q&A sites to extracting alternative implementations. On a large dataset of Stack Overflow posts, our approach extracts 5,080 pairs of alternative implementations that invoke different data manipulation APIs to solve the same tasks, with an accuracy of 86%. Experiments show that for 15% of the extracted pairs, the faster implementation achieved >10x speedup over its slower alternative. We also characterize 68 recurring alternative API pairs from the extraction results to understand the type of APIs that can be used alternatively. To put these findings into practice, we implement a tool, AlterApi7 , to automatically optimize real-world data manipulation programs. In the 1,267 optimization attempts on the Kaggle dataset, 76% achieved desirable performance improvements with up to orders-of-magnitude speedup. Finally, we discuss notable challenges of using alternative APIs for optimizing data manipulation programs. We hope that our study offers a new perspective on API recommendation and automatic performance optimization.

中文翻译:

使用替代实现加速数据操作任务

随着数据量和复杂性以前所未有的速度增长,数据处理程序的性能正成为开发人员关注的主要问题。在本文中,我们研究了替代 API 选择如何提高数据操作性能,同时保持特定于任务的输入/输出等效性。我们提出了一种轻量级方法,该方法利用问答网站中的比较结构来提取替代实现。在 Stack Overflow 帖子的大型数据集上,我们的方法提取了 5,080 对替代实现,它们调用不同的数据操作 API 来解决相同的任务,准确度为 86%。实验表明,对于 15% 的提取对,更快的实现比其较慢的替代实现了 10 倍以上的加速。我们还从提取结果中表征了 68 个重复出现的替代 API 对,以了解可以替代使用的 API 类型。为了将这些发现付诸实践,我们实施了一个工具,AlterApi7,以自动优化现实世界的数据操作程序。在对 Kaggle 数据集进行的 1,267 次优化尝试中,76% 的人实现了理想的性能提升,加速高达数量级。最后,我们讨论了使用替代 API 来优化数据操作程序的显着挑战。我们希望我们的研究为 API 推荐和自动性能优化提供一个新的视角。
更新日期:2021-07-23
down
wechat
bug