research-article

Speeding Up Data Manipulation Tasks with Alternative Implementations: An Exploratory Study

Authors:
Yida Tao

Shenzhen University, China

Shenzhen University, China

0000-0001-6866-336X
View Profile

,
Shan Tang

Shenzhen University, China

Shenzhen University, China
View Profile

,
Yepang Liu

Southern University of Science and Technology, China

Southern University of Science and Technology, China

0000-0001-8147-8126
View Profile

,
Zhiwu Xu

Shenzhen University, China

Shenzhen University, China

0000-0001-6727-440X
View Profile

,
Shengchao Qin

Teesside University, UK, and Shenzhen University, China

Teesside University, UK, and Shenzhen University, China

0000-0003-3028-8191
View Profile

ACM Transactions on Software Engineering and Methodology Volume 30 Issue 4Article No.: 49pp 1–28https://doi.org/10.1145/3456873

Published:23 July 2021Publication History

ACM Transactions on Software Engineering and Methodology

Abstract

As data volume and complexity grow at an unprecedented rate, the performance of data manipulation programs is becoming a major concern for developers. In this article, we study how alternative API choices could improve data manipulation performance while preserving task-specific input/output equivalence. We propose a lightweight approach that leverages the comparative structures in Q&A sites to extracting alternative implementations. On a large dataset of Stack Overflow posts, our approach extracts 5,080 pairs of alternative implementations that invoke different data manipulation APIs to solve the same tasks, with an accuracy of 86%. Experiments show that for 15% of the extracted pairs, the faster implementation achieved >10x speedup over its slower alternative. We also characterize 68 recurring alternative API pairs from the extraction results to understand the type of APIs that can be used alternatively. To put these findings into practice, we implement a tool, AlterApi7, to automatically optimize real-world data manipulation programs. In the 1,267 optimization attempts on the Kaggle dataset, 76% achieved desirable performance improvements with up to orders-of-magnitude speedup. Finally, we discuss notable challenges of using alternative APIs for optimizing data manipulation programs. We hope that our study offers a new perspective on API recommendation and automatic performance optimization.

References

Leonard Richardson. 2007. Beautiful Soup Documentation. (2007).Google Scholar
Django Software Foundation. 2021. Django. https://djangoproject.com.Google Scholar
Wikipedia contributors. 2021. Einstein notation. https://en.wikipedia.org/wiki/Einstein_notation.Google Scholar
Stack Exchange. 2021. How to Create a Minimal, Reproducible Example. https://stackoverflow.com/help/minimalreproducible-example.Google Scholar
The IPython Development Team. 2021. IPython built-in magic commands. https://ipython.readthedocs.io/en/stable/interactive/magics.html.Google Scholar
Kaggle Inc. 2021. Kaggle. https://www.kaggle.com/Google Scholar
Kaggle Inc. 2021. Kaggle Competitions. https://www.kaggle.com/competitions/Google Scholar
Charles R. Harris et al. 2020. Array programming with NumPy. Nature 585 (2020), 357--362. https://doi.org/10.1038/s41586-020-2649-2Google ScholarCross Ref
Wes McKinney. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (SciPy'10), Vol. 445. 51--56.Google ScholarCross Ref
Python Software Foundation. 2021. The Python ast module. https://docs.python.org/3/library/ast.html.Google Scholar
Python Software Foundation. 2021. Python Qualified Name. https://docs.python.org/3/glossary.html#term-qualifiedname.Google Scholar
Python Software Foundation. 2021. Debugging and Profiling. https://docs.python.org/3/library/debug.html.Google Scholar
Wikipedia contributors. 2020. Quartile coefficient of dispersion. https://en.wikipedia.org/wiki/Quartile_coefficient_of_dispersion.Google Scholar
R Core Team. 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. http://www.R-project.org/.Google Scholar
Pauli Virtanen et al. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261--272. https://doi.org/10.1038/s41592-019-0686-2Google ScholarCross Ref
Travis E. Oliphant. 2007. Python for Scientific Computing. Computing in Science Engineering 9, 3 (2007), 10--20. https://doi.org/10.1109/MCSE.2007.58Google ScholarDigital Library
Nathanael Chambers, Daniel Cer, Trond Grenager, David Hall, Chloe Kiddon, Bill MacCartney, Marie-Catherine De Marneffe, Daniel Ramage, Eric Yeh, and Christopher D. Manning. 2007. Learning alignments and leveraging natural logic. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. 165--170.Google Scholar
Stack Exchange Inc. 2019. Stack Exchange Data Dump. https://archive.org/details/stackexchange.Google Scholar
Python Software Foundation. 2021. timeit: Measure execution time of small code snippets. https://docs.python.org/3.6/library/timeit.html.Google Scholar
Sven Amann, Sarah Nadi, Hoan Anh Nguyen, Tien N. Nguyen, and Mira Mezini. 2016. MUBench: A benchmark for API-misuse detectors. In Proceedings of the 13th International Conference on Mining Software Repositories (MSR’16). DOI:https://doi.org/10.1145/2901739.2903506Google ScholarDigital Library
Sven Amann, Hoan Anh Nguyen, Sarah Nadi, Tien N. Nguyen, and Mira Mezini. 2019. A systematic evaluation of static API-misuse Detectors. IEEE Transactions on Software Engineering 45, 12 (2019), 1170--1188. https://doi.org/10.1109/TSE.2018.2827384Google ScholarCross Ref
Eduardo C. Campos, Martin Monperrus, and Marcelo A. Maia. 2016. Searching stack overflow for API-usage-related bug fixes using snippet-based queries. In Proceedings of the 26th Annual International Conference on Computer Science and Software Engineering (CASCON’16). IBM Corp., Riverton, NJ, 232–242.Google Scholar
A. Carzaniga, A. Mattavelli, and M. Pezzè. 2015. Measuring Software Redundancy. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE'15), Vol. 1. 156--166. https://doi.org/10.1109/ICSE.2015.37Google Scholar
Yanto Chandra and Liang Shang. 2019. Qualitative Research Using R: A Systematic Approach. Springer.Google Scholar
Chunyang Chen and Zhenchang Xing. 2016. SimilarTech: Automatically recommend analogical libraries across different programming languages. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE’16). ACM, New York, NY, 834–839. DOI:https://doi.org/10.1145/2970276.2970290Google ScholarDigital Library
Chunyang Chen, Zhenchang Xing, and Yang Liu. 2019. What's Spain's Paris? Mining analogical libraries from Q&A discussions. Empirical Software Engineering 24, 3 (2019), 1155--1194.Google ScholarDigital Library
Samir Gupta, A. S. M. Ashique Mahmood, Karen E. Ross, Cathy H. Wu, and K. Vijay-Shanker. 2017. Identifying comparative structures in biomedical text. In Proceedings of the 16th SIGBioMed Workshop on Biomedical Language Processing (BioNLP'17). 206--215.Google Scholar
Homa B. Hashemi and Rebecca Hwa. 2016. An evaluation of parser robustness for ungrammatical sentences. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP'16). 1765--1774.Google Scholar
Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu. 2018. Tell them apart: Distilling technology differences from crowd-scale comparison discussions. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE’18). ACM, New York, NY, 214–224. DOI:https://doi.org/10.1145/3238147.3238208Google ScholarDigital Library
Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA’09). ACM, New York, NY, 81–92. DOI:https://doi.org/10.1145/1572272.1572283Google ScholarDigital Library
D. Kawrykow and M. P. Robillard. 2009. Detecting inefficient API usage. In 2009 31st International Conference on Software Engineering - Companion Volume (ICSE'09). 183--186. https://doi.org/10.1109/ICSE-COMPANION.2009.5070977Google Scholar
Stefan Krüger, Johannes Späth, Karim Ali, Eric Bodden, and Mira Mezini. 2018. CrySL: An extensible approach to validating the correct usage of cryptographic APIs. In 32nd European Conference on Object-Oriented Programming (ECOOP’18). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google Scholar
Mario Linares-Vásquez, Gabriele Bavota, Carlos Bernal-Cárdenas, Rocco Oliveto, Massimiliano Di Penta, and Denys Poshyvanyk. 2014. Mining energy-greedy API usage patterns in Android apps: An empirical study. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR’14). ACM, New York, NY, 2–11. DOI:https://doi.org/10.1145/2597073.2597085Google ScholarDigital Library
Yepang Liu, Chang Xu, and Shing-Chi Cheung. 2014. Characterizing and detecting performance bugs for smartphone applications. In Proceedings of the 36th International Conference on Software Engineering (ICSE’14). ACM, New York, NY, 1013–1024. DOI:https://doi.org/10.1145/2568225.2568229Google ScholarDigital Library
Yepang Liu, Chang Xu, Shing-Chi Cheung, and Jian Lü. 2014. GreenDroid: Automated diagnosis of energy inefficiency for smartphone applications. IEEE Transactions on Software Engineering 40, 9 (Sep. 2014), 911–940. DOI:https://doi.org/10.1109/TSE.2014.2323982Google Scholar
Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2020. Automated unit test generation for Python. In International Symposium on Search Based Software Engineering (SSBSE'20). 9--24.Google ScholarCross Ref
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL'14). 55--60.Google ScholarCross Ref
Mary L. McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia Medica 22, 3 (2012), 276--282.Google ScholarCross Ref
Wes McKinney. 2012. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O’Reilly Media, Inc.Google Scholar
Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N. Nguyen. 2017. Exploring API embedding for API usages and applications. In Proceedings of the 39th International Conference on Software Engineering (ICSE’17). IEEE Press, Piscataway, NJ, 438–449. DOI:https://doi.org/10.1109/ICSE.2017.47Google Scholar
Wellington Oliveira, Renato Oliveira, Fernando Castor, Benito Fernandes, and Gustavo Pinto. 2019. Recommending energy-efficient Java collections. In Proceedings of the 16th International Conference on Mining Software Repositories (MSR’19). IEEE Press, Piscataway, NJ, 160–170. DOI:https://doi.org/10.1109/MSR.2019.00033Google ScholarDigital Library
C. E. Otero and A. Peter. 2015. Research directions for engineering big data analytics software. IEEE Intelligent Systems 30, 1 (Jan. 2015), 13–19. DOI:https://doi.org/10.1109/MIS.2014.76Google ScholarDigital Library
Marta Recasens, Marie-Catherine de Marneffe, and Christopher Potts. 2013. The life and death of discourse entities: Identifying singleton mentions. In North American Association for Computational Linguistics (NAACL’13).Google Scholar
Peter C. Rigby and Martin P. Robillard. 2013. Discovering essential code elements in informal documentation. In Proceedings of the 2013 International Conference on Software Engineering (ICSE’13). IEEE Press, Piscataway, NJ, 832–841.Google Scholar
David Robinson. 2017. Why is Python Growing So Quickly? https://stackoverflow.blog/2017/09/14/python-growing-quickly/Google Scholar
Jacob T. Schwartz. 1980. Fast probabilistic algorithms for verification of polynomial identities. Journal of the ACM (JACM) 27, 4 (1980), 701–717.Google ScholarDigital Library
M. Selakovic and M. Pradel. 2016. Performance issues and optimizations in JavaScript: An empirical study. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE’16). 61–72. DOI:https://doi.org/10.1145/2884781.2884829Google Scholar
Fang-Hsiang Su, J. Bell, G. Kaiser, and S. Sethumadhavan. 2016. Identifying functionally similar code in complex codebases. In 2016 IEEE 24th International Conference on Program Comprehension (ICPC’16). 1–10. DOI:https://doi.org/10.1109/ICPC.2016.7503720Google Scholar
Yida Tao, Shan Tang, Yepang Liu, Zhiwu Xu, and Shengchao Qin. 2019. How do API selections affect the runtime performance of data analytics tasks? In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19). 665--668. https://doi.org/10.1109/ASE.2019.00067Google ScholarDigital Library
Christoph Treude and Martin P. Robillard. 2016. Augmenting API documentation with insights from stack overflow. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). ACM, New York, NY, 392–403. DOI:https://doi.org/10.1145/2884781.2884800Google Scholar
Junwen Yang, Pranav Subramaniam, Shan Lu, Cong Yan, and Alvin Cheung. 2018. How not to structure your database-backed web applications: A study of performance bugs in the wild. In Proceedings of the 40th International Conference on Software Engineering (ICSE’18). ACM, New York, NY, 800–810. DOI:https://doi.org/10.1145/3180155.3180194Google ScholarDigital Library
Deheng Ye, Zhenchang Xing, Chee Yong Foo, Jing Li, and Nachiket Kapre. 2016. Learning to extract API mentions from informal natural language discussions. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME’16). 389–399. DOI:https://doi.org/10.1109/ICSME.2016.11Google ScholarCross Ref
Deheng Ye, Zhenchang Xing, Jing Li, and Nachiket Kapre. 2016. Software-specific part-of-speech tagging: An experimental study on stack overflow. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC’16). ACM, New York, NY, 1378–1385. DOI:https://doi.org/10.1145/2851613.2851772Google ScholarDigital Library
Jasmine Zakir, Tom Seymour, and Kristi Berg. 2015. Big data analytics. Issues in Information Systems 16, 2 (2015), 81--90.Google Scholar
Tianyi Zhang, Ganesha Upadhyaya, Anastasia Reinhardt, Hridesh Rajan, and Miryung Kim. 2018. Are code examples on an online Q&A forum reliable?: A study of API misuse on stack overflow. In Proceedings of the 40th International Conference on Software Engineering (ICSE’18). ACM, New York, NY, 886–896. DOI:https://doi.org/10.1145/3180155.3180260Google ScholarDigital Library
Hao Zhong, Tao Xie, Lu Zhang, Jian Pei, and Hong Mei. 2009. MAPO: mining and recommending API usage patterns. In Proceedings of the 23rd European Conference on Object-Oriented Programming (ECOOP'09). 318--343. https://doi.org/10.1007/978-3-642-03013-0_15Google ScholarDigital Library

Index Terms

Speeding Up Data Manipulation Tasks with Alternative Implementations: An Exploratory Study
1. Software and its engineering
  1. Software notations and tools
    1. Software libraries and repositories

Recommendations

How do API selections affect the runtime performance of data analytics tasks?
ASE '19: Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering

As data volume and complexity grow at an unprecedented rate, the performance of data analytics programs is becoming a major concern for developers. We observed that developers sometimes use alternative data analytics APIs to improve program runtime ...
Read More
An Empirical Comparison Between Tutorials and Crowd Documentation of Application Programming Interface
Abstract
API (application programming interface) documentation is critical for developers to learn APIs. However, it is unclear whether API documentation indeed improves the API learnability for developers. In this paper, we focus on two types of API ...
Read More
Mashups: who? what? why?
CHI EA '08: CHI '08 Extended Abstracts on Human Factors in Computing Systems

In recent years major web services have opened their systems to outside use through the implementation of public APIs. As a result, web developers have begun to experiment with mashups - software applications that merge separate APIs and data sources ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Software Engineering and Methodology Volume 30, Issue 4
Continuous Special Section: AI and SE
October 2021
613 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/3461694
Editor:
Mauro Pezzè
Università della Svizzera italiana and Università di Milano-Bicocca, Switzerland
Issue’s Table of Contents
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2021
- Revised: 1 March 2021
- Accepted: 1 March 2021
- Received: 1 September 2020
Published in tosem Volume 30, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
API selection
data manipulation
empirical study
mining software repository
performance optimization
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 212
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Speeding Up Data Manipulation Tasks with Alternative Implementations: An Exploratory Study

ACM Transactions on Software Engineering and Methodology

Abstract

References

Cited By

Index Terms

Recommendations

How do API selections affect the runtime performance of data analytics tasks?

An Empirical Comparison Between Tutorials and Crowd Documentation of Application Programming Interface

Mashups: who? what? why?