skip to main content
research-article

Speeding Up Data Manipulation Tasks with Alternative Implementations: An Exploratory Study

Published:23 July 2021Publication History
Skip Abstract Section

Abstract

As data volume and complexity grow at an unprecedented rate, the performance of data manipulation programs is becoming a major concern for developers. In this article, we study how alternative API choices could improve data manipulation performance while preserving task-specific input/output equivalence. We propose a lightweight approach that leverages the comparative structures in Q&A sites to extracting alternative implementations. On a large dataset of Stack Overflow posts, our approach extracts 5,080 pairs of alternative implementations that invoke different data manipulation APIs to solve the same tasks, with an accuracy of 86%. Experiments show that for 15% of the extracted pairs, the faster implementation achieved >10x speedup over its slower alternative. We also characterize 68 recurring alternative API pairs from the extraction results to understand the type of APIs that can be used alternatively. To put these findings into practice, we implement a tool, AlterApi7, to automatically optimize real-world data manipulation programs. In the 1,267 optimization attempts on the Kaggle dataset, 76% achieved desirable performance improvements with up to orders-of-magnitude speedup. Finally, we discuss notable challenges of using alternative APIs for optimizing data manipulation programs. We hope that our study offers a new perspective on API recommendation and automatic performance optimization.

References

  1. Leonard Richardson. 2007. Beautiful Soup Documentation. (2007).Google ScholarGoogle Scholar
  2. Django Software Foundation. 2021. Django. https://djangoproject.com.Google ScholarGoogle Scholar
  3. Wikipedia contributors. 2021. Einstein notation. https://en.wikipedia.org/wiki/Einstein_notation.Google ScholarGoogle Scholar
  4. Stack Exchange. 2021. How to Create a Minimal, Reproducible Example. https://stackoverflow.com/help/minimalreproducible-example.Google ScholarGoogle Scholar
  5. The IPython Development Team. 2021. IPython built-in magic commands. https://ipython.readthedocs.io/en/stable/interactive/magics.html.Google ScholarGoogle Scholar
  6. Kaggle Inc. 2021. Kaggle. https://www.kaggle.com/Google ScholarGoogle Scholar
  7. Kaggle Inc. 2021. Kaggle Competitions. https://www.kaggle.com/competitions/Google ScholarGoogle Scholar
  8. Charles R. Harris et al. 2020. Array programming with NumPy. Nature 585 (2020), 357--362. https://doi.org/10.1038/s41586-020-2649-2Google ScholarGoogle ScholarCross RefCross Ref
  9. Wes McKinney. 2010. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference (SciPy'10), Vol. 445. 51--56.Google ScholarGoogle ScholarCross RefCross Ref
  10. Python Software Foundation. 2021. The Python ast module. https://docs.python.org/3/library/ast.html.Google ScholarGoogle Scholar
  11. Python Software Foundation. 2021. Python Qualified Name. https://docs.python.org/3/glossary.html#term-qualifiedname.Google ScholarGoogle Scholar
  12. Python Software Foundation. 2021. Debugging and Profiling. https://docs.python.org/3/library/debug.html.Google ScholarGoogle Scholar
  13. Wikipedia contributors. 2020. Quartile coefficient of dispersion. https://en.wikipedia.org/wiki/Quartile_coefficient_of_dispersion.Google ScholarGoogle Scholar
  14. R Core Team. 2013. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. http://www.R-project.org/.Google ScholarGoogle Scholar
  15. Pauli Virtanen et al. 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17 (2020), 261--272. https://doi.org/10.1038/s41592-019-0686-2Google ScholarGoogle ScholarCross RefCross Ref
  16. Travis E. Oliphant. 2007. Python for Scientific Computing. Computing in Science Engineering 9, 3 (2007), 10--20. https://doi.org/10.1109/MCSE.2007.58Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Nathanael Chambers, Daniel Cer, Trond Grenager, David Hall, Chloe Kiddon, Bill MacCartney, Marie-Catherine De Marneffe, Daniel Ramage, Eric Yeh, and Christopher D. Manning. 2007. Learning alignments and leveraging natural logic. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. 165--170.Google ScholarGoogle Scholar
  18. Stack Exchange Inc. 2019. Stack Exchange Data Dump. https://archive.org/details/stackexchange.Google ScholarGoogle Scholar
  19. Python Software Foundation. 2021. timeit: Measure execution time of small code snippets. https://docs.python.org/3.6/library/timeit.html.Google ScholarGoogle Scholar
  20. Sven Amann, Sarah Nadi, Hoan Anh Nguyen, Tien N. Nguyen, and Mira Mezini. 2016. MUBench: A benchmark for API-misuse detectors. In Proceedings of the 13th International Conference on Mining Software Repositories (MSR’16). DOI:https://doi.org/10.1145/2901739.2903506Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sven Amann, Hoan Anh Nguyen, Sarah Nadi, Tien N. Nguyen, and Mira Mezini. 2019. A systematic evaluation of static API-misuse Detectors. IEEE Transactions on Software Engineering 45, 12 (2019), 1170--1188. https://doi.org/10.1109/TSE.2018.2827384Google ScholarGoogle ScholarCross RefCross Ref
  22. Eduardo C. Campos, Martin Monperrus, and Marcelo A. Maia. 2016. Searching stack overflow for API-usage-related bug fixes using snippet-based queries. In Proceedings of the 26th Annual International Conference on Computer Science and Software Engineering (CASCON’16). IBM Corp., Riverton, NJ, 232–242.Google ScholarGoogle Scholar
  23. A. Carzaniga, A. Mattavelli, and M. Pezzè. 2015. Measuring Software Redundancy. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE'15), Vol. 1. 156--166. https://doi.org/10.1109/ICSE.2015.37Google ScholarGoogle Scholar
  24. Yanto Chandra and Liang Shang. 2019. Qualitative Research Using R: A Systematic Approach. Springer.Google ScholarGoogle Scholar
  25. Chunyang Chen and Zhenchang Xing. 2016. SimilarTech: Automatically recommend analogical libraries across different programming languages. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE’16). ACM, New York, NY, 834–839. DOI:https://doi.org/10.1145/2970276.2970290Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Chunyang Chen, Zhenchang Xing, and Yang Liu. 2019. What's Spain's Paris? Mining analogical libraries from Q&A discussions. Empirical Software Engineering 24, 3 (2019), 1155--1194.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Samir Gupta, A. S. M. Ashique Mahmood, Karen E. Ross, Cathy H. Wu, and K. Vijay-Shanker. 2017. Identifying comparative structures in biomedical text. In Proceedings of the 16th SIGBioMed Workshop on Biomedical Language Processing (BioNLP'17). 206--215.Google ScholarGoogle Scholar
  28. Homa B. Hashemi and Rebecca Hwa. 2016. An evaluation of parser robustness for ungrammatical sentences. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP'16). 1765--1774.Google ScholarGoogle Scholar
  29. Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu. 2018. Tell them apart: Distilling technology differences from crowd-scale comparison discussions. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE’18). ACM, New York, NY, 214–224. DOI:https://doi.org/10.1145/3238147.3238208Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Lingxiao Jiang and Zhendong Su. 2009. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA’09). ACM, New York, NY, 81–92. DOI:https://doi.org/10.1145/1572272.1572283Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Kawrykow and M. P. Robillard. 2009. Detecting inefficient API usage. In 2009 31st International Conference on Software Engineering - Companion Volume (ICSE'09). 183--186. https://doi.org/10.1109/ICSE-COMPANION.2009.5070977Google ScholarGoogle Scholar
  32. Stefan Krüger, Johannes Späth, Karim Ali, Eric Bodden, and Mira Mezini. 2018. CrySL: An extensible approach to validating the correct usage of cryptographic APIs. In 32nd European Conference on Object-Oriented Programming (ECOOP’18). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.Google ScholarGoogle Scholar
  33. Mario Linares-Vásquez, Gabriele Bavota, Carlos Bernal-Cárdenas, Rocco Oliveto, Massimiliano Di Penta, and Denys Poshyvanyk. 2014. Mining energy-greedy API usage patterns in Android apps: An empirical study. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR’14). ACM, New York, NY, 2–11. DOI:https://doi.org/10.1145/2597073.2597085Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Yepang Liu, Chang Xu, and Shing-Chi Cheung. 2014. Characterizing and detecting performance bugs for smartphone applications. In Proceedings of the 36th International Conference on Software Engineering (ICSE’14). ACM, New York, NY, 1013–1024. DOI:https://doi.org/10.1145/2568225.2568229Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yepang Liu, Chang Xu, Shing-Chi Cheung, and Jian Lü. 2014. GreenDroid: Automated diagnosis of energy inefficiency for smartphone applications. IEEE Transactions on Software Engineering 40, 9 (Sep. 2014), 911–940. DOI:https://doi.org/10.1109/TSE.2014.2323982Google ScholarGoogle Scholar
  36. Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2020. Automated unit test generation for Python. In International Symposium on Search Based Software Engineering (SSBSE'20). 9--24.Google ScholarGoogle ScholarCross RefCross Ref
  37. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (ACL'14). 55--60.Google ScholarGoogle ScholarCross RefCross Ref
  38. Mary L. McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia Medica 22, 3 (2012), 276--282.Google ScholarGoogle ScholarCross RefCross Ref
  39. Wes McKinney. 2012. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O’Reilly Media, Inc.Google ScholarGoogle Scholar
  40. Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan, and Tien N. Nguyen. 2017. Exploring API embedding for API usages and applications. In Proceedings of the 39th International Conference on Software Engineering (ICSE’17). IEEE Press, Piscataway, NJ, 438–449. DOI:https://doi.org/10.1109/ICSE.2017.47Google ScholarGoogle Scholar
  41. Wellington Oliveira, Renato Oliveira, Fernando Castor, Benito Fernandes, and Gustavo Pinto. 2019. Recommending energy-efficient Java collections. In Proceedings of the 16th International Conference on Mining Software Repositories (MSR’19). IEEE Press, Piscataway, NJ, 160–170. DOI:https://doi.org/10.1109/MSR.2019.00033Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. E. Otero and A. Peter. 2015. Research directions for engineering big data analytics software. IEEE Intelligent Systems 30, 1 (Jan. 2015), 13–19. DOI:https://doi.org/10.1109/MIS.2014.76Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Marta Recasens, Marie-Catherine de Marneffe, and Christopher Potts. 2013. The life and death of discourse entities: Identifying singleton mentions. In North American Association for Computational Linguistics (NAACL’13).Google ScholarGoogle Scholar
  44. Peter C. Rigby and Martin P. Robillard. 2013. Discovering essential code elements in informal documentation. In Proceedings of the 2013 International Conference on Software Engineering (ICSE’13). IEEE Press, Piscataway, NJ, 832–841.Google ScholarGoogle Scholar
  45. David Robinson. 2017. Why is Python Growing So Quickly? https://stackoverflow.blog/2017/09/14/python-growing-quickly/Google ScholarGoogle Scholar
  46. Jacob T. Schwartz. 1980. Fast probabilistic algorithms for verification of polynomial identities. Journal of the ACM (JACM) 27, 4 (1980), 701–717.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. M. Selakovic and M. Pradel. 2016. Performance issues and optimizations in JavaScript: An empirical study. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE’16). 61–72. DOI:https://doi.org/10.1145/2884781.2884829Google ScholarGoogle Scholar
  48. Fang-Hsiang Su, J. Bell, G. Kaiser, and S. Sethumadhavan. 2016. Identifying functionally similar code in complex codebases. In 2016 IEEE 24th International Conference on Program Comprehension (ICPC’16). 1–10. DOI:https://doi.org/10.1109/ICPC.2016.7503720Google ScholarGoogle Scholar
  49. Yida Tao, Shan Tang, Yepang Liu, Zhiwu Xu, and Shengchao Qin. 2019. How do API selections affect the runtime performance of data analytics tasks? In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19). 665--668. https://doi.org/10.1109/ASE.2019.00067Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Christoph Treude and Martin P. Robillard. 2016. Augmenting API documentation with insights from stack overflow. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). ACM, New York, NY, 392–403. DOI:https://doi.org/10.1145/2884781.2884800Google ScholarGoogle Scholar
  51. Junwen Yang, Pranav Subramaniam, Shan Lu, Cong Yan, and Alvin Cheung. 2018. How not to structure your database-backed web applications: A study of performance bugs in the wild. In Proceedings of the 40th International Conference on Software Engineering (ICSE’18). ACM, New York, NY, 800–810. DOI:https://doi.org/10.1145/3180155.3180194Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Deheng Ye, Zhenchang Xing, Chee Yong Foo, Jing Li, and Nachiket Kapre. 2016. Learning to extract API mentions from informal natural language discussions. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME’16). 389–399. DOI:https://doi.org/10.1109/ICSME.2016.11Google ScholarGoogle ScholarCross RefCross Ref
  53. Deheng Ye, Zhenchang Xing, Jing Li, and Nachiket Kapre. 2016. Software-specific part-of-speech tagging: An experimental study on stack overflow. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC’16). ACM, New York, NY, 1378–1385. DOI:https://doi.org/10.1145/2851613.2851772Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Jasmine Zakir, Tom Seymour, and Kristi Berg. 2015. Big data analytics. Issues in Information Systems 16, 2 (2015), 81--90.Google ScholarGoogle Scholar
  55. Tianyi Zhang, Ganesha Upadhyaya, Anastasia Reinhardt, Hridesh Rajan, and Miryung Kim. 2018. Are code examples on an online Q&A forum reliable?: A study of API misuse on stack overflow. In Proceedings of the 40th International Conference on Software Engineering (ICSE’18). ACM, New York, NY, 886–896. DOI:https://doi.org/10.1145/3180155.3180260Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Hao Zhong, Tao Xie, Lu Zhang, Jian Pei, and Hong Mei. 2009. MAPO: mining and recommending API usage patterns. In Proceedings of the 23rd European Conference on Object-Oriented Programming (ECOOP'09). 318--343. https://doi.org/10.1007/978-3-642-03013-0_15Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Speeding Up Data Manipulation Tasks with Alternative Implementations: An Exploratory Study

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Software Engineering and Methodology
      ACM Transactions on Software Engineering and Methodology  Volume 30, Issue 4
      Continuous Special Section: AI and SE
      October 2021
      613 pages
      ISSN:1049-331X
      EISSN:1557-7392
      DOI:10.1145/3461694
      • Editor:
      • Mauro Pezzè
      Issue’s Table of Contents

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 July 2021
      • Revised: 1 March 2021
      • Accepted: 1 March 2021
      • Received: 1 September 2020
      Published in tosem Volume 30, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)32
      • Downloads (Last 6 weeks)6

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format