Abstract
Crowd-based multimedia documents such as screencasts have emerged as a source for documenting requirements, the workflow and implementation issues of open source and agile software projects. For example, users can show and narrate how they manipulate an application’s GUI to perform a certain functionality, or a bug reporter could visually explain how to trigger a bug or a security vulnerability. Unfortunately, the streaming nature of programming screencasts and their binary format limit how developers can interact with a screencast’s content. In this research, we present an automated approach for mining and linking the multimedia content found in screencasts to their relevant software artifacts and, more specifically, to source code. We apply LDA-based mining approaches that take as input a set of screencast artifacts, such as GUI text and spoken word, to make the screencast content accessible and searchable to users and to link it to their relevant source code artifacts. To evaluate the applicability of our approach, we report on results from case studies that we conducted on existing WordPress and Mozilla Firefox screencasts. We found that our automated approach can significantly speed up the feature location process. For WordPress, we find that our approach using screencast speech and GUI text can successfully link relevant source code files within the top 10 hits of the result set with median Reciprocal Rank (RR) of 50% (rank 2) and 100% (rank 1). In the case of Firefox, our approach can identify relevant source code directories within the top 100 hits using screencast speech and GUI text with the median RR = 20%, meaning that the first true positive is ranked 5 or higher in more than 50% of the cases. Also, source code related to the frontend implementation that handles high-level or GUI-related aspects of an application is located with higher accuracy. We also found that term frequency rebalancing can further improve the linking results when using less noisy scenarios or locating less technical implementation of scenarios. Investigating the results of using original and weighted screencast data sources (speech, GUI, speech and GUI) that can result in having the highest median RR values in both case studies shows that speech data is an important information source that can result in having RR of 100%.
Similar content being viewed by others
Notes
Portals such as https://www.wikipedia.org/ and https://stackoverflow.com/ contain crowd-based textual documentation.
Portals such as https://commons.wikimedia.org/wiki/Main_Page and https://www.youtube.com/ contain crowd-based multimedia documents.
References
Adrian K et al (n.d.) Software Cartography: thematic software visualization with consistent layout. Journal of Software Maintenance and Evolution: Research and Practice 22(3):191–210. https://doi.org/10.1002/smr.414
Ali N et al (2012) Improving bug location using binary class relationships. In: 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation, pp 174–183. https://doi.org/10.1109/SCAM.2012.26
Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering - ICSE ‘10. Cape Town, South Africa: ACM Press, pp 95–104. https://doi.org/10.1145/1806799.1806817
Bajracharya SK, Lopes CV (2012) Analyzing and mining a code search engine usage log. Empir Softw Eng. Kluwer Academic Publishers, 17(4–5), pp 424–466. https://doi.org/10.1007/s10664-010-9144-6
Baldi PF et al (2008) A theory of aspects as latent topics. In: Proceedings of the conference on object-oriented programming systems, languages, and applications, OOPSLA. New York, New York, USA: ACM Press, pp 543–562. https://doi.org/10.1145/1449764.1449807
Bao L et al (2015) Reverse engineering time-series interaction data from screen-captured videos. 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2015 - Proceedings. Institute of Electrical and Electronics Engineers Inc., pp 399–408. https://doi.org/10.1109/SANER.2015.7081850
Bao L et al (2017) Extracting and analyzing time-series HCI data from screen-captured task videos. Empir Softw Eng 22(1):134–174. https://doi.org/10.1007/s10664-015-9417-1
Bao L et al (2019) VT-revolution: interactive programming video tutorial authoring and watching system. IEEE Trans Softw Eng 45:823–838. https://doi.org/10.1109/TSE.2018.2802916
Baroni M et al (2009) The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Lang Resour Eval 43(3):209–226. https://doi.org/10.1007/s10579-009-9081-4
Barzilay O, Treude C, Zagalsky A (2013) Facilitating crowd sourced software engineering via stack overflow. In: Finding source code on the web for remix and reuse. Springer New York, New York, pp 1–19. https://doi.org/10.1007/978-1-4614-6596-6
Bassett B, Kraft NA (2013) Structural information based term weighting in text retrieval for feature location. In: 2013 21st International Conference on Program Comprehension (ICPC). IEEE, pp 133–141. https://doi.org/10.1109/ICPC.2013.6613841
Blei DM (2012) Probabilistic topic models. In: Communications of the ACM. ACM, pp 77–84. https://doi.org/10.1145/2133806.2133826
Blei DM, Ng AY, Jordan MI (2003a) Latent dirichlet allocation. The Journal of Machine Learning Research. JMLR.org, 3, pp 993–1022
Blei DM et al (2003b) Hierarchical topic models and the nested Chinese restaurant process. In: Proceedings of the 16th International Conference on Neural Information Processing Systems. Cambridge, MA, USA: MIT Press (NIPS’03), pp 17–24
Brunelli R, Poggio T (1993) Face recognition: features versus templates. IEEE Trans Pattern Anal Mach Intell 15(10):1042–1052. https://doi.org/10.1109/34.254061
Campbell JC et al (2013) Deficient documentation detection: a methodology to locate deficient project documentation using topic analysis. IEEE International Working Conference on Mining Software Repositories. IEEE, Piscataway, NJ, USA, pp 57–60. https://doi.org/10.1109/MSR.2013.6624005
Chen T-H, Thomas SW, Hassan AE (2016) A survey on the use of topic models when mining software repositories. Empir Softw Eng 21(5):1843–1919. https://doi.org/10.1007/s10664-015-9402-8
Cheng X et al (2014) BTM: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941. https://doi.org/10.1109/TKDE.2014.2313872
Cheriet M et al (2007) Character recognition systems: a guide for students and practitioners. Wiley-Interscience
Cleland-Huang J et al (2012) Breaking the big-bang practice of traceability: pushing timely trace recommendations to project stakeholders. In: 2012 20th IEEE International Requirements Engineering Conference (RE), pp 231–240. https://doi.org/10.1109/RE.2012.6345809
Cleland-Huang J et al (2014) Software traceability: trends and future directions. In: Proceedings of the on Future of Software Engineering - FOSE 2014. New York, New York, USA: ACM Press, pp 55–69. https://doi.org/10.1145/2593882.2593891
Deerwester S et al (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Dit B et al (2013) Feature location in source code: a taxonomy and survey. J Softw Evol Proc 25(1):53–95. https://doi.org/10.1002/smr.567
Eddy BP, Kraft NA, Gray J (2018) Impact of structural weighting on a latent Dirichlet allocation–based feature location technique. J Softw Evol Proc 30(1):e1892. https://doi.org/10.1002/smr.1892
Ellmann M et al (2017) Find, Understand, and Extend Development Screencasts on YouTube. Proceedings of the 3rd ACM SIGSOFT International Workshop on Software Analytics - SWAN 2017. New York, New York, USA: ACM Press, pp 1–7. https://doi.org/10.1145/3121257.3121260
Escobar-Avila J, Parra E, Haiduc S (2017) Text retrieval-based tagging of software engineering video tutorials. In: 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C). IEEE, pp 341–343. https://doi.org/10.1109/ICSE-C.2017.121
Gaffney Jr. JE (1981) Metrics in software quality assurance. In: Proceedings of the ACM ‘81 Conference. New York, ACM (ACM ‘81), pp 126–130. https://doi.org/10.1145/800175.809854
Gotel O et al (2012) The grand challenge of traceability (v1.0). In: Cleland-Huang J, Gotel O, Zisman A (eds) Software and systems traceability. Springer London, London, pp 343–409. https://doi.org/10.1007/978-1-4471-2239-5_16
Gray WD (2007) Integrated models of cognitive systems (advances in cognitive models and architectures). Oxford University Press, Inc., New York
Grechanik M et al (2010) A search engine for finding highly relevant applications. In: Proceedings of the 32nd ACM/IEEE international conference on software engineering - volume 1. Cape Town, ACM Press, pp 475–484. https://doi.org/10.1145/1806799.1806868
Hofmann T, Thomas (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn. Kluwer Academic Publishers, 42(1/2): 177–196. https://doi.org/10.1023/A:1007617005950
Jiau HC, Yang F-P (2012) Facing up to the inequality of crowdsourced API documentation. ACM SIGSOFT Software Engineering Notes. ACM, 37(1): 1–9. https://doi.org/10.1145/2088883.2088892
Jurafsky D, Martin JH (2009) Speech and language processing, 2nd edn. Prentice-Hall, Inc., USA
Kagdi H, Maletic JI (2007) Software repositories : a source for traceability links. TEFSE/GCT 2007 - 4th International Workshop on Traceability in Emerging Forms of Software Engineering, (APRIL 2002), pp 32–39
Kagdi H, Maletic JI, Sharif B (2007) Mining software repositories for traceability links. In: 15th IEEE International Conference on Program Comprehension (ICPC ‘07), pp 145–154. https://doi.org/10.1109/ICPC.2007.28
Keivanloo I (2013) Source code similarity and clone search. https://spectrum.library.concordia.ca/977472/
Keivanloo I, Roy CK, Rilling J (2014) SeByte: scalable clone and similarity search for bytecode. Sci Comput Program. Elsevier, 95: 426–444. https://doi.org/10.1016/J.SCICO.2013.10.006
Khandwala K, Guo PJ (2018) Codemotion: expanding the design space of learner interactions with computer programming tutorial videos. In: Proceedings of the fifth annual ACM conference on learning at scale - L@S ‘18. London, United Kingdom: ACM Press, pp. 1–10. doi: https://doi.org/10.1145/3231644.3231652
Kuhn A, Ducasse S, Gîrba T (2007) Semantic clustering: identifying topics in source code. Inform Softw Technol. Elsevier, 49(3):230–243. https://doi.org/10.1016/J.INFSOF.2006.10.017
Kuhn A, Loretan P, Nierstrasz O (2012) Consistent layout for thematic software maps. https://doi.org/10.1109/WCRE.2008.45
Leach RJ (2000) Introduction to software engineering. CRC Press, Inc., Boca Raton
Li C et al (2016) Topic modeling for short texts with auxiliary word Embeddings. In: Proceedings of the 39th international ACM SIGIR conference on Research and Development in information retrieval - SIGIR ‘16. Pisa, Italy: ACM Press, pp 165–174. https://doi.org/10.1145/2911451.2911499
Lukins SK, Kraft NA, Etzkorn LH (2010) Bug localization using latent Dirichlet allocation. Inform Softw Technol. Elsevier B.V., 52(9):972–990. https://doi.org/10.1016/j.infsof.2010.04.002
MacLeod L, Storey M-A, Bergen A (2015) Code, camera, action: how software developers document and share program knowledge using YouTube. 2015 IEEE 23rd International Conference on Program Comprehension (ICPC). IEEE, Piscataway, NJ, USA https://doi.org/10.1109/ICPC.2015.19
MacLeod L, Bergen A, Storey M-A (2017) Documenting and sharing software knowledge using screencasts. Empir Softw Eng 22(3):1478–1507. https://doi.org/10.1007/s10664-017-9501-9
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, USA
Marcus A et al (2004) An information retrieval approach to concept location in source code. In: 11th working conference on reverse engineering. USA: IEEE Comput. Soc, pp 214–223. https://doi.org/10.1109/WCRE.2004.10
Mcauliffe JD, Blei DM (2008) Supervised topic models. In”: Platt, J. C. et al. (eds) Advances in neural information processing systems 20. Curran Associates, Inc., pp 121–128. Available at: http://papers.nips.cc/paper/3328-supervised-topic-models.pdf
Mohorovičič S (2012) Creation and use of screencasts in higher education. MIPRO 2012 - 35th International Convention on Information and Communication Technology, Electronics and Microelectronics - Proceedings, pp 1293–1298
Moslehi P, Adams B, Rilling J (2016) On mining crowd-based speech documentation. In: Proceedings of the 13th international workshop on mining software repositories - MSR ‘16. Austin, ACM Press, pp 259–268. https://doi.org/10.1145/2901739.2901771
Moslehi P, Adams B, Rilling J (2018) Feature location using crowd-based screencasts. In: Proceedings of the 15th IEEE working conference on mining software repositories (MSR). Gothenburg, Sweden, pp 192–202. https://doi.org/10.1145/3196398.3196439
Moslehi P, Rilling J, Adams B (2020) Adoption of Crowd-based Software Engineering Tutorial Screencasts. Available at: https://mcislab.github.io/publications/2020/OnTheUseOfMultimediaDocumentation.pdf
Nasehi SM et al (2012) What makes a good code example?: a study of programming Q&a in StackOverflow. In: 2012 28th IEEE international conference on software maintenance (ICSM). IEEE, Piscataway, NJ, USA, pp 25–34. https://doi.org/10.1109/ICSM.2012.6405249
Nguyen AT et al (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM international conference on automated software engineering. Essen, GermanyUSA: ACM Press, pp 70–79. https://doi.org/10.1145/2351676.2351687
Nixon MS, Aguado AS (2012a) Chapter 5 - high-level feature extraction: fixed shape matching. In: Nixon MS, Aguado AS (eds) Feature extraction and image processing for computer vision (third edition). Third edit. Oxford: Academic Press, pp 217–291
Nixon MS, Aguado AS (2012b) Chapter 7 - object description. In: Nixon MS, Aguado AS (eds) Feature Extraction and Image Processing for Computer Vision (Third edition). Third edit. Academic Press, Oxford, pp 343–397
Ott J et al (2018) A deep learning approach to identifying source code in images and video. International Conference on Mining Software Repositories (MSR), pp 376–386. https://doi.org/10.1145/3196398.3196402
Parnin C et al (2012) Crowd documentation: exploring the coverage and the dynamics of API discussions on stack overflow. Georgia Tech technical report. Available at: http://chrisparnin.me/pdf/crowddoc.pdf
Parra E, Escobar-Avila J, Haiduc S (2018) Automatic tag recommendation for software development video tutorials. In: Proceedings of the 26th Conference on Program Comprehension - ICPC ‘18. Gothenburg, ACM Press, pp 222–232. https://doi.org/10.1145/3196321.3196351
Pedrosa G et al (2017) Topic modeling for short texts with co-occurrence frequency-based expansion. Proceedings - 2016 5th Brazilian Conference on Intelligent Systems, BRACIS 2016, pp 277–282. https://doi.org/10.1109/BRACIS.2016.058
Pham R et al (2013) Creating a shared understanding of testing culture on a social coding site. In: 2013 35th international conference on software engineering (ICSE). IEEE, Piscataway, NJ, USA, pp 112–121. https://doi.org/10.1109/ICSE.2013.6606557
Poche E et al (2017) Analyzing user comments on YouTube coding tutorial videos. In: 2017 IEEE/ACM 25th international conference on program comprehension (ICPC). Buenos Aires, Argentina, pp 196–206. https://doi.org/10.1109/ICPC.2017.26
Ponzanelli L et al (2016) Too long; didn’t watch!: extracting relevant fragments from software development video tutorials. In: Proceedings of the 38th international conference on software engineering - ICSE ‘16. Austin, ACM Press, pp 261–272. https://doi.org/10.1145/2884781.2884824
Ponzanelli L et al (2019) Automatic identification and classification of software development video tutorial fragments. IEEE Trans Softw Eng 45:464–488. https://doi.org/10.1109/TSE.2017.2779479
Ramage D et al (2009) Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: EMNLP 2009 - Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: A Meeting of SIGDAT, a Special Interest Group of ACL, Held in Conjunction with ACL-IJCNLP 2009. Singapore, pp 248–256
Storey M-A et al (2014) The (R) evolution of social media in software engineering. Proceedings of the on Future of Software Engineering - FOSE 2014, pp 100–116. https://doi.org/10.1145/2593882.2593887
Subramanian S, Inozemtseva L, Holmes R (2014) Live API documentation. In: Proceedings of the 36th international conference on software engineering - ICSE 2014. Hyderabad, India, pp 643–652. https://doi.org/10.1145/2568225.2568313
Thomas SW (2012) Mining unstructured software repositories using IR models. Queen’s University
Thomas SW et al (2010) Validating the use of topic models for software evolution. In: 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation. IEEE, pp 55–64. https://doi.org/10.1109/SCAM.2010.13
Turk D, France R, Rumpe B (2014) Limitations of agile software processes. abs/1409.6, pp 43–46
van der Spek P, Klusener S, van de Laar P (2008) Towards recovering architectural concepts using latent semantic indexing. In: 2008 12th European Conference on Software Maintenance and Reengineering, pp 253–257. https://doi.org/10.1109/CSMR.2008.4493321
Wallach HM (2006) Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning. Pittsburgh, Pennsylvania, USA: ACM (ICML ‘06), pp 977–984. https://doi.org/10.1145/1143844.1143967
Wang X, McCallum A, Wei X (2007) Topical N-grams: phrase and topic discovery, with an application to information retrieval. Proceedings - IEEE International Conference on Data Mining, ICDM, pp 697–702. https://doi.org/10.1109/ICDM.2007.86
Wells J, Barry RM, Spence A (2012) Using video tutorials as a carrot-and-stick approach to learning. IEEE Trans Educ 55(4):453–458. https://doi.org/10.1109/TE.2012.2187451
Yadid S, Yahav E (2016) Extracting code from programming tutorial videos. In: Proceedings of the 2016 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software - Onward! 2016. Amsterdam, Netherlands: ACM Press, pp 98–111. https://doi.org/10.1145/2986012.2986021
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Gabriele Bavota
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Moslehi, P., Adams, B. & Rilling, J. A feature location approach for mapping application features extracted from crowd-based screencasts to source code. Empir Software Eng 25, 4873–4926 (2020). https://doi.org/10.1007/s10664-020-09874-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-020-09874-z