skip to main content
research-article

Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment

Published:04 January 2016Publication History
Skip Abstract Section

Abstract

Compiler-based static vectorization is used widely to extract data-level parallelism from computation-intensive applications. Static vectorization is very effective in vectorizing traditional array-based applications. However, compilers’ inability to do accurate interprocedural pointer disambiguation and interprocedural array dependence analysis severely limits vectorization opportunities. HW/SW codesigned processors provide an excellent opportunity to optimize the applications at runtime. The availability of dynamic application behavior at runtime helps in capturing vectorization opportunities generally missed by the compilers.

This article proposes to complement the static vectorization with a speculative dynamic vectorizer in an HW/SW codesigned processor. We present a speculative dynamic vectorization algorithm that speculatively reorders ambiguous memory references to uncover vectorization opportunities. The speculative reordering of memory instructions avoids the need for accurate interprocedural pointer disambiguation and interprocedural array dependence analysis. The hardware checks for any memory dependence violation due to speculative vectorization and takes corrective action in case of violation. Our experiments show that the combined (static + dynamic) vectorization approach provides a 2× performance benefit compared to the static GCC vectorization alone, for SPECFP2006. Furthermore, the speculative dynamic vectorizer is able to vectorize 48% of the loops that ICC failed to vectorize due to conservative dependence analysis in the TSVC benchmark suite. Moreover, the dynamic vectorization scheme is as effective in vectorization of pointer-based applications as for the array-based ones, whereas compilers lose significant vectorization opportunities in pointer-based applications. Furthermore, we show that speculation is not only a luxury but also a necessity for runtime vectorization.

References

  1. Yoav Almog, Roni Rosner, Naftali Schwartz, and Ari Schmorak. 2004. Specialized dynamic optimizations for high-performance energy-efficient microarchitecture. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, 137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI’00). ACM, New York, NY, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 Execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium®-based systems. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, 191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Baron. 2005. Cortex-A8: High speed, low power. Microprocessor Report 11, 14, 1--6.Google ScholarGoogle Scholar
  5. Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. 2002. Automatic intra-register vectorization for the Intel architecture. International Journal of Parallel Programming 30, 2, 65--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Matthias Boettcher, Bashir M. Al-Hashimi, Mbou Eyole, Giacomo Gabrielli, and Alastair Reid. 2014. Advanced SIMD: Extending the reach of contemporary SIMD architectures. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE’’14). European Design and Automation Association, 3001 Leuven, Belgium, Article 24, 4 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Darrell Boggs, Gary Brown, Nathan Tuck, and K. S. Venkatraman. 2015. Denver: Nvidia's First 64-bit ARM Processor. In IEEE Micro,35, 2, 46--55.Google ScholarGoogle ScholarCross RefCross Ref
  8. Aleksandar Branković, Kyriakos Stavrou, Enric Gibert, and Antonio González. 2014. Warm-up simulation methodology for HW/SW co-designed processors. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). ACM, New York, NY, Pages 284, 11 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Aleksandar Branković, Kyriakos Stavrou, Enric Gibert, and Antonio González. 2013. Performance analysis and predictability of the software layer in dynamic binary translators/optimizers. In Proceedings of the ACM International Conference on Computing Frontiers (CF’13). ACM, New York, NY, Article 15, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’03). IEEE Computer Society, 265--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. David Callahan, Jack Dongarra, and David Levine. 1988. Vectorizing compilers: A test suite and results. In Proceedings of the 1988 ACM/IEEE Conference on Supercomputing (Supercomputing’88). IEEE Computer Society Press, Los Alamitos, CA, 98--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Nathan Clark, Amir Hormati, Sami Yehia, Scott Mahlke, and Krisztian Flautner. 2007. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE Computer Society, 216--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Paul D'Arcy and Scott Beach. 1999. StarCore SC140: A new DSP architecture for portable devices. In Wireless Symposium. Motorola.Google ScholarGoogle Scholar
  14. James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The Transmeta Code Morphing™ Software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’03). IEEE Computer Society, 15--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Keith Diefendorff, Pradeep K. Dubey, Ron Hochsprung, and Hunter Scales. 2000. AltiVec extension to PowerPC accelerates media processing. IEEE Micro 20, 2, 85--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kemal Ebcioğlu and Erik R. Altman. 1997. DAISY: Dynamic compilation for 100% architectural compatibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97). ACM, New York, NY, 26--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sara El-Shobaky, Ahmed El-Mahdy, and Ahmed El-Nahas. 2009. Automatic vectorization using dynamic compilation and tree pattern matching technique in Jikes RVM. In Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems (ICOOOLPS’09). ACM, New York, NY, 63--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE Computer Society, 503--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Venkatraman Govindaraju, Tony Nowatzki, and Karthikeyan Sankaralingam. 2013. Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE Press, 341--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Bolei Guo, Youfeng Wu, Cheng Wang, Matthew J. Bridges, Guilherme Ottoni, Neil Vachharajani, Jonathan Chang, and David I. August. 2006. Selective runtime memory disambiguation in a dynamic binary translator. In Proceedings of the 15th International Conference on Compiler Construction (CC’06), Alan Mycroft and Andreas Zeller (Eds.). Springer-Verlag, Berlin, 65--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Justin Holewinski, Ragavendar Ramamurthi, Mahesh Ravishankar, Naznin Fauzia, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2012. Dynamic trace-based analysis of vectorization potential of applications. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). ACM, New York, NY, 371--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Intel Corporation, Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1--3. Last accessed on 2015.Google ScholarGoogle Scholar
  23. Intel's HW/SW co-designed processor project: http://www.eetimes.com/document.asp?doc_id=1266396, last accessed on 2015.Google ScholarGoogle Scholar
  24. Intel® Xeon Phi™ Coprocessor: http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html, last accessed on 2015.Google ScholarGoogle Scholar
  25. J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. 2005. Introduction to the cell multiprocessor. IBM Journal of Research and Development 49, 4/5, 589--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Stefan Kral, Franz Franchetti, Juergen Lorenz, and Christoph W. Ueberhuber. 2003. SIMD vectorization of straight line FFT code. In Proceedings of the Euro-Par’03 Conference on Parallel and Distributed Computing LNCS 2790, 251--260.Google ScholarGoogle Scholar
  27. A. Klaiber. 2000. The Technology Behind the Crusoe Processors. White paper, January 2000.Google ScholarGoogle Scholar
  28. Rakesh Kumar, Alejandro Martínez, and Antonio González. 2013. Speculative dynamic vectorization to assist static vectorization in a HW/SW co-designed environment. In Proceedings of 20th International Conference on High Performance Computing (HiPC’13), Bangalore, India, December 18--21.Google ScholarGoogle ScholarCross RefCross Ref
  29. Rakesh Kumar, Alejandro Martínez, and Antonio González. 2013. Vectorizing for wider vector units in a HW/SW co-designed environment. In Proceedings of International Conference on High Performance Computing and Communications (HPCC’13), November 13--15.Google ScholarGoogle ScholarCross RefCross Ref
  30. Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI’00). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ruby B. Lee. 1996. Subword parallelism with MAX-2. IEEE Micro 16, 4, 51--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing dynamic binary translation for SIMD instructions. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society, 269--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Marc Lupon, Enric Gibert, Grigorios Magklis, Sridhar Samudrala, Raúl Martínez, Kyriakos Stavrou, and David R. Ditzel. 2014. Speculative hardware/software co-designed floating-point multiply-add fusion. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, 623--638. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Saeed Maleki, Yaoqing Gao, Maria J. Garzarán, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE Computer Society, 372--382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Steven S. Muchnick. 1997. Advanced Compiler Design & Implementation. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. Naishlos. 2004. Autovectorization in GCC. In The 2004 GCC Developers’ Summit, 105--118.Google ScholarGoogle Scholar
  37. Naveen Neelakantam, David R. Ditzel, and Craig Zilles. 2010. A real system evaluation of hardware atomicity for software speculation. In Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, 29--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jiutao Nie, Buqi Cheng, Shisheng Li, Ligang Wang, and Xiao-Feng Li. 2010. Vectorization for Java. In Proceedings of the 2010 IFIP International Conference on Network and Parallel Computing (NPC’10), Chen Ding, Zhiyuan Shao, and Ran Zheng (Eds.). Springer-Verlag, Berlin, 3--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, 151--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Alex Pajuelo, Antonio González, and Mateo Valero. 2002. Speculative dynamic vectorization. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, 271--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Sanjay J. Patel and Steven S. Lumetta. 2001. rePLay: A hardware framework for dynamic optimization. IEEE Transactions on Computers 50, 6, 590--608. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Demos Pavlou, Aleksandar Brankovic, Rakesh Kumar, Maria Gregori, Kyriakos Stavrou, Enric Gibert, and Antonio Gonzalez. 2011. DARCO: Infrastructure for research on HW/SW co-designed virtual machines. In Proceedings of the 4th Workshop on Architectural and Microarchitectural Support for Binary Translation (AMAS-BT’11), held in conjunction with the 38th International Symposium on Computer Architecture (ISCA-38), June 4, 2011. http://arco.e.ac.upc.edu/wiki/images/d/df/Pavlou_amasbt11.pdf.Google ScholarGoogle Scholar
  43. Demos Pavlou, Enric Gibert, Fernando Latorre, and Antonio Gonzalez. 2012. DDGacc: Boosting dynamic DDG-based binary optimizations through specialized hardware support. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE’12). ACM, New York, NY, 159--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Lawrence Rauchwerger and David Padua. 1995. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI’95). ACM, New York, NY, 218--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz, and Avi Mendelson. 2004. Power awareness through selective dynamically optimized traces. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04). IEEE Computer Society, 162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Sumedh Sathaye, Paul Ledak, Jay Leblanc, Stephen Kosonocky, Michael Gschwind, Jason Fritts, Arthur Bright, Erik Altman, and Craig Agricola. 1999. BOA: Targeting multi-gigahertz with binary translation. In Proceedings of the 1999 Workshop on Binary Translation, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, 2--11.Google ScholarGoogle Scholar
  47. Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, 165--175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. James E. Smith and Ravi Nair. 2005. Virtual Machines: A Versatile Platform for Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design). Elsevier, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Manu Sporny, Gray Carper, and Jonathan Turner. 2002. The Playstation 2 Linux Kit Handbook.Google ScholarGoogle Scholar
  50. Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmarks: http://www.spec.org/cpu2006/, last accessed on 2006.Google ScholarGoogle Scholar
  51. UTDSP Benchmarks: www.eecg.toronto.edu/∼corinna/, last accessed on 1998.Google ScholarGoogle Scholar
  52. Sriram Vajapeyam, P. J. Joseph, and Tulika Mitra. 1999. Dynamic vectorization: A mechanism for exploiting far-flung ILP in ordinary programs. In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99), 16--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Cheng Wang, Marcelo Cintra, and Youfeng Wu. 2013. Acceldroid: Co-designed acceleration of Android bytecode. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE Computer Society, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Thomas Y. Yeh, Petros Faloutsos, Sanjay J. Patel, and Glenn Reinman. 2007. ParallAX: An architecture for real-time physics. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, NY, 232--243. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Computer Systems
        ACM Transactions on Computer Systems  Volume 33, Issue 4
        January 2016
        125 pages
        ISSN:0734-2071
        EISSN:1557-7333
        DOI:10.1145/2841315
        Issue’s Table of Contents

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 January 2016
        • Accepted: 1 July 2015
        • Revised: 1 January 2015
        • Received: 1 June 2014
        Published in tocs Volume 33, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader