Abstract
Compiler-based static vectorization is used widely to extract data-level parallelism from computation-intensive applications. Static vectorization is very effective in vectorizing traditional array-based applications. However, compilers’ inability to do accurate interprocedural pointer disambiguation and interprocedural array dependence analysis severely limits vectorization opportunities. HW/SW codesigned processors provide an excellent opportunity to optimize the applications at runtime. The availability of dynamic application behavior at runtime helps in capturing vectorization opportunities generally missed by the compilers.
This article proposes to complement the static vectorization with a speculative dynamic vectorizer in an HW/SW codesigned processor. We present a speculative dynamic vectorization algorithm that speculatively reorders ambiguous memory references to uncover vectorization opportunities. The speculative reordering of memory instructions avoids the need for accurate interprocedural pointer disambiguation and interprocedural array dependence analysis. The hardware checks for any memory dependence violation due to speculative vectorization and takes corrective action in case of violation. Our experiments show that the combined (static + dynamic) vectorization approach provides a 2× performance benefit compared to the static GCC vectorization alone, for SPECFP2006. Furthermore, the speculative dynamic vectorizer is able to vectorize 48% of the loops that ICC failed to vectorize due to conservative dependence analysis in the TSVC benchmark suite. Moreover, the dynamic vectorization scheme is as effective in vectorization of pointer-based applications as for the array-based ones, whereas compilers lose significant vectorization opportunities in pointer-based applications. Furthermore, we show that speculation is not only a luxury but also a necessity for runtime vectorization.
- Yoav Almog, Roni Rosner, Naftali Schwartz, and Ari Schmorak. 2004. Specialized dynamic optimizations for high-performance energy-efficient microarchitecture. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, 137. Google ScholarDigital Library
- Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI’00). ACM, New York, NY, 1--12. Google ScholarDigital Library
- Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 Execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium®-based systems. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, 191. Google ScholarDigital Library
- M. Baron. 2005. Cortex-A8: High speed, low power. Microprocessor Report 11, 14, 1--6.Google Scholar
- Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. 2002. Automatic intra-register vectorization for the Intel architecture. International Journal of Parallel Programming 30, 2, 65--98. Google ScholarDigital Library
- Matthias Boettcher, Bashir M. Al-Hashimi, Mbou Eyole, Giacomo Gabrielli, and Alastair Reid. 2014. Advanced SIMD: Extending the reach of contemporary SIMD architectures. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE’’14). European Design and Automation Association, 3001 Leuven, Belgium, Article 24, 4 pages. Google ScholarDigital Library
- Darrell Boggs, Gary Brown, Nathan Tuck, and K. S. Venkatraman. 2015. Denver: Nvidia's First 64-bit ARM Processor. In IEEE Micro,35, 2, 46--55.Google ScholarCross Ref
- Aleksandar Branković, Kyriakos Stavrou, Enric Gibert, and Antonio González. 2014. Warm-up simulation methodology for HW/SW co-designed processors. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). ACM, New York, NY, Pages 284, 11 pages. Google ScholarDigital Library
- Aleksandar Branković, Kyriakos Stavrou, Enric Gibert, and Antonio González. 2013. Performance analysis and predictability of the software layer in dynamic binary translators/optimizers. In Proceedings of the ACM International Conference on Computing Frontiers (CF’13). ACM, New York, NY, Article 15, 10 pages. Google ScholarDigital Library
- Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’03). IEEE Computer Society, 265--275. Google ScholarDigital Library
- David Callahan, Jack Dongarra, and David Levine. 1988. Vectorizing compilers: A test suite and results. In Proceedings of the 1988 ACM/IEEE Conference on Supercomputing (Supercomputing’88). IEEE Computer Society Press, Los Alamitos, CA, 98--105. Google ScholarDigital Library
- Nathan Clark, Amir Hormati, Sami Yehia, Scott Mahlke, and Krisztian Flautner. 2007. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE Computer Society, 216--227. Google ScholarDigital Library
- Paul D'Arcy and Scott Beach. 1999. StarCore SC140: A new DSP architecture for portable devices. In Wireless Symposium. Motorola.Google Scholar
- James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The Transmeta Code Morphing™ Software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’03). IEEE Computer Society, 15--24. Google ScholarDigital Library
- Keith Diefendorff, Pradeep K. Dubey, Ron Hochsprung, and Hunter Scales. 2000. AltiVec extension to PowerPC accelerates media processing. IEEE Micro 20, 2, 85--95. Google ScholarDigital Library
- Kemal Ebcioğlu and Erik R. Altman. 1997. DAISY: Dynamic compilation for 100% architectural compatibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97). ACM, New York, NY, 26--37. Google ScholarDigital Library
- Sara El-Shobaky, Ahmed El-Mahdy, and Ahmed El-Nahas. 2009. Automatic vectorization using dynamic compilation and tree pattern matching technique in Jikes RVM. In Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems (ICOOOLPS’09). ACM, New York, NY, 63--69. Google ScholarDigital Library
- Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE Computer Society, 503--514. Google ScholarDigital Library
- Venkatraman Govindaraju, Tony Nowatzki, and Karthikeyan Sankaralingam. 2013. Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE Press, 341--352. Google ScholarDigital Library
- Bolei Guo, Youfeng Wu, Cheng Wang, Matthew J. Bridges, Guilherme Ottoni, Neil Vachharajani, Jonathan Chang, and David I. August. 2006. Selective runtime memory disambiguation in a dynamic binary translator. In Proceedings of the 15th International Conference on Compiler Construction (CC’06), Alan Mycroft and Andreas Zeller (Eds.). Springer-Verlag, Berlin, 65--79. Google ScholarDigital Library
- Justin Holewinski, Ragavendar Ramamurthi, Mahesh Ravishankar, Naznin Fauzia, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2012. Dynamic trace-based analysis of vectorization potential of applications. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). ACM, New York, NY, 371--382. Google ScholarDigital Library
- Intel Corporation, Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1--3. Last accessed on 2015.Google Scholar
- Intel's HW/SW co-designed processor project: http://www.eetimes.com/document.asp?doc_id=1266396, last accessed on 2015.Google Scholar
- Intel® Xeon Phi™ Coprocessor: http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html, last accessed on 2015.Google Scholar
- J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. 2005. Introduction to the cell multiprocessor. IBM Journal of Research and Development 49, 4/5, 589--604. Google ScholarDigital Library
- Stefan Kral, Franz Franchetti, Juergen Lorenz, and Christoph W. Ueberhuber. 2003. SIMD vectorization of straight line FFT code. In Proceedings of the Euro-Par’03 Conference on Parallel and Distributed Computing LNCS 2790, 251--260.Google Scholar
- A. Klaiber. 2000. The Technology Behind the Crusoe Processors. White paper, January 2000.Google Scholar
- Rakesh Kumar, Alejandro Martínez, and Antonio González. 2013. Speculative dynamic vectorization to assist static vectorization in a HW/SW co-designed environment. In Proceedings of 20th International Conference on High Performance Computing (HiPC’13), Bangalore, India, December 18--21.Google ScholarCross Ref
- Rakesh Kumar, Alejandro Martínez, and Antonio González. 2013. Vectorizing for wider vector units in a HW/SW co-designed environment. In Proceedings of International Conference on High Performance Computing and Communications (HPCC’13), November 13--15.Google ScholarCross Ref
- Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI’00). Google ScholarDigital Library
- Ruby B. Lee. 1996. Subword parallelism with MAX-2. IEEE Micro 16, 4, 51--59. Google ScholarDigital Library
- Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing dynamic binary translation for SIMD instructions. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society, 269--280. Google ScholarDigital Library
- Marc Lupon, Enric Gibert, Grigorios Magklis, Sridhar Samudrala, Raúl Martínez, Kyriakos Stavrou, and David R. Ditzel. 2014. Speculative hardware/software co-designed floating-point multiply-add fusion. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, 623--638. Google ScholarDigital Library
- Saeed Maleki, Yaoqing Gao, Maria J. Garzarán, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE Computer Society, 372--382. Google ScholarDigital Library
- Steven S. Muchnick. 1997. Advanced Compiler Design & Implementation. Morgan Kaufmann. Google ScholarDigital Library
- D. Naishlos. 2004. Autovectorization in GCC. In The 2004 GCC Developers’ Summit, 105--118.Google Scholar
- Naveen Neelakantam, David R. Ditzel, and Craig Zilles. 2010. A real system evaluation of hardware atomicity for software speculation. In Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, 29--38. Google ScholarDigital Library
- Jiutao Nie, Buqi Cheng, Shisheng Li, Ligang Wang, and Xiao-Feng Li. 2010. Vectorization for Java. In Proceedings of the 2010 IFIP International Conference on Network and Parallel Computing (NPC’10), Chen Ding, Zhiyuan Shao, and Ran Zheng (Eds.). Springer-Verlag, Berlin, 3--17. Google ScholarDigital Library
- Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, 151--160. Google ScholarDigital Library
- Alex Pajuelo, Antonio González, and Mateo Valero. 2002. Speculative dynamic vectorization. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, 271--280. Google ScholarDigital Library
- Sanjay J. Patel and Steven S. Lumetta. 2001. rePLay: A hardware framework for dynamic optimization. IEEE Transactions on Computers 50, 6, 590--608. Google ScholarDigital Library
- Demos Pavlou, Aleksandar Brankovic, Rakesh Kumar, Maria Gregori, Kyriakos Stavrou, Enric Gibert, and Antonio Gonzalez. 2011. DARCO: Infrastructure for research on HW/SW co-designed virtual machines. In Proceedings of the 4th Workshop on Architectural and Microarchitectural Support for Binary Translation (AMAS-BT’11), held in conjunction with the 38th International Symposium on Computer Architecture (ISCA-38), June 4, 2011. http://arco.e.ac.upc.edu/wiki/images/d/df/Pavlou_amasbt11.pdf.Google Scholar
- Demos Pavlou, Enric Gibert, Fernando Latorre, and Antonio Gonzalez. 2012. DDGacc: Boosting dynamic DDG-based binary optimizations through specialized hardware support. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE’12). ACM, New York, NY, 159--168. Google ScholarDigital Library
- Lawrence Rauchwerger and David Padua. 1995. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI’95). ACM, New York, NY, 218--232. Google ScholarDigital Library
- Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz, and Avi Mendelson. 2004. Power awareness through selective dynamically optimized traces. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04). IEEE Computer Society, 162. Google ScholarDigital Library
- Sumedh Sathaye, Paul Ledak, Jay Leblanc, Stephen Kosonocky, Michael Gschwind, Jason Fritts, Arthur Bright, Erik Altman, and Craig Agricola. 1999. BOA: Targeting multi-gigahertz with binary translation. In Proceedings of the 1999 Workshop on Binary Translation, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, 2--11.Google Scholar
- Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, 165--175. Google ScholarDigital Library
- James E. Smith and Ravi Nair. 2005. Virtual Machines: A Versatile Platform for Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design). Elsevier, 2005. Google ScholarDigital Library
- Manu Sporny, Gray Carper, and Jonathan Turner. 2002. The Playstation 2 Linux Kit Handbook.Google Scholar
- Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmarks: http://www.spec.org/cpu2006/, last accessed on 2006.Google Scholar
- UTDSP Benchmarks: www.eecg.toronto.edu/∼corinna/, last accessed on 1998.Google Scholar
- Sriram Vajapeyam, P. J. Joseph, and Tulika Mitra. 1999. Dynamic vectorization: A mechanism for exploiting far-flung ILP in ordinary programs. In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99), 16--27. Google ScholarDigital Library
- Cheng Wang, Marcelo Cintra, and Youfeng Wu. 2013. Acceldroid: Co-designed acceleration of Android bytecode. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE Computer Society, 1--10. Google ScholarDigital Library
- Thomas Y. Yeh, Petros Faloutsos, Sanjay J. Patel, and Glenn Reinman. 2007. ParallAX: An architecture for real-time physics. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, NY, 232--243. Google ScholarDigital Library
Index Terms
- Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment
Recommendations
FlexVec: auto-vectorization for irregular loops
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and ImplementationTraditional vectorization techniques build a dependence graph with distance and direction information to determine whether a loop is vectorizable. Since vectorization reorders the execution of instructions across iterations, in general instructions ...
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniquesVectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Speculative dynamic vectorization for HW/SW co-designed processors
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesHardware/Software (HW/SW) co-designed processors have emerged as a promising solution to the power and complexity problems of modern microprocessors. These processors utilize dynamic optimizations to improve the performance. However, vectorization, one ...
Comments