research-article

Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment

Authors:
Rakesh Kumar

UPC Barcelona

UPC Barcelona
View Profile

,
Alejandro Martínez

Intel Barcelona Research Center, Intel Labs

Intel Barcelona Research Center, Intel Labs
View Profile

,
Antonio González

Intel Barcelona Research Center, Intel Labs and UPC Barcelona, Barcelona, Spain

Intel Barcelona Research Center, Intel Labs and UPC Barcelona, Barcelona, Spain
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 33 Issue 4Article No.: 12pp 1–33https://doi.org/10.1145/2807694

Published:04 January 2016Publication History

ACM Transactions on Computer Systems

Abstract

Compiler-based static vectorization is used widely to extract data-level parallelism from computation-intensive applications. Static vectorization is very effective in vectorizing traditional array-based applications. However, compilers’ inability to do accurate interprocedural pointer disambiguation and interprocedural array dependence analysis severely limits vectorization opportunities. HW/SW codesigned processors provide an excellent opportunity to optimize the applications at runtime. The availability of dynamic application behavior at runtime helps in capturing vectorization opportunities generally missed by the compilers.

This article proposes to complement the static vectorization with a speculative dynamic vectorizer in an HW/SW codesigned processor. We present a speculative dynamic vectorization algorithm that speculatively reorders ambiguous memory references to uncover vectorization opportunities. The speculative reordering of memory instructions avoids the need for accurate interprocedural pointer disambiguation and interprocedural array dependence analysis. The hardware checks for any memory dependence violation due to speculative vectorization and takes corrective action in case of violation. Our experiments show that the combined (static + dynamic) vectorization approach provides a 2× performance benefit compared to the static GCC vectorization alone, for SPECFP2006. Furthermore, the speculative dynamic vectorizer is able to vectorize 48% of the loops that ICC failed to vectorize due to conservative dependence analysis in the TSVC benchmark suite. Moreover, the dynamic vectorization scheme is as effective in vectorization of pointer-based applications as for the array-based ones, whereas compilers lose significant vectorization opportunities in pointer-based applications. Furthermore, we show that speculation is not only a luxury but also a necessity for runtime vectorization.

References

Yoav Almog, Roni Rosner, Naftali Schwartz, and Ari Schmorak. 2004. Specialized dynamic optimizations for high-performance energy-efficient microarchitecture. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04). IEEE Computer Society, 137. Google ScholarDigital Library
Vasanth Bala, Evelyn Duesterwald, and Sanjeev Banerjia. 2000. Dynamo: A transparent dynamic optimization system. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI’00). ACM, New York, NY, 1--12. Google ScholarDigital Library
Leonid Baraz, Tevi Devor, Orna Etzion, Shalom Goldenberg, Alex Skaletsky, Yun Wang, and Yigel Zemach. 2003. IA-32 Execution layer: A two-phase dynamic translator designed to support IA-32 applications on Itanium®-based systems. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, 191. Google ScholarDigital Library
M. Baron. 2005. Cortex-A8: High speed, low power. Microprocessor Report 11, 14, 1--6.Google Scholar
Aart J. C. Bik, Milind Girkar, Paul M. Grey, and Xinmin Tian. 2002. Automatic intra-register vectorization for the Intel architecture. International Journal of Parallel Programming 30, 2, 65--98. Google ScholarDigital Library
Matthias Boettcher, Bashir M. Al-Hashimi, Mbou Eyole, Giacomo Gabrielli, and Alastair Reid. 2014. Advanced SIMD: Extending the reach of contemporary SIMD architectures. In Proceedings of the Conference on Design, Automation & Test in Europe (DATE’’14). European Design and Automation Association, 3001 Leuven, Belgium, Article 24, 4 pages. Google ScholarDigital Library
Darrell Boggs, Gary Brown, Nathan Tuck, and K. S. Venkatraman. 2015. Denver: Nvidia's First 64-bit ARM Processor. In IEEE Micro,35, 2, 46--55.Google ScholarCross Ref
Aleksandar Branković, Kyriakos Stavrou, Enric Gibert, and Antonio González. 2014. Warm-up simulation methodology for HW/SW co-designed processors. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’14). ACM, New York, NY, Pages 284, 11 pages. Google ScholarDigital Library
Aleksandar Branković, Kyriakos Stavrou, Enric Gibert, and Antonio González. 2013. Performance analysis and predictability of the software layer in dynamic binary translators/optimizers. In Proceedings of the ACM International Conference on Computing Frontiers (CF’13). ACM, New York, NY, Article 15, 10 pages. Google ScholarDigital Library
Derek Bruening, Timothy Garnett, and Saman Amarasinghe. 2003. An infrastructure for adaptive dynamic optimization. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’03). IEEE Computer Society, 265--275. Google ScholarDigital Library
David Callahan, Jack Dongarra, and David Levine. 1988. Vectorizing compilers: A test suite and results. In Proceedings of the 1988 ACM/IEEE Conference on Supercomputing (Supercomputing’88). IEEE Computer Society Press, Los Alamitos, CA, 98--105. Google ScholarDigital Library
Nathan Clark, Amir Hormati, Sami Yehia, Scott Mahlke, and Krisztian Flautner. 2007. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA’07). IEEE Computer Society, 216--227. Google ScholarDigital Library
Paul D'Arcy and Scott Beach. 1999. StarCore SC140: A new DSP architecture for portable devices. In Wireless Symposium. Motorola.Google Scholar
James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The Transmeta Code Morphing™ Software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’03). IEEE Computer Society, 15--24. Google ScholarDigital Library
Keith Diefendorff, Pradeep K. Dubey, Ron Hochsprung, and Hunter Scales. 2000. AltiVec extension to PowerPC accelerates media processing. IEEE Micro 20, 2, 85--95. Google ScholarDigital Library
Kemal Ebcioğlu and Erik R. Altman. 1997. DAISY: Dynamic compilation for 100&percnt; architectural compatibility. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97). ACM, New York, NY, 26--37. Google ScholarDigital Library
Sara El-Shobaky, Ahmed El-Mahdy, and Ahmed El-Nahas. 2009. Automatic vectorization using dynamic compilation and tree pattern matching technique in Jikes RVM. In Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems (ICOOOLPS’09). ACM, New York, NY, 63--69. Google ScholarDigital Library
Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE Computer Society, 503--514. Google ScholarDigital Library
Venkatraman Govindaraju, Tony Nowatzki, and Karthikeyan Sankaralingam. 2013. Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE Press, 341--352. Google ScholarDigital Library
Bolei Guo, Youfeng Wu, Cheng Wang, Matthew J. Bridges, Guilherme Ottoni, Neil Vachharajani, Jonathan Chang, and David I. August. 2006. Selective runtime memory disambiguation in a dynamic binary translator. In Proceedings of the 15th International Conference on Compiler Construction (CC’06), Alan Mycroft and Andreas Zeller (Eds.). Springer-Verlag, Berlin, 65--79. Google ScholarDigital Library
Justin Holewinski, Ragavendar Ramamurthi, Mahesh Ravishankar, Naznin Fauzia, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2012. Dynamic trace-based analysis of vectorization potential of applications. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). ACM, New York, NY, 371--382. Google ScholarDigital Library
Intel Corporation, Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1--3. Last accessed on 2015.Google Scholar
Intel's HW/SW co-designed processor project: http://www.eetimes.com/document.asp?doc_id=1266396, last accessed on 2015.Google Scholar
Intel® Xeon Phi™ Coprocessor: http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html, last accessed on 2015.Google Scholar
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. 2005. Introduction to the cell multiprocessor. IBM Journal of Research and Development 49, 4/5, 589--604. Google ScholarDigital Library
Stefan Kral, Franz Franchetti, Juergen Lorenz, and Christoph W. Ueberhuber. 2003. SIMD vectorization of straight line FFT code. In Proceedings of the Euro-Par’03 Conference on Parallel and Distributed Computing LNCS 2790, 251--260.Google Scholar
A. Klaiber. 2000. The Technology Behind the Crusoe Processors. White paper, January 2000.Google Scholar
Rakesh Kumar, Alejandro Martínez, and Antonio González. 2013. Speculative dynamic vectorization to assist static vectorization in a HW/SW co-designed environment. In Proceedings of 20th International Conference on High Performance Computing (HiPC’13), Bangalore, India, December 18--21.Google ScholarCross Ref
Rakesh Kumar, Alejandro Martínez, and Antonio González. 2013. Vectorizing for wider vector units in a HW/SW co-designed environment. In Proceedings of International Conference on High Performance Computing and Communications (HPCC’13), November 13--15.Google ScholarCross Ref
Samuel Larsen and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation (PLDI’00). Google ScholarDigital Library
Ruby B. Lee. 1996. Subword parallelism with MAX-2. IEEE Micro 16, 4, 51--59. Google ScholarDigital Library
Jianhui Li, Qi Zhang, Shu Xu, and Bo Huang. 2006. Optimizing dynamic binary translation for SIMD instructions. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society, 269--280. Google ScholarDigital Library
Marc Lupon, Enric Gibert, Grigorios Magklis, Sridhar Samudrala, Raúl Martínez, Kyriakos Stavrou, and David R. Ditzel. 2014. Speculative hardware/software co-designed floating-point multiply-add fusion. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, 623--638. Google ScholarDigital Library
Saeed Maleki, Yaoqing Gao, Maria J. Garzarán, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE Computer Society, 372--382. Google ScholarDigital Library
Steven S. Muchnick. 1997. Advanced Compiler Design & Implementation. Morgan Kaufmann. Google ScholarDigital Library
D. Naishlos. 2004. Autovectorization in GCC. In The 2004 GCC Developers’ Summit, 105--118.Google Scholar
Naveen Neelakantam, David R. Ditzel, and Craig Zilles. 2010. A real system evaluation of hardware atomicity for software speculation. In Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, 29--38. Google ScholarDigital Library
Jiutao Nie, Buqi Cheng, Shisheng Li, Ligang Wang, and Xiao-Feng Li. 2010. Vectorization for Java. In Proceedings of the 2010 IFIP International Conference on Network and Parallel Computing (NPC’10), Chen Ding, Zhiyuan Shao, and Ran Zheng (Eds.). Springer-Verlag, Berlin, 3--17. Google ScholarDigital Library
Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams, David Yuste, Albert Cohen, and Ayal Zaks. 2011. Vapor SIMD: Auto-vectorize once, run everywhere. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’11). IEEE Computer Society, 151--160. Google ScholarDigital Library
Alex Pajuelo, Antonio González, and Mateo Valero. 2002. Speculative dynamic vectorization. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, 271--280. Google ScholarDigital Library
Sanjay J. Patel and Steven S. Lumetta. 2001. rePLay: A hardware framework for dynamic optimization. IEEE Transactions on Computers 50, 6, 590--608. Google ScholarDigital Library
Demos Pavlou, Aleksandar Brankovic, Rakesh Kumar, Maria Gregori, Kyriakos Stavrou, Enric Gibert, and Antonio Gonzalez. 2011. DARCO: Infrastructure for research on HW/SW co-designed virtual machines. In Proceedings of the 4th Workshop on Architectural and Microarchitectural Support for Binary Translation (AMAS-BT’11), held in conjunction with the 38th International Symposium on Computer Architecture (ISCA-38), June 4, 2011. http://arco.e.ac.upc.edu/wiki/images/d/df/Pavlou_amasbt11.pdf.Google Scholar
Demos Pavlou, Enric Gibert, Fernando Latorre, and Antonio Gonzalez. 2012. DDGacc: Boosting dynamic DDG-based binary optimizations through specialized hardware support. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments (VEE’12). ACM, New York, NY, 159--168. Google ScholarDigital Library
Lawrence Rauchwerger and David Padua. 1995. The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI’95). ACM, New York, NY, 218--232. Google ScholarDigital Library
Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz, and Avi Mendelson. 2004. Power awareness through selective dynamically optimized traces. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04). IEEE Computer Society, 162. Google ScholarDigital Library
Sumedh Sathaye, Paul Ledak, Jay Leblanc, Stephen Kosonocky, Michael Gschwind, Jason Fritts, Arthur Bright, Erik Altman, and Craig Agricola. 1999. BOA: Targeting multi-gigahertz with binary translation. In Proceedings of the 1999 Workshop on Binary Translation, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, 2--11.Google Scholar
Jaewook Shin, Mary Hall, and Jacqueline Chame. 2005. Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’05). IEEE Computer Society, 165--175. Google ScholarDigital Library
James E. Smith and Ravi Nair. 2005. Virtual Machines: A Versatile Platform for Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design). Elsevier, 2005. Google ScholarDigital Library
Manu Sporny, Gray Carper, and Jonathan Turner. 2002. The Playstation 2 Linux Kit Handbook.Google Scholar
Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmarks: http://www.spec.org/cpu2006/, last accessed on 2006.Google Scholar
UTDSP Benchmarks: www.eecg.toronto.edu/&sim;corinna/, last accessed on 1998.Google Scholar
Sriram Vajapeyam, P. J. Joseph, and Tulika Mitra. 1999. Dynamic vectorization: A mechanism for exploiting far-flung ILP in ordinary programs. In Proceedings of the 26th Annual International Symposium on Computer Architecture (ISCA’99), 16--27. Google ScholarDigital Library
Cheng Wang, Marcelo Cintra, and Youfeng Wu. 2013. Acceldroid: Co-designed acceleration of Android bytecode. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’13). IEEE Computer Society, 1--10. Google ScholarDigital Library
Thomas Y. Yeh, Petros Faloutsos, Sanjay J. Patel, and Glenn Reinman. 2007. ParallAX: An architecture for real-time physics. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA’07). ACM, New York, NY, 232--243. Google ScholarDigital Library

Index Terms

Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

FlexVec: auto-vectorization for irregular loops
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

Traditional vectorization techniques build a dependence graph with distance and direction information to determine whether a loop is vectorizable. Since vectorization reorders the execution of instructions across iterations, in general instructions ...
Read More
Outer-loop vectorization: revisited for short SIMD architectures
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and ...
Read More
Speculative dynamic vectorization for HW/SW co-designed processors
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Hardware/Software (HW/SW) co-designed processors have emerged as a promising solution to the power and complexity problems of modern microprocessors. These processors utilize dynamic optimizations to improve the performance. However, vectorization, one ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Computer Systems Volume 33, Issue 4
January 2016
125 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/2841315
Editor:
Todd C. Mowry
Carnegie Mellon University, Pittsburgh, PA
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 January 2016
- Accepted: 1 July 2015
- Revised: 1 January 2015
- Received: 1 June 2014
Published in tocs Volume 33, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Hardware/software codesigned processors
dynamic optimizations
speculation
vectorization
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 430
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

FlexVec: auto-vectorization for irregular loops

Outer-loop vectorization: revisited for short SIMD architectures

Speculative dynamic vectorization for HW/SW co-designed processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Assisting Static Compiler Vectorization with a Speculative Dynamic Vectorizer in an HW/SW Codesigned Environment

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

FlexVec: auto-vectorization for irregular loops

Outer-loop vectorization: revisited for short SIMD architectures

Speculative dynamic vectorization for HW/SW co-designed processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media