Early Address Prediction: Efficient Pipeline Prefetch and Reuse

Authors:
Ricardo Alves

Uppsala University, Uppsala, Sweden

Uppsala University, Uppsala, Sweden
View Profile

,
Stefanos Kaxiras

Uppsala University, Uppsala, Sweden

Uppsala University, Uppsala, Sweden
View Profile

,
David Black-Schaffer

Uppsala University, Uppsala, Sweden

Uppsala University, Uppsala, Sweden
View Profile

ACM Transactions on Architecture and Code Optimization Volume 18 Issue 3Article No.: 39pp 1–22https://doi.org/10.1145/3458883

Published:08 June 2021Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or L0 caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead.

In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction’s lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead.

Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.

References

Ricardo Alves, Stefanos Kaxiras, and David Black-Schaffer. 2018. Dynamically disabling way-prediction to reduce instruction replay. In Proceedings of the IEEE International Conference on Computer Design (ICCD’18).Google ScholarCross Ref
Ricardo Alves, Nikos Nikoleris, Stefanos Kaxiras, and David Black-Schaffer. 2017. Addressing energy challenges in filter caches. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture (SBAC-PAD’17). IEEE, 49–56.Google ScholarCross Ref
Ricardo Alves, Alberto Ros, David Black-Schaffer, and Stefanos Kaxiras. 2019. Filter caching for free: The untapped potential of the store-buffer. In Proceedings of the 46th IEEE International Symposium on Computer Architecture. ACM, 436–448.Google ScholarDigital Library
Steven Battle, Andrew D. Hilton, Mark Hempstead, and Amir Roth. 2012. Flexible register management using reference counting. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture. IEEE, 1–12.Google ScholarDigital Library
Michael Bekerman, Stephan Jourdan, Ronny Ronen, Gilad Kirshenboim, Lihu Rappoport, Adi Yoaz, and Uri Weiser. 1999. Correlated load-address predictors. In ACM SIGARCH Computer Architecture News, Vol. 27. IEEE Computer Society, 54–63.Google ScholarCross Ref
Nikolaos Bellas, Ibrahim Hajj, and Constantine Polychronopoulos. 1999. Using dynamic cache management techniques to reduce energy in a high-performance processor. In Proceedings of the International Symposium on Low Power Electronics and Design. IEEE, 64–69.Google ScholarDigital Library
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7. DOI: https://doi.org/10.1145/2024716.2024718.Google ScholarDigital Library
George Z. Chrysos and Joel S. Emer. 1998. Memory dependence prediction using store sets. In Proceedings of the 25th International Symposium on Computer Architecture. IEEE, 142–153.Google Scholar
Standard Performance Evaluation Corporation. 2006. SPEC CPU2006. Retrieved from: http://www.spec.org/cpu20066.Google Scholar
Richard J. Eickemeyer and Stamatis Vassiliadis. 1993. A load-instruction unit for pipelined processors. IBM J. Res. Devel. 37, 4 (1993), 547–564.Google ScholarDigital Library
B. Fahs, T. Rafacz, S. J. Patel, and S. S. Lumetta. 2005. Continuous optimization. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 86–97.Google Scholar
Manoj Franklin and Gurindar S. Sohi. 1996. ARB: A hardware mechanism for dynamic reordering of memory references. IEEE Trans. Comput. 45, 5 (1996), 552–571.Google ScholarDigital Library
Freddy Gabbay. 1996. Speculative Execution Based on Value Prediction. Technion-IIT, Department of Electrical Engineering.Google Scholar
Roberto Giorgi and Paolo Bennati. 2007. Reducing leakage in power-saving capable caches for embedded systems by using a filter cache. In Proceedings of the Workshop on Memory Performance: Dealing with Applications, Systems and Architecture. ACM, 97–104.Google ScholarDigital Library
José González and Antonio González. 1997. Speculative execution via address prediction and data prefetching. In Proceedings of the International Conference on Supercomputing. Citeseer, 196–203.Google ScholarDigital Library
Stephan Jourdan, Ronny Ronen, Michael Bekerman, Bishara Shomar, and Adi Yoaz. 1998. A novel renaming scheme to exploit value temporal locality through physical register reuse and unification. In Proceedings of the 31st ACM/IEEE International Symposium on Microarchitecture. IEEE, 216–225.Google ScholarCross Ref
Richard E. Kessler. 1999. The alpha 21264 microprocessor. IEEE Micro 19, 2 (1999), 24–36.Google ScholarDigital Library
Johnson Kin, Munish Gupta, and William H. Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of the 30th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 184–193.Google Scholar
Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher et al. 2019. Spectre attacks: Exploiting speculative execution. In Proceedings of the IEEE Symposium on Security and Privacy (SP’19). IEEE, 1–19.Google ScholarCross Ref
Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the International Conference on Computer-aided Design. IEEE Press, 694–701.Google ScholarDigital Library
M. H. Lipasti. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarDigital Library
Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In Proceedings of the 29th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 226–237.Google Scholar
Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. 2018. Meltdown. arXiv preprint arXiv:1801.01207 (2018).Google Scholar
Andreas Moshovos, Scott E. Breach, Terani N. Vijaykumar, and Gurindar S. Sohi. 1997. Dynamic speculation and synchronization of data dependences. In ACM SIGARCH Computer Architecture News, Vol. 25. ACM, 181–193.Google Scholar
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0. Technical Report HPL-2009-85. HP Labs.Google Scholar
Soner Önder and Rajiv Gupta. 2001. Load and store reuse using register file contents. In Proceedings of the 15th International Conference on Supercomputing. ACM, 289–302.Google ScholarDigital Library
Lois Orosa, Rodolfo Azevedo, and Onur Mutlu. 2018. AVPP: Address-first value-next predictor with value prefetching for improving the efficiency of load value prediction. ACM Trans. Archit. Code Optim. 15, 4 (2018), 49.Google Scholar
Arthur Perais, Fernando A. Endo, and André Seznec. 2016. Register sharing for equality prediction. In Proceedings of the 49th IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 4.Google ScholarDigital Library
Arthur Perais and André Seznec. 2014. EOLE: Paving the way for an effective implementation of value prediction. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). IEEE, 481–492.Google ScholarDigital Library
Arthur Perais and André Seznec. 2014. Practical data value speculation for future high-end processors. In Proceedings of the IEEE 20th International Symposium on High-performance Computer Architecture (HPCA’14). IEEE, 428–439.Google ScholarCross Ref
Arthur Perais and André Seznec. 2015. BeBoP: A cost effective predictor infrastructure for superscalar value prediction. In Proceedings of the IEEE 21st International Symposium on High-performance Computer Architecture (HPCA’15). IEEE, 13–25.Google ScholarCross Ref
Arthur Perais and André Seznec. 2016. Cost effective physical register sharing. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture (HPCA’16). IEEE, 694–706.Google ScholarCross Ref
Arthur Perais, André Seznec, Pierre Michaud, Andreas Sembrant, and Erik Hagersten. 2015. Cost-effective speculative scheduling in high performance processors. In Proceedings of the ACM/IEEE 42nd International Symposium on Computer Architecture (ISCA’15). IEEE, 247–259.Google ScholarDigital Library
Vlad Petric, Anne Bracy, and Amir Roth. 2002. Three extensions to register integration. In Proceedings of the 35th IEEE/ACM International Symposium on Microarchitecture (MICRO’02). IEEE, 37–47.Google ScholarCross Ref
Vlad Petric, Tingting Sha, and Amir Roth. 2005. RENO: A rename-based instruction optimizer. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 98–109.Google ScholarDigital Library
Alberto Ros and Stefanos Kaxiras. 2018. The superfluous load queue. In Proceedings of the 51st IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, 95–107.Google ScholarDigital Library
A. Roth. 2005. Store vulnerability window (SVW): Re-execution filtering for enhanced load optimization. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 458–468.Google ScholarDigital Library
Amir Roth. 2008. Physical register reference counting. IEEE Comput. Archit. Lett. 7, 1 (2008), 9–12.Google ScholarDigital Library
Rami Sheikh, Harold W. Cain, and Raguram Damodaran. 2017. Load value prediction via path-based address prediction: Avoiding mispredictions due to conflicting stores. In Proceedings of the 50th IEEE/ACM International Symposium on Microarchitecture. ACM, 423–435.Google ScholarDigital Library
Avinash Sodani and Gurindar S. Sohi. 1997. Dynamic instruction reuse. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA’97).Google Scholar
Nathan Tuck and Dean M. Tullsen. 2005. Multithreaded value prediction. In Proceedings of the 11th International Symposium on High-performance Computer Architecture. IEEE, 5–15.Google Scholar
Kai Wang and Manoj Franklin. 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 281–290.Google ScholarDigital Library

Index Terms

Early Address Prediction: Efficient Pipeline Prefetch and Reuse
1. Computer systems organization
  1. Architectures
    1. Serial architectures
      1. Pipeline computing
      2. Superscalar architectures

Recommendations

Register file prefetching
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

The memory wall continues to limit the performance of modern out-of-order (OOO) processors, despite the expensive provisioning of large multi-level caches and advancements in memory prefetching. In this paper, we put forth an important observation that ...
Read More
Load value prediction via path-based address prediction: avoiding mispredictions due to conflicting stores
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Current flagship processors excel at extracting instruction-level-parallelism (ILP) by forming large instruction windows. Even then, extracting ILP is inherently limited by true data dependencies. Value prediction was proposed to address this ...
Read More
Data Dependence Speculation Using Data Address Prediction and its Enhancement with Instruction Reissue
EUROMICRO '98: Proceedings of the 24th Conference on EUROMICRO - Volume 1

In this paper, we introduce an instruction reissue mechanism in order to enhance dynamic data dependence speculation using data address prediction. Since instructions which are not data-dependent upon speculatively executed instructions are not squashed,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 18, Issue 3
September 2021
370 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3460978
Editor:
David Kaeli
Northeastern University, USA
Issue’s Table of Contents
Copyright © 2021 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 June 2021
- Revised: 1 March 2021
- Accepted: 1 March 2021
- Received: 1 December 2020
Published in taco Volume 18, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Pipeline prefetching
address prediction
energy efficient computing
first level cache
register sharing
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 953
  Total Downloads
- Downloads (Last 12 months)315
- Downloads (Last 6 weeks)69
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Early Address Prediction: Efficient Pipeline Prefetch and Reuse

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Register file prefetching

Load value prediction via path-based address prediction: avoiding mispredictions due to conflicting stores

Data Dependence Speculation Using Data Address Prediction and its Enhancement with Instruction Reissue