skip to main content
research-article
Open Access

Early Address Prediction: Efficient Pipeline Prefetch and Reuse

Published:08 June 2021Publication History
Skip Abstract Section

Abstract

Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or L0 caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead.

In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction’s lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead.

Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.

References

  1. Ricardo Alves, Stefanos Kaxiras, and David Black-Schaffer. 2018. Dynamically disabling way-prediction to reduce instruction replay. In Proceedings of the IEEE International Conference on Computer Design (ICCD’18).Google ScholarGoogle ScholarCross RefCross Ref
  2. Ricardo Alves, Nikos Nikoleris, Stefanos Kaxiras, and David Black-Schaffer. 2017. Addressing energy challenges in filter caches. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture (SBAC-PAD’17). IEEE, 49–56.Google ScholarGoogle ScholarCross RefCross Ref
  3. Ricardo Alves, Alberto Ros, David Black-Schaffer, and Stefanos Kaxiras. 2019. Filter caching for free: The untapped potential of the store-buffer. In Proceedings of the 46th IEEE International Symposium on Computer Architecture. ACM, 436–448.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Steven Battle, Andrew D. Hilton, Mark Hempstead, and Amir Roth. 2012. Flexible register management using reference counting. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture. IEEE, 1–12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michael Bekerman, Stephan Jourdan, Ronny Ronen, Gilad Kirshenboim, Lihu Rappoport, Adi Yoaz, and Uri Weiser. 1999. Correlated load-address predictors. In ACM SIGARCH Computer Architecture News, Vol. 27. IEEE Computer Society, 54–63.Google ScholarGoogle ScholarCross RefCross Ref
  6. Nikolaos Bellas, Ibrahim Hajj, and Constantine Polychronopoulos. 1999. Using dynamic cache management techniques to reduce energy in a high-performance processor. In Proceedings of the International Symposium on Low Power Electronics and Design. IEEE, 64–69.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7. DOI: https://doi.org/10.1145/2024716.2024718.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. George Z. Chrysos and Joel S. Emer. 1998. Memory dependence prediction using store sets. In Proceedings of the 25th International Symposium on Computer Architecture. IEEE, 142–153.Google ScholarGoogle Scholar
  9. Standard Performance Evaluation Corporation. 2006. SPEC CPU2006. Retrieved from: http://www.spec.org/cpu20066.Google ScholarGoogle Scholar
  10. Richard J. Eickemeyer and Stamatis Vassiliadis. 1993. A load-instruction unit for pipelined processors. IBM J. Res. Devel. 37, 4 (1993), 547–564.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Fahs, T. Rafacz, S. J. Patel, and S. S. Lumetta. 2005. Continuous optimization. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 86–97.Google ScholarGoogle Scholar
  12. Manoj Franklin and Gurindar S. Sohi. 1996. ARB: A hardware mechanism for dynamic reordering of memory references. IEEE Trans. Comput. 45, 5 (1996), 552–571.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Freddy Gabbay. 1996. Speculative Execution Based on Value Prediction. Technion-IIT, Department of Electrical Engineering.Google ScholarGoogle Scholar
  14. Roberto Giorgi and Paolo Bennati. 2007. Reducing leakage in power-saving capable caches for embedded systems by using a filter cache. In Proceedings of the Workshop on Memory Performance: Dealing with Applications, Systems and Architecture. ACM, 97–104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. José González and Antonio González. 1997. Speculative execution via address prediction and data prefetching. In Proceedings of the International Conference on Supercomputing. Citeseer, 196–203.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Stephan Jourdan, Ronny Ronen, Michael Bekerman, Bishara Shomar, and Adi Yoaz. 1998. A novel renaming scheme to exploit value temporal locality through physical register reuse and unification. In Proceedings of the 31st ACM/IEEE International Symposium on Microarchitecture. IEEE, 216–225.Google ScholarGoogle ScholarCross RefCross Ref
  17. Richard E. Kessler. 1999. The alpha 21264 microprocessor. IEEE Micro 19, 2 (1999), 24–36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Johnson Kin, Munish Gupta, and William H. Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of the 30th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 184–193.Google ScholarGoogle Scholar
  19. Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher et al. 2019. Spectre attacks: Exploiting speculative execution. In Proceedings of the IEEE Symposium on Security and Privacy (SP’19). IEEE, 1–19.Google ScholarGoogle ScholarCross RefCross Ref
  20. Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the International Conference on Computer-aided Design. IEEE Press, 694–701.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. H. Lipasti. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In Proceedings of the 29th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 226–237.Google ScholarGoogle Scholar
  23. Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. 2018. Meltdown. arXiv preprint arXiv:1801.01207 (2018).Google ScholarGoogle Scholar
  24. Andreas Moshovos, Scott E. Breach, Terani N. Vijaykumar, and Gurindar S. Sohi. 1997. Dynamic speculation and synchronization of data dependences. In ACM SIGARCH Computer Architecture News, Vol. 25. ACM, 181–193.Google ScholarGoogle Scholar
  25. Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0. Technical Report HPL-2009-85. HP Labs.Google ScholarGoogle Scholar
  26. Soner Önder and Rajiv Gupta. 2001. Load and store reuse using register file contents. In Proceedings of the 15th International Conference on Supercomputing. ACM, 289–302.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lois Orosa, Rodolfo Azevedo, and Onur Mutlu. 2018. AVPP: Address-first value-next predictor with value prefetching for improving the efficiency of load value prediction. ACM Trans. Archit. Code Optim. 15, 4 (2018), 49.Google ScholarGoogle Scholar
  28. Arthur Perais, Fernando A. Endo, and André Seznec. 2016. Register sharing for equality prediction. In Proceedings of the 49th IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Arthur Perais and André Seznec. 2014. EOLE: Paving the way for an effective implementation of value prediction. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). IEEE, 481–492.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Arthur Perais and André Seznec. 2014. Practical data value speculation for future high-end processors. In Proceedings of the IEEE 20th International Symposium on High-performance Computer Architecture (HPCA’14). IEEE, 428–439.Google ScholarGoogle ScholarCross RefCross Ref
  31. Arthur Perais and André Seznec. 2015. BeBoP: A cost effective predictor infrastructure for superscalar value prediction. In Proceedings of the IEEE 21st International Symposium on High-performance Computer Architecture (HPCA’15). IEEE, 13–25.Google ScholarGoogle ScholarCross RefCross Ref
  32. Arthur Perais and André Seznec. 2016. Cost effective physical register sharing. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture (HPCA’16). IEEE, 694–706.Google ScholarGoogle ScholarCross RefCross Ref
  33. Arthur Perais, André Seznec, Pierre Michaud, Andreas Sembrant, and Erik Hagersten. 2015. Cost-effective speculative scheduling in high performance processors. In Proceedings of the ACM/IEEE 42nd International Symposium on Computer Architecture (ISCA’15). IEEE, 247–259.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Vlad Petric, Anne Bracy, and Amir Roth. 2002. Three extensions to register integration. In Proceedings of the 35th IEEE/ACM International Symposium on Microarchitecture (MICRO’02). IEEE, 37–47.Google ScholarGoogle ScholarCross RefCross Ref
  35. Vlad Petric, Tingting Sha, and Amir Roth. 2005. RENO: A rename-based instruction optimizer. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 98–109.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Alberto Ros and Stefanos Kaxiras. 2018. The superfluous load queue. In Proceedings of the 51st IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, 95–107.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Roth. 2005. Store vulnerability window (SVW): Re-execution filtering for enhanced load optimization. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 458–468.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Amir Roth. 2008. Physical register reference counting. IEEE Comput. Archit. Lett. 7, 1 (2008), 9–12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Rami Sheikh, Harold W. Cain, and Raguram Damodaran. 2017. Load value prediction via path-based address prediction: Avoiding mispredictions due to conflicting stores. In Proceedings of the 50th IEEE/ACM International Symposium on Microarchitecture. ACM, 423–435.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Avinash Sodani and Gurindar S. Sohi. 1997. Dynamic instruction reuse. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA’97).Google ScholarGoogle Scholar
  41. Nathan Tuck and Dean M. Tullsen. 2005. Multithreaded value prediction. In Proceedings of the 11th International Symposium on High-performance Computer Architecture. IEEE, 5–15.Google ScholarGoogle Scholar
  42. Kai Wang and Manoj Franklin. 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 281–290.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Early Address Prediction: Efficient Pipeline Prefetch and Reuse

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 18, Issue 3
        September 2021
        370 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/3460978
        Issue’s Table of Contents

        Copyright © 2021 Owner/Author

        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 June 2021
        • Revised: 1 March 2021
        • Accepted: 1 March 2021
        • Received: 1 December 2020
        Published in taco Volume 18, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format