Abstract
Achieving low load-to-use latency with low energy and storage overheads is critical for performance. Existing techniques either prefetch into the pipeline (via address prediction and validation) or provide data reuse in the pipeline (via register sharing or L0 caches). These techniques provide a range of tradeoffs between latency, reuse, and overhead.
In this work, we present a pipeline prefetching technique that achieves state-of-the-art performance and data reuse without additional data storage, data movement, or validation overheads by adding address tags to the register file. Our addition of register file tags allows us to forward (reuse) load data from the register file with no additional data movement, keep the data alive in the register file beyond the instruction’s lifetime to increase temporal reuse, and coalesce prefetch requests to achieve spatial reuse. Further, we show that we can use the existing memory order violation detection hardware to validate prefetches and data forwards without additional overhead.
Our design achieves the performance of existing pipeline prefetching while also forwarding 32% of the loads from the register file (compared to 15% in state-of-the-art register sharing), delivering a 16% reduction in L1 dynamic energy (1.6% total processor energy), with an area overhead of less than 0.5%.
- Ricardo Alves, Stefanos Kaxiras, and David Black-Schaffer. 2018. Dynamically disabling way-prediction to reduce instruction replay. In Proceedings of the IEEE International Conference on Computer Design (ICCD’18).Google ScholarCross Ref
- Ricardo Alves, Nikos Nikoleris, Stefanos Kaxiras, and David Black-Schaffer. 2017. Addressing energy challenges in filter caches. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture (SBAC-PAD’17). IEEE, 49–56.Google ScholarCross Ref
- Ricardo Alves, Alberto Ros, David Black-Schaffer, and Stefanos Kaxiras. 2019. Filter caching for free: The untapped potential of the store-buffer. In Proceedings of the 46th IEEE International Symposium on Computer Architecture. ACM, 436–448.Google ScholarDigital Library
- Steven Battle, Andrew D. Hilton, Mark Hempstead, and Amir Roth. 2012. Flexible register management using reference counting. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture. IEEE, 1–12.Google ScholarDigital Library
- Michael Bekerman, Stephan Jourdan, Ronny Ronen, Gilad Kirshenboim, Lihu Rappoport, Adi Yoaz, and Uri Weiser. 1999. Correlated load-address predictors. In ACM SIGARCH Computer Architecture News, Vol. 27. IEEE Computer Society, 54–63.Google ScholarCross Ref
- Nikolaos Bellas, Ibrahim Hajj, and Constantine Polychronopoulos. 1999. Using dynamic cache management techniques to reduce energy in a high-performance processor. In Proceedings of the International Symposium on Low Power Electronics and Design. IEEE, 64–69.Google ScholarDigital Library
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7. DOI: https://doi.org/10.1145/2024716.2024718.Google ScholarDigital Library
- George Z. Chrysos and Joel S. Emer. 1998. Memory dependence prediction using store sets. In Proceedings of the 25th International Symposium on Computer Architecture. IEEE, 142–153.Google Scholar
- Standard Performance Evaluation Corporation. 2006. SPEC CPU2006. Retrieved from: http://www.spec.org/cpu20066.Google Scholar
- Richard J. Eickemeyer and Stamatis Vassiliadis. 1993. A load-instruction unit for pipelined processors. IBM J. Res. Devel. 37, 4 (1993), 547–564.Google ScholarDigital Library
- B. Fahs, T. Rafacz, S. J. Patel, and S. S. Lumetta. 2005. Continuous optimization. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 86–97.Google Scholar
- Manoj Franklin and Gurindar S. Sohi. 1996. ARB: A hardware mechanism for dynamic reordering of memory references. IEEE Trans. Comput. 45, 5 (1996), 552–571.Google ScholarDigital Library
- Freddy Gabbay. 1996. Speculative Execution Based on Value Prediction. Technion-IIT, Department of Electrical Engineering.Google Scholar
- Roberto Giorgi and Paolo Bennati. 2007. Reducing leakage in power-saving capable caches for embedded systems by using a filter cache. In Proceedings of the Workshop on Memory Performance: Dealing with Applications, Systems and Architecture. ACM, 97–104.Google ScholarDigital Library
- José González and Antonio González. 1997. Speculative execution via address prediction and data prefetching. In Proceedings of the International Conference on Supercomputing. Citeseer, 196–203.Google ScholarDigital Library
- Stephan Jourdan, Ronny Ronen, Michael Bekerman, Bishara Shomar, and Adi Yoaz. 1998. A novel renaming scheme to exploit value temporal locality through physical register reuse and unification. In Proceedings of the 31st ACM/IEEE International Symposium on Microarchitecture. IEEE, 216–225.Google ScholarCross Ref
- Richard E. Kessler. 1999. The alpha 21264 microprocessor. IEEE Micro 19, 2 (1999), 24–36.Google ScholarDigital Library
- Johnson Kin, Munish Gupta, and William H. Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of the 30th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 184–193.Google Scholar
- Paul Kocher, Jann Horn, Anders Fogh, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher et al. 2019. Spectre attacks: Exploiting speculative execution. In Proceedings of the IEEE Symposium on Security and Privacy (SP’19). IEEE, 1–19.Google ScholarCross Ref
- Sheng Li, Ke Chen, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2011. CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques. In Proceedings of the International Conference on Computer-aided Design. IEEE Press, 694–701.Google ScholarDigital Library
- M. H. Lipasti. 1996. Value locality and load value prediction. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarDigital Library
- Mikko H. Lipasti and John Paul Shen. 1996. Exceeding the dataflow limit via value prediction. In Proceedings of the 29th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 226–237.Google Scholar
- Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. 2018. Meltdown. arXiv preprint arXiv:1801.01207 (2018).Google Scholar
- Andreas Moshovos, Scott E. Breach, Terani N. Vijaykumar, and Gurindar S. Sohi. 1997. Dynamic speculation and synchronization of data dependences. In ACM SIGARCH Computer Architecture News, Vol. 25. ACM, 181–193.Google Scholar
- Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0. Technical Report HPL-2009-85. HP Labs.Google Scholar
- Soner Önder and Rajiv Gupta. 2001. Load and store reuse using register file contents. In Proceedings of the 15th International Conference on Supercomputing. ACM, 289–302.Google ScholarDigital Library
- Lois Orosa, Rodolfo Azevedo, and Onur Mutlu. 2018. AVPP: Address-first value-next predictor with value prefetching for improving the efficiency of load value prediction. ACM Trans. Archit. Code Optim. 15, 4 (2018), 49.Google Scholar
- Arthur Perais, Fernando A. Endo, and André Seznec. 2016. Register sharing for equality prediction. In Proceedings of the 49th IEEE/ACM International Symposium on Microarchitecture. IEEE Press, 4.Google ScholarDigital Library
- Arthur Perais and André Seznec. 2014. EOLE: Paving the way for an effective implementation of value prediction. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). IEEE, 481–492.Google ScholarDigital Library
- Arthur Perais and André Seznec. 2014. Practical data value speculation for future high-end processors. In Proceedings of the IEEE 20th International Symposium on High-performance Computer Architecture (HPCA’14). IEEE, 428–439.Google ScholarCross Ref
- Arthur Perais and André Seznec. 2015. BeBoP: A cost effective predictor infrastructure for superscalar value prediction. In Proceedings of the IEEE 21st International Symposium on High-performance Computer Architecture (HPCA’15). IEEE, 13–25.Google ScholarCross Ref
- Arthur Perais and André Seznec. 2016. Cost effective physical register sharing. In Proceedings of the IEEE International Symposium on High-performance Computer Architecture (HPCA’16). IEEE, 694–706.Google ScholarCross Ref
- Arthur Perais, André Seznec, Pierre Michaud, Andreas Sembrant, and Erik Hagersten. 2015. Cost-effective speculative scheduling in high performance processors. In Proceedings of the ACM/IEEE 42nd International Symposium on Computer Architecture (ISCA’15). IEEE, 247–259.Google ScholarDigital Library
- Vlad Petric, Anne Bracy, and Amir Roth. 2002. Three extensions to register integration. In Proceedings of the 35th IEEE/ACM International Symposium on Microarchitecture (MICRO’02). IEEE, 37–47.Google ScholarCross Ref
- Vlad Petric, Tingting Sha, and Amir Roth. 2005. RENO: A rename-based instruction optimizer. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 98–109.Google ScholarDigital Library
- Alberto Ros and Stefanos Kaxiras. 2018. The superfluous load queue. In Proceedings of the 51st IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, 95–107.Google ScholarDigital Library
- A. Roth. 2005. Store vulnerability window (SVW): Re-execution filtering for enhanced load optimization. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 458–468.Google ScholarDigital Library
- Amir Roth. 2008. Physical register reference counting. IEEE Comput. Archit. Lett. 7, 1 (2008), 9–12.Google ScholarDigital Library
- Rami Sheikh, Harold W. Cain, and Raguram Damodaran. 2017. Load value prediction via path-based address prediction: Avoiding mispredictions due to conflicting stores. In Proceedings of the 50th IEEE/ACM International Symposium on Microarchitecture. ACM, 423–435.Google ScholarDigital Library
- Avinash Sodani and Gurindar S. Sohi. 1997. Dynamic instruction reuse. In Proceedings of the 24th International Symposium on Computer Architecture (ISCA’97).Google Scholar
- Nathan Tuck and Dean M. Tullsen. 2005. Multithreaded value prediction. In Proceedings of the 11th International Symposium on High-performance Computer Architecture. IEEE, 5–15.Google Scholar
- Kai Wang and Manoj Franklin. 1997. Highly accurate data value prediction using hybrid predictors. In Proceedings of the 30th ACM/IEEE International Symposium on Microarchitecture. IEEE Computer Society, 281–290.Google ScholarDigital Library
Index Terms
- Early Address Prediction: Efficient Pipeline Prefetch and Reuse
Recommendations
Register file prefetching
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer ArchitectureThe memory wall continues to limit the performance of modern out-of-order (OOO) processors, despite the expensive provisioning of large multi-level caches and advancements in memory prefetching. In this paper, we put forth an important observation that ...
Load value prediction via path-based address prediction: avoiding mispredictions due to conflicting stores
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on MicroarchitectureCurrent flagship processors excel at extracting instruction-level-parallelism (ILP) by forming large instruction windows. Even then, extracting ILP is inherently limited by true data dependencies. Value prediction was proposed to address this ...
Data Dependence Speculation Using Data Address Prediction and its Enhancement with Instruction Reissue
EUROMICRO '98: Proceedings of the 24th Conference on EUROMICRO - Volume 1In this paper, we introduce an instruction reissue mechanism in order to enhance dynamic data dependence speculation using data address prediction. Since instructions which are not data-dependent upon speculatively executed instructions are not squashed,...
Comments