Skip to main content
Log in

DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based Architectures

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

The recent trend towards tile-based manycore architectures has helped to tackle the memory wall by physically distributing memories and processing nodes. However, this introduced a data-to-task locality challenge and inter-tile communication thus often imposes significant software overhead. Thus, we proposed software-defined hardware-managed SHARQ queues that enable efficient inter-tile communication by leveraging user-defined queues with arbitrarily sized elements. To ensure (remote) processing of queued elements, SHARQ introduces an optional handler task, which is scheduled by hardware on demand. Queue management, intra- and inter-tile data transfer, and handler task invocation are entirely handled by hardware. Only rare tasks, like the dynamic queue creation at run-time, are performed in software. DySHARQ, an extension of SHARQ, enables dynamic and concurrent queue memory management and queue length adjustments to be able to adapt to application and resource requirement changes. The DySHARQ hardware is able to monitor the queue memory requirements at run-time and conditionally schedules a software-defined memory management task. It further optimizes the hardware-software interaction for local queue operations. We integrated DySHARQ into the MPI library used by the NAS benchmarks. The evaluation shows a reduction in execution time by up to 43% (compared to software) for the communication intense IS kernel in a 4 \(\times\) 4 tile design on an FPGA platform with a total of 80 LEON3 cores. The dynamic memory management reduces the memory footprint by 3.75\(\times\) in a 2 \(\times\) 2 design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Note that SHARQ is a subset of DySHARQ, which contains all features already present in SHARQ. We refer to this subset as SHARQ, base SHARQ or even base DySHARQ.

  2. The changes to the queue descriptor for the extended DySHARQ version, which are shown in Fig. 3b, are later described in Sect. 4

  3. Similar to the handler task invocation, described in Sect. 3, DySHARQ invokes the memory management task by inserting its task descriptor (defined in the queue header) into the hardware scheduler.

References

  1. Parkhurst, J., Darringer, J., Grundmann, B.: From single core to multi-core: preparing for a new exponential. In: 2006 IEEE/ACM International Conference on Computer Aided Design, pp. 67–72 (2006). https://doi.org/10.1109/ICCAD.2006.320067

  2. Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995). https://doi.org/10.1145/216585.216588

    Article  Google Scholar 

  3. Patterson, D.A., Anderson, T.E., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C.E., Thomas, R., Yelick, K.A.: A case for intelligent RAM. IEEE Micro 17(2), 34–44 (1997). https://doi.org/10.1109/40.592312

    Article  Google Scholar 

  4. Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., Snelting, G.: Invasive computing: an overview. In: Multiprocessor System-on-Chip, pp. 241–268 (2011). https://doi.org/10.1007/978-1-4419-6460-1_11

  5. Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao III, C., Brown, J.F., Agarwal, A.: On-chip interconnection architecture of the tile processor. IEEE Micro 27(5), 15–31 (2007). https://doi.org/10.1109/MM.2007.89

  6. Bell, S., Edwards, B., Amann, J., Conlin, R., Joyce, K., Leung, V., MacKay, J., Reif, M., Bao, L., III JFB, Mattina, M., Miao, C., Ramey, C., Wentzlaff, D., Anderson, W., Berger, E., Fairbanks, N., Khan, D., Montenegro, F., Stickney, J., Zook, J.: TILE64 - processor: a 64-Core SoC with mesh interconnect. In: 2008 IEEE International Solid-State Circuits Conference, ISSCC 2008, Digest of Technical Papers, San Francisco, CA, USA, February 3–7, 2008, IEEE, San Francisco, CA, pp 88–89 (2008). https://doi.org/10.1109/ISSCC.2008.4523070

  7. Lotfi-Kamran, P., Grot, B., Ferdman, M., Volos, S., Kocberber, O., Picorel, J., Adileh, A., Jevdjic, D., Idgunji, S., Ozer, E., Falsafi, B.: Scale-out Processors. In: Proceedings of the 39th Annual International Symposium on Computer Architecture, IEEE Computer Society, USA, ISCA ’12, pp. 500–511 (2012)

  8. Howard, J., Dighe, S., Hoskote, Y., Vangal, S., Finan, D., Ruhl, G., Jenkins, D., Wilson, H., Borkar, N., Schrom, G., Pailet, F., Jain, S., Jacob, T., Yada, S., Marella, S., Salihundam, P., Erraguntla, V., Konow, M., Riepen, M., Droege, G., Lindemann, J., Gries, M., Apel, T., Henriss, K., Lund-Larsen, T., Steibl, S., Borkar, S., De, V., Wijngaart, R.V.D., Mattson, T.: A 48-core IA-32 message-passing processor with DVFS in 45 nm CMOS. In: 2010 IEEE International Solid-State Circuits Conference—(ISSCC), pp. 108–109 (2010). https://doi.org/10.1109/ISSCC.2010.5434077

  9. Mittal, S.: A survey on evaluating and optimizing performance of Intel Xeon Phi. Practice and Experience, Concurrency and Computation (2020)

  10. Siegl, P., Buchty, R., Berekovic, M.: Data-centric computing frontiers: a survey on processing-in-memory. In: Jacob, B. (ed) Proceedings of the Second International Symposium on Memory Systems, MEMSYS 2016, Alexandria, VA, USA, 2016, ACM, pp. 295–308 (2016). https://doi.org/10.1145/2989081.2989087

  11. Kogge, P.: Memory Intensive Computing, the 3rd Wall, and the Need for Innovation in Architecture. (2017) https://memsys.io/wp-content/uploads/2017/12/The_Wall.pdf

  12. Oechslein, B., Schedel, J., Kleinöder, J., Bauer, L., Henkel, J., Lohmann, D., Schröder-Preikschat, W.: OctoPOS: a parallel operating system for invasive computing. In: Proceedings of the International Workshop on Systems for Future Multi-Core Architectures. EuroSys, pp. 9–14 (2011)

  13. Kranz, D.A., Johnson, K.L., Agarwal, A., Kubiatowicz, J., Lim, B.: Integrating message-passing and shared-memory: early experience. In: Chen, M.C., Halstead, R. (eds) Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), San Diego, California, USA, 1993, ACM, pp. 54–63, (1993). https://doi.org/10.1145/155332.155338

  14. Moir, M., Shavit, N.: Concurrent data structures. In: Handbook of Data Structures and Applications (2004)

  15. MPI Forum: MPI: A Message Passing Interface Standard Version 3.1 (2015). https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf

  16. Corbet, J.: Ringing in a new asynchronous I/O API. (2019) https://lwn.net/Articles/776703/

  17. Michael, M.M., Scott, M.L.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: ACM Symposium on Principles of Distributed Computing, pp. 267–275 (1996). https://doi.org/10.1145/248052.248106

  18. Wang, Y., Wang, R., Herdrich, A., Tsai, J., Solihin, Y.: CAF: core to core communication acceleration framework. In: Conference on Parallel Architectures and Compilation (PACT), pp. 351–362 (2016). https://doi.org/10.1145/2967938.2967954

  19. Lee, S., Tiwari, D., Solihin, Y., Tuck, J.: HAQu: hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. In: Conference on High-Performance Computer Architecture (HPCA), pp. 99–110 (2011). https://doi.org/10.1109/HPCA.2011.5749720

  20. Petrovic, D., Ropars, T., Schiper, A.: Leveraging hardware message passing for efficient thread synchronization. TOPC 2(4), 24:1–24:26 (2016). https://doi.org/10.1145/2858652

    Article  Google Scholar 

  21. Sánchez, D., Yoo, R.M., Kozyrakis, C.: Flexible architectural support for fine-grain scheduling. In: ASPLOS Conference Proceedings, pp. 311–322 (2010). https://doi.org/10.1145/1736020.1736055

  22. Lee, J., Nicopoulos, C., Lee, H.G., Panth, S., Lim, S.K., Kim, J.: IsoNet: hardware-based job queue management for many-core architectures. IEEE Trans. VLSI Syst. 21(6), 1080–1093 (2013). https://doi.org/10.1109/TVLSI.2012.2202699

    Article  Google Scholar 

  23. Pujari, R.K., Wild, T., Herkersdorf, A.: TCU: a multi-objective hardware thread mapping unit for HPC clusters. In: High Performance Computing, ISC, pp. 39–58 (2016). https://doi.org/10.1007/978-3-319-41321-1_3

  24. Kumar, S., Hughes, C.J., Nguyen, A.D.: Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In: Symposium on Computer Architecture (ISCA), pp. 162–173 (2007). https://doi.org/10.1145/1250662.1250683

  25. Sharma, R.R., Rajasekhar, Y., Sass, R.: Exploring hardware work queue support for lightweight threads in MPSoCs. In: Conference on Reconfigurable Computing and FPGAs (ReConFig), pp. 1–6 (2012). https://doi.org/10.1109/ReConFig.2012.6416747

  26. Brewer, E.A., Chong, F.T., Liu, L.T., Sharma, S.D., Kubiatowicz, J.: Remote queues: exposing message queues for optimization and atomicity. In: ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 42–53 (1995). https://doi.org/10.1145/215399.215416

  27. Rheindt, S., Schenk, A., Srivatsa, A., Wild, T., Herkersdorf, A.: CaCAO: complex and compositional atomic operations for NoC-based manycore platforms. In: Conference on Architecture of Computing Systems (ARCS), pp 139–152 (2018). https://doi.org/10.1007/978-3-319-77610-1_11

  28. Rheindt, S., Maier, S., Schmaus, F., Wild, T., Schröder-Preikschat, W., Herkersdorf, A.: SHARQ: software-defined hardware-managed queues for tile-based manycore architectures. In: International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIX), Springer, Samos, Greece, pp. 212–225 (2019). https://doi.org/10.1007/978-3-030-27562-4_15

  29. Schmaus, F., Maier, S., Langer, T., Rabenstein, J., Hönig, T., Bauer, L., Henkel, J., Schröder-Preikschat, W.: System software for resource arbitration on future many-* architectures. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, pp. 967–975 (2020). https://doi.org/10.1109/IPDPSW50202.2020.00160

  30. Moerman, F.: Open event machine: a multi-core run-time designed for performance. In: 2014 6th European Embedded Design in Education and Research Conference (EDERC), pp. 41–45 (2014)

  31. Cataldo, R., Fernandes, R., Martin, K.J.M., Sepulveda, J., Susin, A., Marcon, C., Diguet, J.: Subutai: distributed synchronization primitives in NoC interfaces for legacy parallel-applications. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC) (2018)

  32. Zaib, A., Wild, T., Herkersdorf, A., Heisswolf, J., Becker, J., Weichslgartner, A., Teich, J.: Efficient task spawning for shared memory and message passing in many-core architectures. J. Syst. Archit. - Embed. Syst. Des. 77, 72–82 (2017). https://doi.org/10.1016/j.sysarc.2017.03.004

    Article  Google Scholar 

  33. Heisswolf, J., Zaib. A., Weichslgartner, A., Karle, M., Singh, M., Wild, T., Teich, J., Herkersdorf, A., Becker, J.: The invasive network on chip: a multi-objective many-core communication infrastructure. In: Conference on Architecture of Computing Systems (ARCS), Workshop Proceedings, pp. 1–8 (2014)

  34. HkJ, Chu, et al.: Zero-copy TCP in solaris. USENIX Annu. Tech. Conf. 1, 253–264 (1996)

    Google Scholar 

  35. Intel Corporation: Intel 82574 GbE Controller Family—Datasheet. www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf, rev. 3.4 (2014)

  36. Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., et al.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)

    Google Scholar 

  37. Subhlok, J., Venkataramaiah, S., Singh, A.: Characterizing NAS benchmark performance on shared heterogeneous networks. In: Parallel and Distributed Processing Symposium (IPDPS) (2002). https://doi.org/10.1109/IPDPS.2002.1015659

  38. Maier, S., Hönig, T., Wägemann, P., Schröder-Preikschat, W.: Asynchronous abstract machines: anti-noise system software for many-core processors. In: Proceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), ACM, pp. 19–26 (2019). https://doi.org/10.1145/3322789.3328744

Download references

Acknowledgements

This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—Project Number 146371743–TRR 89: Invasive Computing. We also thank Gabor Drescher, Jonas Rabenstein, Christoph Erhardt, and Tobias Langer from FAU, as well as Alexander Preißner and Temur Sabirov from TUM for their excellent help.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sven Rheindt.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rheindt, S., Maier, S., Pohle, N. et al. DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based Architectures. Int J Parallel Prog 49, 506–540 (2021). https://doi.org/10.1007/s10766-020-00687-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-020-00687-7

Keywords

Navigation