DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based Architectures

Rheindt, Sven; Maier, Sebastian; Pohle, Nora; Nolte, Lars; Lenke, Oliver; Schmaus, Florian; Wild, Thomas; Schröder-Preikschat, Wolfgang; Herkersdorf, Andreas

doi:10.1007/s10766-020-00687-7

DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based Architectures

Published: 20 November 2020

Volume 49, pages 506–540, (2021)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Sven Rheindt ORCID: orcid.org/0000-0002-9024-5639¹,
Sebastian Maier²,
Nora Pohle¹,
Lars Nolte¹,
Oliver Lenke¹,
Florian Schmaus²,
Thomas Wild¹,
Wolfgang Schröder-Preikschat² &
…
Andreas Herkersdorf¹

249 Accesses
1 Citation
Explore all metrics

Abstract

The recent trend towards tile-based manycore architectures has helped to tackle the memory wall by physically distributing memories and processing nodes. However, this introduced a data-to-task locality challenge and inter-tile communication thus often imposes significant software overhead. Thus, we proposed software-defined hardware-managed SHARQ queues that enable efficient inter-tile communication by leveraging user-defined queues with arbitrarily sized elements. To ensure (remote) processing of queued elements, SHARQ introduces an optional handler task, which is scheduled by hardware on demand. Queue management, intra- and inter-tile data transfer, and handler task invocation are entirely handled by hardware. Only rare tasks, like the dynamic queue creation at run-time, are performed in software. DySHARQ, an extension of SHARQ, enables dynamic and concurrent queue memory management and queue length adjustments to be able to adapt to application and resource requirement changes. The DySHARQ hardware is able to monitor the queue memory requirements at run-time and conditionally schedules a software-defined memory management task. It further optimizes the hardware-software interaction for local queue operations. We integrated DySHARQ into the MPI library used by the NAS benchmarks. The evaluation shows a reduction in execution time by up to 43% (compared to software) for the communication intense IS kernel in a 4 \(\times\) 4 tile design on an FPGA platform with a total of 80 LEON3 cores. The dynamic memory management reduces the memory footprint by 3.75\(\times\) in a 2 \(\times\) 2 design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SHARQ: Software-Defined Hardware-Managed Queues for Tile-Based Manycore Architectures

CaCAO: Complex and Compositional Atomic Operations for NoC-Based Manycore Platforms

DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures

Article 03 January 2021

Notes

Note that SHARQ is a subset of DySHARQ, which contains all features already present in SHARQ. We refer to this subset as SHARQ, base SHARQ or even base DySHARQ.
The changes to the queue descriptor for the extended DySHARQ version, which are shown in Fig. 3b, are later described in Sect. 4
Similar to the handler task invocation, described in Sect. 3, DySHARQ invokes the memory management task by inserting its task descriptor (defined in the queue header) into the hardware scheduler.

References

Parkhurst, J., Darringer, J., Grundmann, B.: From single core to multi-core: preparing for a new exponential. In: 2006 IEEE/ACM International Conference on Computer Aided Design, pp. 67–72 (2006). https://doi.org/10.1109/ICCAD.2006.320067
Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995). https://doi.org/10.1145/216585.216588
Article Google Scholar
Patterson, D.A., Anderson, T.E., Cardwell, N., Fromm, R., Keeton, K., Kozyrakis, C.E., Thomas, R., Yelick, K.A.: A case for intelligent RAM. IEEE Micro 17(2), 34–44 (1997). https://doi.org/10.1109/40.592312
Article Google Scholar
Teich, J., Henkel, J., Herkersdorf, A., Schmitt-Landsiedel, D., Schröder-Preikschat, W., Snelting, G.: Invasive computing: an overview. In: Multiprocessor System-on-Chip, pp. 241–268 (2011). https://doi.org/10.1007/978-1-4419-6460-1_11
Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao III, C., Brown, J.F., Agarwal, A.: On-chip interconnection architecture of the tile processor. IEEE Micro 27(5), 15–31 (2007). https://doi.org/10.1109/MM.2007.89
Bell, S., Edwards, B., Amann, J., Conlin, R., Joyce, K., Leung, V., MacKay, J., Reif, M., Bao, L., III JFB, Mattina, M., Miao, C., Ramey, C., Wentzlaff, D., Anderson, W., Berger, E., Fairbanks, N., Khan, D., Montenegro, F., Stickney, J., Zook, J.: TILE64 - processor: a 64-Core SoC with mesh interconnect. In: 2008 IEEE International Solid-State Circuits Conference, ISSCC 2008, Digest of Technical Papers, San Francisco, CA, USA, February 3–7, 2008, IEEE, San Francisco, CA, pp 88–89 (2008). https://doi.org/10.1109/ISSCC.2008.4523070
Lotfi-Kamran, P., Grot, B., Ferdman, M., Volos, S., Kocberber, O., Picorel, J., Adileh, A., Jevdjic, D., Idgunji, S., Ozer, E., Falsafi, B.: Scale-out Processors. In: Proceedings of the 39th Annual International Symposium on Computer Architecture, IEEE Computer Society, USA, ISCA ’12, pp. 500–511 (2012)
Howard, J., Dighe, S., Hoskote, Y., Vangal, S., Finan, D., Ruhl, G., Jenkins, D., Wilson, H., Borkar, N., Schrom, G., Pailet, F., Jain, S., Jacob, T., Yada, S., Marella, S., Salihundam, P., Erraguntla, V., Konow, M., Riepen, M., Droege, G., Lindemann, J., Gries, M., Apel, T., Henriss, K., Lund-Larsen, T., Steibl, S., Borkar, S., De, V., Wijngaart, R.V.D., Mattson, T.: A 48-core IA-32 message-passing processor with DVFS in 45 nm CMOS. In: 2010 IEEE International Solid-State Circuits Conference—(ISSCC), pp. 108–109 (2010). https://doi.org/10.1109/ISSCC.2010.5434077
Mittal, S.: A survey on evaluating and optimizing performance of Intel Xeon Phi. Practice and Experience, Concurrency and Computation (2020)
Siegl, P., Buchty, R., Berekovic, M.: Data-centric computing frontiers: a survey on processing-in-memory. In: Jacob, B. (ed) Proceedings of the Second International Symposium on Memory Systems, MEMSYS 2016, Alexandria, VA, USA, 2016, ACM, pp. 295–308 (2016). https://doi.org/10.1145/2989081.2989087
Kogge, P.: Memory Intensive Computing, the 3rd Wall, and the Need for Innovation in Architecture. (2017) https://memsys.io/wp-content/uploads/2017/12/The_Wall.pdf
Oechslein, B., Schedel, J., Kleinöder, J., Bauer, L., Henkel, J., Lohmann, D., Schröder-Preikschat, W.: OctoPOS: a parallel operating system for invasive computing. In: Proceedings of the International Workshop on Systems for Future Multi-Core Architectures. EuroSys, pp. 9–14 (2011)
Kranz, D.A., Johnson, K.L., Agarwal, A., Kubiatowicz, J., Lim, B.: Integrating message-passing and shared-memory: early experience. In: Chen, M.C., Halstead, R. (eds) Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), San Diego, California, USA, 1993, ACM, pp. 54–63, (1993). https://doi.org/10.1145/155332.155338
Moir, M., Shavit, N.: Concurrent data structures. In: Handbook of Data Structures and Applications (2004)
MPI Forum: MPI: A Message Passing Interface Standard Version 3.1 (2015). https://www.mpi-forum.org/docs/mpi-3.1/mpi31-report.pdf
Corbet, J.: Ringing in a new asynchronous I/O API. (2019) https://lwn.net/Articles/776703/
Michael, M.M., Scott, M.L.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: ACM Symposium on Principles of Distributed Computing, pp. 267–275 (1996). https://doi.org/10.1145/248052.248106
Wang, Y., Wang, R., Herdrich, A., Tsai, J., Solihin, Y.: CAF: core to core communication acceleration framework. In: Conference on Parallel Architectures and Compilation (PACT), pp. 351–362 (2016). https://doi.org/10.1145/2967938.2967954
Lee, S., Tiwari, D., Solihin, Y., Tuck, J.: HAQu: hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. In: Conference on High-Performance Computer Architecture (HPCA), pp. 99–110 (2011). https://doi.org/10.1109/HPCA.2011.5749720
Petrovic, D., Ropars, T., Schiper, A.: Leveraging hardware message passing for efficient thread synchronization. TOPC 2(4), 24:1–24:26 (2016). https://doi.org/10.1145/2858652
Article Google Scholar
Sánchez, D., Yoo, R.M., Kozyrakis, C.: Flexible architectural support for fine-grain scheduling. In: ASPLOS Conference Proceedings, pp. 311–322 (2010). https://doi.org/10.1145/1736020.1736055
Lee, J., Nicopoulos, C., Lee, H.G., Panth, S., Lim, S.K., Kim, J.: IsoNet: hardware-based job queue management for many-core architectures. IEEE Trans. VLSI Syst. 21(6), 1080–1093 (2013). https://doi.org/10.1109/TVLSI.2012.2202699
Article Google Scholar
Pujari, R.K., Wild, T., Herkersdorf, A.: TCU: a multi-objective hardware thread mapping unit for HPC clusters. In: High Performance Computing, ISC, pp. 39–58 (2016). https://doi.org/10.1007/978-3-319-41321-1_3
Kumar, S., Hughes, C.J., Nguyen, A.D.: Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In: Symposium on Computer Architecture (ISCA), pp. 162–173 (2007). https://doi.org/10.1145/1250662.1250683
Sharma, R.R., Rajasekhar, Y., Sass, R.: Exploring hardware work queue support for lightweight threads in MPSoCs. In: Conference on Reconfigurable Computing and FPGAs (ReConFig), pp. 1–6 (2012). https://doi.org/10.1109/ReConFig.2012.6416747
Brewer, E.A., Chong, F.T., Liu, L.T., Sharma, S.D., Kubiatowicz, J.: Remote queues: exposing message queues for optimization and atomicity. In: ACM Symposium on Parallel Algorithms and Architectures (SPAA), pp. 42–53 (1995). https://doi.org/10.1145/215399.215416
Rheindt, S., Schenk, A., Srivatsa, A., Wild, T., Herkersdorf, A.: CaCAO: complex and compositional atomic operations for NoC-based manycore platforms. In: Conference on Architecture of Computing Systems (ARCS), pp 139–152 (2018). https://doi.org/10.1007/978-3-319-77610-1_11
Rheindt, S., Maier, S., Schmaus, F., Wild, T., Schröder-Preikschat, W., Herkersdorf, A.: SHARQ: software-defined hardware-managed queues for tile-based manycore architectures. In: International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIX), Springer, Samos, Greece, pp. 212–225 (2019). https://doi.org/10.1007/978-3-030-27562-4_15
Schmaus, F., Maier, S., Langer, T., Rabenstein, J., Hönig, T., Bauer, L., Henkel, J., Schröder-Preikschat, W.: System software for resource arbitration on future many-* architectures. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, pp. 967–975 (2020). https://doi.org/10.1109/IPDPSW50202.2020.00160
Moerman, F.: Open event machine: a multi-core run-time designed for performance. In: 2014 6th European Embedded Design in Education and Research Conference (EDERC), pp. 41–45 (2014)
Cataldo, R., Fernandes, R., Martin, K.J.M., Sepulveda, J., Susin, A., Marcon, C., Diguet, J.: Subutai: distributed synchronization primitives in NoC interfaces for legacy parallel-applications. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC) (2018)
Zaib, A., Wild, T., Herkersdorf, A., Heisswolf, J., Becker, J., Weichslgartner, A., Teich, J.: Efficient task spawning for shared memory and message passing in many-core architectures. J. Syst. Archit. - Embed. Syst. Des. 77, 72–82 (2017). https://doi.org/10.1016/j.sysarc.2017.03.004
Article Google Scholar
Heisswolf, J., Zaib. A., Weichslgartner, A., Karle, M., Singh, M., Wild, T., Teich, J., Herkersdorf, A., Becker, J.: The invasive network on chip: a multi-objective many-core communication infrastructure. In: Conference on Architecture of Computing Systems (ARCS), Workshop Proceedings, pp. 1–8 (2014)
HkJ, Chu, et al.: Zero-copy TCP in solaris. USENIX Annu. Tech. Conf. 1, 253–264 (1996)
Google Scholar
Intel Corporation: Intel 82574 GbE Controller Family—Datasheet. www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf, rev. 3.4 (2014)
Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., et al.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)
Google Scholar
Subhlok, J., Venkataramaiah, S., Singh, A.: Characterizing NAS benchmark performance on shared heterogeneous networks. In: Parallel and Distributed Processing Symposium (IPDPS) (2002). https://doi.org/10.1109/IPDPS.2002.1015659
Maier, S., Hönig, T., Wägemann, P., Schröder-Preikschat, W.: Asynchronous abstract machines: anti-noise system software for many-core processors. In: Proceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS), ACM, pp. 19–26 (2019). https://doi.org/10.1145/3322789.3328744

Download references

Acknowledgements

This work is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—Project Number 146371743–TRR 89: Invasive Computing. We also thank Gabor Drescher, Jonas Rabenstein, Christoph Erhardt, and Tobias Langer from FAU, as well as Alexander Preißner and Temur Sabirov from TUM for their excellent help.

Author information

Authors and Affiliations

Technical University of Munich (TUM), Munich, Germany
Sven Rheindt, Nora Pohle, Lars Nolte, Oliver Lenke, Thomas Wild & Andreas Herkersdorf
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
Sebastian Maier, Florian Schmaus & Wolfgang Schröder-Preikschat

Authors

Sven Rheindt
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Maier
View author publications
You can also search for this author in PubMed Google Scholar
Nora Pohle
View author publications
You can also search for this author in PubMed Google Scholar
Lars Nolte
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Lenke
View author publications
You can also search for this author in PubMed Google Scholar
Florian Schmaus
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Wild
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Schröder-Preikschat
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Herkersdorf
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sven Rheindt.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rheindt, S., Maier, S., Pohle, N. et al. DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based Architectures. Int J Parallel Prog 49, 506–540 (2021). https://doi.org/10.1007/s10766-020-00687-7

Download citation

Received: 03 April 2020
Accepted: 05 November 2020
Published: 20 November 2020
Issue Date: August 2021
DOI: https://doi.org/10.1007/s10766-020-00687-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based Architectures

Abstract

Access this article

Similar content being viewed by others

SHARQ: Software-Defined Hardware-Managed Queues for Tile-Based Manycore Architectures

CaCAO: Complex and Compositional Atomic Operations for NoC-Based Manycore Platforms

DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DySHARQ: Dynamic Software-Defined Hardware-Managed Queues for Tile-Based Architectures

Abstract

Access this article

Similar content being viewed by others

SHARQ: Software-Defined Hardware-Managed Queues for Tile-Based Manycore Architectures

CaCAO: Complex and Compositional Atomic Operations for NoC-Based Manycore Platforms

DynaCo: Dynamic Coherence Management for Tiled Manycore Architectures

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation