Abstract
Fast Fourier transform is widely used to solve numerous scientific and engineering problems. In particular, this transform is behind the software dealing with speech and image recognition, signal analysis, modeling of properties of new materials and substances, etc. Newly emerging high-performance hybrid computing systems, as well as systems with alternative architectures, require research on discrete Fourier transform computation efficiency on these new platforms. The results of such research allow assessing the feasibility of certain solutions for building modern computing and data processing centers. This paper presents the results of such research covering modern hybrid computing systems based on the IBM POWER and Intel Xeon processors, as well as on NVIDIA Tesla co-processors. The analysis is carried out, and conclusions are presented on their performance when executing fast Fourier transforms. The impact of the existing architectural aspects of the hardware (CPU simultaneous multithreading mode, GPU data transfer bus, etc.) on the transform performance efficiency is assessed. The obtained results are used to provide recommendations on the optimal operation modes and settings of the considered mathematical libraries.
Similar content being viewed by others
References
Brodtkorb AR, Dyken C, Hagen TR, Hjelmervik JM, Storaasli OO (2010) State-of-the-art in heterogeneous computing. Sci Progr 18(1):1–33. https://doi.org/10.1155/2010/540159
Cooley JW, Tukey JW (1965) An algorithm for the machine calculation of complex Fourier series. Math Comput 19(90):297–301
Stanković D, Jovanović P, Jović A, Slavnić V, Vudragović D, Balaž A (2014) Implementation and Benchmarking of New FFT Libraries in Quantum ESPRESSO. In: Dulea M, Karaivanova A, Oulas A, Liabotis I, Stojiljkovic D, Prnjat O (eds) High-Performance Computing Infrastructure for South East Europe’s Research Communities, Modeling and Optimization in Science and Technologies, vol 2. Springer, Cham. https://doi.org/10.1007/978-3-319-01520-0_19
Wende F, Marsman M, Steinke T (2016) On Enhancing 3D-FFT Performance in VASP. In: CUG proceedings
Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Dagum L, Fatoohi RA, Frederickson PO, Lasinski TA, Schreiber RS, Simon HD, Venkatakrishnan V, Weeratunga SK (1991) The Nas parallel benchmarks. Int J Supercomput Appl 5(3):63–73. https://doi.org/10.1177/109434209100500306
Luszczek P, Dongarra J, Koester D, Rabensiefner R, Lucas B, Kepner J, McCalpin J, Bailey D, Takahashi D (2005) Introduction to the HPC Challenge Benchmark Suite. Lawrence Berkeley National Laboratory. Paper LBNL-57493, 12p
Park Y-S, Park K-R, Kim J-M, Jeong H-Y (2017) Fast Fourier transform benchmark on X86 Xeon system for multimedia data processing. Multimed Tools Appl 76(4):6015–6030. https://doi.org/10.1007/s11042-015-2843-7
Jodra JL, Gurrutxaga I, Muguerza J (2015) A study of memory consumption and execution performance of the cufft library. In: 2015 10th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC). IEEE, pp 323–327. https://doi.org/10.1109/3PGCIC.2015.66
Střelák D, Filipovič J (2018) Performance analysis and autotuning setup of the cuFFT library. In: Proceedings of the 2nd Workshop on Autotuning and Adaptivity Approaches for Energy Efficient HPC Systems—ANDARE ’18. ACM Press, New York, pp 1–6. https://doi.org/10.1145/3295816.3295817
Govindaraju NK, Lloyd B, Dotsenko Y, Smith B, Manferdelli J (2008) High performance discrete Fourier transforms on graphics processors. In: 2008 SC—International Conference for High Performance Computing, Networking, Storage and Analysis IEEE, pp 1-12. https://doi.org/10.1109/SC.2008.5213922
Smagin SI, Sorokin AA, Malkovsky SI, Korolev SP, Lukyanova OA, Nikitin OY, Kondrashev VA, Chernykh VY (2019) The organization of effective multi-user operation of hybrid computing systems. Comput Technol 5(24):49–60. https://doi.org/10.25743/ICT.2019.24.5.005
Mal’kovskii SI, Sorokin AA, Korolev SP, Zatsarinnyi AA, Tsoi GI (2019) Performance evaluation of a hybrid computer cluster built on IBM POWER8 microprocessors. Progr Comput Softw 45:324–332. https://doi.org/10.1134/S0361768819060057
Sorokin A, Malkovsky S, Tsoy G, Zatsarinnyy A, Volovich K (2020) Comparative performance evaluation of modern heterogeneous. High-performance computing systems CPUs. Electronics 9(6):1035. https://doi.org/10.3390/electronics9061035
ESSL Guide and Reference, IBM (2019). https://www.ibm.com/support/knowledgecenter/SSFHY8_6.2/reference/essl_reference_pdf.pdf. Accessed 17 Aug 2020
Frigo M, Johnson SG (2005) The design and implementation of FFTW3. Proc IEEE 93(2):216–231. https://doi.org/10.1109/JPROC.2004.840301
Sinharoy B, Van Norstrand JA, Eickemeyer RJ, Le HQ, Leenstra J, Nguyen DQ, Konigsburg B, Ward K, Brown MD, Moreira JE, Levitan D, Tung S, Hrusecky D, Bishop JW, Gschwind M, Boersma M, Kroener M, Kaltenbach M, Karkhanis T, Fernsler KM (2015) IBM POWER8 processor core microarchitecture. IBM J Res Dev 59(1):2:1–2:21
Sadasivam SK, Thompto BW, Kalla R, Starke WJ (2017) IBM Power9 processor architecture. IEEE Micro 37:40–51
NVidia: CUDA Toolkit documentation: cuFFT (2019). https://docs.nvidia.com/cuda/cufft/index.html. Accessed 01 Aug 2020
Foley D, Danskin J (2017) Ultra-Performance Pascal GPU and NVLink Interconnect. IEEE Micro 37(2):7–17. https://doi.org/10.1109/MM.2017.37
Choquette J, Giroux O, Foley D (2018) Volta: performance and programmability. IEEE Micro 38(2):42–52. https://doi.org/10.1109/MM.2018.022071134
Mulnix D (2017) Intel Xeon processor scalable family technical overview. https://software.intel.com/ru-ru/articles/intel-xeon-processor-scalable-family-technical-overview. Accessed 01 Aug 2020
Wang E et al (2014) Intel math kernel library. In: High-performance computing on the Intel\(\textregistered\) Xeon \(\text{Phi}^{{\rm TM}}\). Springer, Cham, pp 167-188. https://doi.org/10.1007/978-3-319-06486-4_7
Eggers SJ, Emer JS, Levy HM, Lo JL, Stamm RL, Tullsen DM (1997) Simultaneous multithreading: a platform for next-generation processors. IEEE Micro 17(5):12–19. https://doi.org/10.1109/40.621209
Starke WJ, Stuecheli J, Daly DM, Dodson JS, Auernhammer F, Sagmeister PM, Guthrie GL, Marino CF, Siegel M, Blaner B (2015) The cache and memory subsystems of the IBM POWER8 processor. IBM J Res Dev 59(1):3:1–3:13. https://doi.org/10.1147/JRD.2014.2376131
Starke WJ, Dodson JS, Stuecheli J, Retter E, Michael BW, Powell SJ, Marcella JA (2018) IBM POWER9 memory architectures for optimized systems. IBM J Res Dev 62(4/5):3:1–3:13. https://doi.org/10.1147/JRD.2018.2846159
Steinbach P, Werner M (2017) Gearshifft—the FFT benchmark suite for heterogeneous platforms. In: Kunkel J, Yokota R, Balaji P, Keyes D (eds) High performance computing. ISC 2017. Lecture notes in computer science, vol 10266. Springer, Cham, pp 199-216. https://doi.org/10.1007/978-3-319-58667-0_11
Sorokin AA, Makogonov SV, Korolev SP (2017) The information infrastructure for collective scientific work in the far east of Russia. Sci Tech Inf Proc 44:302–304. https://doi.org/10.3103/S0147688217040153
Informatics Core Facility Statute. Available Online: http://www.frccsc.ru/ckp. Accessed 22 Jan 2020
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research was partly funded by Russian Foundation for Basic Research (RFBR), Project Number 18-29-03196.
Rights and permissions
About this article
Cite this article
Malkovsky, S.I., Sorokin, A.A., Tsoy, G.I. et al. Evaluating the performance of FFT library implementations on modern hybrid computing systems. J Supercomput 77, 8326–8354 (2021). https://doi.org/10.1007/s11227-020-03591-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03591-6