Letting future programmers experience performance-related tasks

https://doi.org/10.1016/j.jpdc.2021.04.014Get rights and content

Highlights

  • Presentation of fine tuned assignments for HPC and parallel programming courses.

  • All assignments tested on students in three interconnected courses.

  • Covering vectorization, caches, multicore CPUs, GPUs, and multi-node problems.

  • Introducing 5 widely used technologies (TBB, OpenMP, CUDA, OpenMPI, and Spark).

Abstract

Programming courses usually focus on software-engineering problems like software decomposition and code maintenance. While computer-science lessons emphasize algorithm complexity, technological problems are usually neglected although they may significantly affect the performance in terms of wall time. As the technological problems are best explained by hands-on experience, we present a set of homework assignments focused on a range of technologies from instruction-level parallelism to GPU programming to cluster computing. These assignments are a product of a decade of development and testing on live subjects – the students of three performance-related software courses at the Faculty of Mathematics and Physics of the Charles University in Prague.

Introduction

Creating a balanced computer science curriculum is becoming increasingly difficult since computer science is one of the most rapidly expanding sciences. Regardless their specialization, the students are expected to get decent background in math and theory of computation, algorithms and data structures, sufficient coding and software engineering skills as well as fundamental knowledge about hardware principles, operating systems, networking, GUI design, or cloud services.

With all these important topics on the palette, parallel and high performance computing often seem only a niche confined to specialized curricula. However, understanding all the levels of available parallelism is essential at many stages of software design to fully utilize the available computing power. Furthermore, optimization of code and data structures also contributes to decreasing of power consumption which becomes important at all scales of computing, from the IoT to the cloud. Therefore, we believe that understanding various levels of parallel processing is beneficial to all software developers, not only the designers of highly specialized code such as numerical libraries.

Hands-on experience with performance-oriented software development may also become an eye-opener for the students. In the very first run of High-performance software development course in 2012, we assigned the same task to the same students twice: In the first assignment (at the very beginning of the course), they were instructed to provide a working solution without regard to performance; in the second assignment (in the middle of the course), the performance was the sole criterion for the points awarded. The task was Bit-matrix multiplication (Section 4.3.1).

Fig. 1 shows the results of both assignments of the 18 students who delivered working solutions and a reference implementation supplied by the teacher. The vertical axis is the decimal logarithm of performance (normalized with respect to the brute-force algorithmic complexity expected for the given size of data); zero indicates the average performance of the reference solution. The quartile boxes show the variability of the performance over 15 different data sets. The differences in variability indicate that the students used different approaches and some of them apparently changed the approach between their two submissions. The main observation associated with this graph is the fact that the performance of the naïve solutions vary by three orders of magnitude among the students and that, even if they are motivated to produce the fastest possible code, there is still more than 1:10 performance gap between the best and worst solutions. Since then, we use this graph in the lecture to demonstrate to the students that they may be able to improve the performance of their first-thought solutions 100 or 1000-times.

In our computer science curricula,1 parallel and high-performance programming topics are divided into three separate courses which are structured to cover related performance issues but without significant overlap. All the three courses are designed in a way that the most important pillars are home assignments. Selecting optimal assignments which combine appropriate complexity, didactic intent, and manageable scope is quite difficult. Furthermore, the assignments must be easily explained and the computed results easily verified, so the students can test the solutions on their own.

In this paper, we present a set of assignments that emerged and were further tuned during a decade of teaching of the three courses mentioned above. Over that time, the assignments were exposed to selection pressure as the teachers dismissed many assignments which raised too many questions, problems, or failures for the students. The assignments presented in the article are those that passed this natural selection. Although it might be interesting to describe the process of evolution of the assignments over time, we had to refrain from doing so due to limited space.

In the next section, we describe the contents of the three courses and enumerate the assignments used in each of them to provide a bigger picture that captures the main objectives of the whole curriculum as well as relations between the assignments. This part is, of course, specific to our local conditions and may not be reproducible elsewhere. In Section 3, we introduce three assignments based on problems which are well-known but not traditionally considered a part of high-performance programming – we use them to highlight the effect of various CPU features as well as the conflict between performance and some widely used programming techniques. Three more traditional high-performance computing problems and six assignments derived from them are described in Section 4, while GPU-related assignments are collected in Section 5. Section 6 explains our approach to evaluation of the results of the students work. In the Conclusion, we summarize the most important achievements with this set of assignments and we also try to anticipate their future development.

Section snippets

The courses

The authors of this paper are responsible for the following three parallel-programming courses:

    Basic parallel programming

    course which focus mainly on parallel technologies and their utilization

    Advanced parallel programming

    course which also covers parallel hardware and underlying principles affecting performance

    High-performance software development

    course focuses mainly on single-threaded performance (instruction-level parallelism, vector instructions, cache hierarchy)

In our curricula, general

Non-numerical assignments

In this section, we introduce three assignments based on problems which are frequently encountered but not traditionally considered a part of high-performance programming – we use them to highlight the effect of various CPU features as well as the conflict between performance and some widely used programming techniques.

Assignments related to CPU parallelism

We have selected the following three problems to demonstrate various aspects and levels of parallelism available in non-GPU environments: Levenshtein distance, K-means clustering, and matrix multiplication.

While real-world high-performance applications usually employ both parallel programming (at single or multiple-node level) and SIMD, combining them in a single assignment would be rather difficult for most students. Therefore, two of the problems are split into more than one assignments, each

GPU-accelerated assignments

There are two assignments dedicated to GPU programming: Physical simulation and Bucketing (which is divided into four incremental sub-assignments). Table 3 summarizes the GPU-programming features encountered in the related two assignments.

We use CUDA [5] as the underlying technology for GPU programming. There are several reasons that influenced our decision:

  • CUDA is rather simple to learn for C++ programmers.

  • Similar alternatives like OpenCL, Vulkan, or DirectCompute are more complex to learn as

Student submissions and evaluation

In our courses on parallel topics, the main evaluation criterion is the speedup, which is computed as simple ratio of measured wall times of the tested solution (submitted by a student) and referential baseline solution (usually a naïve serial implementation). In single-threaded assignments, the student solutions are compared against a well optimized solution by the teacher; therefore most (but not all) of the student submissions are slower. The measured wall time typically covers the entire

Conclusions

As the software complexity grows, most introductory as well as advanced programming courses focus on software-engineering problems like software decomposition and code maintenance. The lessons tend to emphasize algorithm complexity but neglect minor technical details that may significantly affect the performance in terms of wall time. The students are often encouraged to prefer simplicity over performance, quoting the famous maxim premature optimization is the root of all evil [11]. In

CRediT authorship contribution statement

David Bednárek: Conceptualization, Software, Writing – original draft, Writing – review & editing. Martin Kruliš: Conceptualization, Software, Writing – original draft, Writing – review & editing. Jakub Yaghob: Resources, Software, Writing – original draft.

Declaration of Competing Interest

Members of our institution (Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic) should be excluded as possible reviewers as they may have conflict of interest. To our best knowledge, we have no other conflicts of interest.

Acknowledgements

The work was supported by the PROGRES Q48 programme of the Charles University.

David Bednárek received a doctoral degree from software systems at the Department of Software Engineering, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic and he is currently working as a lecturer and researcher at the same institution. He is a member of parallel architectures/applications/algorithms research group and his research interests include compilers, operating systems, NoSQL databases, and high performance software development. He teaches many classes

References (16)

There are more references available in the full text version of this article.

Cited by (3)

David Bednárek received a doctoral degree from software systems at the Department of Software Engineering, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic and he is currently working as a lecturer and researcher at the same institution. He is a member of parallel architectures/applications/algorithms research group and his research interests include compilers, operating systems, NoSQL databases, and high performance software development. He teaches many classes related to programming and code optimizations such as principles of compilers, advanced C++ course, or high-performance programming.

Martin Kruliš received a doctoral degree from software systems at the Department of Software Engineering, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic and he is currently working as an assistant professor at the same institution. He is a member of similarity retrieval research group and parallel architectures/applications/algorithms research group and his research interests include parallel programming, high performance computing, and concurrency in database systems. He is also an accomplished lecturer, currently teaching basic and advanced parallel programing courses and basic and advanced programming of web applications.

Jakub Yaghob received a doctoral degree from software systems at the Department of Software Engineering, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic and he is currently working as a lecturer and researcher at the same institution. He is a member of parallel architectures/applications/algorithms research group and his research interests include parallel programming, high performance computing, compilers, and research software engineering. He regularly receives high rankings as a teacher covering parallel programming courses, principles of compilers, principles of computer systems, and virtualization.

View full text