1 Introduction

Processor vendors cannot longer cost-effectively improve performance by simply scaling processor frequencies [1]. Instead, they increase the number of cores per processor and integrate vector units like SSE or AVX that perform multiple arithmetic operations in parallel. Both these trends reflect the increasing importance of parallel processing over the last years.

To make use of these capabilities, programs have to run ‘as parallel as possible’. One approach is to split programs into many parallel threads and distribute them among the cores of the processor. For this, programmers can use libraries and application programming interfaces like OpenMP, OpenACC, or Cilk++ [2,3,4]. But threads and vector units are typically insufficiently used as programmers need a deep understanding of these libraries and of parallelism in general to efficiently develop parallel applications.

To solve this problem, several research projects developed automatic parallelization tools that transparently transform sequential source code into parallel code. Existing tools such as Par4all, PIPS (the successor of Par4all), and PluTo are able to parallelize sequential program sections under certain conditions [5,6,7,8]. In most cases these sections have to be nested loops, and some of the tools require that the respective sections are marked manually. PluTo, for instance, is capable of transforming a nested loop if the dependencies between the data accesses of loop iterations allow it (for details see Sect. 3.3). Subject to the polyhedral model, PluTo transforms a nested loop so that multiple iteration steps can be executed in parallel. Library-specific pragmas for parallelization and vectorization are automatically inserted and, in some cases, even memory accesses can be optimized to achieve a better cache usage (e.g., in PluTo-SICA [9, 10]). Strout et al. extended the polyhedral model so that it also allows indirect addressing of arrays [11].

Although there has been a lot of progress on automatic parallelization, most parallelization tools underlie a number of restrictions:

  1. 1.

    Current transformers require complete knowledge of memory accesses at compilation time.

  2. 2.

    In particular, sections to be parallelized are not allowed to contain function calls.

  3. 3.

    Many transformers require parallel section candidates to be marked manually by the programmer (although this is not always necessary [12]).

The second restriction would not be necessary if the functions were known to be side-effect-free or pure as we will call them. Functional languages like Haskell, for example, inherently allow the parallel execution of functions because their paradigm demands that (most) functions have no side-effects [13, 14]. Unfortunately, their performance is low compared to the imperative languages traditionally used in high-performance computing like Fortran and C [15] whose functions can have side-effects. That is why it is desirable to combine the strengths of both paradigms.

Fortran introduced the keyword pure to mark side-effect-free functions and checks if they are really side-effect-free. This makes it possible to parallelize more code segments automatically. Prior to this work, such a feature was not available for C. One reason is that testing whether a function is free of side-effects is more difficult than in Fortran.

In this work we extend the C language by the pure keyword and show how programs and source-to-source transformers benefit from it. The new keyword is used similarly to other existing function prefixes or modifiers such as static or inline.

Our contributions are:

  • We have integrated pure functions into C, a programming language more frequently used than Fortran nowadays. This seemingly small extension allows code parallelization similar to Fortran or functional programming languages without influencing or impairing the standard C options.

  • We have developed an additional compiler pass which verifies that functions marked as pure do not change the state of any variable outside of their scope. Thus, it ensures that these functions have no undesired side-effects.

  • Using the PluTo framework [7], we demonstrate the power of this new feature. By allowing pure function calls in polyhedral program loops, these loops—which have not been automatically parallelizable previously—can now transparently be parallelized in the compilation process.

  • We tested our approach with synthetic and real-world applications. The main result of these tests is that our compiler pass helps to detect parallelization opportunities in several applications, which would not be utilized otherwise.

Like other polyhedral code transformers, our solution requires slight code modifications, but using the new keyword has additional benefits. The compiler’s optimizer can, for example, apply the knowledge that parameters and their content will never be modified. Moreover, the pure keyword can also be used in libraries to mark functions as side-effect-free. This has the effect that even library function calls can be used in automatically parallelized program parts.

The article is structured as follows: after discussing the related work in Sect. 2, Sect. 3 presents our extensions to the C language and the compiler chain that allows automatic parallelization. In Sect. 4, we show the gains of the improved parallelization. Section 5 concludes the article.

2 Related Work

In this section, we discuss other C and Fortran language extensions and features that support automatic parallelization. Furthermore, we look at tools that transform sequential program codes into parallel ones.

2.1 Fortran and C Language Extensions

There are different extensions and features of programming languages that enable or support the automatic parallelization of program parts. Fortran provides two concepts: the pure keyword [16]—as we introduce it for C in this work—and co-arrays, which became part of the Fortran 2008 standard [17, 18]. A co-arrays program is run as if there were many instances of the same program executed asynchronously in parallel and each instance (denoted as image) having its own context.

C++ provides the method modifier const to ensure that methods do not change the state of the related class instance. Such const methods can only call other const methods and cannot change members of the related object leading to better optimization opportunities for the compiler. However, const methods can still change other elements and cause side-effects.

The C language provides the function attributes __attribute__((const)) and __attribute__((pure)) [19]. The __attribute__ keyword is used to signal function properties to the compiler. The attribute pure tells the compiler that the function only depends on its parameters and global variables and does not affect ‘external’ variables other than its return value. __attribute__((const)) is stricter than __attribute__((pure)) because const functions only read their parameters, but no global variables. However, these attributes are only programmer suggestions and serve as mere hints to the compiler. In comparison to our compiler chain, there is no further semantic analysis to verify the ‘const-ness’ of a function. In other words, side-effects would be possible.

High Performance Fortran (HPF) is a language extension which is not part of the Fortran standard. HPF reads directives from the code and tells the compiler how data have to be distributed and processed [20]. The related HPF library defines standards for distributed routines, such as parallel gather and parallel scatter.

2.2 Parallelization Tools

Our compiler chain includes tools to perform automatic code transformation, optimization, and parallelization of C source code based on the polyhedral model. In the last decades, this model has been well studied and numerous source-to-source compilation tools have evolved, for example, PluTo [21], PPCG [8], Par4ALL [5], the Cetus compiler [22], or the ROSE compiler infrastructure [23] with its PolyOpt/C optimizer. These frameworks traditionally aim for an automatic OpenMP and SIMD parallelization of sequential CPU codes; some (like PPCG) are also capable of generating CUDA or OpenCL code for GPUs. Our code generator internally uses PluTo with the SICA extension, denoted as PluTo-SICA in the following, which is available from the ‘SICA’ branch of the PluTo Git repository. PluTo-SICA extends the PluTo framework by adding highly optimized SIMD and multi-core code generation and by enforcing an extensive cache usage [9, 10]. PluTo-SICA is well suited for our approach because it parallelizes loops while supporting the compiler at using vector units like SSE or AVX. The produced code optimizes the reuse of data in deeper levels of the memory hierarchy and reduces the inter-core communication in the generated multi-core code. The PluTo framework contains

  • Clan [24] and OpenScop [25] for parsing the source code and modeling the polyhedral representation of nested loops,

  • ClooG [26] for the code generation, and

  • ISL [27], Polylib [28] and Piplib [29] for the transformation steps.

Parts of PluTo are also available as an LLVM plugin for Polly [30].

3 The Integration of Pure Functions into the C Programming Language

When functions in a computer program are executed in parallel, they can interfere with each other. For example, they could alter a shared class variable so that it matters which function is called first. If functions, however, do not have such side-effects, it is fail-safe to run them in parallel. Nevertheless, polyhedral source-to-source transformers are not able to parallelize loops that contain side-effect-free function calls. The reason is that the memory accesses must be known during the transformation process and that function calls mask this information. By introducing the pure keyword, we allow the programmer to mark side-effect-free functions and let the compiler verify that these function are really pure, i.e., free of side-effects. The transformer can ignore these functions and, as a result, potentially parallelize more loops.

In this section, we explain our extensions to the C programming language, the additional compiler pass, and how it supports automatic parallelization.

3.1 Language Extension

Many functional languages allow automatic parallelization by exploiting the properties inherent in this programming paradigm. In such languages, a function is supplied with input parameters and returns a result, but other than that it does not interact with the calling functions or change the state of any variable outside its scope. Hence, it is free of side-effects. This usually does not hold true for imperative programming languages like C. A C function is not a function in the mathematical sense because it can affect program parts outside of the function’s scope.

We extended C with an additional function modifier: pure. Functions of this type mimic the behavior of functions in functional programming languages. They have no impact on the program’s state except for the results of the performed computation. All elements of the program have the same state before and after the execution of the function.

As shown in Listing 1, the pure keyword is placed in front of a function to label it as pure. Pointers can also be marked as pure, and here again the keyword is placed in front of the declaration of the pointer.

figure a

It is important to note that pure pointers and their content cannot be modified. They can only be assigned once. Therefore, it does not make sense to allocate their memory with the malloc function because it would not be possible to alter the memory after the malloc call. Instead, one can create a normal pointer and assign it to the pure pointer.

Listing 2 shows the allowed and denied operations in pure functions. If a pure function calls other functions, these functions must be pure as well. In the body of a pure function, local variables can be declared, initialized, and, if impure, changed. Even the memory for pointers can be allocated and pointers can be freed as long as they have been declared in the function’s scope and do not reference external or global elements. Global pointers (like globalPtr in Listing 2) can be used in pure functions after being type-casted and assigned to a local pointer (extPtr2). Likewise, the return value of a pure function (like func2) can be assigned to a pure pointer (extPtr3).

figure b

The effects of pure are similar to the ones of const in C++ as const methods are also not allowed to call non-const methods or modify the state of the class instance. However, in contrast to pure functions, it is possible to pass pointer parameters to const methods which have not been declared as const, and they can change other program elements (e. g., static class members).

3.2 Compiler Pass

The implementation of our additional compiler pass is almost exclusively based on standard tools, e.g., the GCC tool chain and the AntLR 4.5 compiler (or parser generator). The AntLR repository provides a C grammar built from the C11 specification. We therefore assume that the C programming language standard is not violated (provided that the version from the AntLR repository does not violate it itself). It would have also been possible to integrate our compiler pass directly into the C compiler (e.g., GCC or clang), but we decided against it because it would require significant and error-prone modifications to the original compiler code.

Fig. 1
figure 3

Source code modification in the extended compiler chain

The extended compiler chain and source code modifications are shown in Fig. 1 where the frames show the results of the transformations. Starting from the C file, our preprocessor (PC-PrePro) removes all system includes in the file before it is preprocessed by the GCC preprocessor (GCC-E). GCC-E resolves all remaining includes and preprocessor directives. Our compiler pass then performs a syntactical and semantical analysis (PC-CC) which determines loops that can be executed in parallel. The marked code is edited by PluTo (polycc) which inserts pragmas for parallelization and vectorization. The system includes removed by our preprocessor are then inserted again (PC-PosPro), and the code is finally compiled as an executable program by GCC.

In our compiler pass, the C file preprocessed by our own and the GCC preprocessor is submitted to AntLR to generate an abstract syntax tree (AST). While the AST is traversed, most of the code is ignored. The pass only analyzes for-loops and function declarations and implementations marked pure.

Each for-loop is checked if it only calls pure functions. If this is the case, the loop is surrounded by the keywords #pragma scop and #pragma endscop marking a section which can be translated to a polyhedral representation by the PluTo framework.

If a function is declared or implemented pure, the function must be verified to be pure. For this, the function name is added to a hashset which contains all functions declared or considered pure. The function is then checked if it calls only pure functions, i.e., only functions from the hashset (including itself). In the beginning, the hashset is initialized with those C standard functions that have no side-effects (e.g., sin, cos, log etc.). Additionally, we insert malloc and free to the hashset. Although these functions are not strictly free of side-effects, their side-effects do not affect other threads. Enabling malloc is useful because it might be necessary for functions to return arrays of data which cannot be allocated on the stack as they would be deleted after the execution. Furthermore, our compiler pass checks whether every free call only frees memory space which has been allocated within the same pure function, and returns an error if this is not the case.

The compiler pass also verifies that assignments do not modify function-external data: If a pointer assigns function-external data (e.g., in form of parameters or global data), it must be declared pure, and the assigned data require a respective type cast with the prefix pure (see Listing 3). Only in this way it can be ensured that external and global data can be read, but not manipulated.

figure c

If data are stored somewhere in the function, our compiler pass checks for storage initialization in the function’s scope as well. If the data is assigned to a target which was declared outside of the scope, this code would imply a side-effect and therefore result in a compilation error (see Listing 4).

figure d

If our compiler pass finishes without errors, it is ensured that the pure functions do not have any undesired side effects. But since the pure keyword would cause a compiler error in the classical GCC tool chain, we must replace pure prefixes of pointers in argument lists of functions and remove the prefixes from functions entirely. The pointer prefixes are replaced with the const keyword as it has similar but weaker limitations than pure in that it only prohibits modifications of the data passed to the function. The function prefix, on the other hand, cannot be replaced with const since the modifier would be bound to the function’s return value and a normal pointer would, for example, be changed to a constant pointer. As a result, we remove the function prefix completely because there is no keyword in C representing a similar feature.

An important property of our extension is that it does not negatively influence the C programming language. Removing it has no effect on the results of a program other than that the program might not be as parallelizable as before. In other words, the pure extension does not restrict the C syntax.

Fig. 2
figure 6

Iteration points and dependency structure with an invalid (left) and a valid (right) tiling

3.3 Automatic Parallelization

For the automatic parallelization we use PluTo which internally uses the polyhedral model for transforming sequential code into parallel code. PluTo is unaware of pure functions and the fact that it can ignore them. For that reason, pure functions that are called in loops marked by #pragma scop and #pragma endscop must be temporarily removed. We substitute function calls in such loops by special, unique words to make the function calls appear as if they were constants (in Fig. 1 the function fnAB() is, for example, replaced by tmpConst_fnAB). After that, PluTo can check if the marked sections meet the constraints for polyhedral transformation, and possibly optimize the code and insert pragmas for OpenMP and vectorization. Once the polyhedral transformer has finished its tests and transformations, the previously substituted function calls are adapted and reinserted into the source code. Since PluTo inserts new program parts, including preprocessor directives, we start the GCC toolchain from the beginning with the program file built at the end of our compiler pass.

Polyhedral analysis We give a short account on the polyhedral model and analysis. More detailed descriptions can be found in the work that we reference in this subsection.

In the polyhedral model, the iteration points of a loop nest and the dependencies between them are represented by a \(\mathbb {Z}\)-polyhedron [31,32,33] like the ones that are given in Fig. 2. Each dimension of the polyhedron represents the iterator of one loop of a loop nest. A valid or legal transformation results in a new execution order of the iteration points respecting the data dependencies (arrows). Backward-facing arrows prevent parallelization. A transformation may manipulate loop dimensions (index variables) and thereby deform the polyhedron such that computations can be processed in parallel. In the example in Fig. 2, the polyhedron in the left diagram is sheared into the polyhedron in the right diagram in which no arrow points backwards. The iterator variables j and i are transformed into the variables \(j'\) and \(i'\). The left tiling (red), optimal in the sequential case, cannot be used to parallelize the application because iteration points of different blocks have reciprocal dependencies. After the transformation, it is possible to apply a rectangular tiling (green) which allows an order of execution that respects all dependencies. Independent blocks can be processed in parallel. Generally, a transformation must be affine (see e.g. Bondhugula et al. [34]) which the shearing is.

Our compiler chain uses the parallelization tools PluTo and PluTo-SICA which apply the polyhedral representation and analysis to determine if a loop can be processed in parallel. They insert pragmas for parallelization and vectorization and generate cache-aware tilings of loop nests [35].

The current version of our compiler tool chain cannot take full advantage of all SICA functionalities. In Sect. 4 we test our approach by moving inner loops of loop nests to external pure functions. In these cases the code transformer cannot investigate the whole loop nest and therefore does not fully optimize vectorization and cache alignment. In the future, our compiler pass could store metadata from pure functions containing information about array accesses and iteration patterns and use this information to conduct SICA cache-aware transformations.

3.4 Limitations

We have designed our compiler pass to always prioritize safety where it is possible. Listing 5 shows an example which seems to be fine at first glance. The function func is called within a loop, but it is pure. However, the array array passed to func is assigned the values of the function call, which can and here does result in a side-effect. The computation of array[i] depends on array[i-1] so that the order in which the values are computed matters. Our compiler pass therefore checks for each parameter of a pure function whether it appears on the left-hand side of an assignment operator in the loop nest, and if this is the case, it will throw an error.

figure e

Yet, similar to other performance-relevant optimizations, our approach can be deceived by aliasing. Listing 6 shows the same example as Listing 5, but now on the left-hand side of the assignment (line 14) array is replaced by the a pointer alias which points to the same memory region. Comparing only the names of the variables, the compiler pass is not aware of that and does not throw an error. Even though there are static code analyzers for detecting such pointers at compilation time, there are situations where these tools fail, for example, when the alias depends on runtime conditions [36, 37]. Moreover, there exist also tools that instrument the program code and detect at runtime if an array is accessed by using different pointers [38]. But we do not employ any of these tools.

figure f

4 Evaluation

We evaluated our approach applying it to four program codes:

  1. 1.

    a matrix–matrix multiplication,

  2. 2.

    a code iteratively computing the heat distribution of a point-wise heated plate,

  3. 3.

    an image processor from the field of remote sensing that filters (or processes) satellite images using the hyperspectral observation technique, and

  4. 4.

    a sparse matrix-vector multiplication using the C version of a function from the linear algebra library LAMA [39].

The first two applications were artificially modified by adding functions to the loop bodies so that they are not parallelizable without marking the functions pure. Without the modifications they are automatically parallelizable by source-to-source compilers. In our experiments, we ran PluTo/PluTo-SICA on the unmodified code and compiled the transformed code for comparison. The loops in the latter two applications contain too many operations preventing a proper dependence analysis so that traditional polyhedral source-to-source transformers are unable to parallelize it. Reducing the complexity of the codes by using pure functions, it is possible to generate parallel code automatically. The generated code is compared to hand-optimized ones.

The parallelization in our toolchain currently exclusively generates parallel code through OpenMP, so that the parallelized program code must be compiled by a compiler that supports OpenMP. Before we describe the performance tests and comparisons, we shortly explain the codes and the test environment.

4.1 Test Applications

The first application multiplies two matrices with \(4096 \times 4096\) elements each. Listing 7 shows the relevant part of the program. The transformed code of the matrix–matrix multiplication is shown in Listing 8.

figure g
figure h

Polyhedral transformers cannot parallelize the original source of the program because its loop contains nested function calls (the dot function). But the polyhedral transformers can do their job after the source code is processed by our compiler pass because the pass temporarily substitutes the function call by a constant variable. This temporal substitution is only admissible because the substituted function has no undesired side-effects. Consequently, the parallel execution of the function cannot result in incorrect computations caused by race conditions.

The second application computes the heat distribution of a heated plate which is represented by a regular \(4096 \times 4096\) grid. The plate is permanently heated at one point on one side. The temperature change of each point is iteratively computed as the average temperature of its direct neighbors. We computed 200 time steps in our experiments.

The third application, which we call the satellite application, is a geoscience method for the retrieval of the aerosol optical depth (AOD) [40] from hyperspectral data. The data were obtained from a Moderate Resolution Imaging Spectroradiometer aboard NASA’s Aqua satellite (MODISFootnote 1). The application uses hyperspectral observations and additional data arrays to ‘filter’ hyperspectral images and extract information about the atmosphere.

The fourth application is a standalone version of the ELL sparse matrix vector multiplication function from the LAMA library. The matrix that is multiplied is stored in ELL format. We used the Boeing/pwtkFootnote 2 data set as input for the function which contains a symmetric matrix consisting of over 217K rows and columns with 11.5 million non-zero elements and a size of about 155 MiB.

It is worth reminding that the main loop nests of the first two applications are polyhedral by default, whereas the loops of the last two applications are not polyhedral.

The main loop of the satellite application includes a call to a complex function performing the filter. This function is called for each pixel and does not introduce side-effects on other iterations, i.e., it is pure (a more detailed description of the function is given by Liu et al. [41]). The function itself contains several hundred lines of code and is not transformable or even processible for static code analyzers like PluTo because of its complexity, for example, dynamic conditional jumps within the function.

The loop nest of the LAMA function (fourth application) also contains a function call and indirect adressing (because of the sparseness of the matrices). The function computes the dot product of two vectors and can also be declared pure.

4.2 Test Environment

We used a computer equipped with four AMD Opteron 6272 processors for our tests where each processor runs 16 cores at 2.1 GHz. In total, the node has 512 GiB of main memory and a memory bandwidth of about 100 GiB/s.

For each application, we provide three different versions of the code:

  1. 1.

    a sequential version as a baseline,

  2. 2.

    a version automatically parallelized with PluTo-SICA, and

  3. 3.

    a version compiled with our compiler chain.

We compiled all applications with GCC 7.2.0 and used the -O2 flag in all cases as one of the competitors must be compiled with this optimization flag. The parallelized versions were started with 1, 2, 4, 8, 16, 32 or 64 cores.

It is important to mention that for the versions generated by PluTo or PluTo-SICA (without our preprocessor), the code of the pure functions must be inlined manually due to the limitations of the polyhedral transformers. It would not be possible to parallelize any of the codes using PluTo and/or PluTo-SICA alone. While the codes of the matrix–matrix multiplication and heat-distribution code can easily be changed to be processible by the polyhedral tools, this is not possible for the two real-world examples.

Additionally, we compiled the matrix–matrix multiplication and the heat distribution application with the Intel C/C++ Compiler (ICC) 16.0.2 because the automatic vectorization and optimization capabilities of this compiler are more elaborate than the corresponding features of the GCC compiler chain. The matrix–matrix multiplication is also compared with the professionally hand-tuned matrix–matrix multiplication implementation of the Intel Math Kernel Library (MKL) to gain an impression of the quality achieved by automatically generated parallel codes.

4.3 Scaling Tests

In the following, we evaluate the scaling behavior of the applications. For each program we exponentially increased the number of parallel cores from \(2^0\) to \(2^6\) and measured the average runtime over 20 application runs. Since the polyhedral transformed PluTo-SICA code does not include core pinning, we additionally used numactl in our experiments to ensure that the operating system did not migrate threads from one core to another.

The first two applications, matrix–matrix multiplication and heat distribution, are used to show several aspects of our new pure directive. Although we use the PluTo-SICA tool as parallelization backend in our compiler chain, there is a difference to the performance of the programs directly generated with PluTo-SICA (without pure). These performance differences are caused by the following two restrictions:

  • The standalone version of PluTo must inline code, or else it cannot parallelize it.

  • The SIMD / cache optimization of PluTo-SICA could not be used with the pure directive so that the pure code was not optimized for SSE/AVX-units and cache usage (see Sect. 3.3).

4.3.1 Matrix–Matrix Multiplication

Using the GCC-based compiler chain for the matrix–matrix multiplication, the first test compares the different parallelization tools with each other and with the corresponding MKL library version compiled with the Intel compiler (see Fig. 3). The runtime of the sequential GCC version was 22.17 s and is plotted as dashed line in the figure.

Fig. 3
figure 11

Execution time of the matrix–matrix multiplication using GCC

With an increasing number of utilized cores, the execution time is strictly decreasing when our compiler chain is used (Fig. 4, pure bars), whereas it temporarily increases scaling from 16 to 32 cores when the PluTo polyhedral transformer is used on its own (PluTo bars). Surprisingly, the pure version is always significantly faster than the ‘simple’ PluTo parallelization which is counter-intuitive because our tool chain uses PluTo to parallelize the code. PluTo inlines the function which is usually faster than calling the function [42]. We therefore investigated the automatically parallelized source code and found that a program part was parallelized that was not intended to be parallelized. This program part is a loop that uses malloc to allocate the matrices, and, as mentioned, malloc is one of the standard C functions that we treated as pure. The parallelization of this loop explains the different speedups. The black bars in Fig. 3 show the execution times of the matrix–matrix multiplication for pure when the initialization of the matrices is manually excluded from the parallelization. In this case the runtimes are close to the runtimes of the PluTo only version.

Our automatic parallelization using the GCC compiler chain cannot compete with PluTo-SICA and is much slower than the hand-tuned MKL library. The reason is that PluTo-SICA’s code and the MKL library implementation are able to make exhaustive use of SSE/AVX directives, which offer another parallelization opportunity, and are also able to better cache-align the data.

Fig. 4
figure 12

Execution time of the matrix–matrix multiplication using ICC

The pure program gets a significant performance advantage when Intel’s ICC is used for compilation while the PluTo version cannot benefit from it (see Fig. 4). This is because ICC can vectorize the extracted function that computes the dot product of two vectors. The performance boost is higher for smaller core numbers, while the performance of pure together with the ICC compiler converges to the performance of the GCC compiler chain for core counts higher than 16.

This automatic vectorization is not carried out when the function is inlined, for which reason PluTo and PluTo-SICA cannot benefit from the ICC compiler for smaller core counts. The performance of these two parallelization approaches is only faster than the single core version of the ICC when more than two cores are used, and PluTo-SICA is only able to outperform the pure directive for eight or more cores.

The comparison with the matrix–matrix multiplication using MKL shows that the professional, hand-tuned program code can still significantly outperform all other program versions. The MKL version is 7.28\(\times \) faster than the pure version for a single core and 5.82\(\times \) faster in the case of 64 cores. This comparison shows that the automatic code parallelization tools leave potential for optimization, although the positive effects of the pure directive, SICA (cache alignment and vectorization) and an optimized compiler like Intel’s ICC slightly reduce this gap.

Next we investigated the speedup gained by the parallelization. The speedup was computed by dividing the runtime of the sequential program \(T_{seq}\) compiled with the GNU GCC compiler chain by the runtime of the parallel program \(T_{par}\):

$$\begin{aligned} Speedup = \frac{T_{seq}}{T_{par}} \end{aligned}$$
Fig. 5
figure 13

Speedup of the matrix–matrix multiplication application

Figure 5 shows the results. For a small number of cores, the GCC and the ICC version hardly differ when PluTo or PluTo-SICA is used alone, while the version using pure can strongly benefit from the Intel compiler. Unfortunately, this advantage decreases for higher core counts. PluTo-SICA with its vectorization capabilities can significantly outperform both, PluTo and pure, while it is still much slower than the optimized MKL code. This hand-optimized code already achieves a speedup of 37.44 for two cores (notabene compared to the gcc single core version), but increases this speedup ‘only’ to 72.16 for 64 cores.

4.3.2 Heat Distribution

The parallelization of the computation of the heat distribution shows results slightly different from the ones of the matrix–matrix multiplication. Here, the executable generated with the polyhedral transformer PluTo achieves better results than the optimizations triggered inside the pure compiler chain (see Fig. 6). As the performance difference between PluTo and PluTo-SICA is negligible, we only present results for PluTo-SICA and for the pure version. The sequential GCC version leads to a runtime of 34.14 s, while the version generated by the Intel ICC compiler requires 31.32 s.

Fig. 6
figure 14

Execution time of the heat distribution application

We ran both parallel program versions with valgrind using the cachegrind tool to analyze the caching behavior. However, the variation in the number of cache misses is not large enough to explain the significant differences in the execution time.

In a next step, we used the Intel vTune profiler to analyze the runtime variations. The analysis reveals that both versions have no spinning or overhead times and require almost the same time for the OpenMP operations, where OpenMP parallelizes the nested loop that performs the stencil algorithm. The main algorithm is (again) inlined for the PluTo version, whereas it is called as an external pure function when we compile the code with our compiler chain. The profiler also shows that the execution of the nested loop in the PluTo version takes only 64% of the time needed by the loop and the stencil algorithm in the pure version.

To achieve a deeper understanding of the underlying reasons, we used the Linux performance analysis tool perf. When the PluTo-SICA version of the stencil program was started with a single thread, perf reported 47.5 billion user-space and 424 million kernel-space instructions. On the other hand, the pure version executed 87.8 billion user-space and 470 million kernel-space instructions. Hence, inlining the function for the PluTo compiler saves many function calls. Due to the low number of computations in the function, its optimization cannot compensate for the additional time required by the function calls.

In Fig. 7 the speedups achieved by the tools are displayed. PluTo compiled with GCC performs best for up to 16 threads, but worst for 32 and 64 threads where the programs compiled with Intel’s ICC show the best results. Yet, the speedups of all program versions decrease for more than eight cores. The advanced vectorization capabilities of Intel’s ICC (and of the SICA extension of the PluTo-SICA compiler chain) do not have a positive impact on this application due to the structure of the stencil’s memory accesses.

4.3.3 Satellite

For the satellite application, we can only provide results for the unmodified code and our compiler chain. The reasons are the limitations of the polyhedral transformer, which is not capable of transforming its complex loops, and the complexity of the applications’s filter functions which is clearly too high for the dependence analysis used.

The measurements show an unbalanced behavior of the image filter application in the later program phases which causes the load distribution over the available cores to become uneven. We manually adapted the code to overcome the imbalance. We extended the OpenMP directive #pragma omp for private (...) with the directive keyword schedule(dynamic,1). The results for an increasing number of cores can be seen in Fig. 8.

Fig. 7
figure 15

Speedup of the heat distribution application

Fig. 8
figure 16

Execution time of the image filter application

The application scales well and we observe a continuous speedup for all versions of the program when the number of cores is increased (Fig. 9). The only exception is the ICC-compiled code with manually added OpenMP directives for which we see a drop when increasing the number of cores from 32 to 64. The best speedup is accomplished when the unmodified, automatically generated code is compiled with the ICC compiler and run on 64 cores.

Fig. 9
figure 17

Speedup of the image filter application

This program is a good example for showcasing that internal knowledge about the application can help to obtain better performance by manually adjusting the automatically generated code.

4.3.4 LAMA (ELLMatrix)

Similar to the satellite application, we can only provide results for manually modified code and for code automatically generated with our pure compiler chain. Without the upstream pure stage, the PluTo and PluTo-SICA tools are not able to parallelize this code (as pointed out at the beginning of Sect. 4.3).

The manually written parallelization of the LAMA application uses the directive #pragma omp parallel for private (...) schedule(static) because we expected balanced threads, where all threads require almost the same execution time. The resulting runtimes for the ELL sparse matrix vector multiplication with an increasing number of cores are illustrated in Fig. 10.

In this case, the executable built with our compiler chain performs slightly worse than the manually built executable. This is due to the fact that the thread load differs greatly at the end of the program. The polyhedral transformer does not take this knowledge into account while the manually written parallelization exploits this information. Thus, the manually written version outperforms the automatically parallelized program. However, the performance difference is reduced when the number of cores is increased.

Fig. 10
figure 18

Execution time of the LAMA application

For all program versions, the speedup increases with the number of cores for up to 32 cores (see Fig. 11). While the versions compiled with Intel’s ICC are more efficient than the corresponding GCC versions for less than 16 cores, they are less efficient for more than 16 cores. For exactly 16 cores, only the manually (but not the automatically) parallelized version is better than the GCC versions. Further, the speedup for ICC-compiled versions decreases considerably at 64 cores. The highest speedup is achieved by the unmodified, automatically parallelized version compiled with the GCC compiler and utilizing 64 cores.

Fig. 11
figure 19

Speedup of the LAMA application

The performance differences between the manually and the automatically parallelized programs are not significant because the difference in the runtime is at most \(8 \cdot 10^{-4}\) s. As a result for the LAMA application, we notice that the automatically generated version yields about the same performance as the manually parallelized version.

In summary, our automatic parallelization achieves good results compared to the other solutions presented. In contrast to common polyhedral transformations, the pure keyword allows the transformers to parallelize loops that contain function calls. Additionally, our results show that external knowledge about the application helps to achieve better results, particularly knowledge about the partitioning of the parallel subtasks.

5 Conclusion

In this article, we have introduced the pure keyword for the C programming language which is easily applied and benefits the automatic parallelization of programs. Prior to this, polyhedral transformers were not able to parallelize any loops containing function calls. With marking side-effect-free functions as pure and checking that they are indeed side-effect-free, automatic parallelizers can now safely ignore such functions and thus parallelize more loop nests. Although there exist similar language extensions, this is the first working C compiler pass which checks that a function is guaranteed to have no side-effects.

Our evaluation shows that our preprocessor significantly improves automatic parallelization. It can even reduce the complexity of the program analysis in some case so that PluTo parallelizes a code that it normally would not be able to analyze.

In the future we will integrate the pure keyword into the C++ programming language and couple pure and PluTo-SICA more tightly to provide better cache alignment and better support for code vectorization. Besides, we are planning to evaluate the potential of other libraries and APIs, for instance, Cilk++ or OpenACC instead of OpenMP.