A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs

Size: px

Start display at page:

Download "A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs"

Charity Wilkerson
5 years ago
Views:

1 A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs Prakash Raghavendra, Akshay Kumar Behki, K Hariprasad, Madhav Mohan, Praveen Jain, Srivatsa S Bhat, VM Thejus, Vishnumurthy Prabhu Department of Information Technology, National Institute of Technology Karnataka, Surathkal srp@nitkacin, aksbeks@gmailcom, harinitk2007@gmailcom, madhavmon@gmailcom, prgu jain@yahoocom, bhatsrivatsa@gmailcom, thejusvm@gmailcom, prabhuvishnumurthy@gmailcom Abstract Today, the challenge is to exploit the parallelism available in the way of multi-core architectures by the software This could be done by re-writing the application, by exploiting the hardware capabilities or expect the compiler/software runtime tools to do the job for us With the advent of multi-core architectures ([1] [2]), this problem is becoming more and more relevant Even today, there are not many run-time tools to analyze the behavioral pattern of such performance critical applications, and to re-compile them So, techniques like OpenMP for shared memory programs are still useful in exploiting parallelism in the machine This work tries to study if the loop parallelization (both with and without applying transformations) can be a good case for running scientific programs efficiently on such multi-core architectures We have found the results to be encouraging and we strongly feel that this could lead to some good results if implemented fully in a production compiler for multi-core architectures 1 Introduction Parallel processing requires program logic to have zero dependency between the successive iterations of a loop To run a program in parallel we can divide the task between multiple threads or processes executing in parallel We can also go up to the extent of running these parallel pieces of code simultaneously on different nodes in a high speed network However the amount of parallelization possible depends on the program structure as well as the hardware configuration OpenMP [3] is an Application Program Interface (API) specification for C/C++ or FORTRAN that may be used to explicitly direct multi-threaded, shared memory parallelism It is a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization It has been implemented on most of the popular compilers like GNU (gcc), Intel, IBM, HP, Sun Microsystems compilers C-H Hsu et al (Eds): ICA3PP 2010, Part I, LNCS 6081, pp , 2010 c Springer-Verlag Berlin Heidelberg 2010

2 A Study of Performance Scalability by Parallelizing Loop Iterations 477 Most OpenMP parallelism is specified through the use of compiler directives which are embedded in C/C++ or FORTRAN source code The use of the pre-processor directive (starting with #) along with the OpenMP directive instructs the compiler during pre-processing to implement parallel execution of the code following the OpenMP directive There are various directives available, one of them being #pragma omp parallel for whose implementation was our prime interest in the project The #pragma directive is the method specified by the C standard for providing additional information to the compiler, beyond what is conveyed in the language itself The #pragma omp parallel for directive instructs the compiler that all the iterations of the for loop following the directive can be executed in parallel In that case, OpenMP compiler will generate code to spawn optimized number of threads based on the number of cores available Consider the following example of a typical C/C++ program using OpenMP pragmas: #include <omph> main () { int var1, var2, var3; /*** Serial code ***/ /*** Beginning of parallel section Fork a team of threads ***/ /*** Specify variable scoping ***/ #pragma omp parallel private(var1, var2) shared(var3) { /*** Parallel section executed by all threads ***/ /*** All threads join master thread and disband ***/ } /*** Resume serial code ***/ } The variables var1 and var2 are private to each of the threads spawned and the variable var3 is shared among all the threads The intent in this work is to study some OpenMP programs and see if these scale well on multi-core architectures Further, we would also like to parallelize non-parallel loops by applying transformations (using known techniques like unimodular and GCD transformations [5] [6] [7]) and see if they too scale well on such architectures We used some known OpenMP pragmas as case studies and implemented them in our compiler to study the performance In Section 2, wedescribethewayweimplemented these OpenMP pragmas In Section 3, we discuss unimodular transformations which we used to parallelize non-parallel loops In Section 4, we tabulate and explain all the results Section 5 concludes the paper and suggests some directions for future work

3 478 P Raghavendra et al 2 Implementation of OpenMP Pragmas in gcc The parallel portions of the program can be run on different threads For this we have to implement pthread library function calls in C Our two approaches were: by using a runtime library, and second, by using a wrapper which calls gcc In the first approach, we planned to have our own library functions which would have to be called in the same way as OpenMP We felt that this might not have much impact on performance and therefore, we resolved this to do at source level We would call gcc on the resultant program for further code generation and optimization for the target platform Hence we opted for the second approach in which we intended to develop a wrapper on gcc which would take the file having OpenMP directives as the input file and automatically generate the code which implemented multithreading The wrapper on gcc should be able to search for the OpenMP directives in the input file (program) and replace them with appropriate thread calls This could be easily achieved using a shell script, since we were working in a Linux environment The script searches for #pragma omp and removes it from the code, and the loop which follows this directive will be implemented in a function to be called by the thread The loop which needs to be parallelized is divided into several segments each of which is executed by a thread The kernel implicitly assigns individual threads to available cores Threads can be implemented in two ways: Static and Dynamic assignment of threads In static assignment, the total number of iterations of the loop is divided equally among the different threads However, in dynamic assignment, we dynamically assign loop iterations to different threads as and when the threads complete their previous tasks The static implementation of threads was observed to be faster than its dynamic counterpart since there is no explicit necessity to handle the tasks given to each of the threads But in this case the number of iterations of the loop must be divisible by the number of threads we create, at compile time itself We used the following standard for POSIX threads pthread_create(&thread_id,null,function,&value); This function creates a thread with the thread ID as thread_id, with default characteristics (specified by NULL as second parameter), executes the function function () in the thread and passes the value value (usually a structure) to the function pthread_join(thread_id, &exit_status); This function waits for the thread with thread ID thread_id to complete its execution and collects its exit status (return value if any) in exit_status This also helps in synchronizing more than one thread When the OpenMP pragmas are non-nested, then the case is very simple and all we have to do is to put the code following the directives in separate functions and call the functions using threads Whereas when there are nested

4 A Study of Performance Scalability by Parallelizing Loop Iterations 479 OpenMP directives, then we will have to create threads again inside the outer thread functions, resulting in nested thread functions As an added feature to the script, the number of threads to be implemented can be specified as a command line argument, default being two We run the script as follows: Syntax: /scriptsh <input_file> [no_of_threads] 3 Unimodular Transformations Unimodular transformation is a loop transformation defined by a unimodular matrix To test the validity of a general unimodular transformation, we need to know the distance vectors of the loop After a valid transformation, if a loop in the transformed program carries no dependency, then it can execute in parallel Many researchers have applied unimodular transformations for parallelization of loops [11] [12] Such transformations, though have limitations on applicability due to the model on which they work, can be applied very elegantly onto loops, which give various forms of loops executable in parallel These types can be generalized by a set of non-parallel loops sandwiched between set of outer parallel loops and set of inner parallel loops The beauty of this technique is that we can control on how many of such inner and outer loops we want depending upon the number of schedulable resources that we have If the input loop does not belong to the model or if the number of dependencies are more than what is allowed by the model, then we will not be able to get completely parallel loops In that case, we have to make these run with explicit communication, as explained in some works like [12] In such methods, we allow the loops to have dependencies, however we still run these in parallel and making explicit synchronization wherever necessary so that the loops run correctly honoring all data dependencies The extension of this technique would be one in [13] where they optimize the layout of arrays for such loops Program Model Our model program [6] is the loop nest L= (L1, L2,Lm): L1: do I1 = p1, q1 L2: do I2 = p2, q2 Lm: do Im = pm, qm H (I1, I2,, Im) Enddo Enddo Enddo An iteration H(i) is executed before another iteration H(j) in L iff i < j Take any m*m unimodular matrix U Let LU denote the program consisting of the iterations of L, such that an iteration H(i) is executed before another iteration

5 480 P Raghavendra et al H(j) in LU iff iu < ju LU can be written as a nest of m loops with an index vector K=(K1,K2,Km) defined by K=IU The body of the new program is H(KU-1) which will be written as HU (K) The program LU is the transformed program defined by U, and the transformation L -> LU is the unimodular transformation of L defined by U The idea here is to transform the non-parallel loop into another base so that the same loop would become parallel (and can be executed in parallel on target machine) 4 Results We studied the results as two problems, first to study the OpenMP ready programs with explicit parallel loops, and second, loops which are not parallel as given, but may need some transformations (like unimodular transformations) to make them run in parallel We will show the results for the first case in Section 41 and show the results of next case in Section Study of OpenMP Parallel Programs We took OpenMP programs like matrix multiplication with array sizes of 1800 * 1800 and LU factorization (array sizes of 1800*1800) We tested these programs with different number of threads (2, 4, 6, 8, 16, etc) on IBM Power 5 server and the performances were noted The IBM Power 5 server we used (IBM A) has 4 physical processing units (8 cores) The maximum available physical processing units were 38 since the rest was used for VIOS (Virtual I/O Server) Each processing unit has 2 cores The virtual machine on which the programs were tested was configured for two settings: Profile 1: Virtual Processing Units: 3 (6 cores) Physical Processing Units: 3 (6 cores) Profile 2: Virtual Processing Units: 8 (16 cores) Physical Processing Units: 38 (approx 8 cores) Matrix multiplication The program snippet which performs matrix multiplication implemented with threads is as follows : #pragma omp parallel for private(i,j,k) for(i = 0; i < 1800; i++){ for(j = 0; j < 1800;j++){ c[i][j] = 0; for(k = 0; k < 1800; k++){ c[i][j] += a[i][k] * b[k][j]; }}} Table-1 shows the statistics for the above program using Profile-1 and Profile-2 Graph-1 shows the graph corresponding to Profile-1 Graph-2 shows the graph corresponding to Profile-2

6 A Study of Performance Scalability by Parallelizing Loop Iterations 481 Table 1 Scale-ups for the above program with Profile-1 and Profile-2 Profile-1 Profile-2 Threads Time in s Scale-up Threads Time in s Scale-up Fig 1 Graph-1 From the observations, it can be concluded that the scale up increases with the number of threads as long as the number of threads is less than or equal to the number of cores available We see that we can never approach the ideal speed up of N where N is the number of cores, since there is always some synchronization to be done, which would slow down the program execution Comparing the maximum scale up produced in the two cases (profile-1: 396, profile-2: 459), though the number of cores in the second case is twice that of the first one, considerable amount of scale up is not obtained This is because in the second case 16 is the number of virtual cores and not the physical cores (which is actually 8) And hence one core is taking care of two threads, which leads to overhead due to switching of threads LU Factorization The code snippet which performs LU factorization, implementedwiththreadsisasfollows:

7 482 P Raghavendra et al #pragma omp parallel for private(i,j,k) for(k = 0; k < 1800; k++){ for(i=k+1 = 0; i < 1800;i++){ a[i][k] = a[i][k] / a[k][k]; } for(i=k+1 = 0; i < 1800;i++){ for(j=k+1 = 0; j < 1800;j++){ a[i][j] = a[i][j] - a[i][k] * a[k][j]; }}} Table-2 shows the statistics for the above program using Profile-1 and Profile-2 Graph-3 shows the graph corresponding to Profile-1 Graph-4 shows the graph corresponding to Profile-2 It is clearly observed in all of the graphs that when the number of threads is increased beyond a certain limit (number of cores available), the scale up remains almost a constant But if it is further increased to a much higher value, Fig 2 Graph-2 Table 2 Scale-ups for the above program with Profile-1 and Profile-2 Profile-1 Profile-2 Threads Time in s Scale-up Threads Time in s Scale-up

8 A Study of Performance Scalability by Parallelizing Loop Iterations 483 Fig 3 Graph-3 Fig 4 Graph-4 the scale up would decrease This is because of the delay (overhead) produced due to switching of the large number of threads As we know the loop following the OpenMP directives will be executed in parallel We studied the effects of parallelizing different sets of loops which are nested Consider the following code snippet : for(i = 0; i < 1800; i++){ for(j = 0; j < 1800;j++){ c[i][j] = 0; for(k = 0; k < 1800; k++){ c[i][j] += a[i][k] * b[k][j]; }}} The time taken for a single thread to complete the above loops was seconds When the outer most for loop (loop counter i ) was parallelized with 4 threads, the time taken was found to be 7972 seconds When the second for loop (loop counter j ) was parallelized with 4 threads, the time taken was found to be seconds When the inner most for loop (loop counter k ) was parallelized with 4 threads, the time taken was found to be 5266 seconds

9 484 P Raghavendra et al The parallelization achieved by multithreading the outer for loop is called Coarse Granular Parallelization It takes the least time because each thread executes large chunk of calculations and hence overhead due to switching of threads is minimized since each thread once created handles significant part of the program The parallelization achieved by multithreading the inner for loop is called Fine Granular Parallelization It takes more time because each thread executes smaller chunk of calculations and hence overhead is more due to repeated thread creation and destruction for very small tasks When the outer most and the second for loop were parallelized with two threads each, the time taken was found to be 6347 seconds, which is slightly better than that obtained by parallelizing the outer most loop alone with four threads (7972 s) When the outer most and the inner most for loop were parallelized with two threads each, the time taken was found to be 1975 seconds When the second and the inner most for loop were parallelized with two threads each, the time taken was found to be seconds The above data further proves higher performance gain of coarse granular parallelization over fine granular parallelization There are a number of limitations in case of multithreading There is an upper limit on the number of parallel threads we can create, since all these threads belong to a single process and each process has an upper limit on the memory it can use, depending on the hardware architecture Each thread has its own program stack So having a large number of threads in a single process may lead to memory insufficiency due to the large number of stacks required Moreover, the best performance is obtained when the number of threads equals the number of actual cores available on the system 42 Study of Loop Transformations on Multi-core Loops with independent iterations can be easily parallelized unlike ones with dependencies But some loops with dependencies can still be made independent to some extent by applying mathematical transformations, which then can be parallelized ( [7]) The transformations are usually specific to the function the body of the loop performs Some specific examples are discussed below Inner loop parallelization dependencies : [6] Consider the following double loop which has for(i = 1; i < 1000; i++){ for(j = 1; j < 1000;j++){ for(l = 0; l < 10000; k++)// this loop is just to increase the calculations a[i][j] = a[i-1][j] + a[i][j-1]; }} This code takes 8751 seconds to complete This can be transformed into an independent loop given by:

10 A Study of Performance Scalability by Parallelizing Loop Iterations 485 for(k = 2; k < 1999; k++){ k_1=1>(k-999)?1:(k-999); k_2=999>(k-1)?(k-1):999; for(k_1 = k_1; k_1 <= k_2;k_1++){ for(l = 0; l < 10000; k++)// this loop is just to increase the calculations a[k-k_1][k_1] = a[k-k_1-1][k_1] + a[k-k_1][k_1-1]; }} In the above code, only the inner most loop is independent and hence can be run in parallel When executed with two threads, the time taken is 5217 seconds, which is faster compared to the time taken by dependent loops though the parallelization is fine granular In this case, the number of iterations executed by the second for loop is least for the first and the last iteration of the outer most for loop and highest in between and the gradient is linear Hence in order to obtain higher efficiency, the first and last few iterations (depending upon the amount of work being done) of the outer most for loop can be executed sequentially and the rest in parallel or we can make use of dynamic threads Dynamic threads results are in table-3 Table 3 Scale-ups for Dynamic threads program Threads Time in s Outer loop parallelization [6] Consider the following code which has dependencies : for(i = 6; i <= 500; i++){ for(j = i; j <= (2*i)+4; j++){ for(l = 0; l < 10000; k++)// this loop is just to increase the calculations a[i][j] = a[i-2][j-3] + a[i][j-6]; }} This code takes 3296 seconds to complete This can be transformed into an independent code given by : for(y_1 = 0; y_1 <= 1; y_1++){// parallelizable loop for(y_2 = 0; y_2 <= 1; y_2++){ // parallelizable loop for(k_1 = ceil((6-y_1)/20); k_1 < floor((10-y_1)/20); k_1++){ for(k_2 = ceil(y_1+2*k_1-y_2); k_2 < floor((2*y_1+4*k_1-4-y_2)/30); k_2++){ a[y_1+2*k_1][y_2+3*k_2] = a[y_1+2*k_1-2][y_2+3*k_2-3] + a[y_1+2*k_1][y_2+3*k_2-6]; }}}}

11 486 P Raghavendra et al When executed with two threads, the time taken is 2028 seconds, which is faster compared to the time taken by dependent loops Since it is a case of coarse granular parallelization, increasing the number of threads gives better performance The above program with maximum 6 threads gives a scale up of 3 (1335 s) which is not up to the expectations since the number of inner loop iterations is greater than that of the original program 5 Conclusion In this study, we did two things First, study performance of a few OpenMP ready parallel programs on multi-core machines (up to 8 physical cores) For this, we implemented the OpenMP compiler over the gcc compiler Second, we extended our work to also include non-parallel loops and used some transformations to make these loops run in parallel We were delighted to see that both these cases proved that running these (with or without transformations) can give us significant performance benefits on multi-core architectures The trend today is to have more and more cores and in that context, this work would be quite relevant In future, we would like to extend our work to more non-parallel loops, which may run in parallel with some explicit synchronization on multi-core References 1 AMD Multi-core Products (2006), 2 Multi-core from Intel Products and Platforms (2006), 3 OpenMP, 4 Wolfe, MJ: Techniques for improving the inherent parallelism in programs Technical Report , Department of Computer Science, University of Illinois at Urbana-Champaign (July 1990) 5 Wolfe, M: High Performance Compilers for Parallel Computing Addison-Wesley, Reading 6 Banerjee, UK: Loop Transformations for Restructuring Compilers: The Foundations Kluwer Academic Publishers, Norwell (1993) 7 Banerjee, UK: Loop Parallelization Kluwer Academic Publishers, Norwell (1994) 8 Pthreads reference, 9 DHollander, EH: Partitioning and Labelling of loops by Unimodular Transformation IEEE Transactions on Parallel and Distributed Systems 3(4) (1992) 10 Saas, R, Mutka, M: Enabling unimodular transformation In: Supercomputing 1994, November 1994, pp (1994) 11 Banerjee, U: Unimodular Transformations of Double Loop In: Advances in Languages and Compilers for Parallel Processing, pp (1991) 12 Prakash, SR, Srikant, YN: An Approach to Global Data Partitioning for Distributed Memory Machines In: IPPS/SPDP (1999) 13 Prakash, SR, Srikant, YN: Communication Cost Estimation and Global Data Partitioning for Distributed Memory Machines In: Fourth International Conference on High Performance Computing, Bangalore (1997)

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi