Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Size: px
Start display at page:

Download "Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology"

Transcription

1 Spring 2 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

2 Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization a at Opportunities t

3 Moore s Law From Hennessy and Patterson, Computer Architecture: Itanium 2 A Quantitative Approach, 4th edition, 26 Itanium,,, mance (vs. VAX X-/78) Perfor %/year 486 Pentium 52%/year P2 P3 P4??%/year,,,,,,, Numbe er of Transistors s 886, Saman Amarasinghe MIT From Fall David 26 Patterson

4 Uniprocessor Performance (SPECint) From Hennessy and Patterson, Computer Architecture: Itanium 2 A Quantitative Approach, 4th edition, 26 Itanium,,, P4,, Pentium P2 P3,,,,, Numbe er of Transistors s 886, Saman Amarasinghe MIT From Fall David 26 Patterson

5 Multicores Are Here! 52 Picochip Ambric PC2 AM Cisco CSR- Intel 28 Tflops 64 #of cores 32 Raza Cavium Raw XLR Octeon 6 8 Niagara Opteron 4P Boardcom 48 4 Xeon MP Xbox36 Power4 2 PA-88 Opteron Tanglewood PExtreme Power6 Yonah Pentium P2 P3 Itanium P4 88 Athlon Itanium 2 Cell ?? Saman Amarasinghe MIT Fall 26

6 Issues with Parallelism Amdhal s Law Any computation can be analyzed in terms of a portion that must be executed sequentially, Ts, and a portion that can be executed in parallel, Tp. Then for n processors: T(n) = Ts + Tp/n T() = Ts, thus maximum speedup (Ts + Tp) /Ts Load Balancing The work is distributed among processors so that all processors are kept busy when parallel task is executed. Granularity The size of the parallel regions between synchronizations or the ratio of computation (useful work) to communication (overhead). Saman Amarasinghe MIT Fall 26

7 Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization a at Opportunities t

8 Types of Parallelism Instruction Level Parallelism li (ILP) Scheduling and Hardware Task Level Parallelism (TLP) Loop Level Parallelism (LLP) or Data Parallelism Pipeline Parallelism Divide and Conquer Parallelism Mainly by hand Hand or Compiler Generated Hardware or Streaming Recursive functions Saman Amarasinghe MIT Fall 26

9 Why Loops? 9% of the execution time in % of the code Mostly in loops If parallel, can get good performance Load balancing Relatively easy to analyze Saman Amarasinghe MIT Fall 26

10 Programmer Defined Parallel Loop FORALL FORACROSS No loop carried Some loop carried dependences dependences Fully parallel Saman Amarasinghe 6.35 MIT Fall 26

11 Parallel Execution Example FORPAR I = to N A[I] = A[I] + Block Distribution: Program gets mapped into Iters = ceiling(n/numproc); FOR P = to NUMPROC- FOR I = P*Iters to MIN((P+)*Iters, N) A[I] = A[I] + SPMD (Single Program, Multiple Data) Code If(myPid == ) { Iters = ceiling(n/numproc); } Barrier(); FOR I = mypid*iters to MIN((myPid+)*Iters, Iters, N) A[I] = A[I] + Barrier(); Saman Amarasinghe 6.35 MIT Fall 26

12 Parallel Execution Example FORPAR I = to N A[I] = A[I] + Block Distribution: Program gets mapped into Iters = ceiling(n/numproc); FOR P = to NUMPROC- FOR I = P*Iters to MIN((P+)*Iters, N) A[I] = A[I] + Code fork a function Iters = ceiling(n/numproc); ParallelExecute(func); void func(integer mypid) { FOR I = mypid*iters to MIN((myPid+)*Iters, Iters, N) A[I] = A[I] + } Saman Amarasinghe MIT Fall 26

13 Parallel Execution SPMD Need to get all the processors execute the control flow Et Extra synchronization overhead or redundant d computation on all processors or both Stack: Private or Shared? Fork Local variables not visible within the function Either make the variables used/defined in the loop body global or pass and return them as arguments Function call overhead Saman Amarasinghe MIT Fall 26

14 Parallel Thread Basics Create separate threads Create an OS thread (hopefully) it will be run on a separate core pthread_create(&thr, NULL, &entry_point, NULL) Overhead in thread creation Create a separate stack Get the OS to allocate a thread Thread pool Create all the threads (= num cores) at the beginning g Keep N- idling on a barrier, while sequential execution Get them to run parallel code by each executing a function Back to the barrier when parallel region is done Saman Amarasinghe MIT Fall 26

15 Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization a at Opportunities t

16 Parallelizing Compilers Finding FORALL Loops out of FOR loops Examples FOR I = to 5 A[I] = A[I] + FOR I = to 5 A[I] = A[I+6] + For I = to 5 A[2*I] = A[2*I + ] + Saman Amarasinghe MIT Fall 26

17 Iteration Space N deep loops n-dimensional discrete cartesian space Normalized loops: assume step size = FOR I = to 6 FOR J = I to 7 2 I J Iterations are represented as coordinates in iteration space i = [i, i 2, i 3,, i n ] Saman Amarasinghe MIT Fall 26

18 Iteration Space N deep loops n-dimensional discrete cartesian space Normalized loops: assume step size = FOR I = to 6 FOR J = I to 7 I J Iterations are represented as coordinates in iteration space Sequential execution order of iterations Lexicographic order [,], [,], [,2],, [,6], [,7], [,], [,2],, [,6], [,7], [2,2],, [2,6], [2,7], [6,6], [6,7], Saman Amarasinghe MIT Fall 26

19 Iteration Space N deep loops n-dimensional discrete cartesian space Normalized loops: assume step size = FOR I = to 6 FOR J = I to 7 I J Iterations are represented as coordinates in iteration space Sequential execution order of iterations Lexicographic order Iteration i is lexicograpically less than j, i < j iff there exists c s.t. i = j, i 2 = j 2, i c- = j c- and i c < j c Saman Amarasinghe MIT Fall 26

20 Iteration Space N deep loops n-dimensional discrete cartesian space Normalized loops: assume step size = FOR I = to 6 FOR J = I to 7 I J An affine loop nest Loop bounds are integer linear functions of constants, loop constant variables and outer loop indexes Array accesses are integer linear functions of constants, loop constant variables and loop indexes Saman Amarasinghe MIT Fall 26

21 Iteration Space N deep loops n-dimensional discrete cartesian space Normalized loops: assume step size = J FOR I = to 6 FOR J = I to 7 2 I Affine loop nest Iteration space as a set of liner inequalities I I J I 6 J 7 Saman Amarasinghe MIT Fall 26

22 Data Space M dimensional arrays m-dimensional discrete cartesian space a hypercube Integer A() Float B(5, 6) Saman Amarasinghe MIT Fall 26

23 Dependences True dependence a = = a Anti dependence = a a = Output dependence a = a = Definition: Data dependence exists for a dynamic instance i and j iff either i or j is a write operation i and j refer to the same variable i executes before j How about array accesses within loops? Saman Amarasinghe MIT Fall 26

24 Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization a at Opportunities t

25 Array Accesses in a loop FOR I = to 5 A[I] = A[I] + Iteration Space Data Space Saman Amarasinghe MIT Fall 26

26 Array Accesses in a loop FOR I = to 5 A[I] = A[I] + Iteration Space Data Space A[I] A[I] A[I] A[I] A[I] = A[I] =A[I] = A[I] = A[I] = A[I] [] A[I] = A[I] Saman Amarasinghe MIT Fall 26

27 Array Accesses in a loop FOR I = to 5 A[I+] = A[I] + Iteration Space Data Space A[I+] A[I+] A[I+] A[I+] A[I+] = A[I] =A[I] = A[I] = A[I] = A[I] [] A[I+] = A[I] Saman Amarasinghe MIT Fall 26

28 Array Accesses in a loop FOR I = to 5 A[I] = A[I+2] + Iteration Space Data Space A[I] A[I] A[I] A[I] A[I] = A[I+2] =A[I+2] = A[I+2] = A[I+2] = A[I+2] A[I] = A[I+2] Saman Amarasinghe MIT Fall 26

29 Array Accesses in a loop FOR I = to 5 A[2*I] = A[2*I+] + Iteration Space Data Space = A[2*+] A[2*I] = A[2*I+] A[2*I] = A[2*I+] A[2*I] = A[2*I+] A[2*I] = A[2*I+] A[2*I] = A[2*I+] A[2*I] Saman Amarasinghe MIT Fall 26

30 Distance Vectors A loop has a distance d if there exist a data dependence from iteration i to j and d = j-i dv FOR I = to 5 A[I] = A[I] + dv FOR I = to 5 A[I+] = A[I] + dv 2 FOR I = to 5 A[I] = A[I+2] +, 2 dv * FOR I = to 5 A[I] = A[] + Saman Amarasinghe MIT Fall 26

31 Multi-Dimensional Dependence FOR I = to n FOR J = to n A[I, J] = A[I, J-] + dv I J Saman Amarasinghe MIT Fall 26

32 Multi-Dimensional Dependence FOR I = to n FOR J = to n A[I, J] = A[I, J-] + dv I J FOR I = to n FOR J = to n A[I, J] = A[I+, J] + dv J I Saman Amarasinghe MIT Fall 26

33 Outline Dependence Analysis Increasing Parallelization Opportunities

34 What is the Dependence? FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + I J Saman Amarasinghe MIT Fall 26

35 What is the Dependence? FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + I J Saman Amarasinghe MIT Fall 26

36 What is the Dependence? dv FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + I J Saman Amarasinghe MIT Fall 26

37 What is the Dependence? FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + I J FOR I = to n FOR J = to n A[I] = A[I-] + I J Saman Amarasinghe MIT Fall 26

38 What is the Dependence? FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + dv dv=[, -] I J FOR I = to n FOR J = to n B[I] = B[I-] + dv,,, 2 3 * J I Saman Amarasinghe MIT Fall 26

39 What is the Dependence? FOR i = to N- FOR j = to N- A[i,j] = A[i,j-] + A[i-,j]; I J dv, Saman Amarasinghe MIT Fall 26

40 Recognizing g FORALL Loops Find data dependences in loop For every pair of array acceses to the same array If the first access has at least one dynamic instance (an iteration) in which it refers to a location in the array that the second access also refers to in at least one of the later dynamic instances (iterations). Then there is a data dependence between the statements (Note that same array can refer to itself output dependences) Definition Loop-carried dependence: dependence that crosses a loop boundary If there are no loop carried dependences parallelizable Saman Amarasinghe MIT Fall 26

41 Data Dependence Analysis I: Distance Vector method II: Integer Programming Saman Amarasinghe MIT Fall 26

42 Distance Vector Method The i th loop is parallelizable for all dependence d = [d,,d i,..d n ] either one of d,,d i- is > or all d,,d i = Saman Amarasinghe MIT Fall 26

43 Is the Loop Parallelizable? dv Yes FOR I = to 5 A[I] = A[I] + dv No FOR I = to 5 A[I+] = A[I] + dv 2 No FOR I = to 5 A[I] = A[I+2] + dv * No FOR I = to 5 A[I] = A[] + Saman Amarasinghe MIT Fall 26

44 Are the Loops Parallelizable? FOR I = to n dv FOR J = to n A[I, J] = A[I, J-] + Yes No I J FOR I = to n dv FOR J = to n A[I, J] = A[I+, J] + No Yes Saman Amarasinghe MIT Fall 26 I J

45 Are the Loops Parallelizable? FOR I = to n FOR J = to n A[I, J] = A[I-, J+] + dv dv=[, -] No Yes I J FOR I = to n FOR J = to n B[I] = B[I-] + No dv * Yes J I Saman Amarasinghe MIT Fall 26

46 Integer Programming g Method Example FOR I = to 5 A[I+] = A[I] + Is there a loop-carried dependence between A[I+] and A[I] Is there two distinct iterations i w and i r such that A[i w +] is the same location as A[i r ] integers i w, i r i w, i r 5 i w i r i w + = i r Is there a dependence between A[I+] and A[I+] Is there two distinct iterations i and i 2 such that A[i +] is the same location as A[i 2 +] integers i, i 2 i, i 2 5 i i 2 i +=i 2 + Saman Amarasinghe MIT Fall 26

47 Integer Programming g Method Formulation FOR I = to 5 A[I+] = A[I] + an integer vector i such that  i b where  is an integer matrix and b is an integer vector Saman Amarasinghe MIT Fall 26

48 Iteration Space N deep loops n-dimensional discrete cartesian space FOR I = to 5 A[I+] = A[I] + Affine loop nest Iteration space as a set of liner inequalities 2 I I 3 I 6 4 I J 5 J J Saman Amarasinghe MIT Fall 26

49 Integer Programming g Method Formulation FOR I = to 5 A[I+] = A[I] + an integer vector i such that  i b where  is an integer matrix and b is an integer vector Our problem formulation for A[i] and A[i+] integers i w, i r i w, i r 5 i w i r i w + = i r i w i r is not an affine function divide into 2 problems Problem with i w <i r and problem 2 with i r <i w If either problem has a solution there exists a dependence How about i w + = i r Add two inequalities to single problem i w + i r, and i r i w + Saman Amarasinghe MIT Fall 26

50 Integer Programming g Formulation FOR I = to 5 Problem A[I+] = A[I] + i w i w 5 i r i r 5 i w < i r i w + i r i r i w + Saman Amarasinghe MIT Fall 26

51 Integer Programming g Formulation FOR I = to 5 Problem A[I+] = A[I] + i w -i w i w 5 i w 5 i r -i r i r 5 i r 5 i w < i r i w - i r - i w + i r i w -i r - i r i w + -i w +i r Saman Amarasinghe MIT Fall 26

52 Integer Programming g Formulation 5 Problem 5 i w -i w - i w 5 i w i r -i r i r 5 i r i r i w + -i w +i r - Â b 5-5 i w < i r i w - i r i w + i r i w -i r and problem 2 with i r < i w Saman Amarasinghe MIT Fall 26

53 Generalization An affine loop nest FOR i = f l (c c k ) to I u (c c k ) FOR i 2 = f l2 (i,c c k ) to I u2 (i,c c k ) FOR i n = f ln (i i n-,c c k ) to I un (i i n-,c c k ) A[f a (i i n,c c k ), f a2 (i i n,c c k ),,f am (i i n,c c k )] Solve 2*n problems of the form i = j, i 2 = j 2, i n- = j n-, i n < j n i = j, i 2 = j 2, i n- = j n-, j n < i n i = j, i 2 = j 2, i n- < j n- i = j, i 2 = j 2, j n- < i n- i = j, i 2 < j 2 i = j, j 2 < i 2 i < j j < i Saman Amarasinghe MIT Fall 26

54 Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization a at Opportunities t

55 Increasing Parallelization Scalar Privatization Opportunities Reduction Recognition Induction Variable Identification Array Privatization Loop Transformations Granularity of Parallelism Interprocedural Parallelization Saman Amarasinghe MIT Fall 26

56 Scalar Privatization Example FOR i = to n X = A[i] * 3; B[i] = X; Is there a loop carried dependence? What is the type of dependence? Saman Amarasinghe MIT Fall 26

57 Privatization Analysis: Any anti- and output- loop-carried dependences Eliminate i by assigning i in local l context t FOR i = to n integer Xtmp; Xtmp = A[i] * 3; B[i] = Xtmp; Eliminate by expanding into an array FOR i = to n Xtmp[i] = A[i] * 3; B[i] = Xtmp[i]; Saman Amarasinghe MIT Fall 26

58 Xtmp[i] = A[i] * 3; B[i] = Xtmp[i]; X = Xtmp[n]; Privatization Need a final assignment to maintain the correct value after the loop nest Eliminate by assigning in local context FOR i = to n integer Xtmp; Xtmp = A[i] * 3; B[i] = Xtmp; if(i == n) X = Xtmp Eliminate i by expanding into an array FOR i = to n Saman Amarasinghe MIT Fall 26

59 Another Example How about loop-carried true dependences? Example FOR i = to n X = X + A[i]; Is this loopparallelizable? Saman Amarasinghe MIT Fall 26

60 Reduction Recognition Reduction Analysis: Only associative operations The result is never used within the loop Transformation Integer Xtmp[NUMPROC]; Barrier(); FOR i = mypid*iters to MIN((myPid+)*Iters, n) Xtmp[myPid] = Xtmp[myPid] + A[i]; Barrier(); If(myPid == ) { FOR p = to NUMPROC- X = X + Xtmp[p]; Saman Amarasinghe MIT Fall 26

61 Induction Variables Example FOR i = to N A[i] = 2^i; After strength reduction t = FOR i = to N A[i] = t; t = t*2; What happened to loop carried dependences? Need to do opposite of this! Perform induction variable analysis Rewrite IVs as a function of the loop variable Saman Amarasinghe MIT Fall 26

62 Array Privatization Similar to scalar privatization However, analysis is more complex Array Data Dependence Analysis: Checks if two iterations access the same location Array Data Flow Analysis: Checks if two iterations access the same value Transformations Similar to scalar privatization Pi Private copy for each processor or expand with an additional dimension Saman Amarasinghe MIT Fall 26

63 Loop Transformations A loop may not be parallel as is Example FOR i = to N- FOR j = to N- A[i,j] = A[i,j-] + A[i-,j]; I J Saman Amarasinghe MIT Fall 26

64 Loop Transformations A loop may not be parallel as is Example FOR i = to N- FOR j = to N- A[i,j] = A[i,j-] + A[i-,j]; I J J After loop Skewing i new i j new j FOR i = to 2*N-3 FORPAR j = max(,i-n+2) to min(i, N-) A[i-j+,j] = A[i-j+,j-] + A[i-j,j]; old old I Saman Amarasinghe MIT Fall 26

65 Granularity of Parallelism Example FOR i = to N- FOR j = to N- A[i,j] = A[i,j] + A[i-,j]; Gets transformed into FOR i = to N- Barrier(); FOR j = + mypid*iters to MIN((myPid+)*Iters, n-) A[i,j] = A[i,j] + A[i-,j]; Barrier(); I J Inner loop parallelism can be expensive Startup and teardown overhead of parallel regions Lot of synchronization Can even lead to slowdowns Saman Amarasinghe MIT Fall 26

66 Granularity of Parallelism Inner loop pp parallelism can be expensive Solutions Don t parallelize if the amount of work within the loop is too small or Transform into outer-loop parallelism Saman Amarasinghe MIT Fall 26

67 Outer Loop Parallelism Example FOR i = to N- FOR j = to N- A[i,j] = A[i,j] + A[i-,j]; I J After Loop Transpose FOR j = to N- FOR i = to N- A[i,j] = A[i,j] + A[i-,j]; Get mapped into Barrier(); FOR j = + mypid*iters to MIN((myPid+)*Iters, n-) FOR i = to N- A[i,j] = A[i,j] + A[i-,j]; Barrier(); J I Saman Amarasinghe MIT Fall 26

68 Unimodular Transformations Interchange, reverse and skew Use a matrix transformation I new = A I old Interchange i new i j new j old old Reverse i new i j new j old old Skew i new i j new j Saman Amarasinghe MIT Fall 26 old old

69 Legality of Transformations g y Unimodular transformation with matrix A is valid iff. For all dependence vectors v For all dependence vectors v the first non-zero in Av is positive Example FOR i = to N- FOR j = to N- A[i,j] = A[i,j] + A[i-,j];, dv Interchange A Reverse A Skew Saman Amarasinghe MIT Fall 26 A

70 Interprocedural Parallelization Function calls will make a loop unparallelizatble Reduction of available parallelism A lot of inner-loop parallelism Solutions Interprocedural Analysis Inlining Saman Amarasinghe MIT Fall 26

71 Interprocedural Parallelization Issues Same function reused many times Analyze a function on each trace Possibly exponential Analyze a function once unrealizable path problem Interprocedural Analysis Need to update all the analysis Complex analysis Can be expensive Inlining Works with existing analysis Large code bloat can be very expensive Saman Amarasinghe MIT Fall 26

72 Summary Multicores are here Need parallelism to keep the performance gains Programmer defined or compiler extracted parallelism Automatic parallelization of loops with arrays Requires Data Dependence Analysis Iteration space & data space abstraction An integer programming problem Many optimizations that ll increase parallelism Saman Amarasinghe MIT Fall 26

73 MIT OpenCourseWare Computer Language Engineering Spring 2 For information about citing these materials or our Terms of Use, visit:

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities

Outline. Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Parallelization Outline Why Parallelism Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Moore s Law From Hennessy and Patterson, Computer Architecture:

More information

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

6.189 IAP Lecture 11. Parallelizing Compilers. Prof. Saman Amarasinghe, MIT IAP 2007 MIT 6.189 IAP 2007 Lecture 11 Parallelizing Compilers 1 6.189 IAP 2007 MIT Outline Parallel Execution Parallelizing Compilers Dependence Analysis Increasing Parallelization Opportunities Generation of Parallel

More information

CS758: Multicore Programming

CS758: Multicore Programming CS758: Multicore Programming Introduction Fall 2009 1 CS758 Credits Material for these slides has been contributed by Prof. Saman Amarasinghe, MIT Prof. Mark Hill, Wisconsin Prof. David Patterson, Berkeley

More information

Saman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology

Saman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology Saman Amarasinghe and Rodric Rabbah Massachusetts Institute of Technology http://cag.csail.mit.edu/ps3 6.189-chair@mit.edu A new processor design pattern emerges: The Arrival of Multicores MIT Raw 16 Cores

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Data Dependences and Parallelization

Data Dependences and Parallelization Data Dependences and Parallelization 1 Agenda Introduction Single Loop Nested Loops Data Dependence Analysis 2 Motivation DOALL loops: loops whose iterations can execute in parallel for i = 11, 20 a[i]

More information

Spring 2 Spring Loop Optimizations

Spring 2 Spring Loop Optimizations Spring 2010 Loop Optimizations Instruction Scheduling 5 Outline Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler Induction Variable Recognition

More information

Blue-Steel Ray Tracer

Blue-Steel Ray Tracer MIT 6.189 IAP 2007 Student Project Blue-Steel Ray Tracer Natalia Chernenko Michael D'Ambrosio Scott Fisher Russel Ryan Brian Sweatt Leevar Williams Game Developers Conference March 7 2007 1 Imperative

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday

Outline. L5: Writing Correct Programs, cont. Is this CUDA code correct? Administrative 2/11/09. Next assignment (a homework) given out on Monday Outline L5: Writing Correct Programs, cont. 1 How to tell if your parallelization is correct? Race conditions and data dependences Tools that detect race conditions Abstractions for writing correct parallel

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

PetaBricks: A Language and Compiler based on Autotuning

PetaBricks: A Language and Compiler based on Autotuning PetaBricks: A Language and Compiler based on Autotuning Saman Amarasinghe Joint work with Jason Ansel, Marek Olszewski Cy Chan, Yee Lok Wong, Maciej Pacula Una-May O Reilly and Alan Edelman Computer Science

More information

Lecture 8. The StreamIt Language IAP 2007 MIT

Lecture 8. The StreamIt Language IAP 2007 MIT 6.189 IAP 2007 Lecture 8 The StreamIt Language 1 6.189 IAP 2007 MIT Languages Have Not Kept Up C von-neumann machine Modern architecture Two choices: Develop cool architecture with complicated, ad-hoc

More information

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012 CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Introduction to Parallel Programming. Tuesday, April 17, 12

Introduction to Parallel Programming. Tuesday, April 17, 12 Introduction to Parallel Programming 1 Overview Parallel programming allows the user to use multiple cpus concurrently Reasons for parallel execution: shorten execution time by spreading the computational

More information

CS 293S Parallelism and Dependence Theory

CS 293S Parallelism and Dependence Theory CS 293S Parallelism and Dependence Theory Yufei Ding Reference Book: Optimizing Compilers for Modern Architecture by Allen & Kennedy Slides adapted from Louis-Noël Pouche, Mary Hall End of Moore's Law

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution.

Module 18: Loop Optimizations Lecture 35: Amdahl s Law. The Lecture Contains: Amdahl s Law. Induction Variable Substitution. The Lecture Contains: Amdahl s Law Induction Variable Substitution Index Recurrence Loop Unrolling Constant Propagation And Expression Evaluation Loop Vectorization Partial Loop Vectorization Nested Loops

More information

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse

More information

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format: MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Saman Amarasinghe, 6.189 Multicore Programming Primer, January (IAP)

More information

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞 Principle of Polyhedral model for loop optimization cschen 陳鍾樞 Outline Abstract model Affine expression, Polygon space Polyhedron space, Affine Accesses Data reuse Data locality Tiling Space partition

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

Legal and impossible dependences

Legal and impossible dependences Transformations and Dependences 1 operations, column Fourier-Motzkin elimination us use these tools to determine (i) legality of permutation and Let generation of transformed code. (ii) Recall: Polyhedral

More information

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20

More information

Lecture 11 Loop Transformations for Parallelism and Locality

Lecture 11 Loop Transformations for Parallelism and Locality Lecture 11 Loop Transformations for Parallelism and Locality 1. Examples 2. Affine Partitioning: Do-all 3. Affine Partitioning: Pipelining Readings: Chapter 11 11.3, 11.6 11.7.4, 11.9-11.9.6 1 Shared Memory

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

The Polyhedral Model (Transformations)

The Polyhedral Model (Transformations) The Polyhedral Model (Transformations) Announcements HW4 is due Wednesday February 22 th Project proposal is due NEXT Friday, extension so that example with tool is possible (see resources website for

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Advanced Compiler Construction Theory And Practice

Advanced Compiler Construction Theory And Practice Advanced Compiler Construction Theory And Practice Introduction to loop dependence and Optimizations 7/7/2014 DragonStar 2014 - Qing Yi 1 A little about myself Qing Yi Ph.D. Rice University, USA. Associate

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

CSE 392/CS 378: High-performance Computing - Principles and Practice

CSE 392/CS 378: High-performance Computing - Principles and Practice CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer

More information

I-1 Introduction. I-0 Introduction. Objectives:

I-1 Introduction. I-0 Introduction. Objectives: I-0 Introduction Objectives: Explain necessity of parallel/multithreaded algorithms. Describe different forms of parallel processing. Present commonly used architectures. Introduce a few basic terms. Comments:

More information

6.189 IAP Lecture 3. Introduction to Parallel Architectures. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

6.189 IAP Lecture 3. Introduction to Parallel Architectures. Prof. Saman Amarasinghe, MIT IAP 2007 MIT 6.189 IAP 2007 Lecture 3 Introduction to Parallel Architectures 1 6.189 IAP 2007 MIT Implicit vs. Explicit Parallelism Implicit Explicit Hardware Compiler Superscalar Processors Explicitly Parallel Architectures

More information

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall.

CS4230 Parallel Programming. Lecture 7: Loop Scheduling cont., and Data Dependences 9/6/12. Administrative. Mary Hall. CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7, 2012 Administrative I will be on travel on Tuesday, September 11 There will be no lecture Instead,

More information

Coarse-Grained Parallelism

Coarse-Grained Parallelism Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop interchange and skewing, Loop Strip-mining cs6363 1 Introduction Our previous loop transformations target vector and

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence

More information

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Analysis of Parallelization Techniques and Tools

Analysis of Parallelization Techniques and Tools International Journal of Information and Computation Technology. ISSN 97-2239 Volume 3, Number 5 (213), pp. 71-7 International Research Publications House http://www. irphouse.com /ijict.htm Analysis of

More information

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont.

CS4961 Parallel Programming. Lecture 2: Introduction to Parallel Algorithms 8/31/10. Mary Hall August 26, Homework 1, cont. Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26, 2010 1 Homework 1 Due 10:00 PM, Wed., Sept. 1 To submit your homework: - Submit a PDF file - Use the handin program

More information

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

CSC 447: Parallel Programming for Multi- Core and Cluster Systems CSC 447: Parallel Programming for Multi- Core and Cluster Systems Why Parallel Computing? Haidar M. Harmanani Spring 2017 Definitions What is parallel? Webster: An arrangement or state that permits several

More information

Loop Transformations, Dependences, and Parallelization

Loop Transformations, Dependences, and Parallelization Loop Transformations, Dependences, and Parallelization Announcements HW3 is due Wednesday February 15th Today HW3 intro Unimodular framework rehash with edits Skewing Smith-Waterman (the fix is in!), composing

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Memory Systems and Performance Engineering

Memory Systems and Performance Engineering SPEED LIMIT PER ORDER OF 6.172 Memory Systems and Performance Engineering Fall 2010 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide

More information

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

Loop Transformations! Part II!

Loop Transformations! Part II! Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow

More information

Evaluating Performance Via Profiling

Evaluating Performance Via Profiling Performance Engineering of Software Systems September 21, 2010 Massachusetts Institute of Technology 6.172 Professors Saman Amarasinghe and Charles E. Leiserson Handout 6 Profiling Project 2-1 Evaluating

More information

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture Lecture Topics ECE 486/586 Computer Architecture Lecture # 5 Spring 2015 Portland State University Quantitative Principles of Computer Design Fallacies and Pitfalls Instruction Set Principles Introduction

More information

Why do we care about parallel?

Why do we care about parallel? Threads 11/15/16 CS31 teaches you How a computer runs a program. How the hardware performs computations How the compiler translates your code How the operating system connects hardware and software The

More information

Parallelisation. Michael O Boyle. March 2014

Parallelisation. Michael O Boyle. March 2014 Parallelisation Michael O Boyle March 2014 1 Lecture Overview Parallelisation for fork/join Mapping parallelism to shared memory multi-processors Loop distribution and fusion Data Partitioning and SPMD

More information

Linear Loop Transformations for Locality Enhancement

Linear Loop Transformations for Locality Enhancement Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges ELE 455/555 Computer System Engineering Section 4 Class 1 Challenges Introduction Motivation Desire to provide more performance (processing) Scaling a single processor is limited Clock speeds Power concerns

More information

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format: MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Rodric Rabbah, 6.189 Multicore Programming Primer, January (IAP) 2007.

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format:

MIT OpenCourseWare Multicore Programming Primer, January (IAP) Please use the following citation format: MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Saman Amarasinghe, 6.189 Multicore Programming Primer, January (IAP)

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

Multicore and Parallel Processing

Multicore and Parallel Processing Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 4.10 11, 7.1 6 xkcd/619 2 Pitfall: Amdahl s Law Execution time after improvement

More information

Kismet: Parallel Speedup Estimates for Serial Programs

Kismet: Parallel Speedup Estimates for Serial Programs Kismet: Parallel Speedup Estimates for Serial Programs Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor Computer Science and Engineering University of California, San Diego 1 Questions

More information

Simulating ocean currents

Simulating ocean currents Simulating ocean currents We will study a parallel application that simulates ocean currents. Goal: Simulate the motion of water currents in the ocean. Important to climate modeling. Motion depends on

More information

Compilation for Heterogeneous Platforms

Compilation for Heterogeneous Platforms Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey

More information

CSCI-580 Advanced High Performance Computing

CSCI-580 Advanced High Performance Computing CSCI-580 Advanced High Performance Computing Performance Hacking: Matrix Multiplication Bo Wu Colorado School of Mines Most content of the slides is from: Saman Amarasinghe (MIT) Square-Matrix Multiplication!2

More information

Application-Platform Mapping in Multiprocessor Systems-on-Chip

Application-Platform Mapping in Multiprocessor Systems-on-Chip Application-Platform Mapping in Multiprocessor Systems-on-Chip Leandro Soares Indrusiak lsi@cs.york.ac.uk http://www-users.cs.york.ac.uk/lsi CREDES Kick-off Meeting Tallinn - June 2009 Application-Platform

More information

The Shift to Multicore Architectures

The Shift to Multicore Architectures Parallel Programming Practice Fall 2011 The Shift to Multicore Architectures Bernd Burgstaller Yonsei University Bernhard Scholz The University of Sydney Why is it important? Now we're into the explicit

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 1 Introduction Li Jiang 1 Course Details Time: Tue 8:00-9:40pm, Thu 8:00-9:40am, the first 16 weeks Location: 东下院 402 Course Website: TBA Instructor:

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen

OpenMP - III. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16. HPAC, RWTH Aachen OpenMP - III Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming Outline OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions

More information

CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis. Kevin Quinn Fall 2015

CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis. Kevin Quinn Fall 2015 CSE373: Data Structures & Algorithms Lecture 22: Parallel Reductions, Maps, and Algorithm Analysis Kevin Quinn Fall 2015 Outline Done: How to write a parallel algorithm with fork and join Why using divide-and-conquer

More information

Programmation Concurrente (SE205)

Programmation Concurrente (SE205) Programmation Concurrente (SE205) CM1 - Introduction to Parallelism Florian Brandner & Laurent Pautet LTCI, Télécom ParisTech, Université Paris-Saclay x Outline Course Outline CM1: Introduction Forms of

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops Parallel Programming OpenMP Parallel programming for multiprocessors for loops OpenMP OpenMP An application programming interface (API) for parallel programming on multiprocessors Assumes shared memory

More information

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model Simple Machine Model Fall 005 Lectures & 5: Instruction Scheduling Instructions are executed in sequence Fetch, decode, execute, store results One instruction at a time For branch instructions, start fetching

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Memory Systems and Performance Engineering. Fall 2009

Memory Systems and Performance Engineering. Fall 2009 Memory Systems and Performance Engineering Fall 2009 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide illusion of fast larger memory

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices Patterns of Parallel Programming with.net 4 Ade Miller (adem@microsoft.com) Microsoft patterns & practices Introduction Why you should care? Where to start? Patterns walkthrough Conclusions (and a quiz)

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA: CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.126A ACES Email: pingali@cs.utexas.edu TA: Aditya Rawal Email: 83.aditya.rawal@gmail.com University of Texas,

More information

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information