Performance without Pain = Productivity Data Layout and Collective Communication in UPC

Size: px
Start display at page:

Download "Performance without Pain = Productivity Data Layout and Collective Communication in UPC"

Transcription

1 Performance without Pain = Productivity Data Layout and Collective Communication in UPC By Rajesh Nishtala (UC Berkeley), George Almási (IBM Watson Research Center), Călin Caşcaval (IBM Watson Research Center)

2 Observations and Experiences As the number of processors continues to grow at a rapid pace, application scalability takes center-stage Three important considerations for writing optimized scalable parallel code 1. How do you efficiently and optimally distribute the data across the processors? 2. How do you write a system that leverages existing serial libraries? 3. What are the simplest communication mechanisms you need to coordinate the processors together?

3 Partitioned Global Address Space (PGAS) Languages Programming model suitable for both shared and distributed memory systems Language presents a logically shared memory Any thread may directly read/ write data located on a remote processor Address space is partitioned so each processor has affinity to a memory region Accesses to local memory are potentially much faster shared address space private address space P0 P1 P2 P3

4 Thread vs. Data-Centric Communication MPI Code (Thread-Centric): Example: Send P0 s version of A to P1 A A A A P0 P1 P2 P3 double A; MPI_Status stat; if(myrank == 0) { A = 42.0; MPI_Send(&A, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); } else if(myrank == 1) MPI_Recv(&A, 1, MPI_DOUBLE, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); UPC Code (Data-Centric): shared [1] double A[4]; if(mythread == upc_threadof(&a[0])) A[0] = 42.0; upc_memput(&a[1], &A[0], sizeof(double)); }

5 Thread vs. Data-Centric Communication MPI Code (Thread-Centric): Example: Send P0 s version of A to P1 A A A A P0 P1 P2 P3 double A; MPI_Status stat; if(myrank == 0) { A = 42.0; MPI_Send(&A, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); } else if(myrank == 1) MPI_Recv(&A, 1, MPI_DOUBLE, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); UPC Code (Data-Centric): shared [1] double A[4]; if(mythread == upc_threadof(&a[0])) A[0] = 42.0; upc_memput(&a[1], &A[0], sizeof(double)); }

6 Data Layout Extensions Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries Checkerboard layout critical for load balancing dense factorization methods to UPC

7 Data Layout Extensions Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries to UPC Checkerboard layout critical for load balancing dense factorization methods double A[16][16]

8 Data Layout Extensions Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries to UPC Checkerboard layout critical for load balancing dense factorization methods shared [2][2] double A[16][16]

9 Data Layout Extensions to UPC Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries Checkerboard layout critical for load balancing dense factorization methods shared [2][2] double A[16][16]

10 Data Layout Extensions to UPC Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries Checkerboard layout critical for load balancing dense factorization methods #pragma procesors MyDist(2,2) shared [2][2] (MyDist) double A[16][16]

11 Algorithm Example: DGEMM N B P P M M A N C

12 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets DGEMM M P P N B M A N C

13 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 DGEMM M P P N B M A N C

14 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 Need data owned by 0 and 1 in first round DGEMM M P P N B M A N C

15 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 Need data owned by 0 and 1 in first round Need data owned by 1 and 3 in second round DGEMM M P P N B M A N C

16 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 Need data owned by 0 and 1 in first round Need data owned by 1 and 3 in second round Non-scalable approach due to O(P*M*N) uncoordinated gets A Can we do better? DGEMM M P P N N B M C

17 Collective Communication

18 Collective Communication An operation called by all processes together to perform globally coordinated communication May involve a modest amount of computation, e.g. to combine values as they are reduced Can be extended to teams (or communicators) in which they operate on a subset of the processors

19 Collective Communication An operation called by all processes together to perform globally coordinated communication May involve a modest amount of computation, e.g. to combine values as they are reduced Can be extended to teams (or communicators) in which they operate on a subset of the processors Abstraction that passes responsibility of tuning the communication schedule to the runtime system

20 Team Construction in MPI

21 Team Construction in MPI MPI requires explicit team (communicator) objects They are heavy weight objects that must be constructed outside the critical path They require the user to explicitly specify the processor ranks that are involved in the teams

22 Team Construction in MPI MPI requires explicit team (communicator) objects They are heavy weight objects that must be constructed outside the critical path They require the user to explicitly specify the processor ranks that are involved in the teams Logical extension to MPI s process-centric programming model

23 Team Construction in UPC

24 Team Construction in UPC UPC enables distributed arrays and presents a more data-centric programming model Can we create a novel collective interface to go along with this programming model? Can we make it scale?

25 Team Construction in UPC UPC enables distributed arrays and presents a more data-centric programming model Can we create a novel collective interface to go along with this programming model? Can we make it scale? Our approach: Have the user specify the blocks of the shared data to operate on Let the runtime figure out the mapping of data to processor and dynamically construct the teams

26 Example 1: Broadcast even odd Representative of one-to-many Broadcast one element to every other row and every other column Use Matlab style notation in the interface shared [2][2] double dst[n][n]; dst double even, odd; upc_stride_broadcast( dst<0:2:n-1,0:2:n-1>, even, sizeof(double)); upc_stride_broadcast( dst<1:2:n-1,1:2:n-1>, odd, sizeof(double)); Collective arguments are the same regardless of the number of processors

27 Example 1: Broadcast even odd Representative of one-to-many Broadcast one element to every other row and every other column Use Matlab style notation in the interface shared [2][2] double dst[n][n]; dst double even, odd; upc_stride_broadcast( dst<0:2:n-1,0:2:n-1>, even, sizeof(double)); upc_stride_broadcast( dst<1:2:n-1,1:2:n-1>, odd, sizeof(double)); Collective arguments are the same regardless of the number of processors

28 Example 1: Broadcast even odd Representative of one-to-many Broadcast one element to every other row and every other column Use Matlab style notation in the interface shared [2][2] double dst[n][n]; dst double even, odd; upc_stride_broadcast( dst<0:2:n-1,0:2:n-1>, even, sizeof(double)); upc_stride_broadcast( dst<1:2:n-1,1:2:n-1>, odd, sizeof(double)); Collective arguments are the same regardless of the number of processors

29 Example 2: Exchange Representative of many-to-many Data-centric approach enables multi-dimensional operations Example: Exchange elements from a particular column into a particular row Owned by P0 #pragma processors mydist(2,2) shared[2][2](mydist)double A[16][16]; upc_stride_exchange(a<0,:>, A<:,0>, sizeof(double));

30 Example 2: Exchange Representative of many-to-many Data-centric approach enables multi-dimensional operations Owned by P0 Example: Exchange elements from a particular column into a particular row #pragma processors mydist(2,2) shared[2][2](mydist)double A[16][16]; upc_stride_exchange(a<0,:>, A<:,0>, sizeof(double));

31 Example 2: Exchange Representative of many-to-many Data-centric approach enables multi-dimensional operations Example: Exchange elements from a particular column into a particular row Owned by P0 #pragma processors mydist(2,2) shared[2][2](mydist)double A[16][16]; upc_stride_exchange(a<0,:>, A<:,0>, sizeof(double));

32 Optimizing the Collectives

33 Optimizing the Collectives 0 Tree topology for processor communication is critical for performance binary tree binomial tree fork tree

34 Optimizing the Collectives 0 Tree topology for processor communication is critical for performance Overhead of injecting messages onto network can not be parallelized binary tree 2 6 Sending to intermediary nodes alleviates serial bottleneck and distributes work across machine binomial tree fork tree

35 Optimizing the Collectives 0 Tree topology for processor communication is critical for performance Overhead of injecting messages onto network can not be parallelized Sending to intermediary nodes alleviates serial bottleneck and distributes work across machine Best Tree for BlueGene/L binomial tree binary tree fork tree 6

36 Implementation with Back To DGEMM uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code

37 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

38 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

39 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

40 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

41 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

42 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

43 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

44 DGEMM Performance Linearized pairs of the 4- D torus dimension to get 2D processor grid Use ESSL for serial computation B2 doubles broadcast for every 2B 3 flops Experiments scale problem size w/ processor count ,384 processors w/ linear scaling GFlop/s!"""""#!""""#!"""#!""# *+,-.,/012#3,1)#45-#6-7789:01/-9;# <,1=8.,>#3,.?-.7190,# $%#!&'# &($# (!&#!)# &)# %)# ')#!$)# Processor Count

45 Dense Cholesky Factorization Factor A into UT U Relies on Team Broadcast Recursive algorithm Factor Upper-Left corner of A (AUL) Update A UR using triangular solve Outer product of A UR T and AUR to get ALR Full Code in Paper ESSL for local computation AUL ALL AUR ALR

46 Cholesky Factorization GFlop/s!"""""#!""""#!"""#!""# Performance *+,-.,/012#3,1)#45-#6-7789:01/-9;# <,1=8.,>#3,.?-.7190,# Processor layout identical to DGEMM Broadcast is no longer strictly along rows or columns of processor grid Rectangular processor grid over a square matrix leads to load balance issues!"# $%#!&'# &($# (!&#!)# &)# %)# ')# Processor Count 8.6 8,192 processors w/ linear scaling

47 3D-FFT D00 D01 D02 D03 C00 C01 C02 C03 B00A00 B01 B02 B03 D10 D11 D12 A01 D13 A02 A03 C10 C11 C12 C13 B10A10 B11 B12 B13 D20 D21 D22 A11 D23 A12 A13 C20 C21 C22 C23 B20A20 B21 B22 B23 D30 D31 D32 A21 D33 A22 A23 C30 C31 C32 C33 B30A30B31 B32 B33 A31 A32 A33 P0 Each processor owns a row of 4 squares

48 D00 D01 D02 D03 C00 C01 C02 C03 B00A00 B01 B02 B03 D10 D11 D12 A01 D13 A02 A03 C10 C11 C12 C13 B10A10 B11 B12 B13 D20 D21 D22 A11 D23 A12 A13 C20 C21 C22 C23 B20A20 B21 B22 B23 D30 D31 D32 A21 D33 A22 A23 C30 C31 C32 C33 B30A30B31 B32 B33 A31 A32 A33 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Each processor owns a row of 4 squares

49 D00 D01 D02 D03 C00 C01 C02 C03 B00A00 B01 B02 B03 D10 D11 D12 A01 D13 A02 A03 C10 C11 C12 C13 B10A10 B11 B12 B13 D20 D21 D22 A11 D23 A12 A13 C20 C21 C22 C23 B20A20 B21 B22 B23 D30 D31 D32 A21 D33 A22 A23 C30 C31 C32 C33 B30A30B31 B32 B33 A31 A32 A33 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Each processor owns a row of 4 squares

50 D00 D10 D20 D30 C00 C10 C20 C30 B00A00B10A10B20A20B30A30 D01 D11 D21 D31 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane

51 D00 D10 D20 D30 C00 C10 C20 C30 B00A00B10A10B20A20B30A30 D01 D11 D21 D31 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns

52 D00 D10 D20 D30 C00 C10 C20 C30 B00A00B10A10B20A20B30A30 D01 D11 D21 D31 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns

53 A30 B30 C30 D30 A20 B20 C20 D20 A10 B10 C10 D10 D01 D11 A00 D21 B00 D31 C00 D00 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns Do an exchange across planes

54 A30 B30 C30 D30 A20 B20 C20 D20 A10 B10 C10 D10 D01 D11 A00 D21 B00 D31 C00 D00 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns Do an exchange across planes Perform FFT across the last dimensions

55 3D-FFT Performance!####"!###" +,-./-0123"4-2*"56."7.889:;120.:<" Every processor exchanges all of its data in each round Network bandwidth limits performance GFlop/s!##"!#" Created performance model that exposes bisection bandwidth limits ,384 processors!" $%" &'"!%(" %)&" )!%"!*" %*" '*" (*"!&*" Processor Count

56 Summary of Results Lines of Code Number of Processors Max Performance % of Serial ESSL % of Machine Peak DGEMM 14 16k 28.8 TFlop/s 84% 63% Cholesky Factorization 28 8k 8.6 TFlop/s 56% 38% 3D-FFT 18 16k 2.1 TFlop/s 97% of analytic model

57 Conclusions Added support for multi-dimensional shared arrays in UPC Created a novel data-centric collective communication interface in UPC Implemented 3 important algorithms with data centric approach to showcase brevity of code as well as performance scalability Need to rethink programming models to ensure application scalability as number of processors continues rapid growth

58 Questions?

59 Backup Slides

60 UPC Distributed Arrays

61 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators

62 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators A[0] A[1] A[2] A[3] Examples: shared [] double A[THREADS];

63 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators Examples: A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] shared [] double A[THREADS]; shared [1] double B[THREADS*2];

64 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators Examples: shared [] double A[THREADS]; shared [1] double B[THREADS*2]; A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] C[0] C[1] C[8] C[9] C[2] C[3] C[10] C[11] C[4] C[5] C[12] C[13] C[6] C[7] C[14] C[15] shared [2] double C[THREADS*4];

65 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators Examples: shared [] double A[THREADS]; shared [1] double B[THREADS*2]; A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] C[0] C[1] C[8] C[9] C[2] C[3] C[10] C[11] C[4] C[5] C[12] C[13] C[6] C[7] C[14] C[15] shared [2] double C[THREADS*4]; D[0] D[1] D[4] D[5] D[8] D[9] D[12] D[13] shared [4] double D[THREADS*4]; D[2] D[3] D[6] D[7] D[10] D[11] D[14] D[15]

66 Algorithmic Example: DGEMM

67 Algorithmic Example: Declare the arrays shared [b][b] A[M][P]; shared [b][b] B[P][N]; shared [b][b] C[M][N]; DGEMM P P N M M N

68 Algorithmic Example: Declare the arrays shared [b][b] A[M][P]; shared [b][b] B[P][N]; shared [b][b] C[M][N]; Run the update for(k=0; k<p; k+=b) for(i=0; i<m; i+=b) get A[i][k] into tempa upc_forall(j=0; j<n; j+=b; &C[i][j]) get B[k][j] into tempb C[i][j]+=tempA*tempB DGEMM M P P N N M

69 Algorithmic Example: Declare the arrays shared [b][b] A[M][P]; shared [b][b] B[P][N]; shared [b][b] C[M][N]; Run the update for(k=0; k<p; k+=b) for(i=0; i<m; i+=b) get A[i][k] into tempa upc_forall(j=0; j<n; j+=b; &C[i][j]) get B[k][j] into tempb C[i][j]+=tempA*tempB DGEMM M internal memory layout lets us directly call BLAS3 DGEMM P P N N M

70 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

71 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

72 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

73 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

74 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

75 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

76 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

77 Scaling Applications For The Future Future machines are likely to achieve increased performance by adding more processing elements As the number of processors continues to grow at a rapid pace, application scalability becomes a bigger concern How can we design a programming model and the associated runtime systems to help alleviate these concerns

78 Collective Model Every processor with affinity to an active block of data must make a call into the collective Allows the implementation to build scalable communication schedules Alternative Model Considered (Seidel et. al) Have exactly one process make a call into the collective Requires the user to handle their own synchronization or wait until the next barrier phase Requires implementation to dedicate a thread for building scalable communication schedules amongst processors

79 Machine Layouts Cores Torus X Y Z T UPC Array Mapping Distribution Matrix FT Mapping/Dim Mapping/Dim XT,YZ 8 x 16 YZ, X 16 x 4 1k XT,YZ 16 x 64 YZ, X 64 x 8 2k XT,YZ 16 x 128 YZ, X 128 x 8 4k XT,YZ 16 x 256 YZ, X 256 x 8 8k XT,YZ 16 x 512 YZ, X 512 x 8

80 FFT Performance Model Model Bandwidth limits for each exchange First exchange is always done on a linear set of nodes (X dimension) Half the data from each node is transfered across that one limiting link Second exchange is done across plane (Y,Z dimension) MIN(Y,Z) shows the number of bandwidth limiting links Half the data again travels across these set of links Add in the serial compute time per CPU

81 BlueGene/L System Architecture IBM Research One Chip is a Dual Core PowerPC440 Two Chips per Compute Card (4 cores) 16 Compute Cards Per Node Card (64 cores) 32 Node Cards Per Rack (2048 Cores) We used up to 16 racks

82 Full DGEMM Code

83 Full Cholesky Code

84 Full FT Code

85 Broadcast Performance Model

86 Broadcast Performance Model Plot

Performance without Pain = Productivity

Performance without Pain = Productivity Performance without Pain = Productivity Data Layout and Collective Communication in UPC Rajesh Nishtala George Almási Călin Caşcaval Computer Science Division IBM T.J. Watson Research Center University

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing... Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9 Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Parallel Programming with Coarray Fortran

Parallel Programming with Coarray Fortran Parallel Programming with Coarray Fortran SC10 Tutorial, November 15 th 2010 David Henty, Alan Simpson (EPCC) Harvey Richardson, Bill Long, Nathan Wichmann (Cray) Tutorial Overview The Fortran Programming

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: John Wawrzynek & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Review

More information

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

Automated Mapping of Regular Communication Graphs on Mesh Interconnects Automated Mapping of Regular Communication Graphs on Mesh Interconnects Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung Motivation Running a parallel application on a linear array of processors:

More information

SHARED MEMORY VS DISTRIBUTED MEMORY

SHARED MEMORY VS DISTRIBUTED MEMORY OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Parallel Programming. Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab

Parallel Programming. Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab Parallel Programming Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab Summing n numbers for(i=1; i++; i

More information

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P. 1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular

More information

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Shaun Lindsay CS425 A Comparison of Unified Parallel C, Titanium and Co-Array Fortran The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Array Fortran s methods of parallelism

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c

More information

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18

More information

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3

Exam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3 UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis

More information

Concurrent Programming with OpenMP

Concurrent Programming with OpenMP Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed

More information

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency

Outline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency 1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

Tieing the Threads Together

Tieing the Threads Together Tieing the Threads Together 1 Review Sequential software is slow software SIMD and MIMD are paths to higher performance MIMD thru: multithreading processor cores (increases utilization), Multicore processors

More information

Tutorial OmpSs: Overlapping communication and computation

Tutorial OmpSs: Overlapping communication and computation www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00

More information

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas

More information

Accelerating HPL on Heterogeneous GPU Clusters

Accelerating HPL on Heterogeneous GPU Clusters Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Scaling Communication-Intensive Applications on BlueGene/P Using One-Sided Communication and Overlap

Scaling Communication-Intensive Applications on BlueGene/P Using One-Sided Communication and Overlap Scaling Communication-Intensive Applications on BlueGene/P Using One-Sided Communication and Overlap Rajesh Nishtala, Paul H. Hargrove, Dan O. Bonachea and Katherine A. Yelick Computer Science Division,

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Parallel Languages: Past, Present and Future

Parallel Languages: Past, Present and Future Parallel Languages: Past, Present and Future Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab 1 Kathy Yelick Internal Outline Two components: control and data (communication/sharing) One

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Lecture 32: Partitioned Global Address Space (PGAS) programming models

Lecture 32: Partitioned Global Address Space (PGAS) programming models COMP 322: Fundamentals of Parallel Programming Lecture 32: Partitioned Global Address Space (PGAS) programming models Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu COMP

More information

Cray Scientific Libraries. Overview

Cray Scientific Libraries. Overview Cray Scientific Libraries Overview What are libraries for? Building blocks for writing scientific applications Historically allowed the first forms of code re-use Later became ways of running optimized

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

Unrolling parallel loops

Unrolling parallel loops Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Matrix Multiplication

Matrix Multiplication Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P

Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1, Yevgen Voronenko 2, Gheorghe Almasi 3 1 University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM Research

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

A Characterization of Shared Data Access Patterns in UPC Programs

A Characterization of Shared Data Access Patterns in UPC Programs IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview

More information

COMPUTATIONAL LINEAR ALGEBRA

COMPUTATIONAL LINEAR ALGEBRA COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim

More information

New Programming Paradigms: Partitioned Global Address Space Languages

New Programming Paradigms: Partitioned Global Address Space Languages Raul E. Silvera -- IBM Canada Lab rauls@ca.ibm.com ECMWF Briefing - April 2010 New Programming Paradigms: Partitioned Global Address Space Languages 2009 IBM Corporation Outline Overview of the PGAS programming

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

X10 and APGAS at Petascale

X10 and APGAS at Petascale X10 and APGAS at Petascale Olivier Tardieu 1, Benjamin Herta 1, David Cunningham 2, David Grove 1, Prabhanjan Kambadur 1, Vijay Saraswat 1, Avraham Shinnar 1, Mikio Takeuchi 3, Mandana Vaziri 1 1 IBM T.J.

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort

Last Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort Intro to MPI Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

A Uniform Programming Model for Petascale Computing

A Uniform Programming Model for Petascale Computing A Uniform Programming Model for Petascale Computing Barbara Chapman University of Houston WPSE 2009, Tsukuba March 25, 2009 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools Agenda

More information

Tiled Matrix Multiplication

Tiled Matrix Multiplication Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;

More information

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?

a. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU? CPS 540 Fall 204 Shirley Moore, Instructor Test November 9, 204 Answers Please show all your work.. Draw a sketch of the extended von Neumann architecture for a 4-core multicore processor with three levels

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

Numerical Algorithms

Numerical Algorithms Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0

More information

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele

Large Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

Lecture 16. Parallel Matrix Multiplication

Lecture 16. Parallel Matrix Multiplication Lecture 16 Parallel Matrix Multiplication Assignment #5 Announcements Message passing on Triton GPU programming on Lincoln Calendar No class on Tuesday/Thursday Nov 16th/18 th TA Evaluation, Professor

More information

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications

More information

Parallel 3D Sweep Kernel with PaRSEC

Parallel 3D Sweep Kernel with PaRSEC Parallel 3D Sweep Kernel with PaRSEC Salli Moustafa Mathieu Faverge Laurent Plagne Pierre Ramet 1 st International Workshop on HPC-CFD in Energy/Transport Domains August 22, 2014 Overview 1. Cartesian

More information

Parallelism paradigms

Parallelism paradigms Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization

More information

Lecture 20: Distributed Memory Parallelism. William Gropp

Lecture 20: Distributed Memory Parallelism. William Gropp Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order

More information

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1 Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method

More information

C PGAS XcalableMP(XMP) Unified Parallel

C PGAS XcalableMP(XMP) Unified Parallel PGAS XcalableMP Unified Parallel C 1 2 1, 2 1, 2, 3 C PGAS XcalableMP(XMP) Unified Parallel C(UPC) XMP UPC XMP UPC 1 Berkeley UPC GASNet 1. MPI MPI 1 Center for Computational Sciences, University of Tsukuba

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

PGAS Languages (Par//oned Global Address Space) Marc Snir

PGAS Languages (Par//oned Global Address Space) Marc Snir PGAS Languages (Par//oned Global Address Space) Marc Snir Goal Global address space is more convenient to users: OpenMP programs are simpler than MPI programs Languages such as OpenMP do not provide mechanisms

More information

CS 426. Building and Running a Parallel Application

CS 426. Building and Running a Parallel Application CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations

More information

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500

More information

X10 for Productivity and Performance at Scale

X10 for Productivity and Performance at Scale 212 HPC Challenge Class 2 X1 for Productivity and Performance at Scale Olivier Tardieu, David Grove, Bard Bloom, David Cunningham, Benjamin Herta, Prabhanjan Kambadur, Vijay Saraswat, Avraham Shinnar,

More information

Chapter 4: Multithreaded Programming

Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

Lecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators

Lecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators Lecture 13 Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators Announcements Extra lecture Friday 4p to 5.20p, room 2154 A4 posted u Cannon s matrix multiplication

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Topologies. Maurizio Palesi. Maurizio Palesi 1

Topologies. Maurizio Palesi. Maurizio Palesi 1 Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and

More information

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014

More information

Assignment 3 MPI Tutorial Compiling and Executing MPI programs

Assignment 3 MPI Tutorial Compiling and Executing MPI programs Assignment 3 MPI Tutorial Compiling and Executing MPI programs B. Wilkinson: Modification date: February 11, 2016. This assignment is a tutorial to learn how to execute MPI programs and explore their characteristics.

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information