Performance without Pain = Productivity Data Layout and Collective Communication in UPC

Size: px

Start display at page:

Download "Performance without Pain = Productivity Data Layout and Collective Communication in UPC"

Ernest Barnett
5 years ago
Views:

1 Performance without Pain = Productivity Data Layout and Collective Communication in UPC By Rajesh Nishtala (UC Berkeley), George Almási (IBM Watson Research Center), Călin Caşcaval (IBM Watson Research Center)

2 Observations and Experiences As the number of processors continues to grow at a rapid pace, application scalability takes center-stage Three important considerations for writing optimized scalable parallel code 1. How do you efficiently and optimally distribute the data across the processors? 2. How do you write a system that leverages existing serial libraries? 3. What are the simplest communication mechanisms you need to coordinate the processors together?

3 Partitioned Global Address Space (PGAS) Languages Programming model suitable for both shared and distributed memory systems Language presents a logically shared memory Any thread may directly read/ write data located on a remote processor Address space is partitioned so each processor has affinity to a memory region Accesses to local memory are potentially much faster shared address space private address space P0 P1 P2 P3

4 Thread vs. Data-Centric Communication MPI Code (Thread-Centric): Example: Send P0 s version of A to P1 A A A A P0 P1 P2 P3 double A; MPI_Status stat; if(myrank == 0) { A = 42.0; MPI_Send(&A, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); } else if(myrank == 1) MPI_Recv(&A, 1, MPI_DOUBLE, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); UPC Code (Data-Centric): shared [1] double A[4]; if(mythread == upc_threadof(&a[0])) A[0] = 42.0; upc_memput(&a[1], &A[0], sizeof(double)); }

5 Thread vs. Data-Centric Communication MPI Code (Thread-Centric): Example: Send P0 s version of A to P1 A A A A P0 P1 P2 P3 double A; MPI_Status stat; if(myrank == 0) { A = 42.0; MPI_Send(&A, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); } else if(myrank == 1) MPI_Recv(&A, 1, MPI_DOUBLE, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); UPC Code (Data-Centric): shared [1] double A[4]; if(mythread == upc_threadof(&a[0])) A[0] = 42.0; upc_memput(&a[1], &A[0], sizeof(double)); }

6 Data Layout Extensions Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries Checkerboard layout critical for load balancing dense factorization methods to UPC

7 Data Layout Extensions Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries to UPC Checkerboard layout critical for load balancing dense factorization methods double A[16][16]

8 Data Layout Extensions Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries to UPC Checkerboard layout critical for load balancing dense factorization methods shared [2][2] double A[16][16]

9 Data Layout Extensions to UPC Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries Checkerboard layout critical for load balancing dense factorization methods shared [2][2] double A[16][16]

10 Data Layout Extensions to UPC Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries Checkerboard layout critical for load balancing dense factorization methods #pragma procesors MyDist(2,2) shared [2][2] (MyDist) double A[16][16]

11 Algorithm Example: DGEMM N B P P M M A N C

12 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets DGEMM M P P N B M A N C

13 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 DGEMM M P P N B M A N C

14 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 Need data owned by 0 and 1 in first round DGEMM M P P N B M A N C

15 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 Need data owned by 0 and 1 in first round Need data owned by 1 and 3 in second round DGEMM M P P N B M A N C

16 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 Need data owned by 0 and 1 in first round Need data owned by 1 and 3 in second round Non-scalable approach due to O(P*M*N) uncoordinated gets A Can we do better? DGEMM M P P N N B M C

17 Collective Communication

18 Collective Communication An operation called by all processes together to perform globally coordinated communication May involve a modest amount of computation, e.g. to combine values as they are reduced Can be extended to teams (or communicators) in which they operate on a subset of the processors

19 Collective Communication An operation called by all processes together to perform globally coordinated communication May involve a modest amount of computation, e.g. to combine values as they are reduced Can be extended to teams (or communicators) in which they operate on a subset of the processors Abstraction that passes responsibility of tuning the communication schedule to the runtime system

20 Team Construction in MPI

21 Team Construction in MPI MPI requires explicit team (communicator) objects They are heavy weight objects that must be constructed outside the critical path They require the user to explicitly specify the processor ranks that are involved in the teams

22 Team Construction in MPI MPI requires explicit team (communicator) objects They are heavy weight objects that must be constructed outside the critical path They require the user to explicitly specify the processor ranks that are involved in the teams Logical extension to MPI s process-centric programming model

23 Team Construction in UPC

24 Team Construction in UPC UPC enables distributed arrays and presents a more data-centric programming model Can we create a novel collective interface to go along with this programming model? Can we make it scale?

25 Team Construction in UPC UPC enables distributed arrays and presents a more data-centric programming model Can we create a novel collective interface to go along with this programming model? Can we make it scale? Our approach: Have the user specify the blocks of the shared data to operate on Let the runtime figure out the mapping of data to processor and dynamically construct the teams

26 Example 1: Broadcast even odd Representative of one-to-many Broadcast one element to every other row and every other column Use Matlab style notation in the interface shared [2][2] double dst[n][n]; dst double even, odd; upc_stride_broadcast( dst<0:2:n-1,0:2:n-1>, even, sizeof(double)); upc_stride_broadcast( dst<1:2:n-1,1:2:n-1>, odd, sizeof(double)); Collective arguments are the same regardless of the number of processors

Example 1: Broadcast even odd Representative of one-to-many Broadcast one element to every other row and every other column Use Matlab style notation in the interface shared [2][2] double dst[n][n];

27 Example 1: Broadcast even odd Representative of one-to-many Broadcast one element to every other row and every other column Use Matlab style notation in the interface shared [2][2] double dst[n][n]; dst double even, odd; upc_stride_broadcast( dst<0:2:n-1,0:2:n-1>, even, sizeof(double)); upc_stride_broadcast( dst<1:2:n-1,1:2:n-1>, odd, sizeof(double)); Collective arguments are the same regardless of the number of processors

28 Example 1: Broadcast even odd Representative of one-to-many Broadcast one element to every other row and every other column Use Matlab style notation in the interface shared [2][2] double dst[n][n]; dst double even, odd; upc_stride_broadcast( dst<0:2:n-1,0:2:n-1>, even, sizeof(double)); upc_stride_broadcast( dst<1:2:n-1,1:2:n-1>, odd, sizeof(double)); Collective arguments are the same regardless of the number of processors

29 Example 2: Exchange Representative of many-to-many Data-centric approach enables multi-dimensional operations Example: Exchange elements from a particular column into a particular row Owned by P0 #pragma processors mydist(2,2) shared[2][2](mydist)double A[16][16]; upc_stride_exchange(a<0,:>, A<:,0>, sizeof(double));

30 Example 2: Exchange Representative of many-to-many Data-centric approach enables multi-dimensional operations Owned by P0 Example: Exchange elements from a particular column into a particular row #pragma processors mydist(2,2) shared[2][2](mydist)double A[16][16]; upc_stride_exchange(a<0,:>, A<:,0>, sizeof(double));

31 Example 2: Exchange Representative of many-to-many Data-centric approach enables multi-dimensional operations Example: Exchange elements from a particular column into a particular row Owned by P0 #pragma processors mydist(2,2) shared[2][2](mydist)double A[16][16]; upc_stride_exchange(a<0,:>, A<:,0>, sizeof(double));

32 Optimizing the Collectives

33 Optimizing the Collectives 0 Tree topology for processor communication is critical for performance binary tree binomial tree fork tree

34 Optimizing the Collectives 0 Tree topology for processor communication is critical for performance Overhead of injecting messages onto network can not be parallelized binary tree 2 6 Sending to intermediary nodes alleviates serial bottleneck and distributes work across machine binomial tree fork tree

35 Optimizing the Collectives 0 Tree topology for processor communication is critical for performance Overhead of injecting messages onto network can not be parallelized Sending to intermediary nodes alleviates serial bottleneck and distributes work across machine Best Tree for BlueGene/L binomial tree binary tree fork tree 6

36 Implementation with Back To DGEMM uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code

37 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

38 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

39 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

40 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

41 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

42 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

43 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C

44 DGEMM Performance Linearized pairs of the 4- D torus dimension to get 2D processor grid Use ESSL for serial computation B2 doubles broadcast for every 2B 3 flops Experiments scale problem size w/ processor count ,384 processors w/ linear scaling GFlop/s!"""""#!""""#!"""#!""# *+,-.,/012#3,1)#45-#6-7789:01/-9;# <,1=8.,>#3,.?-.7190,# $%#!&'# &($# (!&#!)# &)# %)# ')#!$)# Processor Count

45 Dense Cholesky Factorization Factor A into UT U Relies on Team Broadcast Recursive algorithm Factor Upper-Left corner of A (AUL) Update A UR using triangular solve Outer product of A UR T and AUR to get ALR Full Code in Paper ESSL for local computation AUL ALL AUR ALR

46 Cholesky Factorization GFlop/s!"""""#!""""#!"""#!""# Performance *+,-.,/012#3,1)#45-#6-7789:01/-9;# <,1=8.,>#3,.?-.7190,# Processor layout identical to DGEMM Broadcast is no longer strictly along rows or columns of processor grid Rectangular processor grid over a square matrix leads to load balance issues!"# $%#!&'# &($# (!&#!)# &)# %)# ')# Processor Count 8.6 8,192 processors w/ linear scaling

47 3D-FFT D00 D01 D02 D03 C00 C01 C02 C03 B00A00 B01 B02 B03 D10 D11 D12 A01 D13 A02 A03 C10 C11 C12 C13 B10A10 B11 B12 B13 D20 D21 D22 A11 D23 A12 A13 C20 C21 C22 C23 B20A20 B21 B22 B23 D30 D31 D32 A21 D33 A22 A23 C30 C31 C32 C33 B30A30B31 B32 B33 A31 A32 A33 P0 Each processor owns a row of 4 squares

48 D00 D01 D02 D03 C00 C01 C02 C03 B00A00 B01 B02 B03 D10 D11 D12 A01 D13 A02 A03 C10 C11 C12 C13 B10A10 B11 B12 B13 D20 D21 D22 A11 D23 A12 A13 C20 C21 C22 C23 B20A20 B21 B22 B23 D30 D31 D32 A21 D33 A22 A23 C30 C31 C32 C33 B30A30B31 B32 B33 A31 A32 A33 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Each processor owns a row of 4 squares

49 D00 D01 D02 D03 C00 C01 C02 C03 B00A00 B01 B02 B03 D10 D11 D12 A01 D13 A02 A03 C10 C11 C12 C13 B10A10 B11 B12 B13 D20 D21 D22 A11 D23 A12 A13 C20 C21 C22 C23 B20A20 B21 B22 B23 D30 D31 D32 A21 D33 A22 A23 C30 C31 C32 C33 B30A30B31 B32 B33 A31 A32 A33 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Each processor owns a row of 4 squares

50 D00 D10 D20 D30 C00 C10 C20 C30 B00A00B10A10B20A20B30A30 D01 D11 D21 D31 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane

51 D00 D10 D20 D30 C00 C10 C20 C30 B00A00B10A10B20A20B30A30 D01 D11 D21 D31 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns

52 D00 D10 D20 D30 C00 C10 C20 C30 B00A00B10A10B20A20B30A30 D01 D11 D21 D31 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns

53 A30 B30 C30 D30 A20 B20 C20 D20 A10 B10 C10 D10 D01 D11 A00 D21 B00 D31 C00 D00 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns Do an exchange across planes

54 A30 B30 C30 D30 A20 B20 C20 D20 A10 B10 C10 D10 D01 D11 A00 D21 B00 D31 C00 D00 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns Do an exchange across planes Perform FFT across the last dimensions

55 3D-FFT Performance!####"!###" +,-./-0123"4-2*"56."7.889:;120.:<" Every processor exchanges all of its data in each round Network bandwidth limits performance GFlop/s!##"!#" Created performance model that exposes bisection bandwidth limits ,384 processors!" $%" &'"!%(" %)&" )!%"!*" %*" '*" (*"!&*" Processor Count

56 Summary of Results Lines of Code Number of Processors Max Performance % of Serial ESSL % of Machine Peak DGEMM 14 16k 28.8 TFlop/s 84% 63% Cholesky Factorization 28 8k 8.6 TFlop/s 56% 38% 3D-FFT 18 16k 2.1 TFlop/s 97% of analytic model

57 Conclusions Added support for multi-dimensional shared arrays in UPC Created a novel data-centric collective communication interface in UPC Implemented 3 important algorithms with data centric approach to showcase brevity of code as well as performance scalability Need to rethink programming models to ensure application scalability as number of processors continues rapid growth

58 Questions?

59 Backup Slides

60 UPC Distributed Arrays

61 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators

62 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators A[0] A[1] A[2] A[3] Examples: shared [] double A[THREADS];

63 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators Examples: A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] shared [] double A[THREADS]; shared [1] double B[THREADS*2];

64 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators Examples: shared [] double A[THREADS]; shared [1] double B[THREADS*2]; A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] C[0] C[1] C[8] C[9] C[2] C[3] C[10] C[11] C[4] C[5] C[12] C[13] C[6] C[7] C[14] C[15] shared [2] double C[THREADS*4];

65 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators Examples: shared [] double A[THREADS]; shared [1] double B[THREADS*2]; A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] C[0] C[1] C[8] C[9] C[2] C[3] C[10] C[11] C[4] C[5] C[12] C[13] C[6] C[7] C[14] C[15] shared [2] double C[THREADS*4]; D[0] D[1] D[4] D[5] D[8] D[9] D[12] D[13] shared [4] double D[THREADS*4]; D[2] D[3] D[6] D[7] D[10] D[11] D[14] D[15]

66 Algorithmic Example: DGEMM

67 Algorithmic Example: Declare the arrays shared [b][b] A[M][P]; shared [b][b] B[P][N]; shared [b][b] C[M][N]; DGEMM P P N M M N

68 Algorithmic Example: Declare the arrays shared [b][b] A[M][P]; shared [b][b] B[P][N]; shared [b][b] C[M][N]; Run the update for(k=0; k<p; k+=b) for(i=0; i<m; i+=b) get A[i][k] into tempa upc_forall(j=0; j<n; j+=b; &C[i][j]) get B[k][j] into tempb C[i][j]+=tempA*tempB DGEMM M P P N N M

69 Algorithmic Example: Declare the arrays shared [b][b] A[M][P]; shared [b][b] B[P][N]; shared [b][b] C[M][N]; Run the update for(k=0; k<p; k+=b) for(i=0; i<m; i+=b) get A[i][k] into tempa upc_forall(j=0; j<n; j+=b; &C[i][j]) get B[k][j] into tempb C[i][j]+=tempA*tempB DGEMM M internal memory layout lets us directly call BLAS3 DGEMM P P N N M

70 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

71 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

72 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

73 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

74 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

75 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

76 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance

77 Scaling Applications For The Future Future machines are likely to achieve increased performance by adding more processing elements As the number of processors continues to grow at a rapid pace, application scalability becomes a bigger concern How can we design a programming model and the associated runtime systems to help alleviate these concerns

78 Collective Model Every processor with affinity to an active block of data must make a call into the collective Allows the implementation to build scalable communication schedules Alternative Model Considered (Seidel et. al) Have exactly one process make a call into the collective Requires the user to handle their own synchronization or wait until the next barrier phase Requires implementation to dedicate a thread for building scalable communication schedules amongst processors

79 Machine Layouts Cores Torus X Y Z T UPC Array Mapping Distribution Matrix FT Mapping/Dim Mapping/Dim XT,YZ 8 x 16 YZ, X 16 x 4 1k XT,YZ 16 x 64 YZ, X 64 x 8 2k XT,YZ 16 x 128 YZ, X 128 x 8 4k XT,YZ 16 x 256 YZ, X 256 x 8 8k XT,YZ 16 x 512 YZ, X 512 x 8

80 FFT Performance Model Model Bandwidth limits for each exchange First exchange is always done on a linear set of nodes (X dimension) Half the data from each node is transfered across that one limiting link Second exchange is done across plane (Y,Z dimension) MIN(Y,Z) shows the number of bandwidth limiting links Half the data again travels across these set of links Add in the serial compute time per CPU

81 BlueGene/L System Architecture IBM Research One Chip is a Dual Core PowerPC440 Two Chips per Compute Card (4 cores) 16 Compute Cards Per Node Card (64 cores) 32 Node Cards Per Rack (2048 Cores) We used up to 16 racks

82 Full DGEMM Code

83 Full Cholesky Code

84 Full FT Code

85 Broadcast Performance Model

86 Broadcast Performance Model Plot

Performance without Pain = Productivity

Performance without Pain = Productivity Data Layout and Collective Communication in UPC Rajesh Nishtala George Almási Călin Caşcaval Computer Science Division IBM T.J. Watson Research Center University