Performance without Pain = Productivity Data Layout and Collective Communication in UPC
|
|
- Ernest Barnett
- 5 years ago
- Views:
Transcription
1 Performance without Pain = Productivity Data Layout and Collective Communication in UPC By Rajesh Nishtala (UC Berkeley), George Almási (IBM Watson Research Center), Călin Caşcaval (IBM Watson Research Center)
2 Observations and Experiences As the number of processors continues to grow at a rapid pace, application scalability takes center-stage Three important considerations for writing optimized scalable parallel code 1. How do you efficiently and optimally distribute the data across the processors? 2. How do you write a system that leverages existing serial libraries? 3. What are the simplest communication mechanisms you need to coordinate the processors together?
3 Partitioned Global Address Space (PGAS) Languages Programming model suitable for both shared and distributed memory systems Language presents a logically shared memory Any thread may directly read/ write data located on a remote processor Address space is partitioned so each processor has affinity to a memory region Accesses to local memory are potentially much faster shared address space private address space P0 P1 P2 P3
4 Thread vs. Data-Centric Communication MPI Code (Thread-Centric): Example: Send P0 s version of A to P1 A A A A P0 P1 P2 P3 double A; MPI_Status stat; if(myrank == 0) { A = 42.0; MPI_Send(&A, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); } else if(myrank == 1) MPI_Recv(&A, 1, MPI_DOUBLE, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); UPC Code (Data-Centric): shared [1] double A[4]; if(mythread == upc_threadof(&a[0])) A[0] = 42.0; upc_memput(&a[1], &A[0], sizeof(double)); }
5 Thread vs. Data-Centric Communication MPI Code (Thread-Centric): Example: Send P0 s version of A to P1 A A A A P0 P1 P2 P3 double A; MPI_Status stat; if(myrank == 0) { A = 42.0; MPI_Send(&A, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); } else if(myrank == 1) MPI_Recv(&A, 1, MPI_DOUBLE, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &stat); UPC Code (Data-Centric): shared [1] double A[4]; if(mythread == upc_threadof(&a[0])) A[0] = 42.0; upc_memput(&a[1], &A[0], sizeof(double)); }
6 Data Layout Extensions Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries Checkerboard layout critical for load balancing dense factorization methods to UPC
7 Data Layout Extensions Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries to UPC Checkerboard layout critical for load balancing dense factorization methods double A[16][16]
8 Data Layout Extensions Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries to UPC Checkerboard layout critical for load balancing dense factorization methods shared [2][2] double A[16][16]
9 Data Layout Extensions to UPC Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries Checkerboard layout critical for load balancing dense factorization methods shared [2][2] double A[16][16]
10 Data Layout Extensions to UPC Proposed Extensions 1. Allow support for true shared multi-dimensional arrays 2. Allow creation of a processor topology with these arrays Ensure that the underlying data layout is suitable for use with existing optimized serial libraries Checkerboard layout critical for load balancing dense factorization methods #pragma procesors MyDist(2,2) shared [2][2] (MyDist) double A[16][16]
11 Algorithm Example: DGEMM N B P P M M A N C
12 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets DGEMM M P P N B M A N C
13 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 DGEMM M P P N B M A N C
14 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 Need data owned by 0 and 1 in first round DGEMM M P P N B M A N C
15 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 Need data owned by 0 and 1 in first round Need data owned by 1 and 3 in second round DGEMM M P P N B M A N C
16 Algorithm Example: C = A*B Use one-sided communication to handle data transfers Initial implementation used uncoordinated gets Ex: Calculate C[0:1][2:3] owned by processor 1 Need data owned by 0 and 1 in first round Need data owned by 1 and 3 in second round Non-scalable approach due to O(P*M*N) uncoordinated gets A Can we do better? DGEMM M P P N N B M C
17 Collective Communication
18 Collective Communication An operation called by all processes together to perform globally coordinated communication May involve a modest amount of computation, e.g. to combine values as they are reduced Can be extended to teams (or communicators) in which they operate on a subset of the processors
19 Collective Communication An operation called by all processes together to perform globally coordinated communication May involve a modest amount of computation, e.g. to combine values as they are reduced Can be extended to teams (or communicators) in which they operate on a subset of the processors Abstraction that passes responsibility of tuning the communication schedule to the runtime system
20 Team Construction in MPI
21 Team Construction in MPI MPI requires explicit team (communicator) objects They are heavy weight objects that must be constructed outside the critical path They require the user to explicitly specify the processor ranks that are involved in the teams
22 Team Construction in MPI MPI requires explicit team (communicator) objects They are heavy weight objects that must be constructed outside the critical path They require the user to explicitly specify the processor ranks that are involved in the teams Logical extension to MPI s process-centric programming model
23 Team Construction in UPC
24 Team Construction in UPC UPC enables distributed arrays and presents a more data-centric programming model Can we create a novel collective interface to go along with this programming model? Can we make it scale?
25 Team Construction in UPC UPC enables distributed arrays and presents a more data-centric programming model Can we create a novel collective interface to go along with this programming model? Can we make it scale? Our approach: Have the user specify the blocks of the shared data to operate on Let the runtime figure out the mapping of data to processor and dynamically construct the teams
26 Example 1: Broadcast even odd Representative of one-to-many Broadcast one element to every other row and every other column Use Matlab style notation in the interface shared [2][2] double dst[n][n]; dst double even, odd; upc_stride_broadcast( dst<0:2:n-1,0:2:n-1>, even, sizeof(double)); upc_stride_broadcast( dst<1:2:n-1,1:2:n-1>, odd, sizeof(double)); Collective arguments are the same regardless of the number of processors
27 Example 1: Broadcast even odd Representative of one-to-many Broadcast one element to every other row and every other column Use Matlab style notation in the interface shared [2][2] double dst[n][n]; dst double even, odd; upc_stride_broadcast( dst<0:2:n-1,0:2:n-1>, even, sizeof(double)); upc_stride_broadcast( dst<1:2:n-1,1:2:n-1>, odd, sizeof(double)); Collective arguments are the same regardless of the number of processors
28 Example 1: Broadcast even odd Representative of one-to-many Broadcast one element to every other row and every other column Use Matlab style notation in the interface shared [2][2] double dst[n][n]; dst double even, odd; upc_stride_broadcast( dst<0:2:n-1,0:2:n-1>, even, sizeof(double)); upc_stride_broadcast( dst<1:2:n-1,1:2:n-1>, odd, sizeof(double)); Collective arguments are the same regardless of the number of processors
29 Example 2: Exchange Representative of many-to-many Data-centric approach enables multi-dimensional operations Example: Exchange elements from a particular column into a particular row Owned by P0 #pragma processors mydist(2,2) shared[2][2](mydist)double A[16][16]; upc_stride_exchange(a<0,:>, A<:,0>, sizeof(double));
30 Example 2: Exchange Representative of many-to-many Data-centric approach enables multi-dimensional operations Owned by P0 Example: Exchange elements from a particular column into a particular row #pragma processors mydist(2,2) shared[2][2](mydist)double A[16][16]; upc_stride_exchange(a<0,:>, A<:,0>, sizeof(double));
31 Example 2: Exchange Representative of many-to-many Data-centric approach enables multi-dimensional operations Example: Exchange elements from a particular column into a particular row Owned by P0 #pragma processors mydist(2,2) shared[2][2](mydist)double A[16][16]; upc_stride_exchange(a<0,:>, A<:,0>, sizeof(double));
32 Optimizing the Collectives
33 Optimizing the Collectives 0 Tree topology for processor communication is critical for performance binary tree binomial tree fork tree
34 Optimizing the Collectives 0 Tree topology for processor communication is critical for performance Overhead of injecting messages onto network can not be parallelized binary tree 2 6 Sending to intermediary nodes alleviates serial bottleneck and distributes work across machine binomial tree fork tree
35 Optimizing the Collectives 0 Tree topology for processor communication is critical for performance Overhead of injecting messages onto network can not be parallelized Sending to intermediary nodes alleviates serial bottleneck and distributes work across machine Best Tree for BlueGene/L binomial tree binary tree fork tree 6
36 Implementation with Back To DGEMM uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code
37 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C
38 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C
39 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C
40 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C
41 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C
42 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C
43 Back To DGEMM Implementation with uncoordinated gets did not scale Change the implementation to use scalable collectives Use simultaneous team broadcasts First round goes across processor rows Second round goes down processor columns See paper for full code B A C
44 DGEMM Performance Linearized pairs of the 4- D torus dimension to get 2D processor grid Use ESSL for serial computation B2 doubles broadcast for every 2B 3 flops Experiments scale problem size w/ processor count ,384 processors w/ linear scaling GFlop/s!"""""#!""""#!"""#!""# *+,-.,/012#3,1)#45-#6-7789:01/-9;# <,1=8.,>#3,.?-.7190,# $%#!&'# &($# (!&#!)# &)# %)# ')#!$)# Processor Count
45 Dense Cholesky Factorization Factor A into UT U Relies on Team Broadcast Recursive algorithm Factor Upper-Left corner of A (AUL) Update A UR using triangular solve Outer product of A UR T and AUR to get ALR Full Code in Paper ESSL for local computation AUL ALL AUR ALR
46 Cholesky Factorization GFlop/s!"""""#!""""#!"""#!""# Performance *+,-.,/012#3,1)#45-#6-7789:01/-9;# <,1=8.,>#3,.?-.7190,# Processor layout identical to DGEMM Broadcast is no longer strictly along rows or columns of processor grid Rectangular processor grid over a square matrix leads to load balance issues!"# $%#!&'# &($# (!&#!)# &)# %)# ')# Processor Count 8.6 8,192 processors w/ linear scaling
47 3D-FFT D00 D01 D02 D03 C00 C01 C02 C03 B00A00 B01 B02 B03 D10 D11 D12 A01 D13 A02 A03 C10 C11 C12 C13 B10A10 B11 B12 B13 D20 D21 D22 A11 D23 A12 A13 C20 C21 C22 C23 B20A20 B21 B22 B23 D30 D31 D32 A21 D33 A22 A23 C30 C31 C32 C33 B30A30B31 B32 B33 A31 A32 A33 P0 Each processor owns a row of 4 squares
48 D00 D01 D02 D03 C00 C01 C02 C03 B00A00 B01 B02 B03 D10 D11 D12 A01 D13 A02 A03 C10 C11 C12 C13 B10A10 B11 B12 B13 D20 D21 D22 A11 D23 A12 A13 C20 C21 C22 C23 B20A20 B21 B22 B23 D30 D31 D32 A21 D33 A22 A23 C30 C31 C32 C33 B30A30B31 B32 B33 A31 A32 A33 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Each processor owns a row of 4 squares
49 D00 D01 D02 D03 C00 C01 C02 C03 B00A00 B01 B02 B03 D10 D11 D12 A01 D13 A02 A03 C10 C11 C12 C13 B10A10 B11 B12 B13 D20 D21 D22 A11 D23 A12 A13 C20 C21 C22 C23 B20A20 B21 B22 B23 D30 D31 D32 A21 D33 A22 A23 C30 C31 C32 C33 B30A30B31 B32 B33 A31 A32 A33 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Each processor owns a row of 4 squares
50 D00 D10 D20 D30 C00 C10 C20 C30 B00A00B10A10B20A20B30A30 D01 D11 D21 D31 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane
51 D00 D10 D20 D30 C00 C10 C20 C30 B00A00B10A10B20A20B30A30 D01 D11 D21 D31 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns
52 D00 D10 D20 D30 C00 C10 C20 C30 B00A00B10A10B20A20B30A30 D01 D11 D21 D31 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns
53 A30 B30 C30 D30 A20 B20 C20 D20 A10 B10 C10 D10 D01 D11 A00 D21 B00 D31 C00 D00 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns Do an exchange across planes
54 A30 B30 C30 D30 A20 B20 C20 D20 A10 B10 C10 D10 D01 D11 A00 D21 B00 D31 C00 D00 C01 C11 C21 C31 B01 B11 B21 B31 D02 D12 A01 D22 A11 D32 A21 A31 C02 C12 C22 C32 B02 B12 B22 B32 D03 D13 A02 D23 A12 D33 A22 A32 C03 C13 C23 C33 B03 B13 B23 B33 A03 A13 A23 A33 Each processor owns a row of 4 squares 3D-FFT Perform a 3D FFT across a large rectangular prism P0 Perform an FFT in each of the 3 Dimensions Need to Team-Exchange for other 2/3 dimensions for a 2-D processor layout Algorithm: Perform FFT across the rows Do an exchange within each plane Perform FFT across the columns Do an exchange across planes Perform FFT across the last dimensions
55 3D-FFT Performance!####"!###" +,-./-0123"4-2*"56."7.889:;120.:<" Every processor exchanges all of its data in each round Network bandwidth limits performance GFlop/s!##"!#" Created performance model that exposes bisection bandwidth limits ,384 processors!" $%" &'"!%(" %)&" )!%"!*" %*" '*" (*"!&*" Processor Count
56 Summary of Results Lines of Code Number of Processors Max Performance % of Serial ESSL % of Machine Peak DGEMM 14 16k 28.8 TFlop/s 84% 63% Cholesky Factorization 28 8k 8.6 TFlop/s 56% 38% 3D-FFT 18 16k 2.1 TFlop/s 97% of analytic model
57 Conclusions Added support for multi-dimensional shared arrays in UPC Created a novel data-centric collective communication interface in UPC Implemented 3 important algorithms with data centric approach to showcase brevity of code as well as performance scalability Need to rethink programming models to ensure application scalability as number of processors continues rapid growth
58 Questions?
59 Backup Slides
60 UPC Distributed Arrays
61 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators
62 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators A[0] A[1] A[2] A[3] Examples: shared [] double A[THREADS];
63 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators Examples: A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] shared [] double A[THREADS]; shared [1] double B[THREADS*2];
64 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators Examples: shared [] double A[THREADS]; shared [1] double B[THREADS*2]; A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] C[0] C[1] C[8] C[9] C[2] C[3] C[10] C[11] C[4] C[5] C[12] C[13] C[6] C[7] C[14] C[15] shared [2] double C[THREADS*4];
65 UPC Distributed Arrays UPC (a popular PGAS language) allows for declaration of distributed arrays in a concise way Can access any element through C- like array operators Examples: shared [] double A[THREADS]; shared [1] double B[THREADS*2]; A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] B[4] B[5] B[6] B[7] C[0] C[1] C[8] C[9] C[2] C[3] C[10] C[11] C[4] C[5] C[12] C[13] C[6] C[7] C[14] C[15] shared [2] double C[THREADS*4]; D[0] D[1] D[4] D[5] D[8] D[9] D[12] D[13] shared [4] double D[THREADS*4]; D[2] D[3] D[6] D[7] D[10] D[11] D[14] D[15]
66 Algorithmic Example: DGEMM
67 Algorithmic Example: Declare the arrays shared [b][b] A[M][P]; shared [b][b] B[P][N]; shared [b][b] C[M][N]; DGEMM P P N M M N
68 Algorithmic Example: Declare the arrays shared [b][b] A[M][P]; shared [b][b] B[P][N]; shared [b][b] C[M][N]; Run the update for(k=0; k<p; k+=b) for(i=0; i<m; i+=b) get A[i][k] into tempa upc_forall(j=0; j<n; j+=b; &C[i][j]) get B[k][j] into tempb C[i][j]+=tempA*tempB DGEMM M P P N N M
69 Algorithmic Example: Declare the arrays shared [b][b] A[M][P]; shared [b][b] B[P][N]; shared [b][b] C[M][N]; Run the update for(k=0; k<p; k+=b) for(i=0; i<m; i+=b) get A[i][k] into tempa upc_forall(j=0; j<n; j+=b; &C[i][j]) get B[k][j] into tempb C[i][j]+=tempA*tempB DGEMM M internal memory layout lets us directly call BLAS3 DGEMM P P N N M
70 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance
71 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance
72 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance
73 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance
74 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance
75 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance
76 Motivation for a Checkerboard Layout Useful for Dense factorization algorithms Ex: Dense LU, Dense Cholesky Methods recursively factor the matrix Without checkerboard bottom corner will be heavily loaded onto one processor Ensures good load balance
77 Scaling Applications For The Future Future machines are likely to achieve increased performance by adding more processing elements As the number of processors continues to grow at a rapid pace, application scalability becomes a bigger concern How can we design a programming model and the associated runtime systems to help alleviate these concerns
78 Collective Model Every processor with affinity to an active block of data must make a call into the collective Allows the implementation to build scalable communication schedules Alternative Model Considered (Seidel et. al) Have exactly one process make a call into the collective Requires the user to handle their own synchronization or wait until the next barrier phase Requires implementation to dedicate a thread for building scalable communication schedules amongst processors
79 Machine Layouts Cores Torus X Y Z T UPC Array Mapping Distribution Matrix FT Mapping/Dim Mapping/Dim XT,YZ 8 x 16 YZ, X 16 x 4 1k XT,YZ 16 x 64 YZ, X 64 x 8 2k XT,YZ 16 x 128 YZ, X 128 x 8 4k XT,YZ 16 x 256 YZ, X 256 x 8 8k XT,YZ 16 x 512 YZ, X 512 x 8
80 FFT Performance Model Model Bandwidth limits for each exchange First exchange is always done on a linear set of nodes (X dimension) Half the data from each node is transfered across that one limiting link Second exchange is done across plane (Y,Z dimension) MIN(Y,Z) shows the number of bandwidth limiting links Half the data again travels across these set of links Add in the serial compute time per CPU
81 BlueGene/L System Architecture IBM Research One Chip is a Dual Core PowerPC440 Two Chips per Compute Card (4 cores) 16 Compute Cards Per Node Card (64 cores) 32 Node Cards Per Rack (2048 Cores) We used up to 16 racks
82 Full DGEMM Code
83 Full Cholesky Code
84 Full FT Code
85 Broadcast Performance Model
86 Broadcast Performance Model Plot
Performance without Pain = Productivity
Performance without Pain = Productivity Data Layout and Collective Communication in UPC Rajesh Nishtala George Almási Călin Caşcaval Computer Science Division IBM T.J. Watson Research Center University
More informationBasic Communication Operations (Chapter 4)
Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:
More informationParallel Programming with OpenMP. CS240A, T. Yang
Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs
More informationDistributed-memory Algorithms for Dense Matrices, Vectors, and Arrays
Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationHalfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9
Halfway! Sequoia CS315B Lecture 9 First half of the course is over Overview/Philosophy of Regent Now start the second half Lectures on other programming models Comparing/contrasting with Regent Start with
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationParallel Programming with Coarray Fortran
Parallel Programming with Coarray Fortran SC10 Tutorial, November 15 th 2010 David Henty, Alan Simpson (EPCC) Harvey Richardson, Bill Long, Nathan Wichmann (Cray) Tutorial Overview The Fortran Programming
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: John Wawrzynek & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Review
More informationAutomated Mapping of Regular Communication Graphs on Mesh Interconnects
Automated Mapping of Regular Communication Graphs on Mesh Interconnects Abhinav Bhatele, Gagan Gupta, Laxmikant V. Kale and I-Hsin Chung Motivation Running a parallel application on a linear array of processors:
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationMPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016
MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationParallel Programming. Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab
Parallel Programming Marc Snir U. of Illinois at Urbana-Champaign & Argonne National Lab Summing n numbers for(i=1; i++; i
More informationLINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.
1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular
More informationA Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-
Shaun Lindsay CS425 A Comparison of Unified Parallel C, Titanium and Co-Array Fortran The purpose of this paper is to compare Unified Parallel C, Titanium and Co- Array Fortran s methods of parallelism
More informationLecture 16: Recapitulations. Lecture 16: Recapitulations p. 1
Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently
More informationMatrix Multiplication
Matrix Multiplication Material based on Chapter 10, Numerical Algorithms, of B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c
More informationBasic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003 Topic Overview One-to-All Broadcast
More informationOpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.
OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18
More informationExam Design and Analysis of Algorithms for Parallel Computer Systems 9 15 at ÖP3
UMEÅ UNIVERSITET Institutionen för datavetenskap Lars Karlsson, Bo Kågström och Mikael Rännar Design and Analysis of Algorithms for Parallel Computer Systems VT2009 June 2, 2009 Exam Design and Analysis
More informationConcurrent Programming with OpenMP
Concurrent Programming with OpenMP Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 11, 2012 CPD (DEI / IST) Parallel and Distributed
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationTieing the Threads Together
Tieing the Threads Together 1 Review Sequential software is slow software SIMD and MIMD are paths to higher performance MIMD thru: multithreading processor cores (increases utilization), Multicore processors
More informationTutorial OmpSs: Overlapping communication and computation
www.bsc.es Tutorial OmpSs: Overlapping communication and computation PATC course Parallel Programming Workshop Rosa M Badia, Xavier Martorell PATC 2013, 18 October 2013 Tutorial OmpSs Agenda 10:00 11:00
More informationOptimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides
Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas
More informationAccelerating HPL on Heterogeneous GPU Clusters
Accelerating HPL on Heterogeneous GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationScaling Communication-Intensive Applications on BlueGene/P Using One-Sided Communication and Overlap
Scaling Communication-Intensive Applications on BlueGene/P Using One-Sided Communication and Overlap Rajesh Nishtala, Paul H. Hargrove, Dan O. Bonachea and Katherine A. Yelick Computer Science Division,
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2013 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationParallel Languages: Past, Present and Future
Parallel Languages: Past, Present and Future Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab 1 Kathy Yelick Internal Outline Two components: control and data (communication/sharing) One
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationLecture 32: Partitioned Global Address Space (PGAS) programming models
COMP 322: Fundamentals of Parallel Programming Lecture 32: Partitioned Global Address Space (PGAS) programming models Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu COMP
More informationCray Scientific Libraries. Overview
Cray Scientific Libraries Overview What are libraries for? Building blocks for writing scientific applications Historically allowed the first forms of code re-use Later became ways of running optimized
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationOpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.
OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationMatrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2018 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationAutomatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P
Automatic Generation of the HPC Challenge's Global FFT Benchmark for BlueGene/P Franz Franchetti 1, Yevgen Voronenko 2, Gheorghe Almasi 3 1 University and SpiralGen, Inc. 2 AccuRay, Inc., 3 IBM Research
More informationScaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc
Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationA Characterization of Shared Data Access Patterns in UPC Programs
IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview
More informationCOMPUTATIONAL LINEAR ALGEBRA
COMPUTATIONAL LINEAR ALGEBRA Matrix Vector Multiplication Matrix matrix Multiplication Slides from UCSD and USB Directed Acyclic Graph Approach Jack Dongarra A new approach using Strassen`s algorithm Jim
More informationNew Programming Paradigms: Partitioned Global Address Space Languages
Raul E. Silvera -- IBM Canada Lab rauls@ca.ibm.com ECMWF Briefing - April 2010 New Programming Paradigms: Partitioned Global Address Space Languages 2009 IBM Corporation Outline Overview of the PGAS programming
More informationMapping MPI+X Applications to Multi-GPU Architectures
Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under
More informationX10 and APGAS at Petascale
X10 and APGAS at Petascale Olivier Tardieu 1, Benjamin Herta 1, David Cunningham 2, David Grove 1, Prabhanjan Kambadur 1, Vijay Saraswat 1, Avraham Shinnar 1, Mikio Takeuchi 3, Mandana Vaziri 1 1 IBM T.J.
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationLast Time. Intro to Parallel Algorithms. Parallel Search Parallel Sorting. Merge sort Sample sort
Intro to MPI Last Time Intro to Parallel Algorithms Parallel Search Parallel Sorting Merge sort Sample sort Today Network Topology Communication Primitives Message Passing Interface (MPI) Randomized Algorithms
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationA Uniform Programming Model for Petascale Computing
A Uniform Programming Model for Petascale Computing Barbara Chapman University of Houston WPSE 2009, Tsukuba March 25, 2009 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools Agenda
More informationTiled Matrix Multiplication
Tiled Matrix Multiplication Basic Matrix Multiplication Kernel global void MatrixMulKernel(int m, m, int n, n, int k, k, float* A, A, float* B, B, float* C) C) { int Row = blockidx.y*blockdim.y+threadidx.y;
More informationa. Assuming a perfect balance of FMUL and FADD instructions and no pipeline stalls, what would be the FLOPS rate of the FPU?
CPS 540 Fall 204 Shirley Moore, Instructor Test November 9, 204 Answers Please show all your work.. Draw a sketch of the extended von Neumann architecture for a 4-core multicore processor with three levels
More informationLecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC
Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationA Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection
A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF
More informationNumerical Algorithms
Chapter 10 Slide 464 Numerical Algorithms Slide 465 Numerical Algorithms In textbook do: Matrix multiplication Solving a system of linear equations Slide 466 Matrices A Review An n m matrix Column a 0,0
More informationLarge Scale Multiprocessors and Scientific Applications. By Pushkar Ratnalikar Namrata Lele
Large Scale Multiprocessors and Scientific Applications By Pushkar Ratnalikar Namrata Lele Agenda Introduction Interprocessor Communication Characteristics of Scientific Applications Synchronization: Scaling
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationLecture 16. Parallel Matrix Multiplication
Lecture 16 Parallel Matrix Multiplication Assignment #5 Announcements Message passing on Triton GPU programming on Lincoln Calendar No class on Tuesday/Thursday Nov 16th/18 th TA Evaluation, Professor
More informationOvercoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science
Overcoming the Barriers to Sustained Petaflop Performance William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp But First Are we too CPU-centric? What about I/O? What do applications
More informationParallel 3D Sweep Kernel with PaRSEC
Parallel 3D Sweep Kernel with PaRSEC Salli Moustafa Mathieu Faverge Laurent Plagne Pierre Ramet 1 st International Workshop on HPC-CFD in Energy/Transport Domains August 22, 2014 Overview 1. Cartesian
More informationParallelism paradigms
Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization
More informationLecture 20: Distributed Memory Parallelism. William Gropp
Lecture 20: Distributed Parallelism William Gropp www.cs.illinois.edu/~wgropp A Very Short, Very Introductory Introduction We start with a short introduction to parallel computing from scratch in order
More informationLecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1
Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method
More informationC PGAS XcalableMP(XMP) Unified Parallel
PGAS XcalableMP Unified Parallel C 1 2 1, 2 1, 2, 3 C PGAS XcalableMP(XMP) Unified Parallel C(UPC) XMP UPC XMP UPC 1 Berkeley UPC GASNet 1. MPI MPI 1 Center for Computational Sciences, University of Tsukuba
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationPGAS Languages (Par//oned Global Address Space) Marc Snir
PGAS Languages (Par//oned Global Address Space) Marc Snir Goal Global address space is more convenient to users: OpenMP programs are simpler than MPI programs Languages such as OpenMP do not provide mechanisms
More informationCS 426. Building and Running a Parallel Application
CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations
More informationParallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008
Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationCS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra
CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?
More informationDesigning Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters
Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi
More informationIntroduction to OpenMP
Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationPiz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design
Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500
More informationX10 for Productivity and Performance at Scale
212 HPC Challenge Class 2 X1 for Productivity and Performance at Scale Olivier Tardieu, David Grove, Bard Bloom, David Cunningham, Benjamin Herta, Prabhanjan Kambadur, Vijay Saraswat, Avraham Shinnar,
More informationChapter 4: Multithreaded Programming
Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationChallenges and Advances in Parallel Sparse Matrix-Matrix Multiplication
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,
More informationLecture 13. Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators
Lecture 13 Writing parallel programs with MPI Matrix Multiplication Basic Collectives Managing communicators Announcements Extra lecture Friday 4p to 5.20p, room 2154 A4 posted u Cannon s matrix multiplication
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationTopologies. Maurizio Palesi. Maurizio Palesi 1
Topologies Maurizio Palesi Maurizio Palesi 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and
More informationEfficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory
Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014
More informationAssignment 3 MPI Tutorial Compiling and Executing MPI programs
Assignment 3 MPI Tutorial Compiling and Executing MPI programs B. Wilkinson: Modification date: February 11, 2016. This assignment is a tutorial to learn how to execute MPI programs and explore their characteristics.
More informationSolving Dense Linear Systems on Platforms with Multiple Hardware Accelerators
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de
More information