High-Performance Parallel Computing

High-Performance Parallel Computing P. (Saday) Sadayappan Rupesh Nasre

Course Overview Emphasis on algorithm development and programming issues for high performance No assumed background in computer architecture; assume knowledge of C Grading: 60% Programming Assignments (4 x 5%) 40% Final Exam (July 4) Accounts will be provided on IIT-M system 2

Course Topics Architectures n Single processor core n Multi-core and SMP (Symmetric Multi-Processor) systems n GPUs (Graphic Processing Units) n Short-vector SIMD instruction set architectures Programming Models/Techniques n Single processor performance issues l Caches and Data locality l Data dependence and fine-grained parallelism l Vectorization (SSE, AVX) n Parallel programming models l Shared-Memory Parallel Programming (OpenMP) l GPU Programming (CUDA) 3

Class Meeting Schedule Lecture from 9am-2pm each day, with mid-class break Optional Lab session from 2-pm each day 4 Programming Assignments Topic Due Date Weightage Data Locality June 24 5% OpenMP June 27 5% CUDA June 30 5% Vectorization July 2 5% 4

The Good Old Days for Software Source: J. Birnbaum Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level- Parallelism) Applications experienced automatic performance 5

Hitting the Power Wall Watts/cm 2 000 00 0 Power doubles every 4 years 5-year projection: 200W total, 25 W/cm2! i386 Hot plate i486 Pentium Nuclear Reactor Pentium 4 Pentium III Pentium II Pentium Pro P=VI: 75W @.5V = 50 A!.5m m 0.7m 0.5m 0.35m 0.25m 0.8m 0.3m 0.m 0.07m Rocket Nozzle Sun's Surface * New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies 6 Fred Pollack, Intel Corp. Micro32 conference key note - 999. Courtesy Avi 6

Hitting the Power Wall toward a brighter tomorrow http://img.tomshardware.com/us/2005//2/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 7

Hitting the Power Wall http://img.tomshardware.com/us/2005//2/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 2004 Intel cancels Tejas and Jayhawk due to "heat problems due to the extreme power consumption of the core..." 8

The Only Option: Use Many Cores Chip density is continuing increase ~2x every 2 years n Clock speed is not n Number of processor cores may double There is little or no more hidden parallelism (ILP) to be found Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) 9 9

Can You Predict Performance? E takes about 0.4s to execute on this laptop Performance in GFLOPs: billions (Giga) of FLoating point Operations per Second = About how long does E2 take?. [0-0.4s] 2. [0.4s-0.6s] 3. [0.6s-0.8s] 4. More than 0.8s double W,X,Y,Z; for(j=0;j<00000000;j++){ W = 0.999999*X; X = 0.999999*W; // Example loop E for(j=0;j<00000000;j++){ W = 0.999999*W + 0.00000; X = 0.999999*X + 0.00000; Y = 0.999999*Y + 0.00000; Z = 0.999999*Z + 0.00000; // Example loop E2 E2 takes 0.35s => 2.27 GFLOPs 0

ILP Affects Performance ILP (Instruction Level Parallelism): Many operations in a sequential code could be executed concurrently if they do not have dependences Pipelined stages in functional units can be exploited by ILP Multiple functional units in a CPU be exploited by ILP ILP is automatically exploited by the system when possible E2 s statements are independent and provide ILP, but E s statements are not, and do not provide ILP double W,X,Y,Z; for(j=0;j<00000000;j++){ W = 0.999999*X; X = 0.999999*W; // Example loop E for(j=0;j<00000000;j++){ W = 0.999999*W + 0.00000; X = 0.999999*X + 0.00000; Y = 0.999999*Y + 0.00000; Z = 0.999999*Z + 0.00000; // Example loop E2

Example: Memory Access Cost #define N 32 #define T 024*024 // #define N 4096 // #define T 64 double A[N][N]; for(it=0; it<t; it++) for (j=0; i<n; i++) for (i=0; i<n; i++) A[i][j] += ; Data Movement overheads from main memory to cache can greatly exceed computational costs About how long will code run for the 4Kx4K matrix?. [0-0.3s] 2. [0.3s-0.6s] 3. [0.6s-.2s] 4. More than.2s 32x32 4Kx4K FLOPs 230 230 Time 0.33s 5.5s GFLOPs 3.24 0.07 2

Cache Memories Adapted from slides by Prof. Randy Bryant (CMU CS5-23) and Dr. Jeffrey Jones (OSU CSE 544) Topics n Generic cache memory organization n Impact of caches on performance 3

Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. n Hold frequently accessed blocks of main memory CPU looks first for data in L, then in L2, then in main memory. Typical bus structure: CPU chip register file L ALU cache cache bus system bus memory bus L2 cache bus interface I/O bridge main memory 4

Inserting an L Cache Between the CPU and Main Memory The transfer unit between the CPU register file and the cache is a 4-byte block. line 0 line The transfer unit between the cache and main memory is a 4-word block (6 bytes). block 0 block 2 block 30 a b c d... p q r s... w x y z... The tiny, very fast CPU register file has room for four 4-byte words. The small fast L cache has room for two 4-word blocks. The big slow main memory has room for many 4-word blocks. 5

Direct-Mapped Cache Byte 0000 Byte 0 0000 Byte 0000 0 Byte 2 0000 0 Byte 3 000 Byte 4 000 Byte 5 000 0 Byte 6 000 0 Byte 7 000 Byte 8 000 Byte 9 0000 Byte 0 000 Byte 00 Byte 2 00 Byte 3 000 Byte 4 00 0 Byte 5 000 Byte 6 000 00 Byte 7 000 8 0 Byte Byte 63Memory64 = = byte memory -64 viewed as 8 blocks of 8 bytes each Memory Block Memory 0 Block Memory Block Memory 2 Block Memory 3 Block 4 Memory Block 5 Memory Block Memory 6 Block 7 Round-robin mapping of memory blocks to cache lines Binary Byte Address valid tag Data Line 0 valid tag Data valid tag Data valid tag Data Each memory block maps to a particular line Line Line 2 Line 3 6

Direct-Mapped Cache RAM byte addressable appropriate bytes for data type transferred to register AC AD register AA AB AC AD AE AF B0 B AA AB AC AD AE AF B0 B system with 64bit data paths transfers 8 bytes cache cache line 7

Accessing Direct-Mapped Caches Set selection n Use the set (= line) index bits to determine the set of interest. set 0: valid tag cache block selected set (line) set : valid tag cache block m- t bits tag s bits 0 0 0 0 b bits set index block offset 0 set S-: valid tag cache block 8

Accessing Direct-Mapped Caches Line matching and word selection n Line matching: Find a valid line in the selected set with a matching tag n Word selection: Then extract the word selected set (i): (2) The tag bits in the cache line must match the tag bits in the address =? () The valid bit must be set m- t bits 00 tag =? 0 2 3 4 5 6 7 00 w0 w w2 w3 s bits i b bits 00 set index block offset 0 (3) If () and (2), then cache hit, and block offset selects starting byte. 9

Direct-Mapped Cache Simulation t= s=2 b= x xx x M=6 byte addresses, B=2 bytes/block, S=4 sets, E= entry/set Address trace (reads): 0 [00002], [0002], 3 [02], 8 [0002], 0 [00002] 0 [00002] (miss) v tag data 0 M[0-] m[] m[0] 3 [02] (miss) v tag data 0 M[0-] m[] m[0] () (3) M[2-3] m[3] m[2] 8 [0002] (miss) v tag data M[8-9] m[9] m[8] 0 [00002] (miss) v tag data 0 M[0-] m[] m[0] (4) M[2-3] (5) M[2-3] m[3] m[2] 20

General Structure of Cache Memory Cache is an array of sets. valid bit per line t tag bits per line B = 2b bytes per cache block Each set contains one or more lines. set 0: valid valid tag tag 0 0 B B E lines per set Each line holds a block of data. valid tag 0 B S = 2s sets set : valid tag 0 B valid tag 0 B set S-: valid tag 0 B Cache size: C = B x E x S data bytes 2

Addressing Caches Address A: t bits s bits b bits set 0: v v tag tag 0 0 B B m- <tag> <set index> <block offset> 0 set : set S-: v v v v tag tag tag tag 0 0 0 0 B B B B The word at address A is in the cache if the tag bits in one of the <valid> lines in set <set index> match <tag>. The word contents begin at offset <block offset> bytes from the beginning of the block. 22

Cache Mapping Examples Address A: t bits s bits b bits m- 0 <tag> <set index> <block offset> Example m C B E 32 024 4 2 32 024 8 4 3 32 024 32 32 4 64 2048 64 6 23

Cache Mapping Examples Address A: t bits s bits b bits m- 0 <tag> <set index> <block offset> Example m C B E S t s b 32 024 4 256 22 8 2 2 32 024 8 4 32 24 5 3 3 32 024 32 32 27 0 5 4 64 2048 64 6 2 57 6 24

Cache Performance Metrics Miss Rate n Fraction of memory references not found in cache (misses/references) n Typical numbers: l 3-0% for L l can be quite small (e.g., < %) for L2, depending on size, etc. Hit Time n Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) n Typical numbers: l clock cycle for L l 3-8 clock cycles for L2 Miss Penalty n Additional time required because of a miss l Typically 25-00 cycles for main memory 25

overall cache performance AMAT (Avg. Mem. Access Time) = t_hit + prob_miss * penalty_miss reduce hit time reduce miss penalty reduce miss rate 26

Cache Performance: One Level AMAT = t_hit + prob_miss * penalty_miss L$ hits in cycle, with 80% hit rate main memory hits in 000 cycles AMAT = + (-.8)(000) = 20 L$ hits in cycle, with 85% hit rate main memory hits in 000 cycles AMAT = + (-.85)(000) = 5 27

Cache Performance: Multi- Level AMATi = t_hiti + prob_missi * penalty_missi L$ hits in cycle, with 50% hit rate L2$ hits in 0 cycles, with 75% hit rate L3$ hits in 00 cycles, with 90% hit rate main memory hits in 000 cycles AMAT = + (-.5)(AMAT2) = + (-.5)(0 + (-.75)(AMAT3)) = + (-.5)(0 + (-.75)(00 + (-.9)(AMATm))) = + (-.5)(0 + (-.75)(00 + (-.9)(000))) OSU CSE 544 28 = 3 J. S. Jones

Cache Performance: Multi- Level AMATi = t_hiti + prob_missi * penalty_missi L$ hits in cycle, with 25% hit rate L2$ hits in 6 cycles, with 80% hit rate L3$ hits in 60 cycles, with 95% hit rate main memory hits in 000 cycles AMAT = + (-.25)(AMAT2) = + (-.25)(6 + (-.8)(AMAT3)) = + (-.25)(6 + (-.8)(60 + (-.95)(AMATm))) = + (-.25)(6 + (-.8)(60 + (-.95)(000))) OSU CSE 544 29 = 22 J. S. Jones

Layout of C Arrays in Memory C arrays allocated in row-major order n each row in contiguous memory locations 30

Layout of C Arrays in Memory C arrays allocated in row-major order n each row in contiguous memory locations Stepping through columns in one row: n for (i = 0; i < N; i++) sum += a[0][i]; n accesses successive elements n if block size (B) > 4 bytes, exploit spatial locality l compulsory miss rate = 4 bytes / B Stepping through rows in one column: n for (i = 0; i < n; i++) sum += a[i][0]; n accesses distant elements n no spatial locality! l compulsory miss rate = (i.e. 00%) 3

Writing Cache Friendly Code Repeated references to variables are good (temporal locality) Stride- reference patterns are good (spatial locality) Example: Summing all elements of a 2D array n cold cache, 4-byte words, 4-word cache blocks int sumarrayrows(int a[n][n]) { int i, j, sum = 0; int sumarraycols(int a[n][n]) { int i, j, sum = 0; for (i = 0; i < N; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; for (j = 0; j < N; j++) for (i = 0; i < N; i++) sum += a[i][j]; return sum; Miss rate = /4 = 25% Miss rate = 00% 32

Matrix Multiplication Example Major Cache Effects to Consider n Total cache size l Exploit temporal locality and keep the working set small (e.g., by using blocking) n Block size l Exploit spatial locality Description: n Multiply N x N matrices n O(N3) total operations n Accesses l N reads per source element l N values summed per destination» but may be able to hold in register /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Variable sum held in register 33

Miss Rate Analysis for Matrix Multiply Assume: n Line size = 32B (big enough for 4 64-bit words) n Matrix dimension (N) is very large l Approximate /N as 0.0 n Cache is not even big enough to hold multiple rows Analysis Method: n Look at access pattern of inner loop k j j i k i A B C 34

Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; Misses per Inner Loop Iteration: A B C 0.25.0 0.0 Inner loop: (i,*) (*,j) (i,j) A B C Row-wise Columnwise Fixed 35

Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum Inner loop: (*,j) (i,j) (i,*) A B C Misses per Inner Loop Iteration: Row-wise Columnwise Fixed C A B 0.25.0 0.0 36

Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Inner loop: (i,k) (k,*) A B C (i,*) Fixed Row-wise Row-wise Misses per Inner Loop Iteration: A B C 0.0 0.25 0.25 37

Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; Inner loop: (i,k) (k,*) (i,*) A B C Fixed Row-wise Row-wise Misses per Inner Loop Iteration: A B C 0.0 0.25 0.25 38

Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Inner loop: (*,k) (*,j) (k,j) A B C Misses per Inner Loop Iteration: C.0 A B.0 0.0 Column - wise Fixed Columnwise 39

Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; Inner loop: (*,k) (*,j) (k,j) A B C Misses per Inner Loop Iteration: A B C.0 0.0.0 Fixed Columnwise Columnwise 40

Summary of Matrix Multiplication ijk (& jik): 2 loads, 0 stores misses/iter =.25 kij (& ikj): 2 loads, store jki misses/iter = 0.5 (& kji): 2 loads, store misses/iter = 2.0 for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k] [j]; c[i][j] = sum; for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; 4