Advanced Parallel Programming I

Size: px

Start display at page:

Download "Advanced Parallel Programming I"

Angela Richardson
5 years ago
Views:

1 Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz

2 Levels of Parallelism RISC Software GmbH Johannes Kepler University Linz

3 Motivation for Parallelism - History RISC Software GmbH Johannes Kepler University Linz

4 Motivation for Parallelism CPU Scaling Moore s Law (doubling transistor density / 2 years) Dennard Scaling made Moore s Law work Smaller Faster Constant RISC Software GmbH Johannes Kepler University Linz

Motivation for Parallelism Limits of serial execution End of classical Dennard Scaling 4 2 1 1. No tox scaling 2. No voltage scaling 3. No delay time scaling 4.

5 Motivation for Parallelism Limits of serial execution End of classical Dennard Scaling No tox scaling 2. No voltage scaling 3. No delay time scaling 4. Leakage P = C EFF * V DD 2 * f + I LEAK * V DD Power wall because of critical head dissipation Frequency wall because of gate delay time RISC Software GmbH Johannes Kepler University Linz

Level of Parallelism Code Granularity Messages Messages Task

func1() { } func2() { } func3() { } Medium grain (control

(data level - DLP) + * / Very fine grain (instruction level -

6 Level of Parallelism Code Granularity Messages Messages Task i - 1 Task i Task i + 1 Large grain (task level - TLP) func1() { } func2() { } func3() { } Medium grain (control level - CLP) a[0]= b[0]= a[1]= b[1]= a[2]= b[2]= Fine grain (data level - DLP) + * / Very fine grain (instruction level - ILP) RISC Software GmbH Johannes Kepler University Linz

7 Level of Parallelism Programmers control Messages Messages Task i - 1 Task i Task i + 1 Programmer (MPI, PVM, ) func1() { } func2() { } func3() { } Programmer (OpenMP, Pthreads, ) a[0]= b[0]= a[1]= b[1]= a[2]= b[2]= Programmer/Automatic (Vector Intrinsics/Compiler) + * / Automatic (CPU) RISC Software GmbH Johannes Kepler University Linz

8 Level of Parallelism - Milestones RISC Software GmbH Johannes Kepler University Linz

9 Instruction Level Parallism Also known as multiple issue Implemented in Hardware via superscalar CPU ( parallel pipelines) Time [Cycles] IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB Instructions Fully utilized pipeline RISC Software GmbH Johannes Kepler University Linz

10 Instruction Level Parallelism Out of Order Execution Out of order execution Input: serial instruction stream Identification and execution of parallel executable instructions Pipeline hazards Structural hazards (resource conflicts) Data hazards (data dependencies) Control hazards (conditional and unconditional jumps Branch Prediction) ILP wall: Any further effort for parallelization leads to little increase in resource utilization RISC Software GmbH Johannes Kepler University Linz

11 Out of order In order Instruction Level Parallelism Sandy Bridge Core Pipeline Instruction Queue Pre Decode Instruction Fetch L1 Instruction Cache (32kB) Branch Prediction Decoder Decoder Decoder Decoder uop Cache 4 uops/cycle Rename/Allocate/Retire 4 uops/cycle issued ReOrder Buffer Load Buffer Store Buffer L2 Cache (256kB) Port 0 ALU VI MUL SSE MUL DIV AVX FP MUL Reservation Station 6 uops/cycle dispatched Port 1 Port 5 Port 2 Port 3 Port 4 ALU ALU Load Store Address Store Data VI ADD JMP Store Address Load SSE ADD AVX/FP Shuf AVX FP ADD AVX/FP Bool Imm Blend Imm Blend Memory Control 48 Bytes/cylce L1 Data Cache (32kB) RISC Software GmbH Johannes Kepler University Linz

12 Data Level Parallelism Same instruction is executed simultaneous with N>1 elements of a vector S1 S2 V1 V1 V1 V1 V2 V2 V2 V2 Scalar Processing + Vector Processing S3 V3 V3 V3 V3 RISC Software GmbH Johannes Kepler University Linz

13 Data Level Parallelism RISC Software GmbH Johannes Kepler University Linz

14 Control Level Parallelism Simultaneous Multi Threading (SMT) Also known as Hyper Threading (HT) or Hardware Threading Multiple logic processors per core Per logic processor Administrative logic Fed with independent serial instruction stream Mapped to ILP increase of resource utilization in superscalar CPU RISC Software GmbH Johannes Kepler University Linz

15 Control Level Parallelism Simultaneous Multi Threading (SMT) Time [Cycles] without SMT with SMT Note: Each box represents a processor execution unit RISC Software GmbH Johannes Kepler University Linz

16 Control Level Parallelism Sandy Bridge Multi-Core CPU (Intel Xeon E5-2600) 2 QPI Links 8.0GT/s Ring bus 32 Bytes wide Bandwidth: 96GB/s 1 cycle per hop Integrated memory controller (IMC) Bandwidth: 4 * 1600 MHz * 8 Byte = 51.2 GB/s 4 Memory Channels 1600MHz DDR3 RISC Software GmbH Johannes Kepler University Linz

Control Level Parallelism Multi-Socket CPUs Sandy Bridge Memory access Latency [ns] Bandwidth [GB/s] local 60 51.

17 Control Level Parallelism Multi-Socket CPUs Sandy Bridge Memory access Latency [ns] Bandwidth [GB/s] local remote *) *) 8.0GT/s * 2 Byte = 16.0GB/s RISC Software GmbH Johannes Kepler University Linz

18 Task Level Parallelism Intra-Node Simultaneous Multi Threading Multi Core Multi Socket Shared memory model Inter-Node Different compute nodes Connected via fast interconnect (e.g. Infiniband) Different connection topologies Distributed memory model RISC Software GmbH Johannes Kepler University Linz

19 Summary RISC Software GmbH Johannes Kepler University Linz

20 Memory Hierarchies RISC Software GmbH Johannes Kepler University Linz

21 Motivation for Memory Hierarchies Moore s Law: CPU performance doubles roughly every two years Memory bandwidth does not keep up (doubles roughly every five years) memory speed gap (memory wall) Year 1980: CPU and memory cycle about 1 µs Year 2000: CPU cycle 1 ns, memory cycle 100 ns RISC Software GmbH Johannes Kepler University Linz

22 Principles of Locality Each program has to a certain degree some locality Usage of recently accessed data and instructions Two kinds of locality Temporal locality: A recently accessed object is used in the near future. E.g.: x is read x is read or written in the near future Spatial locality: Neighbouring objects are accessed in similar time. E.g.: y[i] is read y[i + 1] is read soon. RISC Software GmbH Johannes Kepler University Linz

23 Cache Memory Processor Cache Main Memory Keeps copies of main meory blocks (data and instructions) Access time is much faster (1 ns vs. 100 ns) Much smaller in size than main memory (costs) RISC Software GmbH Johannes Kepler University Linz

24 Cache Design Cache block size: Bytes Cache block usually named cache line Cache design criteria When to cache? Where to cache? How to find cache block again? Which cache block is replaced (after miss)? What to do in case of writes? Methods must be easy to implement (in Hardware) RISC Software GmbH Johannes Kepler University Linz

25 When to Cache? Read access Cache always Memory location is read and copy is not in cache (read miss) cache Write access Nonapplicable for instruction caches Different strategies (see later on) RISC Software GmbH Johannes Kepler University Linz

26 Where to Cache? Direct mapped cache Bytes Associative cache Bytes full address full address m-bits block index n-bits block offset m-bits set index n-bits block offset RISC Software GmbH Johannes Kepler University Linz

27 How to Find Cache Block Again? Load address Find set for address Information per cache block Address tag Valid bit Check for matching address tag in valid blocks Full address address tag m bits set index n bits block offset RISC Software GmbH Johannes Kepler University Linz

28 Which cache block is replaced (after miss)? Direct mapped cache No choice, replacement of indexed block Associative cache Randomly replacement Least recently used (LRU), better, but implementation more difficult RISC Software GmbH Johannes Kepler University Linz

29 What to Do in Case of Writes? Write through Write data directly to main memory (no caching) Used for data I/O and some multiprocessor cache coherency implementations CPU avoids latencies of writes via write buffer Write back Write data only to cache In case of cache block replacement data is written from cache to main memory (checked via dirty bit) Reduced memory traffic More difficult to implement than write through RISC Software GmbH Johannes Kepler University Linz

30 Cache Memory Access Costs Average memory access time hit time + miss ratio x miss time Time to load data from cache to CPU Proportion of accesses which cause a miss Time to load data from main memory to cache Goal: optimization for all three parameters RISC Software GmbH Johannes Kepler University Linz

31 Three Reasons for Cache Misses (3 Cs) Compulsory or cold start: first access to data of an address Capacity: cache is not big enough to hold all data Conflict: to many memory addresses are mapped to the same set index of the cache RISC Software GmbH Johannes Kepler University Linz

MB ~50 cycles L3 Cache ~10-50 MB ~300 cycles Main Memory ~1 GB

32 Memory Hierarchie Speed (and cost) 1 cycle CPU Registers ~1 kb Capacity 2-3 cycles L1 Cache ~100 kb ~20 cycles L2 Cache ~1-10 MB ~50 cycles L3 Cache ~10-50 MB ~300 cycles Main Memory ~1 GB RISC Software GmbH Johannes Kepler University Linz

33 Example 1 Cache Hierarchy 1. Go to the directory example_1 2. Compile the code cache.cpp icc std=c++11 cache.cpp 3. Execute the binary and try to identify the cache boundaries. 4. How big are the different caches? RISC Software GmbH Johannes Kepler University Linz

34 Cache Coherency RISC Software GmbH Johannes Kepler University Linz

35 Shared Memory Model All processors have access to global memory Communication with main memory via read and write Caches are automatically kept up to date or coherent Scalability difficult because of memory bottleneck Low number of processors Multi-Core CPU RISC Software GmbH Johannes Kepler University Linz

36 Shared Memory Model P P P P P P P P L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 Chip L3 Memory RISC Software GmbH Johannes Kepler University Linz

37 Cache Coherence Shared memory model shared variable has unique value at any given point in time Caching multiple copies of a memory location Avoidance of different values caches are kept coherent Write to a memory location must invalidate all copies in other caches RISC Software GmbH Johannes Kepler University Linz

38 Coherence Protocols Book keeping of sharing state of cache blocks Block was modified Is block stored in more than one cache Snooping (or Broadcast) based protocols Each copy in cache has sharing state No centralized state Each processor sees each request Directory based protocols Book keeping of sharing state is centralized RISC Software GmbH Johannes Kepler University Linz

39 Snooping Based Protocols Usage of the valid tags of cache line for invalidation Extra tag for sharing state Usage of dirty bit of Write Back caches All processors see all bus transactions Invalidation message: block in cache invalidation Memory read request: block in cache provide data and cancel memory request Many different implementations RISC Software GmbH Johannes Kepler University Linz

40 Three State Snoopy Protocol MSI Simplest protocol, allowing multiple copies Each cache block can have one of the following states Modified: only on valid copy in any cache and value is different to main memory Shared: several valid copies in caches and value is identical to main memory Invalid: copy is not valid and cannot be used RISC Software GmbH Johannes Kepler University Linz

41 MSI Example: P2 Read P1 P2 P3 PrRd value S value S Snooper Snooper BusRd Snooper Main Memory P2 wants to read the value. Its cache does not have the data, so it places a BusRd to notify other processors and ask for the data. The memory controller provides the data. RISC Software GmbH Johannes Kepler University Linz

42 MSI Example: P3 Write P1 P2 P3 PrWr value S I value S I value M BusRdX Snooper Snooper Snooper Main Memory P3 wants to write the value. It places a BusRdX to get exclusive access and the most recent copy of the data. The caches of P1 and P2 see the BusRdX and invalidate their copies. Because the value is still up-to-date in memory, memory provides the data. RISC Software GmbH Johannes Kepler University Linz

43 MSI Example: P2 Read P1 P2 P3 PrRd value I value I S value M S Snooper BusRd Snooper Flush Snooper Main Memory P2 wants to read the value. P3 s cache has the most up-to-date copy and will provide it. P2 s cache puts a BusRd on the bus. P3 s cache snoops this and cancels the memory access because it will provide the data. P3 s cache flushes the data to the bus. RISC Software GmbH Johannes Kepler University Linz

44 False Sharing Coherent operations are on cache lines Cache lines can contain multiple units of a data type (e.g. int) Two processors write on different units in the same cache line (data values are not really shared) False Sharing Each write invalidation of copies in other caches bus traffic and memory access Significant performance problem and difficult identification Cache-Line n - 1 Cache-Line n Cache-Line n + 1 P1 P2 RISC Software GmbH Johannes Kepler University Linz

45 Example 2 Cache Coherency False Sharing 1. Go to the directory example_2 2. Compile the code coherence.cpp icc openmp coherency.cpp 3. Set two execution threads export OMP_NUM_THREADS=2 4. Use different cores on same CPU export KMP_AFFINITY=granularity=fine,proclist=[0,2],explicit 5. Execute the binary. 6. Can you see the drop in the performance? 7. What is the cache line size? RISC Software GmbH Johannes Kepler University Linz

46 NUMA (Non Uniform Memory Access) RISC Software GmbH Johannes Kepler University Linz

47 Distributed Shared Memory Model P P P P P P P P C C C C M M M M Interconnect RISC Software GmbH Johannes Kepler University Linz

48 Directory Based Coherence Scalability: No fast bus snooping not possible Hence directory structure Bit vector for each cache block One bit per processor (1 in cache of processor) Stored in (distributed) memory Scalability problems Additional memory for directory E. g. 128 Byte cache block, 256 processors 20% RISC Software GmbH Johannes Kepler University Linz

49 Directory Based Coherence P P P P P P P P C C C C M D M D M D M D Interconnect RISC Software GmbH Johannes Kepler University Linz

50 Directory Based Coherence Home node: node where memory (and directory entry) is located Principially like snoopy based protocols: Cache line has three states (MSI) Directory entry has modified, shared, and uncached state Cache miss home node for data, directory bits are set accordingly for read/write miss Directory can Invalidate a copy in a remote cache Fetch the data back from a remote cache Cache can write back to home node RISC Software GmbH Johannes Kepler University Linz

51 NUMA Problem Access to remote memory takes longer than to local memory Operating system is responsible for allocating pages Common allocation policies are: First touch: allocate page on node which makes first access Round robin: allocate cyclically RISC Software GmbH Johannes Kepler University Linz

52 Example 3 Memory Bandwidth Saturation & NUMA 1. Go to directory example_3 2. Compile the file cacheomp.cpp icc std=c++11 openmp cacheomp.cpp 3. Execute the binary. 4. How big is the main memory bandwidth? 5. Activate the parallel initialization of the array in the file cacheomp.cpp and compile/execute the program again. 6. How big is the improvement in the main memory bandwidth? RISC Software GmbH Johannes Kepler University Linz

53 Roofline Model RISC Software GmbH Johannes Kepler University Linz

54 Basic Idea Compute attainable peak floating point performance as a function of the arithmetic intensity Correlates peak floating point performance and peak memory bandwidth of a target machine Enables identification if a certain algorithm is memory bound (bandwidth) or compute bound localization of optimization potential RISC Software GmbH Johannes Kepler University Linz

55 Attainable performance calculation Calculation Peak Floating-Point Performance = #Sockets * #Cores * #Instructions per cycle * Frequency (#Instructions per cycle from DLP and ILP) Peak Memory Bandwidth = #Sockets * #Channels * Frequency * #Bytes (Bandwidth RAM to LLC) Operational Intensity = Flops / Bytes read Differentiation between double precision (DP) and single precision (SP) RISC Software GmbH Johannes Kepler University Linz

56 Operational Intensity Example Matrix Vector Multiplication in DP Matrix: a x b Vector: b x 1 Flops: flops ~ 2 a b Bytes read: mem ~ a b 8 Operational intensity: i = flops 2 a b = = 1 mem a b 8 4 RISC Software GmbH Johannes Kepler University Linz

57 Roofline Model 2 x Xeon E GHz RISC Software GmbH Johannes Kepler University Linz

58 Roofline Model 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz

59 Roofline Model 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz

60 Roofline Model 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz

61 Roofline Model 2 x Xeon E GHz Compute bound RISC Software GmbH Johannes Kepler University Linz

62 Roofline Model Comparision RISC Software GmbH Johannes Kepler University Linz

63 Thank You! Castor, 4228m Pollux, 4092m zwischen Monte-Rosa-Massiv und Matterhorn Wallis, Schweiz RISC Software GmbH Johannes Kepler University Linz

Modern CPU Architectures

Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes