Modern CPU Architectures

Size: px

Start display at page:

Download "Modern CPU Architectures"

May Kelley
6 years ago
Views:

1 Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz

2 Motivation for Parallelism I CPU History RISC Software GmbH Johannes Kepler University Linz

made Moore s Law work Smaller Faster Constant RISC

3 Motivation for Parallelism II CPU Scaling Moore s Law (doubling transitor density / 2 years) Dennard Scaling made Moore s Law work Smaller Faster Constant RISC Software GmbH Johannes Kepler University Linz

4 Motivation for Parallelism III Limits of Serial Code Execution End of classical Dennard Scaling 2 1. No tox scaling 2. No voltage scaling No delay time scaling 4. Leakage P = C EFF * V DD 2 * f + I LEAK * V DD Power wall because of critical heat dissipation Frequency wall because of gate delay time RISC Software GmbH Johannes Kepler University Linz

Levels of Parallelism I Code Granularity Messages

parallelism (TLP) large grain func1() { } func2() {

medium grain a[0]= b[0]= a[1]= b[1]= a[2]= b[2]=

Instruction level parallelism (ILP) very fine grain

5 Levels of Parallelism I Code Granularity Messages Messages Task i - 1 Task i Task i + 1 Task level parallelism (TLP) large grain func1() { } func2() { } func3() { } Control level parallelism (CLP) medium grain a[0]= b[0]= a[1]= b[1]= a[2]= b[2]= Data level parallelism (DLP) fine grain + * / Instruction level parallelism (ILP) very fine grain RISC Software GmbH Johannes Kepler University Linz

6 Levels of Parallelism II Programmers Control Messages Messages Task i - 1 Task i Task i + 1 Programmer (MPI, PVM, ) func1() { } func2() { } func3() { } Programmer (OpenMP, Pthreads, ) a[0]= b[0]= a[1]= b[1]= a[2]= b[2]= Programmer (Vector Intrinsics) or automatic (Compiler) + * / Automatic (CPU) RISC Software GmbH Johannes Kepler University Linz

7 Level of Parallelism III Milestones of Parallelization RISC Software GmbH Johannes Kepler University Linz

8 Instruction Level Parallelism I Superscalar Pipeline Also known as multiple issue Implemented in Hardware via superscalar CPU ( parallel pipelines) Time [Cycles] IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB Instructions Fully utilized pipeline RISC Software GmbH Johannes Kepler University Linz

9 Instruction Level Parallelism II Out of Order Execution Out of Order Execution Input: serial instruction stream Identification and execution of parallel executable instructions Pipeline hazards Structural hazards (resource conflicts) Data hazards (data dependencies) Control hazards (conditional and unconditional jumps Branch Prediction) ILP Wall: Further effort for parallelization leads to little increase in ressource utilization RISC Software GmbH Johannes Kepler University Linz

10 Out of order In order Instruction Level Parallelism III Sandy Bridge Core Pipeline Instruction Queue Pre Decode Instruction Fetch Branch Prediction L1 Instruction Cache (32kB) Decoder Decoder Decoder Decoder uop Cache 4 uops/cycle Rename/Allocate/Retire 4 uops/cycle issued ReOrder Buffer Load Buffer Store Buffer L2 Cache (256kB) Port 0 ALU VI MUL SSE MUL DIV AVX FP MUL Reservation Station 6 uops/cycle dispatched Port 1 Port 5 Port 2 Port 3 Port 4 ALU ALU Load Store Address Store Data VI ADD JMP Store Address Load SSE ADD AVX/FP Shuf AVX FP ADD AVX/FP Bool Imm Blend Imm Blend Memory Control 48 Bytes/cylce L1 Data Cache (32kB) RISC Software GmbH Johannes Kepler University Linz

11 Data Level Parallelism I Basics Same instruction is executed simultaneous with N>1 elements of a vector S1 S2 V1 V1 V1 V1 V2 V2 V2 V2 Scalar Processing + Vector Processing S3 V3 V3 V3 V3 RISC Software GmbH Johannes Kepler University Linz

12 Data Level Parallelism II History RISC Software GmbH Johannes Kepler University Linz

13 Control Level Parallelism I Simultaneous Multi Threading (SMT) Also known as Hyper Threading (HT) or Hardware Threading Multiple logic processors per core Per logic processor Administrative logic Feed with independent serial instruction stream Mapped to ILP increase of resource utilization in superscalar CPU RISC Software GmbH Johannes Kepler University Linz

14 Control Level Parallelism II Simultaneous Multi Threading (SMT) Time [Cycles] without SMT with SMT Note: Each box represents a processor execution unit RISC Software GmbH Johannes Kepler University Linz

15 Control Level Parallelism III Sandy Bridge Multi-Core CPU (Intel Xeon E5-2600) 2 QPI Links 8.0GT/s Ring bus 32 Bytes wide Bandwidth: 96GB/s 1 Cycle per hop Integrated Memory Controller IMC Bandwidth: 4 * 1600MHz * 8 Byte = 51.2 GB/s Latency: 60ns 4 Memory Channels 1600MHz DDR3 RISC Software GmbH Johannes Kepler University Linz

Control Level Parallelism IV Multi-Socket CPUs Sandy Bridge QPI Memory access Latency [ns] Bandwidth [GB/s] local 60 51.

16 Control Level Parallelism IV Multi-Socket CPUs Sandy Bridge QPI Memory access Latency [ns] Bandwidth [GB/s] local remote *) *) 8.0GT/s * 2 Byte = 16.0GB/s RISC Software GmbH Johannes Kepler University Linz

17 Task Level Parallelism Intra-Node Simultaneous Multi Threading Multi Core Multi Socket Shared memory model Inter-Node Different compute nodes Connected via fast interconnect (e.g. Infiniband) Different connection topologies Distributed memory model RISC Software GmbH Johannes Kepler University Linz

18 Caching RISC Software GmbH Johannes Kepler University Linz

19 Motivation for Caching Moore s Law: CPU performance doubles roughly every two years Memory bandwidth does not keep up (doubles roughly every five years) Memory speed gap (Memory Wall) Year 1980: CPU and memory cylce about 1us Year 2000: CPU cycle 1ns, memory cycle 100ns RISC Software GmbH Johannes Kepler University Linz

20 Principles of Locality Each program has to a certain degree some locality Usage of recently accessed data and instructions Two kinds of locality Temporal locality: A recently accessed object is used in the near future. E.g.: x is read x is read or written in the near future Spatial locality: Neighbouring objects are accessed in similar time. E.g.: y[i] is read y[i+1] is read soon. RISC Software GmbH Johannes Kepler University Linz

21 Cache Memory Processor Cache Main Memory Keeps copies of main memory blocks (data and instructions) Access time is much faster (1ns vs. 100ns) Much smaller in size than main memory (costs) RISC Software GmbH Johannes Kepler University Linz

22 Cache Design Cache block size: Bytes Cache block usually named cache line Cache design criteria When to cache? Where to cache? How to find cache block again? Which cache block is replaced (after miss)? What to do in case of writes? Methods must be easy (Hardware) RISC Software GmbH Johannes Kepler University Linz

23 Cache Design When to Cache? Read access Cache always Memory location is read and copy is not in cache (read miss) cache Write access Nonapplicable for instruction caches Different strategies (see later on) RISC Software GmbH Johannes Kepler University Linz

24 Cache Design Where to cache? Direct mapped cache Bytes Associative Cache 64 Bytes cache cache Full address Full address m bits n bits m bits n bits block index block offset set index block offset RISC Software GmbH Johannes Kepler University Linz

25 Cache Design How to Find Cache Block again? Load address Find set for address Information per cache block Address tag Valid bit Check for matching address tag in valid blocks Full address address tag m bits set index n bits block offset RISC Software GmbH Johannes Kepler University Linz

26 Cache Design Which cache block is replaced (after miss)? Direct mapped cache No choice, replacement of indexed block Associative cache Randomly replacement Least recently used (LRU), better, but implementation more difficult RISC Software GmbH Johannes Kepler University Linz

27 Cache Design What to do in case of writes? Strategy write through Write data directly to main memory (no caching) Used for data I/O and some multiprocessor cache coherencey implementations CPU avoids latencies of writes via write buffer Strategy write back Write data only to cache In case of cache block replacement data is written from cache to main memory (checked via dirty bit) Reduced memory traffic More difficult to implement than write through RISC Software GmbH Johannes Kepler University Linz

28 Cache Memory Access Costs Average memory access time hit time avg memory access time = + miss ratio x miss time Time to load data from cache to CPU Proportion of accesses which cause a miss Time to load data from main memory to cache Goal: optimization for all three parameters RISC Software GmbH Johannes Kepler University Linz

29 The Three Reasons for Cache Misses (3 Cs) Compulsory or cold start: first access to data of an address Capacity: cache is not big enough to hold all data Conflict: To many memory addresses are mapped to the same set index of the cache RISC Software GmbH Johannes Kepler University Linz

30 Memory Hierarchy Speed (and cost) 1 cycle CPU Registers ~1 kb Capacity 2-3 cycles L1 Cache ~100 kb ~20 cycles L2 Cache ~1-10 MB ~50 cycles L3 Cache ~10-50 MB ~300 cycles Main Memory ~1 GB RISC Software GmbH Johannes Kepler University Linz

31 Exercise 1 Cache Hierarchy 1. Look at example_1/cache.c. 2. Compile the code with icc openmp cache.c. 3. Execute the binary and try to identify the cache boundaries. 4. How big are the different caches? RISC Software GmbH Johannes Kepler University Linz

32 Cache Coherency RISC Software GmbH Johannes Kepler University Linz

33 Shared Memory Model All processors have access to global memory Communication with main memory via read and write Caches are automatically kept up to date or coherent Scalability difficult because of memory bottleneck Low number of processors Multi core CPU RISC Software GmbH Johannes Kepler University Linz

34 Shared Memory Model P P P P P P P P L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 Chip L3 Memory RISC Software GmbH Johannes Kepler University Linz

35 Cache Coherence Shared memory model shared variable has unique value at any given point in time Caching multiple copies of a memory location Avoidance of different values caches are kept coherent Write to a memory location must invalidate all copies in other caches RISC Software GmbH Johannes Kepler University Linz

36 Coherence Protocols Book keeping of sharing state of cache blocks Block was modified Is block stored in more than one cache Snooping (or Broadcast) based protocols Each copy in cache has sharing state No centralized state Each processor sees each request Directory based protocols Book keeping of sharing state is centralized RISC Software GmbH Johannes Kepler University Linz

37 Snooping Based Protocols Usage of the valid tags of cache line for invalidation Extra tag for sharing state Usage of dirty bit of Write Back caches All processors see all bus transactions Invalidation message: block in cache invalidation Memory read request: block in cache provide data and cancel memory request Many different implementations RISC Software GmbH Johannes Kepler University Linz

38 Three State Snoopy Protocol MSI Simplest protocol, allowing multiple copies Each cache block can have one of the following states: Modified: only one valid copy in any cache and value is different to main memory Shared: several valid copies in caches and value is identical to main memory Invalid: copy is not valid and cannot be used RISC Software GmbH Johannes Kepler University Linz

39 MSI Example: R2 P1 P2 P3 PrRd value S value S Snooper Snooper BusRd Snooper Main Memory P2 wants to read the value. Its cache does not have the data, so it places a BusRd to notify other processors and ask for the data. The memory controller provides the data. RISC Software GmbH Johannes Kepler University Linz

40 MSI Example: W3 P1 P2 P3 PrWr value S I value S I value M BusRdX Snooper Snooper Snooper Main Memory P3 wants to write the value. It places a BusRdX to get exclusive access and the most recent copy of the data. The caches of P1 and P2 see the BusRdX and invalidate their copies. Because the value is still up-to-date in memory, memory provides the data. RISC Software GmbH Johannes Kepler University Linz

41 MSI Example: R2 P1 P2 P3 PrRd value I value I S value M S Snooper BusRd Snooper Flush Snooper Main Memory P2 wants to read the value. P3 s cache has the most up-to-date copy and will provide it. P2 s cache puts a BusRd on the bus. P3 s cache snoops this and cancels the memory access because it will provide the data. P3 s cache flushes the data to the bus. RISC Software GmbH Johannes Kepler University Linz

42 MSI Example: W1 P1 PrWr P2 P3 BusRdX value I M value S I value S I Snooper Snooper Snooper Main Memory P1 wants to write to its cache. The cache places a BusRdX on the bus to gain exclusive access and the most up-to-date value. Main memory is not stale so it provides the data. The snoopers for P2 and P3 see the BusRdX and invalidate their copies in cache. RISC Software GmbH Johannes Kepler University Linz

43 False Sharing Coherent operations are on cache lines Cache lines can contain multiple units of a data type (e.g. int) Two processors write on different units in the same cache line (data values are not really shared) False Sharing Each write invalidation of copies in other caches bus traffic and memory access Significant performance problem and difficult identification RISC Software GmbH Johannes Kepler University Linz

44 Exercise 2 Cache Coherency False Sharing 1. Look at example_2/coherency.c 2. Compile the the code with icc openmp coherency.c location.c 3. Set two execution threads via export OMP_NUM_THREADS=2 4. Use different cores on same CPU via export KMP_AFFINITY=granularity=fine,proclist=[0,2],explicit 5. Execute the binary. 6. Can you see the drop in the performance? 7. What is the cache line size? RISC Software GmbH Johannes Kepler University Linz

45 NUMA (Non Uniform Memory Access) RISC Software GmbH Johannes Kepler University Linz

46 Distributed Shared Memory Model P P P P P P P P C C C C M M M M Interconnect RISC Software GmbH Johannes Kepler University Linz

47 Directory Based Coherence Scalability: No fast bus Snooping not possible Hence directory structure Bit vector for each cache block One bit per processor (1 in cache of processor) Stored in (distributed) memory Scalability problems Additional memory for directory E. g. 128 Byte cache block, 256 processor 20 % RISC Software GmbH Johannes Kepler University Linz

48 Directory based coherence Home node: node where memory (and directory entry) is located Principially like snoopy based protocols: Cache line has three states (MSI) Directory entry has modified, shared, and uncached state Cache miss home node for data, directory bits are set accordingly for read/write miss Directory can Invalidate a copy in a remote cache Fetch the data back from a remote cache Cache can write back to home node RISC Software GmbH Johannes Kepler University Linz

49 NUMA Problem Access to remote memory takes longer than to local memory Operation system is responsible for allocating pages Common allocation policies are: First touch: allocate page on node which makes first access Round robin: allocate cyclically RISC Software GmbH Johannes Kepler University Linz

50 Exercise 3 Memory Bandwidth Saturation & NUMA 1. Look at example_3/cacheomp.c. 2. Compile the file via icc openmp cacheomp.c location.c 3. Execute the binary. 4. How big is the main memory bandwidth? 5. Activate the parallel initialization of the array in the file cacheomp.c and compile/execute the program again. 6. How big is the improvement in the main memory bandwidth? RISC Software GmbH Johannes Kepler University Linz

51 Roofline Model RISC Software GmbH Johannes Kepler University Linz

52 Roofline Model Basic Idea Compute attainable peak floating point performance as function of the arithmetic intensity Correlates peak floating point performance and peak memory bandwidth of a target machine Enables identification if a certain algorithm is memory bound (bandwidth) or compute bound localization of optimization potential RISC Software GmbH Johannes Kepler University Linz

53 Roofline Model Calculation Calculation: Peak Floating-Point Performance = #Sockets * #Cores * #Instructions per Cylce * Frequency (#Instructions per Cylcle from DLP and ILP) Peak Memory Bandwidth = #Sockets * #Channels * Frequency * #Bytes (Bandwidth RAM to LLC) Operational Intensity = Flops / Bytes read Differentiation between double precision (DP) and single precision (SP) RISC Software GmbH Johannes Kepler University Linz

54 Roofline Model Example: Matrix Vector Multiplication in DP Matrix: a x b Vector: b x 1 Flops: flops ~ 2 a b Bytes read: mem ~ a b 8 Operational intensisty: i = flops 2 a b = mem a b 8 = 1 4 RISC Software GmbH Johannes Kepler University Linz

55 Roofline Model Example 2 x Xeon E GHz RISC Software GmbH Johannes Kepler University Linz

56 Roofline Model Example 2 x Xeon E GHz Compute bound RISC Software GmbH Johannes Kepler University Linz

57 Roofline Model Example 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz

58 Roofline Model Example 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz

59 Roofline Model Example 2 x Xeon E GHz Memory bound Compute bound RISC Software GmbH Johannes Kepler University Linz

60 Roofline Model Comparison of Different Architectures RISC Software GmbH Johannes Kepler University Linz

61 Thank You! Castor, 4228m Pollux, 4092m between Monte-Rosa-Massiv and Matterhorn Wallis, Schweiz RISC Software GmbH Johannes Kepler University Linz

Advanced Parallel Programming I

Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University