Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Size: px

Start display at page:

Download "Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes"

Alexandrina Melton
5 years ago
Views:

1 Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

2 Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia Kepler: 7 billion Intel Broadwell: 7.2 billion Nvidia Pascal: 15 billion 1965: G. Moore claimed #transistors on microchip doubles every months (c) RRZE 2017 Basic Architecture 2

3 Multi-core today: Intel Xeon 2600v3 (2014) Xeon E5-2600v3 Haswell EP : Up to 18 cores running at 2+ GHz (+ Turbo Mode : 3.5+ GHz) Simultaneous Multithreading reports as 36-way chip 5.7 billion 22 nm Die size: 662 mm 2 Optional: Cluster on Die (CoD) mode socket server (c) RRZE 2017 Basic Architecture 3

4 A deeper dive into core and chip architecture

(Turing 1936) Similar designs on all modern systems (Still)

5 General-purpose cache-based microprocessor core Modern CPU core Stored-program computer Implements Stored Program Computer concept (Turing 1936) Similar designs on all modern systems (Still) multiple potential bottlenecks Flexible! (c) RRZE 2017 Basic Architecture 5

6 Basic resources on a stored program computer Instruction execution and data movement 1. Instruction execution This is the primary resource of the processor. All efforts in hardware design are targeted towards increasing the instruction throughput. Instructions are the concept of work as seen by processor designers. Not all instructions count as work as seen by application developers! Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Processor work: LOAD r1 = A(i) LOAD r2 = B(i) ADD r1 = r1 + r2 STORE A(i) = r1 INCREMENT i BRANCH top if i<n User work: N Flops (ADDs) (c) RRZE 2017 Basic Architecture 6

7 Basic resources on a stored program computer Instruction execution and data movement 2. Data transfer Data transfers are a consequence of instruction execution and therefore a secondary resource. Maximum bandwidth is determined by the request rate of executed instructions and technical limitations (bus width, speed). Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Data transfers: 8 byte: LOAD r1 = A(i) 8 byte: LOAD r2 = B(i) 8 byte: STORE A(i) = r2 Sum: 24 byte Crucial question: What is the bottleneck? Data transfer? Code execution? (c) RRZE 2017 Basic Architecture 7

8 Microprocessors Pipelining

9 Pipelining of arithmetic/functional units Idea: Split complex instruction into several simple / fast steps (stages) Each step takes the same amount of time, e.g., a single cycle Execute different steps on different instructions at the same time (in parallel) Benefits: Core can work on 5 independent instructions simultaneously One instruction finished each cycle after the pipeline is full Drawback: Pipeline must be filled; large number of independent instructions required Requires complex instruction scheduling by hardware (out-of-order execution) or compiler (software pipelining) Pipelining is widely used in modern computer architectures (c) RRZE 2017 Basic Architecture 10

10 5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,n First result is available after 5 cycles (=latency of pipeline)! Wind-up/-down phases: Empty pipeline stages (c) RRZE 2017 Basic Architecture 11

11 Microprocessors Superscalarity and Simultaneous Multithreading

12 Superscalar Processors Instruction Level Parallelism Multiple units enable use of Instrucion Level Parallelism (ILP): Instruction stream is parallelized on the fly t Fetch Instruction 4 Fetch Instruction 3 Fetch from Instruction L1I 2 Fetch from Instruction L1I 1 Fetch Instruction from L1I 2 Decode Fetch Instruction from L1I 2 Decode Fetch from Instruction L1I 2 Instruction Decode 1 Fetch from Instruction L1I 5 Instruction Decode 1 Fetch Instruction from L1I 3 Decode Instruction 1 Fetch Instruction from L1I 3 Decode Instruction 1 Fetch from Instruction L1I 3 Instruction Decode 2 Fetch from Instruction L1I 9 Instruction Decode 2 Fetch Instruction from L1I 4 Decode Instruction 2 Fetch Instruction from L1I 4 Decode Instruction 5 Fetch from Instruction L1I 4 Instruction Decode 3 Fetch from Instruction L1I 13 Instruction Decode 3 from L1I Instruction 3 from L1I Instruction 9 4-way superscalar Execute Execute Instruction Execute 1 Instruction Execute 1 Execute Instruction 1 Execute Instruction 1 Instruction Execute 2 Instruction Execute 2 Instruction 2 Instruction 5 Example: LOAD STORE MULT ADD Issuing m concurrent instructions per cycle: m-way superscalar Modern processors are 3- to 6-way superscalar & can perform 2 floating point instructions per cycles (c) RRZE 2017 Basic Architecture 14

13 2-way SMT Standard core Core details: Simultaneous multi-threading (SMT) SMT principle (2-way example): (c) RRZE 2017 Basic Architecture 16

14 Microprocessors Single Instruction Multiple Data (SIMD) a.k.a. vectorization

15 A[0] B[0] C[0] A[0] A[1] A[2] A[3] B[0] B[1] B[2] B[3] C[0] C[1] C[2] C[3] Core details: SIMD processing Single Instruction Multiple Data (SIMD) operations allow the concurrent execution of the same operation on wide registers x86 SIMD instruction sets: SSE: register width = 128 Bit 2 double precision floating point operands AVX: register width = 256 Bit 4 double precision floating point operands AVX-512: you guessed it! Adding two registers holding double precision floating point operands R0 R1 R2 R0 R1 R2 SIMD execution: V64ADD [R0,R1] R Bit 64 Bit Scalar execution: R2 ADD [R0,R1] (c) RRZE 2017 Basic Architecture 18

16 Microprocessors Memory Hierarchy

17 Von Neumann bottleneck reloaded: DRAM gap DP peak performance and peak main memory bandwidth for a single Intel processor (chip) Approx. 10 F/B Main memory access speed not sufficient to keep CPU busy Introduce fast on-chip caches, holding copies of recently used data items (c) RRZE 2017 Basic Architecture 21

18 Registers and caches: Data transfers in a memory hierarchy Caches help with getting instructions and data to the CPU fast How does data travel from memory to the CPU and back? Remember: Caches are organized in cache lines (e.g., 64 bytes) Only complete cache lines are transferred between memory hierarchy levels (except registers) MISS: Load or store instruction does not find the data in a cache level CL transfer required LD C(1) MISS CL CPU registers ST A(1) MISS Cache CL write allocate LD C(2..N cl ) ST A(2..N cl ) HIT evict (delayed) Example: Array copy A(:)=C(:) CL CL C(:) A(:) Memory 3 CL transfers (c) RRZE 2017 Basic Architecture 23

19 Parallel Computers Terminology

20 From core to cluster T T chip/socket T T T T T T T T T T T T core distributed-memory cluster shared-memory compute node Network (c) RRZE 2017 Basic Architecture 25

21 Shared-Memory Parallel Computers Cache coherence Multicore-multisocket architecture ccnuma memory organization

22 Muliple cores and caches Cache coherence Data in cache is only a copy of data in memory Multiple copies of same data on multiprocessor systems Cache coherence protocol/hardware ensure consistent data view Without cache coherence, shared cache lines can become clobbered: (Cache line size = 2 Words; A1+A2 are in a single CL) P1 P2 P1 C1 P2 C2 Load A1 Write A1=0 Load A2 Write A2=0 A1, A2 A1, A2 Write-back to memory leads to incoherent data Bus A1, A2 A1, A2 A1, A2 A1, A2 Memory C1 & C2 entry can not be merged to: A1, A2 (c) RRZE 2017 Basic Architecture 27

23 Multiple caches Cache coherence Cache coherence protocol must keep track of cache line status P1 P2 C1 C2 A1, A2 A1, A2 Bus P1 Load A1 Write A1=0: 1. Request exclusive access to CL 2. Invalidate CL in C2 3. Modify A1 in C1 P2 Load A2 A1, A2 Memory 1. Request exclusive CL access Write A2=0: 2. CL write back+ Invalidate t 3. Load CL to C2 C2 is exclusive owner of CL 4. Modify A2 in C2 (c) RRZE 2017 Basic Architecture 28

24 Parallel computers Cache coherence Cache coherence can cause substantial overhead may reduce available bandwidth Different implementations Snoop: On modifying a CL, a CPU must broadcast its address to the whole system Directory, snoop filter : Chipset ( network ) keeps track of which CLs are where and filters coherence traffic Directory-based ccnuma can reduce pain of additional coherence traffic But always take care: Multiple processors should never write frequently to the same cache line ( false sharing )! (c) RRZE 2017 Basic Architecture 29

25 Cray XC30 SandyBridge-EP 8-core dual socket node 8 cores per socket 2.7 GHz turbo) DDR3 memory interface with 4 channels per chip Two-way SMT Two 256-bit SIMD FP units SSE4.2, AVX 32 kb L1 data cache per core 256 kb L2 cache per core 20 MB L3 cache per chip ccnuma memory architecture Memory is physically distributed but logically shared (c) RRZE 2017 Basic Architecture 30

26 Interlude: A glance at current accelerator technology NVidia Pascal GP100 vs. Intel Xeon Phi Knights Landing

7 TFlop/s DP peak 4 MB L2 Cache 4096-bit HBM2 MemBW ~ 732 GB/s

27 NVidia Pascal GP100 block diagram Architecture 15.3 B Transistors ~ 1.4 GHz clock speed Up to 60 SM units 64 (SP) cores each 5.7 TFlop/s DP peak 4 MB L2 Cache 4096-bit HBM2 MemBW ~ 732 GB/s (theoretical) MemBW ~ 510 GB/s (measured) 2:1 SP:DP performance (c) RRZE 2017 Basic Architecture NVIDIA Corp. 32

28 Intel Xeon Phi Knights Landing block diagram VPU VPU VPU VPU MCDRAM MCDRAM MCDRAM MCDRAM T T T T T T T T P P 32 KiB L1 32 KiB L1 DDR4 DDR4 1 MiB L2 DDR4 DDR4 36 tiles (72 cores) max. DDR4 DDR4 Architecture 8 B Transistors Up to 1.5 GHz clock speed Up to 2x36 cores (2D mesh) 2x 512-bit SIMD units each 4-way SMT MCDRAM MCDRAM MCDRAM MCDRAM 3.5 TFlop/s DP peak (SP 2x) 36 MiB L2 Cache 16 GiB MCDRAM MemBW ~ 470 GB/s (measured) Large DDR4 main memory MemBW ~ 90 GB/s (measured) (c) RRZE 2017 Basic Architecture 33

29 Trading single thread performance for parallelism: GPGPUs vs. CPUs GPU vs. CPU light speed estimate (per device) MemBW ~ 5-10x Peak ~ 6-15x 2x Intel Xeon E5-2697v4 Broadwell Intel Xeon Phi 7250 Knights Landing NVidia Tesla P100 Pascal Cores@Clock 2 x 2.3 GHz 1.4 GHz 56 ~1.3 GHz SP Performance/core 73.6 GFlop/s 89.6 GFlop/s ~166 GFlop/s Threads@STREAM ~8 ~40 >8000? SP peak 2.6 TFlop/s 6.1 TFlop/s ~9.3 TFlop/s Stream BW (meas.) 2 x 62.5 GB/s 470 GB/s (HBM) 510 GB/s Transistors / TDP ~2x7 Billion / 2x145 W 8 Billion / 215W 14 Billion/300W (c) RRZE 2017 Basic Architecture 34

30 Node topology and programming models

what to do!? Performance issues Intranode vs.

31 Parallel programming models: Pure MPI Machine structure is invisible to user: Very simple programming model MPI knows what to do!? Performance issues Intranode vs. internode MPI Node/system topology (c) RRZE 2017 Basic Architecture 39

Parallel programming models are topology-agnostic: Example: Pure threading on the node (relevant for this tutorial) Machine structure is invisible to user Very simple programming

32 Parallel programming models are topology-agnostic: Example: Pure threading on the node (relevant for this tutorial) Machine structure is invisible to user Very simple programming model Threading SW (OpenMP, pthreads, TBB, ) should know about the details Performance issues Synchronization overhead Memory access Node topology (c) RRZE 2017 Basic Architecture 40

33 Distributed-memory computers & hybrid systems

Parallel distributed-memory computers: Basics Distributed-memory parallel computer: Each processor P is connected to exclusive local memory (M) and a network interface (NI) Node A (dedicated)

34 Parallel distributed-memory computers: Basics Distributed-memory parallel computer: Each processor P is connected to exclusive local memory (M) and a network interface (NI) Node A (dedicated) communication network connects all nodes Data exchange between nodes: Passing messages via network ( Message Passing ) Variants: No global (shared) address space No Remote Memory Access (NORMA) NON-COHENRENT shared address space (NUMA), e.g. CRAY PGAS languages (CoArray Fortran, UPC) Prototype of first PC clusters: Node: Network: Single CPU PC Ethernet First Massively Parallel Processing (MPP) architectures: CRAY T3D/E, Intel Paragon (c) RRZE 2017 Basic Architecture 42

Parallel distributed-memory computers: Hybrid system Standard concept of most modern large parallel computers: Hybrid/hierarchical Compute node is a 2- or 4-socket shared memory compute nodes with a

35 Parallel distributed-memory computers: Hybrid system Standard concept of most modern large parallel computers: Hybrid/hierarchical Compute node is a 2- or 4-socket shared memory compute nodes with a NI. Communication network (GBit, Infiniband) connects the nodes Price / (Peak) Performance is optimal; Network capabilities / (Peak) Perf. gets worse Parallel Programming? Pure Message Passing is standard. Hybrid programming? Today: GPUs / Accelerators are added to the nodes to further increase complexity Distributed-memory parallel (c) RRZE 2017 Basic Architecture 43

36 Networks Basic ideas and performance characteristics of modern networks

Networks Basic performance characteristics Evaluate the network capabilities to transfer data

V T comm = λ + b network λ is the latency (transfer setup time [sec]) and b network is

processors in different nodes communicate via network ( Point-to-point ) A single message of N

37 Networks Basic performance characteristics Evaluate the network capabilities to transfer data Use the same idea as for main memory access: Total transfer time for a message of V Bytes is: V T comm = λ + b network λ is the latency (transfer setup time [sec]) and b network is asymptotic (N oo) network bandwidth [MBytes/sec] Consider simplest case ( Ping Pong ) Two processors in different nodes communicate via network ( Point-to-point ) A single message of N Bytes is sent forward and backward Overall data transfer is 2 N Bytes! (c) RRZE 2017 Basic Architecture 45

38 Networks Basic performance characteristics Ping-Pong benchmark (schematic view) 1 myid = get_process_id() 2 if(myid.eq.0) then 3 targetid = 1 4 S = get_walltime() 5 call Send_message(buffer,N,targetID) 6 call Receive_message(buffer,N,targetID) 7 E = get_walltime() 8 MBYTES = 2*N/(E-S)/1.d6! MBytes/sec rate 9 TIME = (E-S)/(2*1.d6)! transfer time in microsecs 10! for single message 11 else 12 targetid = 0 13 call Receive_message(buffer,N,targetID) 14 call Send_message(buffer,N,targetID) 15 endif (c) RRZE 2017 Basic Architecture 46

39 B eff = 2*N/(E-S)/1.d6 Networks Basic performance characteristics Ping-Pong benchmark for GBit-Ethernet (GigE) network N 1/2 : Message size where 50% of peak bandwidth is achieved Asymptotic bandwidth b network =111 MB/s GBit/s Latency (N 0): Only qualitative agreement: 44 ms vs. 76 ms (c) RRZE 2017 Basic Architecture 47

40 Networks Basic performance characteristics Ping-Pong benchmark for DDR Infiniband (DDR-IB) network Determine b network and λ independently and combine them λ b network (c) RRZE 2017 Basic Architecture 48

41 Networks Basic performance characteristics First Principles modeling of B eff (V) provides good qualitative results but quantitative description in particular of latency dominated region (V small) may fail because Overhead for transmission protocols, e.g. message headers Minimum frame size for message transmission, e.g. TCP/IP over Ethernet does always transfer frames with V>1 Message setup/initialization involves multiple software layers and protocols; each software layer adds to latency; hardware only latency is often small As the message size increases the software may switch to different protocol, e.g. from eager to rendezvous Typical message sizes in applications are neither small nor large V 1/2 value is also important: V 1/2 = λ b network Network balance: Relate network bandwidth (b network or B eff (N 1/2 )) to computer power (or main memory bandwidth) of the nodes (c) RRZE 2017 Basic Architecture 49

Networks: Topologies & Bisection bandwidth Network bisection bandwidth B b is a general metric for the data transfer capability of a system: Minimum sum of the bandwidths of all connections cut when

42 Networks: Topologies & Bisection bandwidth Network bisection bandwidth B b is a general metric for the data transfer capability of a system: Minimum sum of the bandwidths of all connections cut when splitting the system into two equal parts More meaningful metric in terms of system scalability: Bisection BW per node: B b /N nodes Bisection BW depends on Bandwidth per link Network topology Uni- or Bi-directional bandwidth?! (c) RRZE 2017 Basic Architecture 50

43 Network topologies: Bus Bus can be used by one connection at a time Bandwidth is shared among all devices Bisection BW is constant B b /N nodes ~ 1/N nodes Collision detection, bus arbitration protocols must be in place Examples: PCI bus, diagnostic buses Advantages Low latency Easy to implement Disadvantages Shared bandwidth, not scalable Problems with failure resiliency (one defective agent may block bus) Fast buses for large N require large signal power (c) RRZE 2017 Basic Architecture 51

Non-blocking crossbar A non-blocking crossbar can mediate a number of connections between a group of input and a group of output elements This can be used as a 4-port non-blocking switch (fold at the

44 Non-blocking crossbar A non-blocking crossbar can mediate a number of connections between a group of input and a group of output elements This can be used as a 4-port non-blocking switch (fold at the secondary diagonal) Switches can be cascaded to form hierarchies (common case) 2x2 switching element Allows scalable communication at high hardware/energy costs Crossbars can be used as interconnects in computer systems NEC SX9 vector system ( IXS ) (c) RRZE 2017 Basic Architecture 52

45 Network topologies: Switches and Fat-Trees Standard clusters are built with switched networks Compute nodes ( devices ) are split up in groups each group is connected to single (non-blocking crossbar-)switch ( leaf switches ) Leaf switches are connected with each other using an additional switch hierarchy ( spine switches ) or directly (for small configs.) Switched networks: Distance between any two devices is heterogeneous (number of hops in switch hierarchy) Diameter of network: The maximum number of hops required to connect two arbitrary devices, e.g. diameter of bus=1 Perfect world: Fully non-blocking, i.e. any choice of N nodes /2 disjoint node (device) pairs can communicate at full speed (c) RRZE 2017 Basic Architecture 53

= B/2 Sounds good, but see next slide B Oversubscribed Spine does not support N nodes /2 full BW end-to-end

46 Fat tree switch hierarchies Fully non-blocking N nodes /2 end-to-end connections with full bandwidth B b = B * N nodes /2 B B b /N nodes = const. = B/2 Sounds good, but see next slide B Oversubscribed Spine does not support N nodes /2 full BW end-to-end connections B b /N nodes = const. = B/(2k), with k oversubscription factor k=3 spine switch Resource management (job placement) is crucial node leaf switch (c) RRZE 2017 Basic Architecture 54

Fat trees and static routing If all end-to-end data paths are preconfigured ( static routing ), not all possible combinations of N agents will get full bandwidth Example: is a collision-free pattern

47 Fat trees and static routing If all end-to-end data paths are preconfigured ( static routing ), not all possible combinations of N agents will get full bandwidth Example: is a collision-free pattern here (1 5, 2 6,3 7, 4 8) Change 2 6, 3 7 to 2 7, 3 6: has collisions if no other connections are re-routed at the same time Static routing: Quasi-standard in commodity interconnects However, things are starting to improve slowly (c) RRZE 2017 Basic Architecture 55

48 Full fat-tree: Single 288-port IB DDR-Switch SPINE switch level: 12 switches Basic building blocks: 24-port switches LEAF switch level: 24 switches with 24*12 ports to devices (c) RRZE 2017 S = switches 288 ports 56

49 Fat tree networks Examples Ethernet 1 Gbit/s &10 &100 Gbit/s variants InfiniBand Dominant high-performance commodity interconnect DDR: 20 Gbit/s per link and direction (Building blocks: 24-port switches) QDR: 40 Gbit/s per link and direction QDR IB is used in the RRZE s LiMa and Emmy clusters Building blocks: 36 port switches Large 36*18=648-port switches FDR-10 / FDR: 40/56 Gbit/s per link and direction EDR: 100 Gbit/s per link and direction Intel OmniPath Up to 100 Gbit/s per link & 48-port baseline switches RRZE Meggie cluster Expensive & complex to scale to very high node counts (c) RRZE 2017 Basic Architecture 57

Meshes Fat trees can become prohibitively expensive in large systems Compromise: Meshes n-dimensional Hypercubes Toruses (2D / 3D) Many others (including hybrids) Each node is a router Direct

50 Meshes Fat trees can become prohibitively expensive in large systems Compromise: Meshes n-dimensional Hypercubes Toruses (2D / 3D) Many others (including hybrids) Each node is a router Direct connections only between direct neighbors This is not a non-blocking corossbar! Intelligent resource management and routing algorithms are essential Example: 2D torus mesh Toruses at very large systems: Cray XE/XK series, IBM Blue Gene B b ~ N nodes (d-1)/d B b /N nodes 0 for large N nodes Sounds bad, but those machines show good scaling for many codes Well-defined and predictable bandwidth behavior! (c) RRZE 2017 Basic Architecture 58

51 Conclusions about architecture Modern computer architecture has a rich topology Node-level hardware parallelism takes many forms Sockets/devices CPU: 1-4 or more, GPGPU/Phi: 1-6 or more Cores moderate (CPU: 4-24, GPGPU: , Phi: 64-72) SIMD moderate (CPU: 2-8, Phi: 8-16) to massive (GPGPU: 10 s-100 s) Superscalarity (CPU/Phi: 2-6) System-level architecture is mostly defined by network topology Fat tree, torus, Exploiting performance: parallelism + bottleneck awareness High Performance Computing == computing at a bottleneck Performance of programs is sensitive to architecture Topology/affinity influences overheads of popular programming models Programming standards do not contain (many) topology-aware features Things are starting to improve slowly (MPI 3.0, OpenMP 4.0) Apart from overheads, performance features are largely independent of the programming model (c) RRZE 2017 Basic Architecture 59

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel