Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Size: px

Start display at page:

Download "Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes"

Archibald Strickland
5 years ago
Views:

1 Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell EP : Up to cores running at + GHz (+ Turbo Mode : 3.5+ GHz) Simultaneous Multithreading reports as 44-way chip 7.

2 Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell EP : Up to cores running at + GHz (+ Turbo Mode : 3.5+ GHz) Simultaneous Multithreading reports as 44-way chip 7. Billion Transistors / 14 nm Die size: 456 mm socket server Optional: Cluster on Die (CoD) mode 017: Skylake architecture Mesh instead of ring interconnect CoD Sub-NUMA clustering Up to 8 cores GHz (c) RRZE 018 Basic Architecture

3 A deeper dive into core and chip architecture

(Turing 1936) Similar designs on all modern systems (Still)

4 General-purpose cache-based microprocessor core Modern CPU core Stored-program computer Implements Stored Program Computer concept (Turing 1936) Similar designs on all modern systems (Still) multiple potential bottlenecks Flexible! (c) RRZE 018 Basic Architecture 4

5 Basic resources on a stored program computer execution and data movement 1. execution This is the primary resource of the processor. All efforts in hardware design are targeted towards increasing the instruction throughput. s are the concept of work as seen by processor designers. Not all instructions count as work as seen by application developers! Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Processor work: LOAD r1 = A(i) LOAD r = B(i) ADD r1 = r1 + r STORE A(i) = r1 INCREMENT i BRANCH top if i<n User work: N Flops (ADDs) (c) RRZE 018 Basic Architecture 5

6 Basic resources on a stored program computer execution and data movement. Data transfer Data transfers are a consequence of instruction execution and therefore a secondary resource. Maximum bandwidth is determined by the request rate of executed instructions and technical limitations (bus width, speed). Example: Adding two arrays A(:) and B(:) do i=1, N A(i) = A(i) + B(i) enddo Data transfers: 8 byte: LOAD r1 = A(i) 8 byte: LOAD r = B(i) 8 byte: STORE A(i) = r Sum: 4 byte Crucial question: What is the bottleneck? Data transfer? Code execution? (c) RRZE 018 Basic Architecture 6

7 From high level code to actual execution for(i=0; i<n; ++i) sum += a[i]; addsd: Add nd argument to 1 st argument and store result in 1 st argument Register increment Compare register content..label: addsd inc cmp jb Compiler xmm1, [rdi+rdx*8] rdx rax, rdx..label &a[0] sum in register xmm1 i in register rdx Jump to label if loop continues N in register rax (c) RRZE 018 Basic Architecture 7

Architectural features in the (single) core Pipelining: execution in multiple steps Superscalarity: Multiple instructions per cycle Fetch 4 Fetch from L1I 3 Fetch from L1I Fetch Fetch from L1I 1

8 Architectural features in the (single) core Pipelining: execution in multiple steps Superscalarity: Multiple instructions per cycle Fetch 4 Fetch from L1I 3 Fetch from L1I Fetch Fetch from L1I 1 Fetch from from L1I L1I Fetch from L1I Fetch Fetch from L1I 5 3 Fetch from from L1I L1I 3 Fetch from L1I 3 Fetch Fetch from L1I 9 4 Fetch from from L1I L1I 4 Fetch from L1I 4 Fetch from L1I 13 from L1I Decode Decode 1 Decode 1 Decode Decode 1 Decode 1 Decode Decode Decode Decode 5 3 Decode 3 Decode 3 9 Execute Execute 1 Execute 1 Execute Execute 1 Execute 1 Execute Execute 5 Single Multiple Data: Multiple operations per instruction Simultaneous Multi-Threading: Multiple instruction sequences in parallel A[0] A[1] A[] A[3] B[0] B[1] B[] B[3] C[0] C[1] C[] C[3] (c) RRZE 018 Basic Architecture 8

9 Microprocessors Pipelining

10 Pipelining of arithmetic/functional units Idea: Split complex instruction into several simple / fast steps (stages) Each step takes the same amount of time, e.g., a single cycle Execute different steps on different instructions at the same time (in parallel) Benefits: Core can work on 5 independent instructions simultaneously One instruction finished each cycle after the pipeline is full Drawback: Pipeline must be filled; large number of independent instructions required Requires complex instruction scheduling by hardware (out-of-order execution) or compiler (software pipelining) Pipelining is widely used in modern computer architectures (c) RRZE 018 Basic Architecture 10

11 5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,n First result is available after 5 cycles (=latency of pipeline)! Wind-up/-down phases: Empty pipeline stages (c) RRZE 018 Basic Architecture 11

12 Pipelining: The pipeline Besides arithmetic & functional unit, instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps: Fetch from L1I Decode instruction Execute t Fetch 1 from L1I Fetch from L1I Fetch 3 from L1I Fetch 4 from L1I Decode 1 Decode Decode 3 Execute 1 Execute Branches can stall this pipeline! (Speculative Execution, Predication) Each unit is pipelined itself (e.g., Execute = Multiply Pipeline) (c) RRZE 018 Basic Architecture 1

13 Microprocessors Superscalarity and Simultaneous Multithreading

14 Superscalar Processors Level Parallelism Multiple units enable use of Instrucion Level Parallelism (ILP): stream is parallelized on the fly t Fetch 4 Fetch 3 Fetch from L1I Fetch from L1I 1 Fetch from L1I Decode Fetch from L1I Decode Fetch from L1I Decode 1 Fetch from L1I 5 Decode 1 Fetch from L1I 3 Decode 1 Fetch from L1I 3 Decode 1 Fetch from L1I 3 Decode Fetch from L1I 9 Decode Fetch from L1I 4 Decode Fetch from L1I 4 Decode 5 Fetch from L1I 4 Decode 3 Fetch from L1I 13 Decode 3 from L1I 3 from L1I 9 4-way superscalar Execute Execute Execute 1 Execute 1 Execute 1 Execute 1 Execute Execute 5 Example: LOAD STORE MULT ADD Issuing m concurrent instructions per cycle: m-way superscalar Modern processors are 3- to 6-way superscalar & can perform floating point instructions per cycles (c) RRZE 018 Basic Architecture 14

15 Superscalar processors executing multiple instructions concurrently execution Cycle 1 Cycle Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11 Cycle 1 Cycle 13 Cycle 14 Cycle 15 Cycle 16 load a[1] load a[] load a[3] load a[4] load a[5] load a[6] STORE (Latency: cy) add a[1]=c,a[1] add a[]=c,a[] load a[7] add a[3]=c,a[3] load a[8] add a[4]=c,a[4] store a[1] load a[9] add a[5]=c,a[5] store a[] load a[10] add a[6]=c,a[6] store a[3] load a[11] add a[7]=c,a[7] store a[4] load a[1] add a[8]=c,a[8] store a[5] load a[13] add a[9]=c,a[9] store a[6] load a[14] add a[10]=c,a[10] store a[7] load a[15] add a[11]=c,a[11] store a[8] load a[16] add a[1]=c,a[1] store a[10] Correct interleaving / reordering the instruction streams: Out-Of-Order (OOO) execution LOAD (Latency: 4 cy) for(int i=1; i<n; ++i) a[i] = a[i] + s; ADD (Latency: 3cy) Steady state: 3 instructions/cy ( 3-way superscalar execution ) s Per Cycle: IPC=3 Cycles Per : CPI=0.33 (c) RRZE 018 Basic Architecture 15

16 Core details: Simultaneous multi-threading (SMT) SMT principle (-way example): -way SMT Standard core (c) RRZE 018 Basic Architecture 16

17 Microprocessors Single Multiple Data (SIMD) a.k.a. vectorization

18 Core details: SIMD processing Single Multiple Data (SIMD) operations allow the concurrent execution of the same operation on wide registers x86 SIMD instruction sets: SSE: register width = 18 Bit double precision floating point operands AVX: register width = 56 Bit 4 double precision floating point operands AVX-51: you guessed it! Adding two registers holding double precision floating point operands R0 R1 R R0 R1 R 56 Bit 64 Bit A[0] B[0] C[0] SIMD execution: V64ADD [R0,R1] R Scalar execution: R ADD [R0,R1] A[0] A[1] A[] A[3] B[0] B[1] B[] B[3] C[0] C[1] C[] C[3] (c) RRZE 018 Basic Architecture 18

/cy] nn FFFFFF Superscalarity nn SSSSSSSS [ops/inst.] FMA factor SIMD factor Clock Speed (c) IBM RRZE POWER8 018 Basic Q/014 Architecture S8LC.93 3.

19 There is no single driving force for single core performance! Maximum floating point (FP) performance: PP cccccccc = nnffff ssssssssss nn FFFFFF nn SSSSSSSS ff Typical representatives FFFF nn ssssssssss [inst./cy] nn FFFFFF Superscalarity nn SSSSSSSS [ops/inst.] FMA factor SIMD factor Clock Speed (c) IBM RRZE POWER8 018 Basic Q/014 Architecture S8LC Code ff [Gcy/s] PP cccccccc [GF/s] Nehalem 1 Q1/009 X Westmere 1 Q1/010 X Sandy Bridge 1 4 Q1/01 E Ivy Bridge 1 4 Q3/013 E5-660 v Haswell 4 Q3/014 E5-695 v Broadwell 4 Q1/016 E5-699 v Skylake 8 Q3/017 Gold

20 Microprocessors Memory Hierarchy

21 Von Neumann bottleneck reloaded: DRAM gap DP peak performance and peak main memory bandwidth for a single Intel processor (chip) Approx. 15 F/B Main memory access speed not sufficient to keep CPU busy Recently: mainly driven by SIMD (and FMA) Introduce fast on-chip caches, holding copies of recently used data items (c) RRZE 018 Basic Architecture 1

22 Registers and caches: Data transfers in a memory hierarchy Caches help with getting instructions and data to the CPU fast How does data travel from memory to the CPU and back? Remember: Caches are organized in cache lines (e.g., 64 bytes) Only complete cache lines are transferred between memory hierarchy levels (except registers) MISS: Load or store instruction does not find the data in a cache level CL transfer required LD C(1) MISS CL CPU registers ST A(1) MISS Cache CL write allocate LD C(..N cl ) ST A(..N cl ) HIT evict (delayed) Example: Array copy A(:)=C(:) CL CL C(:) A(:) Memory 3 CL transfers (c) RRZE 018 Basic Architecture 3

23 New kid on the block: AMD Epyc AMD Epyc Core Processor («Naples») Socket 0 Socket 1 Compute node 4 cores per socket 4 chips w/ 6 cores each ( Zeppelin die) 3 cores share 8MB L3 ( Core Complex, CCX ) DDR4-666 memory interface with channels per chip MemBW per node: 16 ch x 8 byte x.666 GHz = 341 GB/s Two-way SMT Two 56-bit (actually 4 18-bit) SIMD FP units AVX, 8 flops/cycle 3 KiB L1 data cache per core 51 KiB L cache per core x 8 MiB L3 cache per chip 64 MiB L3 cache per socket ccnuma memory architecture Infinity fabric between CCX s and between chips (c) RRZE 018 Basic Architecture 8

24 Interlude: A glance at current accelerator technology NVidia Pascal GP100 vs. Intel Xeon Phi Knights Landing vs. NEC SX-Aurora Tsubasa

25 NVidia Pascal GP100 block diagram Architecture 15.3 B Transistors ~ 1.4 GHz clock speed Up to 60 SM units 64 SP cores each 3 DP cores each :1 SP:DP performance 5.7 TFlop/s DP peak 4 MB L Cache 4096-bit HBM MemBW ~ 73 GB/s (theoretical) MemBW ~ 510 GB/s (measured) (c) RRZE 018 Basic Architecture NVIDIA Corp. 30

26 Intel Xeon Phi Knights Landing block diagram VPU VPU VPU VPU MCDRAM MCDRAM MCDRAM MCDRAM T T T T T T T T P P 3 KiB L1 3 KiB L1 DDR4 DDR4 1 MiB L DDR4 36 tiles (7 cores) max. DDR4 Architecture 8 B Transistors DDR4 DDR4 Up to 1.5 GHz clock speed Up to 36x cores (D mesh) x 51-bit SIMD units (VPU) each 4-way SMT MCDRAM MCDRAM MCDRAM MCDRAM 3.5 TFlop/s DP peak (SP x) 36 MiB L Cache 16 GiB MCDRAM MemBW ~ 470 GB/s (measured) Large DDR4 main memory MemBW ~ 90 GB/s (measured) (c) RRZE 018 Basic Architecture 31

27 Trading single thread performance for parallelism: GPGPUs vs. CPUs GPU vs. CPU light speed estimate (per device) MemBW ~ 5-10x Peak ~ 5-10x x Intel Xeon E5-697v4 Broadwell Intel Xeon Phi 750 Knights Landing NVidia Tesla P100 Pascal Cores@Clock x GHz 1.4 GHz 56 ~1.3 GHz SP Performance/core 73.6 GFlop/s 89.6 GFlop/s ~166 GFlop/s Threads@STREAM ~8 ~50 > SP peak.6 TFlop/s 6.1 TFlop/s ~9.3 TFlop/s Stream BW (meas.) x 6.5 GB/s 450 GB/s (MCDRAM) 510 GB/s Transistors / TDP ~x7 Billion / x145 W 8 Billion / 15W 14 Billion/300W (c) RRZE 018 Basic Architecture 3

28 SX-Aurora TSUBASA Architecture May 30 th, 018 Shintaro Momose NEC Deutschland GmbH Information in this material has been public. Each information in this material can be used for another material with out NDA by showing source from NEC. Copyrights NEC all rights reserved

29 SIMD Vector Scalar Input Pipeline Result SIMD Vector SX Vector is more efficient than SIMD SX is a SIMD-vector

30 SX-Aurora TSUBASA 018 Technology: CPU Frequency: CPU Performance: CPU Memory Bandwidth: 16 nm FinFet 1.4/1.6 GHz 150/457 Gflops 18 GB/sec

31 Vector Engine Processor Processor HBM HBM HBM HBM I/F HBM I/F HBM I/F LLC 8MB core core core core core core core core LLC 8MB HBM I/F HBM I/F HBM I/F HBM HBM HBM Memory Subsystem Memory Bandwidth: LLC Bandwidth: Bandwidth/core: LLC/Core: 1.TB/s 3.0TB/s 400GB/s D-mesh Core Vector Length = 56 words (56 x 64b = 16kb/instruction) 307.GF@1.6GHz 68.8GF@1.4GHz Processor 8 cores.45tf@1.6ghz.15tf@1.4ghz Memory bandwidth: 1.TB/s

32 SPU Scalar Processing Unit 1.TB/s / processor (Ave. 150GB/s / core) 400GB/s / core Single core Vector core and hierarchy of register/llc/memory

33 Vector Execution 3e 56e 64e 8e A vector register 56e x 64 B (18kB) C D FMA x3 3e Vector Length = 56e (3e x 8 cycle) 307.GF = 3 Flops/cycle x (FMA) x 3 x 1.6 Gcy/s

34 Node topology and programming models

Parallelism in a modern compute node Parallel and shared resources within a shared-memory node GPU #1 1 3 4 5 6 9 10 Other I/O 8 7 PCIe link GPU # Parallel resources: Shared resources: Execution/SIMD

35 Parallelism in a modern compute node Parallel and shared resources within a shared-memory node GPU # Other I/O 8 7 PCIe link GPU # Parallel resources: Shared resources: Execution/SIMD units 1 Outer cache level per socket 6 Cores Memory bus per socket 7 Inner cache levels 3 Intersocket link 8 Sockets / ccnuma domains 4 PCIe bus(es) 9 Multiple accelerators 5 Other I/O resources 10 How does your application react to all of those details? (c) RRZE 018 Basic Architecture 4

36 Scalable and saturating behavior Clearly distinguish between saturating and scalable performance on the chip level shared resources may show saturating performance parallel resources show scalable performance (c) RRZE 018 Basic Architecture 43

what to do!? Performance issues Intranode vs.

37 Parallel programming models: Pure MPI Machine structure is invisible to user: Very simple programming model MPI knows what to do!? Performance issues Intranode vs. internode MPI Node/system topology (c) RRZE 018 Basic Architecture 45

Parallel programming models are topology-agnostic: Example: Pure threading on the node (relevant for this tutorial) Machine structure is invisible to user Very simple programming model

38 Parallel programming models are topology-agnostic: Example: Pure threading on the node (relevant for this tutorial) Machine structure is invisible to user Very simple programming model Threading SW (OpenMP, pthreads, TBB, ) should know about the details, but doesn t Performance issues Synchronization overhead Memory access Node topology (c) RRZE 018 Basic Architecture 46

39 Conclusions about architecture Modern computer architecture has a rich topology Node-level hardware parallelism takes many forms Sockets/devices CPU: 1-4 or more, GPGPU/Phi: 1-6 or more Cores moderate (CPU: 4-4, GPGPU: , Phi: 64-7) SIMD moderate (CPU: -8, Phi: 8-16) to massive (GPGPU: 10 s-100 s) Superscalarity (CPU/Phi: -6) Exploiting performance: parallelism + bottleneck awareness High Performance Computing == computing at a bottleneck Performance of programs is sensitive to architecture Topology/affinity influences overheads of popular programming models Standards do not contain (many) topology-aware features Things are starting to improve slowly (MPI 3.0, OpenMP 4.0) Apart from overheads, performance features are largely independent of the programming model (c) RRZE 018 Basic Architecture 47

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel