IBM Research: AcceleratorTechnologies in HPC and Cognitive Computing

Size: px

Start display at page:

Download "IBM Research: AcceleratorTechnologies in HPC and Cognitive Computing"

Blake Dean
6 years ago
Views:

1 MaRS Workshop, Eurosys 2017, Belgrade April 23, 2017 IBM Research: AcceleratorTechnologies in HPC and Cognitive Computing Christoph Hagleitner, IBM Research - Zurich Lab

IBM Research Zurich Established in 1956 45+ different nationalities Two Nobel Prizes: 1986: Nobel Prize in Physics for the invention of the scanning tunneling microscope by Heinrich Rohrer and Gerd K.

2 IBM Research Zurich Established in different nationalities Two Nobel Prizes: 1986: Nobel Prize in Physics for the invention of the scanning tunneling microscope by Heinrich Rohrer and Gerd K. Binnig 1987: Nobel Prize in Physics for the discovery of high-temperature superconductivity by K. Alex Müller and J. Georg Bednorz Binnig and Rohrer Nanotechnology Centre opened in 2011 (Public Private Partnership with ETH Zürich and EMPA) Open Collaboration: Horizon2020: 28 funded projects and 170+ partners 9 European Research Council Grants IBM Research THINK Lab Zurich (Client Center) 2

3 Outline Towards exascale computing OpenPOWER Foundation IBM Systems POWER8 Accelerators POWER9 Accelerator IBM Research - Zurich CAPI attached accelerators Near-memory acceleration DSS Hyperscale FPGAs 3

4 Towards Exascale: FLOPS Increasing gap between performance and power efficiency Innovation metric measures (relative) increase in performance x increase in power efficiency Heterogeneous systems (Cell processor, GPUs) Dense systems (BlueGenL, TaihuLight) Diminshing performance / power efficience gains from technology scaling -> heterogeneous systems Performance (Petaflops/sec.) Power efficiency (Gigaflops/W) Innovation 4

Towards Computing @ Exascale: Applications National Labs w/ monolithic applications drive HPC roadmap in the US Europe is different with diverse user /

5 Towards Exascale: Applications National Labs w/ monolithic applications drive HPC roadmap in the US Europe is different with diverse user / application space Convergence of Data-science and HPC, e.g., Cognitive Computing Information extraction Build Knowledge Query, act on knowledge base Enhance Knowledge 5

6 Heterogeneous Exascale Disaggregation Fat Nodes hadoop-style workloads... scale-out via network complex HPC-like workloads... scale-up via high-speed buses main metrics cost (capital, energy) compute density scalability node level (CPU / FPGA / NVMe plus compute) main metrics memory / accelerator / inter-node BW optimal mix of heterogeneous resources (CPU / GPU / FPGA / HBM / DRAM / NVMe) compute density, scalability heterogeniety within nodes data centric design 6

7 Outline OpenPOWER Foundation IBM Systems POWER8 Accelerators POWER9 7

8 OpenPOWER: Five Founding Members in

9 The OpenPOWER Foundation 230+ Members & Growing 9

10 OpenPOWER: Endorsing the Strategy 10

11 OpenPOWER: Going Global 11

12 OpenPOWER Software Support Standard compilers : GCC 4.8.5, MPICH 3.0.4, CUDA 8.0 AT9.0.3 compilers: GCC 5.3.1, Python 3.4, and more optimized for POWER AT compilers: GCC 6.2.1, Python 3.5, and more optimized for POWER Optimized libraries: MASS (math functions) ESSL (BLAS) and MPI 12

OpenPOWER Roadmap: IBM LC-line Mellanox Interconnect Technology Connect-IB FDR Infiniband PCIe Gen3 ConnectX-4 EDR Infiniband CAPI over PCIe Gen3 ConnectX-5 Next-Gen Infiniband Enhanced CAPI over

13 OpenPOWER Roadmap: IBM LC-line Mellanox Interconnect Technology Connect-IB FDR Infiniband PCIe Gen3 ConnectX-4 EDR Infiniband CAPI over PCIe Gen3 ConnectX-5 Next-Gen Infiniband Enhanced CAPI over PCIe Gen4 NVIDIA GPUs Kepler PCIe Gen3 Pascal NVLink Volta NVLink Next Gen IBM CPUs POWER8 OpenPower CAPI Interface POWER8 with NVLink Acceleration: NVLink 1.0, CAPI 1.0, PCIe Gen3 POWER9 Acceleration: CAPI 2.0, NVLink 2.0, opencapi 3.0, PCIe Gen IBM Nodes 13

14 IBM: The LC-line 14

15 Minsky: The System Architecture 15

16 S822LC for High Performance Computing (aka Minsky) 16

17 POWER8+ Processor Up to 12 cores (SMT8) 8 dispatch, 10 issue, 16 exec pipe 2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR 64K data cache, 32K instruction cache New NVlink for Minsky s 17

18 POWER8 Caches L2: 1 MB 8 way per core L3: 96 MB (12 x 8 MB 8 way Bank) L4: 128 MB (on Centaur) NUCA Cache policy (Non-Uniform Cache Architecture) Cache bandwidth 4 TB/sec L2 BW 3 TB/sec L3 BW 18

19 POWER8 Memory System POWER8 Processor 8 high speed channels, 230 GB/s sustained memory BW 32 total DDR ports yielding 410 GB/s peak at the DRAM 1 TB memory capacity per fully configured processor socket 19

20 Accelerator Interfaces: POWER8 20

21 CAPI... Coherent Accelerator Processor Interface Standard I/O Model Flow DD Call Copy/Pin MMIO Notify Accelerate Poll / Int Copy/Unpin Return DD Shared Mem. Notify Accelerator Flow with a Coherent Model Accelerate Shared Memory Completion CAPI FPGA CAPP PCIe POWER8 Processor POWER Service Layer AFU n AFU 2 AFU 1 AFU 0 21

22 Accelerator cards announced at OpenPOWER Summit in April Nallatech team explaining CAPI Flash card: 4/23/2017 IBM Research - Zurich Lab, hle@zurich.ibm.com 22

23 Alpha Data FPGA CAPI ADM-PCIE-7V3 ADM-PCIE-KU3 ADM-PCIE-8K5 4/23/2017 IBM Research - Zurich Lab, hle@zurich.ibm.com 23

Integrated CAPI Flash Form Factor & Attributes Standard PCI card, single wide Four M2 NVMe connectors for flash sticks Systems Supported Up to 4 cards per Tuleta L Up to 2 cards per Firestone LC

SureLock Linux options, no AIX support 1) Ubuntu 16.04 ( GA 8/26/16 ) 2) Redhat 7.3 ( GA 12/06/16) Performance (per NVMe card) M.

24 Integrated CAPI Flash Form Factor & Attributes Standard PCI card, single wide Four M2 NVMe connectors for flash sticks Systems Supported Up to 4 cards per Tuleta L Up to 2 cards per Firestone LC Memory Up to 4 NMVe sticks 1TB ( 2 Supported for 1st GA) ( 2TB NVMe sticks in the future ) Sticks are features, MES adds & upgrades 4GB of on card DRAM Firmware / Hypervisor / OS Environments Same as SureLock Linux options, no AIX support 1) Ubuntu ( GA 8/26/16 ) 2) Redhat 7.3 ( GA 12/06/16) Performance (per NVMe card) M.2 NVMe Specifications (Samsung PM963) 1600 MB/s Sequential Read 1200 MB/s Sequential Write 380K Random Read IOPs 35K Random Write IOPs Card aggregation Applications controlled, can use multiple cards as one Database or Multiple Integrated Flash Configuration Power S822L / S812L / S822 LC 4/23/2017 IBM Research - Zurich Lab, hle@zurich.ibm.com 24

25 Outline Accelerator IBM Research - Zurich CAPI attached accelerators Near-memory acceleration DSS Hyperscale FPGAs 25

26 Heterogeneous Exascale Fat Nodes complex HPC-like workloads... scale-up via high-speed buses main metrics memory / accelerator / inter-node BW optimal mix of heterogeneous resources (CPU / GPU / FPGA / HBM / DRAM / NVMe) compute density, scalability heterogeniety within nodes data centric design 26

27 Accelerated Fast Fourier Transformation Library FFTs are widely used in cognitive computing... Data preparation: spectral analysis, filter banks Data compression: MP3, JPEG ML: convolutional neural networks [1] HPC: partial differential equations, mathematical finance Common FFT Libraries (FFTW, ESSL, MKL, ) [1] Mathieu, Henaff, Lecun. Fast training of convolutional networks through FFTs. ICLR 14 27

28 FFTW on Heterogeneous Compute Nodes 28

29 Latency... for a single CAPI FFT call is 10% higher than CPU (can be improved as the AFU is bandwidth optimized) 4x better compared to a PCIe version using OpenCL CPU 80 Compute Copy FPGA using CAPI 89 FPGA using PCIe (OpenCL) 124 NVIDI K80 using cufft Runtime in micro seconds for one 4k-input complex FFT from cache 29

30 Performance & Energy Efficiency Test case: Compute 100 rounds of subsequent 4k-point FFTs in complex single precision float (1GB input samples per round) a) 1 core W = 0.21 GFLOP/W b) 12 cores 1) W = 0.31 GFLOP/W c) 12 cores 2) W = 0.12 GFLOP/W d) 1 AFU W = 3.37 GFLOP/W e) 1 GPU 3) W = 0.29 GFLOP/W 1) 12 threads, SMT1, DVFS off 2) 96 threads, SMT8, DVFS on 3) NVIDIA K40, CUDA-7.5 Result: One AFU is 2.2x faster and 16x more energy efficient compared to one core 30

31 Outline Near-memory acceleration DSS Hyperscale FPGAs 31

Integrating Near-data Processing in a (POWER) Server enabling near-data processing capabilities, while being minimally-invasive, in an existing CPU architecture ability to implement wide range of

32 Integrating Near-data Processing in a (POWER) Server enabling near-data processing capabilities, while being minimally-invasive, in an existing CPU architecture ability to implement wide range of near-data processing functionality from optimized fixed-function hardware to a multiprocessor SOC dereferencing all virtual pointers of the host process on the NDP, coherent with the CPUs view of the memory 4/23/2017 IBM Research - Zurich Lab, hle@zurich.ibm.com 32

33 Heterogeneous Nodes: POWER8 Accelerator Interfaces 33

34 Near-Memory Acceleration on ConTutto 34

35 Near-Memory Acceleration on ConTutto 35

memory performance and power depend on a complex interaction between workload and memory system locality of reference, access patterns/strides,.

36 Near-memory Acceleration big-data analytics, neural networks, cognitive computing, graph algorithms,... benefit from low latency, small access granularity, and large memories. memory performance and power depend on a complex interaction between workload and memory system locality of reference, access patterns/strides,... cache size, associativity, replacement policy,... bank interleaving, refresh, row buffer hits,... current systems use bare metal programming to adapt workload to memory system memory system should be programmable / adaptive must integrate programmable compute capabilities to achieve substantial performance & power gains for a wide range of workloads 36

37 Speedup Bytes used per bytes fetched from DRAM Boosting Irregular Applications: Graph500 Benchmark results obtained on a system-simulator capable of both functional verification and performance estimations was developed the Graph500 benchmark benefits from a low latency and small access granularity: NDP cores four times slower than the CPU cores outperform them for large problems the NDPs show much better bandwidth utilization due to the small access granularity 4 core CPU 8 core CPU 4 core NDP core NDP 1 4 core CPU (sec. axis) 4 core NDP (sec. axis) Graph500 scale 50% 40% 30% 20% 10% 0% 4/23/2017 IBM Research - Zurich Lab, hle@zurich.ibm.com 37

38 Outline DSS 38

39 Dense Memory (remote access) Prototype Dense Memory integration software stack available byte addressable, distributed globally accessable DM resource exports industry standard asynchronous RDMA API for DM read and write access Implements efficient local and remote DM access zero copy local access via direct DMA device - application buffer zero copy remote access via IB RDMA remote host - application buffer Performance measurements local DM access at NVMe devices performance limits (3.5 GB/s read, 1.8 GB/s write of 4k buffers) remote DM access at network (100Gb/s InfiniBand) and device limits: 12.5 GB/s distributed DM random read with 4 storage nodes, all equipped with one NVMe SSD each close to 900k IOPs for single device short sequential red/write operations 39

40 Flexible DSS Configuration mix of local and shared resources multiple shared DM partitions possible 40

41 Dense Storage: Software Components 3 kernel modules dsa.ko, sal.ko, sal_blkdev.ko DSS GSL part of SAL 1 user library libdsa 1 user level demon dssd 41

42 Outline Hyperscale FPGAs 42

43 Heterogeneous Exascale Disaggregation hadoop-style workloads... scale-out via network main metrics cost (capital, energy) compute density scalability node level (CPU / FPGA / NVMe plus compute) 43

compute chips Passive liquid cooling ultimate density (cooling >70W / node)

44 ZRL Dome mserver of Hyperscale DCs Cloud economics density (>1000 nodes / rack) integrated NICs switch card (backplane, no cables) medium to low-cost compute chips Passive liquid cooling ultimate density (cooling >70W / node) energy re-use Built to integrate heterogeneous resources CPUs Accelerators 44

HyperscaleFPGA: Network-attached FPGAs @ Hyperscale Disaggregation of compute

server form factor (which keep on shrinking) FPGAs can be provisioned / rented similar

fabrics of FPGAs in the cloud FPGAs are promoted to the rank of peer processor (end of

45 HyperscaleFPGA: Network-attached Hyperscale Disaggregation of compute resources FPGAs can be deployed independent of: the # CPUs (respectively servers) the server form factor (which keep on shrinking) FPGAs can be provisioned / rented similar to other cloud compute, storage and network resources Scalability Users can build SDN fabrics of FPGAs in the cloud FPGAs are promoted to the rank of peer processor (end of slavery) HW-based FPGA-to-FPGA communication provides low latency and high-tput (RDMA NICs) 45

Reference Prototype: FPGA Compute Node FPGA Card Memory FPGA Management

extension, board management controller The inic enables the FPGA to hook

servers, disks, I/O and other FPGA appliances inic Network Service Layer

46 Reference Prototype: FPGA Compute Node FPGA Card Memory FPGA Management Layer (ML) User Logic (vfpga) KU060 FPGA w/ 16GB memory, 10GbE, PCIe extension, board management controller The inic enables the FPGA to hook itself to the network and to communicate with other DC resources, such as servers, disks, I/O and other FPGA appliances inic Network Service Layer (NSL) Data Center Network 46 4/23/2017 IBM Research - Zurich Lab, hle@zurich.ibm.com

47 But be willing take incremental steps when you can! IBM Research - Zurich Lab, hle@zurich.ibm.com 47

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner,

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner, IWOPH workshop, ISC, Germany June 21, 2017 OpenPOWER Innovations for HPC IBM Research Christoph Hagleitner, hle@zurich.ibm.com IBM Research - Zurich Lab IBM Research - Zurich Established in 1956 45+ different