A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms

Size: px

Start display at page:

Download "A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms"

Bryce Mosley
5 years ago
Views:

1 A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms Speaker: Jingheng Xu Tsinghua University Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit

2 Contents 1 About Us 2 POWER8 Processor 3 CAPI Technology 4 Preliminary Example

3 About Us! Tsinghua High Performance Geo-Computing Group(HPGC) Algorithm Applica.on Architectur e The Best Computa.onal Solu.on

4 About Us! Tsinghua High Performance Geo-Computing Group(HPGC)! Application:climate simulation, seismic modeling, etc.

5 About Us! Tsinghua High Performance Geo-Computing Group(HPGC)! Application:climate simulation, seismic modeling, etc.! Platform:CPU,GPU,DFE,etc. CPU GPU DFE

6 About Us! Tsinghua High Performance Geo-Computing Group(HPGC)! Application:climate simulation, seismic modeling, etc.! Platform:CPU,GPU,DFE,etc.! Partner:

7 Contents 1 About Us 2 POWER8 Processor 3 CAPI Technology 4 Preliminary Example

POWER8 Processor Technology 22 nm SOI, edram, 15 ML 650 mm2 Cores 12 cores (SMT8) 8 dispatch, 10 issue, 16 execution pipes 2x internal data flows/ queues Enhanced prefetching 64 KB data cache, 32 KB

Interconnect L3 Bank L3 Bank L3 Bank L2 L2 L2 Core Core Core SMP PCIe CAPI SMP SMP Interconnect SMP Interconnect SMP CAPI PCIe SMP Core Core Core L2 L3 Bank L3 Bank Core Chip Interconnect L2 L2 L3

8 POWER8 Processor Technology 22 nm SOI, edram, 15 ML 650 mm2 Cores 12 cores (SMT8) 8 dispatch, 10 issue, 16 execution pipes 2x internal data flows/ queues Enhanced prefetching 64 KB data cache, 32 KB instruction cache Accelerators Crypto and memory expansion Transactional memory VMM assist Memory Bus POWER8 Scale-Out Dual Chip Module Core Core Core L2 L2 L2 L3 Bank L3 Bank L3 Bank Chip Interconnect L3 Bank L3 Bank L3 Bank L2 L2 L2 Core Core Core SMP PCIe CAPI SMP SMP Interconnect SMP Interconnect SMP CAPI PCIe SMP Core Core Core L2 L3 Bank L3 Bank Core Chip Interconnect L2 L2 L3 Bank L3 Bank L2 Core L2 L3 Bank L3 Bank L2 Core Memory Bus Caches 512 KB SRAM L2 / core 96 MB edram shared L3 Memory Up to 230 GB/s sustained bandwidth Bus Interfaces Durable open memory attach interface Integrated PCIe Gen3 SMP interconnect CAPI Data move/vm mobility Energy Management On-chip power management microcontroller

9 POWER8 Processor z x y Jacobi FD4 FD8

10 POWER8 Processor z x y Jacobi

11 Lightweight Tuning OpenMP & SMT NUMA Ctrl GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking POWER8 Processor

12 Lightweight Tuning OpenMP & SMT NUMA Ctrl GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking POWER8 Processor

13 Lightweight Tuning OpenMP & SMT NUMA Ctrl POWER8 Processor GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking GPR: General- Purpose Register VSR: Vector- Scalar Register 2D Cache Blocking

14 Lightweight Tuning OpenMP & SMT NUMA Ctrl GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking POWER8 Processor

15 Lightweight Tuning OpenMP & SMT NUMA Ctrl POWER8 Processor GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking Unit: GFlops Jacobi FD4 FD8 3D Blocking D Blocking

16 POWER8 Processor

17 POWER8 Processor

18 Contents 1 About Us 2 POWER8 Processor 3 CAPI Technology 4 Preliminary Example

19 CAPI Technology

20 CAPI Technology Virt Addr Variables Input Data Device Driver Storage Area Variables Input Data Memory Subsystem Output Data 3 versions of the data (not coherent). thousands of instructions in the device driver. PCIE FPGA Output Data Variables Input Data POWER8 Core App DD POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core

21 CAPI Technology Virt Addr Memory Subsystem 1 coherent version of the data. No device driver call/instructions. PCIE PSL FPGA Output Data Input Variables Data POWER8 Core App POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core

22 Preliminary Example Stencil based RTM (Reversed Time Migra'on) Three main Challenges: 1. Memory Access Pressure 2. Computa.onal Pressure 3. I/O Pressure

23 Preliminary Example Three main Challenges: Memory Access; Computational Pressure;File I/O MA & CP Total Time File I/O & Others 1 core 81.64s 82.94s 1.30s 20 cores 4.54s 7.88s 3.34s POWER Optimized Version MA & CP Total Time File I/O & Others 1 core 21.55s 22.59s 1.04s 20 cores 1.51s 4.54s 3.03s

24 Preliminary Example Hybrid Algorithm Host (POWER8): " Take charge of I/O & other part " Mainly File I/O " Only One POWER8 Core to avoid write conflict. Device (FPGA): " Specifically take charge of computations " Adopting CAPI to avoid longlatency data transfer

25 Preliminary Example Original POWER Opt CAPI Version 20 Power8 processor Cores 7.9s in total 20 Power8 processor Cores 4.5s in total 1 Power8 processor Core & 1 FPGA 2.4s in total* *There is s.ll some accuracy problem of this result.

26 Conclusion Extremely High Performance OpenPOWER system with CAPI Powerful Host Flexible Device Low-latency Interface

27 Jingheng Xu, Haohuan Fu, Yu Song, Hongbo Peng, etc. Tsinghua University, Beijing, China IBM China Systems and Technology Laboratory Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit

28 Jingheng Xu, Haohuan Fu, Yu Song, Hongbo Peng, etc. Tsinghua University, Beijing, China IBM China Systems and Technology Laboratory Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit

30 Reference

Power Technology For a Smarter Future

2011 IBM Power Systems Technical University October 10-14 Fontainebleau Miami Beach Miami, FL IBM Power Technology For a Smarter Future Jeffrey Stuecheli Power Processor Development Copyright IBM Corporation