A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms Speaker: Jingheng Xu Tsinghua University Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit
Contents 1 About Us 2 POWER8 Processor 3 CAPI Technology 4 Preliminary Example
About Us! Tsinghua High Performance Geo-Computing Group(HPGC) Algorithm Applica.on Architectur e The Best Computa.onal Solu.on
About Us! Tsinghua High Performance Geo-Computing Group(HPGC)! Application:climate simulation, seismic modeling, etc.
About Us! Tsinghua High Performance Geo-Computing Group(HPGC)! Application:climate simulation, seismic modeling, etc.! Platform:CPU,GPU,DFE,etc. CPU GPU DFE
About Us! Tsinghua High Performance Geo-Computing Group(HPGC)! Application:climate simulation, seismic modeling, etc.! Platform:CPU,GPU,DFE,etc.! Partner:
Contents 1 About Us 2 POWER8 Processor 3 CAPI Technology 4 Preliminary Example
POWER8 Processor Technology 22 nm SOI, edram, 15 ML 650 mm2 Cores 12 cores (SMT8) 8 dispatch, 10 issue, 16 execution pipes 2x internal data flows/ queues Enhanced prefetching 64 KB data cache, 32 KB instruction cache Accelerators Crypto and memory expansion Transactional memory VMM assist Memory Bus POWER8 Scale-Out Dual Chip Module Core Core Core L2 L2 L2 L3 Bank L3 Bank L3 Bank Chip Interconnect L3 Bank L3 Bank L3 Bank L2 L2 L2 Core Core Core SMP PCIe CAPI SMP SMP Interconnect SMP Interconnect SMP CAPI PCIe SMP Core Core Core L2 L3 Bank L3 Bank Core Chip Interconnect L2 L2 L3 Bank L3 Bank L2 Core L2 L3 Bank L3 Bank L2 Core Memory Bus Caches 512 KB SRAM L2 / core 96 MB edram shared L3 Memory Up to 230 GB/s sustained bandwidth Bus Interfaces Durable open memory attach interface Integrated PCIe Gen3 SMP interconnect CAPI Data move/vm mobility Energy Management On-chip power management microcontroller
POWER8 Processor z x y Jacobi FD4 FD8
POWER8 Processor z x y Jacobi
Lightweight Tuning OpenMP & SMT NUMA Ctrl GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking POWER8 Processor
Lightweight Tuning OpenMP & SMT NUMA Ctrl GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking POWER8 Processor
Lightweight Tuning OpenMP & SMT NUMA Ctrl POWER8 Processor GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking GPR: General- Purpose Register VSR: Vector- Scalar Register 2D Cache Blocking
Lightweight Tuning OpenMP & SMT NUMA Ctrl GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking POWER8 Processor
Lightweight Tuning OpenMP & SMT NUMA Ctrl POWER8 Processor GPR- oriented SIMD VSR- oriented SIMD 3D Cache Blocking 2D Cache Blocking Unit: GFlops Jacobi FD4 FD8 3D Blocking 80 95 136 2D Blocking 102 175 161
POWER8 Processor
POWER8 Processor
Contents 1 About Us 2 POWER8 Processor 3 CAPI Technology 4 Preliminary Example
CAPI Technology
CAPI Technology Virt Addr Variables Input Data Device Driver Storage Area Variables Input Data Memory Subsystem Output Data 3 versions of the data (not coherent). thousands of instructions in the device driver. PCIE FPGA Output Data Variables Input Data POWER8 Core App DD POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core
CAPI Technology Virt Addr Memory Subsystem 1 coherent version of the data. No device driver call/instructions. PCIE PSL FPGA Output Data Input Variables Data POWER8 Core App POWER8 Core POWER8 Core POWER8 Core POWER8 Core POWER8 Core
Preliminary Example Stencil based RTM (Reversed Time Migra'on) Three main Challenges: 1. Memory Access Pressure 2. Computa.onal Pressure 3. I/O Pressure
Preliminary Example Three main Challenges: Memory Access; Computational Pressure;File I/O MA & CP Total Time File I/O & Others 1 core 81.64s 82.94s 1.30s 20 cores 4.54s 7.88s 3.34s POWER Optimized Version MA & CP Total Time File I/O & Others 1 core 21.55s 22.59s 1.04s 20 cores 1.51s 4.54s 3.03s
Preliminary Example Hybrid Algorithm Host (POWER8): " Take charge of I/O & other part " Mainly File I/O " Only One POWER8 Core to avoid write conflict. Device (FPGA): " Specifically take charge of computations " Adopting CAPI to avoid longlatency data transfer
Preliminary Example Original POWER Opt CAPI Version 20 Power8 processor Cores 7.9s in total 20 Power8 processor Cores 4.5s in total 1 Power8 processor Core & 1 FPGA 2.4s in total* *There is s.ll some accuracy problem of this result.
Conclusion Extremely High Performance OpenPOWER system with CAPI Powerful Host Flexible Device Low-latency Interface
Jingheng Xu, Haohuan Fu, Yu Song, Hongbo Peng, etc. 18653236889@163.com Tsinghua University, Beijing, China IBM China Systems and Technology Laboratory Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit
Jingheng Xu, Haohuan Fu, Yu Song, Hongbo Peng, etc. 18653236889@163.com Tsinghua University, Beijing, China IBM China Systems and Technology Laboratory Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit
Reference