Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich

Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich Revolu'onizing the Datacenter Datacenter Join the Conversa'on #OpenPOWERSummit

Towards highly efficient data centers PUE op'miza'on and virtualiza'on energy- efficient architectures next- genera'on devices Workload consolida4on Efficient cooling Heterogeneous compu4ng Near- memory compu4ng In- memory compu4ng Beyond CMOS today 5-50x >100x! PUE is not a measure of efficiency! Heterogeneous compu4ng improves energy efficiency! Programming heterogeneous systems is (s4ll) challenging 4/11/16 2

Enabling FPGAs for sovware programmers Enable hardware accelerators for a larger community! FPGA development is more complex than sovware development High- level design tools (e.g. SDAccel for OpenCL)! Library accelera4on Drop- in replacement for standard sovware library Web Desktop Embedded Hardware Mobile source: stackoverflow.com/research/developer- survey- 2015 4/11/16 3

Cross- pla_orm standard sovware libraries 4/11/16 4

Example: Fast Fourier Transform amplitude DFT amplitude 4me frequency! FFTs are widely used DSP: spectral analysis, filter banks Data compression: MP3, JPEG ML: convolu4onal neural networks HPC: par4al differen4al equa4ons, mathema4cal finance! Common FFT Libraries (FFTW, ESSL, MKL, ) 4/11/16 5

Planning FFTs in sovware N/2- point DFT expand plan N/2- point DFT FFT library consists of many small FFT kernels (codelets) On- line op4miza4on: A planner picks the best composi4on (plan) by measuring the speed of different combina4ons 4/11/16 6

Deep pipelining for hardware FFTs N/2- point DFT compute shuffle expand fold N/2- point DFT Reconfigure the FPGA with a deep FFT pipeline Fully streamed. Linear memory access paiern Buierfly compute units. Shuffle units. 4/11/16 7

Heterogeneous compute libraries User GNU Radio model GNU Radio User (dynamically application linking fftw) select op)mal pla,orm here FFTW library FFTW User interposer application library Custom FFT API Custom FFT API train mapping strategy using sensors User mode driver Device driver libcxl POWER8 CPU PCIe FPGA CAPI FPGA Performance POWER system Power 4/11/16 8

Heterogeneous compute libraries User GNU Radio model GNU Radio User (dynamically application linking fftw) select op)mal pla,orm here FFTW library FFTW interposer library Custom FFT API Custom FFT API train mapping strategy using sensors User mode driver Device driver libcxl POWER8 CPU PCIe FPGA CAPI FPGA Performance POWER system Power 4/11/16 9

FFTW library interposing fftwf_complex *in0, *out0, *in1, *out1; fftwf_plan p; //allocate and initialize... p = fftwf_plan_dft_1d(n, in0, out0, FFTW_FORWARD, FFTW_ESTIMATE); fftwf_execute(p); fftwf_execute(p, in1, out1); //reuse plan User application FFTW interposer library POWER8 CPU FPGA call supported Y plan N register registered plan N return Nw_execute using sopware FFTW 4w_plan Y User applica'on N plan registered 4w_execute Y return return no4fy assemble WED & MMIO write to AFU register comple'on execute FFT on FPGA signal comple4on control flow for batched version 4/11/16 10

Latency for a single FFT func4on call Latency for a single CAPI FFT call is 10% higher than CPU (can be improved as the AFU is bandwidth opbmized) 4x beger compared to a PCIe version using OpenCL CPU 80 Compute Copy FPGA using CAPI 89 FPGA using PCIe (OpenCL) 124 220 0 100 200 300 400 Run4me in micro seconds for one 4k- input complex FFT from cache 4/11/16 11

FFT execu4on 4me on P8 and accelerators Run4me per FFT [us] 140 120 100 80 60 40 20 P8 (1 core, FFTW) CAPI (2 samples/cycle, non- batched, lock) CAPI (1 sample/cycle, batched, irq) CAPI (2 samples/cycle, batched, irq) 0 1 2 4 8 16 32 64 128 256 512 Number of FFTs 4/11/16 12

Power trace for mul4- threaded FFT Total Power I/O Power Socket 0 Power CAPI FFT processing power on the FPGA card ~3W CPU Memory Socket 1 Power CPU Memory 4/11/16 13

Energy efficiency Test case: Compute 100 rounds of 32768 subsequent 4k- point FFTs in complex single precision float (1GB input samples per round) a) 1 core 10.6 GFLOP @ 50W = 0.21 GFLOP/W b) 12 cores 1) 33.5 GFLOP @ 108W = 0.31 GFLOP/W c) 12 cores 2) 30.6 GFLOP @ 193W = 0.12 GFLOP/W d) 1 AFU 23.6 GFLOP @ 7W = 3.37 GFLOP/W Result: One AFU is 2.2x faster and 16x more energy efficient compared to one core 4/11/16 14 1) 12 threads, SMT1, DVFS off 2) 96 threads, SMT8, DVFS on

Conclusion and work in progress! CAPI enables Offloading of lightweight compute jobs Transparent integra4on of FPGA accelerators via shared virtual memory Energy efficient compu4ng for enterprise class servers! FPGA accelerators for cogni4ve compu4ng Regular expression matching Sparse linear algebra for graph analy4cs General convolu4on kernels for machine learning 4/11/16 15

Power trace for mul4- threaded FFT 4/11/16 16