Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich

Similar documents
A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms

Heterogeneous Computing Systems in Cloud Datacenters

IBM Research: AcceleratorTechnologies in HPC and Cognitive Computing

GPUs: The Hype, The Reality, and The Future

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Transprecision Computing

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

POWER CAPI+SNAP+FPGA,

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

OpenCAPI and its Roadmap

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

CAPI SNAP framework, the tool for C/C++ programmers to accelerate by a 2 digit factor using FPGA technology

Industry Collaboration and Innovation

Industry Collaboration and Innovation

Overcoming the Barriers of Graphs on GPUs: Delivering Graph Analy;cs 100X Faster and 40X Cheaper

OpenPOWER Innovations for HPC. IBM Research. IWOPH workshop, ISC, Germany June 21, Christoph Hagleitner,

LUMOS. A Framework with Analy1cal Models for Heterogeneous Architectures. Liang Wang, and Kevin Skadron (University of Virginia)

Intel Performance Libraries

Introduction to the OpenCAPI Interface

Zhang Tianfei. Rosen Xu

Altera SDK for OpenCL

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Cloud Acceleration with FPGA s. Mike Strickland, Director, Computer & Storage BU, Altera

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

OPERA. Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications

The Era of Heterogeneous Computing

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Op#mizing MapReduce for Highly- Distributed Environments

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Industry Collaboration and Innovation

Hypervisors at Hyperscale

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

XPU A Programmable FPGA Accelerator for Diverse Workloads

Simplify Software Integration for FPGA Accelerators with OPAE

Revolutionizing the Datacenter

Near Memory Computing Spectral and Sparse Accelerators

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

CERN openlab & IBM Research Workshop Trip Report

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Server Side Applications (i.e., public/private Clouds and HPC)

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Parallelism in Spiral

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Virtualization. Introduction. Why we interested? 11/28/15. Virtualiza5on provide an abstract environment to run applica5ons.

: Advanced Compiler Design. 8.0 Instruc?on scheduling

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Parallelism. CS6787 Lecture 8 Fall 2017

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

How to Write Fast Code , spring st Lecture, Jan. 14 th

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Introduc)on to Xeon Phi

AN 831: Intel FPGA SDK for OpenCL

Advanced CUDA Optimization 1. Introduction

Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Pactron FPGA Accelerated Computing Solutions

Parallel FFT Program Optimizations on Heterogeneous Computers

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Background. IBM sold expensive mainframes to large organiza<ons. Monitor sits between one or more OSes and HW

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Equinox: A C++11 platform for realtime SDR applications

The OpenVX Computer Vision and Neural Network Inference

GPU Cluster Computing. Advanced Computing Center for Research and Education

Implica(ons of Non Vola(le Memory on So5ware Architectures. Nisha Talagala Lead Architect, Fusion- io

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

SKA Computing and Software

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

General Purpose GPU Computing in Partial Wave Analysis

RapidIO.org Update.

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

All About the Cell Processor

Fault Tolerant Runtime ANL. Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013

A 101 Guide to Heterogeneous, Accelerated, Data Centric Computing Architectures

ECSE 425 Lecture 25: Mul1- threading

7 Ways to Increase Your Produc2vity with Revolu2on R Enterprise 3.0. David Smith, REvolu2on Compu2ng

Best Practices for Setting BIOS Parameters for Performance

Intel Math Kernel Library 10.3

FPGAs as Streaming MIMD Machines for Data Analy9cs. James Thomas, Matei Zaharia, Pat Hanrahan

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

When MPPDB Meets GPU:

Introduction to CELL B.E. and GPU Programming. Agenda

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster

SPIRAL, FFTX, and the Path to SpectralPACK

Western Michigan University

Unlocking FPGAs Using High- Level Synthesis Compiler Technologies

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Accelerating computation with FPGAs

PhD in Computer And Control Engineering XXVII cycle. Torino February 27th, 2015.

Transcription:

Energy Efficient Transparent Library Accelera4on with CAPI Heiner Giefers IBM Research Zurich Revolu'onizing the Datacenter Datacenter Join the Conversa'on #OpenPOWERSummit

Towards highly efficient data centers PUE op'miza'on and virtualiza'on energy- efficient architectures next- genera'on devices Workload consolida4on Efficient cooling Heterogeneous compu4ng Near- memory compu4ng In- memory compu4ng Beyond CMOS today 5-50x >100x! PUE is not a measure of efficiency! Heterogeneous compu4ng improves energy efficiency! Programming heterogeneous systems is (s4ll) challenging 4/11/16 2

Enabling FPGAs for sovware programmers Enable hardware accelerators for a larger community! FPGA development is more complex than sovware development High- level design tools (e.g. SDAccel for OpenCL)! Library accelera4on Drop- in replacement for standard sovware library Web Desktop Embedded Hardware Mobile source: stackoverflow.com/research/developer- survey- 2015 4/11/16 3

Cross- pla_orm standard sovware libraries 4/11/16 4

Example: Fast Fourier Transform amplitude DFT amplitude 4me frequency! FFTs are widely used DSP: spectral analysis, filter banks Data compression: MP3, JPEG ML: convolu4onal neural networks HPC: par4al differen4al equa4ons, mathema4cal finance! Common FFT Libraries (FFTW, ESSL, MKL, ) 4/11/16 5

Planning FFTs in sovware N/2- point DFT expand plan N/2- point DFT FFT library consists of many small FFT kernels (codelets) On- line op4miza4on: A planner picks the best composi4on (plan) by measuring the speed of different combina4ons 4/11/16 6

Deep pipelining for hardware FFTs N/2- point DFT compute shuffle expand fold N/2- point DFT Reconfigure the FPGA with a deep FFT pipeline Fully streamed. Linear memory access paiern Buierfly compute units. Shuffle units. 4/11/16 7

Heterogeneous compute libraries User GNU Radio model GNU Radio User (dynamically application linking fftw) select op)mal pla,orm here FFTW library FFTW User interposer application library Custom FFT API Custom FFT API train mapping strategy using sensors User mode driver Device driver libcxl POWER8 CPU PCIe FPGA CAPI FPGA Performance POWER system Power 4/11/16 8

Heterogeneous compute libraries User GNU Radio model GNU Radio User (dynamically application linking fftw) select op)mal pla,orm here FFTW library FFTW interposer library Custom FFT API Custom FFT API train mapping strategy using sensors User mode driver Device driver libcxl POWER8 CPU PCIe FPGA CAPI FPGA Performance POWER system Power 4/11/16 9

FFTW library interposing fftwf_complex *in0, *out0, *in1, *out1; fftwf_plan p; //allocate and initialize... p = fftwf_plan_dft_1d(n, in0, out0, FFTW_FORWARD, FFTW_ESTIMATE); fftwf_execute(p); fftwf_execute(p, in1, out1); //reuse plan User application FFTW interposer library POWER8 CPU FPGA call supported Y plan N register registered plan N return Nw_execute using sopware FFTW 4w_plan Y User applica'on N plan registered 4w_execute Y return return no4fy assemble WED & MMIO write to AFU register comple'on execute FFT on FPGA signal comple4on control flow for batched version 4/11/16 10

Latency for a single FFT func4on call Latency for a single CAPI FFT call is 10% higher than CPU (can be improved as the AFU is bandwidth opbmized) 4x beger compared to a PCIe version using OpenCL CPU 80 Compute Copy FPGA using CAPI 89 FPGA using PCIe (OpenCL) 124 220 0 100 200 300 400 Run4me in micro seconds for one 4k- input complex FFT from cache 4/11/16 11

FFT execu4on 4me on P8 and accelerators Run4me per FFT [us] 140 120 100 80 60 40 20 P8 (1 core, FFTW) CAPI (2 samples/cycle, non- batched, lock) CAPI (1 sample/cycle, batched, irq) CAPI (2 samples/cycle, batched, irq) 0 1 2 4 8 16 32 64 128 256 512 Number of FFTs 4/11/16 12

Power trace for mul4- threaded FFT Total Power I/O Power Socket 0 Power CAPI FFT processing power on the FPGA card ~3W CPU Memory Socket 1 Power CPU Memory 4/11/16 13

Energy efficiency Test case: Compute 100 rounds of 32768 subsequent 4k- point FFTs in complex single precision float (1GB input samples per round) a) 1 core 10.6 GFLOP @ 50W = 0.21 GFLOP/W b) 12 cores 1) 33.5 GFLOP @ 108W = 0.31 GFLOP/W c) 12 cores 2) 30.6 GFLOP @ 193W = 0.12 GFLOP/W d) 1 AFU 23.6 GFLOP @ 7W = 3.37 GFLOP/W Result: One AFU is 2.2x faster and 16x more energy efficient compared to one core 4/11/16 14 1) 12 threads, SMT1, DVFS off 2) 96 threads, SMT8, DVFS on

Conclusion and work in progress! CAPI enables Offloading of lightweight compute jobs Transparent integra4on of FPGA accelerators via shared virtual memory Energy efficient compu4ng for enterprise class servers! FPGA accelerators for cogni4ve compu4ng Regular expression matching Sparse linear algebra for graph analy4cs General convolu4on kernels for machine learning 4/11/16 15

Power trace for mul4- threaded FFT 4/11/16 16