FPGA-based Supercomputing: New Opportunities and Challenges

Similar documents
High Performance Computing with

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Fast Hardware For AI

Trends in HPC (hardware complexity and software challenges)

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Directive-Based, High-Level Programming and Optimizations for High-Performance Computing with FPGAs

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Mapping MPI+X Applications to Multi-GPU Architectures

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

How to Optimize Geometric Multigrid Methods on GPUs

Evaluating the Error Resilience of Parallel Programs

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Physis: An Implicitly Parallel Framework for Stencil Computa;ons

Requirements for Scalable Application Specific Processing in Commercial HPEC

Addressing Heterogeneity in Manycore Applications

Code optimization in a 3D diffusion model

Applying OpenCL. IWOCL, May Andrew Richards

SimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Rodinia Benchmark Suite

Predicting GPU Performance from CPU Runs Using Machine Learning

Yunsup Lee UC Berkeley 1

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

n N c CIni.o ewsrg.au

The Cray Rainier System: Integrated Scalar/Vector Computing

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

Early Experiences Writing Performance Portable OpenMP 4 Codes

OpenACC Based GPU Parallelization of Plane Sweep Algorithm for Geometric Intersection. Anmol Paudel Satish Puri Marquette University Milwaukee, WI

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

High Performance Computing on GPUs using NVIDIA CUDA

Fujitsu s Approach to Application Centric Petascale Computing

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

OpenACC programming for GPGPUs: Rotor wake simulation

Elaborazione dati real-time su architetture embedded many-core e FPGA

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Altera SDK for OpenCL

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

FPGA Acceleration of the LFRic Weather and Climate Model in the EuroExa Project Using Vivado HLS

Efficient Parallel Programming on Xeon Phi for Exascale

Exploring FPGA-specific Optimizations for Irregular OpenCL Applications

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

GPUs and Emerging Architectures

Experts in Application Acceleration Synective Labs AB

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

CUDA Experiences: Over-Optimization and Future HPC

Parallel waveform extraction algorithms for the Cherenkov Telescope Array Real-Time Analysis

International IEEE Symposium on Field-Programmable Custom Computing Machines

What does Heterogeneity bring?

Big Data Meets High-Performance Reconfigurable Computing

Building NVLink for Developers

An Investigation of Unified Memory Access Performance in CUDA

Using FPGAs as Microservices

(software agnostic) Computational Considerations

High Performance Computing (HPC) Introduction

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Using Graphics Chips for General Purpose Computation

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Accelerator Spectrum

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

The Effect of Emerging Architectures on Data Science (and other thoughts)

The Era of Heterogeneous Computing

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

General Purpose GPU Computing in Partial Wave Analysis

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Accelerating CFD with Graphics Hardware

Cloud Computing with FPGA-based NVMe SSDs

Parallel Systems. Project topics

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

HPC with Multicore and GPUs

Corolla: GPU-Accelerated FPGA Routing Based on Subgraph Dynamic Expansion

Introduction to Multicore Programming

Accelerating sequential computer vision algorithms using commodity parallel hardware

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Transcription:

FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory

SIAM PP18: Deep Learning + HPC MS 26: Deep Learning from HPC Perspectives Thursday, March 8 Speakers from ORNL, ETH, University of Tokyo as well as LLNL, ANL, etc.

Why FPGA? Dark silicon is bringing specialized logic TPU, Nervana, Graphcore, Tensor Core, Embedded AI/vision processing units in modern smartphones But, only to highly profitable and uniform application domains HPC is: Not immediately profitable Very diverse FPGA is a middle-ground solution Not quite efficient as ASICs, but reconfigurability makes specialization possible for such application domains as HPC Growing non-hpc markets that HPC could leverage FPGA by default at Microsoft datacenters Poor man s specialization

But Why Now Again? Significant efforts till late 00 s SRC Map system, Cray XD Largely dwarfed by the emergence of GPGPU Why now? Competing hardware performance Large arrays of IEEE FP-capable DSPs (e.g., 10 TFLOPS in SP of Stratix 10) HBM2-equipped boards by both Altera and Xilinx Lower psychological and practical barrier for acceleratorbased computing thanks to the success of GPU More mature and standardized programming environments OpenCL compilers for FPGAs are provided by both Altera and Xilinx

SRC-6 MAP System (Circa 2004) Source: SRC slides at HPEC 2004

But Why not GPU? If performance is bounded by memory bandwidth, compute reconfigurability won t matter for performance Many HPC codes are memory bound Baseline: Latest high-end FPGAs would be able to achieve comparable performance as that of GPUs for those apps Opportunity: Communication avoiding algorithms with FPGAs

Case Study: 3D Diffusion Stencil Performance (GFLOP/s) 1800 1600 1400 1200 1000 800 600 400 200 0 1584.8 1205.3 515.9 374.7 305.2 101.5 S5 GXA7 A10 GX1150 S10 MX2100 Tesla K40c GTX 980Ti Tesla P100 Zohouri, et al., Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL, FPGA 18, Feb 2018 (to appear) Aggressive temporal blocking applied for FPGAs: 4-way with S5 and 12-way with A10 GPU also uses temporal blocking but only 2-way as the speedup diminished

Research Agenda Performance characterization Identifying performance optimization techniques Tools to support performance analysis and optimization Debugging and validation tools Compilation techniques for exploiting reconfigurable dataflow Extending existing accelerator-aware programming models for FPGAs Performance portability across FPGAs, GPUs, CPUs, Multi-FPGA programming Fault tolerance Programming tools for variable precision

Evaluating FPGAs with OpenCL Performance characterization of FPGA performance when used with OpenCL Used Altera (Intel) OpenCL compiler and the Rodinia benchmark suite [Che et al. 2009] Consists of 23 benchmarks, each with OpenMP, CUDA and OpenCL versions Covers the Berkeley Dwarfs Evaluation approach Use the original OpenCL version as baseline Convert the baseline to more FPGA-friendly programs E.g., use shift register-based data forwarding rather than thread barriers Extended benchmark code is available at https://github.com/fpga-opencl-benchmarks For more details, see: H. Zohouri, et al., Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs, SC16.

Performance and Power Comparison 10 Speed-up Compared to CPU 8.96 10.25 5 0 5.96 5.87 4.34 2.92 2.71 2.24 1 1 1.28 1 1.13 0.4 1 1 1 0.89 NW Hotspot Pathfinder SRAD LUD CFD 15 10 5 0 11.25 Power Efficiency Compared to CPU 3.34 7.87 7.05 6.29 4.63 4.48 3.56 2.48 3.73 1 1 0.97 1 1 1 1 NW Hotspot Pathfinder SRAD LUD CFD 12.84 E5-2670 K20c Stratix V

Performance Portability Challenge On GPUs, OpenCL local memory ( shared memory in CUDA) is used for data locality optimization Barriers are necessary to guarantee local memory consistency Also possible on FPGAs with multi-threaded kernels but: Barriers result in pipeline flush No other way to pass data from one thread to another In single-threaded kernels: Iterations are issued sequentially à Pipelined Issue distance between iterations can be used to pass data Single-clock register read/write allows passing data Dependencies are mostly penalty free, barriers aren t Single-threaded is preferred over multi-threaded i i+1 i+2 t t+1 t+2 Barrier Singlethreaded Multithreaded

Example: Rodinia SRAD 2 preparation kernel 4 compute kernels 1: Copies (in) and (in) 2 to two buffers 2: Performs reduction on the two buffers 3: 5-point stencil computation (like Hotspot) 5 input and 5 output off-chip buffers 4: Dynamic programming computation (like NW) 5 input and one output off-chip buffer Iteration dependency

SRAD: Original Implementation (Cont.) Original implementation favors high memory bandwidth Performs the computation in multiple passes over multiple kernels Uses no local memory-based optimization except for reduction Passes data between kernels using multiple global buffers Does not work well on FPGAs due to: Low off-chip memory bandwidth Multiple mathematically-complex kernels and limited area

SRAD: Optimization Impact CPU: Xeon E5 SNB GPU: Tesla K20 FPGA: Stratix V CPU GPU 4.93 1.26 FPGA Original +SIMD+UR 81.6 80.5 FPGA Pipelining 74 +Reduction+UR 15.6 +Kernel and loop Fusion +Memory parameter tuning 2.7 6.94 Kernel and loop fusion 0 25 50 75 Elapsed time (s) Conversion to one kernel with 4 loops: 10% area reduction Combine loops 1 & 2: Two less off-chip buffers, Less area Allows memory optimization: Read neighbors from same buffer, 4 less off-chip buffers Combine loops 3 & 4: Blocking + Sliding Window, 5 less off-chip buffers

Example: Rodinia NW Rodinia benchmark version 3.0 Altera OpenCL v15 on Stratix V FPGA (Terasic DENET-5) ) ( ) 7 403 4 4 7 403 0 82 0 82 3A0 243.785 468 4 8 8 0 8 8 B 50 4 70 10 498 4

Performance and Power Comparison Speed-up Compared to CPU 10 8.96 10.25 5 0 5.96 5.87 4.34 2.92 2.71 2.24 1 1 1.28 1 1.13 0.4 1 1 1 0.89 NW Hotspot Pathfinder SRAD LUD CFD 15 10 5 0 11.25 Power Efficiency Compared to CPU 3.34 7.87 7.05 6.29 4.63 4.48 3.56 2.48 3.73 1 1 0.97 1 1 1 1 NW Hotspot Pathfinder SRAD LUD CFD 12.84 E5-2670 K20c Stratix V More details in the paper Analytical performance models Optimization strategies

Next Steps Updating performance results with HBM2 memory Identifying when the FPGA can outperform the GPU Promising results with simple stencils by aggressive CA techniques High-level programming systems Automating FPGA-specific optimizations as compilerbased transformation Extending overlay techniques for faster compilation