LIQUID METAL Taming Heterogeneity

Similar documents
VIRTUALIZATION IN THE AGE OF HETEROGENEOUS MACHINES

FPGA & Hybrid Systems in the Enterprise Drivers, Exemplars and Challenges

Growing a Software Language for Hardware Design

AND THEN THERE WERE NONE

Reconfigurable Computing. Design and Implementation. Chapter 4.1

A Deterministic Flow Combining Virtual Platforms, Emulation, and Hardware Prototypes

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Reconfigurable Computing. Design and implementation. Chapter 4.1

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

Hybrid Threading: A New Approach for Performance and Productivity

Early Models in Silicon with SystemC synthesis

Accelerator Spectrum

Predicting GPU Performance from CPU Runs Using Machine Learning

ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point Fast Fourier Transform Simulation

Programmable Logic Devices HDL-Based Design Flows CMPE 415

Advanced FPGA Design Methodologies with Xilinx Vivado

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

ReconOS: An RTOS Supporting Hardware and Software Threads

Eventrons: A Safe Programming Construct for High-Frequency Hard Real-Time Applications

Overview of ROCCC 2.0

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation

Mentor Graphics Solutions Enable Fast, Efficient Designs for Altera s FPGAs. Fall 2004

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

Altera SDK for OpenCL

Implementing MATLAB Algorithms in FPGAs and ASICs By Alexander Schreiber Senior Application Engineer MathWorks

Building Combinatorial Circuit Using Behavioral Modeling Lab

Using FPGAs as Microservices

Experience with the NetFPGA Program

Modeling a 4G LTE System in MATLAB

Digital Systems Design. System on a Programmable Chip

Higher Level Programming Abstractions for FPGAs using OpenCL

High-level development and debugging of FPGA-based network programs

組込みシステムシンポジウム2013 Embedded Systems Symposium 2013 ESS /10/17 Java JavaRock-Thrash LSI HDL Java JavaRock-Thrash FIR FFT IP Developm

Post processing techniques to accelerate assertion development Ajay Sharma

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Did I Just Do That on a Bunch of FPGAs?

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Designing with Nios II Processor for Hardware Engineers

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

ESL design with the Agility Compiler for SystemC

Accelerating Data Center Workloads with FPGAs

FPGA Design Flow 1. All About FPGA

Experts in Application Acceleration Synective Labs AB

Functioning Hardware from Functional Specifications

Dr. Yassine Hariri CMC Microsystems

Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh

NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

01 1 Electronic Design Automation (EDA) the correctness, testability, and compliance of a design is checked by software

The Evolution of Big Data Platforms and Data Science

Cloud Acceleration with FPGA s. Mike Strickland, Director, Computer & Storage BU, Altera

Accelerating FPGA/ASIC Design and Verification

The SOCks Design Platform. Johannes Grad

CLU: Open Source API for OpenCL Prototyping

Java on System z. IBM J9 virtual machine

Re-architecting Virtualization in Heterogeneous Multicore Systems

Intro to System Generator. Objectives. After completing this module, you will be able to:

Xilinx ChipScope ICON/VIO/ILA Tutorial

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Translating Haskell to Hardware. Lianne Lairmore Columbia University

VHDL for Synthesis. Course Description. Course Duration. Goals

8. Best Practices for Incremental Compilation Partitions and Floorplan Assignments

FPGA briefing Part II FPGA development DMW: FPGA development DMW:

VHDL vs. BSV: A case study on a Java-optimized processor

Graphical System Design. David Fuller LabVIEW R&D Section Manager

EEL 4783: HDL in Digital System Design

AL8253 Core Application Note

Untyped Memory in the Java Virtual Machine

Extending the Lifetime of SSD Controller

COP4020 Programming Languages. Compilers and Interpreters Robert van Engelen & Chris Lacher

When Will SystemC Replace gcc/g++? Bodo Parady, CTO Pentum Group Inc. Sunnyvale, CA

Pico Computing M501 PSP User Guide Linux Version 1.0.1

Extending Model-Based Design for HW/SW Design and Verification in MPSoCs Jim Tung MathWorks Fellow

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

FPGA Solutions: Modular Architecture for Peak Performance

AccelDSP tutorial 2 (Matlab.m to HDL for Xilinx) Ronak Gandhi Syracuse University Fall

Chapter 2 Getting Hands on Altera Quartus II Software

System Level Design with IBM PowerPC Models

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

ISim Hardware Co-Simulation Tutorial: Accelerating Floating Point FFT Simulation

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Intel HLS Compiler: Fast Design, Coding, and Hardware

CSEE W4840 Embedded System Design Lab 1

SPOC : GPGPU programming through Stream Processing with OCaml

Advanced module: Video en/decoder on Virtex 5

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Introduction. CS 2210 Compiler Design Wonsun Ahn

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Cover TBD. intel Quartus prime Design software

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

Cover TBD. intel Quartus prime Design software

OpenPOWER Performance

High-Level Synthesis Techniques for In-Circuit Assertion-Based Verification

Lecture 1: Gentle Introduction to GPUs

Evaluation of the RTL Synthesis Tools for FPGA/PLD Design. M.Matveev. Rice University. August 10, 2001

100M Gate Designs in FPGAs

Overview. Design flow. Principles of logic synthesis. Logic Synthesis with the common tools. Conclusions

Embedded Systems: Projects

Appendix SystemC Product Briefs. All product claims contained within are provided by the respective supplying company.

Transcription:

LIQUID METAL Taming Heterogeneity Stephen Fink IBM Research!

IBM Research Liquid Metal Team (IBM T. J. Watson Research Center) Josh Auerbach Perry Cheng 2 David Bacon Stephen Fink Ioana Baldini Rodric Rabbah Sunil Shukla

IBM Research A Heterogeneous Landscape Multicore 3 GPU FPGA

Accelerator Platform Computation starts from CPU HPC Workloads Seismic Simulations Offline Portfolio Valuation Non Real-time Apps CPU GPU FPGA

Network Appliance Platform Process data hot off the network CPU FPGA GPU Financial Data Feed Business Rule Matching Smart Routers Data Filtering Compression Encryption 5

IBM Research Storage Appliance Platform Push computation near data CPU GPU FPGA Database Acceleration Compressed Memory Data Mining 6

Heterogeneous Programming: Current Reality Java C++ Python CUDA OpenCL VHDL Verilog SystemC CPU Compiler GPU Compiler Node Compiler Synthesis binary binary bitfile CPU GPU ASIP FPGA 7

IBM Research Goal: Make Heterogeneous Platforms Accessible for Mainstream Use High-Level Java-Derived Language Vision: Lime Lime Toolchain IDE, Compiler, Debugger, Cloud Synthesis CPU 8 GPU FPGA Platform: Hardware + Runtime

Export Model Development for customer-owned platform Lime 9

Lime The Liquid Metal Programming Language 10

Lime Lime Compiler CPU Backend GPU Backend Node Backend Verilog Backend bytecode binary binary bitfile CPU GPU ASIP FPGA 11 Compiler Philosophy: Bytecode backend compiles the full language Each device backend compiles a subset of the language Compiler gives specific feedback on exclusions

What is Lime? Lime = Java + Isolation + Abstract Parallelism Isolation: ability to move computation local keyword Immutable types Abstract Parallelism: freedom to schedule Stream programming model (task, =>) Data parallel constructs (@ map,! reduce) 12

The Lime Development Experience 1. Prototype in standard Java Lime is a superset of Java Java-like Eclipse IDE (editors, debugger, navigation) 2. Gentle, incremental migration to parallel Lime code Lime language constructs restrict program to safe parallel structures Isolation Immutability Safe Deterministic Parallelism Bounded Space

Example Program: N-body simulation 1. Compute forces using particle states 2. Update particles based on forces 3. Periodic checkpoint to disk or display

Task and Connect class NBody {!!static void simulate() {!!!float[][4] particles =...; // initial state!!!task nbody =! task NBody(particles).particleGenerator!!!=>! task forcecomputation!!!=>! task NBody(particles).forceAccumulator;!!!nbody.finish();!!}! float[][4] particles;!!nbody(float[][4] particles) { this.particles = particles; }! static local float[[][3]] forcecomputation(float[[][4]] particles) { }! float[[][4]] particlegenerator() { }! void forceaccumulator(float[[][3]) { }! }! 15

Relocation Brackets class NBody {!!static void simulate() {!!!float[][4] particles =...; // initial state!!!task nbody =! task NBody(particles).particleGenerator!!!=>! ([ task forcecomputation ])!!!=>! task NBody(particles).forceAccumulator;!!!nbody.finish();!!}! float[][4] particles;!!nbody(float[][4] particles) { this.particles = particles; }! static local float[[][3]] forcecomputation(float[[][4]] particles) { }! float[[][4]] particlegenerator() { }! void forceaccumulator(float[[][3]) { }! }! 16

// compute the force on each particle! static local float[[][3]] forcecomputation(float[[][4]] particles) {!!return @ force(particles, #particles); // map operator @! }! // compute the force on particle p! static local float[[3]] force(float[[4]] p, float[[][4]] P) {!float fx = 0, fy = 0, fz = 0;!for (float[[4]] Pj : P) {!!!float dx = Pj[0] - p[0];!!float dy = Pj[1] - p[1];!!float dz =...;!!float F =...; // scalar code, compute force between p and Pj!!!fx += dx + F;!!!fy += dy + F;!!!fz += dz + F;!!}!!return new float[[]] { fx, fy, fz };! }! Language constructs guarantee safe, deterministic, parallel execution semantics Deeply immutable types (e.g. float[[]]) Isolation (e.g. local) Data parallel function application (map, @) 17

Data Parallelism via map (@) float[] a =...;! float[] b =...;! float[] c = a @+ b; // map operator + over 2 arrays! a b c 1.2 +! 0.0 1.2 0.0 +! 99.0 99.0 3.14 +! -3.0 0.14 static local float add(float x, float y) { return x + y; }! float[] c = @add(a, b); // map function add() over 2 arrays! float[] d = @add(a, #1); // map function add() over 1 array, 1 scalar! 18

What didn t you write? (GPU) IBM 19 Confidential

What didn t you write? (FPGA) User Logic Platform runtime and communication HDL 3000 lines of platform independent HDL 500 lines of platform dependent HDL 3 months of development time??? months of debugging time 20

Lime Developer Tools 21

Compiling the Code (FPGA) Integrated IDE Multiple 3 rd party tools with long workflow 1. Create an ISE or Quartus project 2. Write HDL and testbench 3. Simulate design 4. Repeat 2 & 3 until functionality is achieved 5. Implement your top module 6. Follow advanced design guides 7. Modify the design as necessary, simulate, synthesize, and implement until design requirements are met. 8. Run timing simulations, verify timing closures 9. Program your Xilinx or Altera device 22

Debugging Debug with JVM vs. Low-level Waveforms 23

Running the Code Same IDE (same dialog) vs. Yet Another Tool 24

Limeforge (Synthesis in the Cloud) 200 MHz XUPv5 PCIe Generated HDL Limeforge bitfile IBM Confidential 25

Lime Implementation 26

Lime Lime Virtual Machine Lime Compiler bytecode binary binary bitfile bitfile bitfile Artifact Store Lime Virtual Machine (LVM) CPU GPU Bus ASIP FPGA 27

GPU Backend Structure 28

How do we evaluate performance? Speedup for Naïve Users How much faster than Java? Slowdown for Expert Users How much slower than hand-tuned low-level code? Current status (bottom line) No magical super-compiler Some benchmarks Lime performs as well as hand-coded alternative Some, not. Communication/computation ratio dominates Ongoing research into better implementation technology Big performance improvements in reach Good enough for prime time? We re optimistic. 29

Speedup vs. Java [PLDI 12] end-to-end performance (all overhead included) 30

Speedup vs. Hand-Tuned OpenCL [PLDI 12] (GTX 580) end-to-end performance (all overhead included) 31

Lime FPGA Results Combinatorial circuits: match hand-coded - pipelining Complex circuits: goal is 2-4x worse, now 10x - I/O pipelining and classic compiler optimizations Increasing synthesizable language feature set - First of its kind FPGA garbage collector 32

Liquid Metal Summary High-level, concise abstractions for parallelism Gentle, incremental migration from Java Rich Eclipse-based tool support General solution for programming heterogeneous systems No magic parallel algorithm design is not easier one version of a program may not run well on all devices vanilla Java programs will not run on accelerator http://www.research.ibm.com/liquidmetal/ 33

Backup 34

Export Model Development for customer-owned platform Lime 35