ScalaPipe: A Streaming Application Generator

Similar documents
ScalaPipe: A Streaming Application Generator

Joe Wingbermuehle, (A paper written under the guidance of Prof. Raj Jain)

ScalaPipe. Contents. From Auto-Pipe Wiki

Delite: A Framework for Heterogeneous Parallel DSLs

Parallel Programming

Low-Impact Profiling of Streaming, Heterogeneous Applications

Application-guided Tool Development for Architecturally Diverse Computation

Superoptimized Memory Subsystems for Streaming Applications

Simulation of Streaming Applications on Multicore Systems

Orchestrating Safe Streaming Computations with Precise Control

OptiML: An Implicitly Parallel Domain-Specific Language for ML

Analysis of Sorting as a Streaming Application

Delite. Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University. Tiark Rompf, Martin Odersky EPFL

Communication Library to Overlap Computation and Communication for OpenCL Application

Higher Level Programming Abstractions for FPGAs using OpenCL

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

A DOMAIN SPECIFIC APPROACH TO HETEROGENEOUS PARALLELISM

Language Virtualization for Heterogeneous Parallel Computing

RaftLib: A C++ Template Library for High Performance Stream Parallel Processing

Kunle Olukotun Pervasive Parallelism Laboratory Stanford University

Custom computing systems

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

Simple Analytic Performance Models for Streaming Data Applications Deployed on Diverse Architectures

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Sorting on Architecturally Diverse Computer Systems

Altera SDK for OpenCL

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne

Computer Aided Design Basic Syntax Gate Level Modeling Behavioral Modeling. Verilog

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Design and Performance of the OP2 Library for Unstructured Mesh Applications

Scala. Fernando Medeiros Tomás Paim

CS153: Compilers Lecture 15: Local Optimization

OptiML: An Implicitly Parallel Domain-Specific Language for ML

A Stream Compiler for Communication-Exposed Architectures

Project Final Report High Performance Pipeline Compiler

Accelerator Spectrum

Code Optimization. Code Optimization

Simplifying Parallel Programming with Domain Specific Languages

Intro to HW Design & Externs for P4àNetFPGA. CS344 Lecture 5

MIT Introduction to Program Analysis and Optimization. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

ICS 252 Introduction to Computer Design

New Developments in Spark

What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

Auto-Pipe and the X Language: A Pipeline Design Tool and Description Language

Power Efficient Solutions w/ FPGAs. Bill Jenkins Altera Sr. Product Specialist for Programming Language Solutions

VHDL for Synthesis. Course Description. Course Duration. Goals

FPGAs as Components in Heterogeneous HPC Systems: Raising the Abstraction Level of Heterogeneous Programming

Modern Processor Architectures. L25: Modern Compiler Design

Platform-Specific Optimization and Mapping of Stencil Codes through Refinement

Overview of ROCCC 2.0

Verification and Validation of X-Sim: A Trace-Based Simulator

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL

Python VSIP API: A first draft

Lecture 38 VHDL Description: Addition of Two [5 5] Matrices

Design Verification Lecture 01

CS377P Programming for Performance GPU Programming - II

Hardware Acceleration of Edge Detection Algorithm on FPGAs

Automated Reliability Classification of Queueing Models for Streaming Computation

Supporting Data Parallelism in Matcloud: Final Report

Translating Haskell to Hardware. Lianne Lairmore Columbia University

Compiler Code Generation COMP360

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Creating Safe State Machines

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Verilog for High Performance

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

Synthesizing Benchmarks for Predictive Modeling.

structure syntax different levels of abstraction

Here is a list of lecture objectives. They are provided for you to reflect on what you are supposed to learn, rather than an introduction to this

Chapter 3: Dataflow Modeling

[Sub Track 1-3] FPGA/ASIC 을타겟으로한알고리즘의효율적인생성방법및신기능소개

A Simple Path to Parallelism with Intel Cilk Plus

ECE331: Hardware Organization and Design

Introduction to Verilog

What is a compiler? Xiaokang Qiu Purdue University. August 21, 2017 ECE 573

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

RTL Coding General Concepts

High Level Synthesis

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

High Level Programming for GPGPU. Jason Yang Justin Hensley

Programming in C++ 6. Floating point data types

Loop Optimizations. Outline. Loop Invariant Code Motion. Induction Variables. Loop Invariant Code Motion. Loop Invariant Code Motion

Visions for Application Development on Hybrid Computing Systems

Implementation of DSP Algorithms

NETWORK ON CHIP TO IMPLEMENT THE SYSTEM-LEVEL COMMUNICATION SIMPLIFIES THE DISTRIBUTION OF I/O DATA THROUGHOUT THE CHIP, AND IS ALWAYS

Lecture 15: System Modeling and Verilog

StreamIt on Fleet. Amir Kamil Computer Science Division, University of California, Berkeley UCB-AK06.

Simone Campanoni Loop transformations

Introduction to Multicore Programming

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Scheduling Image Processing Pipelines

Program Optimization. Jo, Heeseung

Using Static Single Assignment Form

FPGA Design Challenge :Techkriti 14 Digital Design using Verilog Part 1

CS 240 Final Exam Review

Mapping Vector Codes to a Stream Processor (Imagine)

Transcription:

ScalaPipe: A Streaming Application Generator Joseph G. Wingbermuehle, Roger D. Chamberlain, Ron K. Cytron This work is supported by the National Science Foundation under grants CNS-09095368 and CNS-0931693. 1

Streaming Computation kernels, or blocks connected by explicit communication channels Advantages: Performance Reuse Abstraction Systems: Auto-Pipe [Fr06] Streams-C [Go00] StreamIT [Th02] Stage 1 Stage 2 Stage 3 2

Example:Solution to Laplace s Equation PDE with several uses, including stationary heat diffusion Solvable using a Monte-Carlo technique 3

Streaming Implementation Random Walk Print 4

Parallel Walks Walk Random Split Average Print Walk 5

Auto-Pipe & X X Description X compiler C Block Application VHDL Block 6

Laplace Application in X e2 walk1 e4 rand e1 split avg e6 print Labels e3 walk2 e5 block top { Random rand; Split split; Walk walk1; Walk walk2; Average avg; Print print; e1: rand -> split; e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.x1; e6: avg -> print; }; Block instances Edge Connections Blocks are implemented externally in C or an HDL. 7

Observation 1 As the number of Walk blocks increases, the amount of configuration code increases Lines 100 80 60 40 20 Ê Ê Two Walk blocks: e1: rand -> split; e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.x1; e6: avg -> print; Ê Ê 5 10 15 128 Walk blocks requires 896 lines of X Ê Walk Blocks Four Walk blocks: e1: rand -> split1; e2: split1.y0 -> split2; e3: split1.y1 -> split3; e4: split2.y0 -> walk1; e5: split2.y1 -> walk2; e6: split3.y0 -> walk3; e7: split3.y1 -> walk4; e8: walk1 -> avg1.x0; e9: walk2 -> avg1.x1; e10: walk3 -> avg2.x0; e11: walk4 -> avg2.x1; e12: avg1 -> avg3.x0; e13: avg2 -> avg3.x1; e14: avg3 -> print; 8

Our Approach Type-safe generator language val Laplace = new AutoPipeApp { val random = Random() val splits = iteratedmap(levels, random, SplitU32) val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } val result = iteratedfold(walks, AverageU32) Print(result) } Same code can generate 1 Walk block or 128 Walk blocks. 9

Observation 2 Moving blocks to a new device requires reimplementation HDL Implementation C Implementation Others 10

Our Approach A single language for block implementations ScalaPipe Block HDL Implementation C Implementation Others 11

Observation 3 Changing the data type requires new block implementations module ShiftRightU32(...); input wire[31:0] input_x; output wire[31:0] output_y;... output_y <= input_x >> 1;... endmodule module ShiftRightS64(...); input wire[63:0] input_x; output wire[63:0] output_y;... output_y <= input_x >>> 1;... endmodule 12

Our Solution Polymorphic block implementations class Average(t: AutoPipeType) extends AutoPipeBlock { val in0 = input(t) val in1 = input(t) val out = output(t) out = (in0 + in1) / 2 } Same implementation works for integral, fixed point, and floating point types. 13

Observation 4 The block interface for blocks on the same resource is a bottleneck Block Interface Block 1 Implementation Runtime System Block Interface Block 2 Implementation 14

Our Approach Single compiler for both the block language and coordination language. Compiler Coordination Language Block Language 15

ScalaPipe Source code (Scala) Scala compiler Generator Application 1 (e.g. 2 Walks) Coordination DSL ScalaPipe Library Block DSL Application 2 (e.g. 8 Walks) 16

AverageU32 Block val AverageU32 extends AutoPipeBlock { val in0 = input(unsigned32) val in1 = input(unsigned32) val out = output(unsigned32) out = (in0 + in1) / 2 } in0 AverageU32 out in1 17

Polymorphic Average Block class Average(t: AutoPipeType) extends AutoPipeBlock { val in0 = input(t) val in1 = input(t) val out = output(t) out = (in0 + in1) / 2 } val AverageU32 = new Average(UNSIGNED32) t can be any of the following: Signed or unsigned integer of any width Fixed point type Floating point type 18

Language Virtualization [Ch10] class Repeat(v: Int, count: Int) extends AutoPipeBlock { val in = input(signed32) val out = output(signed32) val tmp = local(signed32) tmp = in if (tmp == v) { // Evaluated at run time for (i <- 1 to count) { // Expanded at compile time out = tmp } } else { out = tmp } } 19

External AverageU32 Potentially more efficient External and internal blocks can be mixed val AverageU32 = new AutoPipeBlock { val in0 = input(unsigned32) val in1 = input(unsigned32) val out = output(unsigned32) external( HDL, AverageU32 ) // Optional internal implementation } 20

Block Code Generation Internal Block Specification Abstract Syntax Tree C Control Flow Graph OpenCL C External Block Specification Optimizer Verilog 21

HDL Code Optimizer Common subexpression elimination Dead store elimination Dead code elimination Strength reduction Copy propagation ASAP scheduling 22

Coordination DSL Describes the topology and resource mapping val Laplace = new AutoPipeApp { val random = Random() val splits = iteratedmap(levels, random, SplitU32) val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } val result = iteratedfold(walks, AverageU32) Print(result) } 23

Generating Pipelines Inc Inc Inc Inc X language: block pipeline { input UNSIGNED32 source; output UNSIGNED32 result; Inc inc1; Inc inc2; Inc inc3; Inc inc4; }; source -> inc1; inc1 -> inc2; inc2 -> inc3; inc3 -> inc4; inc4 -> result; ScalaPipe: def pipeline(s: Stream, b: AutoPipeBlock, n: Int): Stream = { if (n > 0) { pipeline(b(s), b, n - 1) } else { s } } val result = pipeline(source, Inc, 4) 24

Aspect-Oriented Resource Mapping map(random -> ANY_BLOCK, CPU2FPGA()) CPU 0 FPGA 0 CPU 0 Walk Random Split Average Print Walk map(any_block -> Print, FPGA2CPU() 25

TimeTrial [La11] How do we find bottlenecks? measure(any_block -> Walk, backpressure) Walk Random Split Average Print % Backpressure 100 80 Walk 60 40 20 10 20 30 40 50 60 Frame 26

Illustration of Use Time HsL 200 CPU 0 RNG Walk Print 150 100 101s 50 CPU FPGA 16 Walks Custom RNG 27

Illustration of Use Time HsL FPGA 0 CPU 0 200 174s RNG Walk Print 150 100 101s 50 83% Backpressure CPU FPGA 16 Walks Custom RNG 28

Illustration of Use Time HsL FPGA 0 Walk CPU 0 200 174s RNG Split Print 150 100 101s Walk 50 41s 0% Backpressure CPU FPGA 16 Walks Custom RNG 29

Illustration of Use Time HsL FPGA 0 Walk CPU 0 200 174s crng Split Print 150 100 101s Walk 50 41s 12s CPU FPGA 16 Walks Custom RNG 30

The Current State of ScalaPipe Code generation for CPUs, FPGAs, and GPUs FPGA and GPU code generation is suboptimal No cross-block optimizations 31

The Future of ScalaPipe Improved code generation - Consume multiple items at a time - More Verilog and OpenCL C optimizations Support for more devices Library generation Cross-block optimizations 32

Conclusion ScalaPipe is a streaming application generator The block DSL allows code reuse across data types and platforms The coordination DSL allows easy generation of large and complex topologies Keeping everything in the same language exposes optimization opportunities ScalaPipe Coordination DSL Block DSL 33

References H. CHAFI, Z. DEVITO, A. MOORS, T. ROMPF, A. K. SUJEETH, P. HANRAHAN, M. ODERSKY, AND K. OLUKOTUN, Language virtualization for hetero- geneous parallel computing, in Proc. of ACM Int l Conf. on Object Oriented Programming Systems, Languages, and Applications, 2010, pp. 835 847. J.M. LANCASTER, J. G. WINGBERMUEHLE, AND R. D. CHAMBERLAIN, Asking for performance: Exploiting developer intuition to guide instrumentation with TimeTrial, in Proc. of IEEE 13th Int l Conf. on High Performance Computing and Communcations, Sep. 2011, pp. 321 330. M. A. FRANKLIN, E. J. TYSON, J. BUCKLEY, P. CROWLEY, AND J. MASCHMEYER, Auto-Pipe and the X language: A pipeline design tool and description language, in Proc. of Int l Parallel and Distributed Processing Symp., Apr. 2006. M. B. GOKHALE, J. M. STONE, J. ARNOLD, AND M. KALINOWSKI, Stream- oriented FPGA computing in the Streams-C high level language, in Proc. of IEEE Symp. on Field-Programmable Custom Computing Machines, Apr. 2000, pp. 49 56. W. THIES, M. KARCZMAREK, AND S. AMARASINGHE, StreamIt: A language for streaming applications, in Proc. of 11th Int l Conf. on Compiler Construction, 2002, pp. 179 196. 34