ScalaPipe: A Streaming Application Generator

ScalaPipe: A Streaming Application Generator Joseph G. Wingbermuehle, Roger D. Chamberlain, Ron K. Cytron This work is supported by the National Science Foundation under grants CNS-09095368 and CNS-0931693. 1

Streaming Computation kernels, or blocks connected by explicit communication channels Advantages: Performance Reuse Abstraction Systems: Auto-Pipe [Fr06] Streams-C [Go00] StreamIT [Th02] Stage 1 Stage 2 Stage 3 2

Example:Solution to Laplace s Equation PDE with several uses, including stationary heat diffusion Solvable using a Monte-Carlo technique 3

Streaming Implementation Random Walk Print 4

Parallel Walks Walk Random Split Average Print Walk 5

Auto-Pipe & X X Description X compiler C Block Application VHDL Block 6

Laplace Application in X e2 walk1 e4 rand e1 split avg e6 print Labels e3 walk2 e5 block top { Random rand; Split split; Walk walk1; Walk walk2; Average avg; Print print; e1: rand -> split; e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.x1; e6: avg -> print; }; Block instances Edge Connections Blocks are implemented externally in C or an HDL. 7

Observation 1 As the number of Walk blocks increases, the amount of configuration code increases Lines 100 80 60 40 20 Ê Ê Two Walk blocks: e1: rand -> split; e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.x1; e6: avg -> print; Ê Ê 5 10 15 128 Walk blocks requires 896 lines of X Ê Walk Blocks Four Walk blocks: e1: rand -> split1; e2: split1.y0 -> split2; e3: split1.y1 -> split3; e4: split2.y0 -> walk1; e5: split2.y1 -> walk2; e6: split3.y0 -> walk3; e7: split3.y1 -> walk4; e8: walk1 -> avg1.x0; e9: walk2 -> avg1.x1; e10: walk3 -> avg2.x0; e11: walk4 -> avg2.x1; e12: avg1 -> avg3.x0; e13: avg2 -> avg3.x1; e14: avg3 -> print; 8

Our Approach Type-safe generator language val Laplace = new AutoPipeApp { val random = Random() val splits = iteratedmap(levels, random, SplitU32) val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } val result = iteratedfold(walks, AverageU32) Print(result) } Same code can generate 1 Walk block or 128 Walk blocks. 9

Observation 2 Moving blocks to a new device requires reimplementation HDL Implementation C Implementation Others 10

Our Approach A single language for block implementations ScalaPipe Block HDL Implementation C Implementation Others 11

Observation 3 Changing the data type requires new block implementations module ShiftRightU32(...); input wire[31:0] input_x; output wire[31:0] output_y;... output_y <= input_x >> 1;... endmodule module ShiftRightS64(...); input wire[63:0] input_x; output wire[63:0] output_y;... output_y <= input_x >>> 1;... endmodule 12

Our Solution Polymorphic block implementations class Average(t: AutoPipeType) extends AutoPipeBlock { val in0 = input(t) val in1 = input(t) val out = output(t) out = (in0 + in1) / 2 } Same implementation works for integral, fixed point, and floating point types. 13

Observation 4 The block interface for blocks on the same resource is a bottleneck Block Interface Block 1 Implementation Runtime System Block Interface Block 2 Implementation 14

Our Approach Single compiler for both the block language and coordination language. Compiler Coordination Language Block Language 15

ScalaPipe Source code (Scala) Scala compiler Generator Application 1 (e.g. 2 Walks) Coordination DSL ScalaPipe Library Block DSL Application 2 (e.g. 8 Walks) 16

AverageU32 Block val AverageU32 extends AutoPipeBlock { val in0 = input(unsigned32) val in1 = input(unsigned32) val out = output(unsigned32) out = (in0 + in1) / 2 } in0 AverageU32 out in1 17

Polymorphic Average Block class Average(t: AutoPipeType) extends AutoPipeBlock { val in0 = input(t) val in1 = input(t) val out = output(t) out = (in0 + in1) / 2 } val AverageU32 = new Average(UNSIGNED32) t can be any of the following: Signed or unsigned integer of any width Fixed point type Floating point type 18

Language Virtualization [Ch10] class Repeat(v: Int, count: Int) extends AutoPipeBlock { val in = input(signed32) val out = output(signed32) val tmp = local(signed32) tmp = in if (tmp == v) { // Evaluated at run time for (i <- 1 to count) { // Expanded at compile time out = tmp } } else { out = tmp } } 19

External AverageU32 Potentially more efficient External and internal blocks can be mixed val AverageU32 = new AutoPipeBlock { val in0 = input(unsigned32) val in1 = input(unsigned32) val out = output(unsigned32) external( HDL, AverageU32 ) // Optional internal implementation } 20

Block Code Generation Internal Block Specification Abstract Syntax Tree C Control Flow Graph OpenCL C External Block Specification Optimizer Verilog 21

HDL Code Optimizer Common subexpression elimination Dead store elimination Dead code elimination Strength reduction Copy propagation ASAP scheduling 22

Coordination DSL Describes the topology and resource mapping val Laplace = new AutoPipeApp { val random = Random() val splits = iteratedmap(levels, random, SplitU32) val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } val result = iteratedfold(walks, AverageU32) Print(result) } 23

Generating Pipelines Inc Inc Inc Inc X language: block pipeline { input UNSIGNED32 source; output UNSIGNED32 result; Inc inc1; Inc inc2; Inc inc3; Inc inc4; }; source -> inc1; inc1 -> inc2; inc2 -> inc3; inc3 -> inc4; inc4 -> result; ScalaPipe: def pipeline(s: Stream, b: AutoPipeBlock, n: Int): Stream = { if (n > 0) { pipeline(b(s), b, n - 1) } else { s } } val result = pipeline(source, Inc, 4) 24

Aspect-Oriented Resource Mapping map(random -> ANY_BLOCK, CPU2FPGA()) CPU 0 FPGA 0 CPU 0 Walk Random Split Average Print Walk map(any_block -> Print, FPGA2CPU() 25

TimeTrial [La11] How do we find bottlenecks? measure(any_block -> Walk, backpressure) Walk Random Split Average Print % Backpressure 100 80 Walk 60 40 20 10 20 30 40 50 60 Frame 26

Illustration of Use Time HsL 200 CPU 0 RNG Walk Print 150 100 101s 50 CPU FPGA 16 Walks Custom RNG 27

Illustration of Use Time HsL FPGA 0 CPU 0 200 174s RNG Walk Print 150 100 101s 50 83% Backpressure CPU FPGA 16 Walks Custom RNG 28

Illustration of Use Time HsL FPGA 0 Walk CPU 0 200 174s RNG Split Print 150 100 101s Walk 50 41s 0% Backpressure CPU FPGA 16 Walks Custom RNG 29

Illustration of Use Time HsL FPGA 0 Walk CPU 0 200 174s crng Split Print 150 100 101s Walk 50 41s 12s CPU FPGA 16 Walks Custom RNG 30

The Current State of ScalaPipe Code generation for CPUs, FPGAs, and GPUs FPGA and GPU code generation is suboptimal No cross-block optimizations 31

The Future of ScalaPipe Improved code generation - Consume multiple items at a time - More Verilog and OpenCL C optimizations Support for more devices Library generation Cross-block optimizations 32

Conclusion ScalaPipe is a streaming application generator The block DSL allows code reuse across data types and platforms The coordination DSL allows easy generation of large and complex topologies Keeping everything in the same language exposes optimization opportunities ScalaPipe Coordination DSL Block DSL 33

References H. CHAFI, Z. DEVITO, A. MOORS, T. ROMPF, A. K. SUJEETH, P. HANRAHAN, M. ODERSKY, AND K. OLUKOTUN, Language virtualization for hetero- geneous parallel computing, in Proc. of ACM Int l Conf. on Object Oriented Programming Systems, Languages, and Applications, 2010, pp. 835 847. J.M. LANCASTER, J. G. WINGBERMUEHLE, AND R. D. CHAMBERLAIN, Asking for performance: Exploiting developer intuition to guide instrumentation with TimeTrial, in Proc. of IEEE 13th Int l Conf. on High Performance Computing and Communcations, Sep. 2011, pp. 321 330. M. A. FRANKLIN, E. J. TYSON, J. BUCKLEY, P. CROWLEY, AND J. MASCHMEYER, Auto-Pipe and the X language: A pipeline design tool and description language, in Proc. of Int l Parallel and Distributed Processing Symp., Apr. 2006. M. B. GOKHALE, J. M. STONE, J. ARNOLD, AND M. KALINOWSKI, Stream- oriented FPGA computing in the Streams-C high level language, in Proc. of IEEE Symp. on Field-Programmable Custom Computing Machines, Apr. 2000, pp. 49 56. W. THIES, M. KARCZMAREK, AND S. AMARASINGHE, StreamIt: A language for streaming applications, in Proc. of 11th Int l Conf. on Compiler Construction, 2002, pp. 179 196. 34