ScalaPipe: A Streaming Application Generator

Size: px

Start display at page:

Download "ScalaPipe: A Streaming Application Generator"

Madeleine Hampton
6 years ago
Views:

1 ScalaPipe: A Streaming Application Generator Joseph G. Wingbermuehle, Roger D. Chamberlain, Ron K. Cytron This work is supported by the National Science Foundation under grants CNS and CNS

2 Streaming Computation kernels, or blocks connected by explicit communication channels Advantages: Performance Reuse Abstraction Systems: Auto-Pipe [Fr06] Streams-C [Go00] StreamIT [Th02] Stage 1 Stage 2 Stage 3 2

3 Example:Solution to Laplace s Equation PDE with several uses, including stationary heat diffusion Solvable using a Monte-Carlo technique 3

4 Streaming Implementation Random Walk Print 4

5 Parallel Walks Walk Random Split Average Print Walk 5

6 Auto-Pipe & X X Description X compiler C Block Application VHDL Block 6

e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.

7 Laplace Application in X e2 walk1 e4 rand e1 split avg e6 print Labels e3 walk2 e5 block top { Random rand; Split split; Walk walk1; Walk walk2; Average avg; Print print; e1: rand -> split; e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.x1; e6: avg -> print; }; Block instances Edge Connections Blocks are implemented externally in C or an HDL. 7

Observation 1 As the number of Walk blocks increases, the amount of configuration code increases Lines 100 80 60 40 20 Ê Ê Two Walk blocks: e1: rand -> split; e2:

x1; e6: avg -> print; Ê Ê 5 10 15 128 Walk blocks requires 896 lines of X Ê Walk Blocks Four Walk blocks: e1: rand -> split1; e2: split1.y0 -> split2; e3: split1.

8 Observation 1 As the number of Walk blocks increases, the amount of configuration code increases Lines Ê Ê Two Walk blocks: e1: rand -> split; e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.x1; e6: avg -> print; Ê Ê Walk blocks requires 896 lines of X Ê Walk Blocks Four Walk blocks: e1: rand -> split1; e2: split1.y0 -> split2; e3: split1.y1 -> split3; e4: split2.y0 -> walk1; e5: split2.y1 -> walk2; e6: split3.y0 -> walk3; e7: split3.y1 -> walk4; e8: walk1 -> avg1.x0; e9: walk2 -> avg1.x1; e10: walk3 -> avg2.x0; e11: walk4 -> avg2.x1; e12: avg1 -> avg3.x0; e13: avg2 -> avg3.x1; e14: avg3 -> print; 8

9 Our Approach Type-safe generator language val Laplace = new AutoPipeApp { val random = Random() val splits = iteratedmap(levels, random, SplitU32) val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } val result = iteratedfold(walks, AverageU32) Print(result) } Same code can generate 1 Walk block or 128 Walk blocks. 9

10 Observation 2 Moving blocks to a new device requires reimplementation HDL Implementation C Implementation Others 10

11 Our Approach A single language for block implementations ScalaPipe Block HDL Implementation C Implementation Others 11

Observation 3 Changing the data type requires new block implementations module ShiftRightU32(...); input wire[31:0] input_x; output wire[31:0] output_y;.

12 Observation 3 Changing the data type requires new block implementations module ShiftRightU32(...); input wire[31:0] input_x; output wire[31:0] output_y;... output_y <= input_x >> 1;... endmodule module ShiftRightS64(...); input wire[63:0] input_x; output wire[63:0] output_y;... output_y <= input_x >>> 1;... endmodule 12

13 Our Solution Polymorphic block implementations class Average(t: AutoPipeType) extends AutoPipeBlock { val in0 = input(t) val in1 = input(t) val out = output(t) out = (in0 + in1) / 2 } Same implementation works for integral, fixed point, and floating point types. 13

14 Observation 4 The block interface for blocks on the same resource is a bottleneck Block Interface Block 1 Implementation Runtime System Block Interface Block 2 Implementation 14

15 Our Approach Single compiler for both the block language and coordination language. Compiler Coordination Language Block Language 15

16 ScalaPipe Source code (Scala) Scala compiler Generator Application 1 (e.g. 2 Walks) Coordination DSL ScalaPipe Library Block DSL Application 2 (e.g. 8 Walks) 16

17 AverageU32 Block val AverageU32 extends AutoPipeBlock { val in0 = input(unsigned32) val in1 = input(unsigned32) val out = output(unsigned32) out = (in0 + in1) / 2 } in0 AverageU32 out in1 17

18 Polymorphic Average Block class Average(t: AutoPipeType) extends AutoPipeBlock { val in0 = input(t) val in1 = input(t) val out = output(t) out = (in0 + in1) / 2 } val AverageU32 = new Average(UNSIGNED32) t can be any of the following: Signed or unsigned integer of any width Fixed point type Floating point type 18

19 Language Virtualization [Ch10] class Repeat(v: Int, count: Int) extends AutoPipeBlock { val in = input(signed32) val out = output(signed32) val tmp = local(signed32) tmp = in if (tmp == v) { // Evaluated at run time for (i <- 1 to count) { // Expanded at compile time out = tmp } } else { out = tmp } } 19

20 External AverageU32 Potentially more efficient External and internal blocks can be mixed val AverageU32 = new AutoPipeBlock { val in0 = input(unsigned32) val in1 = input(unsigned32) val out = output(unsigned32) external( HDL, AverageU32 ) // Optional internal implementation } 20

21 Block Code Generation Internal Block Specification Abstract Syntax Tree C Control Flow Graph OpenCL C External Block Specification Optimizer Verilog 21

22 HDL Code Optimizer Common subexpression elimination Dead store elimination Dead code elimination Strength reduction Copy propagation ASAP scheduling 22

23 Coordination DSL Describes the topology and resource mapping val Laplace = new AutoPipeApp { val random = Random() val splits = iteratedmap(levels, random, SplitU32) val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } val result = iteratedfold(walks, AverageU32) Print(result) } 23

Generating Pipelines Inc Inc Inc Inc X language: block pipeline { input UNSIGNED32 source; output UNSIGNED32 result; Inc inc1; Inc inc2; Inc inc3; Inc inc4; }; source -> inc1; inc1 -> inc2; inc2 ->

24 Generating Pipelines Inc Inc Inc Inc X language: block pipeline { input UNSIGNED32 source; output UNSIGNED32 result; Inc inc1; Inc inc2; Inc inc3; Inc inc4; }; source -> inc1; inc1 -> inc2; inc2 -> inc3; inc3 -> inc4; inc4 -> result; ScalaPipe: def pipeline(s: Stream, b: AutoPipeBlock, n: Int): Stream = { if (n > 0) { pipeline(b(s), b, n - 1) } else { s } } val result = pipeline(source, Inc, 4) 24

25 Aspect-Oriented Resource Mapping map(random -> ANY_BLOCK, CPU2FPGA()) CPU 0 FPGA 0 CPU 0 Walk Random Split Average Print Walk map(any_block -> Print, FPGA2CPU() 25

26 TimeTrial [La11] How do we find bottlenecks? measure(any_block -> Walk, backpressure) Walk Random Split Average Print % Backpressure Walk Frame 26

27 Illustration of Use Time HsL 200 CPU 0 RNG Walk Print s 50 CPU FPGA 16 Walks Custom RNG 27

28 Illustration of Use Time HsL FPGA 0 CPU s RNG Walk Print s 50 83% Backpressure CPU FPGA 16 Walks Custom RNG 28

29 Illustration of Use Time HsL FPGA 0 Walk CPU s RNG Split Print s Walk 50 41s 0% Backpressure CPU FPGA 16 Walks Custom RNG 29

30 Illustration of Use Time HsL FPGA 0 Walk CPU s crng Split Print s Walk 50 41s 12s CPU FPGA 16 Walks Custom RNG 30

31 The Current State of ScalaPipe Code generation for CPUs, FPGAs, and GPUs FPGA and GPU code generation is suboptimal No cross-block optimizations 31

32 The Future of ScalaPipe Improved code generation - Consume multiple items at a time - More Verilog and OpenCL C optimizations Support for more devices Library generation Cross-block optimizations 32

33 Conclusion ScalaPipe is a streaming application generator The block DSL allows code reuse across data types and platforms The coordination DSL allows easy generation of large and complex topologies Keeping everything in the same language exposes optimization opportunities ScalaPipe Coordination DSL Block DSL 33

34 References H. CHAFI, Z. DEVITO, A. MOORS, T. ROMPF, A. K. SUJEETH, P. HANRAHAN, M. ODERSKY, AND K. OLUKOTUN, Language virtualization for hetero- geneous parallel computing, in Proc. of ACM Int l Conf. on Object Oriented Programming Systems, Languages, and Applications, 2010, pp J.M. LANCASTER, J. G. WINGBERMUEHLE, AND R. D. CHAMBERLAIN, Asking for performance: Exploiting developer intuition to guide instrumentation with TimeTrial, in Proc. of IEEE 13th Int l Conf. on High Performance Computing and Communcations, Sep. 2011, pp M. A. FRANKLIN, E. J. TYSON, J. BUCKLEY, P. CROWLEY, AND J. MASCHMEYER, Auto-Pipe and the X language: A pipeline design tool and description language, in Proc. of Int l Parallel and Distributed Processing Symp., Apr M. B. GOKHALE, J. M. STONE, J. ARNOLD, AND M. KALINOWSKI, Stream- oriented FPGA computing in the Streams-C high level language, in Proc. of IEEE Symp. on Field-Programmable Custom Computing Machines, Apr. 2000, pp W. THIES, M. KARCZMAREK, AND S. AMARASINGHE, StreamIt: A language for streaming applications, in Proc. of 11th Int l Conf. on Compiler Construction, 2002, pp

ScalaPipe: A Streaming Application Generator

ScalaPipe: A Streaming Application Generator Joseph G. Wingbermuehle Roger D. Chamberlain Ron K. Cytron Joseph G. Wingbermuehle, Roger D. Chamberlain, and Ron K. Cytron, ScalaPipe: A Streaming Application