Lecture 8. The StreamIt Language IAP 2007 MIT

Size: px

Start display at page:

Download "Lecture 8. The StreamIt Language IAP 2007 MIT"

Bernadette Dorsey
5 years ago
Views:

1 6.189 IAP 2007 Lecture 8 The StreamIt Language IAP 2007 MIT

architecture with complicated, ad-hoc language Bend

2 Languages Have Not Kept Up C von-neumann machine Modern architecture Two choices: Develop cool architecture with complicated, ad-hoc language Bend over backwards to support old languages like C/C IAP 2007 MIT

3 Why a New Language? #of cores For uniprocessors, Pi hi PC102 C was: Portable High Performance Raza Composable Raw XLR Malleable Niagara Maintainable a abe Broadcom Picochip Ambric AM Cisco CSR-1 Intel Tflops Cavium Octeon Opteron 4P 4 Xeon MP Xbox360 PA-8800 Opteron Tanglewood Power4 2 PExtreme Power6 Yonah Pentium P2 P3 Itanium 1 P Athlon Itanium 2 Cell ?? IAP 2007 MIT

4 Why a New Language? #of cores 512 Picochip Ambric PC102 AM Uniprocessors: C is the common machine language Raw Cisco CSR-1 Intel Tflops Raza XLR Niagara Cavium Octeon Opteron 4P Broadcom Xeon MP Xbox360 PA-8800 Opteron Tanglewood Power4 2 PExtreme Power6 Yonah Pentium P2 P3 Itanium 1 P Athlon Itanium 2 Cell ?? IAP 2007 MIT

5 Why a New Language? #of cores What is the common Pi hi PC102 machine language for multicores? 512 Picochip Ambric AM Raw Cisco CSR-1 Intel Tflops Raza XLR Niagara Cavium Octeon Opteron 4P Broadcom Xeon MP Xbox360 PA-8800 Opteron Tanglewood Power4 2 PExtreme Power6 Yonah Pentium P2 P3 Itanium 1 P Athlon Itanium 2 Cell ?? IAP 2007 MIT

6 Common Machine Languages Uniprocessors: Common Properties Single flow of control Single memory image Differences: Register File ISA Functional Units von-neumann languages represent the common properties and abstract away the differences Multicores: Common Properties Multiple flows of control Multiple local memories Differences: Number and capabilities of cores Communication Model Synchronization Model Need common machine language(s) for multicores IAP 2007 MIT

7 Streaming as a Common Machine Language For programs based on streams of data Audio, video, DSP, networking, and cryptographic processing kernels Examples: HDTV editing, radar tracking, microphone arrays, cell phone base stations, graphics Several attractive properties Regular and repeating computation Independent filters with explicit communication AtoD FMDemod Scatter LPF 1 LPF 2 LPF 3 HPF 1 HPF 2 HPF 3 Gather Task, data, and pipeline parallelism Adder Speaker IAP 2007 MIT

8 Streaming Models of Computation Many different ways to represent streaming Do senders/receivers block? How much buffering is allowed on channels? Is computation deterministic? Can you avoid deadlock? Three common models: 1. Kahn Process Networks 2. Synchronous Dataflow 3. Communicating Sequential Processes IAP 2007 MIT

9 Streaming Models of Computation Communication Pattern Buffering Notes Kahn process networks (KPN) Synchronous dataflow (SDF) Data-dependent, but deterministic Static Conceptually unbounded Fixed by compiler - UNIX pipes - Ambric (startup) - Static scheduling - Deadlock freedom Communicating Data-dependent, None - Rich synchronization Sequential allows nondeterminism - Occam (Rendesvouz) primitives Processes (CSP) language SDF KPN CSP space of program behaviors IAP 2007 MIT

10 The StreamIt Language A high-level, architecture-independent language for streaming applications Improves programmer productivity (vs. Java, C) Offers scalable performance on multicores Based on synchronous dataflow, with dynamic extensions Compiler determines execution order of filters Many aggressive optimizations possible IAP 2007 MIT

11 The StreamIt Project Applications DES and Serpent [PLDI 05] MPEG-2 [IPDPS 06] SAR, DSP benchmarks, JPEG, Programmability StreamIt Language (CC 02) Teleport Messaging (PPOPP 05) Programming Environment in Eclipse (P-PHEC 05) Domain Specific Optimizations Linear Analysis and Optimization (PLDI 03) Optimizations for bit streaming (PLDI 05) Linear State Space Analysis (CASES 05) Architecture Specific Optimizations Compiling for Communication-Exposed Architectures (ASPLOS 02) Phased Scheduling (LCTES 03) Cache Aware Optimization (LCTES 05) Load-Balanced Rendering (Graphics Hardware 05) Uniprocessor backend Simulator (Java Library) StreamIt Program Front-end Annotated Java Cluster backend Stream-Aware Optimizations Raw backend IBM X10 backend C/C++ MPI-like C per tile + Streaming C/C++ msg code X10 runtime IAP 2007 MIT

12 Example: A Simple Counter void->void pipeline Counter() { add IntSource(); add IntPrinter(); } void->int filter IntSource() { int x; init { x = 0; } work push 1{ push (x++); } } int->void filter IntPrinter() { work pop 1 { print(pop()); } } Counter IntSource IntPrinter % strc Counter.str o counter %./counter i IAP 2007 MIT

13 Representing Streams Conventional wisdom: streams are graphs Graphs have no simple textual representation Graphs are difficult to analyze and optimize Insight: stream programs have structure unstructured structured IAP 2007 MIT

14 Structured Streams filter pipeline splitjoin may be any StreamIt language construct parallel computation Each structure is singleinput, single-output Hierarchical and composable splitter joiner feedback loop joiner splitter IAP 2007 MIT

15 Filter Example: Low Pass Filter float->float filter LowPassFilter (int N, float freq) { float[n] weights; init { weights = calcweights(freq); } N } work peek N push 1 pop 1 { float result = 0; for (int i=0; i<weights.length; i++) { result += weights[i] * peek(i); } push(result); pop(); } filter IAP 2007 MIT

16 Low Pass Filter in C void FIR( int* src, int* dest, int* srcindex, int* destindex, int srcbuffersize, int destbuffersize, int N) { } FIR functionality obscured by buffer management details Programmer must commit to a particular buffer implementation strategy float result = 0.0; for (int i = 0; i < N; i++) { result += weights[i] * src[(*srcindex + i) % srcbuffersize]; } dest[*destindex] = result; *srcindex = (*srcindex + 1) % srcbuffersize; *destindex = (*destindex + 1) % destbuffersize; IAP 2007 MIT

17 Pipeline Example: Band Pass Filter float float pipeline BandPassFilter (int N, float low, float high) { add LowPassFilter(N, low); add HighPassFilter(N, high); LowPassFilter HighPassFilter } IAP 2007 MIT

18 SplitJoin Example: Equalizer float float pipeline Equalizer (int N, float lo, float hi) { Equalizer duplicate add splitjoin { ; for (int i=0; i<n; i++) add BandPassFilter(64, lo + i*(hi - lo)/n); BPF BPF BPF join roundrobin(1); } add Adder(N); roundrobin } Adder IAP 2007 MIT

19 Building Larger Programs: FMRadio void->void pipeline FMRadio(int N, float lo, float hi) { add AtoD(); add FMDemod(); add splitjoin { ; for (int i=0; i<n; i++) { add pipeline { AtoD FMDemod Duplicate add LowPassFilter(lo + i*(hi - lo)/n); LPF 1 LPF 2 LPF 3 add HighPassFilter(lo + i*(hi - lo)/n); } } join roundrobin(); } add Adder(); HPF 1 HPF 2 HPF 3 RoundRobin Adder } add Speaker(); Speaker IAP 2007 MIT

My claim is that it is possible to write grand programs, noble

20 The Beauty of Streaming Some programs are elegant, some are exquisite, it some are sparkling. My claim is that it is possible to write grand programs, noble programs, truly magnificient ones! Don Knuth, ACM Turing Award Lecture IAP 2007 MIT

21 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

22 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

23 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

24 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

25 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

26 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

27 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

28 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

29 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

30 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

31 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

32 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

33 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

34 split roundrobin(n) join roundrobin(n) IAP 2007 MIT

35 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

36 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

37 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

38 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

39 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

40 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

41 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

42 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

43 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

44 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

45 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

46 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

47 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

48 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

49 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

50 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

51 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

52 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

53 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

54 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

55 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

56 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

57 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

58 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

59 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

60 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

61 split roundrobin(1) join roundrobin(1) IAP 2007 MIT

62 split roundrobin(2) join roundrobin(1) IAP 2007 MIT

63 split roundrobin(2) join roundrobin(1) IAP 2007 MIT

64 split roundrobin(2) join roundrobin(1) IAP 2007 MIT

65 split roundrobin(2) join roundrobin(1) IAP 2007 MIT

66 split roundrobin(2) join roundrobin(1) IAP 2007 MIT

67 split roundrobin(2) join roundrobin(1) IAP 2007 MIT

68 split roundrobin(2) join roundrobin(1) IAP 2007 MIT

69 split roundrobin(2) join roundrobin(1) IAP 2007 MIT

70 split roundrobin(2) join roundrobin(1) IAP 2007 MIT

71 split roundrobin(2) join roundrobin(1,2,3) IAP 2007 MIT

72 split roundrobin(2) join roundrobin(1,2,3) IAP 2007 MIT

73 split roundrobin(2) join roundrobin(1,2,3) IAP 2007 MIT

74 split roundrobin(2) join roundrobin(1,2,3) IAP 2007 MIT

75 split roundrobin(2) join roundrobin(1,2,3) IAP 2007 MIT

76 split roundrobin(2) join roundrobin(1,2,3) IAP 2007 MIT

77 split roundrobin(2) join roundrobin(1,2,3) IAP 2007 MIT

78 split roundrobin(2) join roundrobin(1,2,3) IAP 2007 MIT

79 Matrix Transpose N M Transpose M N IAP 2007 MIT

80 Matrix Transpose N M roundrobin(?)? roundrobin(?) M N IAP 2007 MIT

81 Matrix Transpose N M roundrobin(?)? roundrobin(?) M N IAP 2007 MIT

82 Matrix Transpose N M roundrobin(1) N roundrobin(m) M N IAP 2007 MIT

83 Matrix Transpose N M roundrobin(1) N roundrobin(m) M N IAP 2007 MIT

84 Matrix Transpose N M M N roundrobin(1) roundrobin(m) float->float splitjoin Transpose (int M, int N) { split roundrobin(1); for (int i = 0; i<n; i++) { add Identity<float>; } join roundrobin(m); } N IAP 2007 MIT

85 Bit-reversed ordering Many FFT algorithms require a bit-reversal stage If item is at index n (with binary digits b 0 b 1 b k ), then it is transferred to reversed index b k b 1 b 0 For 3-digit binary numbers: IAP 2007 MIT

86 Bit-reversed ordering Many FFT algorithms require a bit-reversal stage If item is at index n (with binary digits b 0 b 1 b k ), then it is transferred to reversed index b k b 1 b 0 For 3-digit binary numbers: RR(w 1 ) RR(w 1 ) RR(w 1 ) RR(w 2 ) RR(w 3 ) RR(w 2 ) IAP 2007 MIT

87 Bit-reversed ordering Many FFT algorithms require a bit-reversal stage If item is at index n (with binary digits b 0 b 1 b k ), then it is transferred to reversed index b k b 1 b 0 For 3-digit binary numbers: RR(1) RR(1) RR(1) RR(2) RR(4) RR(2) IAP 2007 MIT

88 Bit-reversed ordering complex->complex pipeline BitReverse (int N) { } if (N==2) { add Identity<complex>; } else { } rr add splitjoin { split roundrobin(1); rr add BitReverse(N/2); 4 4 rr add BitReverse(N/2); join roundrobin(n/2); } rr rr rr rr rr rr rr rr 4 4 rr rr rr IAP 2007 MIT

89 N/8 N/8 N N/2 N/2 N/4 N/4 N/4 Sort Sort Sort N/8 N/8 Sort N/8 N/8 Sort Sort Sort Sort Merge Merge Merge Merge Merge Merge Merge N/8 N/4 N/8 N-Element Merge Sort int->int pipeline MergeSort (int N) { if (N==2) { add Sort(N); } else { add splitjoin { split roundrobin(n/2); add MergeSort(N/2); add MergeSort(N/2); join roundrobin(n/2); } } add Merge(N); } IAP 2007 MIT

90 N-Element Merge Sort (3-level) N N/2 N/2 N/4 N/4 N/4 N/4 N/8 N/8 N/8 N/8 N/8 N/8 N/8 N/8 Sort Sort Sort Sort Sort Sort Sort Sort Merge Merge Merge Merge Merge Merge Merge IAP 2007 MIT

91 Bitonic Sort IAP 2007 MIT

92 FFT IAP 2007 MIT

93 Block Matrix Multiply IAP 2007 MIT

94 Filterbank IAP 2007 MIT

95 FM Radio with Equalizer IAP 2007 MIT

96 Radar-Array Front End IAP 2007 MIT

97 MP3 Decoder IAP 2007 MIT

98 Case Study: MPEG-2 2Decoder in StreamIt IAP 2007 MIT

99 MPEG-2 Decoder in StreamIt MPEG bit stream picture type <QC> <PT1, PT2> quantization coefficients frequency encoded macroblocks <QC> ZigZag IQuantization IDCT Saturation spatially encoded macroblocks VLD macroblocks, motion vectors splitter joiner differentially coded motion vectors Motion Vector Decode Repeat motion vectors splitter Y Cb Cr Motion Compensation Motion Compensation <PT1> reference <PT1> reference picture picture it Channel Upsample Motion Compensation <PT1> reference picture it Channel Upsample add VLD(QC, PT1, PT2); add splitjoin { split roundrobin(n B, (, V); add pipeline { add ZigZag(B); add IQuantization(B) to QC; add IDCT(B); add Saturation(B); } add pipeline { add MotionVectorDecode(); add Repeat(V, N); } join roundrobin(b, V); } add splitjoin { split roundrobin(4 (B+V), B+V, B+V); add MotionCompensation(4 (B+V)) to PT1; for (int i = 0; i < 2; i++) { add pipeline { add MotionCompensation(B+V) to PT1; add ChannelUpsample(B); } } <PT2> joiner recovered picture Picture Reorder } join roundrobin(1, 1, 1); add PictureReorder(3 W H) to PT2; Color Space Conversion output to player add ColorSpaceConversion(3 W H); IAP 2007 MIT

100 Teleport Messaging in MPEG-2 VLD picture type IQ Y splitter Cb Cr Motion Compensation Motion Compensation Motion Compensation MC MC MC Channel Upsample Channel Upsample joiner recovered picture Order IAP 2007 MIT

101 Messaging Equivalent in C The MPEG Bitstream File Parsing Decode Picture Decode Macroblock Decode Block ZigZagUnordering Inverse Quantization Global Variable Space Decode Motion Vectors Motion Compensation Saturate IDCT Motion Compensation For Single Channel Frame Reordering Output Video IAP 2007 MIT

102 MPEG-2 Implementation Fully-functional MPEG-2 decoder and encoder Developed by 1 programmer in 8 weeks 2257 lines of code Vs lines of C code in MPEG-2 reference 48 static streams, 643 instantiated filters IAP 2007 MIT

103 Conclusions StreamIt language preserves program structure Natural for programmers Parallelism and communication naturally exposed Compiler managed buffers, and portable parallelization technology StreamIt increases programmer productivity, enables parallel performance IAP 2007 MIT

MPEG-2 Decoding in a Stream Programming Language

MPEG-2 Decoding in a Stream Programming Language Matthew Drake, Hank Hoffmann, Rodric Rabbah and Saman Amarasinghe Massachusetts Institute of Technology IPDPS Rhodes, April 2006 1 http://cag.csail.mit.edu/streamit