EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation

Size: px

Start display at page:

Download "EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation"

Archibald Francis
5 years ago
Views:

1 EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and Mehrzad Samadi

2 nnouncements & Reading Material Exams graded and will be returned in Wednesday s class This class» Orchestrating the Execution of Stream Programs on Multicore Platforms, M. Kudlur and S. Mahlke, Proc. CM SIGPLN 2008 Conference on Programming Language Design and Implementation, Jun Next class Research Topic 3: Security» Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software, James Newsome and Dawn Song, Proceedings of the Network and Distributed System Security Symposium, Feb

3 Stream Graph Modulo Scheduling

4 Stream Graph Modulo Scheduling (SGMS) Coarse grain software pipelining» Equal work distribution» Communication/computation overlap» Synchronization costs Target : Cell processor» Cores with disjoint address spaces» Explicit copy to access remote data DM engine independent of PEs Filters = operations, cores = function units SPE0 SPE1 SPE7 SPU 256 KB LS MFC(DM) PPE (Power PC) SPU 256 KB LS MFC(DM) EIB DRM SPU 256 KB LS MFC(DM) - 3 -

5 Preliminaries Synchronous Data Flow (SDF) [Lee 87] StreamIt [Thies 02] int->int filter FIR(int N, int wgts[n]) { int wgts[n]; work pop 1 push 1 { int i, sum = 0; wgts = adapt(wgts); for(i=0; i<n; i++) sum += peek(i)*wgts[i]; push(sum); pop(); } } Push and pop items from input/output FIFOs - 4 -

6 SGMS Overview PE0 Prologue PE0 PE1 PE2 PE3 T 1 T 4 DM DM DM DM DM DM DM DM Epilogue T 1 4 T 4-5 -

7 SGMS Phases Fission + Processor assignment Load balance Stage assignment Causality DM overlap Code generation - 6 -

8 Processor ssignment: Maximizing Throughputs ssigns each filter to a processor W: workload W: 20 B C W: 20 W: 20 Four Processing Elements PE0 PE1 PE2 PE3 D B C F E T2 = 50 PE0 B C D T1 = 170 W: 30 D W: 50 W: 30 E F PE0 PE1 PE2 PE3 j i a ij w( i) a 1 for all filter i = 1,, N ij Minimize II Minimum II: 50 Balanced workload! Maximum throughput II for all PE j = 1,,P E F T1/T2 = 3. 4

9 Need More Than Just Processor ssignment ssign filters to processors» Goal : Equal work distribution Graph partitioning? Bin packing? Modified Original stream program 5 B B2 C 40 B C 10 B1 S J 5 D PE0 B2 B C J PE1 B1 C D Speedup = 60/40 = 1.5 S D Speedup = 60/32 ~ 2 D - 8 -

10 Filter Fission Choices PE0 PE1 PE2 PE3 Speedup ~ 4? - 9 -

11 Integrated Fission + PE ssign Exact solution based on Integer Linear Programming (ILP) Split/Join overhead factored in Objective function- Maximal load on any PE» Minimize Result» Number of times to split each filter» Filter processor mapping

12 Time Step 2: Forming the Software Pipeline To achieve speedup» ll chunks should execute concurrently» Communication should be overlapped Processor assignment alone is insufficient information B C PE0 C PE1 B PE B B 1 B 1 PE1 B B X Overlap i+2 with B i

13 Stage ssignment PE 1 PE 1 i i S i S j S i DM S DM > S i j PE 2 j S j = S DM +1 Preserve causality (producer-consumer dependence) Communication-computation overlap Data flow traversal of the stream graph» ssign stages using above two rules

14 Stage ssignment Example S S C B1 Stage 0 B1 B2 J DM DM DM Stage 1 D B2 C Stage 2 PE 0 PE 1 J DM DM Stage 3 D Stage

15 Step 3: Code Generation for Cell Target the Synergistic Processing Elements (SPEs)» PS3 up to 6 SPEs» QS20 up to 16 SPEs One thread / SPE Challenge» Making a collection of independent threads implement a software pipeline» dapt kernel-only code schema of a modulo schedule

16 Time Complete Example B1 S DM DM DM C B2 J DM DM D void spe1_work() { char stage[5] = {0}; stage[0] = 1; for(i=0; i<mx; i++) { if (stage[0]) { (); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); } } S B1 S B1 S B1 S B1 S B1 D JtoD CtoD JtoD CtoD SPE1 DM1 B2 J C B2 J C B2 J C B1toJ StoB2 toc B1toJ StoB2 toc B1toJ StoB2 toc B1toJ StoB2 toc SPE2 DM2-15 -

17 Relative Speedup SGMS(ILP) vs. Greedy (MIT method, SPLOS 06) ILP Partitioning Greedy Partitioning Exposed DM bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar Benchmarks Solver time < 30 seconds for 16 processors

18 SGMS Conclusions Streamroller» Efficient mapping of stream programs to multicore» Coarse grain software pipelining Performance summary» 14.7x speedup on 16 cores» Up to 35% better than greedy solution (11% on average) Scheduling framework» Tradeoff memory space vs. load balance Memory constrained (embedded) systems Cache based system

19 Discussion Points Is it possible to convert stateful filters into stateless? What if the application does not behave as you expect?» Filters change execution time?» Memory faster/slower than expected? Could this be adapted for a more conventional multiprocessor with caches? Can C code be automatically streamized? Now you have seen 3 forms of software pipelining:» 1) Instruction level modulo scheduling, 2) Decoupled software pipelining, 3) Stream graph modulo scheduling» Where else can it be used?

20 Compilation for GPUs

21 Theoretical GFLOPS/s Why GPUs? NVIDI GPU INTEL CPU GeForce GTX 280 GeForce GTX GeForce 8800 GTX GeForce 7800 GTX GeForce 6800 Ultra GeForce FX

22 Efficiency of GPUs High Flop Per Watt GTX 285 : 5.2 GFLOP/W i7 : 0.78 GFLOP/W GTX 285 : 159 GB/Sec i7 : 32 GB/Sec High Memory Bandwidth High Flop Per Dollar GTX 285 : 3.54 GFLOP/$ i7 : 0.36 GFLOP/$ i7 : 51 GFLOPS GTX 285 :1062 GFLOPS i7 :102 GFLOPS High Flop Rate High DP Flop Rate GTX 285 : 88.5 GFLOPS GTX 480 : 168 GFLOPS

23 GPU rchitecture SM 0 SM 1 SM 2 SM 29 Shared Regs Shared Regs Shared Regs Shared Regs Interconnection Network CPU Host Memory Global Memory (Device Memory) PCIe Bridge

24 CUD Compute Unified Device rchitecture General purpose programming model» User kicks off batches of threads on the GPU dvantages of CUD» Interface designed for compute - graphics free PI» Orchestration of on chip cores» Explicit GPU memory management» Full support for Integer and bitwise operations

25 Time Programming Model Host Device Grid 1 Kernel 1 Grid 2 Kernel

26 GPU Scheduling SM 0 SM 1 SM 2 SM 3 SM 30 Shared Shared Shared Shared Shared Regs Regs Regs Regs Regs Grid

Warp Generation SM0 Block 0 Shared 0 1 2 3 4 5 6 7 Registers

27 Warp Generation SM0 Block 0 Shared Registers ThreadId 0 Warp 310 Warp Block 1 Block 2 Block

Device Memory Hierarchy Grid 0 Thread 0 int RegisterVar Per-thread Register Per-thread Local Memory global int GlobalVar Per app Global Memory int LocalVarrray[10]

28 Device Memory Hierarchy Grid 0 Thread 0 int RegisterVar Per-thread Register Per-thread Local Memory global int GlobalVar Per app Global Memory int LocalVarrray[10] Texture<float,1,ReadMode> TextureVar Per app Texture Memory Host Block 0 Per Block Shared Memory shared int SharedVar constant int ConstVar Per app Constant Memory

6.189 IAP Lecture 12. StreamIt Parallelizing Compiler. Prof. Saman Amarasinghe, MIT IAP 2007 MIT

6.189 IAP Lecture 12. StreamIt Parallelizing Compiler. Prof. Saman Amarasinghe, MIT IAP 2007 MIT 6.89 IAP 2007 Lecture 2 StreamIt Parallelizing Compiler 6.89 IAP 2007 MIT Common Machine Language Represent common properties of architectures Necessary for performance Abstract away differences in architectures