TANGRAM Efficient Kernel Synthesis for Performance Portable Programming

Size: px

Start display at page:

Download "TANGRAM Efficient Kernel Synthesis for Performance Portable Programming"

Imogene Hunt
5 years ago
Views:

1 TANGRAM Efficient Kernel Synthesis for Performance Portable Programming Li-Wen Chang 1, Izzat El Hajj 1, Christopher Rodrigues 2, Juan Gómez-Luna 3, Wen-Mei Hwu 1 1 University of Illinois, 2 Huawei America Research Lab, 3 Universidad de Córdoba lchang20@illinois.edu

2 Performance Portability Maintaining optmized programs for different devices is costly Ideally, programs wriven once should run difference devices with performance 2

3 CPU CPU

4 CPU Port Accelerator A Port Accelerator B CPU Accelerator A Accelerator B

5 Re-opTmize Port Re-opTmize Port Re-opTmize CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0 CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0

6 Performance Portability: OpenCL SGEMM Performance (GFLOPS) %è60% %è30% Tesla GPU (GTX 280) Fermi GPU (C2050) Sandy Bridge CPU (i7-3820) Parboil (default naïve OpenCL version) Parboil (OpenCL version optimized for Tesla GPU) Reference (MKL for CPU, CUBLAS for GPU) 6

7 Source CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0 CPU CPU 2.0 Accelerator A Accelerator A 2.0 Accelerator B Accelerator B 2.0

8 -based Programing Language NESL, Sequoia, Petabricks Highly adaptve to hierarchies Through compositon Usually scaling well Performance relies on base-rule implementatons/ libraries 8

9 Performance SensiTvity in Base Rule: DGEMM Normalized Performance (higher is be)er) % loss MKL TANGRAM Petabricks (with MKL) 11% of MKL Petabricks (no MKL) Sandy Bridge i

10 TANGRAM -based language Focus at high-performance code synthesis within a node Remove dependence of high-performance baserule implementatons/libraries Provide a representaton for bever SIMD utlizaton Provide an architectural hierarchy model to guide compositon 10

11 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 11

12 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 12

13 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 13

14 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 14

15 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 15

16 TANGRAM Language codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet codelet tag(asso_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling 16

17 TANGRAM Language p c codelet int accum = 0; for(unsigned i=0; i < len; ++i) { accum += in[i]; return accum; (a) Atomic autonomous codelet codelet coop tag(kog) shared int tmp[coopdim()]; c b unsigned id = coopidx(); tmp[id] = (id < len)? in[id] : 0; for(unsigned s=1; s<coopdim(); s *= 2) { if(id >= s) tmp[id] += tmp[id - s]; return tmp[coopdim()-1]; (b) Atomic cooperative codelet c a codelet tag(asso_tiled) tunable unsigned p; S unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,tile,len),sequence(1),sequence(tile,tile,len+1)))); (c) Compound codelet using adjacent tiling codelet tag(stride_tiled) tunable unsigned p; unsigned tile = (len+p-1)/p; return sum( map( sum, partition(in, p,sequence(0,1,p),sequence(p),sequence((p-1)*tile,1,len+1)))); (d) Compound codelet using strided tiling S S S p d S S S S 17

18 Rule ExtracTon TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM parser Rule ExtracTon Rule SpecializaTon Device-specific gen Clang 3.5 Program Rules Specialized Rules Plans Kernel Versions Customized TANGRAM AST builder Output a set of TANGRAM ASTs Program Composi,on Rules: (sum) Rule 1: compose(sum, L) S L, devolve(l L ), compose(sum, l L ) Rule 2: compose(sum, L) compute(c a, SE L ) Rule 3: compose(sum, L) compute (c b, VE L ) Rule 4: compose(sum, L) S L, regroup(p c, L), distribute(l L ), compose(sum, l L ), compose(sum, L) Rule 5: compose(sum, L) S L, regroup(p d, L), distribute(l L ), compose(sum, l L ), compose(sum, L) Example for Deriving Composi,on Rules from Compound lets: (codelet c) compose(sum, L) compose(c c, L) compose(sum(map(sum, par::on(, p c ))), L) compose(map(sum, par::on(, p c )), L), compose(sum, L) compose(par::on(, p c ), L), compose(map(sum, ), L), compose(sum, L) S L, regroup(p c, L), distribute(l L ), compose(sum, l L ), compose(sum, L) 18

19 Architectural Hierarchy Model Define a level ComputaTonal capability Scalar or vector executon TANGRAM Lang. lets Rule ExtracTon Program Rules Architectural Hierarchy Model Rule SpecializaTon Specialized Rules Plans Capability to synchronize across the subordinate level of that level Device-specific gen Kernel Versions Device SpecificaKon: G := C G = none, ( l G, S G ) = ( B, terminate/launch ) // G : grid B := C B = VE B, ( l B, S B ) = ( T, syncthreads() ) // B : block T := C T = SE T, ( l T, S T ) = none // T : thread Extensible CPU SIMD, GPU warp, ILP, even GPU dynamic parallelism 19

20 Rule SpecializaTon TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM analyzer Rule ExtracTon Rule SpecializaTon Device-specific gen AST traverser Program Rules Specialized Rules Plans Kernel Versions Output a lookup table Legal codelets for each level Also prioritze them Specialized Composition Rules: G rules: G1: compose(sum, G) S G, devolve(b), compose(sum, B) G4: compose(sum, G) S G, regroup(p c, G), distribute(b), compose(sum, B), compose(sum, G) G5: compose(sum, G) S G, regroup(p d, G), distribute(b), compose(sum, B), compose(sum, G) B rules: B1: compose(sum, B) S B, devolve(t), compose(sum, T) B3: compose(sum, B) compute (c b, VE B ) B4: compose(sum, B) S B, regroup(p c, B), distribute(t), compose(sum, T), compose(sum, B) B5: compose(sum, B) S B, regroup(p d, B), distribute(t), compose(sum, T), compose(sum, B) T rules: T2: compose(sum, T) compute(c a, SE T ) 20

21 TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM planner Rule ExtracTon Rule SpecializaTon Device-specific gen AST traverser/builder Program Rules Specialized Rules Plans Kernel Versions SelecTon of codelets or map policies Pruning Output ASTs for codegen p d p c p c p c p c c a S S S c a c a c a c a c a c a c a c a c a S c b c b c b terminate/launch p d p c c b S S S S c a c a c a c b 21

22 gen TANGRAM Lang. lets Architectural Hierarchy Model TANGRAM codegen Rule ExtracTon Rule SpecializaTon Device-specific gen AST traversers Program Rules Specialized Rules Plans Kernel Versions ConvenTonal optmizatons Output C/CUDA source code 22

23 GPU gen Example tile = (len + griddim.x 1)/gridDim.x; sub_tile = (tile + blockdim.x 1)/blockDim.x; accum = 0 #pragma unroll for(unsigned i = 0; i < sub_tile; ++i) { accum += in[blockidx.x*tile + i*blockdim.x + threadidx.x]; tmp[threadidx.x] = accum; syncthreads(); for(unsigned s=1; s<blockdim.x; s *= 2) { if(id >= s) tmp[threadidx.x] += tmp[threadidx.x - s]; syncthreads(); partial[blockidx.x] = tmp[blockdim.x-1]; return; // Launch new kernel to sum up partial GPU 1. Grid 2. Block 3. Thread p d p c p c p c c a c a c a c a c a c a c a c a c a c b c b c b

24 Experimental Results TANGRAM delivers 70% or higher performance compared to highly-optmized libraries, such as Intel MKL, NVIDIA CUBLAS, CUSPARSE, or Thrust, or experts optmized benchmarks, Rodinia Normalized Performance (higher is be*er) Kepler (reference) Kepler (TANGRAM) Fermi (reference) Fermi (TANGRAM) CPU (reference) CPU (TANGRAM) Scan SGEMV-TS SGEMV-SF DGEMM SpMV KMeans BFS 24

25 FAQ1 Why TANGRAM is bever than other compositon-based languages? TANGRAM provides an architectural hierarchy model to guide compositon TANGRAM provides a representaton of cooperatve codelets for bever SIMD utlizaton Especially shuffle instructons and scratchpad 25

26 FAQ2 Where optmizatons happen? SelecTon of codelets or map policies in ConvenTonal optmizatons in gen OpTmizaTons in backend compilers 26

27 FAQ3 What? MulTple versions? We did NOT ask users to write multple versions of kernels lets can be used to synthesize different versions of kernels lets can be reused multple Tmes within one kernel, across kernels in a device, across kernels for different devices 27

28 Takeaways of TANGRAM Performance portability 70% or higher performance compared to highlyoptmized libraries Extensible architectural hierarchical model Support CPU SIMD, GPU warp, ILP, even GPU dynamic parallelism NaTve descripton for algorithmic design space Perfect for domain users 28

29 QuesTons 29

Efficient Kernel Synthesis for Performance Portable Programming

Efficient Kernel Synthesis for Performance Portable Programming Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna and Wen-mei Hwu University of Illinois at Urbana-Champaign Huawei America