Programming FPGAs in C/C++ with High Level Synthesis PACAP - HLS 1

Size: px

Start display at page:

Download "Programming FPGAs in C/C++ with High Level Synthesis PACAP - HLS 1"

Godwin Stanley
5 years ago
Views:

1 Programmng FPGAs n C/C wth Hgh Level Synthess PACAP - HLS 1

2 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 2

3 HLS : from algorthm to Slcon N-1 =0 A. x x x x for(=0;<n;) acc=data[]*coef[]; [STMcroelectroncs] x Area Area total S1 S2 S3 S4 S5 S6 S7 S8 S9 Latency tme VHL & Verlog RTL Output SystemC TLM Output Hgh Level Synthess ams at rasng the level of abstracton of hardware desgn

Example: FFT at STMcroelectroncs C Generc Model 2K, 1K, 512 128, 64 ponts

WF/WMax 64/2K,1K, 512 (13bts) 20 Ms/s 65 nm [T.

4 Example: FFT at STMcroelectroncs C Generc Model 2K, 1K, , 64 ponts FFT/FFT (varous bt precson) Dgtal ODU 2K ponts (12bts) 2 Gs/s 65n, 45 nm WF/WMax 64/2K,1K, 512 (13bts) 20 Ms/s 65 nm [T. Mchel, STMcroelectroncs] Better Reusablty of Hgh Level Models MRI-CSS UWB 128 ponts (11bts) 528 Ms/s 90nm

5 HLS : does t really work? Hgh Level Synthess s not a new dea Semnal academc resarch work startng n the early 1990 s Frst commercal tools on the market around 1995 Frst generaton of HLS tools was a complete dsaster Worked only on small toy examples QoR (area/power) way below expectatons Blatantly overhyped by marketng folks It took 15 years to make these tools actually work Now used by worldclass chp vendors (STMcro, Nxt, ) Many desgners stll reject HLS (bad past experence)

6 HLS vs HDL performance Tme (us) 10 to 50 % 25 to 100% HLS desgn HDL desgn Cost (smplfed)

7 HLS n 2016 Major CAD vendors have ther own HLS tools Synphony from Synopsys (C/C) C2S from Cadence Catapult-C from Mentor Graphcs (C/C/SystemC) FPGA vendors have also enterng the game Free Vvado HLx by lnx (C/C) OpenCL compler by Altera (OpenCL)

8 HLS adopton Desgn languages used for FPGA desgn For FPGA desgns, Hgh Level Synthess from C/C wll become manstream wthn 5-10 years from now INSA-EII-5A 8

9 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 9

10 Descrbng dgtal desgns n C/C Is the C/C language suted for hardware desgn? No, t follows a sequental executon model Hardware s ntrnscally parallel (HDLs model concurrency) Dervng effcent hardware nvolves parallel executon No, t has flat/lnear vson of memory In an FPGA/ASIC memory s dstrbuted (memory blocks) No, t supports too few datatypes (char, nt, float) Hardware IP uses customzed wordlength (e.g 12 bts) May use specal arthmetc (ex: custom floatng pont) PACAP - HLS 10

11 How to descrbe crcut wth C, then? Use classes to model crcuts component and structure Instancate components and explctely wre them together Use of template can eases the desgn of complex crcuts Easy to generate a low level representaton of the crcut Stll operates at the Regster to Logc Level Does not really rase the level of abstracton No added value w.r.t HDL C based HLS tool ams at beng used as C complers User wrtes an algorthm n C, the tool produces a crcut. PACAP - HLS 11

12 Where s my clock? In C/C, there s no noton of tme Ths s a problem : we need synchronous (clocked) hardware HLS tools nfer tmng usng two approaches Implct semantcs (meanng) of language constructs User defned compler drectves (annotatons) Extendng C/C wth an mplct semantc??? They add a temporal meanng to some C/C constructs Ex : a loop teraton takes at least one cycle to execute Best way to understand s through examples PACAP - HLS 12

13 HLS and FPGA memory model In C/C memory s a large & flat address space Enablng ponter arthmetc, dynamc allocaton, etc. HLS has strong restrctons on memory management No dynamc memory allocaton (no malloc, no recurson) No global varables (HLS operate at the functon level) At best very lmted support for ponter arthmetcs Mappng of program varables nfered from source code Scalar varables are stored n dstrbuted regsters, Arrays are stored n memory blocks or n external memory Arrays are often parttoned n banks (capacty or #ports) PACAP - HLS 13

HLS bt accurate datatypes C/C standard types lmt hardware desgn choces Hardware mplementatons use a wde range of wordlength Mnmum precson to keep algorthm correctness whle reducng

14 HLS bt accurate datatypes C/C standard types lmt hardware desgn choces Hardware mplementatons use a wde range of wordlength Mnmum precson to keep algorthm correctness whle reducng are mprovng speed. HLS provdes bt accurate types n C and C SystemC and vendor specfc types supported to smulate custom wordlength hardware datapaths n C/C [source lnx] PACAP - HLS 14

15 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 15

The rght use of HLS tools To use HLS, one stll need to «thnk n hardware» The desgner specfes an hgh level archtectural template The HLS tools then nfers the complete archtecture Desgners must fully

16 The rght use of HLS tools To use HLS, one stll need to «thnk n hardware» The desgner specfes an hgh level archtectural template The HLS tools then nfers the complete archtecture Desgners must fully understand how C/C language constructs map to hardware Desgners then use that knowledge to make the tool derve the accelerator they need. Need to be aware of optmzng complers lmtatons How smart a compler dependency analyss can be? How to bypass compler lmtatons (e.g usng #pragmas) PACAP - HLS 16

17 Key thngs to understand 1. How the HLS tool nfer the accelerator nterface (I/Os) Very mportant ssue for system level ntegraton 2. How the HLS tool handles array/memory accesses Key ssue when dealng wth mage processng algorthms 3. How the HLS tool handles (nested) loops Does t performs automatc loop ppelnng/unrollng/mergng? PACAP - HLS 17

18 Outlne Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell PACAP - HLS 18

19 Hgh Level Synthess from C/C HLS tools operate at the functon/procedure level Functons may call other functons (no recurson allowed) Each functon s mapped to a specfc hardware component Syntheszed components are then exported as HDL IP cores The I/O nterface nfered from the functon prototype Specfc ports for controllng the component (start, end, etc.) Smple data bus (scalar arguments of functon results) nt foobar(nt a){ } a 32 foobar done start Ths s not enough, we may need to access memory from our hardware component 19

20 I/O nterface and system level ntegraton HLS can nfer more complex nterfaces/protocols Ponters/arrays as argument = master access on system bus Streamng nterface can also be nfered usng drectves or specfc templates. nt foobar( nt a, nt* ptr, nt tab[], sc_ffo<nt> n){ a addr dout dn ptr addr dout dn tab foobar empty read n dn } = n.read() 20

21 Syntheszng hardware from a functon HLS tools decompose the program nto blocks Each block s realzed n hardware as a FSM datapath A global FSM controls whch block s actve accordng to current program state. The datapath of mutually exclusve blocks can be merged to save hardware resources (very few tools have ths ablty) for (nt =0;<32;) { res = []*Y[]; } f(state==4){ for (nt =0;<32;) { res = []/Y[]; } } else { for (nt =0;<32;) { res = 1/[]; } } TOP FSM FSM PACAP - HLS FSM FSM Datapath [] Y[] res [] Y[] res 1 Datapath 1 Datapath [] Y[] res 1 res res res 21 Mergng s possble

22 Array access management In HLS arrays are mapped to local or external memory Local memory maybe local SRAM or large regster fles External memory = bus nterface or SRAM/SDRAM, etc. External memory access management External memory access takes a ndefnte number of cycles Only one pendng access at a tme (no overlappng) Burst mode accesses are sometmes supported Local memory management Memory access always take one cycle (n read or wrte mode) The number of read/wrte ports per memory can be confgured Memory data type can be any type supported by the HLS tool 22

23 HLS support for condtonals HLS tools have two dfferent ways of handlng them By consderng branches as two dfferent blocks f(state==4){ for (nt =0;<32;) { res = []/Y[]; } } else { for (nt =0;<32;) { res = 1/[]; } } FSM FSM Datapath [] Y[] res 1 Datapath [] Y[] res 1 res res Mergng s possble By executng both branches and select the correct results f then/else do not contan loops functon calls. f(state==4){ res = ab; } else { res = a*b; } a b PACAP - HLS state==4 23

24 HLS support for loops Loops are one of the most mportant program constructs Most sgnal/mage kernel conssts of (nested) loops They are often the reason for resortng to hardware IPs Ther support n HLS tools vary a lot between vendors Best n class : Synopsys Synphony, Mentor Catapult-C Worst n class : C2S from Cadence, Impulse-C HLS users spent most on ther tme optmzng loops Drectly at the source level by tweakng the source code By usng compler/scrpt based optmzaton drectves PACAP - HLS 24

HLS wth SDSOC Hardware-software codesgn from

hardware Automatc synthess of glue logc and devce

25 HLS wth SDSOC Hardware-software codesgn from sngle C/C spec. Key program functons annotated as FPGA accelerators HLS used to synthesze accelerator hardware Automatc synthess of glue logc and devce drvers Wth SDSoC you can desgn and run an accelerator n mnutes (used to take month ) PACAP - HLS 25

26 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 26

27 Loop sngle-cycle executon HLS default behavor s to execute one teraton/cycle Executng the loop wth N teraton takes N1 cycles Extra cycle for evaluatng the ext condton [] float [512],Y[512]; res=0.0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } Y[] res res Each operaton s mapped to ts own operator No resource sharng/hardware reuse 1 PACAP - HLS 27

28 Loop sngle-cycle executon Crtcal path (f max ) depends on loop body complexty Let us assume that T mul =3ns and T add =1ns [] Y[] res 1 T add T clk = 4 ns T mul T add res No control on the clock speed! nt [512],Y[512]; nt tmp,res=0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } #cycles f clk Exec tme Mhz 2,5 us PACAP - HLS 28

29 Loop sngle-cycle executon Tme (us) 2,5 Sngle cycle 1 *, 2 Cost (smplfed)

30 Mult-cycle executon Most HLS tools handle constrants over T clk Body executon s splt nto several shorter steps (clock-tcks) Needs accurate nformaton about target technology Target T clk = 3ns nt [512],Y[512]; nt tmp,res=0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } [] Y[] res res 1 #cycles f clk Exec tme T clk =3ns 512*21 330Mhz 3,07us PACAP - HLS Step 1 Step 2 30

31 Hardware reuse n mult-cycle executon For mult-cycle executon, HLS enable operators reuse HLS tools balance hardware operator usage of over tme Trade-off between operator cost and muxes overhead T clk =3ns Step 1 Step 2 [] Y[] [] res Y[] res 1 res 1 multpler, 2 adder 1 multpler, 1 adder, 2 muxes PACAP - HLS 1 res 31

32 Hardware reuse n mult-cycle executon Hardware reuse rate s controlled by the user Trade-off between parallelsm (performance) and reuse (cost) Users provdes constrants on the # and type of resources Mostly used for multplers, memory banks/ports, etc. The HLS tools decdes what and when to reuse Optmzes for area or speed, followng user constrants Combnatoral optmzaton problem (exponental complexty) nt [512],Y[512]; nt tmp,res=0; #pragma HLS mult=1,adder=1 for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } res PACAP - HLS [] Y[] 1 1 multpler, 1 adder, 2 muxes res 32

33 Loop sngle-cycle executon Tme (us) 3,0 Mult cycle 2,5 Sngle cycle 1*,1 2 mux 1*,2 Cost (smplfed)

34 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 34

35 How to further mprove performance? Improve f max wth arthmetc optmzatons Avod complex/costly operatons as much as possble Aggressvely reduce data wordlength to shorten crtcal path nt [512],Y[512]; nt tmp,res=0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } T clk = 4 ns nt12 [512],Y[512]; nt16 res=0; nt24 tmp; for (nt9 =0;<512;){ tmp = []*Y[]; res = res tmp>>8; } T clk = 2,5 ns T mul T add * T add T add T mul T add Very effectve as t mproves performance and reduce cost PACAP - HLS 35

36 Loop sngle-cycle executon Tme (us) 3,0 2,5 Mult cycle (12/24 bt) Mult cycle (32 bts) Sngle cycle (32 bt) Sngle cycle (12/24 bt) 1,25 1*,1 2 mux 1*,2 Cost (smplfed)

37 Outlne Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Performance optmzatons for Hgh Level Synthess PACAP - HLS 37

38 How to further mprove performance? Increase the # of operatons executed per teraton More work performed wthn a clock cycle Ths can be acheved wth partal loop unrollng [3] nt12 [512],Y[512]; nt16 res=0; nt24 tmp; Y[3] for (nt9 =0;<128;=4){ [2] tmp = []*Y[]; Y[2] res = res tmp; [1] tmp = [1]*Y[1]; res = res tmp; [] tmp = [2]*Y[2]; Y[] res = res tmp; tmp = [3]*Y[3]; res res = res tmp; 4 } Y[1] Ths may also ncrease the datapath crtcal path.. and end-up beng counter productve res 38

39 Loop unrollng (partal/full) But, the unrolled datapath can also be optmzed Cascaded nteger adder are reorganzed as adder trees Some array accesses can be removed (replaced by scalars) [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res 4 res [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res 4 res Crtcal path reducton T clk =3,5ns T clk =3ns #cycles f clk Exec tme Mhz 0,45us #cycles f clk Exec tme Mhz 0,38 us 39

40 Loop unrollng perf/area trade-off Tme (us) 3,0 2,5 Mult cycle (12/24 bt) Mult cycle (32 bts) Sngle cycle (32 bt) Sngle cycle (12/24 bt) 1,25 Unrolled (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)

41 Loop unrollng mult-cycle [] Y[] Unrollng can help mprovng mult-cycle schedules [1] Y[1] By ncreasng hardware operator utlzaton rate thanks to addtonal reuse opportuntes. [2] Y[2] [3] Y[3] res Same cost as the ntal mult-cycle verson, but more effcent use rate of the multpler [ ] Y[ ] 4 res res 4 x x x x step 1 step 2 step 3 step 4 step 5 #cycles f clk Exec tme 5* Mhz 1,28 us 41

42 Loop unroll mult-cycle executon Tme (us) 3,0 Mult cycle (12/24 bt) Mult cycle (32 bts) 2,5 Sngle cycle (32 bt) Mult cycle 4x unrolled (12/24 bt) 1,25 Sngle cycle (12/24 bt) Unrolled (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)

43 Loop unrollng (partal/full) Most HLS tool can unroll loops automatcally Generally controlled through a compler drectve (see below) Unrollng s often restrcted to loops wth constant bounds Full unrollng s only used for small trp count loops Unrollng does not always mprove performance Loop carred dependences can hurt/annhlate benefts See example next slde 43

44 Loop unrollng (partal/full) Unrollng loops wth carred dependences for(nt =0;<256;=1){ y[1] = a*y[] x[]; } Y[] [] a * Y[1] a * Y[] [1] Y[2] [] a Y[1] 1 1 * * * #cycles T clk Exec tme Mhz 0,64 us #cycles T clk Exec tme Mhz 0.64 us

45 Loop unrollng (partal/full) Older HLS tools would fully unroll all loops even f the goal s not to get the fastest mplementaton! even for loops wth large teraton counts (e.g. mages) Lack of proper support for loops s one of the reason behnd the falure of the frst generaton of HLS tools 45

46 Loop ppelnng Loop Ppelnng = ppelned executon of loop teratons Start teraton 1 before all operaton of teraton are fnshed Prncples smlar to software ppelnng for VLIW/DSPs nt12 [512],Y[512]; nt16 res=0; nt24 tmp; for (nt9 =0;<512;){ tmp = []*Y[]; res = res tmp>>8; } In ths example a new teraton starts every cycle, all operatons n the loop body have ther own operator. #cycles T clk Exec tme Mhz 1,02us [] Y[] [1] Y[1] 1 * * Stage 1 Stage 2 1 res res PACAP - HLS

47 Loop ppelnng perf/area trade-off Tme (us) 3,0 Mult cycle (12/24 bt) Mult cycle (32 bts) 2,5 1,25 Mult cycle 4x unrolled (12/24 bt) Sngle cycle (32 bt) Sngle cycle (12/24 bt) Ppelned (12/24 bt) Unrolled (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)

48 Loop ppelnng A gven ppelned loop schedule s characterzed by Intaton nterval (II) : delay separatng the executon of two successve teratons. A perfect loop ppelnng lead to II=1. Ppelne Latency (L) : number of cycles needed to completely execute a gven teraton (we have L>1). Targetng a value II=1 s common as the number of operators s not constraned by a predefned archtecture as n VLIW/DSPs Loop ppelnng enables very deeply ppelned datapath Operators are also aggressvely ppelned (e.g float) Latency n the order of 100 s of cycles s not uncommon Ppelnng s the most mportant optmzaton n HLS Huge performance mprovements for margnal resource cost (plenty of regsters n FPGA cells). PACAP - HLS 48

49 Ppelnng unrollng = vectorzng Ppelnng unrollng by K further ncrease performance [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res We execute K teratons followng a SIMD executon model Executon of the K teratons packets are overlapped 4 res However, achevng II=1 s often not straghforward We ll see that n a later slde. II=1, L=3; T clk =max(t mul,2*t add ) Loop executed n 82 cycles PACAP - HLS 49

50 Loop ppelnng perf/area trade-off Tme (us) 3,0 Mult cycle (12/24 bt) Mult cycle (32 bts) 2,5 1,25 Mult cycle 4x unrolled (12/24 bt) Sngle cycle (32 bt) Sngle cycle (12/24 bt) Ppelned (12/24 bt) Unrolled 4x (12/24 bt) Ppelned unrolled 4x (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)

51 Loop fuson Mergng consecutve loops nto a sngle one Everythng happens as f the two loops are executed n parallel It helps reducng the mpact of loop ppelnng fll/flush phases for(=0;<256;){ y[] = foo(x[]); } for(=0;<256;){ z[] = bar(y[]); } for(=0;<256;){ f(<256) y[] = foo(x[]); f(>0) z[-1] = bar(y[-1]); } bar foo bar foo Loop mergng s not alway obvous due to dependences In some cases, mergng must be combned wth loop shftng PACAP - HLS 51

52 Nested loop support n HLS In general HLS tools only ppelne the nnermost loop No bg deal when nnermost loops have large trp counts for(=0;<256;){ } for(j=0;j<m;j){ x[][j] = foo(x[][j]); } There are cases when ths leads to suboptmal desgns For example when processng small 8x8 pxel macro-blocks, or for deeply ppelned datapath wth 100 s of stages M=32 M=8 =0 =1 =2 =0 =1 =2 =3 =4 =5 Ppelne overhead s more sgnfcant for short trp count nnermost loops PACAP - HLS 52

53 How to ppelne outer loops? Completely unroll all nnermost loops to get a sngle loop Supports non perfectly nested nnermost loops Only vable for small constant bounds nner most loops for(=0;<256;){ } for(j=0;j<4;j){ x[][j] = foo(x[][j]); for(j=0;j<2;j){ x[][j] = bar(x[][j]); for(=0;<256;){ x[][0] = foo(x[][0]); x[][1] = foo(x[][1]); x[][2] = foo(x[][2]); x[][3] = foo(x[][3]); x[][0] = bar(x[][0]); x[][1] = bar(x[][1]); } Flattenng/coalescng the loop nest herarchy Only for perfectly nested loops, use wth cauton! for(=0;<m;){ for(j=0;j<n;j){ x[][j] = foo(x[][j]); } for(k=0;k<m*n;k){ =k/n; =k%n; x[][j] = foo(x[][j]); } PACAP - HLS 53

54 Memory port resource conflcts The man lmtng factor for reachng II=1 s memory When II=1, there are generally many concurrent accesses to a same array (and thus to a same memory block). [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res res 4 For II=1, we must handle 8 read per cycle to the memory blocks contanng array [] and Y[] PACAP - HLS 54

55 Parttonng arrays n banks Duplcatng memory reduces memory port conflcts Naïve duplcaton s not effcent, array data must be allocated to several memory blocks to reduce the number of conflcts. Sngle block Y[2] Y[1] Y[0] [ ] [2] [1] [0] Block 1 [ ] [16] [12] [8] [4] [0] Block 2 [ ] [17] [13] [9] [5] [1] Block 3 [ ] [16] [14] [10] [6] [2] Block 4 [ ] [17] [15] [11] [7] [3] [] [1] [2] [3] Block 5 Y[ ] Y[16] Y[12] Y[8] Y[4] Y[0] Block 6 Y[ ] Y[17] Y[13] Y[9] Y[5] Y[1] Y[ ] Y[16] Y[14] Y[10] Y[6] Y[2] Y[ ] Y[17] Y[15] Y[11] Y[7] Y[3] Y[] Y[1] Y[2] Y[3] [] [1][2][3] Y[] Y[1] Y[2] Y[3] A block cyclng parttonng of data for [ ] and Y[ ] helps solvng all memory resource conflct PACAP - HLS 55

56 Parttonng arrays n memory banks Parttonng can be done at the source level Can be tedous, many HLS tools offer some automaton float [32],Y[32]; res=0.0; for (nt =0;<8;=4){ tmp = []*Y[]; res = res tmp; tmp = [1]*Y[1]; res = res tmp; tmp = [2]*Y[2]; res = res tmp; tmp = [3]*Y[3]; res = res tmp; } float 0[8],1[8],2[8],3[8]; float Y0[8],Y1[8],Y2[8],Y3[8]; res=0.0; for (nt =0;<8;=4){ tmp = 0[]*Y0[]; res = res tmp; tmp = 1[]*Y1[]; res = res tmp; tmp = 2[]*Y2[]; res = res tmp; tmp = 3[]*Y3[]; res = res tmp; } PACAP - HLS 56

57 Elmnatng conflcts through scalarzaton Scalarzaton : copy array cell value n a scalar varable Use the scalar varable nstead of the array whenever possble Obvous reuse s often performed by the compler tself Beware, reuse accross teraton of a loop s not automated float [128]; float res=0.0; for (=1;<128;=1){ res=res*([]-[-1]); } Two memory ports needed to reach II=1 float [128]; float res=0.0,tmp1,tmp0; tmp1 = [0]; tmp0 = [1]; for (=2;<128;=1){ res=res*(tmp0-tmp1); tmp1 =tmp0; tmp0 = []; } Only one memory port s needed to obtan II=1 PACAP - HLS 57

58 Another take on HLS vs HDL Tme (us) 3,0 HLS based desgn 3 The actual best desgn s obtaned through HLS! 2,5 HLS based desgn 2 Desgner perf. safety margng Perf constrant 1,25 HDL desgn 1 HLS desgn 1 HLS based desgn 4 Cost (smplfed)

59 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 59

60 HLS n 2020 HLS tools are focusng on hardware IP desgn In 2016, a whole system can be specfed wthn a sngle flow In 2020 hardware/software boundares wll have faded away Today, poor support for dynamc data-structures Speculatve executon technques from processor desgn may help wden the scope of applcablty of HLS to new domans Most program transformatons are done by hand More complex sem automatc program transformatons! Complex combnaton of program transformatons to obtan the best desgn (teratve complaton Hgh Level Synthess) PhP developpers stll can t desgn hardware Well, that s not lkely to change yet PACAP - HLS 60

61 Loop tlng Break the loops nto «tles» or blocks Expose coarse gran parallelsm (for threads) Improve temporal reuse over loop nests Legal only f all loop are permutable (.e. can be nterchanged) float **A,**B,**C; for(=0;<16;) { for(j=0;j<16;j) { for(k=0;k<16;k) S0: C[,j]=B[,k]*A[k,j]; } } float **A,**B,**C; for(=0;<16;=4) for(jj=0;jj<16;jj=4) for(kk=0;kk<16;kk=4) A 4x4x4 Tle for(=0;<4);) for(j=0;j<4;j) for(k=0;k<4;k) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; Best understood wth an example We consder a fully assocatve 32 word data cache The cache s organzed as 8 cache lnes of 4 words each M2R-CSS 61

62 Loop Tlng example (matrx product) Tlng helps mprovng spatal and temporal localty One chooses tle sze such that all data fts nto the cache C[16][16] kk jj float **A,**B,**C; for(=0;<16;=4) for(jj=0;jj<16;jj=4) for(kk=0;kk<16;kk=4) A 4x4x4 Tle for(=0;<4);) for(j=0;j<4;j) for(k=0;k<4;k) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; A[16][16] B[16][16] 62

63 Loop Tlng example (matrx product) Tlng helps mprovng spatal and temporal localty One chooses tle sze such that all data fts nto the cache C[16][16] kk jj float **A,**B,**C; for(=0;<16;=4) for(jj=0;jj<16;jj=4) #pragma ppelne II=1A 4x4x4 Tle for(kk=0;kk<16;kk=4) memcpy(a,la, ) for(k=0;k<4;k) for(=0;<4);) for(j=0;j<4;j) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; A[16][16] B[16][16] 63

64 Loop tlng Executon of the ntal kernel We show whch parts of A,B,C arrays are n the cache [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] M2R-CSS 64

65 Loop tlng Executon of the ntal kernel We show whch parts of A,B,C arrays are n the cache B[,j] B[,j1] B[,j2] B[,j3] A[,j] C[][j] C[][j1] C[..][..] C[][j3] A[1,j] C[1][j] C[1][j1] C[..][..] C[1][j3] A[2,j] A[3,j] C[2][j] C[..][..] C[..][..] C[2][j3] C[3][j] C[..][..] C[..][..] C[3][j3] M2R-CSS

.][..] C[..][..] C[2][j3] C[3][j] C[..][..] C[..][..] C[3][j3] M2R-CSS

66 Loop tlng Executon of the ntal kernel We show whch parts of A,B,C arrays are n the cache A[,j] B[][j%4] B[,j] B[,j1] B[,j2] B[,j3] C[][j] C[][j1] C[..][..] C[][j3] A[1,j] C[1][j] C[1][j1] C[..][..] C[1][j3] A[2,j] A[3,j] C[2][j] C[..][..] C[..][..] C[2][j3] C[3][j] C[..][..] C[..][..] C[3][j3] M2R-CSS

67 Loop tlng Executon of the ntal kernel B[,j] B[,j1] B[,j2] B[,j3] B[,j] A[,j] B[,j1] B[,j2] B[,j3] A[,j] A[1,j] A[1,j] A[2,j] A[2,j] A[3,j] A[3,j] We show whch parts of A,B,C arrays are n the cache C[][j] C[][j1] C[..][..] C[][j3] C[][j] C[][j1] C[..][..] C[][j3] C[1][j] C[1][j1] C[1][j] C[1][j1] C[..][..] C[1][j3] C[..][..] C[1][j3] C[2][j] C[..][..] C[..][..] C[2][j3] C[2][j] C[..][..] C[..][..] C[2][j3] C[3][j] C[..][..] C[..][..] C[3][j3] C[3][j] C[..][..] C[..][..] C[3][j3] M2R-CSS

68 Loop tlng Executon of the tled kernel For a sngle tle (=jj=kk=0) B[][] A[][] C[][] B[][] A[][] C[][] B[][] A[][] C[][] M2R-CSS 68

69 Loop Tlng & parallelzaton Smple way of exposng coarse gran parallelsm Tles are executed as atomc executon unts, there s no synchronzaton durng a tle executon. Tlng enables effcent parallelzaton It mproves localty and reduces synchronzaton overhead Fndng the rght tle sze and shape s dffcult (open problem) kk jj float **A,**B,**C; #pragma omp parallel for prvate() float for(=0;<16;=4) **A,**B,**C; for(=0;<16;=4) #pragma omp parallel for prvate(jj) for(jj=0;jj<16;jj=4) for(kk=0;kk<16;kk=4) for(=0;<4);) A 4x4x4 Tle for(j=0;j<4;j) for(k=0;k<4;k) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; 69

70 Challenge Image processng ppelne example for(=1;<n-1;) for(j=0;j<m;j) S0: tmp[][j]=f(n[][j],n[-1][j],n[1][j]); for(=1;<n-1;) for(j=1;j<m-1;j) S1: out[][j]=f(tmp[][j],tmp[][j-1],tmp[][j1]); In[][] tmp[][] out[][] Fnd a legal fuson and optmze memory accesses PACAP - HLS 70

71 PACAP - HLS 71

72 Exercse 1 (soluton) Straght forward fuson s not possble, needs to be combned wth loop shftng. //llegal transformaton for(=1;<=n-2;=1){ tmp[][0] = 3*n[][0] -n[-1][0]-n[1][0]; for(j=1;j<=m-2;j=1){ tmp[][j] = 3*n[][j] -n[-1][j]-n[1][j]; out[][j] = 3*tmp[][j] -tmp[][j-1]-tmp[][j1]; } tmp[][m-1] = 3*n[][M-1] -n[-1][m-1]-n[1][m-1]; }

73 Exercse 1 (soluton) M2R-CSS 73

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr