Programming FPGAs in C/C++ with High Level Synthesis PACAP - HLS 1

Size: px
Start display at page:

Download "Programming FPGAs in C/C++ with High Level Synthesis PACAP - HLS 1"

Transcription

1 Programmng FPGAs n C/C wth Hgh Level Synthess PACAP - HLS 1

2 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 2

3 HLS : from algorthm to Slcon N-1 =0 A. x x x x for(=0;<n;) acc=data[]*coef[]; [STMcroelectroncs] x Area Area total S1 S2 S3 S4 S5 S6 S7 S8 S9 Latency tme VHL & Verlog RTL Output SystemC TLM Output Hgh Level Synthess ams at rasng the level of abstracton of hardware desgn

4 Example: FFT at STMcroelectroncs C Generc Model 2K, 1K, , 64 ponts FFT/FFT (varous bt precson) Dgtal ODU 2K ponts (12bts) 2 Gs/s 65n, 45 nm WF/WMax 64/2K,1K, 512 (13bts) 20 Ms/s 65 nm [T. Mchel, STMcroelectroncs] Better Reusablty of Hgh Level Models MRI-CSS UWB 128 ponts (11bts) 528 Ms/s 90nm

5 HLS : does t really work? Hgh Level Synthess s not a new dea Semnal academc resarch work startng n the early 1990 s Frst commercal tools on the market around 1995 Frst generaton of HLS tools was a complete dsaster Worked only on small toy examples QoR (area/power) way below expectatons Blatantly overhyped by marketng folks It took 15 years to make these tools actually work Now used by worldclass chp vendors (STMcro, Nxt, ) Many desgners stll reject HLS (bad past experence)

6 HLS vs HDL performance Tme (us) 10 to 50 % 25 to 100% HLS desgn HDL desgn Cost (smplfed)

7 HLS n 2016 Major CAD vendors have ther own HLS tools Synphony from Synopsys (C/C) C2S from Cadence Catapult-C from Mentor Graphcs (C/C/SystemC) FPGA vendors have also enterng the game Free Vvado HLx by lnx (C/C) OpenCL compler by Altera (OpenCL)

8 HLS adopton Desgn languages used for FPGA desgn For FPGA desgns, Hgh Level Synthess from C/C wll become manstream wthn 5-10 years from now INSA-EII-5A 8

9 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 9

10 Descrbng dgtal desgns n C/C Is the C/C language suted for hardware desgn? No, t follows a sequental executon model Hardware s ntrnscally parallel (HDLs model concurrency) Dervng effcent hardware nvolves parallel executon No, t has flat/lnear vson of memory In an FPGA/ASIC memory s dstrbuted (memory blocks) No, t supports too few datatypes (char, nt, float) Hardware IP uses customzed wordlength (e.g 12 bts) May use specal arthmetc (ex: custom floatng pont) PACAP - HLS 10

11 How to descrbe crcut wth C, then? Use classes to model crcuts component and structure Instancate components and explctely wre them together Use of template can eases the desgn of complex crcuts Easy to generate a low level representaton of the crcut Stll operates at the Regster to Logc Level Does not really rase the level of abstracton No added value w.r.t HDL C based HLS tool ams at beng used as C complers User wrtes an algorthm n C, the tool produces a crcut. PACAP - HLS 11

12 Where s my clock? In C/C, there s no noton of tme Ths s a problem : we need synchronous (clocked) hardware HLS tools nfer tmng usng two approaches Implct semantcs (meanng) of language constructs User defned compler drectves (annotatons) Extendng C/C wth an mplct semantc??? They add a temporal meanng to some C/C constructs Ex : a loop teraton takes at least one cycle to execute Best way to understand s through examples PACAP - HLS 12

13 HLS and FPGA memory model In C/C memory s a large & flat address space Enablng ponter arthmetc, dynamc allocaton, etc. HLS has strong restrctons on memory management No dynamc memory allocaton (no malloc, no recurson) No global varables (HLS operate at the functon level) At best very lmted support for ponter arthmetcs Mappng of program varables nfered from source code Scalar varables are stored n dstrbuted regsters, Arrays are stored n memory blocks or n external memory Arrays are often parttoned n banks (capacty or #ports) PACAP - HLS 13

14 HLS bt accurate datatypes C/C standard types lmt hardware desgn choces Hardware mplementatons use a wde range of wordlength Mnmum precson to keep algorthm correctness whle reducng are mprovng speed. HLS provdes bt accurate types n C and C SystemC and vendor specfc types supported to smulate custom wordlength hardware datapaths n C/C [source lnx] PACAP - HLS 14

15 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 15

16 The rght use of HLS tools To use HLS, one stll need to «thnk n hardware» The desgner specfes an hgh level archtectural template The HLS tools then nfers the complete archtecture Desgners must fully understand how C/C language constructs map to hardware Desgners then use that knowledge to make the tool derve the accelerator they need. Need to be aware of optmzng complers lmtatons How smart a compler dependency analyss can be? How to bypass compler lmtatons (e.g usng #pragmas) PACAP - HLS 16

17 Key thngs to understand 1. How the HLS tool nfer the accelerator nterface (I/Os) Very mportant ssue for system level ntegraton 2. How the HLS tool handles array/memory accesses Key ssue when dealng wth mage processng algorthms 3. How the HLS tool handles (nested) loops Does t performs automatc loop ppelnng/unrollng/mergng? PACAP - HLS 17

18 Outlne Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell PACAP - HLS 18

19 Hgh Level Synthess from C/C HLS tools operate at the functon/procedure level Functons may call other functons (no recurson allowed) Each functon s mapped to a specfc hardware component Syntheszed components are then exported as HDL IP cores The I/O nterface nfered from the functon prototype Specfc ports for controllng the component (start, end, etc.) Smple data bus (scalar arguments of functon results) nt foobar(nt a){ } a 32 foobar done start Ths s not enough, we may need to access memory from our hardware component 19

20 I/O nterface and system level ntegraton HLS can nfer more complex nterfaces/protocols Ponters/arrays as argument = master access on system bus Streamng nterface can also be nfered usng drectves or specfc templates. nt foobar( nt a, nt* ptr, nt tab[], sc_ffo<nt> n){ a addr dout dn ptr addr dout dn tab foobar empty read n dn } = n.read() 20

21 Syntheszng hardware from a functon HLS tools decompose the program nto blocks Each block s realzed n hardware as a FSM datapath A global FSM controls whch block s actve accordng to current program state. The datapath of mutually exclusve blocks can be merged to save hardware resources (very few tools have ths ablty) for (nt =0;<32;) { res = []*Y[]; } f(state==4){ for (nt =0;<32;) { res = []/Y[]; } } else { for (nt =0;<32;) { res = 1/[]; } } TOP FSM FSM PACAP - HLS FSM FSM Datapath [] Y[] res [] Y[] res 1 Datapath 1 Datapath [] Y[] res 1 res res res 21 Mergng s possble

22 Array access management In HLS arrays are mapped to local or external memory Local memory maybe local SRAM or large regster fles External memory = bus nterface or SRAM/SDRAM, etc. External memory access management External memory access takes a ndefnte number of cycles Only one pendng access at a tme (no overlappng) Burst mode accesses are sometmes supported Local memory management Memory access always take one cycle (n read or wrte mode) The number of read/wrte ports per memory can be confgured Memory data type can be any type supported by the HLS tool 22

23 HLS support for condtonals HLS tools have two dfferent ways of handlng them By consderng branches as two dfferent blocks f(state==4){ for (nt =0;<32;) { res = []/Y[]; } } else { for (nt =0;<32;) { res = 1/[]; } } FSM FSM Datapath [] Y[] res 1 Datapath [] Y[] res 1 res res Mergng s possble By executng both branches and select the correct results f then/else do not contan loops functon calls. f(state==4){ res = ab; } else { res = a*b; } a b PACAP - HLS state==4 23

24 HLS support for loops Loops are one of the most mportant program constructs Most sgnal/mage kernel conssts of (nested) loops They are often the reason for resortng to hardware IPs Ther support n HLS tools vary a lot between vendors Best n class : Synopsys Synphony, Mentor Catapult-C Worst n class : C2S from Cadence, Impulse-C HLS users spent most on ther tme optmzng loops Drectly at the source level by tweakng the source code By usng compler/scrpt based optmzaton drectves PACAP - HLS 24

25 HLS wth SDSOC Hardware-software codesgn from sngle C/C spec. Key program functons annotated as FPGA accelerators HLS used to synthesze accelerator hardware Automatc synthess of glue logc and devce drvers Wth SDSoC you can desgn and run an accelerator n mnutes (used to take month ) PACAP - HLS 25

26 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 26

27 Loop sngle-cycle executon HLS default behavor s to execute one teraton/cycle Executng the loop wth N teraton takes N1 cycles Extra cycle for evaluatng the ext condton [] float [512],Y[512]; res=0.0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } Y[] res res Each operaton s mapped to ts own operator No resource sharng/hardware reuse 1 PACAP - HLS 27

28 Loop sngle-cycle executon Crtcal path (f max ) depends on loop body complexty Let us assume that T mul =3ns and T add =1ns [] Y[] res 1 T add T clk = 4 ns T mul T add res No control on the clock speed! nt [512],Y[512]; nt tmp,res=0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } #cycles f clk Exec tme Mhz 2,5 us PACAP - HLS 28

29 Loop sngle-cycle executon Tme (us) 2,5 Sngle cycle 1 *, 2 Cost (smplfed)

30 Mult-cycle executon Most HLS tools handle constrants over T clk Body executon s splt nto several shorter steps (clock-tcks) Needs accurate nformaton about target technology Target T clk = 3ns nt [512],Y[512]; nt tmp,res=0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } [] Y[] res res 1 #cycles f clk Exec tme T clk =3ns 512*21 330Mhz 3,07us PACAP - HLS Step 1 Step 2 30

31 Hardware reuse n mult-cycle executon For mult-cycle executon, HLS enable operators reuse HLS tools balance hardware operator usage of over tme Trade-off between operator cost and muxes overhead T clk =3ns Step 1 Step 2 [] Y[] [] res Y[] res 1 res 1 multpler, 2 adder 1 multpler, 1 adder, 2 muxes PACAP - HLS 1 res 31

32 Hardware reuse n mult-cycle executon Hardware reuse rate s controlled by the user Trade-off between parallelsm (performance) and reuse (cost) Users provdes constrants on the # and type of resources Mostly used for multplers, memory banks/ports, etc. The HLS tools decdes what and when to reuse Optmzes for area or speed, followng user constrants Combnatoral optmzaton problem (exponental complexty) nt [512],Y[512]; nt tmp,res=0; #pragma HLS mult=1,adder=1 for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } res PACAP - HLS [] Y[] 1 1 multpler, 1 adder, 2 muxes res 32

33 Loop sngle-cycle executon Tme (us) 3,0 Mult cycle 2,5 Sngle cycle 1*,1 2 mux 1*,2 Cost (smplfed)

34 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 34

35 How to further mprove performance? Improve f max wth arthmetc optmzatons Avod complex/costly operatons as much as possble Aggressvely reduce data wordlength to shorten crtcal path nt [512],Y[512]; nt tmp,res=0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } T clk = 4 ns nt12 [512],Y[512]; nt16 res=0; nt24 tmp; for (nt9 =0;<512;){ tmp = []*Y[]; res = res tmp>>8; } T clk = 2,5 ns T mul T add * T add T add T mul T add Very effectve as t mproves performance and reduce cost PACAP - HLS 35

36 Loop sngle-cycle executon Tme (us) 3,0 2,5 Mult cycle (12/24 bt) Mult cycle (32 bts) Sngle cycle (32 bt) Sngle cycle (12/24 bt) 1,25 1*,1 2 mux 1*,2 Cost (smplfed)

37 Outlne Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Performance optmzatons for Hgh Level Synthess PACAP - HLS 37

38 How to further mprove performance? Increase the # of operatons executed per teraton More work performed wthn a clock cycle Ths can be acheved wth partal loop unrollng [3] nt12 [512],Y[512]; nt16 res=0; nt24 tmp; Y[3] for (nt9 =0;<128;=4){ [2] tmp = []*Y[]; Y[2] res = res tmp; [1] tmp = [1]*Y[1]; res = res tmp; [] tmp = [2]*Y[2]; Y[] res = res tmp; tmp = [3]*Y[3]; res res = res tmp; 4 } Y[1] Ths may also ncrease the datapath crtcal path.. and end-up beng counter productve res 38

39 Loop unrollng (partal/full) But, the unrolled datapath can also be optmzed Cascaded nteger adder are reorganzed as adder trees Some array accesses can be removed (replaced by scalars) [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res 4 res [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res 4 res Crtcal path reducton T clk =3,5ns T clk =3ns #cycles f clk Exec tme Mhz 0,45us #cycles f clk Exec tme Mhz 0,38 us 39

40 Loop unrollng perf/area trade-off Tme (us) 3,0 2,5 Mult cycle (12/24 bt) Mult cycle (32 bts) Sngle cycle (32 bt) Sngle cycle (12/24 bt) 1,25 Unrolled (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)

41 Loop unrollng mult-cycle [] Y[] Unrollng can help mprovng mult-cycle schedules [1] Y[1] By ncreasng hardware operator utlzaton rate thanks to addtonal reuse opportuntes. [2] Y[2] [3] Y[3] res Same cost as the ntal mult-cycle verson, but more effcent use rate of the multpler [ ] Y[ ] 4 res res 4 x x x x step 1 step 2 step 3 step 4 step 5 #cycles f clk Exec tme 5* Mhz 1,28 us 41

42 Loop unroll mult-cycle executon Tme (us) 3,0 Mult cycle (12/24 bt) Mult cycle (32 bts) 2,5 Sngle cycle (32 bt) Mult cycle 4x unrolled (12/24 bt) 1,25 Sngle cycle (12/24 bt) Unrolled (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)

43 Loop unrollng (partal/full) Most HLS tool can unroll loops automatcally Generally controlled through a compler drectve (see below) Unrollng s often restrcted to loops wth constant bounds Full unrollng s only used for small trp count loops Unrollng does not always mprove performance Loop carred dependences can hurt/annhlate benefts See example next slde 43

44 Loop unrollng (partal/full) Unrollng loops wth carred dependences for(nt =0;<256;=1){ y[1] = a*y[] x[]; } Y[] [] a * Y[1] a * Y[] [1] Y[2] [] a Y[1] 1 1 * * * #cycles T clk Exec tme Mhz 0,64 us #cycles T clk Exec tme Mhz 0.64 us

45 Loop unrollng (partal/full) Older HLS tools would fully unroll all loops even f the goal s not to get the fastest mplementaton! even for loops wth large teraton counts (e.g. mages) Lack of proper support for loops s one of the reason behnd the falure of the frst generaton of HLS tools 45

46 Loop ppelnng Loop Ppelnng = ppelned executon of loop teratons Start teraton 1 before all operaton of teraton are fnshed Prncples smlar to software ppelnng for VLIW/DSPs nt12 [512],Y[512]; nt16 res=0; nt24 tmp; for (nt9 =0;<512;){ tmp = []*Y[]; res = res tmp>>8; } In ths example a new teraton starts every cycle, all operatons n the loop body have ther own operator. #cycles T clk Exec tme Mhz 1,02us [] Y[] [1] Y[1] 1 * * Stage 1 Stage 2 1 res res PACAP - HLS

47 Loop ppelnng perf/area trade-off Tme (us) 3,0 Mult cycle (12/24 bt) Mult cycle (32 bts) 2,5 1,25 Mult cycle 4x unrolled (12/24 bt) Sngle cycle (32 bt) Sngle cycle (12/24 bt) Ppelned (12/24 bt) Unrolled (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)

48 Loop ppelnng A gven ppelned loop schedule s characterzed by Intaton nterval (II) : delay separatng the executon of two successve teratons. A perfect loop ppelnng lead to II=1. Ppelne Latency (L) : number of cycles needed to completely execute a gven teraton (we have L>1). Targetng a value II=1 s common as the number of operators s not constraned by a predefned archtecture as n VLIW/DSPs Loop ppelnng enables very deeply ppelned datapath Operators are also aggressvely ppelned (e.g float) Latency n the order of 100 s of cycles s not uncommon Ppelnng s the most mportant optmzaton n HLS Huge performance mprovements for margnal resource cost (plenty of regsters n FPGA cells). PACAP - HLS 48

49 Ppelnng unrollng = vectorzng Ppelnng unrollng by K further ncrease performance [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res We execute K teratons followng a SIMD executon model Executon of the K teratons packets are overlapped 4 res However, achevng II=1 s often not straghforward We ll see that n a later slde. II=1, L=3; T clk =max(t mul,2*t add ) Loop executed n 82 cycles PACAP - HLS 49

50 Loop ppelnng perf/area trade-off Tme (us) 3,0 Mult cycle (12/24 bt) Mult cycle (32 bts) 2,5 1,25 Mult cycle 4x unrolled (12/24 bt) Sngle cycle (32 bt) Sngle cycle (12/24 bt) Ppelned (12/24 bt) Unrolled 4x (12/24 bt) Ppelned unrolled 4x (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)

51 Loop fuson Mergng consecutve loops nto a sngle one Everythng happens as f the two loops are executed n parallel It helps reducng the mpact of loop ppelnng fll/flush phases for(=0;<256;){ y[] = foo(x[]); } for(=0;<256;){ z[] = bar(y[]); } for(=0;<256;){ f(<256) y[] = foo(x[]); f(>0) z[-1] = bar(y[-1]); } bar foo bar foo Loop mergng s not alway obvous due to dependences In some cases, mergng must be combned wth loop shftng PACAP - HLS 51

52 Nested loop support n HLS In general HLS tools only ppelne the nnermost loop No bg deal when nnermost loops have large trp counts for(=0;<256;){ } for(j=0;j<m;j){ x[][j] = foo(x[][j]); } There are cases when ths leads to suboptmal desgns For example when processng small 8x8 pxel macro-blocks, or for deeply ppelned datapath wth 100 s of stages M=32 M=8 =0 =1 =2 =0 =1 =2 =3 =4 =5 Ppelne overhead s more sgnfcant for short trp count nnermost loops PACAP - HLS 52

53 How to ppelne outer loops? Completely unroll all nnermost loops to get a sngle loop Supports non perfectly nested nnermost loops Only vable for small constant bounds nner most loops for(=0;<256;){ } for(j=0;j<4;j){ x[][j] = foo(x[][j]); for(j=0;j<2;j){ x[][j] = bar(x[][j]); for(=0;<256;){ x[][0] = foo(x[][0]); x[][1] = foo(x[][1]); x[][2] = foo(x[][2]); x[][3] = foo(x[][3]); x[][0] = bar(x[][0]); x[][1] = bar(x[][1]); } Flattenng/coalescng the loop nest herarchy Only for perfectly nested loops, use wth cauton! for(=0;<m;){ for(j=0;j<n;j){ x[][j] = foo(x[][j]); } for(k=0;k<m*n;k){ =k/n; =k%n; x[][j] = foo(x[][j]); } PACAP - HLS 53

54 Memory port resource conflcts The man lmtng factor for reachng II=1 s memory When II=1, there are generally many concurrent accesses to a same array (and thus to a same memory block). [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res res 4 For II=1, we must handle 8 read per cycle to the memory blocks contanng array [] and Y[] PACAP - HLS 54

55 Parttonng arrays n banks Duplcatng memory reduces memory port conflcts Naïve duplcaton s not effcent, array data must be allocated to several memory blocks to reduce the number of conflcts. Sngle block Y[2] Y[1] Y[0] [ ] [2] [1] [0] Block 1 [ ] [16] [12] [8] [4] [0] Block 2 [ ] [17] [13] [9] [5] [1] Block 3 [ ] [16] [14] [10] [6] [2] Block 4 [ ] [17] [15] [11] [7] [3] [] [1] [2] [3] Block 5 Y[ ] Y[16] Y[12] Y[8] Y[4] Y[0] Block 6 Y[ ] Y[17] Y[13] Y[9] Y[5] Y[1] Y[ ] Y[16] Y[14] Y[10] Y[6] Y[2] Y[ ] Y[17] Y[15] Y[11] Y[7] Y[3] Y[] Y[1] Y[2] Y[3] [] [1][2][3] Y[] Y[1] Y[2] Y[3] A block cyclng parttonng of data for [ ] and Y[ ] helps solvng all memory resource conflct PACAP - HLS 55

56 Parttonng arrays n memory banks Parttonng can be done at the source level Can be tedous, many HLS tools offer some automaton float [32],Y[32]; res=0.0; for (nt =0;<8;=4){ tmp = []*Y[]; res = res tmp; tmp = [1]*Y[1]; res = res tmp; tmp = [2]*Y[2]; res = res tmp; tmp = [3]*Y[3]; res = res tmp; } float 0[8],1[8],2[8],3[8]; float Y0[8],Y1[8],Y2[8],Y3[8]; res=0.0; for (nt =0;<8;=4){ tmp = 0[]*Y0[]; res = res tmp; tmp = 1[]*Y1[]; res = res tmp; tmp = 2[]*Y2[]; res = res tmp; tmp = 3[]*Y3[]; res = res tmp; } PACAP - HLS 56

57 Elmnatng conflcts through scalarzaton Scalarzaton : copy array cell value n a scalar varable Use the scalar varable nstead of the array whenever possble Obvous reuse s often performed by the compler tself Beware, reuse accross teraton of a loop s not automated float [128]; float res=0.0; for (=1;<128;=1){ res=res*([]-[-1]); } Two memory ports needed to reach II=1 float [128]; float res=0.0,tmp1,tmp0; tmp1 = [0]; tmp0 = [1]; for (=2;<128;=1){ res=res*(tmp0-tmp1); tmp1 =tmp0; tmp0 = []; } Only one memory port s needed to obtan II=1 PACAP - HLS 57

58 Another take on HLS vs HDL Tme (us) 3,0 HLS based desgn 3 The actual best desgn s obtaned through HLS! 2,5 HLS based desgn 2 Desgner perf. safety margng Perf constrant 1,25 HDL desgn 1 HLS desgn 1 HLS based desgn 4 Cost (smplfed)

59 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 59

60 HLS n 2020 HLS tools are focusng on hardware IP desgn In 2016, a whole system can be specfed wthn a sngle flow In 2020 hardware/software boundares wll have faded away Today, poor support for dynamc data-structures Speculatve executon technques from processor desgn may help wden the scope of applcablty of HLS to new domans Most program transformatons are done by hand More complex sem automatc program transformatons! Complex combnaton of program transformatons to obtan the best desgn (teratve complaton Hgh Level Synthess) PhP developpers stll can t desgn hardware Well, that s not lkely to change yet PACAP - HLS 60

61 Loop tlng Break the loops nto «tles» or blocks Expose coarse gran parallelsm (for threads) Improve temporal reuse over loop nests Legal only f all loop are permutable (.e. can be nterchanged) float **A,**B,**C; for(=0;<16;) { for(j=0;j<16;j) { for(k=0;k<16;k) S0: C[,j]=B[,k]*A[k,j]; } } float **A,**B,**C; for(=0;<16;=4) for(jj=0;jj<16;jj=4) for(kk=0;kk<16;kk=4) A 4x4x4 Tle for(=0;<4);) for(j=0;j<4;j) for(k=0;k<4;k) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; Best understood wth an example We consder a fully assocatve 32 word data cache The cache s organzed as 8 cache lnes of 4 words each M2R-CSS 61

62 Loop Tlng example (matrx product) Tlng helps mprovng spatal and temporal localty One chooses tle sze such that all data fts nto the cache C[16][16] kk jj float **A,**B,**C; for(=0;<16;=4) for(jj=0;jj<16;jj=4) for(kk=0;kk<16;kk=4) A 4x4x4 Tle for(=0;<4);) for(j=0;j<4;j) for(k=0;k<4;k) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; A[16][16] B[16][16] 62

63 Loop Tlng example (matrx product) Tlng helps mprovng spatal and temporal localty One chooses tle sze such that all data fts nto the cache C[16][16] kk jj float **A,**B,**C; for(=0;<16;=4) for(jj=0;jj<16;jj=4) #pragma ppelne II=1A 4x4x4 Tle for(kk=0;kk<16;kk=4) memcpy(a,la, ) for(k=0;k<4;k) for(=0;<4);) for(j=0;j<4;j) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; A[16][16] B[16][16] 63

64 Loop tlng Executon of the ntal kernel We show whch parts of A,B,C arrays are n the cache [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] M2R-CSS 64

65 Loop tlng Executon of the ntal kernel We show whch parts of A,B,C arrays are n the cache B[,j] B[,j1] B[,j2] B[,j3] A[,j] C[][j] C[][j1] C[..][..] C[][j3] A[1,j] C[1][j] C[1][j1] C[..][..] C[1][j3] A[2,j] A[3,j] C[2][j] C[..][..] C[..][..] C[2][j3] C[3][j] C[..][..] C[..][..] C[3][j3] M2R-CSS

66 Loop tlng Executon of the ntal kernel We show whch parts of A,B,C arrays are n the cache A[,j] B[][j%4] B[,j] B[,j1] B[,j2] B[,j3] C[][j] C[][j1] C[..][..] C[][j3] A[1,j] C[1][j] C[1][j1] C[..][..] C[1][j3] A[2,j] A[3,j] C[2][j] C[..][..] C[..][..] C[2][j3] C[3][j] C[..][..] C[..][..] C[3][j3] M2R-CSS

67 Loop tlng Executon of the ntal kernel B[,j] B[,j1] B[,j2] B[,j3] B[,j] A[,j] B[,j1] B[,j2] B[,j3] A[,j] A[1,j] A[1,j] A[2,j] A[2,j] A[3,j] A[3,j] We show whch parts of A,B,C arrays are n the cache C[][j] C[][j1] C[..][..] C[][j3] C[][j] C[][j1] C[..][..] C[][j3] C[1][j] C[1][j1] C[1][j] C[1][j1] C[..][..] C[1][j3] C[..][..] C[1][j3] C[2][j] C[..][..] C[..][..] C[2][j3] C[2][j] C[..][..] C[..][..] C[2][j3] C[3][j] C[..][..] C[..][..] C[3][j3] C[3][j] C[..][..] C[..][..] C[3][j3] M2R-CSS

68 Loop tlng Executon of the tled kernel For a sngle tle (=jj=kk=0) B[][] A[][] C[][] B[][] A[][] C[][] B[][] A[][] C[][] M2R-CSS 68

69 Loop Tlng & parallelzaton Smple way of exposng coarse gran parallelsm Tles are executed as atomc executon unts, there s no synchronzaton durng a tle executon. Tlng enables effcent parallelzaton It mproves localty and reduces synchronzaton overhead Fndng the rght tle sze and shape s dffcult (open problem) kk jj float **A,**B,**C; #pragma omp parallel for prvate() float for(=0;<16;=4) **A,**B,**C; for(=0;<16;=4) #pragma omp parallel for prvate(jj) for(jj=0;jj<16;jj=4) for(kk=0;kk<16;kk=4) for(=0;<4);) A 4x4x4 Tle for(j=0;j<4;j) for(k=0;k<4;k) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; 69

70 Challenge Image processng ppelne example for(=1;<n-1;) for(j=0;j<m;j) S0: tmp[][j]=f(n[][j],n[-1][j],n[1][j]); for(=1;<n-1;) for(j=1;j<m-1;j) S1: out[][j]=f(tmp[][j],tmp[][j-1],tmp[][j1]); In[][] tmp[][] out[][] Fnd a legal fuson and optmze memory accesses PACAP - HLS 70

71 PACAP - HLS 71

72 Exercse 1 (soluton) Straght forward fuson s not possble, needs to be combned wth loop shftng. //llegal transformaton for(=1;<=n-2;=1){ tmp[][0] = 3*n[][0] -n[-1][0]-n[1][0]; for(j=1;j<=m-2;j=1){ tmp[][j] = 3*n[][j] -n[-1][j]-n[1][j]; out[][j] = 3*tmp[][j] -tmp[][j-1]-tmp[][j1]; } tmp[][m-1] = 3*n[][M-1] -n[-1][m-1]-n[1][m-1]; }

73 Exercse 1 (soluton) M2R-CSS 73

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Vectorization in the Polyhedral Model

Vectorization in the Polyhedral Model Vectorzaton n the Polyhedral Model Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty October 200 888. Introducton: Overvew Vectorzaton: Detecton

More information

Loop Transformations, Dependences, and Parallelization

Loop Transformations, Dependences, and Parallelization Loop Transformatons, Dependences, and Parallelzaton Announcements Mdterm s Frday from 3-4:15 n ths room Today Semester long project Data dependence recap Parallelsm and storage tradeoff Scalar expanson

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Complaton Foundatons Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty Feb 8, 200 888., Class # Introducton: Polyhedral Complaton Foundatons

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III. Lecture 15: Memory Herarchy Optmzatons I. Caches: A Quck Revew II. Iteraton Space & Loop Transformatons III. Types of Reuse ALSU 7.4.2-7.4.3, 11.2-11.5.1 15-745: Memory Herarchy Optmzatons Phllp B. Gbbons

More information

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont) Loop Transformatons for Parallelsm & Localty Prevously Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Loop nterchange Loop transformatons and transformaton frameworks

More information

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation Loop Transformatons for Parallelsm & Localty Last week Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Scalar expanson for removng false dependences Loop nterchange Loop

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations Improvng Hgh Level Synthess Optmzaton Opportunty Through Polyhedral Transformatons We Zuo 2,5, Yun Lang 1, Peng L 1, Kyle Rupnow 3, Demng Chen 2,3 and Jason Cong 1,4 1 Center for Energy-Effcent Computng

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 Causes of Cache Msses: The 3 C s Computer Archtecture ELEC3441 Lecture 9 Cache (2) Dr. Hayden Kwo-Hay So Department of Electrcal and Electronc Engneerng Compulsory: frst reference to a lne (a..a. cold

More information

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization What s a Computer Program? Descrpton of algorthms and data structures to acheve a specfc ojectve Could e done n any language, even a natural language lke Englsh Programmng language: A Standard notaton

More information

Lecture 3: Computer Arithmetic: Multiplication and Division

Lecture 3: Computer Arithmetic: Multiplication and Division 8-447 Lecture 3: Computer Arthmetc: Multplcaton and Dvson James C. Hoe Dept of ECE, CMU January 26, 29 S 9 L3- Announcements: Handout survey due Lab partner?? Read P&H Ch 3 Read IEEE 754-985 Handouts:

More information

REDUCING hardware design time is more than ever a

REDUCING hardware design time is more than ever a TCAD-2012-0168 1 Polyhedral Bubble Inserton: A Method to Improve Nested Loop Ppelnng for Hgh-Level Synthess Antone Morvan, Steven Derren, and Patrce Qunton Abstract Hgh-Level Synthess (HLS) allows hardware

More information

Conditional Speculative Decimal Addition*

Conditional Speculative Decimal Addition* Condtonal Speculatve Decmal Addton Alvaro Vazquez and Elsardo Antelo Dep. of Electronc and Computer Engneerng Unv. of Santago de Compostela, Span Ths work was supported n part by Xunta de Galca under grant

More information

Smoothing Spline ANOVA for variable screening

Smoothing Spline ANOVA for variable screening Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory

More information

LLVM passes and Intro to Loop Transformation Frameworks

LLVM passes and Intro to Loop Transformation Frameworks LLVM passes and Intro to Loop Transformaton Frameworks Announcements Ths class s recorded and wll be n D2L panapto. No quz Monday after sprng break. Wll be dong md-semester class feedback. Today LLVM passes

More information

RADIX-10 PARALLEL DECIMAL MULTIPLIER

RADIX-10 PARALLEL DECIMAL MULTIPLIER RADIX-10 PARALLEL DECIMAL MULTIPLIER 1 MRUNALINI E. INGLE & 2 TEJASWINI PANSE 1&2 Electroncs Engneerng, Yeshwantrao Chavan College of Engneerng, Nagpur, Inda E-mal : mrunalngle@gmal.com, tejaswn.deshmukh@gmal.com

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011 9/8/2 2 Outlne Appendx C: The Bascs of Logc Desgn TDT4255 Computer Desgn Case Study: TDT4255 Communcaton Module Lecture 2 Magnus Jahre 3 4 Dgtal Systems C.2: Gates, Truth Tables and Logc Equatons All sgnals

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

Petri Net Based Software Dependability Engineering

Petri Net Based Software Dependability Engineering Proc. RELECTRONIC 95, Budapest, pp. 181-186; October 1995 Petr Net Based Software Dependablty Engneerng Monka Hener Brandenburg Unversty of Technology Cottbus Computer Scence Insttute Postbox 101344 D-03013

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

Storage Binding in RTL synthesis

Storage Binding in RTL synthesis Storage Bndng n RTL synthess Pe Zhang Danel D. Gajsk Techncal Report ICS-0-37 August 0th, 200 Center for Embedded Computer Systems Department of Informaton and Computer Scence Unersty of Calforna, Irne

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface. IDC Herzlya Shmon Schocken Assembler Shmon Schocken Sprng 2005 Elements of Computng Systems 1 Assembler (Ch. 6) Where we are at: Human Thought Abstract desgn Chapters 9, 12 abstract nterface H.L. Language

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

CHAPTER 4 PARALLEL PREFIX ADDER

CHAPTER 4 PARALLEL PREFIX ADDER 93 CHAPTER 4 PARALLEL PREFIX ADDER 4.1 INTRODUCTION VLSI Integer adders fnd applcatons n Arthmetc and Logc Unts (ALUs), mcroprocessors and memory addressng unts. Speed of the adder often decdes the mnmum

More information

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6) Harvard Unversty CS 101 Fall 2005, Shmon Schocken Assembler Elements of Computng Systems 1 Assembler (Ch. 6) Why care about assemblers? Because Assemblers employ some nfty trcks Assemblers are the frst

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

Today Using Fourier-Motzkin elimination for code generation Using Fourier-Motzkin elimination for determining schedule constraints

Today Using Fourier-Motzkin elimination for code generation Using Fourier-Motzkin elimination for determining schedule constraints Fourer Motzkn Elmnaton Logstcs HW10 due Frday Aprl 27 th Today Usng Fourer-Motzkn elmnaton for code generaton Usng Fourer-Motzkn elmnaton for determnng schedule constrants Unversty Fourer-Motzkn Elmnaton

More information

Concurrent models of computation for embedded software

Concurrent models of computation for embedded software Concurrent models of computaton for embedded software and hardware! Researcher overvew what t looks lke semantcs what t means and how t relates desgnng an actor language actor propertes and how to represent

More information

SUMMARY... I TABLE OF CONTENTS...II INTRODUCTION...

SUMMARY... I TABLE OF CONTENTS...II INTRODUCTION... Summary A follow-the-leader robot system s mplemented usng Dscrete-Event Supervsory Control methods. The system conssts of three robots, a leader and two followers. The dea s to get the two followers to

More information

Loop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings

Loop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings Loop Ppelnng for Hgh-Throughput Stream Computaton Usng Self-Tmed Rngs Gennette Gll, John Hansen and Montek Sngh Dept. of Computer Scence Unv. of North Carolna, Chapel Hll, NC 27599, USA {gllg,jbhansen,montek}@cs.unc.edu

More information

Cache Memories. Lecture 14 Cache Memories. Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Cache Memories. Lecture 14 Cache Memories. Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory Topcs Lecture 4 Cache Memores Generc cache memory organzaton Drect mapped caches Set assocate caches Impact of caches on performance Cache Memores Cache memores are small, fast SRAM-based memores managed

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introducton 1.1 Parallel Processng There s a contnual demand for greater computatonal speed from a computer system than s currently possble (.e. sequental systems). Areas need great computatonal

More information

System-on-Chip Design Analysis of Control Data Flow. Hao Zheng Comp Sci & Eng U of South Florida

System-on-Chip Design Analysis of Control Data Flow. Hao Zheng Comp Sci & Eng U of South Florida System-on-Chp Desgn Analyss of Control Data Flow Hao Zheng Comp Sc & Eng U of South Florda Overvew DF models descrbe concurrent computa=on at a very hgh level Each actor descrbes non-trval computa=on.

More information

Spatial Computation ABSTRACT 1. INTRODUCTION

Spatial Computation ABSTRACT 1. INTRODUCTION Spatal Computaton Mha Budu, Grsh Venkataraman, Tberu Chelcea and Seth Copen Goldsten {mhab,grsh,tb,seth}@cs.cmu.edu Carnege Mellon Unversty ABSTRACT Ths paper descrbes a computer archtecture, Spatal Computaton

More information

Newton-Raphson division module via truncated multipliers

Newton-Raphson division module via truncated multipliers Newton-Raphson dvson module va truncated multplers Alexandar Tzakov Department of Electrcal and Computer Engneerng Illnos Insttute of Technology Chcago,IL 60616, USA Abstract Reducton n area and power

More information

CPE 628 Chapter 2 Design for Testability. Dr. Rhonda Kay Gaede UAH. UAH Chapter Introduction

CPE 628 Chapter 2 Design for Testability. Dr. Rhonda Kay Gaede UAH. UAH Chapter Introduction Chapter 2 Desgn for Testablty Dr Rhonda Kay Gaede UAH 2 Introducton Dffcultes n and the states of sequental crcuts led to provdng drect access for storage elements, whereby selected storage elements are

More information

Mallathahally, Bangalore, India 1 2

Mallathahally, Bangalore, India 1 2 7 IMPLEMENTATION OF HIGH PERFORMANCE BINARY SQUARER PRADEEP M C, RAMESH S, Department of Electroncs and Communcaton Engneerng, Dr. Ambedkar Insttute of Technology, Mallathahally, Bangalore, Inda pradeepmc@gmal.com,

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:

More information

CS221: Algorithms and Data Structures. Priority Queues and Heaps. Alan J. Hu (Borrowing slides from Steve Wolfman)

CS221: Algorithms and Data Structures. Priority Queues and Heaps. Alan J. Hu (Borrowing slides from Steve Wolfman) CS: Algorthms and Data Structures Prorty Queues and Heaps Alan J. Hu (Borrowng sldes from Steve Wolfman) Learnng Goals After ths unt, you should be able to: Provde examples of approprate applcatons for

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances In: Proc. 0th Int. Symposum on Hardware/Software Codesgn (CODES 02), Estes Park, Colorado, USA, May 6 8, 2002 Algorthmc Transformaton Technques for Effcent Exploraton of Alternatve Applcaton Instances

More information

Using Delayed Addition Techniques to Accelerate Integer and Floating-Point Calculations in Configurable Hardware

Using Delayed Addition Techniques to Accelerate Integer and Floating-Point Calculations in Configurable Hardware Draft submtted for publcaton. Please do not dstrbute Usng Delayed Addton echnques to Accelerate Integer and Floatng-Pont Calculatons n Confgurable Hardware Zhen Luo, Nonmember and Margaret Martonos, Member,

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

Improved Symoblic Simulation By Dynamic Funtional Space Partitioning

Improved Symoblic Simulation By Dynamic Funtional Space Partitioning Improved Symoblc Smulaton By Dynamc Funtonal Space Parttonng Tao Feng, L-.Wang, Kwang-Tng heng Department of EE, U-Santa Barbara, U.S.A tfeng,lcwang, tmcheng @ece.ucsb.edu Andy -. Ln adence Desgn Systems,

More information

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE Yordzhev K., Kostadnova H. Інформаційні технології в освіті ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE Yordzhev K., Kostadnova H. Some aspects of programmng educaton

More information

Solving Planted Motif Problem on GPU

Solving Planted Motif Problem on GPU Solvng Planted Motf Problem on GPU Naga Shalaja Dasar Old Domnon Unversty Norfolk, VA, USA ndasar@cs.odu.edu Ranjan Desh Old Domnon Unversty Norfolk, VA, USA dranjan@cs.odu.edu Zubar M Old Domnon Unversty

More information

Verification by testing

Verification by testing Real-Tme Systems Specfcaton Implementaton System models Executon-tme analyss Verfcaton Verfcaton by testng Dad? How do they know how much weght a brdge can handle? They drve bgger and bgger trucks over

More information

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming CS 4/560 Desgn and Analyss of Algorthms Kent State Unversty Dept. of Math & Computer Scence LECT-6 Dynamc Programmng 2 Dynamc Programmng Dynamc Programmng, lke the dvde-and-conquer method, solves problems

More information

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing EDA222/DIT161 Real-Tme Systems, Chalmers/GU, 2014/2015 Lecture #8 Real-Tme Systems Real-Tme Systems Lecture #8 Specfcaton Professor Jan Jonsson Implementaton System models Executon-tme analyss Department

More information

EFFICIENT SYNCHRONOUS PARALLEL DISCRETE EVENT SIMULATION

EFFICIENT SYNCHRONOUS PARALLEL DISCRETE EVENT SIMULATION EFFICIENT SYNCHRONOUS PARALLEL DISCRETE EVENT SIMULATION WITH THE ARMEN ARCHITECTURE C. Beaumont, B. Potter J.M. Flloque LIBr I.U.T. de Brest and LIBr Unversté de Bretagne Occdentale Télécom Bretagne BP

More information

Intro. Iterators. 1. Access

Intro. Iterators. 1. Access Intro Ths mornng I d lke to talk a lttle bt about s and s. We wll start out wth smlartes and dfferences, then we wll see how to draw them n envronment dagrams, and we wll fnsh wth some examples. Happy

More information

Area Efficient Self Timed Adders For Low Power Applications in VLSI

Area Efficient Self Timed Adders For Low Power Applications in VLSI ISSN(Onlne): 2319-8753 ISSN (Prnt) :2347-6710 Internatonal Journal of Innovatve Research n Scence, Engneerng and Technology (An ISO 3297: 2007 Certfed Organzaton) Area Effcent Self Tmed Adders For Low

More information

Efficient Code Generation for Automatic Parallelization and Optimization

Efficient Code Generation for Automatic Parallelization and Optimization Effcent Code Generaton for utomatc Parallelzaton and Optmzaton Cédrc Bastoul Laboratore PRSM, Unversté de Versalles Sant Quentn 5 avenue des États-Uns, 7805 Versalles Cedex, France Emal: cedrcbastoul@prsmuvsqfr

More information

Oracle Database: SQL and PL/SQL Fundamentals Certification Course

Oracle Database: SQL and PL/SQL Fundamentals Certification Course Oracle Database: SQL and PL/SQL Fundamentals Certfcaton Course 1 Duraton: 5 Days (30 hours) What you wll learn: Ths Oracle Database: SQL and PL/SQL Fundamentals tranng delvers the fundamentals of SQL and

More information

Simulation Based Analysis of FAST TCP using OMNET++

Simulation Based Analysis of FAST TCP using OMNET++ Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months

More information

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions Sortng Revew Introducton to Algorthms Qucksort CSE 680 Prof. Roger Crawfs Inserton Sort T(n) = Θ(n 2 ) In-place Merge Sort T(n) = Θ(n lg(n)) Not n-place Selecton Sort (from homework) T(n) = Θ(n 2 ) In-place

More information

5 The Primal-Dual Method

5 The Primal-Dual Method 5 The Prmal-Dual Method Orgnally desgned as a method for solvng lnear programs, where t reduces weghted optmzaton problems to smpler combnatoral ones, the prmal-dual method (PDM) has receved much attenton

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations* Confguraton Management n Mult-Context Reconfgurable Systems for Smultaneous Performance and Power Optmzatons* Rafael Maestre, Mlagros Fernandez Departamento de Arqutectura de Computadores y Automátca Unversdad

More information

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss. Today s Outlne Sortng Chapter 7 n Wess CSE 26 Data Structures Ruth Anderson Announcements Wrtten Homework #6 due Frday 2/26 at the begnnng of lecture Proect Code due Mon March 1 by 11pm Today s Topcs:

More information

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden

Optimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden Optmzng for Speed Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se What s the potental gan? Latency dfference L$ and mem: ~5x Bandwdth dfference L$ and mem: ~x Repeated TLB msses adds a factor ~-3x Execute

More information

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain AMath 483/583 Lecture 21 May 13, 2011 Today: OpenMP and MPI versons of Jacob teraton Gauss-Sedel and SOR teratve methods Next week: More MPI Debuggng and totalvew GPU computng Read: Class notes and references

More information

CS 534: Computer Vision Model Fitting

CS 534: Computer Vision Model Fitting CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust

More information

On Some Entertaining Applications of the Concept of Set in Computer Science Course

On Some Entertaining Applications of the Concept of Set in Computer Science Course On Some Entertanng Applcatons of the Concept of Set n Computer Scence Course Krasmr Yordzhev *, Hrstna Kostadnova ** * Assocate Professor Krasmr Yordzhev, Ph.D., Faculty of Mathematcs and Natural Scences,

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Combined Functional Partitioning and Communication Speed Selection for Networked Voltage-Scalable Processors Λ

Combined Functional Partitioning and Communication Speed Selection for Networked Voltage-Scalable Processors Λ Combned Functonal Parttonng and Communcaton Speed Selecton for Networked Voltage-Scalable Processors Λ Jnfeng Lu, Pa H. Chou, Nader Bagherzadeh epartment of Electrcal & Computer Engneerng Unversty of Calforna,

More information

Multi-objective Optimization Using Adaptive Explicit Non-Dominated Region Sampling

Multi-objective Optimization Using Adaptive Explicit Non-Dominated Region Sampling 11 th World Congress on Structural and Multdscplnary Optmsaton 07 th -12 th, June 2015, Sydney Australa Mult-objectve Optmzaton Usng Adaptve Explct Non-Domnated Regon Samplng Anrban Basudhar Lvermore Software

More information

Wavefront Reconstructor

Wavefront Reconstructor A Dstrbuted Smplex B-Splne Based Wavefront Reconstructor Coen de Vsser and Mchel Verhaegen 14-12-201212 2012 Delft Unversty of Technology Contents Introducton Wavefront reconstructon usng Smplex B-Splnes

More information

Functional and Timing Validation of Partially Bypassed Processor Pipelines

Functional and Timing Validation of Partially Bypassed Processor Pipelines Functonal and Tmng Valdaton of Partally Bypassed Processor Ppelnes Qang Zhu shyu@labs.fujtsu.com Avral Shrvastava Avral.Shrvastava@asu.edu Nl Dutt dutt@cs.uc.edu Fujtsu Laboratores LTD., Japan 1-1, Kamodanaa

More information

Chapter 6 Programmng the fnte element method Inow turn to the man subject of ths book: The mplementaton of the fnte element algorthm n computer programs. In order to make my dscusson as straghtforward

More information

GSLM Operations Research II Fall 13/14

GSLM Operations Research II Fall 13/14 GSLM 58 Operatons Research II Fall /4 6. Separable Programmng Consder a general NLP mn f(x) s.t. g j (x) b j j =. m. Defnton 6.. The NLP s a separable program f ts objectve functon and all constrants are

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Lecture 4: Principal components

Lecture 4: Principal components /3/6 Lecture 4: Prncpal components 3..6 Multvarate lnear regresson MLR s optmal for the estmaton data...but poor for handlng collnear data Covarance matrx s not nvertble (large condton number) Robustness

More information

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION 24 CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION The present chapter proposes an IPSO approach for multprocessor task schedulng problem wth two classfcatons, namely, statc ndependent tasks and

More information

CS1100 Introduction to Programming

CS1100 Introduction to Programming Factoral (n) Recursve Program fact(n) = n*fact(n-) CS00 Introducton to Programmng Recurson and Sortng Madhu Mutyam Department of Computer Scence and Engneerng Indan Insttute of Technology Madras nt fact

More information

Brave New World Pseudocode Reference

Brave New World Pseudocode Reference Brave New World Pseudocode Reference Pseudocode s a way to descrbe how to accomplsh tasks usng basc steps lke those a computer mght perform. In ths week s lab, you'll see how a form of pseudocode can be

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

An Associative Processor Array Designed for Computer Vision

An Associative Processor Array Designed for Computer Vision An Assocatve Processor Array Desgned for Computer Vson A.W.G. Duller R.H. Storer A.R. Thomson M.R. Pout E.L. Dagless Dept. of Electrcal and Electronc Engneerng Unversty of Brstol Brstol BS8 1TR, UK. The

More information

Scheduling with Integer Time Budgeting for Low-Power Optimization

Scheduling with Integer Time Budgeting for Low-Power Optimization Schedlng wth Integer Tme Bdgetng for Low-Power Optmzaton We Jang, Zhr Zhang, Modrag Potkonjak and Jason Cong Compter Scence Department Unversty of Calforna, Los Angeles Spported by NSF, SRC. Otlne Introdcton

More information

Notes on Organizing Java Code: Packages, Visibility, and Scope

Notes on Organizing Java Code: Packages, Visibility, and Scope Notes on Organzng Java Code: Packages, Vsblty, and Scope CS 112 Wayne Snyder Java programmng n large measure s a process of defnng enttes (.e., packages, classes, methods, or felds) by name and then usng

More information