Programming FPGAs in C/C++ with High Level Synthesis PACAP - HLS 1
|
|
- Godwin Stanley
- 5 years ago
- Views:
Transcription
1 Programmng FPGAs n C/C wth Hgh Level Synthess PACAP - HLS 1
2 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 2
3 HLS : from algorthm to Slcon N-1 =0 A. x x x x for(=0;<n;) acc=data[]*coef[]; [STMcroelectroncs] x Area Area total S1 S2 S3 S4 S5 S6 S7 S8 S9 Latency tme VHL & Verlog RTL Output SystemC TLM Output Hgh Level Synthess ams at rasng the level of abstracton of hardware desgn
4 Example: FFT at STMcroelectroncs C Generc Model 2K, 1K, , 64 ponts FFT/FFT (varous bt precson) Dgtal ODU 2K ponts (12bts) 2 Gs/s 65n, 45 nm WF/WMax 64/2K,1K, 512 (13bts) 20 Ms/s 65 nm [T. Mchel, STMcroelectroncs] Better Reusablty of Hgh Level Models MRI-CSS UWB 128 ponts (11bts) 528 Ms/s 90nm
5 HLS : does t really work? Hgh Level Synthess s not a new dea Semnal academc resarch work startng n the early 1990 s Frst commercal tools on the market around 1995 Frst generaton of HLS tools was a complete dsaster Worked only on small toy examples QoR (area/power) way below expectatons Blatantly overhyped by marketng folks It took 15 years to make these tools actually work Now used by worldclass chp vendors (STMcro, Nxt, ) Many desgners stll reject HLS (bad past experence)
6 HLS vs HDL performance Tme (us) 10 to 50 % 25 to 100% HLS desgn HDL desgn Cost (smplfed)
7 HLS n 2016 Major CAD vendors have ther own HLS tools Synphony from Synopsys (C/C) C2S from Cadence Catapult-C from Mentor Graphcs (C/C/SystemC) FPGA vendors have also enterng the game Free Vvado HLx by lnx (C/C) OpenCL compler by Altera (OpenCL)
8 HLS adopton Desgn languages used for FPGA desgn For FPGA desgns, Hgh Level Synthess from C/C wll become manstream wthn 5-10 years from now INSA-EII-5A 8
9 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 9
10 Descrbng dgtal desgns n C/C Is the C/C language suted for hardware desgn? No, t follows a sequental executon model Hardware s ntrnscally parallel (HDLs model concurrency) Dervng effcent hardware nvolves parallel executon No, t has flat/lnear vson of memory In an FPGA/ASIC memory s dstrbuted (memory blocks) No, t supports too few datatypes (char, nt, float) Hardware IP uses customzed wordlength (e.g 12 bts) May use specal arthmetc (ex: custom floatng pont) PACAP - HLS 10
11 How to descrbe crcut wth C, then? Use classes to model crcuts component and structure Instancate components and explctely wre them together Use of template can eases the desgn of complex crcuts Easy to generate a low level representaton of the crcut Stll operates at the Regster to Logc Level Does not really rase the level of abstracton No added value w.r.t HDL C based HLS tool ams at beng used as C complers User wrtes an algorthm n C, the tool produces a crcut. PACAP - HLS 11
12 Where s my clock? In C/C, there s no noton of tme Ths s a problem : we need synchronous (clocked) hardware HLS tools nfer tmng usng two approaches Implct semantcs (meanng) of language constructs User defned compler drectves (annotatons) Extendng C/C wth an mplct semantc??? They add a temporal meanng to some C/C constructs Ex : a loop teraton takes at least one cycle to execute Best way to understand s through examples PACAP - HLS 12
13 HLS and FPGA memory model In C/C memory s a large & flat address space Enablng ponter arthmetc, dynamc allocaton, etc. HLS has strong restrctons on memory management No dynamc memory allocaton (no malloc, no recurson) No global varables (HLS operate at the functon level) At best very lmted support for ponter arthmetcs Mappng of program varables nfered from source code Scalar varables are stored n dstrbuted regsters, Arrays are stored n memory blocks or n external memory Arrays are often parttoned n banks (capacty or #ports) PACAP - HLS 13
14 HLS bt accurate datatypes C/C standard types lmt hardware desgn choces Hardware mplementatons use a wde range of wordlength Mnmum precson to keep algorthm correctness whle reducng are mprovng speed. HLS provdes bt accurate types n C and C SystemC and vendor specfc types supported to smulate custom wordlength hardware datapaths n C/C [source lnx] PACAP - HLS 14
15 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 15
16 The rght use of HLS tools To use HLS, one stll need to «thnk n hardware» The desgner specfes an hgh level archtectural template The HLS tools then nfers the complete archtecture Desgners must fully understand how C/C language constructs map to hardware Desgners then use that knowledge to make the tool derve the accelerator they need. Need to be aware of optmzng complers lmtatons How smart a compler dependency analyss can be? How to bypass compler lmtatons (e.g usng #pragmas) PACAP - HLS 16
17 Key thngs to understand 1. How the HLS tool nfer the accelerator nterface (I/Os) Very mportant ssue for system level ntegraton 2. How the HLS tool handles array/memory accesses Key ssue when dealng wth mage processng algorthms 3. How the HLS tool handles (nested) loops Does t performs automatc loop ppelnng/unrollng/mergng? PACAP - HLS 17
18 Outlne Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell PACAP - HLS 18
19 Hgh Level Synthess from C/C HLS tools operate at the functon/procedure level Functons may call other functons (no recurson allowed) Each functon s mapped to a specfc hardware component Syntheszed components are then exported as HDL IP cores The I/O nterface nfered from the functon prototype Specfc ports for controllng the component (start, end, etc.) Smple data bus (scalar arguments of functon results) nt foobar(nt a){ } a 32 foobar done start Ths s not enough, we may need to access memory from our hardware component 19
20 I/O nterface and system level ntegraton HLS can nfer more complex nterfaces/protocols Ponters/arrays as argument = master access on system bus Streamng nterface can also be nfered usng drectves or specfc templates. nt foobar( nt a, nt* ptr, nt tab[], sc_ffo<nt> n){ a addr dout dn ptr addr dout dn tab foobar empty read n dn } = n.read() 20
21 Syntheszng hardware from a functon HLS tools decompose the program nto blocks Each block s realzed n hardware as a FSM datapath A global FSM controls whch block s actve accordng to current program state. The datapath of mutually exclusve blocks can be merged to save hardware resources (very few tools have ths ablty) for (nt =0;<32;) { res = []*Y[]; } f(state==4){ for (nt =0;<32;) { res = []/Y[]; } } else { for (nt =0;<32;) { res = 1/[]; } } TOP FSM FSM PACAP - HLS FSM FSM Datapath [] Y[] res [] Y[] res 1 Datapath 1 Datapath [] Y[] res 1 res res res 21 Mergng s possble
22 Array access management In HLS arrays are mapped to local or external memory Local memory maybe local SRAM or large regster fles External memory = bus nterface or SRAM/SDRAM, etc. External memory access management External memory access takes a ndefnte number of cycles Only one pendng access at a tme (no overlappng) Burst mode accesses are sometmes supported Local memory management Memory access always take one cycle (n read or wrte mode) The number of read/wrte ports per memory can be confgured Memory data type can be any type supported by the HLS tool 22
23 HLS support for condtonals HLS tools have two dfferent ways of handlng them By consderng branches as two dfferent blocks f(state==4){ for (nt =0;<32;) { res = []/Y[]; } } else { for (nt =0;<32;) { res = 1/[]; } } FSM FSM Datapath [] Y[] res 1 Datapath [] Y[] res 1 res res Mergng s possble By executng both branches and select the correct results f then/else do not contan loops functon calls. f(state==4){ res = ab; } else { res = a*b; } a b PACAP - HLS state==4 23
24 HLS support for loops Loops are one of the most mportant program constructs Most sgnal/mage kernel conssts of (nested) loops They are often the reason for resortng to hardware IPs Ther support n HLS tools vary a lot between vendors Best n class : Synopsys Synphony, Mentor Catapult-C Worst n class : C2S from Cadence, Impulse-C HLS users spent most on ther tme optmzng loops Drectly at the source level by tweakng the source code By usng compler/scrpt based optmzaton drectves PACAP - HLS 24
25 HLS wth SDSOC Hardware-software codesgn from sngle C/C spec. Key program functons annotated as FPGA accelerators HLS used to synthesze accelerator hardware Automatc synthess of glue logc and devce drvers Wth SDSoC you can desgn and run an accelerator n mnutes (used to take month ) PACAP - HLS 25
26 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 26
27 Loop sngle-cycle executon HLS default behavor s to execute one teraton/cycle Executng the loop wth N teraton takes N1 cycles Extra cycle for evaluatng the ext condton [] float [512],Y[512]; res=0.0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } Y[] res res Each operaton s mapped to ts own operator No resource sharng/hardware reuse 1 PACAP - HLS 27
28 Loop sngle-cycle executon Crtcal path (f max ) depends on loop body complexty Let us assume that T mul =3ns and T add =1ns [] Y[] res 1 T add T clk = 4 ns T mul T add res No control on the clock speed! nt [512],Y[512]; nt tmp,res=0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } #cycles f clk Exec tme Mhz 2,5 us PACAP - HLS 28
29 Loop sngle-cycle executon Tme (us) 2,5 Sngle cycle 1 *, 2 Cost (smplfed)
30 Mult-cycle executon Most HLS tools handle constrants over T clk Body executon s splt nto several shorter steps (clock-tcks) Needs accurate nformaton about target technology Target T clk = 3ns nt [512],Y[512]; nt tmp,res=0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } [] Y[] res res 1 #cycles f clk Exec tme T clk =3ns 512*21 330Mhz 3,07us PACAP - HLS Step 1 Step 2 30
31 Hardware reuse n mult-cycle executon For mult-cycle executon, HLS enable operators reuse HLS tools balance hardware operator usage of over tme Trade-off between operator cost and muxes overhead T clk =3ns Step 1 Step 2 [] Y[] [] res Y[] res 1 res 1 multpler, 2 adder 1 multpler, 1 adder, 2 muxes PACAP - HLS 1 res 31
32 Hardware reuse n mult-cycle executon Hardware reuse rate s controlled by the user Trade-off between parallelsm (performance) and reuse (cost) Users provdes constrants on the # and type of resources Mostly used for multplers, memory banks/ports, etc. The HLS tools decdes what and when to reuse Optmzes for area or speed, followng user constrants Combnatoral optmzaton problem (exponental complexty) nt [512],Y[512]; nt tmp,res=0; #pragma HLS mult=1,adder=1 for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } res PACAP - HLS [] Y[] 1 1 multpler, 1 adder, 2 muxes res 32
33 Loop sngle-cycle executon Tme (us) 3,0 Mult cycle 2,5 Sngle cycle 1*,1 2 mux 1*,2 Cost (smplfed)
34 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 34
35 How to further mprove performance? Improve f max wth arthmetc optmzatons Avod complex/costly operatons as much as possble Aggressvely reduce data wordlength to shorten crtcal path nt [512],Y[512]; nt tmp,res=0; for (nt =0;<512;){ tmp = []*Y[]; res = res tmp; } T clk = 4 ns nt12 [512],Y[512]; nt16 res=0; nt24 tmp; for (nt9 =0;<512;){ tmp = []*Y[]; res = res tmp>>8; } T clk = 2,5 ns T mul T add * T add T add T mul T add Very effectve as t mproves performance and reduce cost PACAP - HLS 35
36 Loop sngle-cycle executon Tme (us) 3,0 2,5 Mult cycle (12/24 bt) Mult cycle (32 bts) Sngle cycle (32 bt) Sngle cycle (12/24 bt) 1,25 1*,1 2 mux 1*,2 Cost (smplfed)
37 Outlne Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Performance optmzatons for Hgh Level Synthess PACAP - HLS 37
38 How to further mprove performance? Increase the # of operatons executed per teraton More work performed wthn a clock cycle Ths can be acheved wth partal loop unrollng [3] nt12 [512],Y[512]; nt16 res=0; nt24 tmp; Y[3] for (nt9 =0;<128;=4){ [2] tmp = []*Y[]; Y[2] res = res tmp; [1] tmp = [1]*Y[1]; res = res tmp; [] tmp = [2]*Y[2]; Y[] res = res tmp; tmp = [3]*Y[3]; res res = res tmp; 4 } Y[1] Ths may also ncrease the datapath crtcal path.. and end-up beng counter productve res 38
39 Loop unrollng (partal/full) But, the unrolled datapath can also be optmzed Cascaded nteger adder are reorganzed as adder trees Some array accesses can be removed (replaced by scalars) [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res 4 res [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res 4 res Crtcal path reducton T clk =3,5ns T clk =3ns #cycles f clk Exec tme Mhz 0,45us #cycles f clk Exec tme Mhz 0,38 us 39
40 Loop unrollng perf/area trade-off Tme (us) 3,0 2,5 Mult cycle (12/24 bt) Mult cycle (32 bts) Sngle cycle (32 bt) Sngle cycle (12/24 bt) 1,25 Unrolled (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)
41 Loop unrollng mult-cycle [] Y[] Unrollng can help mprovng mult-cycle schedules [1] Y[1] By ncreasng hardware operator utlzaton rate thanks to addtonal reuse opportuntes. [2] Y[2] [3] Y[3] res Same cost as the ntal mult-cycle verson, but more effcent use rate of the multpler [ ] Y[ ] 4 res res 4 x x x x step 1 step 2 step 3 step 4 step 5 #cycles f clk Exec tme 5* Mhz 1,28 us 41
42 Loop unroll mult-cycle executon Tme (us) 3,0 Mult cycle (12/24 bt) Mult cycle (32 bts) 2,5 Sngle cycle (32 bt) Mult cycle 4x unrolled (12/24 bt) 1,25 Sngle cycle (12/24 bt) Unrolled (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)
43 Loop unrollng (partal/full) Most HLS tool can unroll loops automatcally Generally controlled through a compler drectve (see below) Unrollng s often restrcted to loops wth constant bounds Full unrollng s only used for small trp count loops Unrollng does not always mprove performance Loop carred dependences can hurt/annhlate benefts See example next slde 43
44 Loop unrollng (partal/full) Unrollng loops wth carred dependences for(nt =0;<256;=1){ y[1] = a*y[] x[]; } Y[] [] a * Y[1] a * Y[] [1] Y[2] [] a Y[1] 1 1 * * * #cycles T clk Exec tme Mhz 0,64 us #cycles T clk Exec tme Mhz 0.64 us
45 Loop unrollng (partal/full) Older HLS tools would fully unroll all loops even f the goal s not to get the fastest mplementaton! even for loops wth large teraton counts (e.g. mages) Lack of proper support for loops s one of the reason behnd the falure of the frst generaton of HLS tools 45
46 Loop ppelnng Loop Ppelnng = ppelned executon of loop teratons Start teraton 1 before all operaton of teraton are fnshed Prncples smlar to software ppelnng for VLIW/DSPs nt12 [512],Y[512]; nt16 res=0; nt24 tmp; for (nt9 =0;<512;){ tmp = []*Y[]; res = res tmp>>8; } In ths example a new teraton starts every cycle, all operatons n the loop body have ther own operator. #cycles T clk Exec tme Mhz 1,02us [] Y[] [1] Y[1] 1 * * Stage 1 Stage 2 1 res res PACAP - HLS
47 Loop ppelnng perf/area trade-off Tme (us) 3,0 Mult cycle (12/24 bt) Mult cycle (32 bts) 2,5 1,25 Mult cycle 4x unrolled (12/24 bt) Sngle cycle (32 bt) Sngle cycle (12/24 bt) Ppelned (12/24 bt) Unrolled (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)
48 Loop ppelnng A gven ppelned loop schedule s characterzed by Intaton nterval (II) : delay separatng the executon of two successve teratons. A perfect loop ppelnng lead to II=1. Ppelne Latency (L) : number of cycles needed to completely execute a gven teraton (we have L>1). Targetng a value II=1 s common as the number of operators s not constraned by a predefned archtecture as n VLIW/DSPs Loop ppelnng enables very deeply ppelned datapath Operators are also aggressvely ppelned (e.g float) Latency n the order of 100 s of cycles s not uncommon Ppelnng s the most mportant optmzaton n HLS Huge performance mprovements for margnal resource cost (plenty of regsters n FPGA cells). PACAP - HLS 48
49 Ppelnng unrollng = vectorzng Ppelnng unrollng by K further ncrease performance [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res We execute K teratons followng a SIMD executon model Executon of the K teratons packets are overlapped 4 res However, achevng II=1 s often not straghforward We ll see that n a later slde. II=1, L=3; T clk =max(t mul,2*t add ) Loop executed n 82 cycles PACAP - HLS 49
50 Loop ppelnng perf/area trade-off Tme (us) 3,0 Mult cycle (12/24 bt) Mult cycle (32 bts) 2,5 1,25 Mult cycle 4x unrolled (12/24 bt) Sngle cycle (32 bt) Sngle cycle (12/24 bt) Ppelned (12/24 bt) Unrolled 4x (12/24 bt) Ppelned unrolled 4x (12/24 bt) 1*,1 2 mux 1*,2 Cost (smplfed)
51 Loop fuson Mergng consecutve loops nto a sngle one Everythng happens as f the two loops are executed n parallel It helps reducng the mpact of loop ppelnng fll/flush phases for(=0;<256;){ y[] = foo(x[]); } for(=0;<256;){ z[] = bar(y[]); } for(=0;<256;){ f(<256) y[] = foo(x[]); f(>0) z[-1] = bar(y[-1]); } bar foo bar foo Loop mergng s not alway obvous due to dependences In some cases, mergng must be combned wth loop shftng PACAP - HLS 51
52 Nested loop support n HLS In general HLS tools only ppelne the nnermost loop No bg deal when nnermost loops have large trp counts for(=0;<256;){ } for(j=0;j<m;j){ x[][j] = foo(x[][j]); } There are cases when ths leads to suboptmal desgns For example when processng small 8x8 pxel macro-blocks, or for deeply ppelned datapath wth 100 s of stages M=32 M=8 =0 =1 =2 =0 =1 =2 =3 =4 =5 Ppelne overhead s more sgnfcant for short trp count nnermost loops PACAP - HLS 52
53 How to ppelne outer loops? Completely unroll all nnermost loops to get a sngle loop Supports non perfectly nested nnermost loops Only vable for small constant bounds nner most loops for(=0;<256;){ } for(j=0;j<4;j){ x[][j] = foo(x[][j]); for(j=0;j<2;j){ x[][j] = bar(x[][j]); for(=0;<256;){ x[][0] = foo(x[][0]); x[][1] = foo(x[][1]); x[][2] = foo(x[][2]); x[][3] = foo(x[][3]); x[][0] = bar(x[][0]); x[][1] = bar(x[][1]); } Flattenng/coalescng the loop nest herarchy Only for perfectly nested loops, use wth cauton! for(=0;<m;){ for(j=0;j<n;j){ x[][j] = foo(x[][j]); } for(k=0;k<m*n;k){ =k/n; =k%n; x[][j] = foo(x[][j]); } PACAP - HLS 53
54 Memory port resource conflcts The man lmtng factor for reachng II=1 s memory When II=1, there are generally many concurrent accesses to a same array (and thus to a same memory block). [3] Y[3] [2] Y[2] [1] Y[1] [] Y[] res res 4 For II=1, we must handle 8 read per cycle to the memory blocks contanng array [] and Y[] PACAP - HLS 54
55 Parttonng arrays n banks Duplcatng memory reduces memory port conflcts Naïve duplcaton s not effcent, array data must be allocated to several memory blocks to reduce the number of conflcts. Sngle block Y[2] Y[1] Y[0] [ ] [2] [1] [0] Block 1 [ ] [16] [12] [8] [4] [0] Block 2 [ ] [17] [13] [9] [5] [1] Block 3 [ ] [16] [14] [10] [6] [2] Block 4 [ ] [17] [15] [11] [7] [3] [] [1] [2] [3] Block 5 Y[ ] Y[16] Y[12] Y[8] Y[4] Y[0] Block 6 Y[ ] Y[17] Y[13] Y[9] Y[5] Y[1] Y[ ] Y[16] Y[14] Y[10] Y[6] Y[2] Y[ ] Y[17] Y[15] Y[11] Y[7] Y[3] Y[] Y[1] Y[2] Y[3] [] [1][2][3] Y[] Y[1] Y[2] Y[3] A block cyclng parttonng of data for [ ] and Y[ ] helps solvng all memory resource conflct PACAP - HLS 55
56 Parttonng arrays n memory banks Parttonng can be done at the source level Can be tedous, many HLS tools offer some automaton float [32],Y[32]; res=0.0; for (nt =0;<8;=4){ tmp = []*Y[]; res = res tmp; tmp = [1]*Y[1]; res = res tmp; tmp = [2]*Y[2]; res = res tmp; tmp = [3]*Y[3]; res = res tmp; } float 0[8],1[8],2[8],3[8]; float Y0[8],Y1[8],Y2[8],Y3[8]; res=0.0; for (nt =0;<8;=4){ tmp = 0[]*Y0[]; res = res tmp; tmp = 1[]*Y1[]; res = res tmp; tmp = 2[]*Y2[]; res = res tmp; tmp = 3[]*Y3[]; res = res tmp; } PACAP - HLS 56
57 Elmnatng conflcts through scalarzaton Scalarzaton : copy array cell value n a scalar varable Use the scalar varable nstead of the array whenever possble Obvous reuse s often performed by the compler tself Beware, reuse accross teraton of a loop s not automated float [128]; float res=0.0; for (=1;<128;=1){ res=res*([]-[-1]); } Two memory ports needed to reach II=1 float [128]; float res=0.0,tmp1,tmp0; tmp1 = [0]; tmp0 = [1]; for (=2;<128;=1){ res=res*(tmp0-tmp1); tmp1 =tmp0; tmp0 = []; } Only one memory port s needed to obtan II=1 PACAP - HLS 57
58 Another take on HLS vs HDL Tme (us) 3,0 HLS based desgn 3 The actual best desgn s obtaned through HLS! 2,5 HLS based desgn 2 Desgner perf. safety margng Perf constrant 1,25 HDL desgn 1 HLS desgn 1 HLS based desgn 4 Cost (smplfed)
59 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area trade-off Program optmzatons for hardware synthess The future of HLS PACAP - HLS 59
60 HLS n 2020 HLS tools are focusng on hardware IP desgn In 2016, a whole system can be specfed wthn a sngle flow In 2020 hardware/software boundares wll have faded away Today, poor support for dynamc data-structures Speculatve executon technques from processor desgn may help wden the scope of applcablty of HLS to new domans Most program transformatons are done by hand More complex sem automatc program transformatons! Complex combnaton of program transformatons to obtan the best desgn (teratve complaton Hgh Level Synthess) PhP developpers stll can t desgn hardware Well, that s not lkely to change yet PACAP - HLS 60
61 Loop tlng Break the loops nto «tles» or blocks Expose coarse gran parallelsm (for threads) Improve temporal reuse over loop nests Legal only f all loop are permutable (.e. can be nterchanged) float **A,**B,**C; for(=0;<16;) { for(j=0;j<16;j) { for(k=0;k<16;k) S0: C[,j]=B[,k]*A[k,j]; } } float **A,**B,**C; for(=0;<16;=4) for(jj=0;jj<16;jj=4) for(kk=0;kk<16;kk=4) A 4x4x4 Tle for(=0;<4);) for(j=0;j<4;j) for(k=0;k<4;k) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; Best understood wth an example We consder a fully assocatve 32 word data cache The cache s organzed as 8 cache lnes of 4 words each M2R-CSS 61
62 Loop Tlng example (matrx product) Tlng helps mprovng spatal and temporal localty One chooses tle sze such that all data fts nto the cache C[16][16] kk jj float **A,**B,**C; for(=0;<16;=4) for(jj=0;jj<16;jj=4) for(kk=0;kk<16;kk=4) A 4x4x4 Tle for(=0;<4);) for(j=0;j<4;j) for(k=0;k<4;k) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; A[16][16] B[16][16] 62
63 Loop Tlng example (matrx product) Tlng helps mprovng spatal and temporal localty One chooses tle sze such that all data fts nto the cache C[16][16] kk jj float **A,**B,**C; for(=0;<16;=4) for(jj=0;jj<16;jj=4) #pragma ppelne II=1A 4x4x4 Tle for(kk=0;kk<16;kk=4) memcpy(a,la, ) for(k=0;k<4;k) for(=0;<4);) for(j=0;j<4;j) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; A[16][16] B[16][16] 63
64 Loop tlng Executon of the ntal kernel We show whch parts of A,B,C arrays are n the cache [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] [3] Y[3] [3] Y[3] C[][j] [3] Y[3] C[][j] [3] Y[3] C[][j] C[][j] M2R-CSS 64
65 Loop tlng Executon of the ntal kernel We show whch parts of A,B,C arrays are n the cache B[,j] B[,j1] B[,j2] B[,j3] A[,j] C[][j] C[][j1] C[..][..] C[][j3] A[1,j] C[1][j] C[1][j1] C[..][..] C[1][j3] A[2,j] A[3,j] C[2][j] C[..][..] C[..][..] C[2][j3] C[3][j] C[..][..] C[..][..] C[3][j3] M2R-CSS
66 Loop tlng Executon of the ntal kernel We show whch parts of A,B,C arrays are n the cache A[,j] B[][j%4] B[,j] B[,j1] B[,j2] B[,j3] C[][j] C[][j1] C[..][..] C[][j3] A[1,j] C[1][j] C[1][j1] C[..][..] C[1][j3] A[2,j] A[3,j] C[2][j] C[..][..] C[..][..] C[2][j3] C[3][j] C[..][..] C[..][..] C[3][j3] M2R-CSS
67 Loop tlng Executon of the ntal kernel B[,j] B[,j1] B[,j2] B[,j3] B[,j] A[,j] B[,j1] B[,j2] B[,j3] A[,j] A[1,j] A[1,j] A[2,j] A[2,j] A[3,j] A[3,j] We show whch parts of A,B,C arrays are n the cache C[][j] C[][j1] C[..][..] C[][j3] C[][j] C[][j1] C[..][..] C[][j3] C[1][j] C[1][j1] C[1][j] C[1][j1] C[..][..] C[1][j3] C[..][..] C[1][j3] C[2][j] C[..][..] C[..][..] C[2][j3] C[2][j] C[..][..] C[..][..] C[2][j3] C[3][j] C[..][..] C[..][..] C[3][j3] C[3][j] C[..][..] C[..][..] C[3][j3] M2R-CSS
68 Loop tlng Executon of the tled kernel For a sngle tle (=jj=kk=0) B[][] A[][] C[][] B[][] A[][] C[][] B[][] A[][] C[][] M2R-CSS 68
69 Loop Tlng & parallelzaton Smple way of exposng coarse gran parallelsm Tles are executed as atomc executon unts, there s no synchronzaton durng a tle executon. Tlng enables effcent parallelzaton It mproves localty and reduces synchronzaton overhead Fndng the rght tle sze and shape s dffcult (open problem) kk jj float **A,**B,**C; #pragma omp parallel for prvate() float for(=0;<16;=4) **A,**B,**C; for(=0;<16;=4) #pragma omp parallel for prvate(jj) for(jj=0;jj<16;jj=4) for(kk=0;kk<16;kk=4) for(=0;<4);) A 4x4x4 Tle for(j=0;j<4;j) for(k=0;k<4;k) S1: C[,jjj]= B[,kkk]*A[kkk,jjj]; 69
70 Challenge Image processng ppelne example for(=1;<n-1;) for(j=0;j<m;j) S0: tmp[][j]=f(n[][j],n[-1][j],n[1][j]); for(=1;<n-1;) for(j=1;j<m-1;j) S1: out[][j]=f(tmp[][j],tmp[][j-1],tmp[][j1]); In[][] tmp[][] out[][] Fnd a legal fuson and optmze memory accesses PACAP - HLS 70
71 PACAP - HLS 71
72 Exercse 1 (soluton) Straght forward fuson s not possble, needs to be combned wth loop shftng. //llegal transformaton for(=1;<=n-2;=1){ tmp[][0] = 3*n[][0] -n[-1][0]-n[1][0]; for(j=1;j<=m-2;j=1){ tmp[][j] = 3*n[][j] -n[-1][j]-n[1][j]; out[][j] = 3*tmp[][j] -tmp[][j-1]-tmp[][j1]; } tmp[][m-1] = 3*n[][M-1] -n[-1][m-1]-n[1][m-1]; }
73 Exercse 1 (soluton) M2R-CSS 73
Parallelism for Nested Loops with Non-uniform and Flow Dependences
Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr
More informationVectorization in the Polyhedral Model
Vectorzaton n the Polyhedral Model Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty October 200 888. Introducton: Overvew Vectorzaton: Detecton
More informationLoop Transformations, Dependences, and Parallelization
Loop Transformatons, Dependences, and Parallelzaton Announcements Mdterm s Frday from 3-4:15 n ths room Today Semester long project Data dependence recap Parallelsm and storage tradeoff Scalar expanson
More informationPolyhedral Compilation Foundations
Polyhedral Complaton Foundatons Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty Feb 8, 200 888., Class # Introducton: Polyhedral Complaton Foundatons
More informationSome material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier
Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)
More informationLecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.
Lecture 15: Memory Herarchy Optmzatons I. Caches: A Quck Revew II. Iteraton Space & Loop Transformatons III. Types of Reuse ALSU 7.4.2-7.4.3, 11.2-11.5.1 15-745: Memory Herarchy Optmzatons Phllp B. Gbbons
More informationLoop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)
Loop Transformatons for Parallelsm & Localty Prevously Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Loop nterchange Loop transformatons and transformaton frameworks
More informationLoop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation
Loop Transformatons for Parallelsm & Localty Last week Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Scalar expanson for removng false dependences Loop nterchange Loop
More informationAssembler. Building a Modern Computer From First Principles.
Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought
More informationVirtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory
Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process
More informationCache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access
Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?
More informationThe Codesign Challenge
ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.
More informationAssignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.
Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton
More informationEfficient Distributed File System (EDFS)
Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate
More informationCompiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz
Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster
More informationImproving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations
Improvng Hgh Level Synthess Optmzaton Opportunty Through Polyhedral Transformatons We Zuo 2,5, Yun Lang 1, Peng L 1, Kyle Rupnow 3, Demng Chen 2,3 and Jason Cong 1,4 1 Center for Energy-Effcent Computng
More informationParallel matrix-vector multiplication
Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more
More informationCSCI 104 Sorting Algorithms. Mark Redekopp David Kempe
CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal
More informationComputer Architecture ELEC3441
Causes of Cache Msses: The 3 C s Computer Archtecture ELEC3441 Lecture 9 Cache (2) Dr. Hayden Kwo-Hay So Department of Electrcal and Electronc Engneerng Compulsory: frst reference to a lne (a..a. cold
More informationHigh level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization
What s a Computer Program? Descrpton of algorthms and data structures to acheve a specfc ojectve Could e done n any language, even a natural language lke Englsh Programmng language: A Standard notaton
More informationLecture 3: Computer Arithmetic: Multiplication and Division
8-447 Lecture 3: Computer Arthmetc: Multplcaton and Dvson James C. Hoe Dept of ECE, CMU January 26, 29 S 9 L3- Announcements: Handout survey due Lab partner?? Read P&H Ch 3 Read IEEE 754-985 Handouts:
More informationREDUCING hardware design time is more than ever a
TCAD-2012-0168 1 Polyhedral Bubble Inserton: A Method to Improve Nested Loop Ppelnng for Hgh-Level Synthess Antone Morvan, Steven Derren, and Patrce Qunton Abstract Hgh-Level Synthess (HLS) allows hardware
More informationConditional Speculative Decimal Addition*
Condtonal Speculatve Decmal Addton Alvaro Vazquez and Elsardo Antelo Dep. of Electronc and Computer Engneerng Unv. of Santago de Compostela, Span Ths work was supported n part by Xunta de Galca under grant
More informationSmoothing Spline ANOVA for variable screening
Smoothng Splne ANOVA for varable screenng a useful tool for metamodels tranng and mult-objectve optmzaton L. Rcco, E. Rgon, A. Turco Outlne RSM Introducton Possble couplng Test case MOO MOO wth Game Theory
More informationLLVM passes and Intro to Loop Transformation Frameworks
LLVM passes and Intro to Loop Transformaton Frameworks Announcements Ths class s recorded and wll be n D2L panapto. No quz Monday after sprng break. Wll be dong md-semester class feedback. Today LLVM passes
More informationRADIX-10 PARALLEL DECIMAL MULTIPLIER
RADIX-10 PARALLEL DECIMAL MULTIPLIER 1 MRUNALINI E. INGLE & 2 TEJASWINI PANSE 1&2 Electroncs Engneerng, Yeshwantrao Chavan College of Engneerng, Nagpur, Inda E-mal : mrunalngle@gmal.com, tejaswn.deshmukh@gmal.com
More informationCMPS 10 Introduction to Computer Science Lecture Notes
CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not
More informationComplex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.
Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal
More informationOutline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011
9/8/2 2 Outlne Appendx C: The Bascs of Logc Desgn TDT4255 Computer Desgn Case Study: TDT4255 Communcaton Module Lecture 2 Magnus Jahre 3 4 Dgtal Systems C.2: Gates, Truth Tables and Logc Equatons All sgnals
More informationCourse Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms
Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques
More informationPetri Net Based Software Dependability Engineering
Proc. RELECTRONIC 95, Budapest, pp. 181-186; October 1995 Petr Net Based Software Dependablty Engneerng Monka Hener Brandenburg Unversty of Technology Cottbus Computer Scence Insttute Postbox 101344 D-03013
More informationNUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS
ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana
More informationProgramming in Fortran 90 : 2017/2018
Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values
More informationStorage Binding in RTL synthesis
Storage Bndng n RTL synthess Pe Zhang Danel D. Gajsk Techncal Report ICS-0-37 August 0th, 200 Center for Embedded Computer Systems Department of Informaton and Computer Scence Unersty of Calforna, Irne
More informationThe Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique
//00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy
More informationAssembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.
IDC Herzlya Shmon Schocken Assembler Shmon Schocken Sprng 2005 Elements of Computng Systems 1 Assembler (Ch. 6) Where we are at: Human Thought Abstract desgn Chapters 9, 12 abstract nterface H.L. Language
More informationHierarchical clustering for gene expression data analysis
Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally
More informationCHAPTER 4 PARALLEL PREFIX ADDER
93 CHAPTER 4 PARALLEL PREFIX ADDER 4.1 INTRODUCTION VLSI Integer adders fnd applcatons n Arthmetc and Logc Unts (ALUs), mcroprocessors and memory addressng unts. Speed of the adder often decdes the mnmum
More informationHarvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)
Harvard Unversty CS 101 Fall 2005, Shmon Schocken Assembler Elements of Computng Systems 1 Assembler (Ch. 6) Why care about assemblers? Because Assemblers employ some nfty trcks Assemblers are the frst
More informationA Fast Visual Tracking Algorithm Based on Circle Pixels Matching
A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng
More informationToday Using Fourier-Motzkin elimination for code generation Using Fourier-Motzkin elimination for determining schedule constraints
Fourer Motzkn Elmnaton Logstcs HW10 due Frday Aprl 27 th Today Usng Fourer-Motzkn elmnaton for code generaton Usng Fourer-Motzkn elmnaton for determnng schedule constrants Unversty Fourer-Motzkn Elmnaton
More informationConcurrent models of computation for embedded software
Concurrent models of computaton for embedded software and hardware! Researcher overvew what t looks lke semantcs what t means and how t relates desgnng an actor language actor propertes and how to represent
More informationSUMMARY... I TABLE OF CONTENTS...II INTRODUCTION...
Summary A follow-the-leader robot system s mplemented usng Dscrete-Event Supervsory Control methods. The system conssts of three robots, a leader and two followers. The dea s to get the two followers to
More informationLoop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings
Loop Ppelnng for Hgh-Throughput Stream Computaton Usng Self-Tmed Rngs Gennette Gll, John Hansen and Montek Sngh Dept. of Computer Scence Unv. of North Carolna, Chapel Hll, NC 27599, USA {gllg,jbhansen,montek}@cs.unc.edu
More informationCache Memories. Lecture 14 Cache Memories. Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory
Topcs Lecture 4 Cache Memores Generc cache memory organzaton Drect mapped caches Set assocate caches Impact of caches on performance Cache Memores Cache memores are small, fast SRAM-based memores managed
More informationMathematics 256 a course in differential equations for engineering students
Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the
More informationLearning the Kernel Parameters in Kernel Minimum Distance Classifier
Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department
More informationChapter 1. Introduction
Chapter 1 Introducton 1.1 Parallel Processng There s a contnual demand for greater computatonal speed from a computer system than s currently possble (.e. sequental systems). Areas need great computatonal
More informationSystem-on-Chip Design Analysis of Control Data Flow. Hao Zheng Comp Sci & Eng U of South Florida
System-on-Chp Desgn Analyss of Control Data Flow Hao Zheng Comp Sc & Eng U of South Florda Overvew DF models descrbe concurrent computa=on at a very hgh level Each actor descrbes non-trval computa=on.
More informationSpatial Computation ABSTRACT 1. INTRODUCTION
Spatal Computaton Mha Budu, Grsh Venkataraman, Tberu Chelcea and Seth Copen Goldsten {mhab,grsh,tb,seth}@cs.cmu.edu Carnege Mellon Unversty ABSTRACT Ths paper descrbes a computer archtecture, Spatal Computaton
More informationNewton-Raphson division module via truncated multipliers
Newton-Raphson dvson module va truncated multplers Alexandar Tzakov Department of Electrcal and Computer Engneerng Illnos Insttute of Technology Chcago,IL 60616, USA Abstract Reducton n area and power
More informationCPE 628 Chapter 2 Design for Testability. Dr. Rhonda Kay Gaede UAH. UAH Chapter Introduction
Chapter 2 Desgn for Testablty Dr Rhonda Kay Gaede UAH 2 Introducton Dffcultes n and the states of sequental crcuts led to provdng drect access for storage elements, whereby selected storage elements are
More informationMallathahally, Bangalore, India 1 2
7 IMPLEMENTATION OF HIGH PERFORMANCE BINARY SQUARER PRADEEP M C, RAMESH S, Department of Electroncs and Communcaton Engneerng, Dr. Ambedkar Insttute of Technology, Mallathahally, Bangalore, Inda pradeepmc@gmal.com,
More informationSequential search. Building Java Programs Chapter 13. Sequential search. Sequential search
Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to
More informationAn Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation
17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed
More informationOutline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1
4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today:
More informationCS221: Algorithms and Data Structures. Priority Queues and Heaps. Alan J. Hu (Borrowing slides from Steve Wolfman)
CS: Algorthms and Data Structures Prorty Queues and Heaps Alan J. Hu (Borrowng sldes from Steve Wolfman) Learnng Goals After ths unt, you should be able to: Provde examples of approprate applcatons for
More informationProblem Definitions and Evaluation Criteria for Computational Expensive Optimization
Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty
More informationAlgorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances
In: Proc. 0th Int. Symposum on Hardware/Software Codesgn (CODES 02), Estes Park, Colorado, USA, May 6 8, 2002 Algorthmc Transformaton Technques for Effcent Exploraton of Alternatve Applcaton Instances
More informationUsing Delayed Addition Techniques to Accelerate Integer and Floating-Point Calculations in Configurable Hardware
Draft submtted for publcaton. Please do not dstrbute Usng Delayed Addton echnques to Accelerate Integer and Floatng-Pont Calculatons n Confgurable Hardware Zhen Luo, Nonmember and Margaret Martonos, Member,
More informationELEC 377 Operating Systems. Week 6 Class 3
ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems
More informationImproved Symoblic Simulation By Dynamic Funtional Space Partitioning
Improved Symoblc Smulaton By Dynamc Funtonal Space Parttonng Tao Feng, L-.Wang, Kwang-Tng heng Department of EE, U-Santa Barbara, U.S.A tfeng,lcwang, tmcheng @ece.ucsb.edu Andy -. Ln adence Desgn Systems,
More informationON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE
Yordzhev K., Kostadnova H. Інформаційні технології в освіті ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE Yordzhev K., Kostadnova H. Some aspects of programmng educaton
More informationSolving Planted Motif Problem on GPU
Solvng Planted Motf Problem on GPU Naga Shalaja Dasar Old Domnon Unversty Norfolk, VA, USA ndasar@cs.odu.edu Ranjan Desh Old Domnon Unversty Norfolk, VA, USA dranjan@cs.odu.edu Zubar M Old Domnon Unversty
More informationVerification by testing
Real-Tme Systems Specfcaton Implementaton System models Executon-tme analyss Verfcaton Verfcaton by testng Dad? How do they know how much weght a brdge can handle? They drve bgger and bgger trucks over
More informationKent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming
CS 4/560 Desgn and Analyss of Algorthms Kent State Unversty Dept. of Math & Computer Scence LECT-6 Dynamc Programmng 2 Dynamc Programmng Dynamc Programmng, lke the dvde-and-conquer method, solves problems
More informationReal-Time Systems. Real-Time Systems. Verification by testing. Verification by testing
EDA222/DIT161 Real-Tme Systems, Chalmers/GU, 2014/2015 Lecture #8 Real-Tme Systems Real-Tme Systems Lecture #8 Specfcaton Professor Jan Jonsson Implementaton System models Executon-tme analyss Department
More informationEFFICIENT SYNCHRONOUS PARALLEL DISCRETE EVENT SIMULATION
EFFICIENT SYNCHRONOUS PARALLEL DISCRETE EVENT SIMULATION WITH THE ARMEN ARCHITECTURE C. Beaumont, B. Potter J.M. Flloque LIBr I.U.T. de Brest and LIBr Unversté de Bretagne Occdentale Télécom Bretagne BP
More informationIntro. Iterators. 1. Access
Intro Ths mornng I d lke to talk a lttle bt about s and s. We wll start out wth smlartes and dfferences, then we wll see how to draw them n envronment dagrams, and we wll fnsh wth some examples. Happy
More informationArea Efficient Self Timed Adders For Low Power Applications in VLSI
ISSN(Onlne): 2319-8753 ISSN (Prnt) :2347-6710 Internatonal Journal of Innovatve Research n Scence, Engneerng and Technology (An ISO 3297: 2007 Certfed Organzaton) Area Effcent Self Tmed Adders For Low
More informationEfficient Code Generation for Automatic Parallelization and Optimization
Effcent Code Generaton for utomatc Parallelzaton and Optmzaton Cédrc Bastoul Laboratore PRSM, Unversté de Versalles Sant Quentn 5 avenue des États-Uns, 7805 Versalles Cedex, France Emal: cedrcbastoul@prsmuvsqfr
More informationOracle Database: SQL and PL/SQL Fundamentals Certification Course
Oracle Database: SQL and PL/SQL Fundamentals Certfcaton Course 1 Duraton: 5 Days (30 hours) What you wll learn: Ths Oracle Database: SQL and PL/SQL Fundamentals tranng delvers the fundamentals of SQL and
More informationSimulation Based Analysis of FAST TCP using OMNET++
Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months
More informationSorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions
Sortng Revew Introducton to Algorthms Qucksort CSE 680 Prof. Roger Crawfs Inserton Sort T(n) = Θ(n 2 ) In-place Merge Sort T(n) = Θ(n lg(n)) Not n-place Selecton Sort (from homework) T(n) = Θ(n 2 ) In-place
More information5 The Primal-Dual Method
5 The Prmal-Dual Method Orgnally desgned as a method for solvng lnear programs, where t reduces weghted optmzaton problems to smpler combnatoral ones, the prmal-dual method (PDM) has receved much attenton
More informationSLAM Summer School 2006 Practical 2: SLAM using Monocular Vision
SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,
More informationConfiguration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*
Confguraton Management n Mult-Context Reconfgurable Systems for Smultaneous Performance and Power Optmzatons* Rafael Maestre, Mlagros Fernandez Departamento de Arqutectura de Computadores y Automátca Unversdad
More informationToday s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.
Today s Outlne Sortng Chapter 7 n Wess CSE 26 Data Structures Ruth Anderson Announcements Wrtten Homework #6 due Frday 2/26 at the begnnng of lecture Proect Code due Mon March 1 by 11pm Today s Topcs:
More informationOptimizing for Speed. What is the potential gain? What can go Wrong? A Simple Example. Erik Hagersten Uppsala University, Sweden
Optmzng for Speed Er Hagersten Uppsala Unversty, Sweden eh@t.uu.se What s the potental gan? Latency dfference L$ and mem: ~5x Bandwdth dfference L$ and mem: ~x Repeated TLB msses adds a factor ~-3x Execute
More informationAADL : about scheduling analysis
AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng
More informationFor instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)
Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A
More informationAMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain
AMath 483/583 Lecture 21 May 13, 2011 Today: OpenMP and MPI versons of Jacob teraton Gauss-Sedel and SOR teratve methods Next week: More MPI Debuggng and totalvew GPU computng Read: Class notes and references
More informationCS 534: Computer Vision Model Fitting
CS 534: Computer Vson Model Fttng Sprng 004 Ahmed Elgammal Dept of Computer Scence CS 534 Model Fttng - 1 Outlnes Model fttng s mportant Least-squares fttng Maxmum lkelhood estmaton MAP estmaton Robust
More informationOn Some Entertaining Applications of the Concept of Set in Computer Science Course
On Some Entertanng Applcatons of the Concept of Set n Computer Scence Course Krasmr Yordzhev *, Hrstna Kostadnova ** * Assocate Professor Krasmr Yordzhev, Ph.D., Faculty of Mathematcs and Natural Scences,
More informationModule Management Tool in Software Development Organizations
Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,
More informationCombined Functional Partitioning and Communication Speed Selection for Networked Voltage-Scalable Processors Λ
Combned Functonal Parttonng and Communcaton Speed Selecton for Networked Voltage-Scalable Processors Λ Jnfeng Lu, Pa H. Chou, Nader Bagherzadeh epartment of Electrcal & Computer Engneerng Unversty of Calforna,
More informationMulti-objective Optimization Using Adaptive Explicit Non-Dominated Region Sampling
11 th World Congress on Structural and Multdscplnary Optmsaton 07 th -12 th, June 2015, Sydney Australa Mult-objectve Optmzaton Usng Adaptve Explct Non-Domnated Regon Samplng Anrban Basudhar Lvermore Software
More informationWavefront Reconstructor
A Dstrbuted Smplex B-Splne Based Wavefront Reconstructor Coen de Vsser and Mchel Verhaegen 14-12-201212 2012 Delft Unversty of Technology Contents Introducton Wavefront reconstructon usng Smplex B-Splnes
More informationFunctional and Timing Validation of Partially Bypassed Processor Pipelines
Functonal and Tmng Valdaton of Partally Bypassed Processor Ppelnes Qang Zhu shyu@labs.fujtsu.com Avral Shrvastava Avral.Shrvastava@asu.edu Nl Dutt dutt@cs.uc.edu Fujtsu Laboratores LTD., Japan 1-1, Kamodanaa
More informationChapter 6 Programmng the fnte element method Inow turn to the man subject of ths book: The mplementaton of the fnte element algorthm n computer programs. In order to make my dscusson as straghtforward
More informationGSLM Operations Research II Fall 13/14
GSLM 58 Operatons Research II Fall /4 6. Separable Programmng Consder a general NLP mn f(x) s.t. g j (x) b j j =. m. Defnton 6.. The NLP s a separable program f ts objectve functon and all constrants are
More informationAn Optimal Algorithm for Prufer Codes *
J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,
More informationLecture 4: Principal components
/3/6 Lecture 4: Prncpal components 3..6 Multvarate lnear regresson MLR s optmal for the estmaton data...but poor for handlng collnear data Covarance matrx s not nvertble (large condton number) Robustness
More informationCHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION
24 CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION The present chapter proposes an IPSO approach for multprocessor task schedulng problem wth two classfcatons, namely, statc ndependent tasks and
More informationCS1100 Introduction to Programming
Factoral (n) Recursve Program fact(n) = n*fact(n-) CS00 Introducton to Programmng Recurson and Sortng Madhu Mutyam Department of Computer Scence and Engneerng Indan Insttute of Technology Madras nt fact
More informationBrave New World Pseudocode Reference
Brave New World Pseudocode Reference Pseudocode s a way to descrbe how to accomplsh tasks usng basc steps lke those a computer mght perform. In ths week s lab, you'll see how a form of pseudocode can be
More informationHermite Splines in Lie Groups as Products of Geodesics
Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the
More informationAn Associative Processor Array Designed for Computer Vision
An Assocatve Processor Array Desgned for Computer Vson A.W.G. Duller R.H. Storer A.R. Thomson M.R. Pout E.L. Dagless Dept. of Electrcal and Electronc Engneerng Unversty of Brstol Brstol BS8 1TR, UK. The
More informationScheduling with Integer Time Budgeting for Low-Power Optimization
Schedlng wth Integer Tme Bdgetng for Low-Power Optmzaton We Jang, Zhr Zhang, Modrag Potkonjak and Jason Cong Compter Scence Department Unversty of Calforna, Los Angeles Spported by NSF, SRC. Otlne Introdcton
More informationNotes on Organizing Java Code: Packages, Visibility, and Scope
Notes on Organzng Java Code: Packages, Vsblty, and Scope CS 112 Wayne Snyder Java programmng n large measure s a process of defnng enttes (.e., packages, classes, methods, or felds) by name and then usng
More information