Loop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings

Size: px
Start display at page:

Download "Loop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings"

Transcription

1 Loop Ppelnng for Hgh-Throughput Stream Computaton Usng Self-Tmed Rngs Gennette Gll, John Hansen and Montek Sngh Dept. of Computer Scence Unv. of North Carolna, Chapel Hll, NC 27599, USA ABSTRACT We present a technque for ncreasng the throughput of stream processng archtectures by removng the bottlenecks caused by loop structures. We mplement loops as self-tmed ppelned rngs that can operate on multple data sets concurrently. Our contrbuton ncludes a transformaton algorthm whch takes as nput a hgh-level program and gves as output the structure of an optmzed ppelne rng. Our technque handles nested loops and s further enhanced by loop unrollng. Smulatons run on benchmark examples show a 1.3 to 4.9x speedup wthout unrollng and a 2.6 to 9.7x speedup wth twofold loop unrollng. 1. INTRODUCTION Ths paper targets the doman of hgh-performance dgtal ICs that are mplemented usng ppelned dataflow archtectures. We focus on stream processng archtectures,.e., those that take a stream of data tems and produce a stream of processed results. Hgh-speed stream processors are a natural match for many hgh-end applcatons, ncludng 3D graphcs renderng, mage and vdeo processng, dgtal flters and DSPs, cryptography, and networkng processors. The development of fast stream processors s lkely to be key to sustanng the explosve growth we are wtnessng n consumer electroncs, multmeda applcatons, and hgh-speed networkng. Whle stream processors are well-suted to mplementng algorthms that are dataflow n character, a key challenge s the effcent mplementaton of control constructs. In partcular, the presence of condtonals ( f-then-else constructs) and loops ( for and whle constructs) n algorthms typcally creates performance bottlenecks that lmt the throughput of the resultng stream processor even f the remander of the algorthm s effcently ppelned. Whle there has been some recent work on addressng ths challenge for condtonals [4, 7], there s no satsfactory approach for effcently handlng loop constructs. Ths paper, therefore, focuses on algorthmc loops, and provdes an effcent approach for ther hgh-throughput mplementaton. Exstng approaches to mplementng loop structures are lmted n the throughputs they can acheve. They focus prmarly on lmted concurrency mprovements: () shavng off delays from the crtcal path of an teraton, e.g. by local transformatons that change sequental operatons nto parallel ones, and () slght overlappng of Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ICCAD 06, November 5-9, 2006, San Jose, CA Copyrght 2006 ACM /06/ $5.00. adjacent teratons,.e. overlappng the end of one teraton wth the start of the next teraton of the same data tem, or overlappng the end of an tem s last teraton slghtly wth the next tem s frst teraton. These approaches can somewhat shorten the executon tme of each data tem s computaton. However, none of these approaches truly allow multple data tems to be processed concurrently; only a sngle data tem s allowed to be processed by the loop hardware at any tme. Thus, n exstng approaches, even f the loop s ppelned at the crcut level, t s effectvely unppelned at the algorthm level. Ths paper presents an approach for effcently mplementng teratve loops n hardware, whch truly acheves loop ppelnng at the algorthmc level. Unlke exstng approaches that only target local concurrency mprovements, our approach focuses nstead on allowng multple successve data tems to be computed concurrently, thereby offerng more substantal throughput benefts. Our approach s applcable to teratve algorthms n whch successve data tems can be computed ndependent of each other, but where successve teratons on the same tem are allowed to be dependent on each other. Ths class of algorthms s qute rch and can be used n several realworld applcatons ncludng networkng, encrypton, multmeda applcatons, and dfferental-equaton solvers. Our contrbuton s twofold: () novel archtectural buldng blocks for mplementng teratve archtectures, and () an automated synthess approach that allows hgh-level teratve algorthms to be mapped onto these buldng blocks. The core archtectural buldng block a self-tmed rng structure drectly mplements teratve computatons n an effcent manner. We ntroduce a novel rng nterface composed of several helper blocks that allow the rng to operate at ts deal throughput,.e. nether under-utlzed nor congested. Our approach takes advantage of one of the key benefts of self-tmed archtectures: modular desgn. The archtectural buldng blocks can be easly composed together because they all have the same flexble external nterface, whch s robust to dfferences n nternal mplementatons and varatons n nternal delays. Our synthess approach takes as nput a hgh-level specfcaton of an teratve algorthm and generates the structure of ts loop-ppelned mplementaton. Our synthess approach also handles condtonals, sequental blocks, and parallel blocks, and s therefore applcable to a rch class of algorthms. The modularty of our buldng blocks allows our synthess approach to easly handle nested loops, wth arbtrary level of nestng. Fnally, our approach s further enhanced by ncorporatng loop unrollng at the hardware level,.e. replcatng loop hardware to further explot data parallelsm. Our approach has been valdated by applyng t to a set of complex examples wth real-world applcatons, ncludng teratve polynomal root-fndng, ordnary dfferental-equaton solvng, greatestcommon-dvsor computaton, a smple encrypton algorthm, etc.

2 Each example was syntheszed usng a mature commercal asynchronous synthess flow Haste/TDE (formerly Tangram) tool sute from Handshake Solutons [1], a Phlps subsdary both wth and wthout our loop ppelnng approach. The results of applyng our approach are qute encouragng: x ncrease n throughput wthout loop unrollng. When a twofold loop unrollng was appled, even hgher speedups were acheved: x. The remander of ths paper s organzed as follows. Secton 2 presents background on self-tmed rng archtectures, and dscusses prevous work on loop optmzaton. Secton 3 then ntroduces our new approach to loop ppelnng; for smplcty of presentaton, ths ntal dscusson focuses only on loops wth a fxed number of teratons. The basc approach s then extended n Secton 4 to provde several advanced features: handlng loops wth varable teraton counts; handlng nested loops; and explotng loop unrollng. 2. PREVIOUS WORK AND BACKGROUND 2.1 Prevous Work Prevous Approaches n Hardware Desgn. Several approaches n hardware desgn have attempted to optmze teratve computaton. However, ther man objectve s to reduce the latency of the loop, as opposed to mprovng ts throughput. One such approach s the CASH compler by Budu and Goldsten [4], whch translates an ANSI C program nto data-flow hardware. A dfferent approach by Theobald and Nowck [9] targets generaton of dstrbuted asynchronous control from control-dataflow graphs, wth the objectve of optmzng communcaton between the controllers, and between a controller and ts assocated datapath object. The above approaches offer very lmted throughput beneft, snce they are manly targeted to mprovng a loop s latency. In partcular, the optmzatons ntroduced by these approaches focus on shavng off some delays from each loop teraton, and offer modest concurrency ncreases: the tal-end of an teraton for a gven data tem s allowed to overlap somewhat wth the start of that same tem s next teraton, or the tal-end of an tem s last teraton overlaps somewhat the start of the next tem s frst teraton. Nether approach allows multple dstnct data tems to be truly processed concurrently. Hence, the throughput benefts of these approaches are modest, and nowhere n the 2 10x range targeted by our proposed approach. Lke our approach, work by Kapas, Dally et al. [7] targets effcent mplementaton of stream algorthms. However, t does not address the crtcal challenge of ppelnng loops and only addresses the problems related to condtonal branches. Wllams and Horowtz [10, 11] ntroduced the classc work on self-tmed rngs. An teratve floatng-pont dvder chp was desgned usng a self-tmed rng structure [11], whch was later adapted for use n the earler HAL processors. However, ths desgn dd not actually allow multple data sets to be nsde the rng concurrently. In addton, ther approach does not target translaton of hgh-level algorthms drectly nto hardware. A key contrbuton of [10] was analyss of the performance of self-tmed rngs whch can be loaded wth multple data sets. Ths analyss s revewed n Secton Prevous Approaches n Software Desgn. There are many loop optmzaton technques for software complers; we menton a only few relevant approaches here. In software ppelnng [3] the goal s to re-structure a loop so as to reduce the mpact of nter-nstructon dependences by mxng nstructons from successve teratons. Parallelzng complers often use some combnaton of loop unrollng and compacton to acheve parallelzaton between successve loop teratons [2]. There are other approaches used by parallelzng complers: loop skewng, ndex-set splttng, and node splttng [8]. However, many of these approaches apply only to loops that terate over Fgure 1: A smple self-tmed ppelne arrays, wth each teraton operatng on a subsequent array element. As such, they have lmted applcablty to algorthms that requre multple teratons for each data tem. 2.2 Background: Self-Tmed Ppelnes and Rngs Ths secton brefly revews asynchronous or self-tmed ppelnes and rngs, ncludng ther structure, operaton and performance Asynchronous or Self-Tmed Ppelnes Structure. Fgure 1 shows the basc structure of a self-tmed ppelne. Each ppelne stage conssts of a controller, a storage element ( data latch ), and processng logc. A key feature s the absence of a system-wde clock for global synchronzaton. Instead, synchronzaton s acheved locally through request/acknowledge handshakng between neghborng stages. Operaton. Data s transferred from one stage to next accordng to the handshake protocol chosen for the ppelne. Typcally, a stage generates a request to ntate a handshake wth ts successor stage, ndcatng that new data s ready. If the successor stage s empty t accepts the data and performs two further actons: () t acknowledges ts predecessor for the data receved and, () ntates a smlar handshake wth ts own successor stage further down the ppelne. Performance. There are three metrcs typcally used to characterze the performance of a self-tmed ppelne: forward latency, reverse latency,andcycle tme. The throughput of the rng as a whole can be derved from these three metrcs. The forward latency s smply the tme t takes one data tem to flow through an ntally empty ppelne. Thus, f the latency of Stage s L Stage, then the latency of the entre ppelne s: L Ppelne = L Stage (1) Smlarly, the reverse latency characterzes the speed at whch empty stages or holes flow backward through an ntally full ppelne. The reverse latency of the entre ppelne s smply the sum of the stage reverse latences: R Ppelne = R Stage (2) The cycle tme of a stage, denoted by T Stage, s the mnmum tme that must elapse after a data tem s produced by that stage before the next data tem wll be generated. Snce a complete cycle of a stage typcally conssts of transmtted a data tem forward, followed by acceptng a hole from the next stage, the followng relatonshp holds: T Stage = L Stage + R Stage (3) The maxmum throughput a stage can support s smply the recprocal of ts cycle tme. The cycle tme of a lnear ppelne s typcally lmted by the cycle tme of ts slowest stage: ( ) T Ppelne = max T Stage (4)

3 problem n soluton out nterface Self-Tmed Rng Fgure 2: A self-tmed rng Self-Tmed Ppelne Rngs Frequently, self-tmed ppelne stages must be arranged n confguratons other than smple lnear ppelnes to meet the requrements of an applcaton. One confguraton of specal nterest n ths paper s a self-tmed rng. The rng structure allows one data tem to repeatedly pass through the same sequence of processng stages, thereby allowng teratve computatons to be mplemented. Structure. Fgure 2 shows the basc structure of a self-tmed rng. The rng contans a number of stages whch perform computaton and an nterface stage whose functon s to load data tems nto the rng and to dran results from the rng. Operaton. Once an tem s ntroduced nto the rng through the nterface stage, t revolves nsde the rng untl some termnatng condton ndcates that the computaton s complete. The result s then draned from the rng through the nterface stage. Performance. The classc work by Wllams [10] ntroduced a useful metrc for measurng the performance of a rng: the number of tmes any data tem crosses the nterface stage per second. We wll refer to ths measure as the rng frequency. The performance of the rng s hghly dependent on ts occupancy,.e., the number of data tems revolvng nsde t. When the number of data tems s small, the rng frequency s low, and the ppelne s sad to be data lmted. On the other hand, when nearly every stage of the ppelne s flled wth data tems, the performance s once agan lmted because holes are needed to allow data tems to flow through the ppelne; n ths scenaro, the ppelne s sad to be congested, or hole lmted. Data Lmted Operaton. Ifthere arek data tems n the rng, then n the tme a partcular data tem completes one revoluton around the rng (.e., L Stage ), all k tems would have crossed the nterface stage. Therefore, the maxmum rng frequency attanable s proportonal to the rng occupancy: / Rng Frequency k L Stage (5) Hole Lmted Operaton. If the rng s flled wth data tems n nearly all stages, then the rng frequency s lmted by the number of holes n the rng. For one data tem to cross the nterface stage, exactly one hole must cross n the reverse drecton. If there are h holes n the rng, then n the tme a partcular hole completes one revoluton around the rng (.e., R Stage ), all h holes would have crossed the nterface stage, thereby allowng h data tems to advance. The maxmum rng frequency s therefore proportonal to the number of holes. Thus, f N s the number of stages n the rng, then h = N k, and we have the followng bound on the performance: / Rng Frequency (N k) R Stage (6) Fgure 3 shows a plot of the rng frequency versus ts occupancy. The rsng porton of the graph represents the data lmted regon, where performance rses lnearly wth the number of data tems. The Rng Frequency 0 Lmted by a slow stage Data lmted regon Slope = 1/ΣL Max deal freq = 1/(L+R) Slope = 1/ΣR Hole lmted regon 1 2 N-2 N-1 Rng Occupancy, k Fgure 3: Upper bounds on the rng frequency fallng porton, smlarly, represents the hole lmted regon, where performance drops lnearly wth a decrease n the number of holes. Lmtatons Due to a Slow Stage. If all the stages n the rng have smlar forward and reverse latences, then the maxmum attanable performance wll be the frequency at whch the rsng the fallng lnes of Fgure 3 ntersect. Ths pont represents a frequency that s the nverse of the cycle tme of each rng stage: 1/T = 1/(L + R). However, f some stages are slower than others, then the rng frequency wll be lmted by the cycle tme of the slowest stage. In the fgure, the horzontal lne represents the maxmum operatng rate that can be sustaned by the slowest stage n the rng [10]: / Rng Frequency 1 max N (T Stage ) (7) The overall rng performance wll always be constraned to le under the canopy formed by the three lnes n Fgure BASIC APPROACH: PIPELINING LOOPS We now ntroduce our basc approach for convertng teratve loops nto self-tmed ppelne rngs wth the beneft of sgnfcantly hgher throughput. In ths secton, we focus on a smple case: a for loop wth a constant teraton count,.e., t terates the same number of tmes on each data set. Our advanced approach s ntroduced n Secton 4, whch s able to handle a wder varety of specfcatons, ncludng loops wth varable teraton counts, nested loops, and unrolled loops. We frst motvate our work n Secton 3.1 by showng how a for loop can be a bottleneck n the compute ppelne, thereby severely lmtng throughput obtaned. Then we ntroduce our approach for elmnatng ths bottleneck, ncludng hardware templates for ppelnng the loop (Secton 3.2), an algorthm for mappng a hgh-level code fragment onto ths template (Secton 3.3), and an analyss of the performance beneft our of approach (Secton 3.4). 3.1 Motvaton: The Loop Bottleneck We llustrate our method usng the smple code example shown n Fgure 4. Ths example represents a generc code fragment for a streamng hardware system that terates on every data set usng a for loop. In the example, the symbols s1 through s8 represent statements. The code conssts of a man procedure that performs communcaton wth the envronment and a compute functon that does the computaton. Once a data set s receved from the envronment by man, t s sent to compute for processng, and the result s then sent to the envronment. These operatons are performed repeatedly by man,.

4 func compute(n context) s1; s2; for = 1 to N s3; s4; s5; s6; end s7; s8; return(out context) proc man whle (true) do read(nput); output = compute(nput); wrte(output); end Fgure 4: Sample code for a stream processor that terates on every data set usng a for loop read() s1 loop block s2 for N s7 s8 s6 s5 s4 s3 wrte() Fgure 5: A smple mplementaton of the compute functon Rather than specfyng varables n ths code, we consder functons to have a context whch s operated upon and modfed by each statement. At the begnnng of the compute functon, the context conssts of the entre nput set sent to the functon call, labelled n context. Ths ntal context s augmented wth any locally defned varables before beng sent nto functon. At the end of the functon, the context holds the set of outputs to be returned to the callng envronment. The operaton of the code fragment s as follows. The man procedure reads a data set from the envronment and nvokes compute functon. The value passed from man forms the ntal context (.e., set of nputs) for compute. Insde the body of compute, a certan number of statements e.g., s1 and s2 operate on the context. Next, the for loop operates on the context for N teratons. Then a certan number of statements, represented by s7 and s8, operate on the context after the for loop. Fnally, the modfed context s returned to man, whch communcates the relevant porton of t to the recevng envronment. A drect translaton of ths code nto data-drven hardware typcally yelds the schematc structure shown n Fgure 5. Each statement n the code s1 through s8 becomes a ppelne stage. Statements s3 through s6 along wth the for statement compose the loop block. Durng operaton, the data streams n from the envronment through the read() stage and s streamed out to the envronment through the wrte() stage. A key observaton s that the loop block has the same external nterface as an ndvdual ppelne stage, even though nternally t contans an entre rng for teratve computaton. Specfcally, the loop block accepts one data set from the predecessor stage, performs a calculaton, and passes the computed result on to a successor stage. It does not accept new data untl the results of the calculaton have been accepted by the successor stage. Interestngly, the presence of a loop n the compute body has the same effect on the performance of the stream processor as a sngle ppelne stage wth long latency. As dscussed n Secton 2.2.1, the throughput of a ppelne s determned by the cycle tme of the slowest stage (Equaton 4), whch n turn drectly depends on that stage s s1 s2 dst s3 arbter s4 s5 nterface s6 f < N ++ Fgure 6: Proposed mplementaton of compute latency (Equaton 3). As a result of ths long effectve latency and cycle tme of the loop block, the throughput of the entre stream processor s severely lmted. Even f each of the ndvdual stages s1 through s8 are mplemented effcently, the presence of a sngle long-latency for loop drastcally dmnshes the throughput obtaned through ppelnng. Fgure 5 llustrates ths bottleneck scenaro by ndcatng the presence of data wth a dot. Stages downstream from the loop block are dle whle the stages upstream from the loop block are stalled. Although some exstng approaches can generate slghtly more optmal mplementatons than that of Fgure 5, they stll suffer from the same bottleneck llustrated by ths example, as dscussed n Secton 2.1. Whle these approaches [4, 9] somewhat ncrease concurrency, they only allow a very slght overlap between successve data sets. 3.2 Proposed Rng Structure In order to overcome the bottleneck of loop computaton, our approach ntroduces a novel structure based on a self-tmed rng that sgnfcantly mproves performance by operatng on several data sets at once. However, there are three restrctons on the class of teratve streamng algorthms that can be translated usng our method: () calculatons on each dstnct data set should not depend on each other, () no communcaton wth the caller envronment should occur wthn the body of the loop, and () dfferent statements n the loop body should not share any resources. Note that our translaton scheme stll apples to algorthms n whch consecutve teratons on the same data set do depend on each other. Our proposed rng structure for the code n Fgure 4 s llustrated n Fgure 6. In our mplementaton, the rng as a whole does not have the same nterface behavor as an ndvdual ppelne stage: the rng can actually accept a new data set whle the prevous one s (or several prevous ones are) stll beng operated on, space permttng. The nterface s composed of specal-purpose helper stages a dstrbutor,a, an arbter, and an f stage whch allow multple data sets to enter the rng wthout nterferng. Fgure 6 llustrates the ablty of the rng to operate on multple data sets by ndcatng the presence of data wth a dot. Note that each data set wthn the rng s n a dfferent stage of computaton. Our ppelned rng should deally be flled wth as many data sets as possble wthout causng throughput degradaton due to congeston. As dscussed n Secton 2.2.2, every self-tmed rng has some deal occupancy, C, at whch t acheves maxmum throughput. Therefore, our strategy s to allow at most C elements to be present nsde the rng at any tme. The deal occupancy can be analytcally computed f the forward and reverse latences of stage nsde the rng are known, as dscussed n Secton If, however, these latences are unknown, or hghly data dependent, the deal rng occupancy s determned by smulaton. The rng operates as follows. When a data set arrves from stage s7 s8

5 (a) IN(S) S IN(P1) Ppe(P1) fork IN(P2) (c) Ppe(P2) OUT(S) OUT(P1) jon OUT(P2) (b) IN(f) (d) IN(P2) Ppe(P1) Ppe(P2) IN(P1) Ppe(P1) OUT(P1) IN(B) bool OUT(f) fork B merge IN(P2) Ppe(P2) OUT(P2) (e) dst token IN(P1) arbter IN(P1) IN(P1) token OUT(P1), bool OUT(P1) Ppe(P1) Ppe(cond) f Fgure 7: A graphcal representaton of each composton functon a. Output of sngle stage(s) b. Output of compose sequental(p1, P2) c. Output of compose parallel(p1, P2) d. Output of compose cond(b, P1, P2) e. Output of compose loop(p1, cond) Ppe (P : program). begn f P s a sngle assgnment statement S then output sngle stage(s) else f P s the sequental block P1; P2 then compose sequental( Ppe(P1), Ppe(P2) ) else f P s the parallel block P1 P2 then compose parallel( Ppe(P1), Ppe(P2) ) else f P s the condtonal f(b) then P1 else P2 then compose condtonal( Ppe(B), Ppe(P1), Ppe(P2) ) else f P s a loop for (n) P1 or whle (cond) do P1 then compose loop( Ppe(P1), Ppe(cond) ) end Fgure 8: Our transformaton algorthm s2 as a new nput to the loop, t s frst sent to the dstrbutor (labelled dst n the fgure). If the rng s at less than deal occupancy (.e f <C), the dstrbutor sends the data set nto the rng and ncrements the to keep track of the rng occupancy. The data set then progresses through the stages of the loop body. At the end of an teraton, the data set passes through an f stage. If the termnatng condton s not met, the f stage passes the data set back to the begnnng of the rng. If the termnatng condton s met, the f stage sends the data set out of the rng and also sends a sgnal to the decrement. If the rng had stopped acceptng new data sets because t had reached deal occupancy, ths decrementng of the wll allow a new data set to enter the rng. One arbter s necessary n ths hardware scheme because data can enter the begnnng of the loop from two places. New data arrves va the dstrbutor, and current data loops back va the f stage. These two events can occur at arbtrary tmes, makng arbtraton necessary. 3.3 Transformaton Algorthm The algorthm for our hardware translaton scheme s shown n Fgure 8. The nput to our algorthm s a hgh-level program; the output s a ppelned structural mplementaton of that program. Currently, our approach handles the followng types of language constructs: sequental blocks, parallel blocks, condtonals, and loops. Our algorthm assumes the ablty to perform lve varable dataflow analyss, n order to fnd IN and OUT sets of sectons of code [8]. In partcular, the IN set s the set of data values that are requred to go nto a code fragment ether because they may be used nsde, or because they may need to be relayed to a successor of that fragment. The OUT set of a code fragment s the unon of the IN sets of ts successors. These sets are used to determne whch values need to be communcated between stages. Fgures 7(a-e) show a graphcal representaton of our ppelne composton functons. In these fgures, each box represents a set of one or more ppelne stages. Each arrow represents a communcaton between ppelne stages. Labels on the arrows ndcate the context, or set of varables, that must be passed between stages. (Note that the context s computed usng the IN and OUT sets of sectons of code.) A stage or mult-stage block that s not yet connected to other stages or blocks ndcates an nput port wth an open crcle,, and an output port wth closed a crcle,. These ports wll be connected to ports of other stages or blocks at the next upper level of herarchy durng the recursve traversal of our algorthm. 3.4 Performance Beneft and Overheads Prevous work on performance analyss of rngs allows us to predct the speedup obtaned by the use of our method. As dscussed n Secton and shown n Fgure 3, the rng frequency s proportonal to the total number of data sets that are revolvng nsde the rng, as long as the rng s not congested (.e., hole lmted ). Therefore, the maxmum speedup of our approach s proportonal to the deal rng occupancy. The speed mprovement of our method s largely due to mproved hardware utlzaton. A rng that holds only one data set has a hgh amount of unused hardware at any gven tme. By allowng multple data sets, we are able to obtan hgh hardware utlzaton from the components wthn the rng. Our approach adds some overhead that decreases the actual speedup and ncreases total area. Certan helper stages such as the dstrbutor, and arbter ncrease the latency of the rng and add a small area overhead. Also, the has some latency and therefore wll not allow new data to enter mmedately after old data leaves. The most notable ncrease to area s the extra storage elements that are requred n order to hold the entre context at each rng stage. Ths overhead s necessary to allow each data set to hold ts own copy of the loop s context, thereby enablng multple data sets to coexst ndependently wthn the rng. 4. ADVANCED APPROACH In Secton 3 we descrbed our basc scheme for translatng a loop wth a fxed number of teratons nto a self-tmed rng wth maxmal throughput. In ths secton, we descrbe three advanced technques: () handlng loops that have a varable (e.g., data-dependent) number of teratons, () handlng nested loops, and () usng loop unrollng to further mprove performance. 4.1 Data-Dependent Loops Many algorthms contan loops that terate a dfferent number of tmes dependng on the value of the nput data set. Our basc approach of translatng the loop nto a self-tmed rng can stll be appled, but one problem arses. Specfcally, f each data set s allowed to leave the loop as soon as ts computaton has fnshed, the data

6 func GCD( a, b ) whle( b!= 0) s=a-b; f (s < 0) then swap(a,b) else a := s return C Fgure 9: Eucld s GCD solver Table 1: Reorderng n GCD computaton. nputs orgnal re-ordered output a b output value tag sets wll ext the loop n some order that may dffer from the order n whch they entered. Thus, the data sets ext the loop out of order. One example of an algorthm that has a data-dependent teraton count s Eucld s algorthm for computng the greatest common dvsor (GCD) of two ntegers. A pseudo-code mplementaton of ths algorthm s shown n Fgure 9. If the approach descrbed n Secton 3 were used to mplement ths code, the values n the output stream would be out of order. Table 1 shows the output n the orgnal order and the output generated by the GCD rng assumng an occupancy of three. Our soluton to the reorderng problem s to append a unque tag to each data set; the tag represents a sequence number. For example, the tags 0 through 7 are used n the GCD example n Table 1. Ths tag becomes a part of the context of the loop, ensurng that even f the results emerge from the loop out of order, they are stll tagged correctly. Ths tag can be used n two dfferent ways: () the out-of-order results are smply passed on to the envronment along wth ther tags, or () a re-order buffer s used to correctly order the tems before sendng them to the envronment. The frst method s preferable f the envronment can handle tagged outputs, whch can be useful n applcatons such as graphc renderers where outputs need not come out n order. The second method, whch ntroduces a re-order buffer, s preferable f the callng envronment must reman naïve to the re-orderng problem. There has been much research on the desgn of re-order buffers n the feld of computer archtecture [6]; our approach smply leverages that work. 4.2 Nested Loops Nested loops are mplctly handled by the recursve ppelnng algorthm shown n Fgure 8. We provde an example here of a nested loop transformaton to llustrate ths capablty. An example of an algorthm that has nested loops s the bsecton algorthm for fndng a zero of a polynomal. Fgure 10 shows the pseudo-code for ths algorthm. The body of functon bsecton contans a whle loop whch successvely halves the search nterval for fndng a zero of the polynomal. Each teraton of ths loop requres the value of the polynomal to be evaluated at the md-pont of ths nterval. Ths evaluaton s carred out by callng the functon poly eval whch, n turn, s an teratve algorthm based on Cramer s rule, and contans a for loop wth data-dependent teraton count. When our algorthm s appled to the bsecton code n Fgure 10, t func bsecton ( coefs, tol, pos, neg ) 1 whle( abs (pos - neg) < tol) 2 md := (pos + neg) / 2; 3 a := poly eval( coefs, md ); 4 fa < 0 5 then neg = mdpt; 6 else pos = mdpt; return mdpt; func poly eval( coefs, x, degree ) a = coefs[degree]; for( = degree-1; 0; --) temp = a * x; a = temp + coefs[]; return a Fgure 10: Code for bsecton and polynomal evaluaton = dst arbter f * + -- >= 0 Fgure 11: Structure for polynomal evaluaton begns by callng the functon compose loop on the entre program. (For convenence the ndvdual statements n the code wll be referred to by ther numbered labels.) The functon compose loop n turn trggers a call to Ppe to ppelne the body of the loop. After handlng lne 2, the algorthm calls compose sequental(ppe(3), Ppe([4-6])). The output of Ppe(3) s a rng structure that mplements the polynomal loop, as shown n Fgure 11. The poly eval rng s fnally composed sequentally wth the rest of the statements wthn the bsecton body to form the structure shown n Fgure Loop Unrollng Unrollng the loop body to form a rng wth a greater number of stages can greatly mprove performance when combned wth our hardware translaton approach. Intutvely, ths mprovement results from the duplcaton of hardware nsde the loop and a correspondng ncrease n the rng occupancy. Thus, the unrolled loop s able to perform more work per unt tme. More formally, the rng frequency (at deal occupancy, cf. Equaton 7) remans farly unchanged when the loop s unrolled, but every tck at the loop s nterface now represents a greater amount of work completed. In partcular, f the loop s unrolled u tmes, every tme a data set crosses the nterface stage, t ndcates that u teratons have just been completed on that data set, rather than just one. Therefore, gnorng overheads, the loop s effectve computaton throughput ncreases by factor equal to the number of tmes t s unrolled, u. As a second-order effect, loop unrollng actually also has the beneft of somewhat reducng the overhead of the specal-purpose helper stages: the dstrbutor, the and the arbter. That overhead s now amortzed over a larger rng. As a result, the latency of each data set wll tend to somewhat decrease and hardware utlzaton wll slghtly ncrease. One possble negatve effect of loop unrollng s that t can cause some data sets to be terated over more tmes than necessary, thereby requrng extra checks wthn the unrolled loop to preserve the semantcs of the computaton. Although loop unrollng s a common technque n both software

7 dst + / 2 arbter = dst * arbter f + -- f f >= 0 Fgure 12: Bsecton loop wth nested polynomal evaluaton and hardware optmzaton, our current use of t has much dfferent goals and performance effects. In software complers, the prmary beneft of loop unrollng s to ntroduce more room to allow nstructons to be reordered, wth the purpose of reducng stalls due to branch and data hazards. In hardware translaton approaches, such as [4, 9], loop unrollng s used n conjuncton wth compacton to ncrease concurrency wthn the loop body. However, these approaches do not allow an ncrease n the occupancy of the loop, thereby obtanng lmted throughput mprovement. In contrast, our approach ncreases the loop occupancy by the same factor as the number of tmes t s unrolled, thereby obtanng dramatcally hgher speedup. 5. RESULTS Benchmarks. Our approach targets teratve algorthms that operate on a stream of ndependent data sets. Secton 4 dscussed several such examples: GCD, BISECT, and POLY. In addton to those examples, we tested mplementatons of the followng algorthms as benchmarks: BTREE: The BTREE algorthm searches a bnary tree resdng n ROM for a gven nput key. A key match returns the data value assocated wth the node that matches the key. Behavor s hghly data-dependent: the number of loop teratons depends on how deep the nput key s node s n the tree. CRC: The cyclc redundancy check (CRC) algorthm calculates an 8-bt checksum for a gven block of nput data. Ths algorthm has a fxed teraton count (16 teratons). ODE: The ODE example s the ordnary dfferental equaton solver from [12, 13, 9], based on the Euler method. It receves as nput the coeffcents of a thrd degree ordnary dfferental equaton, along wth an nterval over whch to ntegrate, ntal condtons, and a step sze. Its output s the fnal value of the dependent varable. The number of loop teratons depends on the sze of the nterval and the step sze. TEA: Tny Encrypton Algorthm (TEA) encrypts a 64-bt block wth a 128-bt key. The number of loop teratons depends on the number of encrypton rounds chosen. For ths example, the number of rounds was fxed at 32 (16 loop teratons). Expermental Setup. The benchmark examples were syntheszed usng the Haste/TDE synthess flow from Handshake Solutons [1], a Phlps subsdary. Haste (formerly Tangram ) s the only mature automated asynchronous synthess flow avalable at present. Whle the control-domnated archtectures that Haste currently produces are not an deal match for ppelned dataflow applcatons, ths expermental setup allows the relatve performance beneft of our approach to be accurately estmated. < - For each of the benchmark examples, three dfferent mplementatons were syntheszed: () a baselne verson ( Orgnal ) that dd not use our approach; () a second verson ( Ppelned ) that used our loop ppelnng approach, but dd not use loop unrollng; and () a fnal verson ( Unrolled ) that employed a twofold unrollng for ts loop along wth our loop ppelnng approach. Smulaton and area estmaton tools from the Haste sute were used to quantfy the latency, throughput and area of the resultng mplementatons. Snce the performance of some of the examples s data-dependent, nput streams contanng as many as 100 data sets were used, and latency and throughput results were averaged over them. Results. Tables 2 3 summarze the results of our experments. Table 2 presents the area, latency and throughput obtaned for each of the benchmarks. The fnal column presents the normalzed throughput (relatve to the Orgnal verson), to llustrate the performance beneft of our approach. Fnally, the area and latency overheads of our approach are summarzed n Table 3, once agan normalzed wth respect to the Orgnal verson. Dscusson. The results demonstrate that a substantal mpact on throughput s acheved by loop ppelnng: up to 9.7x speedup. Wthout the use of loop unrollng, our approach obtans a throughput mprovement by a factor of 1.3 to 4.9. When a twofold loop unrollng was appled along wth loop ppelnng, the speedup obtaned was even hgher: a factor of 2.6 to 9.7x. In greater detal, algorthms that were ppelned nto a relatvely small number of stages (CRC and GCD) had a lmted potental for loop ppelnng because the maxmum capacty of the self-tmed rng structures n these cases was low. As a result, the throughput beneft was around 1.3x (wthout unrollng). On the other hand, algorthms that were hghly ppelned (BISECT, ODE, and TEA), had larger rng structures whch could accommodate a greater number of data sets concurrently. For these benchmarks, the throughput ncrease was substantally hgher: a factor of 2 to 4.9x (wthout unrollng). As expected, the twofold unrollng led to a throughput ncrease n each case by a factor of 1.7 to 2.0x, yeldng an overall combned throughput beneft of 2.6 to 9.7x relatve to the orgnal mplementaton. Although our approach results n a sgnfcant boost n throughput, there are costs assocated wth the performance mprovement. Shown n Table 3 are the ncreases n total area consumed, and n the average latency per data set. In terms of area, the ppelned verson adds the overheads of loop control dscussed n Secton 3.4. Each ppelne stage must latch data from the prevous stage, so algorthms wth many stages or large contexts (e.g. BISECT) wllseealargencreasenarea. Byunrollng the loop twofold, the total area of the mplementaton ncreased by a factor of x. The average latency for a data set also ncreased when loop ppelnng was used. Ths was expected because, lke most tradtonal ppelnng approaches, our approach ncreases throughput at the expense of latency. The latency overhead n most of the benchmarks was n the x range, except for the example that contaned a nested loop, BISECT. For BISECT, the latency overhead reported s substantally hgher (8.2x) due to the compounded overheads of ts nested loops. Whle the area and latency overheads may seem dauntng, for most applcatons the man performance measure s overall executon tme. By ncreasng latency and chp area, a dramatc mprovement n throughput results, reducng executon tme by a large factor. 6. CONCLUSIONS AND ONGOING WORK In ths paper we presented a technque that allows loop hardware to operate on many data sets at the same tme. We proposed a novel loop nterface and a set of transformatons from hgh-level code to

8 Table 2: Synthess Results: Performance Beneft Algorthm/ Area Latency Throughput Normalzed Approach (µm 2 ) (ns) (Mega tems/s) Throughput BISECT Orgnal Ppelned Unrolled BTREE Orgnal Ppelned Unrolled CRC Orgnal Ppelned Unrolled GCD Orgnal Ppelned Unrolled ODE Orgnal Ppelned Unrolled POLY Orgnal Ppelned Unrolled TEA Orgnal Ppelned Unrolled Table 3: Area and Latency (Relatve Overheads) Algorthm Ppelned Unrolled Area Latency Area Latency (Norm.) (Norm.) (Norm.) (Norm.) BISECT BTREE CRC GCD ODE POLY TEA a ppelned loop structure. Our results showed sgnfcant throughput mprovement over the non-ppelned rng mplementaton, wth further benefts through unrollng. More work needs to be done to decrease the area overhead of our technque. Our current approach s conservatve: each stage n the rng structure stores a dstnct copy of the maxmal set of lve varables that t may need. We are explorng more optmal technques for reducng ths overhead. In ongong work, we are re-mplementng the above benchmarks n a true dataflow style, n order to elmnate the unnecessary control overheads ntroduced by the Haste synthess flow. We antcpate sgnfcantly better performance and area results after ths modfcaton. In addton, a loop-ppelned GCD desgn has been mplemented n slcon. Intal results ndcate fully functonal parts, though a complete evaluaton s pendng. A full-featured transformaton algorthm s the next logcal step n ths work. Addtonal features nclude support for resource sharng, unrestrcted communcaton wth the envronment, and handlng dependences across data sets. Other loop optmzng technques, such as compacton, are also beng pursued. 7. REFERENCES [1] Handshake Solutons, a Phlps subsdary. [2] A. Aken and A. Ncolau. Perfect ppelnng: A new loop parallelzaton technque. In European Symposum on Programmng, pages , [3] V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan. Software ppelnng. ACM Comput. Surv., 27(3): , [4] M. Budu. Spatal Computaton. PhD thess, Carnege Mellon Unversty, Computer Scence Department, December Techncal report CMU-CS [5] A. Davs and S. M. Nowck. An ntroducton to asynchronous crcut desgn. Techncal Report UUCS , Dept. of Computer Scence, Unversty of Utah, Sept [6] J. L. Hennessy and D. A. Patterson. Computer archtecture: a quanttatve approach. Morgan Kaufmann Publshers Inc., San Francsco, CA, USA, [7] U. J. Kapas, W. J. Dally, S. Rxner, P. R. Mattson, J. D. Owens, and B. Khalany. Effcent condtonal operatons for data-parallel archtectures. In Proc. of IEEE/ACM Intl. Symp. on Mcroarchtecture, [8] K. Kennedy and J. R. Allen. Optmzng complers for modern archtectures: a dependence-based approach. Morgan Kaufmann Publshers Inc., San Francsco, CA, USA, [9] M. Theobald and S. M. Nowck. Transformatons for the synthess and optmzaton of asynchronous dstrbuted control. In Proc. ACM/IEEE Desgn Automaton Conf., June [10] T. E. Wllams. Self-Tmed Rngs and ther Applcaton to Dvson. PhD thess, Stanford Unversty, June [11] T. E. Wllams and M. A. Horowtz. A zero-overhead self-tmed 160ns 54b CMOS dvder. IEEE Journal of Sold-State Crcuts, 26(11): , Nov [12] K. Y. Yun, P. A. Beerel, V. Vaklotojar, A. E. Dooply, and J. Arceo. The desgn and verfcaton of a hgh-performance low-control-overhead asynchronous dfferental equaton solver. In Proc. Int. Symp. on Advanced Research n Asynchronous Crcuts and Systems, pages IEEE Computer Socety Press, Apr [13] K. Y. Yun, P. A. Beerel, V. Vaklotojar, A. E. Dooply, and J. Arceo. The desgn and verfcaton of a hgh-performance low-control-overhead asynchronous dfferental equaton solver. IEEE Trans. on VLSI Systems, 6(4): , Dec

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

Loop Transformations, Dependences, and Parallelization

Loop Transformations, Dependences, and Parallelization Loop Transformatons, Dependences, and Parallelzaton Announcements Mdterm s Frday from 3-4:15 n ths room Today Semester long project Data dependence recap Parallelsm and storage tradeoff Scalar expanson

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach Data Representaton n Dgtal Desgn, a Sngle Converson Equaton and a Formal Languages Approach Hassan Farhat Unversty of Nebraska at Omaha Abstract- In the study of data representaton n dgtal desgn and computer

More information

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation Loop Transformatons for Parallelsm & Localty Last week Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Scalar expanson for removng false dependences Loop nterchange Loop

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Conditional Speculative Decimal Addition*

Conditional Speculative Decimal Addition* Condtonal Speculatve Decmal Addton Alvaro Vazquez and Elsardo Antelo Dep. of Electronc and Computer Engneerng Unv. of Santago de Compostela, Span Ths work was supported n part by Xunta de Galca under grant

More information

High-Boost Mesh Filtering for 3-D Shape Enhancement

High-Boost Mesh Filtering for 3-D Shape Enhancement Hgh-Boost Mesh Flterng for 3-D Shape Enhancement Hrokazu Yagou Λ Alexander Belyaev y Damng We z Λ y z ; ; Shape Modelng Laboratory, Unversty of Azu, Azu-Wakamatsu 965-8580 Japan y Computer Graphcs Group,

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements Explct Formulas and Effcent Algorthm for Moment Computaton of Coupled RC Trees wth Lumped and Dstrbuted Elements Qngan Yu and Ernest S.Kuh Electroncs Research Lab. Unv. of Calforna at Berkeley Berkeley

More information

S1 Note. Basis functions.

S1 Note. Basis functions. S1 Note. Bass functons. Contents Types of bass functons...1 The Fourer bass...2 B-splne bass...3 Power and type I error rates wth dfferent numbers of bass functons...4 Table S1. Smulaton results of type

More information

Simulation Based Analysis of FAST TCP using OMNET++

Simulation Based Analysis of FAST TCP using OMNET++ Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months

More information

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011 9/8/2 2 Outlne Appendx C: The Bascs of Logc Desgn TDT4255 Computer Desgn Case Study: TDT4255 Communcaton Module Lecture 2 Magnus Jahre 3 4 Dgtal Systems C.2: Gates, Truth Tables and Logc Equatons All sgnals

More information

Lecture 3: Computer Arithmetic: Multiplication and Division

Lecture 3: Computer Arithmetic: Multiplication and Division 8-447 Lecture 3: Computer Arthmetc: Multplcaton and Dvson James C. Hoe Dept of ECE, CMU January 26, 29 S 9 L3- Announcements: Handout survey due Lab partner?? Read P&H Ch 3 Read IEEE 754-985 Handouts:

More information

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont) Loop Transformatons for Parallelsm & Localty Prevously Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Loop nterchange Loop transformatons and transformaton frameworks

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations Improvng Hgh Level Synthess Optmzaton Opportunty Through Polyhedral Transformatons We Zuo 2,5, Yun Lang 1, Peng L 1, Kyle Rupnow 3, Demng Chen 2,3 and Jason Cong 1,4 1 Center for Energy-Effcent Computng

More information

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain AMath 483/583 Lecture 21 May 13, 2011 Today: OpenMP and MPI versons of Jacob teraton Gauss-Sedel and SOR teratve methods Next week: More MPI Debuggng and totalvew GPU computng Read: Class notes and references

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation

Quality Improvement Algorithm for Tetrahedral Mesh Based on Optimal Delaunay Triangulation Intellgent Informaton Management, 013, 5, 191-195 Publshed Onlne November 013 (http://www.scrp.org/journal/m) http://dx.do.org/10.36/m.013.5601 Qualty Improvement Algorthm for Tetrahedral Mesh Based on

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility IJCSNS Internatonal Journal of Computer Scence and Network Securty, VOL.15 No.10, October 2015 1 Evaluaton of an Enhanced Scheme for Hgh-level Nested Network Moblty Mohammed Babker Al Mohammed, Asha Hassan.

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 An Iteratve Soluton Approach to Process Plant Layout usng Mxed

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

VISUAL SELECTION OF SURFACE FEATURES DURING THEIR GEOMETRIC SIMULATION WITH THE HELP OF COMPUTER TECHNOLOGIES

VISUAL SELECTION OF SURFACE FEATURES DURING THEIR GEOMETRIC SIMULATION WITH THE HELP OF COMPUTER TECHNOLOGIES UbCC 2011, Volume 6, 5002981-x manuscrpts OPEN ACCES UbCC Journal ISSN 1992-8424 www.ubcc.org VISUAL SELECTION OF SURFACE FEATURES DURING THEIR GEOMETRIC SIMULATION WITH THE HELP OF COMPUTER TECHNOLOGIES

More information

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms Course Introducton Course Topcs Exams, abs, Proects A quc loo at a few algorthms 1 Advanced Data Structures and Algorthms Descrpton: We are gong to dscuss algorthm complexty analyss, algorthm desgn technques

More information

3. CR parameters and Multi-Objective Fitness Function

3. CR parameters and Multi-Objective Fitness Function 3 CR parameters and Mult-objectve Ftness Functon 41 3. CR parameters and Mult-Objectve Ftness Functon 3.1. Introducton Cogntve rados dynamcally confgure the wreless communcaton system, whch takes beneft

More information

Chapter 6 Programmng the fnte element method Inow turn to the man subject of ths book: The mplementaton of the fnte element algorthm n computer programs. In order to make my dscusson as straghtforward

More information

CHAPTER 4 PARALLEL PREFIX ADDER

CHAPTER 4 PARALLEL PREFIX ADDER 93 CHAPTER 4 PARALLEL PREFIX ADDER 4.1 INTRODUCTION VLSI Integer adders fnd applcatons n Arthmetc and Logc Unts (ALUs), mcroprocessors and memory addressng unts. Speed of the adder often decdes the mnmum

More information

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT Bran J. Wolf, Joseph L. Hammond, and Harlan B. Russell Dept. of Electrcal and Computer Engneerng, Clemson Unversty,

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

Analysis of Collaborative Distributed Admission Control in x Networks

Analysis of Collaborative Distributed Admission Control in x Networks 1 Analyss of Collaboratve Dstrbuted Admsson Control n 82.11x Networks Thnh Nguyen, Member, IEEE, Ken Nguyen, Member, IEEE, Lnha He, Member, IEEE, Abstract Wth the recent surge of wreless home networks,

More information

arxiv: v3 [cs.ds] 7 Feb 2017

arxiv: v3 [cs.ds] 7 Feb 2017 : A Two-stage Sketch for Data Streams Tong Yang 1, Lngtong Lu 2, Ybo Yan 1, Muhammad Shahzad 3, Yulong Shen 2 Xaomng L 1, Bn Cu 1, Gaogang Xe 4 1 Pekng Unversty, Chna. 2 Xdan Unversty, Chna. 3 North Carolna

More information

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS ARPN Journal of Engneerng and Appled Scences 006-017 Asan Research Publshng Network (ARPN). All rghts reserved. NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS Igor Grgoryev, Svetlana

More information

Solving two-person zero-sum game by Matlab

Solving two-person zero-sum game by Matlab Appled Mechancs and Materals Onlne: 2011-02-02 ISSN: 1662-7482, Vols. 50-51, pp 262-265 do:10.4028/www.scentfc.net/amm.50-51.262 2011 Trans Tech Publcatons, Swtzerland Solvng two-person zero-sum game by

More information

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching A Fast Vsual Trackng Algorthm Based on Crcle Pxels Matchng Zhqang Hou hou_zhq@sohu.com Chongzhao Han czhan@mal.xjtu.edu.cn Ln Zheng Abstract: A fast vsual trackng algorthm based on crcle pxels matchng

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions Sortng Revew Introducton to Algorthms Qucksort CSE 680 Prof. Roger Crawfs Inserton Sort T(n) = Θ(n 2 ) In-place Merge Sort T(n) = Θ(n lg(n)) Not n-place Selecton Sort (from homework) T(n) = Θ(n 2 ) In-place

More information

APPLICATION OF A COMPUTATIONALLY EFFICIENT GEOSTATISTICAL APPROACH TO CHARACTERIZING VARIABLY SPACED WATER-TABLE DATA

APPLICATION OF A COMPUTATIONALLY EFFICIENT GEOSTATISTICAL APPROACH TO CHARACTERIZING VARIABLY SPACED WATER-TABLE DATA RFr"W/FZD JAN 2 4 1995 OST control # 1385 John J Q U ~ M Argonne Natonal Laboratory Argonne, L 60439 Tel: 708-252-5357, Fax: 708-252-3 611 APPLCATON OF A COMPUTATONALLY EFFCENT GEOSTATSTCAL APPROACH TO

More information

ARTICLE IN PRESS. Signal Processing: Image Communication

ARTICLE IN PRESS. Signal Processing: Image Communication Sgnal Processng: Image Communcaton 23 (2008) 754 768 Contents lsts avalable at ScenceDrect Sgnal Processng: Image Communcaton journal homepage: www.elsever.com/locate/mage Dstrbuted meda rate allocaton

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances In: Proc. 0th Int. Symposum on Hardware/Software Codesgn (CODES 02), Estes Park, Colorado, USA, May 6 8, 2002 Algorthmc Transformaton Technques for Effcent Exploraton of Alternatve Applcaton Instances

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

CE 221 Data Structures and Algorithms

CE 221 Data Structures and Algorithms CE 1 ata Structures and Algorthms Chapter 4: Trees BST Text: Read Wess, 4.3 Izmr Unversty of Economcs 1 The Search Tree AT Bnary Search Trees An mportant applcaton of bnary trees s n searchng. Let us assume

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introducton 1.1 Parallel Processng There s a contnual demand for greater computatonal speed from a computer system than s currently possble (.e. sequental systems). Areas need great computatonal

More information

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) ,

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) , VRT012 User s gude V0.1 Thank you for purchasng our product. We hope ths user-frendly devce wll be helpful n realsng your deas and brngng comfort to your lfe. Please take few mnutes to read ths manual

More information

Channel 0. Channel 1 Channel 2. Channel 3 Channel 4. Channel 5 Channel 6 Channel 7

Channel 0. Channel 1 Channel 2. Channel 3 Channel 4. Channel 5 Channel 6 Channel 7 Optmzed Regonal Cachng for On-Demand Data Delvery Derek L. Eager Mchael C. Ferrs Mary K. Vernon Unversty of Saskatchewan Unversty of Wsconsn Madson Saskatoon, SK Canada S7N 5A9 Madson, WI 5376 eager@cs.usask.ca

More information

Design of a Real Time FPGA-based Three Dimensional Positioning Algorithm

Design of a Real Time FPGA-based Three Dimensional Positioning Algorithm Desgn of a Real Tme FPGA-based Three Dmensonal Postonng Algorthm Nathan G. Johnson-Wllams, Student Member IEEE, Robert S. Myaoka, Member IEEE, Xaol L, Student Member IEEE, Tom K. Lewellen, Fellow IEEE,

More information

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?

More information

Agenda & Reading. Simple If. Decision-Making Statements. COMPSCI 280 S1C Applications Programming. Programming Fundamentals

Agenda & Reading. Simple If. Decision-Making Statements. COMPSCI 280 S1C Applications Programming. Programming Fundamentals Agenda & Readng COMPSCI 8 SC Applcatons Programmng Programmng Fundamentals Control Flow Agenda: Decsonmakng statements: Smple If, Ifelse, nested felse, Select Case s Whle, DoWhle/Untl, For, For Each, Nested

More information

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL)

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL) Crcut Analyss I (ENG 405) Chapter Method of Analyss Nodal(KCL) and Mesh(KVL) Nodal Analyss If nstead of focusng on the oltages of the crcut elements, one looks at the oltages at the nodes of the crcut,

More information

Machine Learning: Algorithms and Applications

Machine Learning: Algorithms and Applications 14/05/1 Machne Learnng: Algorthms and Applcatons Florano Zn Free Unversty of Bozen-Bolzano Faculty of Computer Scence Academc Year 011-01 Lecture 10: 14 May 01 Unsupervsed Learnng cont Sldes courtesy of

More information

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning Parallel Inverse Halftonng by Look-Up Table (LUT) Parttonng Umar F. Sddq and Sadq M. Sat umar@ccse.kfupm.edu.sa, sadq@kfupm.edu.sa KFUPM Box: Department of Computer Engneerng, Kng Fahd Unversty of Petroleum

More information

Intro. Iterators. 1. Access

Intro. Iterators. 1. Access Intro Ths mornng I d lke to talk a lttle bt about s and s. We wll start out wth smlartes and dfferences, then we wll see how to draw them n envronment dagrams, and we wll fnsh wth some examples. Happy

More information

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints Australan Journal of Basc and Appled Scences, 2(4): 1204-1208, 2008 ISSN 1991-8178 Sum of Lnear and Fractonal Multobjectve Programmng Problem under Fuzzy Rules Constrants 1 2 Sanjay Jan and Kalash Lachhwan

More information

TN348: Openlab Module - Colocalization

TN348: Openlab Module - Colocalization TN348: Openlab Module - Colocalzaton Topc The Colocalzaton module provdes the faclty to vsualze and quantfy colocalzaton between pars of mages. The Colocalzaton wndow contans a prevew of the two mages

More information

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution Dynamc Voltage Scalng of Supply and Body Bas Explotng Software Runtme Dstrbuton Sungpack Hong EE Department Stanford Unversty Sungjoo Yoo, Byeong Bn, Kyu-Myung Cho, Soo-Kwan Eo Samsung Electroncs Taehwan

More information

Feature Reduction and Selection

Feature Reduction and Selection Feature Reducton and Selecton Dr. Shuang LIANG School of Software Engneerng TongJ Unversty Fall, 2012 Today s Topcs Introducton Problems of Dmensonalty Feature Reducton Statstc methods Prncpal Components

More information

Random Kernel Perceptron on ATTiny2313 Microcontroller

Random Kernel Perceptron on ATTiny2313 Microcontroller Random Kernel Perceptron on ATTny233 Mcrocontroller Nemanja Djurc Department of Computer and Informaton Scences, Temple Unversty Phladelpha, PA 922, USA nemanja.djurc@temple.edu Slobodan Vucetc Department

More information

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier Floatng-Pont Dvson Algorthms for an x86 Mcroprocessor wth a Rectangular Multpler Mchael J. Schulte Dmtr Tan Carl E. Lemonds Unversty of Wsconsn Advanced Mcro Devces Advanced Mcro Devces Schulte@engr.wsc.edu

More information

BANDWIDTH OPTIMIZATION OF INDIVIDUAL HOP FOR ROBUST DATA STREAMING ON EMERGENCY MEDICAL APPLICATION

BANDWIDTH OPTIMIZATION OF INDIVIDUAL HOP FOR ROBUST DATA STREAMING ON EMERGENCY MEDICAL APPLICATION ARPN Journal of Engneerng and Appled Scences 2006-2009 Asan Research Publshng Network (ARPN). All rghts reserved. BANDWIDTH OPTIMIZATION OF INDIVIDUA HOP FOR ROBUST DATA STREAMING ON EMERGENCY MEDICA APPICATION

More information

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming CS 4/560 Desgn and Analyss of Algorthms Kent State Unversty Dept. of Math & Computer Scence LECT-6 Dynamc Programmng 2 Dynamc Programmng Dynamc Programmng, lke the dvde-and-conquer method, solves problems

More information

y and the total sum of

y and the total sum of Lnear regresson Testng for non-lnearty In analytcal chemstry, lnear regresson s commonly used n the constructon of calbraton functons requred for analytcal technques such as gas chromatography, atomc absorpton

More information

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007 Syntheszer 1.0 A Varyng Coeffcent Meta Meta-Analytc nalytc Tool Employng Mcrosoft Excel 007.38.17.5 User s Gude Z. Krzan 009 Table of Contents 1. Introducton and Acknowledgments 3. Operatonal Functons

More information

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION 24 CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION The present chapter proposes an IPSO approach for multprocessor task schedulng problem wth two classfcatons, namely, statc ndependent tasks and

More information

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm Internatonal Journal of Advancements n Research & Technology, Volume, Issue, July- ISS - on-splt Restraned Domnatng Set of an Interval Graph Usng an Algorthm ABSTRACT Dr.A.Sudhakaraah *, E. Gnana Deepka,

More information

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints TPL-ware Dsplacement-drven Detaled Placement Refnement wth Colorng Constrants Tao Ln Iowa State Unversty tln@astate.edu Chrs Chu Iowa State Unversty cnchu@astate.edu BSTRCT To mnmze the effect of process

More information

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES A SYSOLIC APPROACH O LOOP PARIIONING AND MAPPING INO FIXED SIZE DISRIBUED MEMORY ARCHIECURES Ioanns Drosts, Nektaros Kozrs, George Papakonstantnou and Panayots sanakas Natonal echncal Unversty of Athens

More information

Memory Modeling in ESL-RTL Equivalence Checking

Memory Modeling in ESL-RTL Equivalence Checking 11.4 Memory Modelng n ESL-RTL Equvalence Checkng Alfred Koelbl 2025 NW Cornelus Pass Rd. Hllsboro, OR 97124 koelbl@synopsys.com Jerry R. Burch 2025 NW Cornelus Pass Rd. Hllsboro, OR 97124 burch@synopsys.com

More information