Spatial Computation ABSTRACT 1. INTRODUCTION

Size: px
Start display at page:

Download "Spatial Computation ABSTRACT 1. INTRODUCTION"

Transcription

1 Spatal Computaton Mha Budu, Grsh Venkataraman, Tberu Chelcea and Seth Copen Goldsten Carnege Mellon Unversty ABSTRACT Ths paper descrbes a computer archtecture, Spatal Computaton (SC), whch s based on the translaton of hgh-level language programs drectly nto hardware structures. SC program mplementatons are completely dstrbuted, wth no centralzed control. SC crcuts are optmzed for wres at the expense of computaton unts. In ths paper we nvestgate a partcular mplementaton of SC: ASH (Applcaton-Specfc Hardware). Under the aspton that computaton s cheaper than communcaton, ASH replcates computaton unts to smplfy nterconnect, buldng a system whch uses very smple, completely dedcated communcaton channels. As a consequence, communcaton on the datapath never requres arbtraton; the only arbtraton requred s for accessng memory. ASH reles on very smple hardware prmtves, usng no assocatve structures, no multported regster fles, no schedulng logc, no broadcast, and no clocks. As a consequence, ASH hardware s fast and extremely power effcent. In ths work we demonstrate three features of ASH: () that such archtectures can be bult by automatc complaton of C programs; (2) that dstrbuted computaton s n some respects fundamentally dfferent from monolthc superscalar processors; and (3) that ASIC mplementatons of ASH use three orders of magntude less energy compared to hgh-end superscalar processors, whle beng on average only 33% slower n performance (3.5x worst-case). Categores and Subject Descrptors: B.2.4 arthmetc and logc cost/performance, B.6.3 automatc synthess, optmzaton, smulaton B.7. algorthms mplemented n hardware, B.7.2 smulaton, C..3 dataflow archtectures, hybrd systems, D.3.2 data-flow languages, D.3.4 code generaton, complers, optmzaton General Terms: Measurement, Performance, Desgn. Keywords: spatal computaton, dataflow machne, applcatonspecfc hardware, low-power.. INTRODUCTION The von Neumann computer archtecture [08] has proven to be extremely reslent despte numerous perceved shortcomngs [7]. Computer archtects have contnuously enhanced the structure of the central processng unt, takng advantage of Moore s law. To- Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ASPLOS 04, October 7 3, 2004, Boston, Massachusetts, USA. Copyrght 2004 ACM /04/000...$5.00. day s superppelned, superscalar, out-of-order mcroprocessors are amazng achevements. However, the future scalablty of superscalar (and even VLIW) archtectures s questonable. Attemptng to ncrease the ppelne wdth of the processor beyond the current four or fve nstructons per cycle s dffcult snce the nterconnecton networks scale superlnearly. The regster fle, nstructon ssue logc, and ppelne forwardng networks grow quadratcally wth ssue wdth, makng the nterconnecton latency the lmtng factor [2]. Ths problem s compounded by the ncreasng clock rates and shrnkng technologes: currently sgnal propagaton delays on nter-module wres domnate logc delays [56]. Just the dstrbuton of the clock sgnal s a major undertakng [0]. Wre delays are not the only factor n the way of scalng: power conpton and power densty have reached dangerous levels, due to ncreased amounts of speculatve executon, ncreased logc densty and wde ssue. Desgn complexty s yet another lmtaton: whle the number of avalable transstors grows by 58% annually, desgner productvty only grows by 2% []. Ths exponentally ncreasng productvty gap has been hstorcally covered by employng larger and larger desgn and verfcaton teams, but human resources are economcally hard to scale. The research presented n ths paper s amed drectly at these problems. We explore Spatal Computaton, whch s a model of computaton optmzed for wres. We have prevously proposed to use Spatal Computaton for mappng programs to nanofabrcs [5]; n ths paper we evaluate the compler technology we developed for nanofabrcs on a tradtonal CMOS substrate. Snce the class of crcuts one could call spatal s arguably very large, we focus our attenton on a partcular set of nstances of SC structures, whch we call Applcaton-Specfc Hardware (ASH). ASH requres no clocks, nor any global sgnals. The core aspton s that computaton gates are cheap, and wll become even cheaper compared to the cost of wres (n terms of delay, power and area). ASH s an extreme pont n the space of SC archtectures: n ASH computaton structures are never shared, and each program operaton s syntheszed as a dfferent functonal unt. We present a complete compler/cad tool-chan that brdges both software complaton and mcroarchtecture. Applcatons wrtten n hgh-level languages are compled nto hardware descrptons. These descrptons can ether be loaded onto a reconfgurable hardware fabrc or syntheszed drectly nto crcuts. The resultng crcuts use only localzed communcaton, requre nether broadcast nor global control, and are self-synchronzed. The compler we have developed s automatc, fast, requres no desgner nterventon, and explots nstructon-level parallelsm (ILP) and ppelnng. The novel research contrbutons descrbed n ths paper are: () a compler tool-chan from ANSI C to asynchronous hardware;

2 C Suf front end CFG optmze convert Pegasus optmze async back end Verlog commercal CAD ASIC layout ModelSm CASH hgh level smulaton crcut smulaton secton Fgure : Toolflow used for evaluaton. The rght-hand sde ndcates the sectons n ths paper that dscuss each of the buldngblocks. low ILP OS VM CPU local memory Memory ASH hgh ILP computaton Fgure 2: ASH used n tandem wth a processor for mplementng whole applcatons. The processor s relegated to low-ilp program fragments and for executng the operatng system. (2) a qualtatve comparson of Spatal Computaton archtectures and superscalar processors; (3) a crcut-level evaluaton of the syntheszed crcuts on C program kernels from the Medabench sute; (4) a descrpton of a hgh-level synthess toolflow that produces extremely energy-effcent program mplementatons: comparable wth custom hand-tuned hardware desgns, and three orders of magntude better than superscalar processors; (5) the frst mplementaton of a C compler that can target dataflow machnes. 2. COMPILING C TO HARDWARE Ths secton presents our complaton methods as emboded n the CASH compler (Compler for ASH). The structure of CASH, and ts place wthn a complete synthess tool-flow are llustrated n Fgure, whch also shows the organzaton of ths paper. The crcuts generated by CASH cannot handle system calls. For translatng whole applcaton we ase frst that hardwaresoftware parttonng s performed, and that part of the applcaton s executed on a tradtonal processor (e.g., I/O), whle the rest s mapped to hardware, as shown n Fgure 2. The processor and the hardware have access to the same global memory space, and there s some mechansm to mantan a coherent vew of memory. Crossng the hardware-software nterface can be hdden by employng a stub compler, whch encapsulates the nformaton transmtted across the nterface, as we have proposed n [20], effectvely performng Remote Procedure Calls across the hw/sw nterface. 2. CASH CASH takes ANSI C as nput. CASH represents the nput program usng Pegasus [6, 7], a dataflow ntermedate representaton (IR). The output of CASH s a hardware dataflow machne whch drectly executes the nput program. Currently CASH generates a structural Verlog descrpton of the crcuts. CASH has a C front-end, based on the Suf compler []. The front-end performs some optmzatons (ncludng procedure nlnng, loop unrollng, call-graph computaton, and basc control-flow optmzatons), ntraprocedural ponter analyss, and lve-varable analyss. Then, the front-end translates the low-suf ntermedate representaton nto Pegasus. Next CASH performs a wealth of optmzatons on ths representaton, ncludng scalar-, memory- and Boolean optmzatons. Fnally, a back-end performs peephole optmzatons and generates code. The translaton of C nto hardware s eased by mantanng the same memory layout of all program data structures as mplemented n a classcal CPU-based system (the heap structure s practcally dentcal, but CASH uses less stack space, snce t never needs to spll regsters). ASH currently uses a sngle monolthc memory for ths purpose (see Secton 4..3). There s nothng ntrnsc n Spatal Computaton that mandates the use of a monolthc memory; on the contrary, usng several ndependent memores (as suggested for example n [94, 9]) would be very benefcal. 2.2 The Pegasus Intermedate Representaton The key technque allowng us to brdge the semantc gap between mperatve languages and asynchronous dataflow s Statc Sngle Assgnment (SSA) [34]. SSA s an IR used for mperatve programs n whch each varable s assgned to only once. As such, t can be seen as a functonal program [5]. Pegasus represents the scalar part of the computaton of C programs as SSA. Due to space lmtatons we only brefly descrbe Pegasus. See [6, 7] for more detals. Pegasus seamlessly extends SSA representng memory dependences, predcaton, and (forward) speculaton n a unfed manner. Whle other IRs have prevously combned some of these aspects, we beleve Pegasus s the frst to unfy them nto a coherent, semantcally precse representaton. A program s represented by a drected graph n whch nodes are operatons and edges ndcate value flow; an example s shown n Fgure 3. Pegasus leverages technques used n complers for predcated executon machnes [74] by collectng multple basc blocks nto one hyperblock; each hyperblock s transformed nto straghtlne code through the use of predcaton, usng technques smlar to PSSA [25]. Instead of SSA φ-nodes, wthn hyperblocks Pegasus uses explct decoded multplexor (MUX) nodes (one example s gven n Fgure 7). A decoded MUX has n data nputs and n predcates. The data nputs are the reachng defntons. The MUX predcates correspond to path predcates n PSSA; each predcate selects one correspondng data nput. The predcates of each MUX are guaranteed to be mutually dsjont (.e., the predcates are onehot encoded). CASH uses the espresso [3] Boolean optmzer to smplfy the predcate computatons. Speculaton s ntroduced by predcate promoton [73]: the predcates that guard nstructons wthout sde-effects are weakened to become true,.e., these nstructons are executed uncondtonally once a hyperblock s entered. Predcaton and speculaton are thus core constructs n Pegasus. The former s used for translatng control-flow constructs nto dataflow; the latter for reducng the crt- A hyperblock s a porton of the program control-flow graph havng a sngle entry pont and possbly multple exts. 2

3 nt squares() { nt = 0, = 0; } for (;<0;) = *; return ; 0 * 0 <= eta merge Fgure 3: C program and ts representaton comprsng three hyperblocks; each hyperblock s shown as a numbered rectangle. The dotted lnes represent predcate values. (Ths fgure omts the token edges used for memory synchronzaton.) calty of control-dependences [63]. They effectvely ncrease the exposed ILP. Note that MUX nodes are natural speculaton squashng ponts, dscardng all of ther data nputs correspondng to false predcates (.e., computed on ms-speculated paths). Hyperblocks are sttched together nto a dataflow graph representng an entre procedure by creatng dataflow edges connectng each hyperblock to ts successors. Each varable lve at the end of a hyperblock s forwarded through an ETA node [82] (also called a gateway ). ETAs are shown as trangles pontng down n our fgures. ETA nodes have two nputs a value and a predcate and one output. When the predcate evaluates to true, the ETA node moves the nput value to the output; when the predcate evaluates to false, the nput value and the predcate are smply coned, generatng no output. A hyperblock wth multple predecessors receves control from one of several dfferent ponts; nter-hyperblock jon ponts are represented by MERGE nodes, shown as trangles pontng up. Fgure 3 shows a functon that uses as an nducton varable and to accumulate the of the squares of. On the rght s the program s Pegasus representaton, whch conssts of three hyperblocks. Hyperblock ntalzes and to 0. Hyperblock 2 represents the loop; t contans two MERGE nodes, one for each of the loop-carred values, and. Hyperblock 3 s the functon eplog, contanng just the RETURN. Back-edges wthn a hyperblock denote loop-carred values; n ths example there are two such edges n hyperblock 2; back-edges always connect an ETA to a MERGE node. Memory accesses are represented through explct LOAD and STORE nodes. These and other operatons wth sde-effects (e.g., CALL and DIVISION whch may generate exceptons) also have a predcate nput: f the predcate s false, the operaton s not executed. In our fgures, predcate values are shown as dotted lnes. The compler adds dependence edges, called token edges, to explctly synchronze operatons whose sde-effects may not commute. Operatons wth memory sde-effects (LOAD, STORE, CALL, and RETURN) all have a token nput. Token edges explctly encode 2 eta ret 3 data flow through memory. An operaton wth memory sde-effects must collect tokens from all ts potentally conflctng predecessors (e.g., a STORE followng a set of LOADs). The COMBINE operator s used for ths purpose. COMBINE has multple token nputs and a sngle token output; t generates an output after t receves all ts nputs. It has been noted for the Value Dependence Graph representaton [97] that such token networks can be nterpreted as SSA for memory, where the COMBINE operator corresponds to a φ-functon. Tokens encode both true-, output- and ant-dependences, and are may dependences. We have devsed new algorthms for removng redundant memory accesses whch explot predcates and token edges n concert [8, 9]. As we show later, tokens are also explctly syntheszed as hardware sgnals, so they are both comple-tme and run-tme constructs. Currently the compler s purely statc,.e., uses no proflng nformaton. There s no reason that proflng cannot be ncorporated n our tool-chan. Secton 4..2 explans why proflng s less crtcal for CASH than for tradtonal ILP complers [92]. 2.3 The Dataflow Semantcs of Pegasus In [6] we have gven a precse and concse operatonal semantcs for all Pegasus constructs. At run-tme each edge of the graph ether holds a value or s ( empty ). An operaton begns computng once all of ts requred nputs are avalable. It latches the newly computed value when ts output s. The computaton cones the nput values (settng the nput edges to ) and produces an output value. Ths semantcs s the one of a statc dataflow machne (.e., each edge can hold a sngle value at one tme). The precse semantcs s useful for reasonng about the correctness of compler optmzatons and s a precse specfcaton for compler back-ends. Currently CASH has three back-ends: () a graph-drawng back-end, whch generates drawngs n the dot language [45], such as n Fgure 3; (2) a smulaton back-end, whch generates an nterpreter of the graph structure (used for the analyss n Secton 3); and (3) the asynchronous crcuts Verlog back-end, descrbed n Secton 4. (used for the evaluaton n Secton 4). 2.4 Compler Status The core of CASH handles all of ANSI C except longjmp, alloca, and functons wth a varable number of arguments. Whle the latter two constructs are relatvely easy to ntegrate, handlng longjmp s substantally more dffcult. Strctly speakng, C does not have exceptons [6] p. 200, and our compler does not handle them. Recurson s handled n exactly the same way as n software: CASH allocates stack frames for savng and restorng the lve local varables around the recursve call. As an optmzaton, CASH uses the call-graph to detect possbly recursve calls, and avods savng locals for all non-recursve calls. The asynchronous back-end s newer, and, therefore, somewhat less complete: t does not yet handle procedure calls and floatngpont computatons. The latter can be easly handled wth a sutable IP core contanng mplementatons of floatng-pont arthmetc. Currently we handle some procedure calls by nlnng. Handlng functon ponters requres an on-chp network, as the call wll need to dynamcally route the procedure arguments to the callee crcut dependent on the run-tme value of the ponter. 3. ASH VERSUS SUPERSCALAR Ths secton s devoted to a comparson of the propertes of ASH and superscalar processors. The comparson s performed by executng whole programs usng tmng-accurate smulators. Snce there are many parameters, one should see ths comparson as a lmt study. Ths study can also be nterpreted as beng the frst 3

4 head-to-head comparson between an unlmted-resource statc dataflow machne and a superscalar processor. Interestngly enough, despte the lmted resources of the superscalar, some of ts capabltes gve t a substantal edge over the statc dataflow model, as shown below. These results may be a partal explanaton of the demse of the dataflow model of computaton, whch was a very popular research subject n the seventes and eghtes. But frst we wll brefly dscuss the man source of parallelsm n dataflow machnes. 3. Dataflow Software Ppelnng A consequence of the dataflow nature of ASH s the automatc explotaton of ppelne parallelsm. Ths phenomenon has been studed extensvely n the dataflow lterature under names such as dataflow software ppelnng [46], and loop unravelng [33]. As the name suggests, ths phenomenon s closely related to software ppelnng [3], whch s a compler schedulng algorthm used n mostly for VLIW processors. The program n Fgure 3 llustrates ths phenomenon. Let us ase that the multpler mplementaton s ppelned wth fve stages. In Fgure 4 we show a few consecutve snapshots of ths crcut as t executes, startng wth the ntal snapshot n whch the two MERGEs contan the ntal values of and. (We have mplemented a tool that can automatcally generate such pctures; the nputs to the tool are pctures generated by the CASH backend and executon traces generated by the executon smulator.) In the last snapshot (6), the computaton f has already executed two teratons, two consecutve values of are njected n the multpler, whle the computaton of has yet to complete ts frst teraton. The executon of the multpler s thus effectvely ppelned. A smlar effect can be acheved n a statcally scheduled computaton by explctly software ppelnng the loop, schedulng the computaton of to occur one teraton ahead of. Ppelnng also occurs automatcally (.e., wthout any compler nterventon) n superscalar processors f there are enough resources to smultaneously process nstructons from multple nstances of the loop body. In practce large loops may not be dynamcally ppelned by a superscalar due to n-order nstructon fetch, whch can prevent some teratons from gettng ahead. Maxmzng the throughput of a ppelned computaton n ASH requres that the delay of all paths between dfferent strongly connected components n the Pegasus graph be equal. CASH nserts FIFO elements to acheve ths, a transformaton closely related to ppelne balancng n statc dataflow machnes [46] and slack matchng n asynchronous crcuts [70]. The FIFO elements correspond to the reservaton statons n superscalar desgns, and to the rotatng regsters n software ppelnng. 3.2 ASH Versus Superscalar For comparng ASH wth a superscalar we make the followng asptons: () all arthmetc operatons have the same latences on both computatonal fabrcs; (2) MUXs, MERGEs and Boolean operatons n ASH have latences proportonal to the log of the number of nputs; (3) ETA has the same latency as an addton; (4) memory operatons n ASH ncur an addtonal cost for network arbtraton compared to the superscalar; (5) the memory herarchy used for both models s dentcal: an LSQ and a two-level cache herarchy 2. The superscalar s a 4-way out-of-order SmpleScalar smulaton [2] wth the PISA nstructon set, usng gcc -O as 2 For ths study we use a very smlar LSQ for both ASH and the superscalar. As future work we are explorng the synthess of program-specfc LSQ structures. a compler. ASH s smulated usng a hgh-level smulator whch s automatcally generated by CASH, as shown n Fgure. We cannot smulate the executon of lbrares n ASH (unless we supply them to the compler as source-code), and thus we have nstrumented SmpleScalar to gnore ther executon tme, n order to have a far comparsons. Navely one would expect ASH to execute programs strctly faster than the superscalar (asng comparable compler technology) snce t benefts from (a) unlmted parallelsm, (b) no resource constrants, (c) no nstructon fetch/decode/dspatch, and (d) dynamc schedulng. Smulatng whole programs from SpecInt95 under these asptons results n two programs (099.go and 32.jpeg) showng a 25% mprovement on ASH, whle the other programs are between 0% and 40% slower. The speed-ups on ASH are attrbutable to the ncreased ILP due to the unlmted number of functonal unts; (for these benchmarks the nstructon cache of the processor dd not seem to be a bottleneck). In the next secton we nvestgate the slowdowns. 3.3 Superscalar Advantages In order to understand the advantages of the superscalar processor we have carred out a detaled analyss of code fragments whch perform especally poorly on ASH. The man tool we have used for ths purpose s the dynamc crtcal path [43]. In ASH the dynamc crtcal path s a sequence of last-arrval events. An event s lastarrval f t s the one that enables the computaton of a node to proceed. Events correspond to sgnal transtons on the graph edges. The dynamc crtcal path s computed by tracng the edges correspondng to last-arrval events backwards from the last operaton executed. Most often a last-arrval edge s the last nput arrvng at an operaton. However, for lenent operatons (see Secton 4..2), the last-arrval edge s the edge enablng the computaton of the output. Sometmes all the nputs may be present but an operaton may be unable to compute because t has not receved the acknowledgment sgnal for ts prevous computaton; n ths case the ack s the last-arrval event. Despte the fact that the superscalar has to tme-multplex a small number of the computatonal unts, some of the mechansms t employs provde clear performance advantages. Below s a bref mary of our fndngs. Branch predcton: the ablty of a superscalar to predct branch outcomes changes radcally the structure of the dynamc dependences: for example, a correctly predcted branch s dynamcally ndependent of the actual branch condton computaton. Unless the n-order commt stage (or some other structural hazard of the processor ppelne) s a bottleneck, the entre computaton of the branch condton s removed from the crtcal path. In contrast, n ASH nter-hyperblock control transfers are never speculatve. Often, the ETA control predcate computaton s on the crtcal path; e.g., when there s no computaton to overlap wth the branch condton evaluaton, such as n control-ntensve code. Such code fragments may be executed faster on a processor. Some branches, such as those testng exceptonal condtons (e.g., ntroduced by the use of assert statements), are never executed, and thus the processor branch predcton does a very good job of handlng them. These cases are especally detrmental to ASH. We note that good branch predcton requres global nformaton, aggregatng nformaton from multple branches, and would be very challengng to mplement effcently n Spatal Computaton. Synchronzaton: MERGE and MUX operatons have a non-zero cost, and may translate n overhead n ASH. These operatons cor- 4

5 * 0 * 0 * 0 * 0 * 0 * 0 2 <= <= [0] <= 0 <= <= <= [0] [0] [0] () ret (2) ret (3) ret (4) ret (5) ret (6) ret Fgure 4: Snapshots of the executon of the crcut n Fgure 3. The shaded nodes are actvely computng; they also ndcate the current value of ther output latch. We are asng a 5-stage ppelned multpler (each stage shown as ); we ase all nodes n these graphs have the same latences, except the Boolean negaton, whch takes zero tme unts (our mplementaton folds the nverter nto the destnaton ppelne stage). In the last snapshot, two dfferent values of are smultaneously present n the multpler ppelne. respond to labels n machne code,.e., control-flow jon ponts, whch have a zero executon cost on a CPU. Ths phenomenon s another facet of the tenson between synchronzaton and parallelsm. Whle a processor uses a program counter to sequence through the program, ASH reles on completely dstrbuted control. MERGE and MUX operatons are very smple forms of synchronzaton, used to merge several canddate values for a varable. Thus, the fne-graned parallelsm of dataflow requres addtonal synchronzaton. Ths occurs even when the dataflow machne s not executed by an nterpreter, but s drectly mapped to hardware. Dstance to memory: a superscalar contans a lmted number of load/store executon unts (usually two). In flght memory access nstructons have to be dynamcally scheduled to access these unts, but once they get hold of a unt they can ntate a memory access for a constant cost. (For example, the use of an on-processor LSQ allows wrte operatons to complete n essentally zero tme.) In contrast, on ASH, each memory access operaton s syntheszed as a dstnct hardware entty. Snce our current mplementaton uses a monolthc memory, ASH requres the use of a network to connect the operatons to memory. One such network mplementaton s descrbed n Secton Ths network requres arbtraton for the lmted number of memory ports; the total arbtraton cost s O(log(n)) (n beng the number of memory operatons n the program). The wre length of such a network grows as O( n). The mpact of the complexty of the memory network can be somewhat reduced by fragmentng memory n ndependent banks connected by separate networks, as we plan to do n future work. Note that the asymptotc complexty of the memory tself has the same behavor: the decoders and selectors for a memory of n bts requre O(log n) stages; the worst-case wre length s O( n). Ths explans why memory systems grow ntrnscally slower than processors n speed: today s memores are also bound by wre delays [4]. Whle ASH addresses some shortcomngs of superscalar processors, t does not drectly am to solve the memory bottleneck problem; both models of computaton attack ths problem by tryng to overlap memory stall tme wth useful computaton. Statc vs. dynamc dataflow: n ASH, at most one nstance of an operaton may be executng at any gven tme, because each operaton has a sngle output latch for storng the result. In contrast, a superscalar processor may have multple nstances of any nstructon n flght at once, because the regster renamng mechansm effectvely provdes a dfferent storage element for each nstance of an n-flght nstructon. The only nstructon whch cannot be effectvely ppelned wthout major changes n mplementaton s the LOAD: such operatons have to wat for the memory access to complete before ntatng a new access. A local reorder buffer could be employed for ths purpose, but devates from the sprt of ASH. In ASH, loop unrollng and ppelnng can sometmes provde smlar results to the full dynamc dataflow model of superscalars, but are less general, snce they are performed statcally: we have seen nstances where the CPU dynamc renamng outperformed the statc verson of the code for some nput set. Strct procedures: our current mplementaton of procedures reles on CALL nodes whch are strct;.e., to ntate a procedure all nputs to the node must be avalable. The fact that all nputs must be present before ntatng a call ntroduces addtonal synchronzaton and puts the slowest argument computaton on the dynamc crtcal path. When applcable, procedure nlnng elmnates ths problem as the procedure call network s specalzed to become smple pont-to-pont channels. In contrast, on a superscalar processor, procedure nvocaton s decoupled from passng of arguments (whch are put nto regsters or on the stack) and the call s smply a branch. Thus, the code computng the procedure arguments does not need to complete before the procedure body s ntated. In fact, the computaton of an unused procedure argument s never on the crtcal path. The ssues dscussed above seem fundamental to ASH. Other shortcomngs of ASH are attrbutable to polces n our compler, and could be corrected by a more careful mplementaton. 4. FROM C TO LAYOUT In ths secton we descrbe how Pegasus s translated to asynchronous crcuts and we present detaled measurements of the syntheszed crcuts. We also dscuss the reasons for the excellent power effcency of our crcuts. 4. CAB: The CASH Asynchronous Back-end The asynchronous back-end of CASH translates the Pegasus representatons nto asynchronous crcuts. The statc dataflow ma- 5

6 source regster ack ready Fgure 5: Sgnalng protocol between data producers and coners. data dest f (x > 0) y = -x; else y = b*x; b x 0 * y > Fgure 7: Sample program fragment and the correspondng Pegasus crcut wth the statc crtcal path hghlghted. Fgure 6: Control crcutry for a ppelne stage. s a delay matched to the computatonal unt. The block labeled wth = s a completon detecton block, detectng when the regster output has stablzed to a correct value. chne semantcs of Pegasus makes such a translaton farly straghtforward. More detals about ths process are avalable n [07]. 4.. Syntheszng Scalar Computatons Pegasus representatons could be mapped to asynchronous crcuts n many ways. We have chosen to mplement each Pegasus node as a separate hardware structure. Each IR node s mplemented as a ppelne stage, usng the mcroppelne crcut style, ntroduced by Sutherland [98] 3. Each ppelne stage contans an output regster whch s used to hold the result of the stage computaton. Each edge s syntheszed as a channel consstng of three un-drectonal sgnals as shown n Fgure 5. () A data bus, transfers the data from producer to coner. (2) A data ready wre from producer to coner ndcates when the data can be safely used by the coner. (3) An acknowledgment wre, from coner to producer, ndcates when the value has been used and the channel s. Ths sgnalng method, called the bundled data protocol, s wdely employed n asynchronous crcuts. The control crcutry drvng a ppelne stage s shown n Fgure 6. The C gate s a Müller C element [8], whch mplements the fnte-state machne control for properly alternatng data ready and acknowledgment sgnals. When there are multple coners the data bus s used to broadcast the value to all of them, and the channel contans one acknowledgment wre from each coner. Due to the SSA form of Pegasus, each channel has a sngle wrter. Therefore, there s no need for arbtraton, makng data transfer a lghtweght operaton. Perhaps the most mportant feature of our mplementaton s the complete absence of any global control structures. Control s completely emboded n the handshakng sgnals naturally dstrbuted wthn the computaton. Ths gves our crcuts a very strong datapath orentaton, makng them amenable to effcent layout Lenent Evaluaton The form of speculatve executon employed by Pegasus, whch executes all forward branches smultaneously, allevates the m- 3 Unlke Sutherland s mcroppelnes, whch used a 2-phase sgnallng protocol [2], we use 4-phase sgnalng, n whch each sgnal returns to zero before ntatng a new computaton cycle. pact of branches, but may be plagued by the problem of unbalanced paths [8], as llustrated n Fgure 7: the statc crtcal path of the entre construct s the longest of the crtcal paths. If the short path s executed frequently, the benefts of speculaton may be negated by the cost of the long path. Ths problem also occurs for machnes whch employ predcated executon. Tradtonally ths problem s addressed n two ways: () usng proflng, only hyperblocks whch ensure that the long path s most often executed at run-tme are predcated, or (2) excludng certan hyperblock topologes from consderaton, dsallowng the predcaton of paths whch dffer wdely n length. Because there s no sngle PC, we can employ a thrd, and more elegant soluton, n hardware by usng lenency [9] to solve ths problem. By defnton, a lenent operaton expects all of ts nputs to arrve eventually, but t can compute ts output usng only a subset of ts nputs. Lenent operators generate a result as soon as possble. For example, an AND operaton can determne that the output s false as soon as one of ts nputs s false. 4 Whle the output can be avalable before all nputs, our mplementaton ensures that a lenent operaton sends an acknowledgment only after all of ts nputs have been receved. To obtan the full beneft of lenency one also needs to ssue early acknowledgments, as suggested n [4]. In the asynchronous crcuts lterature, lenency was proposed under the name early evaluaton [85]. Forms of lenent evaluaton have been also been used n the desgn of arthmetc unts for mcroprocessors: for example, some multpler desgns may generate the result very quckly when an nput s zero. MUXes are also mplemented lenently: as soon as a selector s true and the correspondng data s avalable, a MUX generates ts output. Note that t s crucal for the MUX to be decoded (see Secton 2.2) n order for ths scheme to work effcently. A result of lenency s that the dynamc crtcal path s the same as n a nonspeculatve mplementaton. For example, f the multplcaton n Fgure 7 s not used, t does not affect the crtcal path. 5 In addton to Booleans and multplexors, all predcated operatons are lenent n ther predcate nput. For example, f a LOAD operaton receves a false predcate nput, t can mmedately emt an arbtrary output, snce the actual output s rrelevant. It cannot, however, output a token untl t receves ts nput token, snce memory dependences are transtvely mpled. The rrelevant out- 4 Lenent evaluaton should not be confused wth short-crcut evaluaton: a short-crcut evaluaton of an AND always evaluates the left operand, and f ths one s true, t also evaluates the rght one. However, a lenent evaluaton generates a false result as soon as ether nput s known to be false. 5 The multpler can stll be on the crtcal path because of ts late acknowledgments, whch may prevent the next wave of computaton from propagatng forward, as descrbed n Secton 3.3. Ths problem can be allevated ether by usng a ppelned multpler, or by usng early acknowledgements [4]. 6

7 Fgure 8: Memory access network and mplementaton of the value and token forwardng network. The LOAD produces a data value coned by the oval node. The STORE node may depend on the load (.e., we have a token edge between the LOAD and the STORE, shown as a dashed lne). The token travels to the root of the tree, whch s a load-store queue (LSQ). put wll be dscarded downstream by a MUX or ETA node controlled by a false predcate Memory Access The most complcated part of the synthess process s buldng the network used by the LOAD and STORE operatons to access memory. Fgure 8 llustrates how a load and a dependent store access memory through ths network. Our current mplementaton conssts of a herarchy of buses and asynchronous arbters used to medate access to the buses. Memory nstructons whch are ready to access memory compete for these buses; the wnners of the arbtraton nject messages whch travel up the herarchy n a ppelned fashon. A memory operaton can produce a token as soon as ts effect s guaranteed to occur n the rght order wth respect to the potentally nterferng operatons. The network does not guarantee n-order message delvery, so by travelng to the root we mantan the nvarant that a dependent operaton wll be ssued only after all operatons on whch t depends have njected ther requests n the LSQ. The root of the tree s a unque seralzaton pont, guaranteeng n-order executon of dependent operatons. The LSQ holds the descrpton of the memory operatons under executon untl ther memory effects are completed; t may also perform dynamc dsambguaton and act as a small fully-assocatve cache. In Secton we dscuss some dsadvantages of ths mplementaton. We currently synthesze a very smple load-store queue (LSQ), whch can hold a sngle operaton untl ts executon completes. It s worthwhle to notce that ths mplementaton of the memory access network s very much n the sprt of ASH, beng completely dstrbuted, composed entrely of ppelne stages, and usng only control localzed wthn each stage; t contans no global sgnals of any knd. 4.2 Low-level Evaluaton In ths secton we present measurements from a detaled lowlevel smulaton. We synthesze C kernels nto ASICs and evaluate ther performance on standard data sets. Snce CAB generates syntheszable Verlog, FPGAs could be targeted n prncple for evaluaton. There are two factors that prevent us from dong so: () commercal FPGAs are synchronous devces, and mappng some of the features of our asynchronous crcuts would be very nef- 6 The predcated-false operaton does not need to swng the output lnes, t need only assert the data ready sgnal (see Secton 4..). Ths wll decrease the power conpton. fcent, (2) commercal FPGAs are not optmzed for power [47]; they would thus probably negate one of the man advantages of our mplementaton scheme, the very low power conpton. We use kernels from the Medabench sute [66] to generate crcuts. From each program we select one hot functon (see Table ) to mplement n hardware (the only excepton are the g72 benchmarks, for whch the hot functon was very small, so we selected the functon and one of ts callers, we nlned the callee, unrolled the resultng loop and substtuted references to an array of constants as nlne constant values. The same code was used on the SmpleScalar smulator n comparsons.) The expermental results presented below are for the entre crcut syntheszed by CAB, ncludng the memory access network, but excludng the memory tself or I/O to the crcut. We report data only for the executon of each kernel, gnorng the rest of the program; due to long smulaton tmes, we execute each kernel for the frst three nvocatons n the program and we measure the cumulatve values (tme, energy, etc) for all three nvocatons. We do not estmate the overhead of nvokng and returnng from the kernel, snce n ths work we am to understand the behavor of ASH, and not of a whole CPUASH system. Snce our current back-end does not support the synthess of floatng-pont computaton we had to omt some kernels, such as the ones from the epc, rasta and mesa benchmarks. The CAB back-end s used to generate a Verlog representaton of each kernel. A detaled descrpton of our methodology can be found n [07]. We use a 80nm/2V standard-cell lbrary from STMcroelectroncs, optmzed for performance. The structural Verlog generated by our tool flow s partally technology-mapped by CAB and partally syntheszed wth Synopsys Desgn Compler SP2. The technology-mapped crcuts are placedand-routed wth Slcon Ensemble 5.3 from Cadence. Currently the placement s handled completely by Slcon Ensemble, operatng on a flat netlst; we expect that CAB can use knowledge of the crcut structure to automatcally generate floor-plans whch can mprove our results substantally 7. Data collecton wth the commercal CAD tools for both pegwt benchmarks has faled after placement, so we present pre-placement numbers for these. (The performance for the other benchmarks s about 5% better than ther pre-placement estmate.) Smulaton s performed wth Modeltech Modelsm SE5.7. We ase a perfect L cache, wth a 600MHz cycle tme. We synthesze a one-element LSQ for ASH. Complaton tme s on the order of tens of seconds for all these benchmarks, and s thus completely nconsequental compared to hardware synthess through the commercal tool-chan (the worstcase program takes about 30s through CASH, one hour through synthess and more than fve hours for place-and-route). The code expanson n terms of lnes of code from C to Verlog s a factor of 200x. All the results n ths secton are obtaned wthout loop unrollng, whch can ncrease crcut area and complaton tme Area Fgure 9 shows the area requred for each of these kernels. The area s broken down nto computaton and memory tree. The memory tree s the area of the arbter crcuts used to medate access to memory and of the herarchcal ppelned buses. For reference, n the same technology a mnmal RISC core can be syntheszed n.3mm 2, a 6 6 multpler requres 0.mm 2, and a complete P4 processor de, ncludng all caches, has 27mm 2. Ths shows that whle the area of our kernels s sometmes substantal, t s certanly affordable, especally n future technologes. Normalzng the area 7 Good placement and physcal optmzatons can account for as much as a factor of 4x n sze and 2.3x n performance [36] 7

8 Benchmark Functon Lnes adpcm d adpcm decoder 80 adpcm e adpcm coder 03 g72 d fmultquan 4 g72 e fmultquan 4 gsm d Short term synthess flterng 24 gsm e Short term analyss flterng 45 jpeg d jpeg dct slow 24 jpeg e jpeg fdct slow 44 mpeg2 d dctcol 55 mpeg2 e dst 92 pegwt d squaredecrypt 78 pegwt e squareencrypt 77 Table : Embedded benchmark kernels used for the low-level measurements and ther sze n orgnal (un-processed) source lnes of code. For g72 the functon quan was nlned nto fmult. Fgure 0: Kernel slowdown compared to a 4-wde ssue 600MHz superscalar processor n 80nm. A value of ndcates dentcal performance, values bgger than ndcate slower crcuts on ASH. Fgure 9: Slcon real-estate n mm 2 for each kernel. versus the object fle sze, we requre on average 0.87mm 2 /kb of a gcc-generated MIPS object fle Executon performance Fgure 0 shows the normalzed executon tme of each kernel aganst a baselne 600MHz4-wde superscalar processor. Whle we dd not smulate a VLIW, we expect the trends to be smlar, snce the superscalar mantans a hgh IPC for these kernels. The processor has the same perfect L cache, but a 32-element LSQ. On average, ASH crcuts are.33 tmes slower, but 4 kernels are faster than on the processor. Gven the unlmted amount of ILP that can be exploted by ASH, these results are somewhat dsappontng. An analyss of ASH crcuts has ponted out that, although these crcuts can be mproved n many respects, the man bottleneck of our current desgn s the memory access protocol. In our current mplementaton, as descrbed n Secton 4..3, a memory operaton does not release a token untl ts request has reached memory (.e., the token must traverse the network end-to-end n both drectons). An mproved constructon would allow an operaton to () nject requests n the network, allowng them to travel out-of-order, and (2) release the token to the dependent operatons mmedately. The network packet can carry enough nformaton to enable the LSQ to buffer out-of-order requests and to execute the memory operatons n the orgnal program order. Ths knd of protocol s actually used by superscalar processors, whch nject requests n order n the load-store queue, and can proceed to ssue more memory operatons before the prevous ones have completed. Fgure : Evaluaton the mpact of an deal memory nterconnecton protocol. The left bar reproduces the data from Fgure 0. To gauge the mpact of the memory network on program performance, we performed a lmt study usng a behavoral mplementaton of the network n whch each stage has zero latency. The mprovement n performance s shown n Fgure : programs havng large memory access networks n Fgure 9 dsplay sgnfcant mprovements (up to 8x for pegwt) whch shows that programs whch perform many memory accesses are bound by the memory network round-trp tme. These numbers are obtaned asng that both value and token travel very quckly through the network; n realty, we can only substantally speed-up the token path, so the performance of a better protocol stll has to be evaluated. In Fgure 2 we measure ASH performance usng several MIPS metrcs: the bottom bar we labeled MOPS, for mllons of useful arthmetc operatons per second. The ncorrectly speculated arthmetc s accounted for as MOPSspec. Fnally, MOPSall ncludes auxlary operatons, ncludng the MERGE, ETA, MUX, COMBINE, ppelne balancng FIFOs, and other overhead operatons. Although speculatve executon sometmes domnates useful work (e.g., g72), on average /3 of the executed arthmetc operatons are ncorrectly speculated. 8 For some programs the control operatons consttute a substantal fracton of the total number of executed operatons. On average our programs sustan GOPS. 8 The compler can control the degree of speculaton by varyng the hyperblock constructon algorthm; we have not yet explored the trade-offs nvolved. 8

9 Dedcated hardware ASH meda kernels Asynchronous mcrocontroller FPGA General purpose DSP Mcroprocessors Energy Effcency (MOPS/mW or OP/nJ) Fgure 4: Energy effcency of several computatonal models usng comparable technologes. Fgure 2: Computatonal performance of the ASH Medabench kernels, expressed n mllons of operatons per second (MOPS). Fgure 5: Energy-delay rato between ASH and superscalar. Fgure 3: Energy effcency of the syntheszed kernels, expressed n useful arthmetc operatons per nanojoule Power and Energy The power conpton of our crcuts ranges between 5.5mW (g72) and 35.9mW (jpeg d), wth an average of 9.3mW. In contrast, a very low power DSP chp n the same technology draws about 0mW. In order to quantfy energy effcency we use the normalzed metrc proposed by Claasen [29], MOPS/mW, or, equvalently, operatons/nanojoule. Fgure 3 shows how our kernels fare from ths perspectve. We count only the useful operatons n the MIPS metrc,.e., the bottom bar from Fgure 2. The energy effcency for CASH systems s between 5 and 280 operatons/nj, wth an average of 52. g72 d and g72 e are outlers because they do not make any memory accesses, and ther mplementaton s thus extremely effcent. For comparson Fgure 4 shows, on a logarthmc scale, the energy effcency of mcroprocessors, dgtal sgnal processng, custom hardware crcuts from [6], an asynchronous mcroprocessor [77], and FPGAs. All these crcuts use comparable hardware technologes (80 and 250nm). ASH s three to four orders of magntude better than a superscalar mcroprocessor and between one and two orders of magntude better than a DSP. Fgure 5 compares ASH wth a 4-wde low-power superscalar (modeled wth Wattch [5], usng aggressve clock-gatng for power reducton, and drawng around 5W dynamc power) n terms of energy-delay [53], a metrc whch s relatvely ndependent of executon speed. ASH crcuts are between 6 and 3600 tmes better. The crcuts wthout memory access are agan substantally better (one order of magntude) than the other. Power s roughly proportonal to the crcut actvty; the power wasted on speculaton should be proportonal to the number of speculatve operatons executed. Currently we beleve that we should negotate the power-performance trade-off n the drecton of expendng more power to ncrease performance: snce power s very low, some ncrease n power s acceptable even for a relatvely small ncrease n performance Dscusson There are multple sources of neffcency n the mplementaton of superscalar processors, whch account for the huge dfference n power and energy. () The clock dstrbuton network for a large chp accounts for up to 50% of the total power budget; whle some of ths power s spent on the latches, whch exst also n asynchronous mplementatons [2], the bg clock network and ts huge drvers stll account for a substantal fracton of the power. (2) Most of the ppelne stages n ASH are usually nactve (see for example Fgure 4). Snce these crcuts are asynchronous, they only draw statc power when nactve, whch s very small n a.8µm technology. Whle leakage power s a bg ssue n future technologes, there are many crcut- and archtecture-level technques [03] that can be appled to reduce ts mpact n ASH. (3) On the Pentum 4 de, even excludng caches, all functonal unts combned (nteger, floatng-pont and MMX) together take less than 0% of the chp area. The remanng 90% s devoted entrely to mechansms that only support the computaton, wthout producng useful results. From the pont of vew of the energy effcency metrc we use, all of ths extra actvty s pure overhead. 9

10 (4) [77] suggests that more than 70% of the power of the asynchronous Lutonum processor s spent n nstructon fetch and decode. Ths shows that there s an nherent overhead for nterpretng programs encoded n machne-code, whch ASH does not have to pay. (5) Fnally, n a mcroprocessor, and even n many ASICs, functonal unts are heavly optmzed for speed at the expense of power and area. Snce SC replcates lavshly functonal unts, the trade-off has to be based n the reverse drecton. It has been known that dedcated hardware chps can be vastly more energy-effcent that processors [6]. But most often the algorthms mplemented n dedcated hardware have a natural hgh degree of parallelsm. Ths paper shows for the frst tme that a fully automatc tool-chan startng from C programs can generate results comparable n speed wth hgh-end processors and n power wth custom hardware. We are confdent that further optmzatons, ncludng crcutlevel technques, hgh-level compler transformatons, a better memory access protocol, and compler-guded floor-plannng, wll substantally ncrease ASH s performance. 5. RELATED WORK Optmzng complers. Pegasus s a form of dataflow ntermedate language, an dea poneered by Denns [39]. The crux of Pegasus s handlng memory dependences wthout excessvely nhbtng parallelsm. The core deas on how to use tokens to express fne-graned locaton-based synchronzaton were ntroduced by Beck, et al. []. The explct representaton of memory dependences between program operatons has been suggested numerous tmes n the lterature, e.g., Pngal s Dependence Flow Graph [83]; or Steensgaard s adaptaton of Value-Dependence Graphs [97]. Other researchers have also explored extendng SSA to handle memory dependences, e.g.,[35, 44, 64, 28, 30, 7, 72]. But none of the approaches s as smple as ours. The ntegraton of predcaton and SSA has also been done n PSSA [24, 25]. PSSA, however, does not use φ functons and therefore loses some of the appealng propertes of SSA. Our use of the hyperblock as a basc optmzaton unt and our algorthm for computng block and path predcates were nspred by ths work. Hgh-level synthess. Whle there s a substantal amount of research on hardware synthess from hgh-level languages, and dalects of C and C, none of t supports C as fully as CASH does. One major dfference between our approach and the vast majorty of other efforts s that we use dusty-deck C whle most other projects employ C as a hardware descrpton language (HDL). They ether add constructs (e.g., reactvty, concurrency, and varable btwdth) to C n order to make t more sutable to expressng hardware propertes, or/and remove constructs (e.g., ponters, dynamc allocaton and recurson) that do not correspond to natural hardware objects, obtanng Regster Transfer C. Other efforts mpose a strct codng dscplne n C n order to obtan a syntheszable program. There are numerous research and commercal products usng varants of C as an HDL [80]: Pco from HP [93], Cynlb, orgnally from Cynapps, now at Forte Desgn Systems [88], Cyber from NEC [09], A RT bulder orgnally from Fronter Desgn, now at Xlnx [59], Scenc/CoCentrc from Synopsys [69], N2C from CoWare [32], complers for Bach-C (orgnally based on Occam) from Sharp [60], OCAPI from IMEC [90], Synopsys c2verlog, orgnally from C-Level Desgn [96], Celoxca s Handel-C compler [3], and Cadence s ECL (based on C and Esterel) compler [65], SpC from Stanford [94] and also [48, 6, 54, 55]. Our goals are most closely related to the chp-n-a-day project from Berkeley [37], but our approach s very dfferent: that project starts from parallel Statechart descrptons and employs sophstcated hard macros to synthesze synchronous desgns wth automatc clock gatng. In contrast we start from C, use a small lbrary of standard cells, and buld asynchronous crcuts. Reconfgurable computng. A completely dfferent approach to hardware desgn s taken n the research on reconfgurable computng, whch reles on automatc complaton of C or other hghlevel languages to target reconfgurable hardware [2]. Notable projects are: PRISM II [0], PRISC [84], DISC [3], NAPA [49], DEFACTO [40], Chmaera [5], OneChp [4], RaPD [4], PamDC [02], StreamC [50], WASMII [00], XCC/PACT [23], PpeRench [5], RAW/Vrtual Wres [9], the systems descrbed n [95] and [78], complaton of Term-Rewrtng Systems [58], or synthess of dataflow graphs [86]. CASH has been nfluenced by the work of Callahan on the GARP C compler [22, 68]. GarpCC s stll one of the few systems wth the broad scope of targetng C to a reconfgurable fabrc. None of these approaches targets a true Spatal Computaton model, wth completely dstrbuted computaton and control. Dataflow machnes. A large number of dataflow machne archtectures have been proposed and bult; see for example a survey n [06]. All of these were nterpreters, executng programs descrbed n a specalzed form of machne code. Most of the dataflow machnes were programmed n functonal languages, such as VAL and Id. CASH could n prncple be used to target a tradtonal dataflow machne. To our knowledge, none of the other efforts to translate C for executon on dataflow machnes (e.g., [05]) reached completon. Asynchronous crcut desgn. Tradtonally, asynchronous synthess has concentrated on syntheszng ndvdual controllers. However, n the last decade, a number of other approaches have also explored the synthess of entre systems. The startng pont for these related synthess flows s a hgh-level language (usually an nherently parallel one based on Hoare s Communcatng Sequental Processes [57]) sutable for descrbng the parallelsm of the applcaton, e.g., Tangram [04], Balsa [42], OCCAM [79], CHP [76, 0]. Wth the excepton of [76], synthess tends to follow a two step approach: the hgh-level descrpton s translated n a syntaxdrected fashon nto an ntermedate representaton; then, each module n the IR s (template-based) technology mapped nto gates. An optonal optmzaton step [27] may be ntroduced to mprove the performance of the syntheszed crcuts. In [76], the hghlevel specfcaton s decomposed nto smaller specfcatons, some dataflow and local optmzatons technques are appled, and the resultng low-level specfcatons are syntheszed n a template-based fashon. Our approach s somewhat smlar to the syntax-drected flows wth three notable dfferences. Frst, our nput language s a wellestablshed, mperatve, sequental programmng language; handlng ponters and arrays s central n our approach. Second, the CASH compler performs extensve analyss and optmzaton steps before generatng the ntermedate form. Fnally, the target mplementaton for our ntermedate form conssts of ppelned crcuts, whch naturally ncrease performance when compared wth tradtonal syntax-drected approaches. Spatal Computaton. Varous models related to Spatal Computaton have been nvestgated by other research efforts: complng for fnte-szed fabrcs has been studed by research on systolc arrays [62], RAW [67], PpeRench [52] and TRIPS [89]. More remotely related are efforts such as SmartMemores [75] and Imagne [87]. Unlmted or vrtualzed hardware s exploted n proposals such as SCORE [26], [38], and WaveScalar [99]. Among the latter efforts, our research s dstngushed by the fact that () t targets 0

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought

More information

Dataflow: A Complement to Superscalar

Dataflow: A Complement to Superscalar Dataflow: A Complement to Superscalar Mha Budu Pedro V. Artgas Seth Copen Goldsten mbudu@mcrosoft.com artgas@cs.cmu.edu seth@cs.cmu.edu Mcrosoft Research, Slcon Valley Carnege Mellon Unversty Abstract

More information

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz Compler Desgn Sprng 2014 Regster Allocaton Sample Exercses and Solutons Prof. Pedro C. Dnz USC / Informaton Scences Insttute 4676 Admralty Way, Sute 1001 Marna del Rey, Calforna 90292 pedro@s.edu Regster

More information

Parallel matrix-vector multiplication

Parallel matrix-vector multiplication Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more

More information

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011 9/8/2 2 Outlne Appendx C: The Bascs of Logc Desgn TDT4255 Computer Desgn Case Study: TDT4255 Communcaton Module Lecture 2 Magnus Jahre 3 4 Dgtal Systems C.2: Gates, Truth Tables and Logc Equatons All sgnals

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some materal adapted from Mohamed Youns, UMBC CMSC 611 Spr 2003 course sldes Some materal adapted from Hennessy & Patterson / 2003 Elsever Scence Performance = 1 Executon tme Speedup = Performance (B)

More information

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process

More information

The Codesign Challenge

The Codesign Challenge ECE 4530 Codesgn Challenge Fall 2007 Hardware/Software Codesgn The Codesgn Challenge Objectves In the codesgn challenge, your task s to accelerate a gven software reference mplementaton as fast as possble.

More information

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr

More information

Mathematics 256 a course in differential equations for engineering students

Mathematics 256 a course in differential equations for engineering students Mathematcs 56 a course n dfferental equatons for engneerng students Chapter 5. More effcent methods of numercal soluton Euler s method s qute neffcent. Because the error s essentally proportonal to the

More information

Conditional Speculative Decimal Addition*

Conditional Speculative Decimal Addition* Condtonal Speculatve Decmal Addton Alvaro Vazquez and Elsardo Antelo Dep. of Electronc and Computer Engneerng Unv. of Santago de Compostela, Span Ths work was supported n part by Xunta de Galca under grant

More information

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009. Farrukh Jabeen Algorthms 51 Assgnment #2 Due Date: June 15, 29. Assgnment # 2 Chapter 3 Dscrete Fourer Transforms Implement the FFT for the DFT. Descrbed n sectons 3.1 and 3.2. Delverables: 1. Concse descrpton

More information

ELEC 377 Operating Systems. Week 6 Class 3

ELEC 377 Operating Systems. Week 6 Class 3 ELEC 377 Operatng Systems Week 6 Class 3 Last Class Memory Management Memory Pagng Pagng Structure ELEC 377 Operatng Systems Today Pagng Szes Vrtual Memory Concept Demand Pagng ELEC 377 Operatng Systems

More information

Concurrent Apriori Data Mining Algorithms

Concurrent Apriori Data Mining Algorithms Concurrent Apror Data Mnng Algorthms Vassl Halatchev Department of Electrcal Engneerng and Computer Scence York Unversty, Toronto October 8, 2015 Outlne Why t s mportant Introducton to Assocaton Rule Mnng

More information

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6) Harvard Unversty CS 101 Fall 2005, Shmon Schocken Assembler Elements of Computng Systems 1 Assembler (Ch. 6) Why care about assemblers? Because Assemblers employ some nfty trcks Assemblers are the frst

More information

Load Balancing for Hex-Cell Interconnection Network

Load Balancing for Hex-Cell Interconnection Network Int. J. Communcatons, Network and System Scences,,, - Publshed Onlne Aprl n ScRes. http://www.scrp.org/journal/jcns http://dx.do.org/./jcns.. Load Balancng for Hex-Cell Interconnecton Network Saher Manaseer,

More information

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to: 4.1 4.2 Motvaton EE 457 Unt 4 Computer System Performance An ndvdual user wants to: Mnmze sngle program executon tme A datacenter owner wants to: Maxmze number of Mnmze ( ) http://e-tellgentnternetmarketng.com/webste/frustrated-computer-user-2/

More information

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface. IDC Herzlya Shmon Schocken Assembler Shmon Schocken Sprng 2005 Elements of Computng Systems 1 Assembler (Ch. 6) Where we are at: Human Thought Abstract desgn Chapters 9, 12 abstract nterface H.L. Language

More information

Petri Net Based Software Dependability Engineering

Petri Net Based Software Dependability Engineering Proc. RELECTRONIC 95, Budapest, pp. 181-186; October 1995 Petr Net Based Software Dependablty Engneerng Monka Hener Brandenburg Unversty of Technology Cottbus Computer Scence Insttute Postbox 101344 D-03013

More information

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma

More information

Simulation Based Analysis of FAST TCP using OMNET++

Simulation Based Analysis of FAST TCP using OMNET++ Smulaton Based Analyss of FAST TCP usng OMNET++ Umar ul Hassan 04030038@lums.edu.pk Md Term Report CS678 Topcs n Internet Research Sprng, 2006 Introducton Internet traffc s doublng roughly every 3 months

More information

CMPS 10 Introduction to Computer Science Lecture Notes

CMPS 10 Introduction to Computer Science Lecture Notes CPS 0 Introducton to Computer Scence Lecture Notes Chapter : Algorthm Desgn How should we present algorthms? Natural languages lke Englsh, Spansh, or French whch are rch n nterpretaton and meanng are not

More information

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following. Complex Numbers The last topc n ths secton s not really related to most of what we ve done n ths chapter, although t s somewhat related to the radcals secton as we wll see. We also won t need the materal

More information

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,

More information

Loop Transformations, Dependences, and Parallelization

Loop Transformations, Dependences, and Parallelization Loop Transformatons, Dependences, and Parallelzaton Announcements Mdterm s Frday from 3-4:15 n ths room Today Semester long project Data dependence recap Parallelsm and storage tradeoff Scalar expanson

More information

Memory Modeling in ESL-RTL Equivalence Checking

Memory Modeling in ESL-RTL Equivalence Checking 11.4 Memory Modelng n ESL-RTL Equvalence Checkng Alfred Koelbl 2025 NW Cornelus Pass Rd. Hllsboro, OR 97124 koelbl@synopsys.com Jerry R. Burch 2025 NW Cornelus Pass Rd. Hllsboro, OR 97124 burch@synopsys.com

More information

Verification by testing

Verification by testing Real-Tme Systems Specfcaton Implementaton System models Executon-tme analyss Verfcaton Verfcaton by testng Dad? How do they know how much weght a brdge can handle? They drve bgger and bgger trucks over

More information

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique //00 :0 AM Outlne and Readng The Greedy Method The Greedy Method Technque (secton.) Fractonal Knapsack Problem (secton..) Task Schedulng (secton..) Mnmum Spannng Trees (secton.) Change Money Problem Greedy

More information

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1) Secton 1.2 Subsets and the Boolean operatons on sets If every element of the set A s an element of the set B, we say that A s a subset of B, or that A s contaned n B, or that B contans A, and we wrte A

More information

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing EDA222/DIT161 Real-Tme Systems, Chalmers/GU, 2014/2015 Lecture #8 Real-Tme Systems Real-Tme Systems Lecture #8 Specfcaton Professor Jan Jonsson Implementaton System models Executon-tme analyss Department

More information

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng

More information

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation Loop Transformatons for Parallelsm & Localty Last week Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Scalar expanson for removng false dependences Loop nterchange Loop

More information

Mallathahally, Bangalore, India 1 2

Mallathahally, Bangalore, India 1 2 7 IMPLEMENTATION OF HIGH PERFORMANCE BINARY SQUARER PRADEEP M C, RAMESH S, Department of Electroncs and Communcaton Engneerng, Dr. Ambedkar Insttute of Technology, Mallathahally, Bangalore, Inda pradeepmc@gmal.com,

More information

CE 221 Data Structures and Algorithms

CE 221 Data Structures and Algorithms CE 1 ata Structures and Algorthms Chapter 4: Trees BST Text: Read Wess, 4.3 Izmr Unversty of Economcs 1 The Search Tree AT Bnary Search Trees An mportant applcaton of bnary trees s n searchng. Let us assume

More information

Lecture 5: Multilayer Perceptrons

Lecture 5: Multilayer Perceptrons Lecture 5: Multlayer Perceptrons Roger Grosse 1 Introducton So far, we ve only talked about lnear models: lnear regresson and lnear bnary classfers. We noted that there are functons that can t be represented

More information

Private Information Retrieval (PIR)

Private Information Retrieval (PIR) 2 Levente Buttyán Problem formulaton Alce wants to obtan nformaton from a database, but she does not want the database to learn whch nformaton she wanted e.g., Alce s an nvestor queryng a stock-market

More information

Efficient Distributed File System (EDFS)

Efficient Distributed File System (EDFS) Effcent Dstrbuted Fle System (EDFS) (Sem-Centralzed) Debessay(Debsh) Fesehaye, Rahul Malk & Klara Naherstedt Unversty of Illnos-Urbana Champagn Contents Problem Statement, Related Work, EDFS Desgn Rate

More information

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search Sequental search Buldng Java Programs Chapter 13 Searchng and Sortng sequental search: Locates a target value n an array/lst by examnng each element from start to fnsh. How many elements wll t need to

More information

Loop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings

Loop Pipelining for High-Throughput Stream Computation Using Self-Timed Rings Loop Ppelnng for Hgh-Throughput Stream Computaton Usng Self-Tmed Rngs Gennette Gll, John Hansen and Montek Sngh Dept. of Computer Scence Unv. of North Carolna, Chapel Hll, NC 27599, USA {gllg,jbhansen,montek}@cs.unc.edu

More information

An Entropy-Based Approach to Integrated Information Needs Assessment

An Entropy-Based Approach to Integrated Information Needs Assessment Dstrbuton Statement A: Approved for publc release; dstrbuton s unlmted. An Entropy-Based Approach to ntegrated nformaton Needs Assessment June 8, 2004 Wllam J. Farrell Lockheed Martn Advanced Technology

More information

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Proceedngs of the Wnter Smulaton Conference M E Kuhl, N M Steger, F B Armstrong, and J A Jones, eds A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS Mark W Brantley Chun-Hung

More information

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach Data Representaton n Dgtal Desgn, a Sngle Converson Equaton and a Formal Languages Approach Hassan Farhat Unversty of Nebraska at Omaha Abstract- In the study of data representaton n dgtal desgn and computer

More information

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a

More information

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,

More information

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT Bran J. Wolf, Joseph L. Hammond, and Harlan B. Russell Dept. of Electrcal and Computer Engneerng, Clemson Unversty,

More information

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,

More information

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization What s a Computer Program? Descrpton of algorthms and data structures to acheve a specfc ojectve Could e done n any language, even a natural language lke Englsh Programmng language: A Standard notaton

More information

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization Problem efntons and Evaluaton Crtera for Computatonal Expensve Optmzaton B. Lu 1, Q. Chen and Q. Zhang 3, J. J. Lang 4, P. N. Suganthan, B. Y. Qu 6 1 epartment of Computng, Glyndwr Unversty, UK Faclty

More information

Performance Evaluation

Performance Evaluation Performance Evaluaton [Ch. ] What s performance? of a car? of a car wash? of a TV? How should we measure the performance of a computer? The response tme (or wall-clock tme) t takes to complete a task?

More information

Vectorization in the Polyhedral Model

Vectorization in the Polyhedral Model Vectorzaton n the Polyhedral Model Lous-Noël Pouchet pouchet@cse.oho-state.edu Dept. of Computer Scence and Engneerng, the Oho State Unversty October 200 888. Introducton: Overvew Vectorzaton: Detecton

More information

RADIX-10 PARALLEL DECIMAL MULTIPLIER

RADIX-10 PARALLEL DECIMAL MULTIPLIER RADIX-10 PARALLEL DECIMAL MULTIPLIER 1 MRUNALINI E. INGLE & 2 TEJASWINI PANSE 1&2 Electroncs Engneerng, Yeshwantrao Chavan College of Engneerng, Nagpur, Inda E-mal : mrunalngle@gmal.com, tejaswn.deshmukh@gmal.com

More information

Hierarchical clustering for gene expression data analysis

Hierarchical clustering for gene expression data analysis Herarchcal clusterng for gene expresson data analyss Gorgo Valentn e-mal: valentn@ds.unm.t Clusterng of Mcroarray Data. Clusterng of gene expresson profles (rows) => dscovery of co-regulated and functonally

More information

Design and Analysis of Algorithms

Design and Analysis of Algorithms Desgn and Analyss of Algorthms Heaps and Heapsort Reference: CLRS Chapter 6 Topcs: Heaps Heapsort Prorty queue Huo Hongwe Recap and overvew The story so far... Inserton sort runnng tme of Θ(n 2 ); sorts

More information

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access Agenda Cache Performance Samra Khan March 28, 217 Revew from last lecture Cache access Assocatvty Replacement Cache Performance Cache Abstracton and Metrcs Address Tag Store (s the address n the cache?

More information

Hermite Splines in Lie Groups as Products of Geodesics

Hermite Splines in Lie Groups as Products of Geodesics Hermte Splnes n Le Groups as Products of Geodescs Ethan Eade Updated May 28, 2017 1 Introducton 1.1 Goal Ths document defnes a curve n the Le group G parametrzed by tme and by structural parameters n the

More information

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.

More information

Area Efficient Self Timed Adders For Low Power Applications in VLSI

Area Efficient Self Timed Adders For Low Power Applications in VLSI ISSN(Onlne): 2319-8753 ISSN (Prnt) :2347-6710 Internatonal Journal of Innovatve Research n Scence, Engneerng and Technology (An ISO 3297: 2007 Certfed Organzaton) Area Effcent Self Tmed Adders For Low

More information

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vdyanagar Faculty Name: Am D. Trved Class: SYBCA Subject: US03CBCA03 (Advanced Data & Fle Structure) *UNIT 1 (ARRAYS AND TREES) **INTRODUCTION TO ARRAYS If we want

More information

User Authentication Based On Behavioral Mouse Dynamics Biometrics

User Authentication Based On Behavioral Mouse Dynamics Biometrics User Authentcaton Based On Behavoral Mouse Dynamcs Bometrcs Chee-Hyung Yoon Danel Donghyun Km Department of Computer Scence Department of Computer Scence Stanford Unversty Stanford Unversty Stanford, CA

More information

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour 6.854 Advanced Algorthms Petar Maymounkov Problem Set 11 (November 23, 2005) Wth: Benjamn Rossman, Oren Wemann, and Pouya Kheradpour Problem 1. We reduce vertex cover to MAX-SAT wth weghts, such that the

More information

Programming in Fortran 90 : 2017/2018

Programming in Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Programmng n Fortran 90 : 2017/2018 Exercse 1 : Evaluaton of functon dependng on nput Wrte a program who evaluate the functon f (x,y) for any two user specfed values

More information

Using Delayed Addition Techniques to Accelerate Integer and Floating-Point Calculations in Configurable Hardware

Using Delayed Addition Techniques to Accelerate Integer and Floating-Point Calculations in Configurable Hardware Draft submtted for publcaton. Please do not dstrbute Usng Delayed Addton echnques to Accelerate Integer and Floatng-Pont Calculatons n Confgurable Hardware Zhen Luo, Nonmember and Margaret Martonos, Member,

More information

Problem Set 3 Solutions

Problem Set 3 Solutions Introducton to Algorthms October 4, 2002 Massachusetts Insttute of Technology 6046J/18410J Professors Erk Demane and Shaf Goldwasser Handout 14 Problem Set 3 Solutons (Exercses were not to be turned n,

More information

Optimizing Document Scoring for Query Retrieval

Optimizing Document Scoring for Query Retrieval Optmzng Document Scorng for Query Retreval Brent Ellwen baellwe@cs.stanford.edu Abstract The goal of ths project was to automate the process of tunng a document query engne. Specfcally, I used machne learnng

More information

CPE 628 Chapter 2 Design for Testability. Dr. Rhonda Kay Gaede UAH. UAH Chapter Introduction

CPE 628 Chapter 2 Design for Testability. Dr. Rhonda Kay Gaede UAH. UAH Chapter Introduction Chapter 2 Desgn for Testablty Dr Rhonda Kay Gaede UAH 2 Introducton Dffcultes n and the states of sequental crcuts led to provdng drect access for storage elements, whereby selected storage elements are

More information

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach

Distributed Resource Scheduling in Grid Computing Using Fuzzy Approach Dstrbuted Resource Schedulng n Grd Computng Usng Fuzzy Approach Shahram Amn, Mohammad Ahmad Computer Engneerng Department Islamc Azad Unversty branch Mahallat, Iran Islamc Azad Unversty branch khomen,

More information

Chapter 6 Programmng the fnte element method Inow turn to the man subject of ths book: The mplementaton of the fnte element algorthm n computer programs. In order to make my dscusson as straghtforward

More information

Meta-heuristics for Multidimensional Knapsack Problems

Meta-heuristics for Multidimensional Knapsack Problems 2012 4th Internatonal Conference on Computer Research and Development IPCSIT vol.39 (2012) (2012) IACSIT Press, Sngapore Meta-heurstcs for Multdmensonal Knapsack Problems Zhbao Man + Computer Scence Department,

More information

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS CACHE MEMORY DESIGN FOR INTERNET PROCESSORS WE EVALUATE A SERIES OF THREE PROGRESSIVELY MORE AGGRESSIVE ROUTING-TABLE CACHE DESIGNS AND DEMONSTRATE THAT THE INCORPORATION OF HARDWARE CACHES INTO INTERNET

More information

CHAPTER 4 PARALLEL PREFIX ADDER

CHAPTER 4 PARALLEL PREFIX ADDER 93 CHAPTER 4 PARALLEL PREFIX ADDER 4.1 INTRODUCTION VLSI Integer adders fnd applcatons n Arthmetc and Logc Unts (ALUs), mcroprocessors and memory addressng unts. Speed of the adder often decdes the mnmum

More information

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations Improvng Hgh Level Synthess Optmzaton Opportunty Through Polyhedral Transformatons We Zuo 2,5, Yun Lang 1, Peng L 1, Kyle Rupnow 3, Demng Chen 2,3 and Jason Cong 1,4 1 Center for Energy-Effcent Computng

More information

Cluster Analysis of Electrical Behavior

Cluster Analysis of Electrical Behavior Journal of Computer and Communcatons, 205, 3, 88-93 Publshed Onlne May 205 n ScRes. http://www.scrp.org/ournal/cc http://dx.do.org/0.4236/cc.205.350 Cluster Analyss of Electrcal Behavor Ln Lu Ln Lu, School

More information

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments Comparson of Heurstcs for Schedulng Independent Tasks on Heterogeneous Dstrbuted Envronments Hesam Izakan¹, Ath Abraham², Senor Member, IEEE, Václav Snášel³ ¹ Islamc Azad Unversty, Ramsar Branch, Ramsar,

More information

Real-Time Guarantees. Traffic Characteristics. Flow Control

Real-Time Guarantees. Traffic Characteristics. Flow Control Real-Tme Guarantees Requrements on RT communcaton protocols: delay (response s) small jtter small throughput hgh error detecton at recever (and sender) small error detecton latency no thrashng under peak

More information

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements Explct Formulas and Effcent Algorthm for Moment Computaton of Coupled RC Trees wth Lumped and Dstrbuted Elements Qngan Yu and Ernest S.Kuh Electroncs Research Lab. Unv. of Calforna at Berkeley Berkeley

More information

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning Parallel Inverse Halftonng by Look-Up Table (LUT) Parttonng Umar F. Sddq and Sadq M. Sat umar@ccse.kfupm.edu.sa, sadq@kfupm.edu.sa KFUPM Box: Department of Computer Engneerng, Kng Fahd Unversty of Petroleum

More information

Module Management Tool in Software Development Organizations

Module Management Tool in Software Development Organizations Journal of Computer Scence (5): 8-, 7 ISSN 59-66 7 Scence Publcatons Management Tool n Software Development Organzatons Ahmad A. Al-Rababah and Mohammad A. Al-Rababah Faculty of IT, Al-Ahlyyah Amman Unversty,

More information

ARTICLE IN PRESS. Signal Processing: Image Communication

ARTICLE IN PRESS. Signal Processing: Image Communication Sgnal Processng: Image Communcaton 23 (2008) 754 768 Contents lsts avalable at ScenceDrect Sgnal Processng: Image Communcaton journal homepage: www.elsever.com/locate/mage Dstrbuted meda rate allocaton

More information

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation Precondtonng Parallel Sparse Iteratve Solvers for Crcut Smulaton A. Basermann, U. Jaekel, and K. Hachya 1 Introducton One mportant mathematcal problem n smulaton of large electrcal crcuts s the soluton

More information

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions Sortng Revew Introducton to Algorthms Qucksort CSE 680 Prof. Roger Crawfs Inserton Sort T(n) = Θ(n 2 ) In-place Merge Sort T(n) = Θ(n lg(n)) Not n-place Selecton Sort (from homework) T(n) = Θ(n 2 ) In-place

More information

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr) Helsnk Unversty Of Technology, Systems Analyss Laboratory Mat-2.08 Independent research projects n appled mathematcs (3 cr) "! #$&% Antt Laukkanen 506 R ajlaukka@cc.hut.f 2 Introducton...3 2 Multattrbute

More information

Introduction to Programming. Lecture 13: Container data structures. Container data structures. Topics for this lecture. A basic issue with containers

Introduction to Programming. Lecture 13: Container data structures. Container data structures. Topics for this lecture. A basic issue with containers 1 2 Introducton to Programmng Bertrand Meyer Lecture 13: Contaner data structures Last revsed 1 December 2003 Topcs for ths lecture 3 Contaner data structures 4 Contaners and genercty Contan other objects

More information

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields 17 th European Symposum on Computer Aded Process Engneerng ESCAPE17 V. Plesu and P.S. Agach (Edtors) 2007 Elsever B.V. All rghts reserved. 1 A mathematcal programmng approach to the analyss, desgn and

More information

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries Run-Tme Operator State Spllng for Memory Intensve Long-Runnng Queres Bn Lu, Yal Zhu, and lke A. Rundenstener epartment of Computer Scence, Worcester Polytechnc Insttute Worcester, Massachusetts, USA {bnlu,

More information

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,

More information

Newton-Raphson division module via truncated multipliers

Newton-Raphson division module via truncated multipliers Newton-Raphson dvson module va truncated multplers Alexandar Tzakov Department of Electrcal and Computer Engneerng Illnos Insttute of Technology Chcago,IL 60616, USA Abstract Reducton n area and power

More information

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL)

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL) Crcut Analyss I (ENG 405) Chapter Method of Analyss Nodal(KCL) and Mesh(KVL) Nodal Analyss If nstead of focusng on the oltages of the crcut elements, one looks at the oltages at the nodes of the crcut,

More information

Random Kernel Perceptron on ATTiny2313 Microcontroller

Random Kernel Perceptron on ATTiny2313 Microcontroller Random Kernel Perceptron on ATTny233 Mcrocontroller Nemanja Djurc Department of Computer and Informaton Scences, Temple Unversty Phladelpha, PA 922, USA nemanja.djurc@temple.edu Slobodan Vucetc Department

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introducton 1.1 Parallel Processng There s a contnual demand for greater computatonal speed from a computer system than s currently possble (.e. sequental systems). Areas need great computatonal

More information

Related-Mode Attacks on CTR Encryption Mode

Related-Mode Attacks on CTR Encryption Mode Internatonal Journal of Network Securty, Vol.4, No.3, PP.282 287, May 2007 282 Related-Mode Attacks on CTR Encrypton Mode Dayn Wang, Dongda Ln, and Wenlng Wu (Correspondng author: Dayn Wang) Key Laboratory

More information

Brave New World Pseudocode Reference

Brave New World Pseudocode Reference Brave New World Pseudocode Reference Pseudocode s a way to descrbe how to accomplsh tasks usng basc steps lke those a computer mght perform. In ths week s lab, you'll see how a form of pseudocode can be

More information

Reducing Frame Rate for Object Tracking

Reducing Frame Rate for Object Tracking Reducng Frame Rate for Object Trackng Pavel Korshunov 1 and We Tsang Oo 2 1 Natonal Unversty of Sngapore, Sngapore 11977, pavelkor@comp.nus.edu.sg 2 Natonal Unversty of Sngapore, Sngapore 11977, oowt@comp.nus.edu.sg

More information

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont) Loop Transformatons for Parallelsm & Localty Prevously Data dependences and loops Loop transformatons Parallelzaton Loop nterchange Today Loop nterchange Loop transformatons and transformaton frameworks

More information

Storage Binding in RTL synthesis

Storage Binding in RTL synthesis Storage Bndng n RTL synthess Pe Zhang Danel D. Gajsk Techncal Report ICS-0-37 August 0th, 200 Center for Embedded Computer Systems Department of Informaton and Computer Scence Unersty of Calforna, Irne

More information

Programming FPGAs in C/C++ with High Level Synthesis PACAP - HLS 1

Programming FPGAs in C/C++ with High Level Synthesis PACAP - HLS 1 Programmng FPGAs n C/C wth Hgh Level Synthess PACAP - HLS 1 Outlne Why Hgh Level Synthess? Challenges when syntheszng hardware from C/C Hgh Level Synthess from C n a nutshell Explorng performance/area

More information

A RECONFIGURABLE ARCHITECTURE FOR MULTI-GIGABIT SPEED CONTENT-BASED ROUTING. James Moscola, Young H. Cho, John W. Lockwood

A RECONFIGURABLE ARCHITECTURE FOR MULTI-GIGABIT SPEED CONTENT-BASED ROUTING. James Moscola, Young H. Cho, John W. Lockwood A RECONFIGURABLE ARCHITECTURE FOR MULTI-GIGABIT SPEED CONTENT-BASED ROUTING James Moscola, Young H. Cho, John W. Lockwood Dept. of Computer Scence and Engneerng Washngton Unversty, St. Lous, MO {jmm5,

More information

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe CSCI 104 Sortng Algorthms Mark Redekopp Davd Kempe Algorthm Effcency SORTING 2 Sortng If we have an unordered lst, sequental search becomes our only choce If we wll perform a lot of searches t may be benefcal

More information

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) ,

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) , VRT012 User s gude V0.1 Thank you for purchasng our product. We hope ths user-frendly devce wll be helpful n realsng your deas and brngng comfort to your lfe. Please take few mnutes to read ths manual

More information

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers IOSR Journal of Electroncs and Communcaton Engneerng (IOSR-JECE) e-issn: 78-834,p- ISSN: 78-8735.Volume 9, Issue, Ver. IV (Mar - Apr. 04), PP 0-07 Content Based Image Retreval Usng -D Dscrete Wavelet wth

More information

Performance Evaluation of Information Retrieval Systems

Performance Evaluation of Information Retrieval Systems Why System Evaluaton? Performance Evaluaton of Informaton Retreval Systems Many sldes n ths secton are adapted from Prof. Joydeep Ghosh (UT ECE) who n turn adapted them from Prof. Dk Lee (Unv. of Scence

More information