Spatial Computation ABSTRACT 1. INTRODUCTION

Size: px

Start display at page:

Download "Spatial Computation ABSTRACT 1. INTRODUCTION"

Terence Nelson
5 years ago
Views:

1 Spatal Computaton Mha Budu, Grsh Venkataraman, Tberu Chelcea and Seth Copen Goldsten Carnege Mellon Unversty ABSTRACT Ths paper descrbes a computer archtecture, Spatal Computaton (SC), whch s based on the translaton of hgh-level language programs drectly nto hardware structures. SC program mplementatons are completely dstrbuted, wth no centralzed control. SC crcuts are optmzed for wres at the expense of computaton unts. In ths paper we nvestgate a partcular mplementaton of SC: ASH (Applcaton-Specfc Hardware). Under the aspton that computaton s cheaper than communcaton, ASH replcates computaton unts to smplfy nterconnect, buldng a system whch uses very smple, completely dedcated communcaton channels. As a consequence, communcaton on the datapath never requres arbtraton; the only arbtraton requred s for accessng memory. ASH reles on very smple hardware prmtves, usng no assocatve structures, no multported regster fles, no schedulng logc, no broadcast, and no clocks. As a consequence, ASH hardware s fast and extremely power effcent. In ths work we demonstrate three features of ASH: () that such archtectures can be bult by automatc complaton of C programs; (2) that dstrbuted computaton s n some respects fundamentally dfferent from monolthc superscalar processors; and (3) that ASIC mplementatons of ASH use three orders of magntude less energy compared to hgh-end superscalar processors, whle beng on average only 33% slower n performance (3.5x worst-case). Categores and Subject Descrptors: B.2.4 arthmetc and logc cost/performance, B.6.3 automatc synthess, optmzaton, smulaton B.7. algorthms mplemented n hardware, B.7.2 smulaton, C..3 dataflow archtectures, hybrd systems, D.3.2 data-flow languages, D.3.4 code generaton, complers, optmzaton General Terms: Measurement, Performance, Desgn. Keywords: spatal computaton, dataflow machne, applcatonspecfc hardware, low-power.. INTRODUCTION The von Neumann computer archtecture [08] has proven to be extremely reslent despte numerous perceved shortcomngs [7]. Computer archtects have contnuously enhanced the structure of the central processng unt, takng advantage of Moore s law. To- Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ASPLOS 04, October 7 3, 2004, Boston, Massachusetts, USA. Copyrght 2004 ACM /04/000...$5.00. day s superppelned, superscalar, out-of-order mcroprocessors are amazng achevements. However, the future scalablty of superscalar (and even VLIW) archtectures s questonable. Attemptng to ncrease the ppelne wdth of the processor beyond the current four or fve nstructons per cycle s dffcult snce the nterconnecton networks scale superlnearly. The regster fle, nstructon ssue logc, and ppelne forwardng networks grow quadratcally wth ssue wdth, makng the nterconnecton latency the lmtng factor [2]. Ths problem s compounded by the ncreasng clock rates and shrnkng technologes: currently sgnal propagaton delays on nter-module wres domnate logc delays [56]. Just the dstrbuton of the clock sgnal s a major undertakng [0]. Wre delays are not the only factor n the way of scalng: power conpton and power densty have reached dangerous levels, due to ncreased amounts of speculatve executon, ncreased logc densty and wde ssue. Desgn complexty s yet another lmtaton: whle the number of avalable transstors grows by 58% annually, desgner productvty only grows by 2% []. Ths exponentally ncreasng productvty gap has been hstorcally covered by employng larger and larger desgn and verfcaton teams, but human resources are economcally hard to scale. The research presented n ths paper s amed drectly at these problems. We explore Spatal Computaton, whch s a model of computaton optmzed for wres. We have prevously proposed to use Spatal Computaton for mappng programs to nanofabrcs [5]; n ths paper we evaluate the compler technology we developed for nanofabrcs on a tradtonal CMOS substrate. Snce the class of crcuts one could call spatal s arguably very large, we focus our attenton on a partcular set of nstances of SC structures, whch we call Applcaton-Specfc Hardware (ASH). ASH requres no clocks, nor any global sgnals. The core aspton s that computaton gates are cheap, and wll become even cheaper compared to the cost of wres (n terms of delay, power and area). ASH s an extreme pont n the space of SC archtectures: n ASH computaton structures are never shared, and each program operaton s syntheszed as a dfferent functonal unt. We present a complete compler/cad tool-chan that brdges both software complaton and mcroarchtecture. Applcatons wrtten n hgh-level languages are compled nto hardware descrptons. These descrptons can ether be loaded onto a reconfgurable hardware fabrc or syntheszed drectly nto crcuts. The resultng crcuts use only localzed communcaton, requre nether broadcast nor global control, and are self-synchronzed. The compler we have developed s automatc, fast, requres no desgner nterventon, and explots nstructon-level parallelsm (ILP) and ppelnng. The novel research contrbutons descrbed n ths paper are: () a compler tool-chan from ANSI C to asynchronous hardware;

2 C Suf front end CFG optmze convert Pegasus optmze async back end Verlog commercal CAD ASIC layout ModelSm CASH hgh level smulaton crcut smulaton secton Fgure : Toolflow used for evaluaton. The rght-hand sde ndcates the sectons n ths paper that dscuss each of the buldngblocks. low ILP OS VM CPU local memory Memory ASH hgh ILP computaton Fgure 2: ASH used n tandem wth a processor for mplementng whole applcatons. The processor s relegated to low-ilp program fragments and for executng the operatng system. (2) a qualtatve comparson of Spatal Computaton archtectures and superscalar processors; (3) a crcut-level evaluaton of the syntheszed crcuts on C program kernels from the Medabench sute; (4) a descrpton of a hgh-level synthess toolflow that produces extremely energy-effcent program mplementatons: comparable wth custom hand-tuned hardware desgns, and three orders of magntude better than superscalar processors; (5) the frst mplementaton of a C compler that can target dataflow machnes. 2. COMPILING C TO HARDWARE Ths secton presents our complaton methods as emboded n the CASH compler (Compler for ASH). The structure of CASH, and ts place wthn a complete synthess tool-flow are llustrated n Fgure, whch also shows the organzaton of ths paper. The crcuts generated by CASH cannot handle system calls. For translatng whole applcaton we ase frst that hardwaresoftware parttonng s performed, and that part of the applcaton s executed on a tradtonal processor (e.g., I/O), whle the rest s mapped to hardware, as shown n Fgure 2. The processor and the hardware have access to the same global memory space, and there s some mechansm to mantan a coherent vew of memory. Crossng the hardware-software nterface can be hdden by employng a stub compler, whch encapsulates the nformaton transmtted across the nterface, as we have proposed n [20], effectvely performng Remote Procedure Calls across the hw/sw nterface. 2. CASH CASH takes ANSI C as nput. CASH represents the nput program usng Pegasus [6, 7], a dataflow ntermedate representaton (IR). The output of CASH s a hardware dataflow machne whch drectly executes the nput program. Currently CASH generates a structural Verlog descrpton of the crcuts. CASH has a C front-end, based on the Suf compler []. The front-end performs some optmzatons (ncludng procedure nlnng, loop unrollng, call-graph computaton, and basc control-flow optmzatons), ntraprocedural ponter analyss, and lve-varable analyss. Then, the front-end translates the low-suf ntermedate representaton nto Pegasus. Next CASH performs a wealth of optmzatons on ths representaton, ncludng scalar-, memory- and Boolean optmzatons. Fnally, a back-end performs peephole optmzatons and generates code. The translaton of C nto hardware s eased by mantanng the same memory layout of all program data structures as mplemented n a classcal CPU-based system (the heap structure s practcally dentcal, but CASH uses less stack space, snce t never needs to spll regsters). ASH currently uses a sngle monolthc memory for ths purpose (see Secton 4..3). There s nothng ntrnsc n Spatal Computaton that mandates the use of a monolthc memory; on the contrary, usng several ndependent memores (as suggested for example n [94, 9]) would be very benefcal. 2.2 The Pegasus Intermedate Representaton The key technque allowng us to brdge the semantc gap between mperatve languages and asynchronous dataflow s Statc Sngle Assgnment (SSA) [34]. SSA s an IR used for mperatve programs n whch each varable s assgned to only once. As such, t can be seen as a functonal program [5]. Pegasus represents the scalar part of the computaton of C programs as SSA. Due to space lmtatons we only brefly descrbe Pegasus. See [6, 7] for more detals. Pegasus seamlessly extends SSA representng memory dependences, predcaton, and (forward) speculaton n a unfed manner. Whle other IRs have prevously combned some of these aspects, we beleve Pegasus s the frst to unfy them nto a coherent, semantcally precse representaton. A program s represented by a drected graph n whch nodes are operatons and edges ndcate value flow; an example s shown n Fgure 3. Pegasus leverages technques used n complers for predcated executon machnes [74] by collectng multple basc blocks nto one hyperblock; each hyperblock s transformed nto straghtlne code through the use of predcaton, usng technques smlar to PSSA [25]. Instead of SSA φ-nodes, wthn hyperblocks Pegasus uses explct decoded multplexor (MUX) nodes (one example s gven n Fgure 7). A decoded MUX has n data nputs and n predcates. The data nputs are the reachng defntons. The MUX predcates correspond to path predcates n PSSA; each predcate selects one correspondng data nput. The predcates of each MUX are guaranteed to be mutually dsjont (.e., the predcates are onehot encoded). CASH uses the espresso [3] Boolean optmzer to smplfy the predcate computatons. Speculaton s ntroduced by predcate promoton [73]: the predcates that guard nstructons wthout sde-effects are weakened to become true,.e., these nstructons are executed uncondtonally once a hyperblock s entered. Predcaton and speculaton are thus core constructs n Pegasus. The former s used for translatng control-flow constructs nto dataflow; the latter for reducng the crt- A hyperblock s a porton of the program control-flow graph havng a sngle entry pont and possbly multple exts. 2

3 nt squares() { nt = 0, = 0; } for (;<0;) = *; return ; 0 * 0 <= eta merge Fgure 3: C program and ts representaton comprsng three hyperblocks; each hyperblock s shown as a numbered rectangle. The dotted lnes represent predcate values. (Ths fgure omts the token edges used for memory synchronzaton.) calty of control-dependences [63]. They effectvely ncrease the exposed ILP. Note that MUX nodes are natural speculaton squashng ponts, dscardng all of ther data nputs correspondng to false predcates (.e., computed on ms-speculated paths). Hyperblocks are sttched together nto a dataflow graph representng an entre procedure by creatng dataflow edges connectng each hyperblock to ts successors. Each varable lve at the end of a hyperblock s forwarded through an ETA node [82] (also called a gateway ). ETAs are shown as trangles pontng down n our fgures. ETA nodes have two nputs a value and a predcate and one output. When the predcate evaluates to true, the ETA node moves the nput value to the output; when the predcate evaluates to false, the nput value and the predcate are smply coned, generatng no output. A hyperblock wth multple predecessors receves control from one of several dfferent ponts; nter-hyperblock jon ponts are represented by MERGE nodes, shown as trangles pontng up. Fgure 3 shows a functon that uses as an nducton varable and to accumulate the of the squares of. On the rght s the program s Pegasus representaton, whch conssts of three hyperblocks. Hyperblock ntalzes and to 0. Hyperblock 2 represents the loop; t contans two MERGE nodes, one for each of the loop-carred values, and. Hyperblock 3 s the functon eplog, contanng just the RETURN. Back-edges wthn a hyperblock denote loop-carred values; n ths example there are two such edges n hyperblock 2; back-edges always connect an ETA to a MERGE node. Memory accesses are represented through explct LOAD and STORE nodes. These and other operatons wth sde-effects (e.g., CALL and DIVISION whch may generate exceptons) also have a predcate nput: f the predcate s false, the operaton s not executed. In our fgures, predcate values are shown as dotted lnes. The compler adds dependence edges, called token edges, to explctly synchronze operatons whose sde-effects may not commute. Operatons wth memory sde-effects (LOAD, STORE, CALL, and RETURN) all have a token nput. Token edges explctly encode 2 eta ret 3 data flow through memory. An operaton wth memory sde-effects must collect tokens from all ts potentally conflctng predecessors (e.g., a STORE followng a set of LOADs). The COMBINE operator s used for ths purpose. COMBINE has multple token nputs and a sngle token output; t generates an output after t receves all ts nputs. It has been noted for the Value Dependence Graph representaton [97] that such token networks can be nterpreted as SSA for memory, where the COMBINE operator corresponds to a φ-functon. Tokens encode both true-, output- and ant-dependences, and are may dependences. We have devsed new algorthms for removng redundant memory accesses whch explot predcates and token edges n concert [8, 9]. As we show later, tokens are also explctly syntheszed as hardware sgnals, so they are both comple-tme and run-tme constructs. Currently the compler s purely statc,.e., uses no proflng nformaton. There s no reason that proflng cannot be ncorporated n our tool-chan. Secton 4..2 explans why proflng s less crtcal for CASH than for tradtonal ILP complers [92]. 2.3 The Dataflow Semantcs of Pegasus In [6] we have gven a precse and concse operatonal semantcs for all Pegasus constructs. At run-tme each edge of the graph ether holds a value or s ( empty ). An operaton begns computng once all of ts requred nputs are avalable. It latches the newly computed value when ts output s. The computaton cones the nput values (settng the nput edges to ) and produces an output value. Ths semantcs s the one of a statc dataflow machne (.e., each edge can hold a sngle value at one tme). The precse semantcs s useful for reasonng about the correctness of compler optmzatons and s a precse specfcaton for compler back-ends. Currently CASH has three back-ends: () a graph-drawng back-end, whch generates drawngs n the dot language [45], such as n Fgure 3; (2) a smulaton back-end, whch generates an nterpreter of the graph structure (used for the analyss n Secton 3); and (3) the asynchronous crcuts Verlog back-end, descrbed n Secton 4. (used for the evaluaton n Secton 4). 2.4 Compler Status The core of CASH handles all of ANSI C except longjmp, alloca, and functons wth a varable number of arguments. Whle the latter two constructs are relatvely easy to ntegrate, handlng longjmp s substantally more dffcult. Strctly speakng, C does not have exceptons [6] p. 200, and our compler does not handle them. Recurson s handled n exactly the same way as n software: CASH allocates stack frames for savng and restorng the lve local varables around the recursve call. As an optmzaton, CASH uses the call-graph to detect possbly recursve calls, and avods savng locals for all non-recursve calls. The asynchronous back-end s newer, and, therefore, somewhat less complete: t does not yet handle procedure calls and floatngpont computatons. The latter can be easly handled wth a sutable IP core contanng mplementatons of floatng-pont arthmetc. Currently we handle some procedure calls by nlnng. Handlng functon ponters requres an on-chp network, as the call wll need to dynamcally route the procedure arguments to the callee crcut dependent on the run-tme value of the ponter. 3. ASH VERSUS SUPERSCALAR Ths secton s devoted to a comparson of the propertes of ASH and superscalar processors. The comparson s performed by executng whole programs usng tmng-accurate smulators. Snce there are many parameters, one should see ths comparson as a lmt study. Ths study can also be nterpreted as beng the frst 3

4 head-to-head comparson between an unlmted-resource statc dataflow machne and a superscalar processor. Interestngly enough, despte the lmted resources of the superscalar, some of ts capabltes gve t a substantal edge over the statc dataflow model, as shown below. These results may be a partal explanaton of the demse of the dataflow model of computaton, whch was a very popular research subject n the seventes and eghtes. But frst we wll brefly dscuss the man source of parallelsm n dataflow machnes. 3. Dataflow Software Ppelnng A consequence of the dataflow nature of ASH s the automatc explotaton of ppelne parallelsm. Ths phenomenon has been studed extensvely n the dataflow lterature under names such as dataflow software ppelnng [46], and loop unravelng [33]. As the name suggests, ths phenomenon s closely related to software ppelnng [3], whch s a compler schedulng algorthm used n mostly for VLIW processors. The program n Fgure 3 llustrates ths phenomenon. Let us ase that the multpler mplementaton s ppelned wth fve stages. In Fgure 4 we show a few consecutve snapshots of ths crcut as t executes, startng wth the ntal snapshot n whch the two MERGEs contan the ntal values of and. (We have mplemented a tool that can automatcally generate such pctures; the nputs to the tool are pctures generated by the CASH backend and executon traces generated by the executon smulator.) In the last snapshot (6), the computaton f has already executed two teratons, two consecutve values of are njected n the multpler, whle the computaton of has yet to complete ts frst teraton. The executon of the multpler s thus effectvely ppelned. A smlar effect can be acheved n a statcally scheduled computaton by explctly software ppelnng the loop, schedulng the computaton of to occur one teraton ahead of. Ppelnng also occurs automatcally (.e., wthout any compler nterventon) n superscalar processors f there are enough resources to smultaneously process nstructons from multple nstances of the loop body. In practce large loops may not be dynamcally ppelned by a superscalar due to n-order nstructon fetch, whch can prevent some teratons from gettng ahead. Maxmzng the throughput of a ppelned computaton n ASH requres that the delay of all paths between dfferent strongly connected components n the Pegasus graph be equal. CASH nserts FIFO elements to acheve ths, a transformaton closely related to ppelne balancng n statc dataflow machnes [46] and slack matchng n asynchronous crcuts [70]. The FIFO elements correspond to the reservaton statons n superscalar desgns, and to the rotatng regsters n software ppelnng. 3.2 ASH Versus Superscalar For comparng ASH wth a superscalar we make the followng asptons: () all arthmetc operatons have the same latences on both computatonal fabrcs; (2) MUXs, MERGEs and Boolean operatons n ASH have latences proportonal to the log of the number of nputs; (3) ETA has the same latency as an addton; (4) memory operatons n ASH ncur an addtonal cost for network arbtraton compared to the superscalar; (5) the memory herarchy used for both models s dentcal: an LSQ and a two-level cache herarchy 2. The superscalar s a 4-way out-of-order SmpleScalar smulaton [2] wth the PISA nstructon set, usng gcc -O as 2 For ths study we use a very smlar LSQ for both ASH and the superscalar. As future work we are explorng the synthess of program-specfc LSQ structures. a compler. ASH s smulated usng a hgh-level smulator whch s automatcally generated by CASH, as shown n Fgure. We cannot smulate the executon of lbrares n ASH (unless we supply them to the compler as source-code), and thus we have nstrumented SmpleScalar to gnore ther executon tme, n order to have a far comparsons. Navely one would expect ASH to execute programs strctly faster than the superscalar (asng comparable compler technology) snce t benefts from (a) unlmted parallelsm, (b) no resource constrants, (c) no nstructon fetch/decode/dspatch, and (d) dynamc schedulng. Smulatng whole programs from SpecInt95 under these asptons results n two programs (099.go and 32.jpeg) showng a 25% mprovement on ASH, whle the other programs are between 0% and 40% slower. The speed-ups on ASH are attrbutable to the ncreased ILP due to the unlmted number of functonal unts; (for these benchmarks the nstructon cache of the processor dd not seem to be a bottleneck). In the next secton we nvestgate the slowdowns. 3.3 Superscalar Advantages In order to understand the advantages of the superscalar processor we have carred out a detaled analyss of code fragments whch perform especally poorly on ASH. The man tool we have used for ths purpose s the dynamc crtcal path [43]. In ASH the dynamc crtcal path s a sequence of last-arrval events. An event s lastarrval f t s the one that enables the computaton of a node to proceed. Events correspond to sgnal transtons on the graph edges. The dynamc crtcal path s computed by tracng the edges correspondng to last-arrval events backwards from the last operaton executed. Most often a last-arrval edge s the last nput arrvng at an operaton. However, for lenent operatons (see Secton 4..2), the last-arrval edge s the edge enablng the computaton of the output. Sometmes all the nputs may be present but an operaton may be unable to compute because t has not receved the acknowledgment sgnal for ts prevous computaton; n ths case the ack s the last-arrval event. Despte the fact that the superscalar has to tme-multplex a small number of the computatonal unts, some of the mechansms t employs provde clear performance advantages. Below s a bref mary of our fndngs. Branch predcton: the ablty of a superscalar to predct branch outcomes changes radcally the structure of the dynamc dependences: for example, a correctly predcted branch s dynamcally ndependent of the actual branch condton computaton. Unless the n-order commt stage (or some other structural hazard of the processor ppelne) s a bottleneck, the entre computaton of the branch condton s removed from the crtcal path. In contrast, n ASH nter-hyperblock control transfers are never speculatve. Often, the ETA control predcate computaton s on the crtcal path; e.g., when there s no computaton to overlap wth the branch condton evaluaton, such as n control-ntensve code. Such code fragments may be executed faster on a processor. Some branches, such as those testng exceptonal condtons (e.g., ntroduced by the use of assert statements), are never executed, and thus the processor branch predcton does a very good job of handlng them. These cases are especally detrmental to ASH. We note that good branch predcton requres global nformaton, aggregatng nformaton from multple branches, and would be very challengng to mplement effcently n Spatal Computaton. Synchronzaton: MERGE and MUX operatons have a non-zero cost, and may translate n overhead n ASH. These operatons cor- 4

5 * 0 * 0 * 0 * 0 * 0 * 0 2 <= <= [0] <= 0 <= <= <= [0] [0] [0] () ret (2) ret (3) ret (4) ret (5) ret (6) ret Fgure 4: Snapshots of the executon of the crcut n Fgure 3. The shaded nodes are actvely computng; they also ndcate the current value of ther output latch. We are asng a 5-stage ppelned multpler (each stage shown as ); we ase all nodes n these graphs have the same latences, except the Boolean negaton, whch takes zero tme unts (our mplementaton folds the nverter nto the destnaton ppelne stage). In the last snapshot, two dfferent values of are smultaneously present n the multpler ppelne. respond to labels n machne code,.e., control-flow jon ponts, whch have a zero executon cost on a CPU. Ths phenomenon s another facet of the tenson between synchronzaton and parallelsm. Whle a processor uses a program counter to sequence through the program, ASH reles on completely dstrbuted control. MERGE and MUX operatons are very smple forms of synchronzaton, used to merge several canddate values for a varable. Thus, the fne-graned parallelsm of dataflow requres addtonal synchronzaton. Ths occurs even when the dataflow machne s not executed by an nterpreter, but s drectly mapped to hardware. Dstance to memory: a superscalar contans a lmted number of load/store executon unts (usually two). In flght memory access nstructons have to be dynamcally scheduled to access these unts, but once they get hold of a unt they can ntate a memory access for a constant cost. (For example, the use of an on-processor LSQ allows wrte operatons to complete n essentally zero tme.) In contrast, on ASH, each memory access operaton s syntheszed as a dstnct hardware entty. Snce our current mplementaton uses a monolthc memory, ASH requres the use of a network to connect the operatons to memory. One such network mplementaton s descrbed n Secton Ths network requres arbtraton for the lmted number of memory ports; the total arbtraton cost s O(log(n)) (n beng the number of memory operatons n the program). The wre length of such a network grows as O( n). The mpact of the complexty of the memory network can be somewhat reduced by fragmentng memory n ndependent banks connected by separate networks, as we plan to do n future work. Note that the asymptotc complexty of the memory tself has the same behavor: the decoders and selectors for a memory of n bts requre O(log n) stages; the worst-case wre length s O( n). Ths explans why memory systems grow ntrnscally slower than processors n speed: today s memores are also bound by wre delays [4]. Whle ASH addresses some shortcomngs of superscalar processors, t does not drectly am to solve the memory bottleneck problem; both models of computaton attack ths problem by tryng to overlap memory stall tme wth useful computaton. Statc vs. dynamc dataflow: n ASH, at most one nstance of an operaton may be executng at any gven tme, because each operaton has a sngle output latch for storng the result. In contrast, a superscalar processor may have multple nstances of any nstructon n flght at once, because the regster renamng mechansm effectvely provdes a dfferent storage element for each nstance of an n-flght nstructon. The only nstructon whch cannot be effectvely ppelned wthout major changes n mplementaton s the LOAD: such operatons have to wat for the memory access to complete before ntatng a new access. A local reorder buffer could be employed for ths purpose, but devates from the sprt of ASH. In ASH, loop unrollng and ppelnng can sometmes provde smlar results to the full dynamc dataflow model of superscalars, but are less general, snce they are performed statcally: we have seen nstances where the CPU dynamc renamng outperformed the statc verson of the code for some nput set. Strct procedures: our current mplementaton of procedures reles on CALL nodes whch are strct;.e., to ntate a procedure all nputs to the node must be avalable. The fact that all nputs must be present before ntatng a call ntroduces addtonal synchronzaton and puts the slowest argument computaton on the dynamc crtcal path. When applcable, procedure nlnng elmnates ths problem as the procedure call network s specalzed to become smple pont-to-pont channels. In contrast, on a superscalar processor, procedure nvocaton s decoupled from passng of arguments (whch are put nto regsters or on the stack) and the call s smply a branch. Thus, the code computng the procedure arguments does not need to complete before the procedure body s ntated. In fact, the computaton of an unused procedure argument s never on the crtcal path. The ssues dscussed above seem fundamental to ASH. Other shortcomngs of ASH are attrbutable to polces n our compler, and could be corrected by a more careful mplementaton. 4. FROM C TO LAYOUT In ths secton we descrbe how Pegasus s translated to asynchronous crcuts and we present detaled measurements of the syntheszed crcuts. We also dscuss the reasons for the excellent power effcency of our crcuts. 4. CAB: The CASH Asynchronous Back-end The asynchronous back-end of CASH translates the Pegasus representatons nto asynchronous crcuts. The statc dataflow ma- 5

6 source regster ack ready Fgure 5: Sgnalng protocol between data producers and coners. data dest f (x > 0) y = -x; else y = b*x; b x 0 * y > Fgure 7: Sample program fragment and the correspondng Pegasus crcut wth the statc crtcal path hghlghted. Fgure 6: Control crcutry for a ppelne stage. s a delay matched to the computatonal unt. The block labeled wth = s a completon detecton block, detectng when the regster output has stablzed to a correct value. chne semantcs of Pegasus makes such a translaton farly straghtforward. More detals about ths process are avalable n [07]. 4.. Syntheszng Scalar Computatons Pegasus representatons could be mapped to asynchronous crcuts n many ways. We have chosen to mplement each Pegasus node as a separate hardware structure. Each IR node s mplemented as a ppelne stage, usng the mcroppelne crcut style, ntroduced by Sutherland [98] 3. Each ppelne stage contans an output regster whch s used to hold the result of the stage computaton. Each edge s syntheszed as a channel consstng of three un-drectonal sgnals as shown n Fgure 5. () A data bus, transfers the data from producer to coner. (2) A data ready wre from producer to coner ndcates when the data can be safely used by the coner. (3) An acknowledgment wre, from coner to producer, ndcates when the value has been used and the channel s. Ths sgnalng method, called the bundled data protocol, s wdely employed n asynchronous crcuts. The control crcutry drvng a ppelne stage s shown n Fgure 6. The C gate s a Müller C element [8], whch mplements the fnte-state machne control for properly alternatng data ready and acknowledgment sgnals. When there are multple coners the data bus s used to broadcast the value to all of them, and the channel contans one acknowledgment wre from each coner. Due to the SSA form of Pegasus, each channel has a sngle wrter. Therefore, there s no need for arbtraton, makng data transfer a lghtweght operaton. Perhaps the most mportant feature of our mplementaton s the complete absence of any global control structures. Control s completely emboded n the handshakng sgnals naturally dstrbuted wthn the computaton. Ths gves our crcuts a very strong datapath orentaton, makng them amenable to effcent layout Lenent Evaluaton The form of speculatve executon employed by Pegasus, whch executes all forward branches smultaneously, allevates the m- 3 Unlke Sutherland s mcroppelnes, whch used a 2-phase sgnallng protocol [2], we use 4-phase sgnalng, n whch each sgnal returns to zero before ntatng a new computaton cycle. pact of branches, but may be plagued by the problem of unbalanced paths [8], as llustrated n Fgure 7: the statc crtcal path of the entre construct s the longest of the crtcal paths. If the short path s executed frequently, the benefts of speculaton may be negated by the cost of the long path. Ths problem also occurs for machnes whch employ predcated executon. Tradtonally ths problem s addressed n two ways: () usng proflng, only hyperblocks whch ensure that the long path s most often executed at run-tme are predcated, or (2) excludng certan hyperblock topologes from consderaton, dsallowng the predcaton of paths whch dffer wdely n length. Because there s no sngle PC, we can employ a thrd, and more elegant soluton, n hardware by usng lenency [9] to solve ths problem. By defnton, a lenent operaton expects all of ts nputs to arrve eventually, but t can compute ts output usng only a subset of ts nputs. Lenent operators generate a result as soon as possble. For example, an AND operaton can determne that the output s false as soon as one of ts nputs s false. 4 Whle the output can be avalable before all nputs, our mplementaton ensures that a lenent operaton sends an acknowledgment only after all of ts nputs have been receved. To obtan the full beneft of lenency one also needs to ssue early acknowledgments, as suggested n [4]. In the asynchronous crcuts lterature, lenency was proposed under the name early evaluaton [85]. Forms of lenent evaluaton have been also been used n the desgn of arthmetc unts for mcroprocessors: for example, some multpler desgns may generate the result very quckly when an nput s zero. MUXes are also mplemented lenently: as soon as a selector s true and the correspondng data s avalable, a MUX generates ts output. Note that t s crucal for the MUX to be decoded (see Secton 2.2) n order for ths scheme to work effcently. A result of lenency s that the dynamc crtcal path s the same as n a nonspeculatve mplementaton. For example, f the multplcaton n Fgure 7 s not used, t does not affect the crtcal path. 5 In addton to Booleans and multplexors, all predcated operatons are lenent n ther predcate nput. For example, f a LOAD operaton receves a false predcate nput, t can mmedately emt an arbtrary output, snce the actual output s rrelevant. It cannot, however, output a token untl t receves ts nput token, snce memory dependences are transtvely mpled. The rrelevant out- 4 Lenent evaluaton should not be confused wth short-crcut evaluaton: a short-crcut evaluaton of an AND always evaluates the left operand, and f ths one s true, t also evaluates the rght one. However, a lenent evaluaton generates a false result as soon as ether nput s known to be false. 5 The multpler can stll be on the crtcal path because of ts late acknowledgments, whch may prevent the next wave of computaton from propagatng forward, as descrbed n Secton 3.3. Ths problem can be allevated ether by usng a ppelned multpler, or by usng early acknowledgements [4]. 6

7 Fgure 8: Memory access network and mplementaton of the value and token forwardng network. The LOAD produces a data value coned by the oval node. The STORE node may depend on the load (.e., we have a token edge between the LOAD and the STORE, shown as a dashed lne). The token travels to the root of the tree, whch s a load-store queue (LSQ). put wll be dscarded downstream by a MUX or ETA node controlled by a false predcate Memory Access The most complcated part of the synthess process s buldng the network used by the LOAD and STORE operatons to access memory. Fgure 8 llustrates how a load and a dependent store access memory through ths network. Our current mplementaton conssts of a herarchy of buses and asynchronous arbters used to medate access to the buses. Memory nstructons whch are ready to access memory compete for these buses; the wnners of the arbtraton nject messages whch travel up the herarchy n a ppelned fashon. A memory operaton can produce a token as soon as ts effect s guaranteed to occur n the rght order wth respect to the potentally nterferng operatons. The network does not guarantee n-order message delvery, so by travelng to the root we mantan the nvarant that a dependent operaton wll be ssued only after all operatons on whch t depends have njected ther requests n the LSQ. The root of the tree s a unque seralzaton pont, guaranteeng n-order executon of dependent operatons. The LSQ holds the descrpton of the memory operatons under executon untl ther memory effects are completed; t may also perform dynamc dsambguaton and act as a small fully-assocatve cache. In Secton we dscuss some dsadvantages of ths mplementaton. We currently synthesze a very smple load-store queue (LSQ), whch can hold a sngle operaton untl ts executon completes. It s worthwhle to notce that ths mplementaton of the memory access network s very much n the sprt of ASH, beng completely dstrbuted, composed entrely of ppelne stages, and usng only control localzed wthn each stage; t contans no global sgnals of any knd. 4.2 Low-level Evaluaton In ths secton we present measurements from a detaled lowlevel smulaton. We synthesze C kernels nto ASICs and evaluate ther performance on standard data sets. Snce CAB generates syntheszable Verlog, FPGAs could be targeted n prncple for evaluaton. There are two factors that prevent us from dong so: () commercal FPGAs are synchronous devces, and mappng some of the features of our asynchronous crcuts would be very nef- 6 The predcated-false operaton does not need to swng the output lnes, t need only assert the data ready sgnal (see Secton 4..). Ths wll decrease the power conpton. fcent, (2) commercal FPGAs are not optmzed for power [47]; they would thus probably negate one of the man advantages of our mplementaton scheme, the very low power conpton. We use kernels from the Medabench sute [66] to generate crcuts. From each program we select one hot functon (see Table ) to mplement n hardware (the only excepton are the g72 benchmarks, for whch the hot functon was very small, so we selected the functon and one of ts callers, we nlned the callee, unrolled the resultng loop and substtuted references to an array of constants as nlne constant values. The same code was used on the SmpleScalar smulator n comparsons.) The expermental results presented below are for the entre crcut syntheszed by CAB, ncludng the memory access network, but excludng the memory tself or I/O to the crcut. We report data only for the executon of each kernel, gnorng the rest of the program; due to long smulaton tmes, we execute each kernel for the frst three nvocatons n the program and we measure the cumulatve values (tme, energy, etc) for all three nvocatons. We do not estmate the overhead of nvokng and returnng from the kernel, snce n ths work we am to understand the behavor of ASH, and not of a whole CPUASH system. Snce our current back-end does not support the synthess of floatng-pont computaton we had to omt some kernels, such as the ones from the epc, rasta and mesa benchmarks. The CAB back-end s used to generate a Verlog representaton of each kernel. A detaled descrpton of our methodology can be found n [07]. We use a 80nm/2V standard-cell lbrary from STMcroelectroncs, optmzed for performance. The structural Verlog generated by our tool flow s partally technology-mapped by CAB and partally syntheszed wth Synopsys Desgn Compler SP2. The technology-mapped crcuts are placedand-routed wth Slcon Ensemble 5.3 from Cadence. Currently the placement s handled completely by Slcon Ensemble, operatng on a flat netlst; we expect that CAB can use knowledge of the crcut structure to automatcally generate floor-plans whch can mprove our results substantally 7. Data collecton wth the commercal CAD tools for both pegwt benchmarks has faled after placement, so we present pre-placement numbers for these. (The performance for the other benchmarks s about 5% better than ther pre-placement estmate.) Smulaton s performed wth Modeltech Modelsm SE5.7. We ase a perfect L cache, wth a 600MHz cycle tme. We synthesze a one-element LSQ for ASH. Complaton tme s on the order of tens of seconds for all these benchmarks, and s thus completely nconsequental compared to hardware synthess through the commercal tool-chan (the worstcase program takes about 30s through CASH, one hour through synthess and more than fve hours for place-and-route). The code expanson n terms of lnes of code from C to Verlog s a factor of 200x. All the results n ths secton are obtaned wthout loop unrollng, whch can ncrease crcut area and complaton tme Area Fgure 9 shows the area requred for each of these kernels. The area s broken down nto computaton and memory tree. The memory tree s the area of the arbter crcuts used to medate access to memory and of the herarchcal ppelned buses. For reference, n the same technology a mnmal RISC core can be syntheszed n.3mm 2, a 6 6 multpler requres 0.mm 2, and a complete P4 processor de, ncludng all caches, has 27mm 2. Ths shows that whle the area of our kernels s sometmes substantal, t s certanly affordable, especally n future technologes. Normalzng the area 7 Good placement and physcal optmzatons can account for as much as a factor of 4x n sze and 2.3x n performance [36] 7

8 Benchmark Functon Lnes adpcm d adpcm decoder 80 adpcm e adpcm coder 03 g72 d fmultquan 4 g72 e fmultquan 4 gsm d Short term synthess flterng 24 gsm e Short term analyss flterng 45 jpeg d jpeg dct slow 24 jpeg e jpeg fdct slow 44 mpeg2 d dctcol 55 mpeg2 e dst 92 pegwt d squaredecrypt 78 pegwt e squareencrypt 77 Table : Embedded benchmark kernels used for the low-level measurements and ther sze n orgnal (un-processed) source lnes of code. For g72 the functon quan was nlned nto fmult. Fgure 0: Kernel slowdown compared to a 4-wde ssue 600MHz superscalar processor n 80nm. A value of ndcates dentcal performance, values bgger than ndcate slower crcuts on ASH. Fgure 9: Slcon real-estate n mm 2 for each kernel. versus the object fle sze, we requre on average 0.87mm 2 /kb of a gcc-generated MIPS object fle Executon performance Fgure 0 shows the normalzed executon tme of each kernel aganst a baselne 600MHz4-wde superscalar processor. Whle we dd not smulate a VLIW, we expect the trends to be smlar, snce the superscalar mantans a hgh IPC for these kernels. The processor has the same perfect L cache, but a 32-element LSQ. On average, ASH crcuts are.33 tmes slower, but 4 kernels are faster than on the processor. Gven the unlmted amount of ILP that can be exploted by ASH, these results are somewhat dsappontng. An analyss of ASH crcuts has ponted out that, although these crcuts can be mproved n many respects, the man bottleneck of our current desgn s the memory access protocol. In our current mplementaton, as descrbed n Secton 4..3, a memory operaton does not release a token untl ts request has reached memory (.e., the token must traverse the network end-to-end n both drectons). An mproved constructon would allow an operaton to () nject requests n the network, allowng them to travel out-of-order, and (2) release the token to the dependent operatons mmedately. The network packet can carry enough nformaton to enable the LSQ to buffer out-of-order requests and to execute the memory operatons n the orgnal program order. Ths knd of protocol s actually used by superscalar processors, whch nject requests n order n the load-store queue, and can proceed to ssue more memory operatons before the prevous ones have completed. Fgure : Evaluaton the mpact of an deal memory nterconnecton protocol. The left bar reproduces the data from Fgure 0. To gauge the mpact of the memory network on program performance, we performed a lmt study usng a behavoral mplementaton of the network n whch each stage has zero latency. The mprovement n performance s shown n Fgure : programs havng large memory access networks n Fgure 9 dsplay sgnfcant mprovements (up to 8x for pegwt) whch shows that programs whch perform many memory accesses are bound by the memory network round-trp tme. These numbers are obtaned asng that both value and token travel very quckly through the network; n realty, we can only substantally speed-up the token path, so the performance of a better protocol stll has to be evaluated. In Fgure 2 we measure ASH performance usng several MIPS metrcs: the bottom bar we labeled MOPS, for mllons of useful arthmetc operatons per second. The ncorrectly speculated arthmetc s accounted for as MOPSspec. Fnally, MOPSall ncludes auxlary operatons, ncludng the MERGE, ETA, MUX, COMBINE, ppelne balancng FIFOs, and other overhead operatons. Although speculatve executon sometmes domnates useful work (e.g., g72), on average /3 of the executed arthmetc operatons are ncorrectly speculated. 8 For some programs the control operatons consttute a substantal fracton of the total number of executed operatons. On average our programs sustan GOPS. 8 The compler can control the degree of speculaton by varyng the hyperblock constructon algorthm; we have not yet explored the trade-offs nvolved. 8

9 Dedcated hardware ASH meda kernels Asynchronous mcrocontroller FPGA General purpose DSP Mcroprocessors Energy Effcency (MOPS/mW or OP/nJ) Fgure 4: Energy effcency of several computatonal models usng comparable technologes. Fgure 2: Computatonal performance of the ASH Medabench kernels, expressed n mllons of operatons per second (MOPS). Fgure 5: Energy-delay rato between ASH and superscalar. Fgure 3: Energy effcency of the syntheszed kernels, expressed n useful arthmetc operatons per nanojoule Power and Energy The power conpton of our crcuts ranges between 5.5mW (g72) and 35.9mW (jpeg d), wth an average of 9.3mW. In contrast, a very low power DSP chp n the same technology draws about 0mW. In order to quantfy energy effcency we use the normalzed metrc proposed by Claasen [29], MOPS/mW, or, equvalently, operatons/nanojoule. Fgure 3 shows how our kernels fare from ths perspectve. We count only the useful operatons n the MIPS metrc,.e., the bottom bar from Fgure 2. The energy effcency for CASH systems s between 5 and 280 operatons/nj, wth an average of 52. g72 d and g72 e are outlers because they do not make any memory accesses, and ther mplementaton s thus extremely effcent. For comparson Fgure 4 shows, on a logarthmc scale, the energy effcency of mcroprocessors, dgtal sgnal processng, custom hardware crcuts from [6], an asynchronous mcroprocessor [77], and FPGAs. All these crcuts use comparable hardware technologes (80 and 250nm). ASH s three to four orders of magntude better than a superscalar mcroprocessor and between one and two orders of magntude better than a DSP. Fgure 5 compares ASH wth a 4-wde low-power superscalar (modeled wth Wattch [5], usng aggressve clock-gatng for power reducton, and drawng around 5W dynamc power) n terms of energy-delay [53], a metrc whch s relatvely ndependent of executon speed. ASH crcuts are between 6 and 3600 tmes better. The crcuts wthout memory access are agan substantally better (one order of magntude) than the other. Power s roughly proportonal to the crcut actvty; the power wasted on speculaton should be proportonal to the number of speculatve operatons executed. Currently we beleve that we should negotate the power-performance trade-off n the drecton of expendng more power to ncrease performance: snce power s very low, some ncrease n power s acceptable even for a relatvely small ncrease n performance Dscusson There are multple sources of neffcency n the mplementaton of superscalar processors, whch account for the huge dfference n power and energy. () The clock dstrbuton network for a large chp accounts for up to 50% of the total power budget; whle some of ths power s spent on the latches, whch exst also n asynchronous mplementatons [2], the bg clock network and ts huge drvers stll account for a substantal fracton of the power. (2) Most of the ppelne stages n ASH are usually nactve (see for example Fgure 4). Snce these crcuts are asynchronous, they only draw statc power when nactve, whch s very small n a.8µm technology. Whle leakage power s a bg ssue n future technologes, there are many crcut- and archtecture-level technques [03] that can be appled to reduce ts mpact n ASH. (3) On the Pentum 4 de, even excludng caches, all functonal unts combned (nteger, floatng-pont and MMX) together take less than 0% of the chp area. The remanng 90% s devoted entrely to mechansms that only support the computaton, wthout producng useful results. From the pont of vew of the energy effcency metrc we use, all of ths extra actvty s pure overhead. 9

10 (4) [77] suggests that more than 70% of the power of the asynchronous Lutonum processor s spent n nstructon fetch and decode. Ths shows that there s an nherent overhead for nterpretng programs encoded n machne-code, whch ASH does not have to pay. (5) Fnally, n a mcroprocessor, and even n many ASICs, functonal unts are heavly optmzed for speed at the expense of power and area. Snce SC replcates lavshly functonal unts, the trade-off has to be based n the reverse drecton. It has been known that dedcated hardware chps can be vastly more energy-effcent that processors [6]. But most often the algorthms mplemented n dedcated hardware have a natural hgh degree of parallelsm. Ths paper shows for the frst tme that a fully automatc tool-chan startng from C programs can generate results comparable n speed wth hgh-end processors and n power wth custom hardware. We are confdent that further optmzatons, ncludng crcutlevel technques, hgh-level compler transformatons, a better memory access protocol, and compler-guded floor-plannng, wll substantally ncrease ASH s performance. 5. RELATED WORK Optmzng complers. Pegasus s a form of dataflow ntermedate language, an dea poneered by Denns [39]. The crux of Pegasus s handlng memory dependences wthout excessvely nhbtng parallelsm. The core deas on how to use tokens to express fne-graned locaton-based synchronzaton were ntroduced by Beck, et al. []. The explct representaton of memory dependences between program operatons has been suggested numerous tmes n the lterature, e.g., Pngal s Dependence Flow Graph [83]; or Steensgaard s adaptaton of Value-Dependence Graphs [97]. Other researchers have also explored extendng SSA to handle memory dependences, e.g.,[35, 44, 64, 28, 30, 7, 72]. But none of the approaches s as smple as ours. The ntegraton of predcaton and SSA has also been done n PSSA [24, 25]. PSSA, however, does not use φ functons and therefore loses some of the appealng propertes of SSA. Our use of the hyperblock as a basc optmzaton unt and our algorthm for computng block and path predcates were nspred by ths work. Hgh-level synthess. Whle there s a substantal amount of research on hardware synthess from hgh-level languages, and dalects of C and C, none of t supports C as fully as CASH does. One major dfference between our approach and the vast majorty of other efforts s that we use dusty-deck C whle most other projects employ C as a hardware descrpton language (HDL). They ether add constructs (e.g., reactvty, concurrency, and varable btwdth) to C n order to make t more sutable to expressng hardware propertes, or/and remove constructs (e.g., ponters, dynamc allocaton and recurson) that do not correspond to natural hardware objects, obtanng Regster Transfer C. Other efforts mpose a strct codng dscplne n C n order to obtan a syntheszable program. There are numerous research and commercal products usng varants of C as an HDL [80]: Pco from HP [93], Cynlb, orgnally from Cynapps, now at Forte Desgn Systems [88], Cyber from NEC [09], A RT bulder orgnally from Fronter Desgn, now at Xlnx [59], Scenc/CoCentrc from Synopsys [69], N2C from CoWare [32], complers for Bach-C (orgnally based on Occam) from Sharp [60], OCAPI from IMEC [90], Synopsys c2verlog, orgnally from C-Level Desgn [96], Celoxca s Handel-C compler [3], and Cadence s ECL (based on C and Esterel) compler [65], SpC from Stanford [94] and also [48, 6, 54, 55]. Our goals are most closely related to the chp-n-a-day project from Berkeley [37], but our approach s very dfferent: that project starts from parallel Statechart descrptons and employs sophstcated hard macros to synthesze synchronous desgns wth automatc clock gatng. In contrast we start from C, use a small lbrary of standard cells, and buld asynchronous crcuts. Reconfgurable computng. A completely dfferent approach to hardware desgn s taken n the research on reconfgurable computng, whch reles on automatc complaton of C or other hghlevel languages to target reconfgurable hardware [2]. Notable projects are: PRISM II [0], PRISC [84], DISC [3], NAPA [49], DEFACTO [40], Chmaera [5], OneChp [4], RaPD [4], PamDC [02], StreamC [50], WASMII [00], XCC/PACT [23], PpeRench [5], RAW/Vrtual Wres [9], the systems descrbed n [95] and [78], complaton of Term-Rewrtng Systems [58], or synthess of dataflow graphs [86]. CASH has been nfluenced by the work of Callahan on the GARP C compler [22, 68]. GarpCC s stll one of the few systems wth the broad scope of targetng C to a reconfgurable fabrc. None of these approaches targets a true Spatal Computaton model, wth completely dstrbuted computaton and control. Dataflow machnes. A large number of dataflow machne archtectures have been proposed and bult; see for example a survey n [06]. All of these were nterpreters, executng programs descrbed n a specalzed form of machne code. Most of the dataflow machnes were programmed n functonal languages, such as VAL and Id. CASH could n prncple be used to target a tradtonal dataflow machne. To our knowledge, none of the other efforts to translate C for executon on dataflow machnes (e.g., [05]) reached completon. Asynchronous crcut desgn. Tradtonally, asynchronous synthess has concentrated on syntheszng ndvdual controllers. However, n the last decade, a number of other approaches have also explored the synthess of entre systems. The startng pont for these related synthess flows s a hgh-level language (usually an nherently parallel one based on Hoare s Communcatng Sequental Processes [57]) sutable for descrbng the parallelsm of the applcaton, e.g., Tangram [04], Balsa [42], OCCAM [79], CHP [76, 0]. Wth the excepton of [76], synthess tends to follow a two step approach: the hgh-level descrpton s translated n a syntaxdrected fashon nto an ntermedate representaton; then, each module n the IR s (template-based) technology mapped nto gates. An optonal optmzaton step [27] may be ntroduced to mprove the performance of the syntheszed crcuts. In [76], the hghlevel specfcaton s decomposed nto smaller specfcatons, some dataflow and local optmzatons technques are appled, and the resultng low-level specfcatons are syntheszed n a template-based fashon. Our approach s somewhat smlar to the syntax-drected flows wth three notable dfferences. Frst, our nput language s a wellestablshed, mperatve, sequental programmng language; handlng ponters and arrays s central n our approach. Second, the CASH compler performs extensve analyss and optmzaton steps before generatng the ntermedate form. Fnally, the target mplementaton for our ntermedate form conssts of ppelned crcuts, whch naturally ncrease performance when compared wth tradtonal syntax-drected approaches. Spatal Computaton. Varous models related to Spatal Computaton have been nvestgated by other research efforts: complng for fnte-szed fabrcs has been studed by research on systolc arrays [62], RAW [67], PpeRench [52] and TRIPS [89]. More remotely related are efforts such as SmartMemores [75] and Imagne [87]. Unlmted or vrtualzed hardware s exploted n proposals such as SCORE [26], [38], and WaveScalar [99]. Among the latter efforts, our research s dstngushed by the fact that () t targets 0

Assembler. Building a Modern Computer From First Principles.

Assembler. Building a Modern Computer From First Principles. Assembler Buldng a Modern Computer From Frst Prncples www.nand2tetrs.org Elements of Computng Systems, Nsan & Schocken, MIT Press, www.nand2tetrs.org, Chapter 6: Assembler slde Where we are at: Human Thought