Baring it all to Software: The Raw Machine

Size: px

Start display at page:

Download "Baring it all to Software: The Raw Machine"

Eustace Rodgers
5 years ago
Views:

1 Baring it all to Software: The Raw Machine Elliot Waingol, Michael Taylor, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Srikrishna Devabhaktuni, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, Anant Agarwal MIT Laboratory for Computer Science Cambrige, MA 0239 Abstract Rapi avances in technology force a quest for computer architectures that exploit new opportunities an she existing mechanisms that o not scale. Current architectures, such as harware scheule superscalars, are alreay hitting performance an complexity limits an cannot be scale inefinitely. The Reconfigurable Architecture Workstation (Raw) is a simple, wire-efficient architecture that scales with increasing VLSI gate ensities an attempts to provie performance that is at least comparable to that provie by scaling an existing architecture, but that can achieve orers of magnitue more performance for applications in which the compiler can iscover an statically scheule fine-grain parallelism. The Raw microprocessor chip comprises a set of replicate tiles, each tile containing a simple RISC like processor, a small amount of configurable logic, an a portion of memory for instructions an ata. Each tile has an associate programmable switch which connects the tiles in a wie-channel point-to-point interconnect. The compiler statically scheules multiple streams of computations, with one program counter per tile. The interconnect provies register-to-register communication with very low latency an can also be statically scheule. The compiler is thus able to scheule instruction-level parallelism across the tiles an exploit the large number of registers an memory ports. Of course, Raw provies backup ynamic support in the form of flow control for situations in which the compiler cannot etermine a precise static scheule. The Raw architecture can be viewe as replacing the bus architecture of superscalar processors with a switche interconnect an accomplishing at compile time operations such as register renaming an instruction scheuling. This paper makes a case for the Raw architecture an provies early results on the plausibility of static compilation for several small benchmarks. We have implemente a prototype Raw processor (calle RawLogic) an an associate compilation system by leveraging commercial FPGA base logic emulation technology. RawLogic supports many of the features of the Raw architecture an emonstrates that we can write compilers to statically orchestrate all the communication an computation in multiple threas, an that performance levels 0x-00x over workstations is achievable for many applications. Also appears as MIT Laboratory For Computer Science TR-709, March 997.

2 Raw P RawTile IMEM PC REGS DMEM ALU CL PC SMEM SWITCH Introuction Figure : RawP composition. Each RawP comprises multiple tiles. Rapily evolving technology places a billion transistors on a chip within reach of the computer architect. Although several approaches to utilizing the large amount of silicon resources have been put forth, they share the single important goal of obtaining the most performance out of approximately one square inch of silicon. All approaches exploit more parallelism in one or more instruction streams, an inclue more aggressive superscalar esigns, simultaneous multithreaing, multiprocessors on a chip, VLIWs, an integrating a processor with huge amounts of DRAM memory. The Raw project s approach to achieving this goal is to implement a simple, highly parallel VLSI architecture, an to fully expose the low-level etails of the harware architecture to the compiler so that the compiler or the software can etermine, an implement, the best allocation of resources for each application. As isplaye in Figure, the Raw microprocessor (RawP) is compose of a set of interconnecte tiles, each tile comprising instruction an ata memory, an ALU, registers, some configurable logic, an a programmable switch. Wie channels couple near-neighbor tiles. The harware architecture is expose fully to the compiler an the software system, an the switch is programmable via a sequence of instructions, also etermine by the compiler. The Raw esign uses both istribute register files an istribute memory. Communication between tiles is slightly less expensive than memory access on the same tile, slightly more expensive than register reas an occurs over high hanwith point-to-point channels. Figure 2 shows a multiprocessor system environment for the RawPs, where each multiprocessor noe is constructe out of a RawP, a ynamic router, an some high-banwith RDRAM. Our approach leverages the same two features that make application-specific custom harware systems superior in performance for the given application to extant general purpose computers. First, they implement fine-grain communication between large numbers of replicate processing elements an, thereby, are able to exploit huge amounts of fine-grain parallelism in applications, when this parallelism exists. For example, custom vision chips that implement hunres of small processing tiles interconnecte via a point-to-point, near-neighbor interconnect can achieve performance levels far exceeing that of workstations in a cost effective manner. Secon, they expose the complete etails of the unerlying harware architecture to the software system (be it the software CAD system, the applications software, or the compiler), so the software can carefully orchestrate the 2

3 RawMP Dynamic Router RDRAM Raw Microprocessor Figure 2: Raw multiprocessor comprising multiple Raw Ps. MEM MEM MEM MEM Switch Switch Switch MEM MEM MEM REGS REGS REGS REGS REGS REGS REGS Switch Switch Switch ALU ALU ALU ALU ALU ALU ALU ALU ALU Raw Superscalar Multiprocessor Figure 3: Raw microprocessors versus superscalar processors an multiprocessors. The Raw microprocessor istributes the register file an memory ports an communicates between ALUs on a switche, point-to-point interconnect. In contrast a superscalar contains a single register file an memory port an communicates between ALUs on a global bus. Multiprocessor s communicate at a much coarser grain through the memory subsystem. execution of the application by applying techniques such as pipelining, synchronization elimination through careful cycle counting, an conflict elimination for share resources by static scheuling an routing. In keeping with the general philosophy of builing a simple replicate architecture an exposing the unerlying implementation to the compiler, Raw has the following features. Simple replicate tile As shown in Figure, the Raw machine comprises an interconnecte set of tiles. Each tile contains a simple RISC-like pipeline, an is interconnecte over a pipeline, point-to-point network with other tiles. Unlike current superscalar architectures, Raw oes not bin specialize logic structures such as register renaming logic, ynamic instruction issue logic, or caching into harware. Instea, it focuses on keeping each tile small, an maximizes the number of tiles it can implement on a chip, thereby increasing the amount of parallelism it can exploit, an the clock spee it can achieve. As Figure 3 isplays, Raw oes not use buses, rather, it uses a switche interconnect. Communication between tiles is pipeline over this interconnect, an occurs at the register level. Thus the 3

4 clock frequency is higher than in chips that uses global buses. No signal in Raw travels a istance greater than a single tile with within a clock cycle. Furthermore, because the interconnect allows communication between tiles to occur at nearly the same spee as a register rea, compilers can use instruction-level parallelism in aition to coarser forms of parallelism to keep the tiles busy. In aition, memory istribute across the tiles will eliminate the memory banwith bottleneck an provie significantly lower latencies to each memory moule. Wie channels will allow significantly higher banwith for both I/O an on-chip communication. A large number of istribute registers will eliminate the small register name space problem allowing the exploitation of instruction-level parallelism (ILP) to a greater egree. Significantly higher switching spees internal to a chip compare to off-chip communication an DRAM latencies will allow software sequencing to replace specialize harware functionality. Software sequencing also exposes the opportunity for customization to applications. As contraste in Figure 3, with multiprocessors on a chip, inter-tile communication in Raw is at the register level an can be compiler orchestrate, allowing Raw to exploit fine-grain ILP to keep the tiles busy. Of course, a simple architecture also results in faster time to market, an eliminates the verification nightmare that threatens moern ay superscalars. The passage of time has not change the benefits of simple architectures superbly articulate in the paper by Patterson an Ditzel [8] making a case for RISCs. Programmable, integrate interconnect The switche interconnect between the tiles provies high banwith, low latency, communication between neighboring tiles. Integrate interconnect allows channels with hunres of wires instea of the traitional tens. VLSI switches are pa limite an rarely ominate by internal switch or channel area. The switches support ynamic routing, but are also programmable an synchronize so that a static scheule can be maintaine across the entire machine shoul the compiler choose to o so. Static routing permits higher utilization of available banwith of chip-level channel resources an eliminates ynamic synchronization costs. Multi-sequential statically-scheule instruction streams Each tile runs a single threa of computation with its own program counter. Switches that can be programme with static communication scheules enable the compiler to statically scheule the computations in the threas in each tile. Static scheuling eliminates synchronization, an thus provies the cheapest form of synchronization one that is free. Section 5 iscusses how Raw software can hanle cases such as ynamic routing an epenence checking, where traitional systems evote specialize harware resources for them. Of course, Raw also provies basic flow control support on each channel so that a tile can stall if a neighboring tile violates the static scheule ue to an unforeseen ynamic event. As iscusse in Section 3, Raw also allows apportioning channel banwith between static routing an ynamic routing. Multigranular operations The Raw architecture implements mechanisms such as wie-wor ALUs an is thus coarser than traitional FPGA base computers. In orer to achieve the levels of parallelism an FPGA computer can attain, the Raw architecture is multigranular. Multigranular support inclues the ability of a single tile to exploit bit or byte-level parallelism when it exists in an application. Logic simulation applications, for example, require bit level operations. FPGA base emulation engines are simply special purpose computers that can exploit bit-level parallelism. The Intel MMX esign an several signal processing architectures have similar mechanisms to allow their ALUs to be use for a few wie wor operations or many narrow wor computations. 4

5 Multigranular support in Raw is achieve through the use of a small amount of configurable logic in each tile which can be customize to the right atapath with. Configurability Configurability allows the customization of the harware to support special operations to support specific application emans. Each Raw tile inclues a small amount of FPGA-like configurable logic that can be customize at the bit-level granularity to perform special operations. The key with configurability is that both bit or byte-level parallelism can be combine with special communication between each of the iniviual bits or bytes, an is thus significantly more powerful than MMX-like multigranularity. One way to view configurable logic is as a mechanism to provie the compiler a means to create customize single or multicycle instructions without resorting to longer software sequences.. Comparison with other Architectures Raw buils on several previous architectures. IWarp [3] an NuMesh [4] both share with Raw the philosophy of builing point-to-point networks that support static scheuling. However, the cost of initiating a message in both systems was so high that compilers coul exploit only coarse grain parallelism, or ha to focus on very regular signal processing applications so the startup cost coul be amortize over long messages. Note that even though iwarp supporte very low-latency processor access to the message queues, it require an initial set up of a pathway before wor-level ata coul be sent. Due to the ifficulty of preicting statically events over long perios of time, static scheuling in a coarse grain environment is very ifficult. The Raw architecture improves on NuMesh an iwarp in several ways. The register like latencies of communication in Raw allows it to exploit the same kins of instruction-level parallelism that superscalars can exploit. Preicting latencies over small instruction sequences, followe by statically scheuling of multiple tightly couple instruction streams is much simpler. Just like the processor, the switches in Raw are programmable via an instruction stream prouce by the compiler, an are thus significantly more flexible than either iwarp or the multiple finite-state-machine architecture of the NuMesh switches. Like an FPGA-base computer, the Raw architecture achieves the spees of special purpose machines by exploiting fine-grain parallelism an fast static communication, an exposing the lowlevel harware etails to facilitate compiler orchestration. Compilation in Raw is fast, unlike FPGAcomputers, because it bins into harware commonly use compute mechanisms such as ALUs, registers, wor-wie ata channels an memory paths, an communication an switching channels, thereby eliminating repeate low-level compilations of these macro units. The onerous compile times of our multi-fpga base Raw prototype computer (see Section 6) amply emonstrates this compile spee problem. Bining of common mechanisms into harware also yiels significantly better execution spee, lower area, an better power efficiency than FPGA systems. Many features in Raw are inspire by the VLIW work. Like the VLIW Multiflow machine [5], Raw has a large register name space an a istribute register file. Like VLIWs, Raw also provies multiple memory ports, an relies heavily on compiler technology to iscover parallelism an statically scheule computations. Unlike traitional VLIWs, however, Raw uses multiple instruction streams. Iniviual instruction streams give Raw significantly more flexibility to perform inepenent but statically scheule computations in ifferent tiles, such as loops with ifferent bouns. It is natural to compare this work with the natural evolution of multiprocessors, viz, integrate on a chip a multiprocessor that uses a simple RISC chip [7]. Like Raw, such a multiprocessor woul use a simple replicate tile an provie istribute memory. However, the cost of message startup an synchronization woul hamper its ability to exploit fine-grain instruction-level parallelism. A 5

6 compiler can exploit Raw s static interconnect to scheule ILP across multiple tiles without the nee for expensive ynamic synchronization. Raw architectures also promise better balance than the processor in memory or IRAM architectures. Current proposals for the latter architectures combine within a DRAM chip a single processor, an as such suffer from the same problems as extant superscalar processors, although to a lesser extent. As Dick Sites notes [2], extant superscalars spen three out of four CPU cycles waiting for memory for many applications. Long memory latencies are one of the causes for this an are not significantly reuce in the IRAM approach. The Raw approach irectly aresses the memory latency problem by implementing many smaller memories, so there are multiple inepenent ports to memory, an each memory has lower latency. The parallelism allowe by multiple inepenent memory ports allows the compiler to issue several memory operations simultaneously an helps mask the latency. Furthermore, if IRAM architectures hol the physical area of memory constant an increase the amount of memory, the latency problem will only be exacerbate. Processing an switching spees will increase significantly in future generation chips, making elays ue to the long memory bitlines an worlines relatively much worse. The Raw approach can be viewe as an IRAM approach in which multiple processing units are sprinkle with istribute memory, where each inepenent memory moule is size so its access spee scales as the processor clock..2 Roamap The rest of the paper is arrange as follows. The next section iscusses a simple example to emonstrate the benefits that erive from the Raw architecture. The following section escribes the Raw architecture in more etail. Section 3 escribes a harware implementation of the Raw architecture. Sections 4 iscusses the Raw compiler, an Section 5 iscusses how ata-epenent events such as ynamic routing an epenence checking can be supporte within a statically scheule framework. In Section 6 we give results from an early Raw prototype which show that Raw can achieve massive instruction level parallelism with extremely low latency. Section 7 conclues the paper. 2 An Example This section uses a small Jacobi program fragment to iscuss how the features of Raw yiel performance improvements over a traitional multiprocessor system. Our iscussion focuses on the fine-grain parallelism exploite by a single-chip RawP. Of course, as in extant multiprocessors, a Raw multiprocessor can exploit coarse grain parallelism using several RawP chips, a ynamic interconnect an external DRAM, by partitioning the program into multiple coarse-grain threas. The goal of the Raw microprocessor is to extract a further egree of fine-grain parallelism out of each of these threas. Let us start with a traitional fine-grain implementation of a Jacobi relaxation step in which each processor "owns" a single value, receives a single value from each of four neighbor processors, computes an average value an upates its own value with this average, an transmits its new value to its neighbors. Figure 4 shows successive versions of this Jacobi coe, each one yieling progressive improvements in runtime as we integrate various features of Raw. In the first version, most of the time for such a fine-grain step in a traitional multiprocessor implementation woul go into communication latency an synchronization elays between the noes. The Raw architecture significantly reuces communication an memory access elays through its tightly integrate register-level inter-tile communication. Explicit synchronization is also elim- 6

7 Original Coe ;; MPP VERSION loop: sen(msg, NDIR) ; sen value to the north receive(s, SDIR) ; wait for value from the south.. ; COMMUNICATION LATENCY. sen(msg, SDIR) ; sen value to the south receive(n, NDIR) ; wait for value from the north.. ; COMMUNICATION LATENCY. sen(msg, EDIR) ; sen value to the east receive(w, WDIR) ; wait for value from the west.. ; COMMUNICATION LATENCY. sen(msg, WDIR) ; sen value to the west receive(e, EDIR) ; wait for value from the east.. ; COMMUNICATION LATENCY. ;; Move ata into registers nreg <- [n] ; n = &msg[x][y-] sreg <- [s] ; s = &msg[x][y+]; ereg <- [e] ; e = &msg[x+][y]; wreg <- [w] ; w = &msg[x-][y]; ;; Do computation t0 <- nreg + sreg t <- ereg + wreg t2 <- t0 + t r <- t2 >> 2 ; r = (((n+s)+(e+w))/4); ;; Move result back to memory [msg] <- r ; *to = r; barrier().. ; SYNCHRONIZATION LATENCY. iter <- iter + ; while (++iter < max_iter) t4 <- (iter < max_iter) if t4 goto loop Secon Version ;; TIGHTLY COUPLED COMMUNICATION ;; an MULTI-SEQUENTIAL STATIC loop: sen(r, NDIR) receive(sreg, SDIR), sen(r, SDIR) receive(nreg, NDIR), sen(r, EDIR) receive(wreg, WDIR), sen(r, WDIR) receive(ereg, EDIR) ;; Do computation t0 <- nreg + sreg t <- ereg + wreg t2 <- t0 + t r <- t2 >> 2 ; r = (((n+s)+(e+w))/4); ;; NO BARRIER NEEDED, ALL NODES ARE IN LOCKSTEP iter <- iter + ; while (++iter < max_iter) t4 <- (iter < max_iter) if t4 goto loop Thir Version ;; CONFIGURABLE ;; (APPLICATION SPECIFIC OPERATION ((a+b+c+)/4)) loop: sen(r, NDIR) ; sen value to the north receive(sreg, SDIR), sen(r, SDIR) ; wait for value from the south receive(nreg, NDIR), sen(r, EDIR) receive(wreg, WDIR), sen(r, WDIR) ; sen value to the west receive(ereg, EDIR) ; wait for value from the east ;; Do computation r <- (((n+s)+(e+w))>>2); APPLICATION SPECIFIC OP iter <- iter + ; while (++iter < max_iter) t4 <- (iter < max_iter) if t4 goto loop Figure 4: Several variations on the Jacobi kernel 7

8 Processor Instr. Mem. Data Mem. Register File Proc. Con trol Logic ALU Con fig ur able Logic Switch Instr. Mem. Switch Control Logic Switch Figure 5: Major atapaths of each iniviual tile. inate because the multiple instruction streams are statically scheule. These two benefits are reflecte in the secon Jacobi coe sequence. The thir benefit Raw conveys is support for single-cycle multigranular operations. If we assume that our Jacobi algorithm emane 6-bit integer precision, we coul possibly pack each register with two of the Jacobi values. We coul then perform the 6-bit multigranular operation (aition, in this case) which, in one cycle, woul perform two separate aitions, one on the top halves of the registers, an the other on the bottom halves. The fourth benefit arises from Raw s configurable logic. A custom instruction can be instantiate for the application as in the final Jacobi coe sequence. 3 Raw Harware As iscusse in Section the approach use by the Raw architecture is to expose all unerlying harware mechanisms an move scheuling ecisions into the compiler. This section escribes a Raw harware architecture which irectly exposes a set of mechanisms, incluing ALUs, instruction sequencing harware, communication interconnect switches, physical pipeline registers, an network flow control bits, to allow high-quality static scheuling. Aitional harware support is provie for hanling ynamic events. Physical Organization Globally the Raw chip is organize as a mesh of processing tiles. The major atapaths of each tile are shown in Figure 5. The tile contains all the mechanisms foun in a typical pipeline RISC processor. These inclue a register file, ALU, SRAM for softwaremanage instruction an ata caches an pipeline registers. In aition each tile also inclues some configurable logic to support bit an byte level parallel operations, an a programmable switch to communicate with neighboring tiles. The switch supports both ynamic routing an compiler orchestrate static routing. The switch is integrate irectly into the processor pipeline to support 8

9 Instruction Stream (Control) Static Physical Channel (Input) Router Physical Channel (Output) Dynamic Router Flow Control Data Channel Figure 6: Structure of the Raw switch. The switch multiplexes two ifferent networks, one static an one ynamic, across the same physical wires. The physical channels also provie wires for flow control. unassigne timeslot available to ynamic router processor north south east west north west proc west east south north east north time Statically assigne route from north to west output channel assignment Figure 7: Logical information in the communication timetable implemente by the switch program. Each output channel is statically assigne an input. When no static assignment is pre-scheule, time slots are available to the ynamic router. single cycle message injection an receive operations. Sequencing Each tile inclues two sets of control logic an instruction memories. The first of these controls the operation of the processing elements. The secon is eicate to sequencing routing instructions for the switch. Separating the control of the processor an the switch allows the processor to take arbitrary ata epenent branches without isturbing the routing of inepenent messages that are passing through the switch. The switch scheule can be change by loaing a ifferent program into the switch instruction memory. Communication The switch structure is shown in more etail in Figure 6. Physical wires are a critical resource, so the switch multiplexes two logically istinct networks, one static, an one ynamic, over the same set of physical wires. The static network is controlle by the sequence of switch instructions (Figure 7) which escribe, for each cycle, which switch inputs are connecte to each switch output. Not shown are the aitional control bits to support static branches in the switch instruction stream. Because the communication patterns are orchestrate at compile time, messages o not require heaer interpretation. Thus there is no routing overhea, an efficient single wor ata transfers are possible. Single wor messages allow tiles to take avantage of registers an function units in neighboring tiles to support istribute register files an very fine-grain ILP parallelism. Any slot in the routing timetable not allocate to static scheuling is available to the ynamic router. The ynamic router is a traitional wormhole router, making routing ecisions base on the hea of each message. Each wor following the heaer simply follows the same path as the 9

10 heaer. The ynamic router inclues the require aitional lines for flow control in the case that messages become temporarily blocke ue to contention or the nee to wait for empty slots in the static scheule. Thus in Figure 6, when a physical channel is reserve for the static router, the ynamic router will have a stalling flow control signal. The processor/network interface is share between the two networks. The processor uses istinct opcoes to istinguish between accesses to the two networks. The interface performs ifferent actions epening on which network the message is injecte into. For the static network the message is injecte irectly into, or copie irectly out of, the static switch. The internals of the ynamic network inclue sen an receive FIFOs. The flow control bits for the ynamic network are visible to the processor so the software can etermine whether the sen queue is full or the receive queue is empty. This allows the processor to utilize the ynamic network without harware stalls an permits flexibility in the rate at which iniviual noes poll for ynamic events. Although compiler orchestration ensures that the static switch stalls infrequently, the switches amit the possibility of failing to meet a static scheule because they allow interaction between statically scheule an ynamic events. Notice that while the static an ynamic network are logically isjoint, ata epenencies between statically scheule coe an ynamic messages may exist in the processor coe. Such a system must support some form of flow control even in the static network for correctness. Our esign provies minimal harware-supporte flow control in the static network. This flow-control information is also exporte to the software. If the processor or switch attempts to use a busy resource, it is stalle until the resource is free. A stalle processor or switch, in turn, may cause others to stall as well. While the static router is stalle, the ynamic router is allowe to use all timeslots until the static router is able to make forwar progress. We are also exploring higher level en-to-en software flow control strategies to avoi the requirement of flow control on the static network altogether. Memory Each Raw tile contains a small amount of ata SRAM that can be locally aresse. The amount of ata memory is chosen to roughly balance the areas evote to processing an memory, an to match within a factor of two or three the memory access time an the processor clock. Memory accesses will be pipeline an scheule by the compiler. As with the communication network, the emans of static scheuling require that ynamic events, like cache misses in a conventional processor, o not stall the processor pipeline. The ata memory on each tile, therefore, is irectly physically aresse. A harware cache tag file will export a hit/miss status to the software rather than stalling the processor pipeline, so that the software can check the status if it is uncertain a atum is in the cache. Higher level abstractions like caching, global share memory an memory protection are implemente by the compiler [0], an ynamic events, like cache misses, are hanle entirely in software. Discussion The Raw harware architecture provies a number of benefits over conventional superscalar esigns. Because of its tile istribute structure Raw can achieve large I-fetch, ata memory an register file banwith. Since each tile contains its own set of resources, multiple memory units an registers can be accesse in parallel without resorting to builing expensive an slow multi-porte register files or memories. Because iniviual harware mechanisms are expose irectly to the compiler, applications with static structures can take avantage of the raw spee available from the static network. Messages that can be statically structure can be injecte an remove from the network with no packaging or routing overhea an messages move through the network at the lowest possible latency. Because the Raw esign exposes all possible ynamic events to the software, ynamic an static communication 0

11 SOURCE PROGRAM Front En Intermeiate Coe (high level) High Level Transformations Target Configurations Configuration Selection Logical Threa Partitioning Logical Data Partitioning Traitional Optimizations Global Register Allocation Placement Routing/Scheuling Coe Generation Transforme Intermeiate Coe (low level) Intermeiate Coe + Custom Instructions Partitione Coe (logical tiles) Partitione Coe + Data Optimize Partitione Coe Optimize Partitione Coe (physical registers) Optimize Partitione Coe (physical tiles) Raw Machine Coe (unoptimize scheule) OPTIMIZED TARGET CODE FOR THE RAW MACHINE Figure 8: Compilation phases in Raw compiler patterns can coexist, an unexpecte stalls are rare. Finally, the tile organization on a single chip permits extremely wie an short message paths. Unlike traitional routers that face pin limits, the ata paths in an on-chip router can be several wors wie, permitting extremely high-banwith communication. The tile point-to-point communication organization further permits the avoiance of long chip-crossing ata-paths. Each signal nee travel no more than the with of a single tile each cycle, reucing the effort require to create very high clock-rate esigns. 4 Raw Compilation System The goal of the Raw compiler is to take a single-threae (sequential) or multi-threae (parallel) program written in a high-level programming language an map it onto Raw harware. Raw s explicitly parallel moel makes it possible for the compiler to fin an express program parallelism irectly to the harware. This overcomes a funamental limitation of superscalar processors, that the harware must extract instructions for concurrent execution from a sequential instruction stream. The issue logic that ynamically resolves instruction epenencies in superscalars is a system bottleneck, limiting available concurrency [3]. The Raw machine avois the bottleneck resulting from a single sequential instruction stream by supporting a multi-sequential architecture in which multiple sequential instruction streams execute on multiple tiles (one stream per tile). Unlike the meium-grain or coarse-grain parallelism exploite on multiprocessor systems, the compiler views the collection of N tiles in the Raw machine as a collection of functional units for exploiting instruction-level parallelism. In this section, we outline a sequence of compilation phases for the Raw machine. These phases are shown in Figure 8. While we may consier merging some of the phases in the future, this is the phase orering that we will use in builing the first version of the Raw compiler. The compilation phases are iscusse in the following paragraphs, along with the rationale for this phase orering.

12 Front-en The job of the front-en is to generate intermeiate coe for an abstract Raw machine with unboune resources i.e., unboune number of virtual tiles an virtual registers per tile. The generate coe reflects the structure of the source program, an may be sequential with a single aress space, or multithreae with a share aress space or a istribute aress space, epening on whether the source program is sequential or parallel. Regarless of the form of the source program, the problem of co-locating instructions an ata within a tile (Partitioning), an mapping instructions an ata to specific tiles (Placement) are performe by later phases in the compiler an not by the front-en. Data references are kept at a high level an are classifie as private or share. Private ata is only accessible by the virtual tile that it is allocate in. Share ata is accessible by all virtual tiles. Data references are also classifie by the compiler as being preictable ( static ) or unpreictable ( ynamic ) using type information an program analysis information. A preictable ata reference is one for which the compiler will be able to preict the estination tile, even though the base aress may be unknown. This issue is iscusse in etail in the Logical Data Partitioning phase. The conservative assumption for an unanalyzable ata reference is to classify it as unpreictable. For a traitional sequential program with no explicit parallelism or ata istribution, the intermeiate coe generate by the front-en for the abstract Raw machine will look very similar to intermeiate coe generate in compilers for traitional uniprocessors. The restructuring into multi-sequential coe for a physical Raw machine is performe by later phases of the Raw compiler. High-Level Transformations This phase performs high-level program transformations to expose greater levels of instruction-level parallelism an register locality. The transformations inclue loop unrolling, proceure inlining, software pipelining, coe uplication, scalar replacement, speculative execution, an iteration-reorering loop transformations such as loop interchange, loop tiling, an unimoular loop transformations [4]. As in high-level optimizers in toay s compilers for highperformance processors, this phase inclues a eep intraproceural an interproceural analysis of the program s control flow an ata flow. Alias analysis is especially important for reucing the number of unpreictable (ynamic) memory references on the Raw machine. Branch instructions an array ata references in the intermeiate coe are lowere to machine level instructions after this phase, but they still use symbolic labels as aress arguments. High-level transformations is the first optimization phase performe after the front-en, so that the remaining compiler phases can operate on the lower level intermeiate coe. Configuration Selection This phase selects a configuration to be loae into the configurable logic in the Raw machine, for the given input program. For the sake of simplicity the first version of the Raw compiler assumes that each tile is loae with the same configuration an only consiers implementing combinational logic as custom instructions. Future versions of the compiler may lift these restrictions. The compiler will be capable of performing this phase automatically but will also give the programmer the option of specifying custom instructions. As shown in Figure 9, for each custom operation the configuration phase must both output a specification for the configurable logic an rewrite the intermeiate coe so that each subtree (compoun operation) is replace by a call to the appropriate custom instruction. The compiler will invoke a logic synthesis tool to translate a custom operation specification into the appropriate bit sequence for the configurable logic in the Raw machine. Our current Raw prototype system escribe in Section 6 requires that these special operators be ientifie by the programmer an specifie in behavioral Verilog. The compiler will exten the ynamic programming algorithms use in tree pattern-matching 2

13 Input Program IMEM PC REGS DMEM Compiler ALU Static Data CL SMEM PC SWITCH Processor Coe Switch Coe Configurable Logic Description Figure 9: Compiler interaction with the architecture systems [] to ientify commonly occurring subtree patterns that might be goo caniates for implementation as custom instructions. To support multigranular operation the compiler will search for important special cases, such as bitwise an multiple shortwor (e.g., 8-bit) arithmetic operations that can be packe together into a single custom instruction. Partitioning into Logical Threas The Partitioning phase generates parallel coe for an iealize Raw machine with the same number of tiles as the physical architecture, but assuming an iealize fully-connecte switch so that tile placement is not an issue. This iealize Raw architecture also has an unboune number of virtual registers per tile an an supports symbolic ata references like the abstract Raw machine. The structure of the coe after partitioning is similar to that of an single-program-multiple-ata (SPMD) program. A single program is assume for all threas, along with a my i primitive that is use to guie control flow on ifferent threas. Along with traitional multiprocessor an vector loop parallelization techniques [9] this phase ientifies an partitions for fine grain instruction level parallelism. The partitioner must balance the benefits of parallelism versus the overheas of communication an synchronization. Because these overheas are much lower in the Raw microprocessor than in multiprocessors the partitioning can occur at a much finer grain than in conventional parallelizing compilers. Note that the Raw compiler performs logical threa partitioning before logical ata partitioning, which is counter to the approach use by compilers for ata-parallel languages such as High Performance Fortran. This phase orering is motivate by a esire to target a much wier class of applications with instruction level parallelism versus static ata-parallelism. In the future, we will explore the possibility of combining threa partitioning an ata partitioning into a single phase. Logical Data Partitioning This phase etermines the co-location of ata an computation by bining symbolic variables to logical threas. In aition, symbolic names are translate to lower level offsets an locations in the local aress spaces of the logical threas. The storage allocation is selecte to enhance ata locality as much as possible, an may inclue replication of frequently-use variables/registers. For array variables that have no storage aliasing, this phase can also inclue selection of ata layouts (e.g., row-major or column-major) an of ata istributions (e.g., block, cyclic, etc. as in HPF). This phase assumes that each tile s aress space is ivie into a private section an a share section. An access to the private section is always local. The compiler implements a share memory abstraction on top of the istribute physical memory using techniques similar to those escribe in [0]. As far as possible, we woul like the Raw compiler to statically preict the estination tile for a memory access to the share section, so that even an access to a remote tile can be statically route. 3

14 We observe that the offset for many memory operations is often known at compile-time (e.g., the stack frame offset for a local variable or the fiel/element offset in a heap-allocate object) even though the base aress may be unknown. Therefore, by using low-orer interleaving an N B alignment we can calculate the tile on estination tile for each memory request.. Low-orer interleaving we assume that the aress space is ivie into blocks of B = 2 b bytes (typical values of B are 8, 6, 32, 64), an that the aressing of blocks is low-orerinterleave across tiles. This means that the bit layout of a ata aress in Raw is as follows, where N = 2 n is the number of tiles: <---- n -----> <---- b ----> Block aress in noe/tile Noe/tile i Block offset This aress format makes it easy for the compiler to extract the tile i from a low orer aress offset. 2. N B alignment We assume the base aress of commonly use program entities (stack frames, static arrays, heap-allocate structures, etc.) is aligne with a N B bounary (i.e., has zeroes in the lowest n + b bits). This enables the compiler to statically preict the estination tile for a known offset, even though the actual base aress is unknown. For base pointers such as function parameters that may not be known at compile-time to be N B aligne, we suggest generating ual-path (multi-version) coe with a runtime test that checks if all unknown base aresses are N B aligne. The coe for the case when the test returns true will contain only preictable references (this shoul be the common case). The coe for the false case will contain unpreictable references an thus be less efficient but will serve as a correct fallback coe for executing with unaligne arguments. Traitional Optimizations This phase performs traitional back-en optimizations such as constant propagation, common-subexpression elimination, loop-invariant coe motion, strength reuction, an elimination of ea coe an unreachable coe. The output of this phase is optimize intermeiate coe with operations at the level of the target instruction set. Traitional optimizations are performe after Logical Data Distribution so that all aress computations in the program are expose for optimization, but before Global Register Allocation so that register allocation can be performe on optimize coe. Global Register Allocation This phase assigns the unboune number of virtual/symbolic registers assume in previous phases to the fixe N R registers in the target machine, where R is the number of registers in a single tile. Our motivation in performing Global Register Allocation after Partitioning is to give priority to instruction-level parallelism over register locality. Ha we performe Register Allocation earlier, we woul be in anger of reucing instruction-level parallelism by aing false epenences on physical registers before the Partitioning phase. The register allocation phase is guie by harware costs for accessing local/nonlocal registers an local/nonlocal memory locations. Our initial assumption is that the costs will be in increasing orer for the following:. Access to a local register 4

15 Input program communication graph t3 Raw Topology t0 Raw Tile t t2 Channel graph t 0 t 3 t t 2 Figure 0: Input program communication graph, the Raw microprocessor topology, an the Raw communication channel graph showing the initial values of the available channel capacities. 2. Access to a nonlocal register (in a neighboring tile) 3. Statically preictable access to a local memory location 4. Statically preictable access to a nonlocal memory location 5. Unpreictable (ynamic) access to a memory location Placement The only remaining iealization in the machine moel at this stage of compilation is the hypothetical crossbar interconnect among threas. The Placement phase removes this iealization by selecting a - mapping from threas to physical tiles base on the inter-threa communication patterns an the physical structure of the static interconnect among tiles in the Raw machine. Because the earlier ata partitioning phase co-locate ata variables an arrays with the threas, the threa placement phase results in physical ata placement as well. The placement algorithm attempts to minimize a latency an banwith cost measure an is a variant of a VLSI cell placement algorithm. This algorithm is use in our current multi- FPGA compiler system [2]. The placement algorithm attempts to lay close together threas that communicate intensely or threas that are in timing critical paths. As shown in Figure 0, the Placement algorithm uses a epenence graph for the program to etermine critical paths an the volume of communication between threas. Each ege is annotate both with the length of the longest critical path on which it is (shown in the figure), an the volume of communication in wors it represents (not shown). The placement algorithm also uses a physical channel graph whose eges are annotate with the physical banwith they represent. The channel graph captures the topology an the available banwith of the RawP. The goal of placement is to etermine an optimal mapping of the input program graph to the channel graph. Our prototype system uses simulate annealing to optimize a placement cost function. A suitable cost function sums over all pairs of threas the prouct of the communication volume between a pair of threas an the number of hops between the pair of tiles on which the iniviual threas are place, ivie by the available banwith between the tiles. To account for latency, the cost function also fols in a latency relate weight. Routing an Global Scheuling This phase inserts routing instructions for nonlocal ata accesses, an selects the final orering of computation an routing instructions on iniviual tiles 5

16 with the goal of minimizing the overall estimate completion time. This phase will automatically give priority to routing/compute instructions on a critical path, without requiring special-case treatment for either class of instructions. Routing uses the same input program graph an the channel graph use by placement an assumes that a ata value can be communicate between a neighboring pair of tiles in a single cycle. For each value communicate between threas, the scheuler prescribes both a sequence of inter-tile channels an a corresponing sequence of time slots (or cycles) in which each channel will be use. The following is a brief summary of the TIERS (Topology Inepenent Pipeline Routing an Scheuling) routing an scheuling algorithm that we will use []. This algorithm is implemente in our emulation system, an we propose to aapt this for use in Raw. Given the channel graph an the epenence-analyze input program communication graph, the TIERS routing approach uses a greey algorithm to etermine a time-space path for each nonlocal communication event an a scheule for the instructions in each threa. The scheuler relies on available channel capacities for each cycle maintaine in each ege of the channel graph. The greey algorithm performs a topological sort of the communication eges in the input program graph so that eges appear in partial epenence orer. The routing algorithm picks eges in orer from the sorte list an routes each ege using the sequence of channels in the shortest path between the source an estination tiles. The scheuler etermines the cycle on which a value can be scheule for communication base on three quantities: the arrival times of the eges on which it epens, the amount of time require for any computation or memory accesses that prouce the value, the availability of channel capacity along the shortest part uring a subsequent sequence of cycles. The scheuler ecrements available channel capacities for each channel along the route taken by the ege, an marks a channel unavailable uring any cycle on which its capacity is exhauste. Routing of each ege is accompanie by the prouction of a precise scheule for computations an memory accesses within each threa. Final Coe Generation This phase emits the final coe in the appropriate object coe format for the Raw Machine. It also emits the appropriate bit sequence for implementing the selecte configuration within each tile. 5 Software Support for Dynamic Events As in any approach avocating exposure of harware etails, one raison e etre of the Raw Machine is that compilation techniques exist which can manage an inee take avantage of the resulting explicit control. For static program regions, the compilation process escribe in Section 4 provies plausible grouns for leveraging this control to achieving attractive performance gains when compare to more traitional architectures. For efficient execution of the full spectrum of generalpurpose programs, however, we must complement this compilation paraigm with techniques that support ynamic an unpreictable events. Dynamic events typically occur when ata epenencies an/or latencies cannot be resolve at compile time. Examples of unresolvable ata epenence inclue uneciable program input an non-linear (or inirect) array inexing. Latencies unknown at compile time may be a function of the memory system (cache misses, page faults), ata-epenent routing estination, or simply a region of coe which executes for a ata-epenent number of cycles. A final class of ynamic events inclues unpreictable situtations that arise as a function of the external environment of the processor, such as a bus interrupt. Raw s philosphy for hanling ynamic situtations that occur while running a program is to: 6

17 ensure correct execution of all ynamic events, hanle infrequent events in software (i.e. via polling), use ynamic harware as a last resort. We have iscovere that uner the right circumstances certain static techniques can actually outperform ynamic protocols at their own game. When static techniques fail, we have iscovere software solutions for the Raw architecture which approach the performance of aitional harware without the entailing overhea an complexity. An finally, when we o resort to harware, such as in the ynamic routing of non-uniform message streams, we must be careful to insure that the unpreictable ynamic situations are hanle without interfering with any carefully preicte static events taking place at the same time. This section continues by escribing some of these techniques. 5. Software Dynamic Routing While some applications can be compile in a completely static fashion, many applications have at least some amount of ynamic, ata-epenent communication. We use the term ynamic message to enote any act of communication that cannot be route an scheule by the compiler. Raw uses one of two approaches to hanle ynamic messages. The first is a software approach that preserves the preictability of the program in the face of ynamic messages by using a static, ata-inepenent routing protocol. This approach statically reserves channel banwith between noes that may potentially communicate. In the worst case, this conservative approach will involve an all-to-all personal communication scheule [6] between the processing elements. All-to-all scheules effectively simulate global communication architectures, such as a cross-bar or bus, on our static mesh. This approach is effective for a small number of tiles. It is also effective when messages are short, so that the amount of communication banwith waste is small. The secon approach uses the ynamically route network. Raw supports a ynamically route network in which communication is groupe into multi-wor packets with a heaer that inclues the estination aress. Harware interprets the message heaer in a packet at each switching point an ynamically routes the packet in an appropriate irection. By conservatively estimating the elivery time of ynamic messages an allowing enough slack, the compiler can still hope to meet its static scheule. Correctness is preserve in this approach through software checks of the flow control bits. This approach is more effective when the ynamic messages are long, so that the overhea of the message heaers an the routing ecision making can be amortize over a larger volume of communication. 5.2 Software Memory Depenency Checking Dynamically checking for epenencies between every pair of memory operations is expensive. Most superscalar processors can sustain only one or at most two (e.g., Alpha 2264) memory instruction issues per cycle. Since, in most programs, memory operations account for at least 20% of the total ynamic instructions, this limit on memory issue sharply limits the total parallelism that a superscalar processor can achieve. While our peagogical vanilla Raw Machine will inclue the minimal harware mechanisms require, our research irections inclue evaluation of various levels of harware support for the kins of ynamic events foun in real applications. 7

Computer Organization

Computer Organization Douglas Comer Computer Science Department Purue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purue.eu/people/comer Copyright 2006. All rights reserve.