Baring it all to Software: The Raw Machine

Size: px
Start display at page:

Download "Baring it all to Software: The Raw Machine"

Transcription

1 Baring it all to Software: The Raw Machine Elliot Waingol, Michael Taylor, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Srikrishna Devabhaktuni, Rajeev Barua, Jonathan Babb, Saman Amarasinghe, Anant Agarwal MIT Laboratory for Computer Science Cambrige, MA 0239 Abstract Rapi avances in technology force a quest for computer architectures that exploit new opportunities an she existing mechanisms that o not scale. Current architectures, such as harware scheule superscalars, are alreay hitting performance an complexity limits an cannot be scale inefinitely. The Reconfigurable Architecture Workstation (Raw) is a simple, wire-efficient architecture that scales with increasing VLSI gate ensities an attempts to provie performance that is at least comparable to that provie by scaling an existing architecture, but that can achieve orers of magnitue more performance for applications in which the compiler can iscover an statically scheule fine-grain parallelism. The Raw microprocessor chip comprises a set of replicate tiles, each tile containing a simple RISC like processor, a small amount of configurable logic, an a portion of memory for instructions an ata. Each tile has an associate programmable switch which connects the tiles in a wie-channel point-to-point interconnect. The compiler statically scheules multiple streams of computations, with one program counter per tile. The interconnect provies register-to-register communication with very low latency an can also be statically scheule. The compiler is thus able to scheule instruction-level parallelism across the tiles an exploit the large number of registers an memory ports. Of course, Raw provies backup ynamic support in the form of flow control for situations in which the compiler cannot etermine a precise static scheule. The Raw architecture can be viewe as replacing the bus architecture of superscalar processors with a switche interconnect an accomplishing at compile time operations such as register renaming an instruction scheuling. This paper makes a case for the Raw architecture an provies early results on the plausibility of static compilation for several small benchmarks. We have implemente a prototype Raw processor (calle RawLogic) an an associate compilation system by leveraging commercial FPGA base logic emulation technology. RawLogic supports many of the features of the Raw architecture an emonstrates that we can write compilers to statically orchestrate all the communication an computation in multiple threas, an that performance levels 0x-00x over workstations is achievable for many applications. Also appears as MIT Laboratory For Computer Science TR-709, March 997.

2 Raw P RawTile IMEM PC REGS DMEM ALU CL PC SMEM SWITCH Introuction Figure : RawP composition. Each RawP comprises multiple tiles. Rapily evolving technology places a billion transistors on a chip within reach of the computer architect. Although several approaches to utilizing the large amount of silicon resources have been put forth, they share the single important goal of obtaining the most performance out of approximately one square inch of silicon. All approaches exploit more parallelism in one or more instruction streams, an inclue more aggressive superscalar esigns, simultaneous multithreaing, multiprocessors on a chip, VLIWs, an integrating a processor with huge amounts of DRAM memory. The Raw project s approach to achieving this goal is to implement a simple, highly parallel VLSI architecture, an to fully expose the low-level etails of the harware architecture to the compiler so that the compiler or the software can etermine, an implement, the best allocation of resources for each application. As isplaye in Figure, the Raw microprocessor (RawP) is compose of a set of interconnecte tiles, each tile comprising instruction an ata memory, an ALU, registers, some configurable logic, an a programmable switch. Wie channels couple near-neighbor tiles. The harware architecture is expose fully to the compiler an the software system, an the switch is programmable via a sequence of instructions, also etermine by the compiler. The Raw esign uses both istribute register files an istribute memory. Communication between tiles is slightly less expensive than memory access on the same tile, slightly more expensive than register reas an occurs over high hanwith point-to-point channels. Figure 2 shows a multiprocessor system environment for the RawPs, where each multiprocessor noe is constructe out of a RawP, a ynamic router, an some high-banwith RDRAM. Our approach leverages the same two features that make application-specific custom harware systems superior in performance for the given application to extant general purpose computers. First, they implement fine-grain communication between large numbers of replicate processing elements an, thereby, are able to exploit huge amounts of fine-grain parallelism in applications, when this parallelism exists. For example, custom vision chips that implement hunres of small processing tiles interconnecte via a point-to-point, near-neighbor interconnect can achieve performance levels far exceeing that of workstations in a cost effective manner. Secon, they expose the complete etails of the unerlying harware architecture to the software system (be it the software CAD system, the applications software, or the compiler), so the software can carefully orchestrate the 2

3 RawMP Dynamic Router RDRAM Raw Microprocessor Figure 2: Raw multiprocessor comprising multiple Raw Ps. MEM MEM MEM MEM Switch Switch Switch MEM MEM MEM REGS REGS REGS REGS REGS REGS REGS Switch Switch Switch ALU ALU ALU ALU ALU ALU ALU ALU ALU Raw Superscalar Multiprocessor Figure 3: Raw microprocessors versus superscalar processors an multiprocessors. The Raw microprocessor istributes the register file an memory ports an communicates between ALUs on a switche, point-to-point interconnect. In contrast a superscalar contains a single register file an memory port an communicates between ALUs on a global bus. Multiprocessor s communicate at a much coarser grain through the memory subsystem. execution of the application by applying techniques such as pipelining, synchronization elimination through careful cycle counting, an conflict elimination for share resources by static scheuling an routing. In keeping with the general philosophy of builing a simple replicate architecture an exposing the unerlying implementation to the compiler, Raw has the following features. Simple replicate tile As shown in Figure, the Raw machine comprises an interconnecte set of tiles. Each tile contains a simple RISC-like pipeline, an is interconnecte over a pipeline, point-to-point network with other tiles. Unlike current superscalar architectures, Raw oes not bin specialize logic structures such as register renaming logic, ynamic instruction issue logic, or caching into harware. Instea, it focuses on keeping each tile small, an maximizes the number of tiles it can implement on a chip, thereby increasing the amount of parallelism it can exploit, an the clock spee it can achieve. As Figure 3 isplays, Raw oes not use buses, rather, it uses a switche interconnect. Communication between tiles is pipeline over this interconnect, an occurs at the register level. Thus the 3

4 clock frequency is higher than in chips that uses global buses. No signal in Raw travels a istance greater than a single tile with within a clock cycle. Furthermore, because the interconnect allows communication between tiles to occur at nearly the same spee as a register rea, compilers can use instruction-level parallelism in aition to coarser forms of parallelism to keep the tiles busy. In aition, memory istribute across the tiles will eliminate the memory banwith bottleneck an provie significantly lower latencies to each memory moule. Wie channels will allow significantly higher banwith for both I/O an on-chip communication. A large number of istribute registers will eliminate the small register name space problem allowing the exploitation of instruction-level parallelism (ILP) to a greater egree. Significantly higher switching spees internal to a chip compare to off-chip communication an DRAM latencies will allow software sequencing to replace specialize harware functionality. Software sequencing also exposes the opportunity for customization to applications. As contraste in Figure 3, with multiprocessors on a chip, inter-tile communication in Raw is at the register level an can be compiler orchestrate, allowing Raw to exploit fine-grain ILP to keep the tiles busy. Of course, a simple architecture also results in faster time to market, an eliminates the verification nightmare that threatens moern ay superscalars. The passage of time has not change the benefits of simple architectures superbly articulate in the paper by Patterson an Ditzel [8] making a case for RISCs. Programmable, integrate interconnect The switche interconnect between the tiles provies high banwith, low latency, communication between neighboring tiles. Integrate interconnect allows channels with hunres of wires instea of the traitional tens. VLSI switches are pa limite an rarely ominate by internal switch or channel area. The switches support ynamic routing, but are also programmable an synchronize so that a static scheule can be maintaine across the entire machine shoul the compiler choose to o so. Static routing permits higher utilization of available banwith of chip-level channel resources an eliminates ynamic synchronization costs. Multi-sequential statically-scheule instruction streams Each tile runs a single threa of computation with its own program counter. Switches that can be programme with static communication scheules enable the compiler to statically scheule the computations in the threas in each tile. Static scheuling eliminates synchronization, an thus provies the cheapest form of synchronization one that is free. Section 5 iscusses how Raw software can hanle cases such as ynamic routing an epenence checking, where traitional systems evote specialize harware resources for them. Of course, Raw also provies basic flow control support on each channel so that a tile can stall if a neighboring tile violates the static scheule ue to an unforeseen ynamic event. As iscusse in Section 3, Raw also allows apportioning channel banwith between static routing an ynamic routing. Multigranular operations The Raw architecture implements mechanisms such as wie-wor ALUs an is thus coarser than traitional FPGA base computers. In orer to achieve the levels of parallelism an FPGA computer can attain, the Raw architecture is multigranular. Multigranular support inclues the ability of a single tile to exploit bit or byte-level parallelism when it exists in an application. Logic simulation applications, for example, require bit level operations. FPGA base emulation engines are simply special purpose computers that can exploit bit-level parallelism. The Intel MMX esign an several signal processing architectures have similar mechanisms to allow their ALUs to be use for a few wie wor operations or many narrow wor computations. 4

5 Multigranular support in Raw is achieve through the use of a small amount of configurable logic in each tile which can be customize to the right atapath with. Configurability Configurability allows the customization of the harware to support special operations to support specific application emans. Each Raw tile inclues a small amount of FPGA-like configurable logic that can be customize at the bit-level granularity to perform special operations. The key with configurability is that both bit or byte-level parallelism can be combine with special communication between each of the iniviual bits or bytes, an is thus significantly more powerful than MMX-like multigranularity. One way to view configurable logic is as a mechanism to provie the compiler a means to create customize single or multicycle instructions without resorting to longer software sequences.. Comparison with other Architectures Raw buils on several previous architectures. IWarp [3] an NuMesh [4] both share with Raw the philosophy of builing point-to-point networks that support static scheuling. However, the cost of initiating a message in both systems was so high that compilers coul exploit only coarse grain parallelism, or ha to focus on very regular signal processing applications so the startup cost coul be amortize over long messages. Note that even though iwarp supporte very low-latency processor access to the message queues, it require an initial set up of a pathway before wor-level ata coul be sent. Due to the ifficulty of preicting statically events over long perios of time, static scheuling in a coarse grain environment is very ifficult. The Raw architecture improves on NuMesh an iwarp in several ways. The register like latencies of communication in Raw allows it to exploit the same kins of instruction-level parallelism that superscalars can exploit. Preicting latencies over small instruction sequences, followe by statically scheuling of multiple tightly couple instruction streams is much simpler. Just like the processor, the switches in Raw are programmable via an instruction stream prouce by the compiler, an are thus significantly more flexible than either iwarp or the multiple finite-state-machine architecture of the NuMesh switches. Like an FPGA-base computer, the Raw architecture achieves the spees of special purpose machines by exploiting fine-grain parallelism an fast static communication, an exposing the lowlevel harware etails to facilitate compiler orchestration. Compilation in Raw is fast, unlike FPGAcomputers, because it bins into harware commonly use compute mechanisms such as ALUs, registers, wor-wie ata channels an memory paths, an communication an switching channels, thereby eliminating repeate low-level compilations of these macro units. The onerous compile times of our multi-fpga base Raw prototype computer (see Section 6) amply emonstrates this compile spee problem. Bining of common mechanisms into harware also yiels significantly better execution spee, lower area, an better power efficiency than FPGA systems. Many features in Raw are inspire by the VLIW work. Like the VLIW Multiflow machine [5], Raw has a large register name space an a istribute register file. Like VLIWs, Raw also provies multiple memory ports, an relies heavily on compiler technology to iscover parallelism an statically scheule computations. Unlike traitional VLIWs, however, Raw uses multiple instruction streams. Iniviual instruction streams give Raw significantly more flexibility to perform inepenent but statically scheule computations in ifferent tiles, such as loops with ifferent bouns. It is natural to compare this work with the natural evolution of multiprocessors, viz, integrate on a chip a multiprocessor that uses a simple RISC chip [7]. Like Raw, such a multiprocessor woul use a simple replicate tile an provie istribute memory. However, the cost of message startup an synchronization woul hamper its ability to exploit fine-grain instruction-level parallelism. A 5

6 compiler can exploit Raw s static interconnect to scheule ILP across multiple tiles without the nee for expensive ynamic synchronization. Raw architectures also promise better balance than the processor in memory or IRAM architectures. Current proposals for the latter architectures combine within a DRAM chip a single processor, an as such suffer from the same problems as extant superscalar processors, although to a lesser extent. As Dick Sites notes [2], extant superscalars spen three out of four CPU cycles waiting for memory for many applications. Long memory latencies are one of the causes for this an are not significantly reuce in the IRAM approach. The Raw approach irectly aresses the memory latency problem by implementing many smaller memories, so there are multiple inepenent ports to memory, an each memory has lower latency. The parallelism allowe by multiple inepenent memory ports allows the compiler to issue several memory operations simultaneously an helps mask the latency. Furthermore, if IRAM architectures hol the physical area of memory constant an increase the amount of memory, the latency problem will only be exacerbate. Processing an switching spees will increase significantly in future generation chips, making elays ue to the long memory bitlines an worlines relatively much worse. The Raw approach can be viewe as an IRAM approach in which multiple processing units are sprinkle with istribute memory, where each inepenent memory moule is size so its access spee scales as the processor clock..2 Roamap The rest of the paper is arrange as follows. The next section iscusses a simple example to emonstrate the benefits that erive from the Raw architecture. The following section escribes the Raw architecture in more etail. Section 3 escribes a harware implementation of the Raw architecture. Sections 4 iscusses the Raw compiler, an Section 5 iscusses how ata-epenent events such as ynamic routing an epenence checking can be supporte within a statically scheule framework. In Section 6 we give results from an early Raw prototype which show that Raw can achieve massive instruction level parallelism with extremely low latency. Section 7 conclues the paper. 2 An Example This section uses a small Jacobi program fragment to iscuss how the features of Raw yiel performance improvements over a traitional multiprocessor system. Our iscussion focuses on the fine-grain parallelism exploite by a single-chip RawP. Of course, as in extant multiprocessors, a Raw multiprocessor can exploit coarse grain parallelism using several RawP chips, a ynamic interconnect an external DRAM, by partitioning the program into multiple coarse-grain threas. The goal of the Raw microprocessor is to extract a further egree of fine-grain parallelism out of each of these threas. Let us start with a traitional fine-grain implementation of a Jacobi relaxation step in which each processor "owns" a single value, receives a single value from each of four neighbor processors, computes an average value an upates its own value with this average, an transmits its new value to its neighbors. Figure 4 shows successive versions of this Jacobi coe, each one yieling progressive improvements in runtime as we integrate various features of Raw. In the first version, most of the time for such a fine-grain step in a traitional multiprocessor implementation woul go into communication latency an synchronization elays between the noes. The Raw architecture significantly reuces communication an memory access elays through its tightly integrate register-level inter-tile communication. Explicit synchronization is also elim- 6

7 Original Coe ;; MPP VERSION loop: sen(msg, NDIR) ; sen value to the north receive(s, SDIR) ; wait for value from the south.. ; COMMUNICATION LATENCY. sen(msg, SDIR) ; sen value to the south receive(n, NDIR) ; wait for value from the north.. ; COMMUNICATION LATENCY. sen(msg, EDIR) ; sen value to the east receive(w, WDIR) ; wait for value from the west.. ; COMMUNICATION LATENCY. sen(msg, WDIR) ; sen value to the west receive(e, EDIR) ; wait for value from the east.. ; COMMUNICATION LATENCY. ;; Move ata into registers nreg <- [n] ; n = &msg[x][y-] sreg <- [s] ; s = &msg[x][y+]; ereg <- [e] ; e = &msg[x+][y]; wreg <- [w] ; w = &msg[x-][y]; ;; Do computation t0 <- nreg + sreg t <- ereg + wreg t2 <- t0 + t r <- t2 >> 2 ; r = (((n+s)+(e+w))/4); ;; Move result back to memory [msg] <- r ; *to = r; barrier().. ; SYNCHRONIZATION LATENCY. iter <- iter + ; while (++iter < max_iter) t4 <- (iter < max_iter) if t4 goto loop Secon Version ;; TIGHTLY COUPLED COMMUNICATION ;; an MULTI-SEQUENTIAL STATIC loop: sen(r, NDIR) receive(sreg, SDIR), sen(r, SDIR) receive(nreg, NDIR), sen(r, EDIR) receive(wreg, WDIR), sen(r, WDIR) receive(ereg, EDIR) ;; Do computation t0 <- nreg + sreg t <- ereg + wreg t2 <- t0 + t r <- t2 >> 2 ; r = (((n+s)+(e+w))/4); ;; NO BARRIER NEEDED, ALL NODES ARE IN LOCKSTEP iter <- iter + ; while (++iter < max_iter) t4 <- (iter < max_iter) if t4 goto loop Thir Version ;; CONFIGURABLE ;; (APPLICATION SPECIFIC OPERATION ((a+b+c+)/4)) loop: sen(r, NDIR) ; sen value to the north receive(sreg, SDIR), sen(r, SDIR) ; wait for value from the south receive(nreg, NDIR), sen(r, EDIR) receive(wreg, WDIR), sen(r, WDIR) ; sen value to the west receive(ereg, EDIR) ; wait for value from the east ;; Do computation r <- (((n+s)+(e+w))>>2); APPLICATION SPECIFIC OP iter <- iter + ; while (++iter < max_iter) t4 <- (iter < max_iter) if t4 goto loop Figure 4: Several variations on the Jacobi kernel 7

8 Processor Instr. Mem. Data Mem. Register File Proc. Con trol Logic ALU Con fig ur able Logic Switch Instr. Mem. Switch Control Logic Switch Figure 5: Major atapaths of each iniviual tile. inate because the multiple instruction streams are statically scheule. These two benefits are reflecte in the secon Jacobi coe sequence. The thir benefit Raw conveys is support for single-cycle multigranular operations. If we assume that our Jacobi algorithm emane 6-bit integer precision, we coul possibly pack each register with two of the Jacobi values. We coul then perform the 6-bit multigranular operation (aition, in this case) which, in one cycle, woul perform two separate aitions, one on the top halves of the registers, an the other on the bottom halves. The fourth benefit arises from Raw s configurable logic. A custom instruction can be instantiate for the application as in the final Jacobi coe sequence. 3 Raw Harware As iscusse in Section the approach use by the Raw architecture is to expose all unerlying harware mechanisms an move scheuling ecisions into the compiler. This section escribes a Raw harware architecture which irectly exposes a set of mechanisms, incluing ALUs, instruction sequencing harware, communication interconnect switches, physical pipeline registers, an network flow control bits, to allow high-quality static scheuling. Aitional harware support is provie for hanling ynamic events. Physical Organization Globally the Raw chip is organize as a mesh of processing tiles. The major atapaths of each tile are shown in Figure 5. The tile contains all the mechanisms foun in a typical pipeline RISC processor. These inclue a register file, ALU, SRAM for softwaremanage instruction an ata caches an pipeline registers. In aition each tile also inclues some configurable logic to support bit an byte level parallel operations, an a programmable switch to communicate with neighboring tiles. The switch supports both ynamic routing an compiler orchestrate static routing. The switch is integrate irectly into the processor pipeline to support 8

9 Instruction Stream (Control) Static Physical Channel (Input) Router Physical Channel (Output) Dynamic Router Flow Control Data Channel Figure 6: Structure of the Raw switch. The switch multiplexes two ifferent networks, one static an one ynamic, across the same physical wires. The physical channels also provie wires for flow control. unassigne timeslot available to ynamic router processor north south east west north west proc west east south north east north time Statically assigne route from north to west output channel assignment Figure 7: Logical information in the communication timetable implemente by the switch program. Each output channel is statically assigne an input. When no static assignment is pre-scheule, time slots are available to the ynamic router. single cycle message injection an receive operations. Sequencing Each tile inclues two sets of control logic an instruction memories. The first of these controls the operation of the processing elements. The secon is eicate to sequencing routing instructions for the switch. Separating the control of the processor an the switch allows the processor to take arbitrary ata epenent branches without isturbing the routing of inepenent messages that are passing through the switch. The switch scheule can be change by loaing a ifferent program into the switch instruction memory. Communication The switch structure is shown in more etail in Figure 6. Physical wires are a critical resource, so the switch multiplexes two logically istinct networks, one static, an one ynamic, over the same set of physical wires. The static network is controlle by the sequence of switch instructions (Figure 7) which escribe, for each cycle, which switch inputs are connecte to each switch output. Not shown are the aitional control bits to support static branches in the switch instruction stream. Because the communication patterns are orchestrate at compile time, messages o not require heaer interpretation. Thus there is no routing overhea, an efficient single wor ata transfers are possible. Single wor messages allow tiles to take avantage of registers an function units in neighboring tiles to support istribute register files an very fine-grain ILP parallelism. Any slot in the routing timetable not allocate to static scheuling is available to the ynamic router. The ynamic router is a traitional wormhole router, making routing ecisions base on the hea of each message. Each wor following the heaer simply follows the same path as the 9

10 heaer. The ynamic router inclues the require aitional lines for flow control in the case that messages become temporarily blocke ue to contention or the nee to wait for empty slots in the static scheule. Thus in Figure 6, when a physical channel is reserve for the static router, the ynamic router will have a stalling flow control signal. The processor/network interface is share between the two networks. The processor uses istinct opcoes to istinguish between accesses to the two networks. The interface performs ifferent actions epening on which network the message is injecte into. For the static network the message is injecte irectly into, or copie irectly out of, the static switch. The internals of the ynamic network inclue sen an receive FIFOs. The flow control bits for the ynamic network are visible to the processor so the software can etermine whether the sen queue is full or the receive queue is empty. This allows the processor to utilize the ynamic network without harware stalls an permits flexibility in the rate at which iniviual noes poll for ynamic events. Although compiler orchestration ensures that the static switch stalls infrequently, the switches amit the possibility of failing to meet a static scheule because they allow interaction between statically scheule an ynamic events. Notice that while the static an ynamic network are logically isjoint, ata epenencies between statically scheule coe an ynamic messages may exist in the processor coe. Such a system must support some form of flow control even in the static network for correctness. Our esign provies minimal harware-supporte flow control in the static network. This flow-control information is also exporte to the software. If the processor or switch attempts to use a busy resource, it is stalle until the resource is free. A stalle processor or switch, in turn, may cause others to stall as well. While the static router is stalle, the ynamic router is allowe to use all timeslots until the static router is able to make forwar progress. We are also exploring higher level en-to-en software flow control strategies to avoi the requirement of flow control on the static network altogether. Memory Each Raw tile contains a small amount of ata SRAM that can be locally aresse. The amount of ata memory is chosen to roughly balance the areas evote to processing an memory, an to match within a factor of two or three the memory access time an the processor clock. Memory accesses will be pipeline an scheule by the compiler. As with the communication network, the emans of static scheuling require that ynamic events, like cache misses in a conventional processor, o not stall the processor pipeline. The ata memory on each tile, therefore, is irectly physically aresse. A harware cache tag file will export a hit/miss status to the software rather than stalling the processor pipeline, so that the software can check the status if it is uncertain a atum is in the cache. Higher level abstractions like caching, global share memory an memory protection are implemente by the compiler [0], an ynamic events, like cache misses, are hanle entirely in software. Discussion The Raw harware architecture provies a number of benefits over conventional superscalar esigns. Because of its tile istribute structure Raw can achieve large I-fetch, ata memory an register file banwith. Since each tile contains its own set of resources, multiple memory units an registers can be accesse in parallel without resorting to builing expensive an slow multi-porte register files or memories. Because iniviual harware mechanisms are expose irectly to the compiler, applications with static structures can take avantage of the raw spee available from the static network. Messages that can be statically structure can be injecte an remove from the network with no packaging or routing overhea an messages move through the network at the lowest possible latency. Because the Raw esign exposes all possible ynamic events to the software, ynamic an static communication 0

11 SOURCE PROGRAM Front En Intermeiate Coe (high level) High Level Transformations Target Configurations Configuration Selection Logical Threa Partitioning Logical Data Partitioning Traitional Optimizations Global Register Allocation Placement Routing/Scheuling Coe Generation Transforme Intermeiate Coe (low level) Intermeiate Coe + Custom Instructions Partitione Coe (logical tiles) Partitione Coe + Data Optimize Partitione Coe Optimize Partitione Coe (physical registers) Optimize Partitione Coe (physical tiles) Raw Machine Coe (unoptimize scheule) OPTIMIZED TARGET CODE FOR THE RAW MACHINE Figure 8: Compilation phases in Raw compiler patterns can coexist, an unexpecte stalls are rare. Finally, the tile organization on a single chip permits extremely wie an short message paths. Unlike traitional routers that face pin limits, the ata paths in an on-chip router can be several wors wie, permitting extremely high-banwith communication. The tile point-to-point communication organization further permits the avoiance of long chip-crossing ata-paths. Each signal nee travel no more than the with of a single tile each cycle, reucing the effort require to create very high clock-rate esigns. 4 Raw Compilation System The goal of the Raw compiler is to take a single-threae (sequential) or multi-threae (parallel) program written in a high-level programming language an map it onto Raw harware. Raw s explicitly parallel moel makes it possible for the compiler to fin an express program parallelism irectly to the harware. This overcomes a funamental limitation of superscalar processors, that the harware must extract instructions for concurrent execution from a sequential instruction stream. The issue logic that ynamically resolves instruction epenencies in superscalars is a system bottleneck, limiting available concurrency [3]. The Raw machine avois the bottleneck resulting from a single sequential instruction stream by supporting a multi-sequential architecture in which multiple sequential instruction streams execute on multiple tiles (one stream per tile). Unlike the meium-grain or coarse-grain parallelism exploite on multiprocessor systems, the compiler views the collection of N tiles in the Raw machine as a collection of functional units for exploiting instruction-level parallelism. In this section, we outline a sequence of compilation phases for the Raw machine. These phases are shown in Figure 8. While we may consier merging some of the phases in the future, this is the phase orering that we will use in builing the first version of the Raw compiler. The compilation phases are iscusse in the following paragraphs, along with the rationale for this phase orering.

12 Front-en The job of the front-en is to generate intermeiate coe for an abstract Raw machine with unboune resources i.e., unboune number of virtual tiles an virtual registers per tile. The generate coe reflects the structure of the source program, an may be sequential with a single aress space, or multithreae with a share aress space or a istribute aress space, epening on whether the source program is sequential or parallel. Regarless of the form of the source program, the problem of co-locating instructions an ata within a tile (Partitioning), an mapping instructions an ata to specific tiles (Placement) are performe by later phases in the compiler an not by the front-en. Data references are kept at a high level an are classifie as private or share. Private ata is only accessible by the virtual tile that it is allocate in. Share ata is accessible by all virtual tiles. Data references are also classifie by the compiler as being preictable ( static ) or unpreictable ( ynamic ) using type information an program analysis information. A preictable ata reference is one for which the compiler will be able to preict the estination tile, even though the base aress may be unknown. This issue is iscusse in etail in the Logical Data Partitioning phase. The conservative assumption for an unanalyzable ata reference is to classify it as unpreictable. For a traitional sequential program with no explicit parallelism or ata istribution, the intermeiate coe generate by the front-en for the abstract Raw machine will look very similar to intermeiate coe generate in compilers for traitional uniprocessors. The restructuring into multi-sequential coe for a physical Raw machine is performe by later phases of the Raw compiler. High-Level Transformations This phase performs high-level program transformations to expose greater levels of instruction-level parallelism an register locality. The transformations inclue loop unrolling, proceure inlining, software pipelining, coe uplication, scalar replacement, speculative execution, an iteration-reorering loop transformations such as loop interchange, loop tiling, an unimoular loop transformations [4]. As in high-level optimizers in toay s compilers for highperformance processors, this phase inclues a eep intraproceural an interproceural analysis of the program s control flow an ata flow. Alias analysis is especially important for reucing the number of unpreictable (ynamic) memory references on the Raw machine. Branch instructions an array ata references in the intermeiate coe are lowere to machine level instructions after this phase, but they still use symbolic labels as aress arguments. High-level transformations is the first optimization phase performe after the front-en, so that the remaining compiler phases can operate on the lower level intermeiate coe. Configuration Selection This phase selects a configuration to be loae into the configurable logic in the Raw machine, for the given input program. For the sake of simplicity the first version of the Raw compiler assumes that each tile is loae with the same configuration an only consiers implementing combinational logic as custom instructions. Future versions of the compiler may lift these restrictions. The compiler will be capable of performing this phase automatically but will also give the programmer the option of specifying custom instructions. As shown in Figure 9, for each custom operation the configuration phase must both output a specification for the configurable logic an rewrite the intermeiate coe so that each subtree (compoun operation) is replace by a call to the appropriate custom instruction. The compiler will invoke a logic synthesis tool to translate a custom operation specification into the appropriate bit sequence for the configurable logic in the Raw machine. Our current Raw prototype system escribe in Section 6 requires that these special operators be ientifie by the programmer an specifie in behavioral Verilog. The compiler will exten the ynamic programming algorithms use in tree pattern-matching 2

13 Input Program IMEM PC REGS DMEM Compiler ALU Static Data CL SMEM PC SWITCH Processor Coe Switch Coe Configurable Logic Description Figure 9: Compiler interaction with the architecture systems [] to ientify commonly occurring subtree patterns that might be goo caniates for implementation as custom instructions. To support multigranular operation the compiler will search for important special cases, such as bitwise an multiple shortwor (e.g., 8-bit) arithmetic operations that can be packe together into a single custom instruction. Partitioning into Logical Threas The Partitioning phase generates parallel coe for an iealize Raw machine with the same number of tiles as the physical architecture, but assuming an iealize fully-connecte switch so that tile placement is not an issue. This iealize Raw architecture also has an unboune number of virtual registers per tile an an supports symbolic ata references like the abstract Raw machine. The structure of the coe after partitioning is similar to that of an single-program-multiple-ata (SPMD) program. A single program is assume for all threas, along with a my i primitive that is use to guie control flow on ifferent threas. Along with traitional multiprocessor an vector loop parallelization techniques [9] this phase ientifies an partitions for fine grain instruction level parallelism. The partitioner must balance the benefits of parallelism versus the overheas of communication an synchronization. Because these overheas are much lower in the Raw microprocessor than in multiprocessors the partitioning can occur at a much finer grain than in conventional parallelizing compilers. Note that the Raw compiler performs logical threa partitioning before logical ata partitioning, which is counter to the approach use by compilers for ata-parallel languages such as High Performance Fortran. This phase orering is motivate by a esire to target a much wier class of applications with instruction level parallelism versus static ata-parallelism. In the future, we will explore the possibility of combining threa partitioning an ata partitioning into a single phase. Logical Data Partitioning This phase etermines the co-location of ata an computation by bining symbolic variables to logical threas. In aition, symbolic names are translate to lower level offsets an locations in the local aress spaces of the logical threas. The storage allocation is selecte to enhance ata locality as much as possible, an may inclue replication of frequently-use variables/registers. For array variables that have no storage aliasing, this phase can also inclue selection of ata layouts (e.g., row-major or column-major) an of ata istributions (e.g., block, cyclic, etc. as in HPF). This phase assumes that each tile s aress space is ivie into a private section an a share section. An access to the private section is always local. The compiler implements a share memory abstraction on top of the istribute physical memory using techniques similar to those escribe in [0]. As far as possible, we woul like the Raw compiler to statically preict the estination tile for a memory access to the share section, so that even an access to a remote tile can be statically route. 3

14 We observe that the offset for many memory operations is often known at compile-time (e.g., the stack frame offset for a local variable or the fiel/element offset in a heap-allocate object) even though the base aress may be unknown. Therefore, by using low-orer interleaving an N B alignment we can calculate the tile on estination tile for each memory request.. Low-orer interleaving we assume that the aress space is ivie into blocks of B = 2 b bytes (typical values of B are 8, 6, 32, 64), an that the aressing of blocks is low-orerinterleave across tiles. This means that the bit layout of a ata aress in Raw is as follows, where N = 2 n is the number of tiles: <---- n -----> <---- b ----> Block aress in noe/tile Noe/tile i Block offset This aress format makes it easy for the compiler to extract the tile i from a low orer aress offset. 2. N B alignment We assume the base aress of commonly use program entities (stack frames, static arrays, heap-allocate structures, etc.) is aligne with a N B bounary (i.e., has zeroes in the lowest n + b bits). This enables the compiler to statically preict the estination tile for a known offset, even though the actual base aress is unknown. For base pointers such as function parameters that may not be known at compile-time to be N B aligne, we suggest generating ual-path (multi-version) coe with a runtime test that checks if all unknown base aresses are N B aligne. The coe for the case when the test returns true will contain only preictable references (this shoul be the common case). The coe for the false case will contain unpreictable references an thus be less efficient but will serve as a correct fallback coe for executing with unaligne arguments. Traitional Optimizations This phase performs traitional back-en optimizations such as constant propagation, common-subexpression elimination, loop-invariant coe motion, strength reuction, an elimination of ea coe an unreachable coe. The output of this phase is optimize intermeiate coe with operations at the level of the target instruction set. Traitional optimizations are performe after Logical Data Distribution so that all aress computations in the program are expose for optimization, but before Global Register Allocation so that register allocation can be performe on optimize coe. Global Register Allocation This phase assigns the unboune number of virtual/symbolic registers assume in previous phases to the fixe N R registers in the target machine, where R is the number of registers in a single tile. Our motivation in performing Global Register Allocation after Partitioning is to give priority to instruction-level parallelism over register locality. Ha we performe Register Allocation earlier, we woul be in anger of reucing instruction-level parallelism by aing false epenences on physical registers before the Partitioning phase. The register allocation phase is guie by harware costs for accessing local/nonlocal registers an local/nonlocal memory locations. Our initial assumption is that the costs will be in increasing orer for the following:. Access to a local register 4

15 Input program communication graph t3 Raw Topology t0 Raw Tile t t2 Channel graph t 0 t 3 t t 2 Figure 0: Input program communication graph, the Raw microprocessor topology, an the Raw communication channel graph showing the initial values of the available channel capacities. 2. Access to a nonlocal register (in a neighboring tile) 3. Statically preictable access to a local memory location 4. Statically preictable access to a nonlocal memory location 5. Unpreictable (ynamic) access to a memory location Placement The only remaining iealization in the machine moel at this stage of compilation is the hypothetical crossbar interconnect among threas. The Placement phase removes this iealization by selecting a - mapping from threas to physical tiles base on the inter-threa communication patterns an the physical structure of the static interconnect among tiles in the Raw machine. Because the earlier ata partitioning phase co-locate ata variables an arrays with the threas, the threa placement phase results in physical ata placement as well. The placement algorithm attempts to minimize a latency an banwith cost measure an is a variant of a VLSI cell placement algorithm. This algorithm is use in our current multi- FPGA compiler system [2]. The placement algorithm attempts to lay close together threas that communicate intensely or threas that are in timing critical paths. As shown in Figure 0, the Placement algorithm uses a epenence graph for the program to etermine critical paths an the volume of communication between threas. Each ege is annotate both with the length of the longest critical path on which it is (shown in the figure), an the volume of communication in wors it represents (not shown). The placement algorithm also uses a physical channel graph whose eges are annotate with the physical banwith they represent. The channel graph captures the topology an the available banwith of the RawP. The goal of placement is to etermine an optimal mapping of the input program graph to the channel graph. Our prototype system uses simulate annealing to optimize a placement cost function. A suitable cost function sums over all pairs of threas the prouct of the communication volume between a pair of threas an the number of hops between the pair of tiles on which the iniviual threas are place, ivie by the available banwith between the tiles. To account for latency, the cost function also fols in a latency relate weight. Routing an Global Scheuling This phase inserts routing instructions for nonlocal ata accesses, an selects the final orering of computation an routing instructions on iniviual tiles 5

16 with the goal of minimizing the overall estimate completion time. This phase will automatically give priority to routing/compute instructions on a critical path, without requiring special-case treatment for either class of instructions. Routing uses the same input program graph an the channel graph use by placement an assumes that a ata value can be communicate between a neighboring pair of tiles in a single cycle. For each value communicate between threas, the scheuler prescribes both a sequence of inter-tile channels an a corresponing sequence of time slots (or cycles) in which each channel will be use. The following is a brief summary of the TIERS (Topology Inepenent Pipeline Routing an Scheuling) routing an scheuling algorithm that we will use []. This algorithm is implemente in our emulation system, an we propose to aapt this for use in Raw. Given the channel graph an the epenence-analyze input program communication graph, the TIERS routing approach uses a greey algorithm to etermine a time-space path for each nonlocal communication event an a scheule for the instructions in each threa. The scheuler relies on available channel capacities for each cycle maintaine in each ege of the channel graph. The greey algorithm performs a topological sort of the communication eges in the input program graph so that eges appear in partial epenence orer. The routing algorithm picks eges in orer from the sorte list an routes each ege using the sequence of channels in the shortest path between the source an estination tiles. The scheuler etermines the cycle on which a value can be scheule for communication base on three quantities: the arrival times of the eges on which it epens, the amount of time require for any computation or memory accesses that prouce the value, the availability of channel capacity along the shortest part uring a subsequent sequence of cycles. The scheuler ecrements available channel capacities for each channel along the route taken by the ege, an marks a channel unavailable uring any cycle on which its capacity is exhauste. Routing of each ege is accompanie by the prouction of a precise scheule for computations an memory accesses within each threa. Final Coe Generation This phase emits the final coe in the appropriate object coe format for the Raw Machine. It also emits the appropriate bit sequence for implementing the selecte configuration within each tile. 5 Software Support for Dynamic Events As in any approach avocating exposure of harware etails, one raison e etre of the Raw Machine is that compilation techniques exist which can manage an inee take avantage of the resulting explicit control. For static program regions, the compilation process escribe in Section 4 provies plausible grouns for leveraging this control to achieving attractive performance gains when compare to more traitional architectures. For efficient execution of the full spectrum of generalpurpose programs, however, we must complement this compilation paraigm with techniques that support ynamic an unpreictable events. Dynamic events typically occur when ata epenencies an/or latencies cannot be resolve at compile time. Examples of unresolvable ata epenence inclue uneciable program input an non-linear (or inirect) array inexing. Latencies unknown at compile time may be a function of the memory system (cache misses, page faults), ata-epenent routing estination, or simply a region of coe which executes for a ata-epenent number of cycles. A final class of ynamic events inclues unpreictable situtations that arise as a function of the external environment of the processor, such as a bus interrupt. Raw s philosphy for hanling ynamic situtations that occur while running a program is to: 6

17 ensure correct execution of all ynamic events, hanle infrequent events in software (i.e. via polling), use ynamic harware as a last resort. We have iscovere that uner the right circumstances certain static techniques can actually outperform ynamic protocols at their own game. When static techniques fail, we have iscovere software solutions for the Raw architecture which approach the performance of aitional harware without the entailing overhea an complexity. An finally, when we o resort to harware, such as in the ynamic routing of non-uniform message streams, we must be careful to insure that the unpreictable ynamic situations are hanle without interfering with any carefully preicte static events taking place at the same time. This section continues by escribing some of these techniques. 5. Software Dynamic Routing While some applications can be compile in a completely static fashion, many applications have at least some amount of ynamic, ata-epenent communication. We use the term ynamic message to enote any act of communication that cannot be route an scheule by the compiler. Raw uses one of two approaches to hanle ynamic messages. The first is a software approach that preserves the preictability of the program in the face of ynamic messages by using a static, ata-inepenent routing protocol. This approach statically reserves channel banwith between noes that may potentially communicate. In the worst case, this conservative approach will involve an all-to-all personal communication scheule [6] between the processing elements. All-to-all scheules effectively simulate global communication architectures, such as a cross-bar or bus, on our static mesh. This approach is effective for a small number of tiles. It is also effective when messages are short, so that the amount of communication banwith waste is small. The secon approach uses the ynamically route network. Raw supports a ynamically route network in which communication is groupe into multi-wor packets with a heaer that inclues the estination aress. Harware interprets the message heaer in a packet at each switching point an ynamically routes the packet in an appropriate irection. By conservatively estimating the elivery time of ynamic messages an allowing enough slack, the compiler can still hope to meet its static scheule. Correctness is preserve in this approach through software checks of the flow control bits. This approach is more effective when the ynamic messages are long, so that the overhea of the message heaers an the routing ecision making can be amortize over a larger volume of communication. 5.2 Software Memory Depenency Checking Dynamically checking for epenencies between every pair of memory operations is expensive. Most superscalar processors can sustain only one or at most two (e.g., Alpha 2264) memory instruction issues per cycle. Since, in most programs, memory operations account for at least 20% of the total ynamic instructions, this limit on memory issue sharply limits the total parallelism that a superscalar processor can achieve. While our peagogical vanilla Raw Machine will inclue the minimal harware mechanisms require, our research irections inclue evaluation of various levels of harware support for the kins of ynamic events foun in real applications. 7

Computer Organization

Computer Organization Computer Organization Douglas Comer Computer Science Department Purue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purue.eu/people/comer Copyright 2006. All rights reserve.

More information

Compiler Optimisation

Compiler Optimisation Compiler Optimisation Michael O Boyle mob@inf.e.ac.uk Room 1.06 January, 2014 1 Two recommene books for the course Recommene texts Engineering a Compiler Engineering a Compiler by K. D. Cooper an L. Torczon.

More information

Chapter 9 Memory Management

Chapter 9 Memory Management Contents 1. Introuction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threas 6. CPU Scheuling 7. Process Synchronization 8. Dealocks 9. Memory Management 10.Virtual Memory

More information

Just-In-Time Software Pipelining

Just-In-Time Software Pipelining Just-In-Time Software Pipelining Hongbo Rong Hyunchul Park Youfeng Wu Cheng Wang Programming Systems Lab Intel Labs, Santa Clara What is software pipelining? A loop optimization exposing instruction-level

More information

Coupling the User Interfaces of a Multiuser Program

Coupling the User Interfaces of a Multiuser Program Coupling the User Interfaces of a Multiuser Program PRASUN DEWAN University of North Carolina at Chapel Hill RAJIV CHOUDHARY Intel Corporation We have evelope a new moel for coupling the user-interfaces

More information

Message Transport With The User Datagram Protocol

Message Transport With The User Datagram Protocol Message Transport With The User Datagram Protocol User Datagram Protocol (UDP) Use During startup For VoIP an some vieo applications Accounts for less than 10% of Internet traffic Blocke by some ISPs Computer

More information

As our industry develops the technology that

As our industry develops the technology that Theme Feature Baring It All to Software: Raw Machines This innovative approach eliminates the traditional instruction set interface and instead exposes the details of a simple replicated architecture directly

More information

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control Almost Disjunct Coes in Large Scale Multihop Wireless Network Meia Access Control D. Charles Engelhart Anan Sivasubramaniam Penn. State University University Park PA 682 engelhar,anan @cse.psu.eu Abstract

More information

6.823 Computer System Architecture. Problem Set #3 Spring 2002

6.823 Computer System Architecture. Problem Set #3 Spring 2002 6.823 Computer System Architecture Problem Set #3 Spring 2002 Stuents are strongly encourage to collaborate in groups of up to three people. A group shoul han in only one copy of the solution to the problem

More information

Overview. Operating Systems I. Simple Memory Management. Simple Memory Management. Multiprocessing w/fixed Partitions.

Overview. Operating Systems I. Simple Memory Management. Simple Memory Management. Multiprocessing w/fixed Partitions. Overview Operating Systems I Management Provie Services processes files Manage Devices processor memory isk Simple Management One process in memory, using it all each program nees I/O rivers until 96 I/O

More information

Loop Scheduling and Partitions for Hiding Memory Latencies

Loop Scheduling and Partitions for Hiding Memory Latencies Loop Scheuling an Partitions for Hiing Memory Latencies Fei Chen Ewin Hsing-Mean Sha Dept. of Computer Science an Engineering University of Notre Dame Notre Dame, IN 46556 Email: fchen,esha @cse.n.eu Tel:

More information

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Generalized Edge Coloring for Channel Assignment in Wireless Networks Generalize Ege Coloring for Channel Assignment in Wireless Networks Chun-Chen Hsu Institute of Information Science Acaemia Sinica Taipei, Taiwan Da-wei Wang Jan-Jan Wu Institute of Information Science

More information

PART 2. Organization Of An Operating System

PART 2. Organization Of An Operating System PART 2 Organization Of An Operating System CS 503 - PART 2 1 2010 Services An OS Supplies Support for concurrent execution Facilities for process synchronization Inter-process communication mechanisms

More information

Online Appendix to: Generalizing Database Forensics

Online Appendix to: Generalizing Database Forensics Online Appenix to: Generalizing Database Forensics KYRIACOS E. PAVLOU an RICHARD T. SNODGRASS, University of Arizona This appenix presents a step-by-step iscussion of the forensic analysis protocol that

More information

An Adaptive Routing Algorithm for Communication Networks using Back Pressure Technique

An Adaptive Routing Algorithm for Communication Networks using Back Pressure Technique International OPEN ACCESS Journal Of Moern Engineering Research (IJMER) An Aaptive Routing Algorithm for Communication Networks using Back Pressure Technique Khasimpeera Mohamme 1, K. Kalpana 2 1 M. Tech

More information

AnyTraffic Labeled Routing

AnyTraffic Labeled Routing AnyTraffic Labele Routing Dimitri Papaimitriou 1, Pero Peroso 2, Davie Careglio 2 1 Alcatel-Lucent Bell, Antwerp, Belgium Email: imitri.papaimitriou@alcatel-lucent.com 2 Universitat Politècnica e Catalunya,

More information

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2 This paper appears in J. of Parallel an Distribute Computing 10 (1990), pp. 167 181. Intensive Hypercube Communication: Prearrange Communication in Link-Boun Machines 1 2 Quentin F. Stout an Bruce Wagar

More information

Computer Organization

Computer Organization Computer Organization Douglas Comer Computer Science Department Purue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purue.eu/people/comer Copyright 2006. All rights reserve.

More information

SURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH

SURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH SURVIVABLE IP OVER WDM: GUARANTEEEING MINIMUM NETWORK BANDWIDTH Galen H Sasaki Dept Elec Engg, U Hawaii 2540 Dole Street Honolul HI 96822 USA Ching-Fong Su Fuitsu Laboratories of America 595 Lawrence Expressway

More information

Skyline Community Search in Multi-valued Networks

Skyline Community Search in Multi-valued Networks Syline Community Search in Multi-value Networs Rong-Hua Li Beijing Institute of Technology Beijing, China lironghuascut@gmail.com Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China yu@se.cuh.eu.h

More information

Supporting Fully Adaptive Routing in InfiniBand Networks

Supporting Fully Adaptive Routing in InfiniBand Networks XIV JORNADAS DE PARALELISMO - LEGANES, SEPTIEMBRE 200 1 Supporting Fully Aaptive Routing in InfiniBan Networks J.C. Martínez, J. Flich, A. Robles, P. López an J. Duato Resumen InfiniBan is a new stanar

More information

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks Queueing Moel an Optimization of Packet Dropping in Real-Time Wireless Sensor Networks Marc Aoun, Antonios Argyriou, Philips Research, Einhoven, 66AE, The Netherlans Department of Computer an Communication

More information

Optimal Routing and Scheduling for Deterministic Delay Tolerant Networks

Optimal Routing and Scheduling for Deterministic Delay Tolerant Networks Optimal Routing an Scheuling for Deterministic Delay Tolerant Networks Davi Hay Dipartimento i Elettronica olitecnico i Torino, Italy Email: hay@tlc.polito.it aolo Giaccone Dipartimento i Elettronica olitecnico

More information

ACE: And/Or-parallel Copying-based Execution of Logic Programs

ACE: And/Or-parallel Copying-based Execution of Logic Programs ACE: An/Or-parallel Copying-base Execution of Logic Programs Gopal GuptaJ Manuel Hermenegilo* Enrico PontelliJ an Vítor Santos Costa' Abstract In this paper we present a novel execution moel for parallel

More information

On the Placement of Internet Taps in Wireless Neighborhood Networks

On the Placement of Internet Taps in Wireless Neighborhood Networks 1 On the Placement of Internet Taps in Wireless Neighborhoo Networks Lili Qiu, Ranveer Chanra, Kamal Jain, Mohamma Mahian Abstract Recently there has emerge a novel application of wireless technology that

More information

Impact of FTP Application file size and TCP Variants on MANET Protocols Performance

Impact of FTP Application file size and TCP Variants on MANET Protocols Performance International Journal of Moern Communication Technologies & Research (IJMCTR) Impact of FTP Application file size an TCP Variants on MANET Protocols Performance Abelmuti Ahme Abbasher Ali, Dr.Amin Babkir

More information

EDOVE: Energy and Depth Variance-Based Opportunistic Void Avoidance Scheme for Underwater Acoustic Sensor Networks

EDOVE: Energy and Depth Variance-Based Opportunistic Void Avoidance Scheme for Underwater Acoustic Sensor Networks sensors Article EDOVE: Energy an Depth Variance-Base Opportunistic Voi Avoiance Scheme for Unerwater Acoustic Sensor Networks Safar Hussain Bouk 1, *, Sye Hassan Ahme 2, Kyung-Joon Park 1 an Yongsoon Eun

More information

MORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks

MORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks : a Movement-Base Routing Algorithm for Vehicle A Hoc Networks Fabrizio Granelli, Senior Member, Giulia Boato, Member, an Dzmitry Kliazovich, Stuent Member Abstract Recent interest in car-to-car communications

More information

MODULE VII. Emerging Technologies

MODULE VII. Emerging Technologies MODULE VII Emerging Technologies Computer Networks an Internets -- Moule 7 1 Spring, 2014 Copyright 2014. All rights reserve. Topics Software Define Networking The Internet Of Things Other trens in networking

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science an Center of Simulation of Avance Rockets University of Illinois at Urbana-Champaign

More information

Creating a Scalable Microprocessor:

Creating a Scalable Microprocessor: Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B.

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Adaptive Load Balancing based on IP Fast Reroute to Avoid Congestion Hot-spots

Adaptive Load Balancing based on IP Fast Reroute to Avoid Congestion Hot-spots Aaptive Loa Balancing base on IP Fast Reroute to Avoi Congestion Hot-spots Masaki Hara an Takuya Yoshihiro Faculty of Systems Engineering, Wakayama University 930 Sakaeani, Wakayama, 640-8510, Japan Email:

More information

Socially-optimal ISP-aware P2P Content Distribution via a Primal-Dual Approach

Socially-optimal ISP-aware P2P Content Distribution via a Primal-Dual Approach Socially-optimal ISP-aware P2P Content Distribution via a Primal-Dual Approach Jian Zhao, Chuan Wu The University of Hong Kong {jzhao,cwu}@cs.hku.hk Abstract Peer-to-peer (P2P) technology is popularly

More information

Offloading Cellular Traffic through Opportunistic Communications: Analysis and Optimization

Offloading Cellular Traffic through Opportunistic Communications: Analysis and Optimization 1 Offloaing Cellular Traffic through Opportunistic Communications: Analysis an Optimization Vincenzo Sciancalepore, Domenico Giustiniano, Albert Banchs, Anreea Picu arxiv:1405.3548v1 [cs.ni] 14 May 24

More information

Disjoint Multipath Routing in Dual Homing Networks using Colored Trees

Disjoint Multipath Routing in Dual Homing Networks using Colored Trees Disjoint Multipath Routing in Dual Homing Networks using Colore Trees Preetha Thulasiraman, Srinivasan Ramasubramanian, an Marwan Krunz Department of Electrical an Computer Engineering University of Arizona,

More information

Generalized Edge Coloring for Channel Assignment in Wireless Networks

Generalized Edge Coloring for Channel Assignment in Wireless Networks TR-IIS-05-021 Generalize Ege Coloring for Channel Assignment in Wireless Networks Chun-Chen Hsu, Pangfeng Liu, Da-Wei Wang, Jan-Jan Wu December 2005 Technical Report No. TR-IIS-05-021 http://www.iis.sinica.eu.tw/lib/techreport/tr2005/tr05.html

More information

Demystifying Automata Processing: GPUs, FPGAs or Micron s AP?

Demystifying Automata Processing: GPUs, FPGAs or Micron s AP? Demystifying Automata Processing: GPUs, FPGAs or Micron s AP? Marziyeh Nourian 1,3, Xiang Wang 1, Xiaoong Yu 2, Wu-chun Feng 2, Michela Becchi 1,3 1,3 Department of Electrical an Computer Engineering,

More information

Research Article REALFLOW: Reliable Real-Time Flooding-Based Routing Protocol for Industrial Wireless Sensor Networks

Research Article REALFLOW: Reliable Real-Time Flooding-Based Routing Protocol for Industrial Wireless Sensor Networks Hinawi Publishing Corporation International Journal of Distribute Sensor Networks Volume 2014, Article ID 936379, 17 pages http://x.oi.org/10.1155/2014/936379 Research Article REALFLOW: Reliable Real-Time

More information

PAPER. 1. Introduction

PAPER. 1. Introduction IEICE TRANS. COMMUN., VOL. E9x-B, No.8 AUGUST 2010 PAPER Integrating Overlay Protocols for Proviing Autonomic Services in Mobile A-hoc Networks Panagiotis Gouvas, IEICE Stuent member, Anastasios Zafeiropoulos,,

More information

CS269I: Incentives in Computer Science Lecture #8: Incentives in BGP Routing

CS269I: Incentives in Computer Science Lecture #8: Incentives in BGP Routing CS269I: Incentives in Computer Science Lecture #8: Incentives in BGP Routing Tim Roughgaren October 19, 2016 1 Routing in the Internet Last lecture we talke about elay-base (or selfish ) routing, which

More information

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama an Hayato Ohwaa Faculty of Sci. an Tech. Tokyo University of Science, 2641 Yamazaki, Noa-shi, CHIBA, 278-8510, Japan hiroyuki@rs.noa.tus.ac.jp,

More information

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Non-homogeneous Generalization in Privacy Preserving Data Publishing Non-homogeneous Generalization in Privacy Preserving Data Publishing W. K. Wong, Nios Mamoulis an Davi W. Cheung Department of Computer Science, The University of Hong Kong Pofulam Roa, Hong Kong {wwong2,nios,cheung}@cs.hu.h

More information

Robust PIM-SM Multicasting using Anycast RP in Wireless Ad Hoc Networks

Robust PIM-SM Multicasting using Anycast RP in Wireless Ad Hoc Networks Robust PIM-SM Multicasting using Anycast RP in Wireless A Hoc Networks Jaewon Kang, John Sucec, Vikram Kaul, Sunil Samtani an Mariusz A. Fecko Applie Research, Telcoria Technologies One Telcoria Drive,

More information

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method Southern Cross University epublications@scu 23r Australasian Conference on the Mechanics of Structures an Materials 214 Transient analysis of wave propagation in 3D soil by using the scale bounary finite

More information

DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE. Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose

DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE. Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE Peter Yiannacouras, J. Gregory Steffan, an Jonathan Rose Ewar S. Rogers Sr. Department of Electrical an Computer Engineering University of Toronto

More information

Backpressure-based Packet-by-Packet Adaptive Routing in Communication Networks

Backpressure-based Packet-by-Packet Adaptive Routing in Communication Networks 1 Backpressure-base Packet-by-Packet Aaptive Routing in Communication Networks Eleftheria Athanasopoulou, Loc Bui, Tianxiong Ji, R. Srikant, an Alexaner Stolyar Abstract Backpressure-base aaptive routing

More information

Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine

Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, Saman Amarasinghe M.I.T. Laboratory

More information

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources

An Algorithm for Building an Enterprise Network Topology Using Widespread Data Sources An Algorithm for Builing an Enterprise Network Topology Using Wiesprea Data Sources Anton Anreev, Iurii Bogoiavlenskii Petrozavosk State University Petrozavosk, Russia {anreev, ybgv}@cs.petrsu.ru Abstract

More information

PART 5. Process Coordination And Synchronization

PART 5. Process Coordination And Synchronization PART 5 Process Coorination An Synchronization CS 503 - PART 5 1 2010 Location Of Process Coorination In The Xinu Hierarchy CS 503 - PART 5 2 2010 Coorination Of Processes Necessary in a concurrent system

More information

Lab work #8. Congestion control

Lab work #8. Congestion control TEORÍA DE REDES DE TELECOMUNICACIONES Grao en Ingeniería Telemática Grao en Ingeniería en Sistemas e Telecomunicación Curso 2015-2016 Lab work #8. Congestion control (1 session) Author: Pablo Pavón Mariño

More information

Recitation Caches and Blocking. 4 March 2019

Recitation Caches and Blocking. 4 March 2019 15-213 Recitation Caches an Blocking 4 March 2019 Agena Reminers Revisiting Cache Lab Caching Review Blocking to reuce cache misses Cache alignment Reminers Due Dates Cache Lab (Thursay 3/7) Miterm Exam

More information

Distributed Line Graphs: A Universal Technique for Designing DHTs Based on Arbitrary Regular Graphs

Distributed Line Graphs: A Universal Technique for Designing DHTs Based on Arbitrary Regular Graphs IEEE TRANSACTIONS ON KNOWLEDE AND DATA ENINEERIN, MANUSCRIPT ID Distribute Line raphs: A Universal Technique for Designing DHTs Base on Arbitrary Regular raphs Yiming Zhang an Ling Liu, Senior Member,

More information

Design Principles for Practical Self-Routing Nonblocking Switching Networks with O(N log N) Bit-Complexity

Design Principles for Practical Self-Routing Nonblocking Switching Networks with O(N log N) Bit-Complexity IEEE TRANSACTIONS ON COMPUTERS, VOL. 46, NO. 10, OCTOBER 1997 1 Design Principles for Practical Self-Routing Nonblocking Switching Networks with O(N log N) Bit-Complexity Te H. Szymanski, Member, IEEE

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Architecture Design of Mobile Access Coordinated Wireless Sensor Networks

Architecture Design of Mobile Access Coordinated Wireless Sensor Networks Architecture Design of Mobile Access Coorinate Wireless Sensor Networks Mai Abelhakim 1 Leonar E. Lightfoot Jian Ren 1 Tongtong Li 1 1 Department of Electrical & Computer Engineering, Michigan State University,

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Table-based division by small integer constants

Table-based division by small integer constants Table-base ivision by small integer constants Florent e Dinechin, Laurent-Stéphane Diier LIP, Université e Lyon (ENS-Lyon/CNRS/INRIA/UCBL) 46, allée Italie, 69364 Lyon Ceex 07 Florent.e.Dinechin@ens-lyon.fr

More information

Learning Subproblem Complexities in Distributed Branch and Bound

Learning Subproblem Complexities in Distributed Branch and Bound Learning Subproblem Complexities in Distribute Branch an Boun Lars Otten Department of Computer Science University of California, Irvine lotten@ics.uci.eu Rina Dechter Department of Computer Science University

More information

Pairwise alignment using shortest path algorithms, Gunnar Klau, November 29, 2005, 11:

Pairwise alignment using shortest path algorithms, Gunnar Klau, November 29, 2005, 11: airwise alignment using shortest path algorithms, Gunnar Klau, November 9,, : 3 3 airwise alignment using shortest path algorithms e will iscuss: it graph Dijkstra s algorithm algorithm (GDU) 3. References

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra is a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) A 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Questions? Post on piazza, or Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)!

Questions? Post on piazza, or  Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)! EE122 Fall 2013 HW3 Instructions Recor your answers in a file calle hw3.pf. Make sure to write your name an SID at the top of your assignment. For each problem, clearly inicate your final answer, bol an

More information

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP) Hydra ia a 4-core Chip Multiprocessor (CMP) based microarchitecture/compiler effort at Stanford that provides hardware/software

More information

Improving Spatial Reuse of IEEE Based Ad Hoc Networks

Improving Spatial Reuse of IEEE Based Ad Hoc Networks mproving Spatial Reuse of EEE 82.11 Base A Hoc Networks Fengji Ye, Su Yi an Biplab Sikar ECSE Department, Rensselaer Polytechnic nstitute Troy, NY 1218 Abstract n this paper, we evaluate an suggest methos

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Backpressure-based Packet-by-Packet Adaptive Routing in Communication Networks

Backpressure-based Packet-by-Packet Adaptive Routing in Communication Networks 1 Backpressure-base Packet-by-Packet Aaptive Routing in Communication Networks Eleftheria Athanasopoulou, Loc Bui, Tianxiong Ji, R. Srikant, an Alexaner Stoylar arxiv:15.4984v1 [cs.ni] 27 May 21 Abstract

More information

Optimal Distributed P2P Streaming under Node Degree Bounds

Optimal Distributed P2P Streaming under Node Degree Bounds Optimal Distribute P2P Streaming uner Noe Degree Bouns Shaoquan Zhang, Ziyu Shao, Minghua Chen, an Libin Jiang Department of Information Engineering, The Chinese University of Hong Kong Department of EECS,

More information

MODULE V. Internetworking: Concepts, Addressing, Architecture, Protocols, Datagram Processing, Transport-Layer Protocols, And End-To-End Services

MODULE V. Internetworking: Concepts, Addressing, Architecture, Protocols, Datagram Processing, Transport-Layer Protocols, And End-To-End Services MODULE V Internetworking: Concepts, Aressing, Architecture, Protocols, Datagram Processing, Transport-Layer Protocols, An En-To-En Services Computer Networks an Internets -- Moule 5 1 Spring, 2014 Copyright

More information

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means

Classifying Facial Expression with Radial Basis Function Networks, using Gradient Descent and K-means Classifying Facial Expression with Raial Basis Function Networks, using Graient Descent an K-means Neil Allrin Department of Computer Science University of California, San Diego La Jolla, CA 9237 nallrin@cs.ucs.eu

More information

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation DEIM Forum 2018 I4-4 Abstract Ranom Clustering for Multiple Sampling Units to Spee Up Run-time Sample Generation uzuru OKAJIMA an Koichi MARUAMA NEC Solution Innovators, Lt. 1-18-7 Shinkiba, Koto-ku, Tokyo,

More information

Learning Polynomial Functions. by Feature Construction

Learning Polynomial Functions. by Feature Construction I Proceeings of the Eighth International Workshop on Machine Learning Chicago, Illinois, June 27-29 1991 Learning Polynomial Functions by Feature Construction Richar S. Sutton GTE Laboratories Incorporate

More information

Optimizing the quality of scalable video streams on P2P Networks

Optimizing the quality of scalable video streams on P2P Networks Optimizing the quality of scalable vieo streams on PP Networks Paper #7 ASTRACT The volume of multimeia ata, incluing vieo, serve through Peer-to-Peer (PP) networks is growing rapily Unfortunately, high

More information

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation Solution Representation for Job Shop Scheuling Problems in Ant Colony Optimisation James Montgomery, Carole Faya 2, an Sana Petrovic 2 Faculty of Information & Communication Technologies, Swinburne University

More information

You Can Do That. Unit 16. Motivation. Computer Organization. Computer Organization Design of a Simple Processor. Now that you have some understanding

You Can Do That. Unit 16. Motivation. Computer Organization. Computer Organization Design of a Simple Processor. Now that you have some understanding .. ou Can Do That Unit Computer Organization Design of a imple Clou & Distribute Computing (CyberPhysical, bases, Mining,etc.) Applications (AI, Robotics, Graphics, Mobile) ystems & Networking (Embee ystems,

More information

Study of Network Optimization Method Based on ACL

Study of Network Optimization Method Based on ACL Available online at www.scienceirect.com Proceia Engineering 5 (20) 3959 3963 Avance in Control Engineering an Information Science Stuy of Network Optimization Metho Base on ACL Liu Zhian * Department

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

A shortest path algorithm in multimodal networks: a case study with time varying costs

A shortest path algorithm in multimodal networks: a case study with time varying costs A shortest path algorithm in multimoal networks: a case stuy with time varying costs Daniela Ambrosino*, Anna Sciomachen* * Department of Economics an Quantitative Methos (DIEM), University of Genoa Via

More information

A Metric for Routing in Delay-Sensitive Wireless Sensor Networks

A Metric for Routing in Delay-Sensitive Wireless Sensor Networks A Metric for Routing in Delay-Sensitive Wireless Sensor Networks Zhen Jiang Jie Wu Risa Ito Dept. of Computer Sci. Dept. of Computer & Info. Sciences Dept. of Computer Sci. West Chester University Temple

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Memory Bank Disambiguation using Modulo Unrolling for Raw Machines

Memory Bank Disambiguation using Modulo Unrolling for Raw Machines Memory Bank Disambiguation using Modulo Unrolling for Raw Machines Rajeev Barua, Walter Lee, Saman Amarasinghe, Anant Agarwal M.I.T. Laboratory for Computer Science Cambridge, MA 02139, U.S.A. fbarua,walt,saman,agarwalg@lcs.mit.edu

More information

Professor Lee, Yong Surk. References 고성능마이크로프로세서구조의개요. Topics Microprocessor & microcontroller

Professor Lee, Yong Surk. References 고성능마이크로프로세서구조의개요. Topics Microprocessor & microcontroller 이강좌는 C & S Technology 사의지원으로제작되었으며 copyright가없으므로비영리적인목적에한하여누구든지복사, 배포가가능합니다. 연구실홈페이지에는고성능마이크로프로세서에관련된많은강좌가있으며누구나무료로다운로드받을수있습니다. Professor Lee, Yong Surk 1973 : B.S., Electrical Eng., Yonsei niv. 1981

More information

Preamble. Singly linked lists. Collaboration policy and academic integrity. Getting help

Preamble. Singly linked lists. Collaboration policy and academic integrity. Getting help CS2110 Spring 2016 Assignment A. Linke Lists Due on the CMS by: See the CMS 1 Preamble Linke Lists This assignment begins our iscussions of structures. In this assignment, you will implement a structure

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Bends, Jogs, And Wiggles for Railroad Tracks and Vehicle Guide Ways

Bends, Jogs, And Wiggles for Railroad Tracks and Vehicle Guide Ways Ben, Jogs, An Wiggles for Railroa Tracks an Vehicle Guie Ways Louis T. Klauer Jr., PhD, PE. Work Soft 833 Galer Dr. Newtown Square, PA 19073 lklauer@wsof.com Preprint, June 4, 00 Copyright 00 by Louis

More information

MIT Laboratory for Computer Science

MIT Laboratory for Computer Science The Raw Processor A Scalable 32 bit Fabric for General Purpose and Embedded Computing Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben Greenwald, Paul Johnson,Walter Lee, Albert Ma, Nathan Shnidman,

More information

Topics. Computer Networks and Internets -- Module 5 2 Spring, Copyright All rights reserved.

Topics. Computer Networks and Internets -- Module 5 2 Spring, Copyright All rights reserved. Topics Internet concept an architecture Internet aressing Internet Protocol packets (atagrams) Datagram forwaring Aress resolution Error reporting mechanism Configuration Network aress translation Computer

More information

NAND flash memory is widely used as a storage

NAND flash memory is widely used as a storage 1 : Buffer-Aware Garbage Collection for Flash-Base Storage Systems Sungjin Lee, Dongkun Shin Member, IEEE, an Jihong Kim Member, IEEE Abstract NAND flash-base storage evice is becoming a viable storage

More information

A Measurement Framework for Pin-Pointing Routing Changes

A Measurement Framework for Pin-Pointing Routing Changes A Measurement Framework for Pin-Pointing Routing Changes Renata Teixeira Univ. Calif. San Diego La Jolla, CA teixeira@cs.ucs.eu Jennifer Rexfor AT&T Labs Research Florham Park, NJ jrex@research.att.com

More information

CMSC 430 Introduction to Compilers. Spring Register Allocation

CMSC 430 Introduction to Compilers. Spring Register Allocation CMSC 430 Introuction to Compilers Spring 2016 Register Allocation Introuction Change coe that uses an unoune set of virtual registers to coe that uses a finite set of actual regs For ytecoe targets, can

More information

THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE

THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE БСУ Международна конференция - 2 THE BAYESIAN RECEIVER OPERATING CHARACTERISTIC CURVE AN EFFECTIVE APPROACH TO EVALUATE THE IDS PERFORMANCE Evgeniya Nikolova, Veselina Jecheva Burgas Free University Abstract:

More information

Questions? Post on piazza, or Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)!

Questions? Post on piazza, or  Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)! EE122 Fall 2013 HW3 Instructions Recor your answers in a file calle hw3.pf. Make sure to write your name an SID at the top of your assignment. For each problem, clearly inicate your final answer, bol an

More information

Scalable Deterministic Scheduling for WDM Slot Switching Xhaul with Zero-Jitter

Scalable Deterministic Scheduling for WDM Slot Switching Xhaul with Zero-Jitter FDL sel. VOA SOA 100 Regular papers ONDM 2018 Scalable Deterministic Scheuling for WDM Slot Switching Xhaul with Zero-Jitter Bogan Uscumlic 1, Dominique Chiaroni 1, Brice Leclerc 1, Thierry Zami 2, Annie

More information

Master of Engineering Preliminary Thesis Proposal For Prototyping Research Results. December 5, 2002

Master of Engineering Preliminary Thesis Proposal For Prototyping Research Results. December 5, 2002 Master of Engineering Preliminary Thesis Proposal For 6.191 Prototyping Research Results December 5, 2002 Cemal Akcaba Massachusetts Institute of Technology Cambridge, MA 02139. Thesis Advisor: Prof. Agarwal

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations

Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations 2012 Thir International Conference on Networking an Computing Towars a Low-Power Accelerator of Many FPGAs for Stencil Computations Ryohei Kobayashi Tokyo Institute of Technology, Japan E-mail: kobayashi@arch.cs.titech.ac.jp

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

A Classification of 3R Orthogonal Manipulators by the Topology of their Workspace

A Classification of 3R Orthogonal Manipulators by the Topology of their Workspace A Classification of R Orthogonal Manipulators by the Topology of their Workspace Maher aili, Philippe Wenger an Damien Chablat Institut e Recherche en Communications et Cybernétique e Nantes, UMR C.N.R.S.

More information

A Streaming Multi-Threaded Model

A Streaming Multi-Threaded Model A Streaming Multi-Threaded Model Extended Abstract Eylon Caspi, André DeHon, John Wawrzynek September 30, 2001 Summary. We present SCORE, a multi-threaded model that relies on streams to expose thread

More information