Design of a Parallel Vector Access Unit for SDRAM Memory Systems

Size: px
Start display at page:

Download "Design of a Parallel Vector Access Unit for SDRAM Memory Systems"

Transcription

1 Design of a Parallel Vetor Aess Unit for SDRAM Memory Systems Binu K. Mathew, Sally A. MKee, John B. Carter, Al Davis Department of Computer Siene University of Utah Salt Lake City, UT mbinu sam retra Abstrat We are attaking the memory bottlenek by building a smart memory ontroller that improves effetive memory bandwidth, bus utilization, and ahe effiieny by letting appliations ditate how their data is aessed and ahed. This paper desribes a Parallel Vetor Aess unit (PVA), the vetor memory subsystem that effiiently gathers sparse, strided data strutures in parallel on a multibank SDRAM memory. We have validated our PVA design via gate-level simulation, and have evaluated its performane via funtional simulation and formal analysis. On unit-stride vetors, PVA performane equals or exeeds that of an SDRAM system optimized for ahe line fills. On vetors with larger strides, the PVA is up to 32.8 times faster. Our design is up to 3.3 times faster than a pipelined, serial SDRAM memory system that gathers sparse vetor data, and the gathering mehanism is two to five times faster than in other PVAs with similar goals. Our PVA only slightly inreases hardware omplexity with respet to these other systems, and the salle design is appropriate for a range of omputing platforms, from vetor superomputers to ommodity PCs. 1. Introdution Proessor speeds are inreasing muh faster than memory speeds, and this disparity prevents many appliations from making effetive use of the tremendous omputing power of modern miroproessors. In the Impulse projet, we are attaking the memory bottlenek by designing and This effort was sponsored in part by the Defense Advaned Researh Projets Ageny (DARPA) and the Air Fore Researh Loratory (AFRL) under agreement number F and DARPA Order Numbers F393/-1 and F376/. The views and onlusions ontained herein are those of the authors and should not be interpreted as neessarily representing the offiial polies or endorsements, either express or implied, of DARPA, AFRL, or the US Government. building a smart memory ontroller [3]. The Impulse memory system an signifiantly improve the performane of appliations with preditle aess patterns but poor spatial or temporal loality [3]. Impulse supports an optional extra address translation stage allowing appliations to ontrol how their data is aessed and ahed. For instane, on a onventional memory system, traversing rows of a FORTRAN matrix wastes bus bandwidth: the ahe line fills transfer unneeded data and evit other useful data. Impulse gathers sparse vetor elements into dense ahe lines, muh like the satter/gather operations supported by the load-store units of vetor superomputers. Several new instrution set extensions (e.g., Intel s MMX for the Pentium [1], AMD s 3DNow! for the K6-2 [1], MIPS s MDMX [18], Sun s VIS for the Ultra- SPARC [22], and Motorola s AltiVe for the PowerPC [19]) bring stream and vetor proessing to the domain of desktop omputing. Results for some appliations that use these extensions are promising [21, 23], even though the extensions do little to address memory system performane. Impulse an boost the benefit of these vetor extensions by optimizing the ahe and bus utilization of sparse data aesses. In this paper we desribe a vetor memory subsystem that implements both onventional ahe line fills and vetor-style satter/gather operations effiiently. Our design inorporates three omplementary optimizations for nonunit stride vetor requests: 1. We improve memory loality via remapping. Rather than perform a series of regular ahe line fills for nonunit stride vetors, our system gathers only the desired elements into dense ahe lines. 2. We inrease throughput with parallelism. To mitigate the relatively high lateny of SDRAM, we operate multiple banks simultaneously, with omponents working on independent parts of a vetor request. Enoding many individual requests in a ompound vetor ommand enles this parallelism and redues ommuniation within the memory ontroller. 1

2 System Bus IMPULSE ADAPTABLE MEMORY CONTROLLER Conventional Feth Unit Remapping Controller bank ontroller bus Controller Controller Controller... Controller VECTOR ACCESS UNIT DRAM DRAM DRAM DRAM to 32.8 times faster than the onventional memory system. Our system is up to 3.3 times faster than a pipelined, entralized memory aess unit that gathers sparse vetor data by issuing (up to) one SDRAM aess per yle. Compared to other parallel vetor aess units with similar goals [5], gather operations are two to five times faster. This improved parallel aess algorithm only modestly inreases hardware omplexity. By loalizing all arhitetural hanges within the memory ontroller, we require no modifiations to the proessor, system bus, or on-hip memory hierarhy. This salle solution is applile to a range of omputing platforms, from vetor omputers with DRAM memories to ommodity personal omputers. Figure 1. Memory subsystem overview. The onfigurle remapping ontrollers broadast vetor ommands to all bank ontrollers. The ontrollers gather data elements from the SDRAMs into staging units, from whih the vetor is transferred to the CPU hip. 3. We exploit SDRAM s non-uniform aess harateristis. We minimize observed preharge latenies and row aess delays by overlapping these with other memory ativity, and by trying to issue vetor referenes in an order that hits the urrent row buffers. Figure 1 illustrates how the omponents of the Impulse memory ontroller interat. A small set of remapping ontrollers support three types of satter/gather operations: base-stride, vetor indiret, and matrix inversion. Appliations onfigure the memory ontroller so that requests to ertain physial address regions trigger satter/gather operations (interfae and programming details are presented elsewhere [3]). When the proessor issues a load falling into suh a region, its remapping ontroller sees the load and broadasts the appropriate satter-gather vetor ommand to all bank ontroller (BC) units via the bank ontroller bus. In parallel, eah BC determines whih parts of the ommand it must perform loally and then gathers the orresponding vetor elements into a staging unit. When all BCs have fethed their elements, they signal the remapping ontroller, whih onstruts the omplete vetor from the data in the staging units. In this way, a single ahe line fill an load data from a set of sparse addresses, e.g., the row elements of a FORTRAN array. We validate the PVA design via gate-level synthesis and simulation, and have evaluated its performane via funtional simulation and formal analysis. For the kernels we study, our PVA-based memory system fills normal (unit stride) ahe lines as fast as and up to 8% faster than a onventional, ahe-line interleaved memory system optimized for line fills. For larger strides, it loads elements up 2. Related Work We limit our disussion to work that addresses loading vetors from DRAM. Moyer defines aess sheduling and aess ordering to be tehniques that redue load/store interlok delays by overlapping omputation with memory lateny, and that hange the order of memory requests to inrease performane, respetively. [2]. Aess sheduling attempts to separate the exeution of a load/store instrution from that of the instrution that produes/onsumes its operand, thereby reduing the proessor s observed memory delays. Moyer applies both onepts to ompiler algorithms that optimize inner loops, unrolling and grouping stream aesses to amortize the ost of eah DRAM page miss over several referenes to the open page. Lee mimis Cray instrutions on the Intel i86xr using another software approah, treating the ahe as a pseudo vetor register by reading vetor elements in bloks (using non-ahing loads) and then writing them to a prealloated portion of ahe [11]. Loading a single vetor via Moyer s and Lee s shemes on an ipsc/86 node improves performane by 4-45%, depending on the stride [15]. Valero et al. dynamially avoid bank onflits in vetor proessors by aessing vetor elements out of order. They analyze this system first for single vetors [24], and then extend the design to multiple vetors [25]. del Corral and Lleria analyze a related hardware sheme for avoiding bank onflits among multiple vetors in omplex memories [7]. These shemes fous on vetor omputers with (uniform aess time) SRAM memory omponents. The PVA omponent presented herein is similar to Corbal et al. s Command Vetor Memory System [6] (CVMS), whih exploits parallelism and loality of referene to improve effetive bandwidth for out-of-order vetor proessors with dual-banked SDRAM memories. Instead of sending individual requests to individual devies, the CVMS broadasts ommands requesting multiple independent words, a design idea we adopt. Setion ontrollers reeive the broadasts, ompute subommands for the portion

3 of the data for whih they are responsible, and then issue the addresses to SDRAMs under their ontrol. The memory subsystem orders requests to eah dual-banked devie, attempting to overlap preharges to eah internal bank with aesses to the other. Simulation results demonstrate performane improvements of 15-54% over a serial ontroller. Our bank ontrollers behaviorally resemble CVMS setion ontrollers, but our hardware design and parallel aess algorithm (see Setion 4.3) differ substantially. The Stream Memory Controller (SMC) of MKee et al. [14] ombines programmle stream buffers and prefething in a memory ontroller with intelligent DRAM sheduling. Vetor data bypass the ahe in this system, but the underlying aess-ordering onepts an be adapted to systems that ahe vetors. The SMC dynamially reorders stream/vetor aesses and issues them serially to exploit: a) parallelism aross dual banks of fast-page mode DRAM, and b) loality of referene within DRAM page buffers. For most alignments and strides on uniproessor systems, simple ordering shemes perform ompetitively with sophistiated ones [13]. Stream detetion is an important design issue for these systems. At one end of the spetrum, the appliation programmer may be required to identify vetors, as is urrently the ase in Impulse. Alternatively, the ompiler an identify vetor aesses and speify them to the memory ontroller, an approah we are pursuing. For instane, Benitez and Davidson present simple and effiient ompiler algorithms (whose omplexity is similar to strength redution) to detet and optimize streams [2]. Vetorizing ompilers an also provide the needed vetor parameters, and an perform extensive loop restruturing and optimization to maximize vetor performane [26]. At the other end of the spetrum lie hardware vetor or stream detetion shemes, as in referene predition tles [4]. Any of these suffies to provide the information Impulse needs to generate vetor aesses. 3. Mathematial Foundations The Impulse remapping ontroller gathers strided data strutures by broadasting vetor ommands to a set of bank ontrollers (BCs), eah of whih determines independently and in tandem with the others whih elements of the vetor reside in the SDRAM it manages. This broadast approah is potentially muh more effiient than the straightforward alternative of having a entralized vetor ontroller issue the stream of element addresses, one per yle. Realizing this performane potential requires a method whereby eah bank ontroller an determine the addresses of the elements that reside on its SDRAM without sequentially expanding the entire vetor. The primary advantage of our PVA mehanism over similar designs is the effiieny of our hardware algorithms for omputing eah bank s subvetor. We first introdue the terminology used in desribing these algorithms. Base-stride vetor operations are repre- is the sented by a tuple,, where base address, the sequene stride, and the sequene length. We refer to s element as. For example,! " designates vetor elements # $%, # &,... #('*)+. The number of banks,,, is a power of two. The PVA algorithm is based on two funtions: 1. FirstHit(,- ) takes a vetor and a bank - and returns either the index of the first element of that hits in - or a value indiating that no suh element exists. 2. NextHit( ) returns an inrement. suh that if a bank holds / 1, it also holds Spae onsiderations only permit a simplified explanation here. Our tehnial report ontains omplete mathematial details [12]. FirstHit() and first address alulation together an be evaluated in two for power of two strides and at most five for other strides. Their design is salle and an be implemented in a variety of ways. This paper heneforth assumes that they are implemented as a programmle logi array (PLA). NextHit() is trivial to implement, and takes only a few gate delays to evaluate. Given inputs -,,, 56 mod,, and 798+:;,, eah bank ontroller uses these funtions to independently determine the sub-vetor elements for whih it is responsible. The BC for bank - performs the following operations (onurrently, where possible): 1. alulate i=firsthit(,- ); if NoHit, ontinue. 2. while i V.L do aess memory loation V.B + i < V.S i += NextHit(V.S) 4. Vetor Aess Unit The design spae for a PVA mehanism is enormous: the type of DRAM, number of banks, interleave fator, and implementation strategy for FirstHit() an be varied to trade hardware omplexity for performane. For instane, lowerost solutions might let a set of banks share bank ontrollers and BC buses, multiplexing the use of these resoures. To demonstrate the feasibility of our approah and to derive timing and hardware omplexity estimates, we developed and synthesized a Verilog model of one design point in this large spae. The implementation uses 16 banks of wordinterleaved SDRAM (32-bit wide). Eah has a dediated bank ontroller that drives 256 Mbit 16-bit wide Miron SDRAM parts, eah of whih ontains four internal banks [17]. The urrent PVA design assumes an L2 ahe line of 128 bytes, and therefore operates on vetor ommands of

4 32 single-word elements. We first desribe the implementation of the bank-ontroller bus and the BCs, and then show how the ontrollers work in tandem. 4.1 Controller Bus As illustrated in Figure 1, the bank ontrollers ommuniate with the rest of the memory ontroller via a shared, split-transation bus (BC bus) that multiplexes requests and data. During a vetor request yle, eah bus supports a 32-bit address, 32-bit stride, three-bit transation ID, twobit ommand, and some ontrol information. During a data yle, eah supports 64 data bits. The urrent PVA design targets a MIPS R1 proessor with a 64-bit system bus, on whih the PVA unit an send or reeive one data word per yle. No intermediate unit is needed to merge data olleted by multiple BCs: when read data is returned to the proessor, the BCs take turns driving their part of the ahe line onto the system bus. Eletrial limitations require a turn-around yle whenever bus ownership hanges, but to avoid these delay, we use a 128-bit BC bus and drive alternate 64-bit halves every other data yle. In addition to the 128 multiplexed lines, the BC bus inludes eight shared transation-omplete indiation lines. 4.2 Controllers For a given vetor read or write ommand, eah Controller (BC) is responsible for identifying and aessing the (possibly null) subvetor that resides in its bank. Shown in Figure 2, the arhiteture of this omponent onsists of: 1. a FirstHit Preditor to determines whether elements of a given vetor request hit this bank. If there is a hit and the stride is a power of two, this subomponent performs the FirstHit address alulation; 2. a Request FIFO to queue vetor requests for servie; 3. a Register File to provide storage for the Request FIFO; 4. a FirstHit Calulate module to determine the address of the first element hitting this bank when the stride is not a power of two; 5. an Aess Sheduler to drive the SDRAM, reordering read, write, bank ativate and preharge operations to maximize performane; 6. a set of Vetor Contexts within the Aess Sheduler to represent the urrent vetor requests; 7. a Sheduling Poliy Module within eah Vetor Context to ditate the sheduling poliy; and 8. a Staging Unit that onsists of (i) a Read Staging Unit to store read-data waiting to be assembled into a ahe line, and (ii) a Write Staging Unit to store write-data waiting to be sent to the SDRAMs. We briefly desribe eah of these subomponents. Essential to effiient operation are several bypass paths that redue ommuniation lateny within the BC. Our tehnial report fleshes out details of these modules and their interations [12]. The main modules of the BC manage the omputations required for parallel vetor aess, the effiient sheduling of SDRAM, and the data staging Parallelizing Logi The parallelizing logi onsists of the FirstHit Predit (FHP) module, the Request FIFO (RQF), the Register File (RF), and the FirstHit Calulate (FHC) modules. The FHP module wathes vetor requests on the BC bus and determines whether or not any element of a request will hit the bank. The FHP alulates the FirstHit index, the index of the first vetor element in the bank. For power-of-two strides that hit, the FHP also alulates the FirstHit address, the bank address of the first element. The FHP then signals the RQF to queue: the request s = # tuple; the FirstHit index; the alulated bank address, if ready; and an address alulation omplete (ACC) flag indiating the status of the bank address field. The RF subomponent ontains as many entries as the number of outstanding transations permitted by the BC bus, whih is eight in this implementation. The RQF module implements the state mahine and tail pointer to maintain the RF as a queue, storing vetor requests in the RF entries until those requests are assigned to vetor ontexts. Queued requests with a leared ACC flag require further proessing: the FHC module omputes the FirstHit address for these requests, whose stride is not a power of two. The FHC sans the requests between the queue head pointer, whih it maintains, and the tail pointer, multiplying the stride by the FirstHit index alulated by the FHP, and then adding that to the base address to generate the FirstHit address. The FHC then writes this address into the register file and sets the entry s ACC flag. Sine this alulation requires a multiply and add, it inurs a two-yle delay, but the FHC works in parallel with the Aess Sheduler (SCHED), so when the latter module is busy, this delay is ompletely hidden. When the SCHED sees the ACC bit set for the entry at the head of the RQF it knows that there is a vetor request ready for issue Aess Sheduler The SCHED and its subomponents, the Vetor Contexts (VCs) and Sheduling Poliy Unit (SPU) modules, are responsible for: (i) expanding the series of addresses in a vetor request, (ii) ordering the stream of reads, writes, bank ativates, and preharges so that multiple vetor requests an be issued optimally, (iii) making row ativate/preharge

5 jie { lki lki jie h lki kbb jie k m h k m JNF BDQ gnf Request FIFO FirstHit Predit pgq I6N NC IUX6C S FQHK o6c 6r C L F vv whr L s x vv whr L s x s L C y vv whr L s x s C V _ whyc ƒ eel { db egff db egf vv whr L s z vv whr L s z s L C y vv whr L s z s C V _ whyc dji hegff h db egff db egf dji dji h db egff m { h db egff db egff db egf Register File Controller Bus Y O C V R FtZ 6vnDZ }(~IHZ n6_ F_ ` Staging Unit n6_ F_ }DQHV [HF_ FKQ6y R C ƒ eel Aess Sheduler SDRAM Interfae FirstHit Calulate MNKC IPO QHRHNSUTHKC I6NS FDO NV C L BDC E FHGJI IHKC L L n6_ B6C F_ WX6C sht _ S Ft FQHK sho Q S u o6c 6r C L F Y G(I IHKC L L Z[HFKNI C Z\^] C K_ FNQHV ` 6K_ V L _ S FNQHV s }DQH~]HyC FC Vetor Context_ Vetor... Vetor Context_1 Context_n [(n(odggp!qjr L Figure 2. ontroller internal organization deisions, and (iv) driving the SDRAM. The SCHED module deides when to keep an SDRAM row open, and the SPUs within the SCHED s VCs reorder the aesses. The urrent design ontains four VCs, eah of whih an hold a vetor request whose aesses are ready to be issued to the SDRAM. The VC performs a series of shifts and adds to generate the sequene of addresses required to feth a partiular vetor. These effiient alulations onstitute the rux of our PVA approah, and our tehnial report explains their details. The VCs share the SCHED datapath, and they ooperate to issue the highest priority pending SDRAM operation required by any VC. The VCs arbitrate for this datapath suh that at most one an aess it in any yle, and the oldest pending operation has highest priority. Vetor operations are injeted into VC, and whenever one ompletes (at most one finishes per yle), other pending operations shift right into the next free VC. To give the oldest pending operation priority, we daisy-hain the SCHED requests from VC N to VC suh that a lower numbered VC an plae a request on the shared datapath if and only if no higher numbered VC wishes to do so. The VCs attempt to minimize preharge overhead by giving aesses that hit in an open internal bank priority over requests to a different internal bank on the same SDRAM module. Three lines per internal bank bank hit predit, bank more hit predit, and bank lose predit oordinate this operation. The SCHED broadasts to the VCs the the urrent row addresses of the open internal banks. When a VC determines that it has a pending request that would hit an open row, it drives the internal bank s shared line to tell the SCHED not to lose the row in other words, we implement a wired OR operation. Likewise, VCs that have a pending request that misses in the internal bank use the bank lose predit line to tell the SCHED to lose the row. The SPUs within eah of the VCs deide together whih VC an issue an operation during the urrent yle. This deision is based on their olletive state as observed on the bank hit predit, bank more hit predit, and bank lose predit lines. Separate SPUs are used to isolate the sheduling heuristis within the subomponents, permitting experimentation with the sheduling poliy without hanging the rest of the BC. The sheduling algorithm strives to improve performane by maximizing row hits and hiding latenies; it does this by operating other internal banks while a given internal bank is being opened or preharged. We implement a poliy that promotes row opens and preharges ove read and write operations, as long as the former do not onflit with the open rows in use by another VC. This heuristi opens rows as early as possible. When onflits or open/preharge latenies prevent higher numbered VCs from issuing a read or write, a lower priority VC may issue its reads or writes. The poliy ensures that when an older request ompletes, a new request will be ready, even if the new one uses a different internal bank. Details of the sheduling algorithm are given in our tehnial report [12].

6 4.2.3 Staging Units The Staging Units (SUs) store data returned by the SDRAMs for a VC-generated read operation or provided by the memory ontroller for a write. In the ase of a gathered vetor read operation, the SUs on the partiipating BCs ooperate to merge vetor elements into a ahe line to be sent to the memory ontroller front end, as desribed in Setion 4.1. In the ase of a sattered vetor write operation, the SUs at eah partiipating BC buffer the write data sent by the front end. The SUs drive a transation omplete line on the BC bus to signal the ompletion of a pending vetor operation. This line ats as a wired OR that deasserts whenever all BCs have finished a partiular gathered vetor read or sattered vetor write operation. When the line goes low during a read, the memory ontroller issues a STAGE READ ommand on the vetor bus, indiating whih pending vetor read operation s data is to be read. When the line goes low during a write, the memory ontroller knows that the orresponding data has been ommitted to SDRAM Data Hazards Reordering reads and writes may violate onsisteny semantis. To maintain aeptle onsisteny semantis and to avoid turnaround, the following restrition is required: a VC may issue a read/write only if the bus has the same polarity and no polarity reversals have ourred in any preeding (older) VC. The gist of this rule is that elements of different vetors may be issued out-of-order as long as they are not separated by a request of the opposite polarity. This poliy gives rise to two important onsisteny semantis. First, RAW hazards annot happen. Seond, WAW hazards may happen if two vetor write requests not separated by a read happen to write different data to the same loation. We assume that the latter event is unlikely to our in a uniproessor mahine. If the L2 ahe has a write-bak and write-alloate poliy, then any onseutive writes to the same loation will be separated by a read. If striter onsisteny semantis are required a ompiler an be made to issue a dummy read to separate the two writes. 4.3 Timing Considerations SDRAMs define timing restritions on the sequene of operations that an legally be performed. To maintain these restritions, we use a set of small ounters alled restimers, eah of whih enfores one timing parameter by asserting a resoure availle line when the orresponding operation is permitted. The ontrol logi of the VC window works like a soreboard and ensures that all timing restritions are met by letting a VC issue an operation only when all the Kernel opy saxpy sale swap tridiag vaxpy Aess Pattern for (i=; i L S; i+=s) y[i]=x[i]; for (i=; i L S; i+=s) y[i] += a x[i]; for (i=; i L S; i+=s) x[i]=a x[i]; for ˆ (i=; i L S; i+=s) reg=x[i]; x[i]=y[i]; y[i]=reg; for (i=; i L S; i+=s) x[i]=z[i] (y[i]-x[i-1]); for (i=; i L S; i+=s) y[i]+=a[i] x[i]; Tle 1. Inner loops used to evaluate our PVA unit design. resoures it needs inluding the restimers and the datapath an be aquired. Eletrial onsiderations require a one-yle bus turnaround delay whenever the bus polarity is reversed, i.e., when a read is immediately followed by a write or vie-versa. The SCHED units attempt to minimize turnaround by reordering aesses. 5. Experimental Methodology This setion desribes the details and rationale of how we evaluate the PVA design. The initial prototype uses a word-interleaved organization, sine blok-interleaving ompliates address arithmeti and inreases the hardware omplexity of the memory ontroller. Our design an be extended for blok-interleaved memories, but we have yet to perform prie/performane analyses of this design spae. Note that Hsu and Smith study interleaving shemes for fast-page mode DRAM memories in vetor mahines [9], finding ahe-line interleaving and blok interleaving superior to low-order interleaving for many vetor appliations. The systems they examine perform no dynami aess ordering to inrease loality, though, and their results thus favor organizations that inrease spatial loality within the DRAM page buffers. It remains to be seen whether loworder interleaving beomes more attrative in onjuntion with aess ordering and sheduling tehniques, but our initial results are enouraging. Tle 1 lists the kernels used to generate the results presented here. opy, saxpy and sale are from the BLAS (Basi Linear Algebra Subprograms) benhmark suite [8], and tridiag is a tridiagonal gaussian elimination fragment, the fifth Livermore Loop [16]. vaxpy denotes a vetor axpy operation that ours in matrix-vetor multipliation by diagonals. We hoose loop kernels over wholeprogram benhmarks for this initial study beause: (i) our PVA sheduler only speeds up vetor aesses, (ii) kernels allow us to examine the performane of our PVA mehanism over a larger experimental design spae, and (iii) kernels are small enough to permit the detailed, gate-level simulations required to validate the design and to derive timing

7 Type Count AND D FLIP-FLOP 139 D Lath 32 INV 1627 MUX2 183 NAND NOR2 843 OR2 194 XOR2 5 PULLDOWN 13 TRISTATE BUFFER 1849 On-hip RAM 2K bytes Tle 2. Complexity of the synthesized bank ontroller. estimates. Performane on larger, real-world benhmarks via funtional simulation of the whole Impulse system or performane analysis of the hardware prototype we are building will be neessary to demonstrate the final proof of onept for the design presented here, but these results are not yet availle. Reall that the bus model we target allows only eight outstanding transations. This limit prevents us from unrolling most of our loops to group multiple ommands to a given vetor, but we examine performane for this optimization on the two kernels that aess only two vetors, opy and sale. In our experiments, we vary both the vetor stride and the relative vetor alignments (plaement of the base addresses within memory banks, within internal banks for a given SDRAM, and within rows or pages for a given internal bank). All vetors are 124 elements (32 ahe lines) long, and the strides are equal throughout a given loop. In all, we have evaluated PVA performane for 24 data points (eight aess patterns < six strides < five relative vetor alignments) for eah of four different memory system models. We present highlights of these results in the following setion; details may be found in our tehnial report [12]. 6. Results This setion presents timing and omplexity results from synthesizing the PVA and omparative performane results for our suite of benhmark kernels. 6.1 Synthesis Results Our end goal is to friate a CMOS ASIC of the Impulse memory ontroller, but we are first validating piees of the larger design using FPGA (field programmle gate array) tehnology. We produe an FPGA implementation on an IKOS Hermes emulator with 64 Xi- FPGAs, and then use this implementation to derive timing estimates. The PVA s Verilog desription onsists of 36 lines of ode. The types and numbers of omponents in the synthesized bank ontroller are given in Tle 2. We expet that the ustom CMOS implementation to be muh more effiient than the FPGA implementation. We used the synthesized design to measure delay through the ritial path the multiply-and-add iruit required to alulate FirstHit() for non-power-of-two strides. Our multiply-and-add unit has a delay of 29.5ns. We expet that an optimized CMOS implementation will have a delay less than 2ns, making it possible to omplete this operation in two at 1MHz. Other paths are fast enough to operate at 1MHz even in the FPGA implementation. The FHP unit has a delay of 8.3ns and SCHED has a delay of 9.3ns. CMOS timing onsiderations are very different from those for FPGAs, and thus the optimization strategies differ signifiantly. These FPGA delays represent an upper bound the ustom CMOS version will be muh faster. 6.2 Performane Results We ompare the performane of the PVA funtional model to three other memory systems. Figure 3(a)-() show the omparative performane for our four memory models on strides 1, 2, 4, 8, 16, and 19 for the opy, swap, and vaxpy kernels, and Figure 3(d)-(f) show omparative performane aross all benhmarks for strides 1, 4, and 16. The annotations ove eah bar indiate exeution time normalized to the minimum PVA SDRAM yle time for eah aess pattern. Bars that would be off the y sale are drawn at the maximum y value and annotated with the atual number of spent. The sets of bars leled opy2 and sale2 represent unrolled kernels in whih read and write vetor ommands are grouped (so the PVA sees two onseutive vetor ommands for the first vetor, then two for the seond, and so on). This optimization only improves performane for the PVA SDRAM systems, yielding a slight advantage over the unoptimized versions of the same benhmark. If more outstanding transations were allowed on the proessor bus, greater unrolling would deliver larger improvements. The bars leled ahe line interleaved serial SDRAM model the bak end of an idealized, 16-module SDRAM system optimized for ahe line fills. The memory bus is 64 bits, and L2 ahe lines are 128 bytes. The SDRAMs modeled require two for eah of RAS and CAS, and are aple of 16-yle bursts. We optimistially assume that preharge latenies an be overlapped with ativity on other SDRAMs (and we ignore the fat that writing lines takes slightly less time than reading), thus eah ahe line fill takes 2 (two for RAS, two for CAS, and 16 for the data burst). The number of ahe lines aessed depends on the length and stride of the vetors; this system makes no attempt to gather sparse data within the memory ontroller. The bars leled gathering pipelined serial SDRAM model the bak end of a 16-module, word-interleaved

8 Š Š Š Š Š Ž Š min parallel vetor SDRAM max parallel vetor SDRAM min parallel vetor SRAM max parallel vetor SRAM gathering pipelined serial SDRAM ahe line interleaved serial SDRAM opy opy saxpy sale sale swap tridiag vaxpy stride 1 stride 2 stride 4 stride 8 stride 16 stride 19 (a) opy Figure 4. Comparative performane for a prime stride (19). stride stride opy opy opy opy opy opy2 stride stride saxpy saxpy 1.45 saxpy stride (b) swap stride () vaxpy sale 3.19 (d) stride sale stride stride sale2 sale2 (e) stride sale 248 sale2 (f) stride swap stride swap swap stride tridiag tridiag 1.61 tridiag stride stride Figure 3. Comparative performane for four memory bak ends vaxpy vaxpy 1.48 vaxpy SDRAM system with a losed-page poliy and pipelined preharge. As before, the memory bus is 64 bits, and vetor ommands aess 32 elements (128 bytes, sine the present system uses 4-byte elements). Instead of performing ahe line fills, this system aesses eah vetor element individually. Although aesses are issued serially, we assume that the memory ontroller an overlap RAS latenies with ativity on other banks for all but the first element aessed by eah ommand. We optimistially assume that vetor ommands never ross DRAM pages, and thus DRAM pages are left open during the proessing of eah ommand. Preharge osts are inurred at the beginning of eah vetor ommand. This system requires more to aess unit-stride vetors than the ahe line interleaved system we model, but beause it only aesses the desired vetor elements, its relative performane inreases dramatially as vetor stride goes up. The bars leled min parallel vetor aess SRAM and max parallel vetor aess SRAM model the performane of an idealized SRAM vetor memory system with the same parallel aess sheme but with no preharge or RAS latenies. Comparing PVA SDRAM and PVA SRAM system performanes gives a measure of how well our system hides the extra latenies assoiated with dynami RAM. For unit-stride aess patterns (dense vetors or aheline fills), the PVA performs out the same as a ahe-line interleaved system that performs only line fills. As shown in Figure 3, normalized exeution time for the latter system is between 1% (for opy and sale) and 18% (for opy2, and sale2, vaxpy, swap) of the PVA s minimum exeution time for our kernels. As stride inreases, the relative performane of the ahe-line interleaved system falls off rapidly: at stride four, normalized exeution time rises to between 37% (for sale) and 48% (for vaxpy) of the PVA system s, and at stride 16, normalized exeution time reahes 1112% (for tridiag). Figure 3(a), (b), and () demonstrate that performane shows similar trends for eah benhmark kernel. Figure 3(d), (e), and (f) show performane trends for a given vetor stride. Figure 4 shows performane results for vetors with large strides that still hit all the memory banks. Performanes for both our SDRAM PVA system and the SRAM PVA sys-

9 Ž Ž diff mod #, same bank #, diff page # same mod #, diff bank #, same page # same mod #, same bank #, diff page # same mod #, same bank #, same page # stride stride stride stride stride stride 8 (a) SRAM PVA stride stride 8 (b) SDRAM PVA stride stride stride stride 19 Figure 5. Details of the vaxpy kernel performane on the PVA and a similar PVA SRAM system. Bars of graph (a) are annotated with normalized exeution time with respet to the leftmost bar, and those of (b) with respet to the orresponding bar from (a). tem for stride 19 are similar to the orresponding results for unit-stride aess patterns. In ontrast, the serial gathering SDRAM and the ahe-line interleaved systems yield performanes muh more like those for stride 16. Some relative vetor alignments are more advantageous than others, as evidened by the variations in the SDRAM PVA performane in Figure 5(b). The SRAM version of the PVA system in Figure 5(a) shows similar trends for the various ombinations of vetor stride and relative alignments, although its performane is slightly more robust. For small strides that hit more than two SDRAM banks, the minimum and maximum exeution times for the PVA system differ only by a few perent. For strides that hit one or two of the SDRAM omponents, though, relative alignment has a larger impat on overall exeution time. The results highlighted here are representative of those for all our experiments [12]. On dense data, the SDRAM PVA performs like an SDRAM system optimized for aheline fills. In general, it performs muh like an SRAM vetor memory system at a fration of the ost. 7. Disussion In this paper, we have desribed the design of a Parallel Vetor Aess unit (PVA) for the Impulse smart memory ontroller. The PVA employs a novel parallel aess algorithm that allows a olletion of bank ontrollers to determine in tandem whih parts of a vetor ommand are loated on their SDRAMs. The BCs optimize low-level aess to their SDRAMs to maximize the frequeny of openrow hits and overlap aesses to independent banks as muh as possible. As a result, the Impulse memory ontroller always performs no worse than 1% slower (and up to 8% faster) than a memory system optimized for normal ahe line fills on unit-stride aesses. For vetor-style aesses, the PVA delivers data up to 32.8 times faster than a onventional memory ontroller and up to 3.3 times faster than alternative vetor aess units, for a modest inrease in hardware omplexity. We are integrating the PVA into the full Impulse simulation environment, so that we an evaluate the performane improvements aross whole appliations. Spae limitations prevent us from fully addressing a number of important features of the PVA, inluding salility, interoperility with virtual memory, and tehniques for optimizing other kinds of satter-gather operations. Ultimately, the salility of our memory system depends on the implementation hoie of FirstHit(). For systems that use a PLA to ompute the firsthit index, the omplexity of the PLA grows with the square of the number of banks, whih limits the effetive size of suh a design to around 16 banks. For systems with a small number of banks interleaved at blok-size, repliating the FirstHit() logi times in eah BC is optimal. For very large memory systems, regardless of their interleave fator, it is best to implement a PLA to alulate the suessive vetor indies within a bank. The omplexity of this PLA inreases approximately linearly with the number of banks, the rest of the hardware remains unhanged, and the performane is onstant, irrespetive of the number of banks. Another design issue is how to handle ontiguous data spread aross disjoint physial pages. If strided vetors span multiple pages, additional address translation logi is required in the BCs. In the urrent evaluation, we assume the data being gathered into eah dense ahe line falls within a single page or superpage of physial memory. Working around the limitations of paged virtual memory is disussed in our tehnial report [12]. Finally, the PVA desribed here an be extended to handle vetor-indiret satter-gather operations by performing the gather in two phases: (i) loading the indiretion vetor into the BCs and then (ii) loading the vetor elements. The first phase is simply a unit-stride vetor load operation. After the indiretion vetor is loaded, its ontents an be broadast aross the BC bus. Eah BC determines whih elements reside in its SDRAM by snooping this broadast and performing a simple bit-mask operation on eah address. Then eah BC performs its part of the gather in parallel, and the result are oalesed from the staging units in muh the same way as for strided aesses,

10 In summary, we have presented the design of a Parallel Vetor Aess unit that shows great promise for providing appliations with poor loality with vetor-mahine-like memory performane. Although muh work remains to be done, our experiene to date indiates that suh a system an signifiantly redue the memory bottlenek for the kinds of appliations that suffer on onventional memory systems. The next steps are to evaluate the PVA design on a suite of whole-program benhmarks and to address the issues raised ove, partiularly the interation with virtual memory and supporting other satter-gather operations. 8. Aknowledgments Disussions with Mike Parker and Lambert Shaelike on aspets of the PVA design and its evaluation proved invalule. Ganesh Gopalakrishnan helped with the IKOS equipment and the Verilog model. Wilson Hsieh, Lixin Zhang, and the other members of the Impulse and Avalanhe projets helped shape this work, and Gordon Kindlmann helped us with the figures. Referenes [1] Advaned Miro Devies. Inside 3DNow!(tm) tehnology. [2] M. Benitez and J. Davidson. Code generation for streaming: An aess/exeute mehanism. In Proeedings of the 4th Symposium on Arhitetural Support for Programming Languages and Operating Systems, pages , Apr [3] J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Shaelike,, and T. Tateyama. Impulse: Building a smarter memory ontroller. In Proeedings of the Fifth Annual Symposium on High Performane Computer Arhiteture, pages 7 79, Jan [4] T.-F. Chen. Data Prefething for High Performane Proessors. PhD thesis, Univ. of Washington, July [5] J. Corbal, R. Espasa, and M. Valero. Command vetor memory systems: High performane at low ost. In Proeedings of the 1998 International Conferene on Parallel Arhitetures and Compilation Tehniques, pages 68 77, Ot [6] J. Corbal, R. Espasa, and M. Valero. Command vetor memory systems: High performane at low ost. Tehnial Report UPC-DAC , Universitat Politenia de Catalunya, Jan [7] A. del Corral and J. Lleria. Aess order to avoid intervetor onflits in omplex memory systems. In Proeedings of the Ninth International Parallel Proessing Symposium, [8] J. Dongarra, J. DuCroz, I. Duff, and S. Hammerling. A set of level 3 basi linear algebra subprograms. ACM Transations on Mathematial Software, 16(1):1 17, Mar [9] W. Hsu and J. Smith. Performane of ahed DRAM organizations in vetor superomputers. In Proeedings of the 2th Annual International Symposium on Computer Arhiteture, pages , May [1] Intel. MMX programmer s referene manual. [11] K. Lee. The NAS86 Library User s Manual. NASA Ames Researh Center, Mar [12] B. Mathew, S. MKee, J. Carter, and A. Davis. Parallel aess ordering for SDRAM memories. Tehnial Report UUCS-99-6, University of Utah Department of Computer Siene, June [13] S. MKee. Maximizing Memory Bandwidth for Streamed Computations. PhD thesis, Shool of Engineering and Applied Siene, University of Virginia, May [14] S. MKee et al. Design and evaluation of dynami aess ordering hardware. In Proeedings of the 1th ACM International Conferene on Superomputing, Philadelphia, PA, May [15] S. MKee and W. Wulf. Aess ordering and memoryonsious ahe utilization. In Proeedings of the First Annual Symposium on High Performane Computer Arhiteture, pages , Jan [16] F. MMahon. The livermore fortran kernels: A omputer test of the numerial performane range. Tehnial Report UCRL-53745, Lawrene Livermore National Loratory, Deember [17] Miron Tehnology, In. 256mb: Sdram. [18] MIPS Tehnologies, In. MIPS extension for digital media with 3D. teh brf.pdf. [19] Motorola. Altive(tm) tehnology programming interfae manual, rev Apr [2] S. Moyer. Aess Ordering Algorithms and Effetive Memory Bandwidth. PhD thesis, Shool of Engineering and Applied Siene, University of Virginia, May [21] SUN. The VIS advantage: Benhmark results hart VIS performane. Whitepaper WPR-12. [22] Sun. VIS instrution set user s manual. [23] J. Tyler, J. Lent, A. Mather, and H. Nguyen. Altive: Bringing vetor tehnology to the powerp proessor family. In Proeedings of the 1999 IEEE International Performane, Computing, and Communiations Conferene, Feb [24] M. Valero, T. Lang, J. Lleria, M. Peiron, E. Ayguade, and J. Navarro. Inreasing the number of strides for onflitfree vetor aess. In Proeedings of the 19th Annual International Symposium on Computer Arhiteture, pages , May [25] M. Valero, T. Lang, M. Peiron, and E. Ayguade. Conflit-free aess for streams in multi-module memories. Tehnial Report UPC-DAC-93-11, Universitat Politenia de Catalunya, Barelona, Spain, [26] M. Wolfe. Optimizing Superompilers for Superomputers. MIT Press, Cambridge, Massahusetts, 1989.

Pipelined Multipliers for Reconfigurable Hardware

Pipelined Multipliers for Reconfigurable Hardware Pipelined Multipliers for Reonfigurable Hardware Mithell J. Myjak and José G. Delgado-Frias Shool of Eletrial Engineering and Computer Siene, Washington State University Pullman, WA 99164-2752 USA {mmyjak,

More information

On - Line Path Delay Fault Testing of Omega MINs M. Bellos 1, E. Kalligeros 1, D. Nikolos 1,2 & H. T. Vergos 1,2

On - Line Path Delay Fault Testing of Omega MINs M. Bellos 1, E. Kalligeros 1, D. Nikolos 1,2 & H. T. Vergos 1,2 On - Line Path Delay Fault Testing of Omega MINs M. Bellos, E. Kalligeros, D. Nikolos,2 & H. T. Vergos,2 Dept. of Computer Engineering and Informatis 2 Computer Tehnology Institute University of Patras,

More information

Announcements. Lecture Caching Issues for Multi-core Processors. Shared Vs. Private Caches for Small-scale Multi-core

Announcements. Lecture Caching Issues for Multi-core Processors. Shared Vs. Private Caches for Small-scale Multi-core Announements Your fous should be on the lass projet now Leture 17: Cahing Issues for Multi-ore Proessors This week: status update and meeting A short presentation on: projet desription (problem, importane,

More information

COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY

COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY Dileep P, Bhondarkor Texas Instruments Inorporated Dallas, Texas ABSTRACT Charge oupled devies (CCD's) hove been mentioned as potential fast auxiliary

More information

Outline: Software Design

Outline: Software Design Outline: Software Design. Goals History of software design ideas Design priniples Design methods Life belt or leg iron? (Budgen) Copyright Nany Leveson, Sept. 1999 A Little History... At first, struggling

More information

A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR

A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR Malaysian Journal of Computer Siene, Vol 10 No 1, June 1997, pp 36-41 A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR Md Rafiqul Islam, Harihodin Selamat and Mohd Noor Md Sap Faulty of Computer Siene and

More information

Partial Character Decoding for Improved Regular Expression Matching in FPGAs

Partial Character Decoding for Improved Regular Expression Matching in FPGAs Partial Charater Deoding for Improved Regular Expression Mathing in FPGAs Peter Sutton Shool of Information Tehnology and Eletrial Engineering The University of Queensland Brisbane, Queensland, 4072, Australia

More information

Folding. Hardware Mapped vs. Time multiplexed. Folding by N (N=folding factor) Node A. Unfolding by J A 1 A J-1. Time multiplexed/microcoded

Folding. Hardware Mapped vs. Time multiplexed. Folding by N (N=folding factor) Node A. Unfolding by J A 1 A J-1. Time multiplexed/microcoded Folding is verse of Unfolding Node A A Folding by N (N=folding fator) Folding A Unfolding by J A A J- Hardware Mapped vs. Time multiplexed l Hardware Mapped vs. Time multiplexed/mirooded FI : y x(n) h

More information

Multi-Channel Wireless Networks: Capacity and Protocols

Multi-Channel Wireless Networks: Capacity and Protocols Multi-Channel Wireless Networks: Capaity and Protools Tehnial Report April 2005 Pradeep Kyasanur Dept. of Computer Siene, and Coordinated Siene Laboratory, University of Illinois at Urbana-Champaign Email:

More information

HEXA: Compact Data Structures for Faster Packet Processing

HEXA: Compact Data Structures for Faster Packet Processing Washington University in St. Louis Washington University Open Sholarship All Computer Siene and Engineering Researh Computer Siene and Engineering Report Number: 27-26 27 HEXA: Compat Data Strutures for

More information

13.1 Numerical Evaluation of Integrals Over One Dimension

13.1 Numerical Evaluation of Integrals Over One Dimension 13.1 Numerial Evaluation of Integrals Over One Dimension A. Purpose This olletion of subprograms estimates the value of the integral b a f(x) dx where the integrand f(x) and the limits a and b are supplied

More information

What are Cycle-Stealing Systems Good For? A Detailed Performance Model Case Study

What are Cycle-Stealing Systems Good For? A Detailed Performance Model Case Study What are Cyle-Stealing Systems Good For? A Detailed Performane Model Case Study Wayne Kelly and Jiro Sumitomo Queensland University of Tehnology, Australia {w.kelly, j.sumitomo}@qut.edu.au Abstrat The

More information

Space- and Time-Efficient BDD Construction via Working Set Control

Space- and Time-Efficient BDD Construction via Working Set Control Spae- and Time-Effiient BDD Constrution via Working Set Control Bwolen Yang Yirng-An Chen Randal E. Bryant David R. O Hallaron Computer Siene Department Carnegie Mellon University Pittsburgh, PA 15213.

More information

Cross-layer Resource Allocation on Broadband Power Line Based on Novel QoS-priority Scheduling Function in MAC Layer

Cross-layer Resource Allocation on Broadband Power Line Based on Novel QoS-priority Scheduling Function in MAC Layer Communiations and Networ, 2013, 5, 69-73 http://dx.doi.org/10.4236/n.2013.53b2014 Published Online September 2013 (http://www.sirp.org/journal/n) Cross-layer Resoure Alloation on Broadband Power Line Based

More information

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract CS 9 Projet Final Report: Learning Convention Propagation in BeerAdvoate Reviews from a etwork Perspetive Abstrat We look at the way onventions propagate between reviews on the BeerAdvoate dataset, and

More information

Algorithms, Mechanisms and Procedures for the Computer-aided Project Generation System

Algorithms, Mechanisms and Procedures for the Computer-aided Project Generation System Algorithms, Mehanisms and Proedures for the Computer-aided Projet Generation System Anton O. Butko 1*, Aleksandr P. Briukhovetskii 2, Dmitry E. Grigoriev 2# and Konstantin S. Kalashnikov 3 1 Department

More information

Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal Microsoft Research One Microsoft Way Redmond, WA, USA +1-(425)

Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal Microsoft Research One Microsoft Way Redmond, WA, USA +1-(425) Automati Physial Design Tuning: Workload as a Sequene Sanjay Agrawal Mirosoft Researh One Mirosoft Way Redmond, WA, USA +1-(425) 75-357 sagrawal@mirosoft.om Eri Chu * Computer Sienes Department University

More information

Abstract. Key Words: Image Filters, Fuzzy Filters, Order Statistics Filters, Rank Ordered Mean Filters, Channel Noise. 1.

Abstract. Key Words: Image Filters, Fuzzy Filters, Order Statistics Filters, Rank Ordered Mean Filters, Channel Noise. 1. Fuzzy Weighted Rank Ordered Mean (FWROM) Filters for Mixed Noise Suppression from Images S. Meher, G. Panda, B. Majhi 3, M.R. Meher 4,,4 Department of Eletronis and I.E., National Institute of Tehnology,

More information

High-level synthesis under I/O Timing and Memory constraints

High-level synthesis under I/O Timing and Memory constraints Highlevel synthesis under I/O Timing and Memory onstraints Philippe Coussy, Gwenolé Corre, Pierre Bomel, Eri Senn, Eri Martin To ite this version: Philippe Coussy, Gwenolé Corre, Pierre Bomel, Eri Senn,

More information

Design of High Speed Mac Unit

Design of High Speed Mac Unit Design of High Speed Ma Unit 1 Harish Babu N, 2 Rajeev Pankaj N 1 PG Student, 2 Assistant professor Shools of Eletronis Engineering, VIT University, Vellore -632014, TamilNadu, India. 1 harishharsha72@gmail.om,

More information

8 Instruction Selection

8 Instruction Selection 8 Instrution Seletion The IR ode instrutions were designed to do exatly one operation: load/store, add, subtrat, jump, et. The mahine instrutions of a real CPU often perform several of these primitive

More information

The AMDREL Project in Retrospective

The AMDREL Project in Retrospective The AMDREL Projet in Retrospetive K. Siozios 1, G. Koutroumpezis 1, K. Tatas 1, N. Vassiliadis 2, V. Kalenteridis 2, H. Pournara 2, I. Pappas 2, D. Soudris 1, S. Nikolaidis 2, S. Siskos 2, and A. Thanailakis

More information

CleanUp: Improving Quadrilateral Finite Element Meshes

CleanUp: Improving Quadrilateral Finite Element Meshes CleanUp: Improving Quadrilateral Finite Element Meshes Paul Kinney MD-10 ECC P.O. Box 203 Ford Motor Company Dearborn, MI. 8121 (313) 28-1228 pkinney@ford.om Abstrat: Unless an all quadrilateral (quad)

More information

Compilation Lecture 11a. Register Allocation Noam Rinetzky. Text book: Modern compiler implementation in C Andrew A.

Compilation Lecture 11a. Register Allocation Noam Rinetzky. Text book: Modern compiler implementation in C Andrew A. Compilation 0368-3133 Leture 11a Text book: Modern ompiler implementation in C Andrew A. Appel Register Alloation Noam Rinetzky 1 Registers Dediated memory loations that an be aessed quikly, an have omputations

More information

Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System

Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System Arhiteture and Performane of the Hitahi SR221 Massively Parallel Proessor System Hiroaki Fujii, Yoshiko Yasuda, Hideya Akashi, Yasuhiro Inagami, Makoto Koga*, Osamu Ishihara*, Masamori Kashiyama*, Hideo

More information

Algorithms for External Memory Lecture 6 Graph Algorithms - Weighted List Ranking

Algorithms for External Memory Lecture 6 Graph Algorithms - Weighted List Ranking Algorithms for External Memory Leture 6 Graph Algorithms - Weighted List Ranking Leturer: Nodari Sithinava Sribe: Andi Hellmund, Simon Ohsenreither 1 Introdution & Motivation After talking about I/O-effiient

More information

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications System-Level Parallelism and hroughput Optimization in Designing Reonfigurable Computing Appliations Esam El-Araby 1, Mohamed aher 1, Kris Gaj 2, arek El-Ghazawi 1, David Caliga 3, and Nikitas Alexandridis

More information

The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines

The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines The Minimum Redundany Maximum Relevane Approah to Building Sparse Support Vetor Mahines Xiaoxing Yang, Ke Tang, and Xin Yao, Nature Inspired Computation and Appliations Laboratory (NICAL), Shool of Computer

More information

DECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Euncheol Kim, Gwan Choi, Mark Yeary *

DECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Euncheol Kim, Gwan Choi, Mark Yeary * DECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Eunheol Kim, Gwan Choi, Mark Yeary * Dept. of Eletrial Engineering, Texas A&M University, College Station, TX-77840

More information

mahines. HBSP enhanes the appliability of the BSP model by inorporating parameters that reet the relative speeds of the heterogeneous omputing omponen

mahines. HBSP enhanes the appliability of the BSP model by inorporating parameters that reet the relative speeds of the heterogeneous omputing omponen The Heterogeneous Bulk Synhronous Parallel Model Tiani L. Williams and Rebea J. Parsons Shool of Computer Siene University of Central Florida Orlando, FL 32816-2362 fwilliams,rebeag@s.uf.edu Abstrat. Trends

More information

PROJECT PERIODIC REPORT

PROJECT PERIODIC REPORT FP7-ICT-2007-1 Contrat no.: 215040 www.ative-projet.eu PROJECT PERIODIC REPORT Publishable Summary Grant Agreement number: ICT-215040 Projet aronym: Projet title: Enabling the Knowledge Powered Enterprise

More information

Direct-Mapped Caches

Direct-Mapped Caches A Case for Diret-Mapped Cahes Mark D. Hill University of Wisonsin ahe is a small, fast buffer in whih a system keeps those parts, of the ontents of a larger, slower memory that are likely to be used soon.

More information

A {k, n}-secret Sharing Scheme for Color Images

A {k, n}-secret Sharing Scheme for Color Images A {k, n}-seret Sharing Sheme for Color Images Rastislav Luka, Konstantinos N. Plataniotis, and Anastasios N. Venetsanopoulos The Edward S. Rogers Sr. Dept. of Eletrial and Computer Engineering, University

More information

Acoustic Links. Maximizing Channel Utilization for Underwater

Acoustic Links. Maximizing Channel Utilization for Underwater Maximizing Channel Utilization for Underwater Aousti Links Albert F Hairris III Davide G. B. Meneghetti Adihele Zorzi Department of Information Engineering University of Padova, Italy Email: {harris,davide.meneghetti,zorzi}@dei.unipd.it

More information

A Load-Balanced Clustering Protocol for Hierarchical Wireless Sensor Networks

A Load-Balanced Clustering Protocol for Hierarchical Wireless Sensor Networks International Journal of Advanes in Computer Networks and Its Seurity IJCNS A Load-Balaned Clustering Protool for Hierarhial Wireless Sensor Networks Mehdi Tarhani, Yousef S. Kavian, Saman Siavoshi, Ali

More information

NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION. Ken Sauer and Charles A. Bouman

NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION. Ken Sauer and Charles A. Bouman NONLINEAR BACK PROJECTION FOR TOMOGRAPHIC IMAGE RECONSTRUCTION Ken Sauer and Charles A. Bouman Department of Eletrial Engineering, University of Notre Dame Notre Dame, IN 46556, (219) 631-6999 Shool of

More information

A Dual-Hamiltonian-Path-Based Multicasting Strategy for Wormhole-Routed Star Graph Interconnection Networks

A Dual-Hamiltonian-Path-Based Multicasting Strategy for Wormhole-Routed Star Graph Interconnection Networks A Dual-Hamiltonian-Path-Based Multiasting Strategy for Wormhole-Routed Star Graph Interonnetion Networks Nen-Chung Wang Department of Information and Communiation Engineering Chaoyang University of Tehnology,

More information

A Partial Sorting Algorithm in Multi-Hop Wireless Sensor Networks

A Partial Sorting Algorithm in Multi-Hop Wireless Sensor Networks A Partial Sorting Algorithm in Multi-Hop Wireless Sensor Networks Abouberine Ould Cheikhna Department of Computer Siene University of Piardie Jules Verne 80039 Amiens Frane Ould.heikhna.abouberine @u-piardie.fr

More information

Parallelizing Frequent Web Access Pattern Mining with Partial Enumeration for High Speedup

Parallelizing Frequent Web Access Pattern Mining with Partial Enumeration for High Speedup Parallelizing Frequent Web Aess Pattern Mining with Partial Enumeration for High Peiyi Tang Markus P. Turkia Department of Computer Siene Department of Computer Siene University of Arkansas at Little Rok

More information

Constructing Transaction Serialization Order for Incremental. Data Warehouse Refresh. Ming-Ling Lo and Hui-I Hsiao. IBM T. J. Watson Research Center

Constructing Transaction Serialization Order for Incremental. Data Warehouse Refresh. Ming-Ling Lo and Hui-I Hsiao. IBM T. J. Watson Research Center Construting Transation Serialization Order for Inremental Data Warehouse Refresh Ming-Ling Lo and Hui-I Hsiao IBM T. J. Watson Researh Center July 11, 1997 Abstrat In typial pratie of data warehouse, the

More information

Extracting Partition Statistics from Semistructured Data

Extracting Partition Statistics from Semistructured Data Extrating Partition Statistis from Semistrutured Data John N. Wilson Rihard Gourlay Robert Japp Mathias Neumüller Department of Computer and Information Sienes University of Strathlyde, Glasgow, UK {jnw,rsg,rpj,mathias}@is.strath.a.uk

More information

A Novel Validity Index for Determination of the Optimal Number of Clusters

A Novel Validity Index for Determination of the Optimal Number of Clusters IEICE TRANS. INF. & SYST., VOL.E84 D, NO.2 FEBRUARY 2001 281 LETTER A Novel Validity Index for Determination of the Optimal Number of Clusters Do-Jong KIM, Yong-Woon PARK, and Dong-Jo PARK, Nonmembers

More information

Multi-hop Fast Conflict Resolution Algorithm for Ad Hoc Networks

Multi-hop Fast Conflict Resolution Algorithm for Ad Hoc Networks Multi-hop Fast Conflit Resolution Algorithm for Ad Ho Networks Shengwei Wang 1, Jun Liu 2,*, Wei Cai 2, Minghao Yin 2, Lingyun Zhou 2, and Hui Hao 3 1 Power Emergeny Center, Sihuan Eletri Power Corporation,

More information

Uplink Channel Allocation Scheme and QoS Management Mechanism for Cognitive Cellular- Femtocell Networks

Uplink Channel Allocation Scheme and QoS Management Mechanism for Cognitive Cellular- Femtocell Networks 62 Uplink Channel Alloation Sheme and QoS Management Mehanism for Cognitive Cellular- Femtoell Networks Kien Du Nguyen 1, Hoang Nam Nguyen 1, Hiroaki Morino 2 and Iwao Sasase 3 1 University of Engineering

More information

This fact makes it difficult to evaluate the cost function to be minimized

This fact makes it difficult to evaluate the cost function to be minimized RSOURC LLOCTION N SSINMNT In the resoure alloation step the amount of resoures required to exeute the different types of proesses is determined. We will refer to the time interval during whih a proess

More information

Performance Improvement of TCP on Wireless Cellular Networks by Adaptive FEC Combined with Explicit Loss Notification

Performance Improvement of TCP on Wireless Cellular Networks by Adaptive FEC Combined with Explicit Loss Notification erformane Improvement of TC on Wireless Cellular Networks by Adaptive Combined with Expliit Loss tifiation Masahiro Miyoshi, Masashi Sugano, Masayuki Murata Department of Infomatis and Mathematial Siene,

More information

Graph-Based vs Depth-Based Data Representation for Multiview Images

Graph-Based vs Depth-Based Data Representation for Multiview Images Graph-Based vs Depth-Based Data Representation for Multiview Images Thomas Maugey, Antonio Ortega, Pasal Frossard Signal Proessing Laboratory (LTS), Eole Polytehnique Fédérale de Lausanne (EPFL) Email:

More information

The recursive decoupling method for solving tridiagonal linear systems

The recursive decoupling method for solving tridiagonal linear systems Loughborough University Institutional Repository The reursive deoupling method for solving tridiagonal linear systems This item was submitted to Loughborough University's Institutional Repository by the/an

More information

Zippy - A coarse-grained reconfigurable array with support for hardware virtualization

Zippy - A coarse-grained reconfigurable array with support for hardware virtualization Zippy - A oarse-grained reonfigurable array with support for hardware virtualization Christian Plessl Computer Engineering and Networks Lab ETH Zürih, Switzerland plessl@tik.ee.ethz.h Maro Platzner Department

More information

Approximate logic synthesis for error tolerant applications

Approximate logic synthesis for error tolerant applications Approximate logi synthesis for error tolerant appliations Doohul Shin and Sandeep K. Gupta Eletrial Engineering Department, University of Southern California, Los Angeles, CA 989 {doohuls, sandeep}@us.edu

More information

DETECTION METHOD FOR NETWORK PENETRATING BEHAVIOR BASED ON COMMUNICATION FINGERPRINT

DETECTION METHOD FOR NETWORK PENETRATING BEHAVIOR BASED ON COMMUNICATION FINGERPRINT DETECTION METHOD FOR NETWORK PENETRATING BEHAVIOR BASED ON COMMUNICATION FINGERPRINT 1 ZHANGGUO TANG, 2 HUANZHOU LI, 3 MINGQUAN ZHONG, 4 JIAN ZHANG 1 Institute of Computer Network and Communiation Tehnology,

More information

Evaluation of Benchmark Performance Estimation for Parallel. Fortran Programs on Massively Parallel SIMD and MIMD. Computers.

Evaluation of Benchmark Performance Estimation for Parallel. Fortran Programs on Massively Parallel SIMD and MIMD. Computers. Evaluation of Benhmark Performane Estimation for Parallel Fortran Programs on Massively Parallel SIMD and MIMD Computers Thomas Fahringer Dept of Software Tehnology and Parallel Systems University of Vienna

More information

arxiv: v1 [cs.db] 13 Sep 2017

arxiv: v1 [cs.db] 13 Sep 2017 An effiient lustering algorithm from the measure of loal Gaussian distribution Yuan-Yen Tai (Dated: May 27, 2018) In this paper, I will introdue a fast and novel lustering algorithm based on Gaussian distribution

More information

Smooth Trajectory Planning Along Bezier Curve for Mobile Robots with Velocity Constraints

Smooth Trajectory Planning Along Bezier Curve for Mobile Robots with Velocity Constraints Smooth Trajetory Planning Along Bezier Curve for Mobile Robots with Veloity Constraints Gil Jin Yang and Byoung Wook Choi Department of Eletrial and Information Engineering Seoul National University of

More information

Boosted Random Forest

Boosted Random Forest Boosted Random Forest Yohei Mishina, Masamitsu suhiya and Hironobu Fujiyoshi Department of Computer Siene, Chubu University, 1200 Matsumoto-ho, Kasugai, Aihi, Japan {mishi, mtdoll}@vision.s.hubu.a.jp,

More information

Exploring the Commonality in Feature Modeling Notations

Exploring the Commonality in Feature Modeling Notations Exploring the Commonality in Feature Modeling Notations Miloslav ŠÍPKA Slovak University of Tehnology Faulty of Informatis and Information Tehnologies Ilkovičova 3, 842 16 Bratislava, Slovakia miloslav.sipka@gmail.om

More information

A Novel Timestamp Ordering Approach for Co-existing Traditional and Cooperative Transaction Processing

A Novel Timestamp Ordering Approach for Co-existing Traditional and Cooperative Transaction Processing A Novel Timestamp Ordering Approah for Co-existing Traditional and Cooperative Transation Proessing Author Sun, Chengzheng, Zhang, Y., Kambayashi, Y., Yang, Y. Published 1998 Conferene Title Proeedings

More information

Chapter 2: Introduction to Maple V

Chapter 2: Introduction to Maple V Chapter 2: Introdution to Maple V 2-1 Working with Maple Worksheets Try It! (p. 15) Start a Maple session with an empty worksheet. The name of the worksheet should be Untitled (1). Use one of the standard

More information

Plot-to-track correlation in A-SMGCS using the target images from a Surface Movement Radar

Plot-to-track correlation in A-SMGCS using the target images from a Surface Movement Radar Plot-to-trak orrelation in A-SMGCS using the target images from a Surfae Movement Radar G. Golino Radar & ehnology Division AMS, Italy ggolino@amsjv.it Abstrat he main topi of this paper is the formulation

More information

Performance Benchmarks for an Interactive Video-on-Demand System

Performance Benchmarks for an Interactive Video-on-Demand System Performane Benhmarks for an Interative Video-on-Demand System. Guo,P.G.Taylor,E.W.M.Wong,S.Chan,M.Zukerman andk.s.tang ARC Speial Researh Centre for Ultra-Broadband Information Networks (CUBIN) Department

More information

1. Introduction. 2. The Probable Stope Algorithm

1. Introduction. 2. The Probable Stope Algorithm 1. Introdution Optimization in underground mine design has reeived less attention than that in open pit mines. This is mostly due to the diversity o underground mining methods and omplexity o underground

More information

Calculation of typical running time of a branch-and-bound algorithm for the vertex-cover problem

Calculation of typical running time of a branch-and-bound algorithm for the vertex-cover problem Calulation of typial running time of a branh-and-bound algorithm for the vertex-over problem Joni Pajarinen, Joni.Pajarinen@iki.fi Otober 21, 2007 1 Introdution The vertex-over problem is one of a olletion

More information

Accommodations of QoS DiffServ Over IP and MPLS Networks

Accommodations of QoS DiffServ Over IP and MPLS Networks Aommodations of QoS DiffServ Over IP and MPLS Networks Abdullah AlWehaibi, Anjali Agarwal, Mihael Kadoh and Ahmed ElHakeem Department of Eletrial and Computer Department de Genie Eletrique Engineering

More information

Analysis of input and output configurations for use in four-valued CCD programmable logic arrays

Analysis of input and output configurations for use in four-valued CCD programmable logic arrays nalysis of input and output onfigurations for use in four-valued D programmable logi arrays J.T. utler H.G. Kerkhoff ndexing terms: Logi, iruit theory and design, harge-oupled devies bstrat: s in binary,

More information

Methods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems

Methods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems Methods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems Arne Hamann, Razvan Rau, Rolf Ernst Institute of Computer and Communiation Network Engineering Tehnial University of Braunshweig,

More information

Reduced-Complexity Column-Layered Decoding and. Implementation for LDPC Codes

Reduced-Complexity Column-Layered Decoding and. Implementation for LDPC Codes Redued-Complexity Column-Layered Deoding and Implementation for LDPC Codes Zhiqiang Cui 1, Zhongfeng Wang 2, Senior Member, IEEE, and Xinmiao Zhang 3 1 Qualomm In., San Diego, CA 92121, USA 2 Broadom Corp.,

More information

Cluster-Based Cumulative Ensembles

Cluster-Based Cumulative Ensembles Cluster-Based Cumulative Ensembles Hanan G. Ayad and Mohamed S. Kamel Pattern Analysis and Mahine Intelligene Lab, Eletrial and Computer Engineering, University of Waterloo, Waterloo, Ontario N2L 3G1,

More information

SVC-DASH-M: Scalable Video Coding Dynamic Adaptive Streaming Over HTTP Using Multiple Connections

SVC-DASH-M: Scalable Video Coding Dynamic Adaptive Streaming Over HTTP Using Multiple Connections SVC-DASH-M: Salable Video Coding Dynami Adaptive Streaming Over HTTP Using Multiple Connetions Samar Ibrahim, Ahmed H. Zahran and Mahmoud H. Ismail Department of Eletronis and Eletrial Communiations, Faulty

More information

High Speed Area Efficient VLSI Architecture for DCT using Proposed CORDIC Algorithm

High Speed Area Efficient VLSI Architecture for DCT using Proposed CORDIC Algorithm International Journal of Innovative Researh in Siene, Engineering and Tehnology Website: www.ijirset.om High Speed Area Effiient VLSI Arhiteture for DCT using Proposed CORDIC Algorithm Deepnarayan Sinha

More information

Speeding up Consensus by Chasing Fast Decisions

Speeding up Consensus by Chasing Fast Decisions Speeding up Consensus by Chasing Fast Deisions Balaji Arun, Sebastiano Peluso, Roberto Palmieri, Giuliano Losa, Binoy Ravindran ECE, Virginia Teh, USA {balajia,peluso,robertop,giuliano.losa,binoy}@vt.edu

More information

Allocating Rotating Registers by Scheduling

Allocating Rotating Registers by Scheduling Alloating Rotating Registers by Sheduling Hongbo Rong Hyunhul Park Cheng Wang Youfeng Wu Programming Systems Lab Intel Labs {hongbo.rong,hyunhul.park,heng..wang,youfeng.wu}@intel.om ABSTRACT A rotating

More information

Gray Codes for Reflectable Languages

Gray Codes for Reflectable Languages Gray Codes for Refletable Languages Yue Li Joe Sawada Marh 8, 2008 Abstrat We lassify a type of language alled a refletable language. We then develop a generi algorithm that an be used to list all strings

More information

Improved Circuit-to-CNF Transformation for SAT-based ATPG

Improved Circuit-to-CNF Transformation for SAT-based ATPG Improved Ciruit-to-CNF Transformation for SAT-based ATPG Daniel Tille 1 René Krenz-Bååth 2 Juergen Shloeffel 2 Rolf Drehsler 1 1 Institute of Computer Siene, University of Bremen, 28359 Bremen, Germany

More information

Volume 3, Issue 9, September 2013 International Journal of Advanced Research in Computer Science and Software Engineering

Volume 3, Issue 9, September 2013 International Journal of Advanced Research in Computer Science and Software Engineering Volume 3, Issue 9, September 2013 ISSN: 2277 128X International Journal of Advaned Researh in Computer Siene and Software Engineering Researh Paper Available online at: www.ijarsse.om A New-Fangled Algorithm

More information

We don t need no generation - a practical approach to sliding window RLNC

We don t need no generation - a practical approach to sliding window RLNC We don t need no generation - a pratial approah to sliding window RLNC Simon Wunderlih, Frank Gabriel, Sreekrishna Pandi, Frank H.P. Fitzek Deutshe Telekom Chair of Communiation Networks, TU Dresden, Dresden,

More information

Alleviating DFT cost using testability driven HLS

Alleviating DFT cost using testability driven HLS Alleviating DFT ost using testability driven HLS M.L.Flottes, R.Pires, B.Rouzeyre Laboratoire d Informatique, de Robotique et de Miroéletronique de Montpellier, U.M. CNRS 5506 6 rue Ada, 34392 Montpellier

More information

COMBINATION OF INTERSECTION- AND SWEPT-BASED METHODS FOR SINGLE-MATERIAL REMAP

COMBINATION OF INTERSECTION- AND SWEPT-BASED METHODS FOR SINGLE-MATERIAL REMAP Combination of intersetion- and swept-based methods for single-material remap 11th World Congress on Computational Mehanis WCCM XI) 5th European Conferene on Computational Mehanis ECCM V) 6th European

More information

An Efficient and Scalable Approach to CNN Queries in a Road Network

An Efficient and Scalable Approach to CNN Queries in a Road Network An Effiient and Salable Approah to CNN Queries in a Road Network Hyung-Ju Cho Chin-Wan Chung Dept. of Eletrial Engineering & Computer Siene Korea Advaned Institute of Siene and Tehnology 373- Kusong-dong,

More information

Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis

Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis Design Impliations for Enterprise Storage Systems via Multi-Dimensional Trae Analysis Yanpei Chen, Kiran Srinivasan, Garth Goodson, Randy Katz University of California, Berkeley, NetApp In. {yhen2, randy}@ees.berkeley.edu,

More information

RAC 2 E: Novel Rendezvous Protocol for Asynchronous Cognitive Radios in Cooperative Environments

RAC 2 E: Novel Rendezvous Protocol for Asynchronous Cognitive Radios in Cooperative Environments 21st Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communiations 1 RAC 2 E: Novel Rendezvous Protool for Asynhronous Cognitive Radios in Cooperative Environments Valentina Pavlovska,

More information

Unsupervised Stereoscopic Video Object Segmentation Based on Active Contours and Retrainable Neural Networks

Unsupervised Stereoscopic Video Object Segmentation Based on Active Contours and Retrainable Neural Networks Unsupervised Stereosopi Video Objet Segmentation Based on Ative Contours and Retrainable Neural Networks KLIMIS NTALIANIS, ANASTASIOS DOULAMIS, and NIKOLAOS DOULAMIS National Tehnial University of Athens

More information

A Novel Bit Level Time Series Representation with Implication of Similarity Search and Clustering

A Novel Bit Level Time Series Representation with Implication of Similarity Search and Clustering A Novel Bit Level Time Series Representation with Impliation of Similarity Searh and lustering hotirat Ratanamahatana, Eamonn Keogh, Anthony J. Bagnall 2, and Stefano Lonardi Dept. of omputer Siene & Engineering,

More information

Facility Location: Distributed Approximation

Facility Location: Distributed Approximation Faility Loation: Distributed Approximation Thomas Mosibroda Roger Wattenhofer Distributed Computing Group PODC 2005 Where to plae ahes in the Internet? A distributed appliation that has to dynamially plae

More information

Parallelization and Performance of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters

Parallelization and Performance of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters Parallelization and Performane of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters F. Zhang, A. Bilas, A. Dhanantwari, K.N. Plataniotis, R. Abiprojo, and S. Stergiopoulos Dept. of Eletrial

More information

A Dictionary based Efficient Text Compression Technique using Replacement Strategy

A Dictionary based Efficient Text Compression Technique using Replacement Strategy A based Effiient Text Compression Tehnique using Replaement Strategy Debashis Chakraborty Assistant Professor, Department of CSE, St. Thomas College of Engineering and Tehnology, Kolkata, 700023, India

More information

XML Data Streams. XML Stream Processing. XML Stream Processing. Yanlei Diao. University of Massachusetts Amherst

XML Data Streams. XML Stream Processing. XML Stream Processing. Yanlei Diao. University of Massachusetts Amherst XML Stream Proessing Yanlei Diao University of Massahusetts Amherst XML Data Streams XML is the wire format for data exhanged online. Purhase orders http://www.oasis-open.org/ommittees/t_home.php?wg_abbrev=ubl

More information

Accelerating Multiprocessor Simulation with a Memory Timestamp Record

Accelerating Multiprocessor Simulation with a Memory Timestamp Record Aelerating Multiproessor Simulation with a Memory Timestamp Reord Kenneth Barr Heidi Pan Mihael Zhang Krste Asanovi Marh, 5 Massahusetts Institute of Tehnology Intelligent sampling gives est speed-auray

More information

The Tofu Interconnect D

The Tofu Interconnect D 2018 IEEE International Conferene on Cluster Computing The Tofu Interonnet D Yuihiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouihi Hirai, Toshiyuki Shimizu Next Generation Tehnial

More information

Post-K Supercomputer with Fujitsu's Original CPU, A64FX Powered by Arm ISA

Post-K Supercomputer with Fujitsu's Original CPU, A64FX Powered by Arm ISA Post-K Superomputer with Fujitsu's Original CPU, A64FX Powered by Arm ISA Toshiyuki Shimizu Nov. 15th, 2018 Post-K is under development, information in these slides is subjet to hange without notie 0 Agenda

More information

Performance Improvement in a Multi Cluster using a Modified Scheduling and Global Memory Management with a Novel Load Balancing Mechanism

Performance Improvement in a Multi Cluster using a Modified Scheduling and Global Memory Management with a Novel Load Balancing Mechanism Performane Improvement in a Multi Cluster using a Modified Sheduling and Global Memory Management with a Novel Load Balaning Mehanism P. Sammulal, PhD. Assistant Professor, Department of CSE, JNTUH College

More information

Sequential Incremental-Value Auctions

Sequential Incremental-Value Auctions Sequential Inremental-Value Autions Xiaoming Zheng and Sven Koenig Department of Computer Siene University of Southern California Los Angeles, CA 90089-0781 {xiaominz,skoenig}@us.edu Abstrat We study the

More information

Mining effective design solutions based on a model-driven approach

Mining effective design solutions based on a model-driven approach ata Mining VI 463 Mining effetive design solutions based on a model-driven approah T. Katsimpa 2, S. Sirmakessis 1,. Tsakalidis 1,2 & G. Tzimas 1,2 1 Researh ademi omputer Tehnology Institute, Hellas 2

More information

Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps

Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps Stairase Join: Teah a Relational DBMS to Wath its (Axis) Steps Torsten Grust Maurie van Keulen Jens Teubner University of Konstanz Department of Computer and Information Siene P.O. Box D 88, 78457 Konstanz,

More information

Detection of RF interference to GPS using day-to-day C/No differences

Detection of RF interference to GPS using day-to-day C/No differences 1 International Symposium on GPS/GSS Otober 6-8, 1. Detetion of RF interferene to GPS using day-to-day /o differenes Ryan J. R. Thompson 1#, Jinghui Wu #, Asghar Tabatabaei Balaei 3^, and Andrew G. Dempster

More information

Drawing lines. Naïve line drawing algorithm. drawpixel(x, round(y)); double dy = y1 - y0; double dx = x1 - x0; double m = dy / dx; double y = y0;

Drawing lines. Naïve line drawing algorithm. drawpixel(x, round(y)); double dy = y1 - y0; double dx = x1 - x0; double m = dy / dx; double y = y0; Naïve line drawing algorithm // Connet to grid points(x0,y0) and // (x1,y1) by a line. void drawline(int x0, int y0, int x1, int y1) { int x; double dy = y1 - y0; double dx = x1 - x0; double m = dy / dx;

More information

splitting tehniques that partition live ranges have been proposed to solve both the spilling problem[5][8] and the assignment problem[8][9]. The parti

splitting tehniques that partition live ranges have been proposed to solve both the spilling problem[5][8] and the assignment problem[8][9]. The parti Load/Store Range Analysis for Global Register Alloation Priyadarshan Kolte and Mary Jean Harrold Department of Computer Siene Clemson University Abstrat Live range splitting tehniques divide the live ranges

More information

the data. Structured Principal Component Analysis (SPCA)

the data. Structured Principal Component Analysis (SPCA) Strutured Prinipal Component Analysis Kristin M. Branson and Sameer Agarwal Department of Computer Siene and Engineering University of California, San Diego La Jolla, CA 9193-114 Abstrat Many tasks involving

More information

Establishing Secure Ethernet LANs Using Intelligent Switching Hubs in Internet Environments

Establishing Secure Ethernet LANs Using Intelligent Switching Hubs in Internet Environments Establishing Seure Ethernet LANs Using Intelligent Swithing Hubs in Internet Environments WOEIJIUNN TSAUR AND SHIJINN HORNG Department of Eletrial Engineering, National Taiwan University of Siene and Tehnology,

More information

Exploiting Enriched Contextual Information for Mobile App Classification

Exploiting Enriched Contextual Information for Mobile App Classification Exploiting Enrihed Contextual Information for Mobile App Classifiation Hengshu Zhu 1 Huanhuan Cao 2 Enhong Chen 1 Hui Xiong 3 Jilei Tian 2 1 University of Siene and Tehnology of China 2 Nokia Researh Center

More information

Trajectory Tracking Control for A Wheeled Mobile Robot Using Fuzzy Logic Controller

Trajectory Tracking Control for A Wheeled Mobile Robot Using Fuzzy Logic Controller Trajetory Traking Control for A Wheeled Mobile Robot Using Fuzzy Logi Controller K N FARESS 1 M T EL HAGRY 1 A A EL KOSY 2 1 Eletronis researh institute, Cairo, Egypt 2 Faulty of Engineering, Cairo University,

More information