Design of a Parallel Vector Access Unit for SDRAM Memory Systems

Size: px

Start display at page:

Download "Design of a Parallel Vector Access Unit for SDRAM Memory Systems"

Jennifer Shelton
5 years ago
Views:

1 Design of a Parallel Vetor Aess Unit for SDRAM Memory Systems Binu K. Mathew, Sally A. MKee, John B. Carter, Al Davis Department of Computer Siene University of Utah Salt Lake City, UT mbinu sam retra Abstrat We are attaking the memory bottlenek by building a smart memory ontroller that improves effetive memory bandwidth, bus utilization, and ahe effiieny by letting appliations ditate how their data is aessed and ahed. This paper desribes a Parallel Vetor Aess unit (PVA), the vetor memory subsystem that effiiently gathers sparse, strided data strutures in parallel on a multibank SDRAM memory. We have validated our PVA design via gate-level simulation, and have evaluated its performane via funtional simulation and formal analysis. On unit-stride vetors, PVA performane equals or exeeds that of an SDRAM system optimized for ahe line fills. On vetors with larger strides, the PVA is up to 32.8 times faster. Our design is up to 3.3 times faster than a pipelined, serial SDRAM memory system that gathers sparse vetor data, and the gathering mehanism is two to five times faster than in other PVAs with similar goals. Our PVA only slightly inreases hardware omplexity with respet to these other systems, and the salle design is appropriate for a range of omputing platforms, from vetor superomputers to ommodity PCs. 1. Introdution Proessor speeds are inreasing muh faster than memory speeds, and this disparity prevents many appliations from making effetive use of the tremendous omputing power of modern miroproessors. In the Impulse projet, we are attaking the memory bottlenek by designing and This effort was sponsored in part by the Defense Advaned Researh Projets Ageny (DARPA) and the Air Fore Researh Loratory (AFRL) under agreement number F and DARPA Order Numbers F393/-1 and F376/. The views and onlusions ontained herein are those of the authors and should not be interpreted as neessarily representing the offiial polies or endorsements, either express or implied, of DARPA, AFRL, or the US Government. building a smart memory ontroller [3]. The Impulse memory system an signifiantly improve the performane of appliations with preditle aess patterns but poor spatial or temporal loality [3]. Impulse supports an optional extra address translation stage allowing appliations to ontrol how their data is aessed and ahed. For instane, on a onventional memory system, traversing rows of a FORTRAN matrix wastes bus bandwidth: the ahe line fills transfer unneeded data and evit other useful data. Impulse gathers sparse vetor elements into dense ahe lines, muh like the satter/gather operations supported by the load-store units of vetor superomputers. Several new instrution set extensions (e.g., Intel s MMX for the Pentium [1], AMD s 3DNow! for the K6-2 [1], MIPS s MDMX [18], Sun s VIS for the Ultra- SPARC [22], and Motorola s AltiVe for the PowerPC [19]) bring stream and vetor proessing to the domain of desktop omputing. Results for some appliations that use these extensions are promising [21, 23], even though the extensions do little to address memory system performane. Impulse an boost the benefit of these vetor extensions by optimizing the ahe and bus utilization of sparse data aesses. In this paper we desribe a vetor memory subsystem that implements both onventional ahe line fills and vetor-style satter/gather operations effiiently. Our design inorporates three omplementary optimizations for nonunit stride vetor requests: 1. We improve memory loality via remapping. Rather than perform a series of regular ahe line fills for nonunit stride vetors, our system gathers only the desired elements into dense ahe lines. 2. We inrease throughput with parallelism. To mitigate the relatively high lateny of SDRAM, we operate multiple banks simultaneously, with omponents working on independent parts of a vetor request. Enoding many individual requests in a ompound vetor ommand enles this parallelism and redues ommuniation within the memory ontroller. 1

2 System Bus IMPULSE ADAPTABLE MEMORY CONTROLLER Conventional Feth Unit Remapping Controller bank ontroller bus Controller Controller Controller... Controller VECTOR ACCESS UNIT DRAM DRAM DRAM DRAM to 32.8 times faster than the onventional memory system. Our system is up to 3.3 times faster than a pipelined, entralized memory aess unit that gathers sparse vetor data by issuing (up to) one SDRAM aess per yle. Compared to other parallel vetor aess units with similar goals [5], gather operations are two to five times faster. This improved parallel aess algorithm only modestly inreases hardware omplexity. By loalizing all arhitetural hanges within the memory ontroller, we require no modifiations to the proessor, system bus, or on-hip memory hierarhy. This salle solution is applile to a range of omputing platforms, from vetor omputers with DRAM memories to ommodity personal omputers. Figure 1. Memory subsystem overview. The onfigurle remapping ontrollers broadast vetor ommands to all bank ontrollers. The ontrollers gather data elements from the SDRAMs into staging units, from whih the vetor is transferred to the CPU hip. 3. We exploit SDRAM s non-uniform aess harateristis. We minimize observed preharge latenies and row aess delays by overlapping these with other memory ativity, and by trying to issue vetor referenes in an order that hits the urrent row buffers. Figure 1 illustrates how the omponents of the Impulse memory ontroller interat. A small set of remapping ontrollers support three types of satter/gather operations: base-stride, vetor indiret, and matrix inversion. Appliations onfigure the memory ontroller so that requests to ertain physial address regions trigger satter/gather operations (interfae and programming details are presented elsewhere [3]). When the proessor issues a load falling into suh a region, its remapping ontroller sees the load and broadasts the appropriate satter-gather vetor ommand to all bank ontroller (BC) units via the bank ontroller bus. In parallel, eah BC determines whih parts of the ommand it must perform loally and then gathers the orresponding vetor elements into a staging unit. When all BCs have fethed their elements, they signal the remapping ontroller, whih onstruts the omplete vetor from the data in the staging units. In this way, a single ahe line fill an load data from a set of sparse addresses, e.g., the row elements of a FORTRAN array. We validate the PVA design via gate-level synthesis and simulation, and have evaluated its performane via funtional simulation and formal analysis. For the kernels we study, our PVA-based memory system fills normal (unit stride) ahe lines as fast as and up to 8% faster than a onventional, ahe-line interleaved memory system optimized for line fills. For larger strides, it loads elements up 2. Related Work We limit our disussion to work that addresses loading vetors from DRAM. Moyer defines aess sheduling and aess ordering to be tehniques that redue load/store interlok delays by overlapping omputation with memory lateny, and that hange the order of memory requests to inrease performane, respetively. [2]. Aess sheduling attempts to separate the exeution of a load/store instrution from that of the instrution that produes/onsumes its operand, thereby reduing the proessor s observed memory delays. Moyer applies both onepts to ompiler algorithms that optimize inner loops, unrolling and grouping stream aesses to amortize the ost of eah DRAM page miss over several referenes to the open page. Lee mimis Cray instrutions on the Intel i86xr using another software approah, treating the ahe as a pseudo vetor register by reading vetor elements in bloks (using non-ahing loads) and then writing them to a prealloated portion of ahe [11]. Loading a single vetor via Moyer s and Lee s shemes on an ipsc/86 node improves performane by 4-45%, depending on the stride [15]. Valero et al. dynamially avoid bank onflits in vetor proessors by aessing vetor elements out of order. They analyze this system first for single vetors [24], and then extend the design to multiple vetors [25]. del Corral and Lleria analyze a related hardware sheme for avoiding bank onflits among multiple vetors in omplex memories [7]. These shemes fous on vetor omputers with (uniform aess time) SRAM memory omponents. The PVA omponent presented herein is similar to Corbal et al. s Command Vetor Memory System [6] (CVMS), whih exploits parallelism and loality of referene to improve effetive bandwidth for out-of-order vetor proessors with dual-banked SDRAM memories. Instead of sending individual requests to individual devies, the CVMS broadasts ommands requesting multiple independent words, a design idea we adopt. Setion ontrollers reeive the broadasts, ompute subommands for the portion

3 of the data for whih they are responsible, and then issue the addresses to SDRAMs under their ontrol. The memory subsystem orders requests to eah dual-banked devie, attempting to overlap preharges to eah internal bank with aesses to the other. Simulation results demonstrate performane improvements of 15-54% over a serial ontroller. Our bank ontrollers behaviorally resemble CVMS setion ontrollers, but our hardware design and parallel aess algorithm (see Setion 4.3) differ substantially. The Stream Memory Controller (SMC) of MKee et al. [14] ombines programmle stream buffers and prefething in a memory ontroller with intelligent DRAM sheduling. Vetor data bypass the ahe in this system, but the underlying aess-ordering onepts an be adapted to systems that ahe vetors. The SMC dynamially reorders stream/vetor aesses and issues them serially to exploit: a) parallelism aross dual banks of fast-page mode DRAM, and b) loality of referene within DRAM page buffers. For most alignments and strides on uniproessor systems, simple ordering shemes perform ompetitively with sophistiated ones [13]. Stream detetion is an important design issue for these systems. At one end of the spetrum, the appliation programmer may be required to identify vetors, as is urrently the ase in Impulse. Alternatively, the ompiler an identify vetor aesses and speify them to the memory ontroller, an approah we are pursuing. For instane, Benitez and Davidson present simple and effiient ompiler algorithms (whose omplexity is similar to strength redution) to detet and optimize streams [2]. Vetorizing ompilers an also provide the needed vetor parameters, and an perform extensive loop restruturing and optimization to maximize vetor performane [26]. At the other end of the spetrum lie hardware vetor or stream detetion shemes, as in referene predition tles [4]. Any of these suffies to provide the information Impulse needs to generate vetor aesses. 3. Mathematial Foundations The Impulse remapping ontroller gathers strided data strutures by broadasting vetor ommands to a set of bank ontrollers (BCs), eah of whih determines independently and in tandem with the others whih elements of the vetor reside in the SDRAM it manages. This broadast approah is potentially muh more effiient than the straightforward alternative of having a entralized vetor ontroller issue the stream of element addresses, one per yle. Realizing this performane potential requires a method whereby eah bank ontroller an determine the addresses of the elements that reside on its SDRAM without sequentially expanding the entire vetor. The primary advantage of our PVA mehanism over similar designs is the effiieny of our hardware algorithms for omputing eah bank s subvetor. We first introdue the terminology used in desribing these algorithms. Base-stride vetor operations are repre- is the sented by a tuple,, where base address, the sequene stride, and the sequene length. We refer to s element as. For example,! " designates vetor elements # $%, # &,... #('*)+. The number of banks,,, is a power of two. The PVA algorithm is based on two funtions: 1. FirstHit(,- ) takes a vetor and a bank - and returns either the index of the first element of that hits in - or a value indiating that no suh element exists. 2. NextHit( ) returns an inrement. suh that if a bank holds / 1, it also holds Spae onsiderations only permit a simplified explanation here. Our tehnial report ontains omplete mathematial details [12]. FirstHit() and first address alulation together an be evaluated in two for power of two strides and at most five for other strides. Their design is salle and an be implemented in a variety of ways. This paper heneforth assumes that they are implemented as a programmle logi array (PLA). NextHit() is trivial to implement, and takes only a few gate delays to evaluate. Given inputs -,,, 56 mod,, and 798+:;,, eah bank ontroller uses these funtions to independently determine the sub-vetor elements for whih it is responsible. The BC for bank - performs the following operations (onurrently, where possible): 1. alulate i=firsthit(,- ); if NoHit, ontinue. 2. while i V.L do aess memory loation V.B + i < V.S i += NextHit(V.S) 4. Vetor Aess Unit The design spae for a PVA mehanism is enormous: the type of DRAM, number of banks, interleave fator, and implementation strategy for FirstHit() an be varied to trade hardware omplexity for performane. For instane, lowerost solutions might let a set of banks share bank ontrollers and BC buses, multiplexing the use of these resoures. To demonstrate the feasibility of our approah and to derive timing and hardware omplexity estimates, we developed and synthesized a Verilog model of one design point in this large spae. The implementation uses 16 banks of wordinterleaved SDRAM (32-bit wide). Eah has a dediated bank ontroller that drives 256 Mbit 16-bit wide Miron SDRAM parts, eah of whih ontains four internal banks [17]. The urrent PVA design assumes an L2 ahe line of 128 bytes, and therefore operates on vetor ommands of

4 32 single-word elements. We first desribe the implementation of the bank-ontroller bus and the BCs, and then show how the ontrollers work in tandem. 4.1 Controller Bus As illustrated in Figure 1, the bank ontrollers ommuniate with the rest of the memory ontroller via a shared, split-transation bus (BC bus) that multiplexes requests and data. During a vetor request yle, eah bus supports a 32-bit address, 32-bit stride, three-bit transation ID, twobit ommand, and some ontrol information. During a data yle, eah supports 64 data bits. The urrent PVA design targets a MIPS R1 proessor with a 64-bit system bus, on whih the PVA unit an send or reeive one data word per yle. No intermediate unit is needed to merge data olleted by multiple BCs: when read data is returned to the proessor, the BCs take turns driving their part of the ahe line onto the system bus. Eletrial limitations require a turn-around yle whenever bus ownership hanges, but to avoid these delay, we use a 128-bit BC bus and drive alternate 64-bit halves every other data yle. In addition to the 128 multiplexed lines, the BC bus inludes eight shared transation-omplete indiation lines. 4.2 Controllers For a given vetor read or write ommand, eah Controller (BC) is responsible for identifying and aessing the (possibly null) subvetor that resides in its bank. Shown in Figure 2, the arhiteture of this omponent onsists of: 1. a FirstHit Preditor to determines whether elements of a given vetor request hit this bank. If there is a hit and the stride is a power of two, this subomponent performs the FirstHit address alulation; 2. a Request FIFO to queue vetor requests for servie; 3. a Register File to provide storage for the Request FIFO; 4. a FirstHit Calulate module to determine the address of the first element hitting this bank when the stride is not a power of two; 5. an Aess Sheduler to drive the SDRAM, reordering read, write, bank ativate and preharge operations to maximize performane; 6. a set of Vetor Contexts within the Aess Sheduler to represent the urrent vetor requests; 7. a Sheduling Poliy Module within eah Vetor Context to ditate the sheduling poliy; and 8. a Staging Unit that onsists of (i) a Read Staging Unit to store read-data waiting to be assembled into a ahe line, and (ii) a Write Staging Unit to store write-data waiting to be sent to the SDRAMs. We briefly desribe eah of these subomponents. Essential to effiient operation are several bypass paths that redue ommuniation lateny within the BC. Our tehnial report fleshes out details of these modules and their interations [12]. The main modules of the BC manage the omputations required for parallel vetor aess, the effiient sheduling of SDRAM, and the data staging Parallelizing Logi The parallelizing logi onsists of the FirstHit Predit (FHP) module, the Request FIFO (RQF), the Register File (RF), and the FirstHit Calulate (FHC) modules. The FHP module wathes vetor requests on the BC bus and determines whether or not any element of a request will hit the bank. The FHP alulates the FirstHit index, the index of the first vetor element in the bank. For power-of-two strides that hit, the FHP also alulates the FirstHit address, the bank address of the first element. The FHP then signals the RQF to queue: the request s = # tuple; the FirstHit index; the alulated bank address, if ready; and an address alulation omplete (ACC) flag indiating the status of the bank address field. The RF subomponent ontains as many entries as the number of outstanding transations permitted by the BC bus, whih is eight in this implementation. The RQF module implements the state mahine and tail pointer to maintain the RF as a queue, storing vetor requests in the RF entries until those requests are assigned to vetor ontexts. Queued requests with a leared ACC flag require further proessing: the FHC module omputes the FirstHit address for these requests, whose stride is not a power of two. The FHC sans the requests between the queue head pointer, whih it maintains, and the tail pointer, multiplying the stride by the FirstHit index alulated by the FHP, and then adding that to the base address to generate the FirstHit address. The FHC then writes this address into the register file and sets the entry s ACC flag. Sine this alulation requires a multiply and add, it inurs a two-yle delay, but the FHC works in parallel with the Aess Sheduler (SCHED), so when the latter module is busy, this delay is ompletely hidden. When the SCHED sees the ACC bit set for the entry at the head of the RQF it knows that there is a vetor request ready for issue Aess Sheduler The SCHED and its subomponents, the Vetor Contexts (VCs) and Sheduling Poliy Unit (SPU) modules, are responsible for: (i) expanding the series of addresses in a vetor request, (ii) ordering the stream of reads, writes, bank ativates, and preharges so that multiple vetor requests an be issued optimally, (iii) making row ativate/preharge

5 jie { lki lki jie h lki kbb jie k m h k m JNF BDQ gnf Request FIFO FirstHit Predit pgq I6N NC IUX6C S FQHK o6c 6r C L F vv whr L s x vv whr L s x s L C y vv whr L s x s C V _ whyc ƒ eel { db egff db egf vv whr L s z vv whr L s z s L C y vv whr L s z s C V _ whyc dji hegff h db egff db egf dji dji h db egff m { h db egff db egff db egf Register File Controller Bus Y O C V R FtZ 6vnDZ }(~IHZ n6_ F_ ` Staging Unit n6_ F_ }DQHV [HF_ FKQ6y R C ƒ eel Aess Sheduler SDRAM Interfae FirstHit Calulate MNKC IPO QHRHNSUTHKC I6NS FDO NV C L BDC E FHGJI IHKC L L n6_ B6C F_ WX6C sht _ S Ft FQHK sho Q S u o6c 6r C L F Y G(I IHKC L L Z[HFKNI C Z\^] C K_ FNQHV ` 6K_ V L _ S FNQHV s }DQH~]HyC FC Vetor Context_ Vetor... Vetor Context_1 Context_n [(n(odggp!qjr L Figure 2. ontroller internal organization deisions, and (iv) driving the SDRAM. The SCHED module deides when to keep an SDRAM row open, and the SPUs within the SCHED s VCs reorder the aesses. The urrent design ontains four VCs, eah of whih an hold a vetor request whose aesses are ready to be issued to the SDRAM. The VC performs a series of shifts and adds to generate the sequene of addresses required to feth a partiular vetor. These effiient alulations onstitute the rux of our PVA approah, and our tehnial report explains their details. The VCs share the SCHED datapath, and they ooperate to issue the highest priority pending SDRAM operation required by any VC. The VCs arbitrate for this datapath suh that at most one an aess it in any yle, and the oldest pending operation has highest priority. Vetor operations are injeted into VC, and whenever one ompletes (at most one finishes per yle), other pending operations shift right into the next free VC. To give the oldest pending operation priority, we daisy-hain the SCHED requests from VC N to VC suh that a lower numbered VC an plae a request on the shared datapath if and only if no higher numbered VC wishes to do so. The VCs attempt to minimize preharge overhead by giving aesses that hit in an open internal bank priority over requests to a different internal bank on the same SDRAM module. Three lines per internal bank bank hit predit, bank more hit predit, and bank lose predit oordinate this operation. The SCHED broadasts to the VCs the the urrent row addresses of the open internal banks. When a VC determines that it has a pending request that would hit an open row, it drives the internal bank s shared line to tell the SCHED not to lose the row in other words, we implement a wired OR operation. Likewise, VCs that have a pending request that misses in the internal bank use the bank lose predit line to tell the SCHED to lose the row. The SPUs within eah of the VCs deide together whih VC an issue an operation during the urrent yle. This deision is based on their olletive state as observed on the bank hit predit, bank more hit predit, and bank lose predit lines. Separate SPUs are used to isolate the sheduling heuristis within the subomponents, permitting experimentation with the sheduling poliy without hanging the rest of the BC. The sheduling algorithm strives to improve performane by maximizing row hits and hiding latenies; it does this by operating other internal banks while a given internal bank is being opened or preharged. We implement a poliy that promotes row opens and preharges ove read and write operations, as long as the former do not onflit with the open rows in use by another VC. This heuristi opens rows as early as possible. When onflits or open/preharge latenies prevent higher numbered VCs from issuing a read or write, a lower priority VC may issue its reads or writes. The poliy ensures that when an older request ompletes, a new request will be ready, even if the new one uses a different internal bank. Details of the sheduling algorithm are given in our tehnial report [12].

6 4.2.3 Staging Units The Staging Units (SUs) store data returned by the SDRAMs for a VC-generated read operation or provided by the memory ontroller for a write. In the ase of a gathered vetor read operation, the SUs on the partiipating BCs ooperate to merge vetor elements into a ahe line to be sent to the memory ontroller front end, as desribed in Setion 4.1. In the ase of a sattered vetor write operation, the SUs at eah partiipating BC buffer the write data sent by the front end. The SUs drive a transation omplete line on the BC bus to signal the ompletion of a pending vetor operation. This line ats as a wired OR that deasserts whenever all BCs have finished a partiular gathered vetor read or sattered vetor write operation. When the line goes low during a read, the memory ontroller issues a STAGE READ ommand on the vetor bus, indiating whih pending vetor read operation s data is to be read. When the line goes low during a write, the memory ontroller knows that the orresponding data has been ommitted to SDRAM Data Hazards Reordering reads and writes may violate onsisteny semantis. To maintain aeptle onsisteny semantis and to avoid turnaround, the following restrition is required: a VC may issue a read/write only if the bus has the same polarity and no polarity reversals have ourred in any preeding (older) VC. The gist of this rule is that elements of different vetors may be issued out-of-order as long as they are not separated by a request of the opposite polarity. This poliy gives rise to two important onsisteny semantis. First, RAW hazards annot happen. Seond, WAW hazards may happen if two vetor write requests not separated by a read happen to write different data to the same loation. We assume that the latter event is unlikely to our in a uniproessor mahine. If the L2 ahe has a write-bak and write-alloate poliy, then any onseutive writes to the same loation will be separated by a read. If striter onsisteny semantis are required a ompiler an be made to issue a dummy read to separate the two writes. 4.3 Timing Considerations SDRAMs define timing restritions on the sequene of operations that an legally be performed. To maintain these restritions, we use a set of small ounters alled restimers, eah of whih enfores one timing parameter by asserting a resoure availle line when the orresponding operation is permitted. The ontrol logi of the VC window works like a soreboard and ensures that all timing restritions are met by letting a VC issue an operation only when all the Kernel opy saxpy sale swap tridiag vaxpy Aess Pattern for (i=; i L S; i+=s) y[i]=x[i]; for (i=; i L S; i+=s) y[i] += a x[i]; for (i=; i L S; i+=s) x[i]=a x[i]; for ˆ (i=; i L S; i+=s) reg=x[i]; x[i]=y[i]; y[i]=reg; for (i=; i L S; i+=s) x[i]=z[i] (y[i]-x[i-1]); for (i=; i L S; i+=s) y[i]+=a[i] x[i]; Tle 1. Inner loops used to evaluate our PVA unit design. resoures it needs inluding the restimers and the datapath an be aquired. Eletrial onsiderations require a one-yle bus turnaround delay whenever the bus polarity is reversed, i.e., when a read is immediately followed by a write or vie-versa. The SCHED units attempt to minimize turnaround by reordering aesses. 5. Experimental Methodology This setion desribes the details and rationale of how we evaluate the PVA design. The initial prototype uses a word-interleaved organization, sine blok-interleaving ompliates address arithmeti and inreases the hardware omplexity of the memory ontroller. Our design an be extended for blok-interleaved memories, but we have yet to perform prie/performane analyses of this design spae. Note that Hsu and Smith study interleaving shemes for fast-page mode DRAM memories in vetor mahines [9], finding ahe-line interleaving and blok interleaving superior to low-order interleaving for many vetor appliations. The systems they examine perform no dynami aess ordering to inrease loality, though, and their results thus favor organizations that inrease spatial loality within the DRAM page buffers. It remains to be seen whether loworder interleaving beomes more attrative in onjuntion with aess ordering and sheduling tehniques, but our initial results are enouraging. Tle 1 lists the kernels used to generate the results presented here. opy, saxpy and sale are from the BLAS (Basi Linear Algebra Subprograms) benhmark suite [8], and tridiag is a tridiagonal gaussian elimination fragment, the fifth Livermore Loop [16]. vaxpy denotes a vetor axpy operation that ours in matrix-vetor multipliation by diagonals. We hoose loop kernels over wholeprogram benhmarks for this initial study beause: (i) our PVA sheduler only speeds up vetor aesses, (ii) kernels allow us to examine the performane of our PVA mehanism over a larger experimental design spae, and (iii) kernels are small enough to permit the detailed, gate-level simulations required to validate the design and to derive timing

7 Type Count AND D FLIP-FLOP 139 D Lath 32 INV 1627 MUX2 183 NAND NOR2 843 OR2 194 XOR2 5 PULLDOWN 13 TRISTATE BUFFER 1849 On-hip RAM 2K bytes Tle 2. Complexity of the synthesized bank ontroller. estimates. Performane on larger, real-world benhmarks via funtional simulation of the whole Impulse system or performane analysis of the hardware prototype we are building will be neessary to demonstrate the final proof of onept for the design presented here, but these results are not yet availle. Reall that the bus model we target allows only eight outstanding transations. This limit prevents us from unrolling most of our loops to group multiple ommands to a given vetor, but we examine performane for this optimization on the two kernels that aess only two vetors, opy and sale. In our experiments, we vary both the vetor stride and the relative vetor alignments (plaement of the base addresses within memory banks, within internal banks for a given SDRAM, and within rows or pages for a given internal bank). All vetors are 124 elements (32 ahe lines) long, and the strides are equal throughout a given loop. In all, we have evaluated PVA performane for 24 data points (eight aess patterns < six strides < five relative vetor alignments) for eah of four different memory system models. We present highlights of these results in the following setion; details may be found in our tehnial report [12]. 6. Results This setion presents timing and omplexity results from synthesizing the PVA and omparative performane results for our suite of benhmark kernels. 6.1 Synthesis Results Our end goal is to friate a CMOS ASIC of the Impulse memory ontroller, but we are first validating piees of the larger design using FPGA (field programmle gate array) tehnology. We produe an FPGA implementation on an IKOS Hermes emulator with 64 Xi- FPGAs, and then use this implementation to derive timing estimates. The PVA s Verilog desription onsists of 36 lines of ode. The types and numbers of omponents in the synthesized bank ontroller are given in Tle 2. We expet that the ustom CMOS implementation to be muh more effiient than the FPGA implementation. We used the synthesized design to measure delay through the ritial path the multiply-and-add iruit required to alulate FirstHit() for non-power-of-two strides. Our multiply-and-add unit has a delay of 29.5ns. We expet that an optimized CMOS implementation will have a delay less than 2ns, making it possible to omplete this operation in two at 1MHz. Other paths are fast enough to operate at 1MHz even in the FPGA implementation. The FHP unit has a delay of 8.3ns and SCHED has a delay of 9.3ns. CMOS timing onsiderations are very different from those for FPGAs, and thus the optimization strategies differ signifiantly. These FPGA delays represent an upper bound the ustom CMOS version will be muh faster. 6.2 Performane Results We ompare the performane of the PVA funtional model to three other memory systems. Figure 3(a)-() show the omparative performane for our four memory models on strides 1, 2, 4, 8, 16, and 19 for the opy, swap, and vaxpy kernels, and Figure 3(d)-(f) show omparative performane aross all benhmarks for strides 1, 4, and 16. The annotations ove eah bar indiate exeution time normalized to the minimum PVA SDRAM yle time for eah aess pattern. Bars that would be off the y sale are drawn at the maximum y value and annotated with the atual number of spent. The sets of bars leled opy2 and sale2 represent unrolled kernels in whih read and write vetor ommands are grouped (so the PVA sees two onseutive vetor ommands for the first vetor, then two for the seond, and so on). This optimization only improves performane for the PVA SDRAM systems, yielding a slight advantage over the unoptimized versions of the same benhmark. If more outstanding transations were allowed on the proessor bus, greater unrolling would deliver larger improvements. The bars leled ahe line interleaved serial SDRAM model the bak end of an idealized, 16-module SDRAM system optimized for ahe line fills. The memory bus is 64 bits, and L2 ahe lines are 128 bytes. The SDRAMs modeled require two for eah of RAS and CAS, and are aple of 16-yle bursts. We optimistially assume that preharge latenies an be overlapped with ativity on other SDRAMs (and we ignore the fat that writing lines takes slightly less time than reading), thus eah ahe line fill takes 2 (two for RAS, two for CAS, and 16 for the data burst). The number of ahe lines aessed depends on the length and stride of the vetors; this system makes no attempt to gather sparse data within the memory ontroller. The bars leled gathering pipelined serial SDRAM model the bak end of a 16-module, word-interleaved

8 Š Š Š Š Š Ž Š min parallel vetor SDRAM max parallel vetor SDRAM min parallel vetor SRAM max parallel vetor SRAM gathering pipelined serial SDRAM ahe line interleaved serial SDRAM opy opy saxpy sale sale swap tridiag vaxpy stride 1 stride 2 stride 4 stride 8 stride 16 stride 19 (a) opy Figure 4. Comparative performane for a prime stride (19). stride stride opy opy opy opy opy opy2 stride stride saxpy saxpy 1.45 saxpy stride (b) swap stride () vaxpy sale 3.19 (d) stride sale stride stride sale2 sale2 (e) stride sale 248 sale2 (f) stride swap stride swap swap stride tridiag tridiag 1.61 tridiag stride stride Figure 3. Comparative performane for four memory bak ends vaxpy vaxpy 1.48 vaxpy SDRAM system with a losed-page poliy and pipelined preharge. As before, the memory bus is 64 bits, and vetor ommands aess 32 elements (128 bytes, sine the present system uses 4-byte elements). Instead of performing ahe line fills, this system aesses eah vetor element individually. Although aesses are issued serially, we assume that the memory ontroller an overlap RAS latenies with ativity on other banks for all but the first element aessed by eah ommand. We optimistially assume that vetor ommands never ross DRAM pages, and thus DRAM pages are left open during the proessing of eah ommand. Preharge osts are inurred at the beginning of eah vetor ommand. This system requires more to aess unit-stride vetors than the ahe line interleaved system we model, but beause it only aesses the desired vetor elements, its relative performane inreases dramatially as vetor stride goes up. The bars leled min parallel vetor aess SRAM and max parallel vetor aess SRAM model the performane of an idealized SRAM vetor memory system with the same parallel aess sheme but with no preharge or RAS latenies. Comparing PVA SDRAM and PVA SRAM system performanes gives a measure of how well our system hides the extra latenies assoiated with dynami RAM. For unit-stride aess patterns (dense vetors or aheline fills), the PVA performs out the same as a ahe-line interleaved system that performs only line fills. As shown in Figure 3, normalized exeution time for the latter system is between 1% (for opy and sale) and 18% (for opy2, and sale2, vaxpy, swap) of the PVA s minimum exeution time for our kernels. As stride inreases, the relative performane of the ahe-line interleaved system falls off rapidly: at stride four, normalized exeution time rises to between 37% (for sale) and 48% (for vaxpy) of the PVA system s, and at stride 16, normalized exeution time reahes 1112% (for tridiag). Figure 3(a), (b), and () demonstrate that performane shows similar trends for eah benhmark kernel. Figure 3(d), (e), and (f) show performane trends for a given vetor stride. Figure 4 shows performane results for vetors with large strides that still hit all the memory banks. Performanes for both our SDRAM PVA system and the SRAM PVA sys-

9 Ž Ž diff mod #, same bank #, diff page # same mod #, diff bank #, same page # same mod #, same bank #, diff page # same mod #, same bank #, same page # stride stride stride stride stride stride 8 (a) SRAM PVA stride stride 8 (b) SDRAM PVA stride stride stride stride 19 Figure 5. Details of the vaxpy kernel performane on the PVA and a similar PVA SRAM system. Bars of graph (a) are annotated with normalized exeution time with respet to the leftmost bar, and those of (b) with respet to the orresponding bar from (a). tem for stride 19 are similar to the orresponding results for unit-stride aess patterns. In ontrast, the serial gathering SDRAM and the ahe-line interleaved systems yield performanes muh more like those for stride 16. Some relative vetor alignments are more advantageous than others, as evidened by the variations in the SDRAM PVA performane in Figure 5(b). The SRAM version of the PVA system in Figure 5(a) shows similar trends for the various ombinations of vetor stride and relative alignments, although its performane is slightly more robust. For small strides that hit more than two SDRAM banks, the minimum and maximum exeution times for the PVA system differ only by a few perent. For strides that hit one or two of the SDRAM omponents, though, relative alignment has a larger impat on overall exeution time. The results highlighted here are representative of those for all our experiments [12]. On dense data, the SDRAM PVA performs like an SDRAM system optimized for aheline fills. In general, it performs muh like an SRAM vetor memory system at a fration of the ost. 7. Disussion In this paper, we have desribed the design of a Parallel Vetor Aess unit (PVA) for the Impulse smart memory ontroller. The PVA employs a novel parallel aess algorithm that allows a olletion of bank ontrollers to determine in tandem whih parts of a vetor ommand are loated on their SDRAMs. The BCs optimize low-level aess to their SDRAMs to maximize the frequeny of openrow hits and overlap aesses to independent banks as muh as possible. As a result, the Impulse memory ontroller always performs no worse than 1% slower (and up to 8% faster) than a memory system optimized for normal ahe line fills on unit-stride aesses. For vetor-style aesses, the PVA delivers data up to 32.8 times faster than a onventional memory ontroller and up to 3.3 times faster than alternative vetor aess units, for a modest inrease in hardware omplexity. We are integrating the PVA into the full Impulse simulation environment, so that we an evaluate the performane improvements aross whole appliations. Spae limitations prevent us from fully addressing a number of important features of the PVA, inluding salility, interoperility with virtual memory, and tehniques for optimizing other kinds of satter-gather operations. Ultimately, the salility of our memory system depends on the implementation hoie of FirstHit(). For systems that use a PLA to ompute the firsthit index, the omplexity of the PLA grows with the square of the number of banks, whih limits the effetive size of suh a design to around 16 banks. For systems with a small number of banks interleaved at blok-size, repliating the FirstHit() logi times in eah BC is optimal. For very large memory systems, regardless of their interleave fator, it is best to implement a PLA to alulate the suessive vetor indies within a bank. The omplexity of this PLA inreases approximately linearly with the number of banks, the rest of the hardware remains unhanged, and the performane is onstant, irrespetive of the number of banks. Another design issue is how to handle ontiguous data spread aross disjoint physial pages. If strided vetors span multiple pages, additional address translation logi is required in the BCs. In the urrent evaluation, we assume the data being gathered into eah dense ahe line falls within a single page or superpage of physial memory. Working around the limitations of paged virtual memory is disussed in our tehnial report [12]. Finally, the PVA desribed here an be extended to handle vetor-indiret satter-gather operations by performing the gather in two phases: (i) loading the indiretion vetor into the BCs and then (ii) loading the vetor elements. The first phase is simply a unit-stride vetor load operation. After the indiretion vetor is loaded, its ontents an be broadast aross the BC bus. Eah BC determines whih elements reside in its SDRAM by snooping this broadast and performing a simple bit-mask operation on eah address. Then eah BC performs its part of the gather in parallel, and the result are oalesed from the staging units in muh the same way as for strided aesses,

10 In summary, we have presented the design of a Parallel Vetor Aess unit that shows great promise for providing appliations with poor loality with vetor-mahine-like memory performane. Although muh work remains to be done, our experiene to date indiates that suh a system an signifiantly redue the memory bottlenek for the kinds of appliations that suffer on onventional memory systems. The next steps are to evaluate the PVA design on a suite of whole-program benhmarks and to address the issues raised ove, partiularly the interation with virtual memory and supporting other satter-gather operations. 8. Aknowledgments Disussions with Mike Parker and Lambert Shaelike on aspets of the PVA design and its evaluation proved invalule. Ganesh Gopalakrishnan helped with the IKOS equipment and the Verilog model. Wilson Hsieh, Lixin Zhang, and the other members of the Impulse and Avalanhe projets helped shape this work, and Gordon Kindlmann helped us with the figures. Referenes [1] Advaned Miro Devies. Inside 3DNow!(tm) tehnology. [2] M. Benitez and J. Davidson. Code generation for streaming: An aess/exeute mehanism. In Proeedings of the 4th Symposium on Arhitetural Support for Programming Languages and Operating Systems, pages , Apr [3] J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Shaelike,, and T. Tateyama. Impulse: Building a smarter memory ontroller. In Proeedings of the Fifth Annual Symposium on High Performane Computer Arhiteture, pages 7 79, Jan [4] T.-F. Chen. Data Prefething for High Performane Proessors. PhD thesis, Univ. of Washington, July [5] J. Corbal, R. Espasa, and M. Valero. Command vetor memory systems: High performane at low ost. In Proeedings of the 1998 International Conferene on Parallel Arhitetures and Compilation Tehniques, pages 68 77, Ot [6] J. Corbal, R. Espasa, and M. Valero. Command vetor memory systems: High performane at low ost. Tehnial Report UPC-DAC , Universitat Politenia de Catalunya, Jan [7] A. del Corral and J. Lleria. Aess order to avoid intervetor onflits in omplex memory systems. In Proeedings of the Ninth International Parallel Proessing Symposium, [8] J. Dongarra, J. DuCroz, I. Duff, and S. Hammerling. A set of level 3 basi linear algebra subprograms. ACM Transations on Mathematial Software, 16(1):1 17, Mar [9] W. Hsu and J. Smith. Performane of ahed DRAM organizations in vetor superomputers. In Proeedings of the 2th Annual International Symposium on Computer Arhiteture, pages , May [1] Intel. MMX programmer s referene manual. [11] K. Lee. The NAS86 Library User s Manual. NASA Ames Researh Center, Mar [12] B. Mathew, S. MKee, J. Carter, and A. Davis. Parallel aess ordering for SDRAM memories. Tehnial Report UUCS-99-6, University of Utah Department of Computer Siene, June [13] S. MKee. Maximizing Memory Bandwidth for Streamed Computations. PhD thesis, Shool of Engineering and Applied Siene, University of Virginia, May [14] S. MKee et al. Design and evaluation of dynami aess ordering hardware. In Proeedings of the 1th ACM International Conferene on Superomputing, Philadelphia, PA, May [15] S. MKee and W. Wulf. Aess ordering and memoryonsious ahe utilization. In Proeedings of the First Annual Symposium on High Performane Computer Arhiteture, pages , Jan [16] F. MMahon. The livermore fortran kernels: A omputer test of the numerial performane range. Tehnial Report UCRL-53745, Lawrene Livermore National Loratory, Deember [17] Miron Tehnology, In. 256mb: Sdram. [18] MIPS Tehnologies, In. MIPS extension for digital media with 3D. teh brf.pdf. [19] Motorola. Altive(tm) tehnology programming interfae manual, rev Apr [2] S. Moyer. Aess Ordering Algorithms and Effetive Memory Bandwidth. PhD thesis, Shool of Engineering and Applied Siene, University of Virginia, May [21] SUN. The VIS advantage: Benhmark results hart VIS performane. Whitepaper WPR-12. [22] Sun. VIS instrution set user s manual. [23] J. Tyler, J. Lent, A. Mather, and H. Nguyen. Altive: Bringing vetor tehnology to the powerp proessor family. In Proeedings of the 1999 IEEE International Performane, Computing, and Communiations Conferene, Feb [24] M. Valero, T. Lang, J. Lleria, M. Peiron, E. Ayguade, and J. Navarro. Inreasing the number of strides for onflitfree vetor aess. In Proeedings of the 19th Annual International Symposium on Computer Arhiteture, pages , May [25] M. Valero, T. Lang, M. Peiron, and E. Ayguade. Conflit-free aess for streams in multi-module memories. Tehnial Report UPC-DAC-93-11, Universitat Politenia de Catalunya, Barelona, Spain, [26] M. Wolfe. Optimizing Superompilers for Superomputers. MIT Press, Cambridge, Massahusetts, 1989.

Pipelined Multipliers for Reconfigurable Hardware

Pipelined Multipliers for Reconfigurable Hardware Pipelined Multipliers for Reonfigurable Hardware Mithell J. Myjak and José G. Delgado-Frias Shool of Eletrial Engineering and Computer Siene, Washington State University Pullman, WA 99164-2752 USA {mmyjak,