A Conflict-Free Memory Banking Architecture for Fast Packet Buffers.

Size: px

Start display at page:

Download "A Conflict-Free Memory Banking Architecture for Fast Packet Buffers."

Meghan Franklin
6 years ago
Views:

1 A Conflict-Free Memory anking Architecture for Fast Packet uffers. Jorge García, Llorenç Cerdà, and Jesús Coral Polytechnic University of Catalonia - Computer Architecture Dept. fjorge,llorenc,jcoralg@ac.upc.es July, Astract Routers and switches operating at high link rate require large and fast packet uffers. In many cases the uffers store cells from different queues which are requested y an ariter. Such uffers can e uilt with slow ut low cost DRAM coupled with fast ut expensive SRAM. The heads and tails of the queues reside in SRAM, while the rest of the queues are stored in DRAM. Periodically a Memory Management Algorithm requests the transfer of a group of cells of a given queue etween DRAM and SRAM. In this paper we propose a memory architecture that exploits memory ank organization in order to achieve conflict-free access. This leads to a reduction of the granularity of DRAM accesses, resulting in a decrease of storage capacity needed y the SRAM. We carry out an analysis giving the values of the system parameters that guarantee zero miss conditions. Moreover, a technological study of the system implementation shows that, confronted with previous designs [], [], our organization gives etter results for area, access times, latency, and maximum numer of queues. Index Terms Memory uffers, high-speed routers and switches, DRAM-SRAM memory organization, memory management algorithms, lookahead uffers. Methods Keywords System design. I. INTRODUCTION ONE prolem that comes out when designing high speed routers or switches is how to uild high andwidth packet uffers ale of storing a large amount of packets. For a router or switch consisting of n I interfaces with line rate R ps, an input-uffer architecture would require uffers with andwidth of R, while an output-uffer architecture would require uffers with andwidth of (n I +) R. As a rule of thum, packet uffers are dimensioned to store the andwidth-round trip time (RTT) product of the network. For example, for a port operating at Gps (OC-768) and a RTT of.5 s, the required uffer size would e of. Gytes and the This work was supported y the Ministry of Education of Spain under grant TIC--956-C- and TIC98-5C-. required andwidth for an input-uffer architecture would e of 8 Gps. In many cases, packet uffers consist of Q queues that are accessed in the order determined y an ariter. For example, this ariter can e a scheduler of an interconnection network or an scheduler that guarantees the fair share of an output link. A memory architecture for uilding these uffers using slow ut low cost DRAM coupled with fast ut costly SRAM is analyzed in []. In this architecture the line rate can e sustained using several DRAM memory devices in parallel such that cells of length L C of the same queue are accessed every random access time of the DRAM. For example, in case of an input-uffer design if the random access time of the DRAM is T seconds and the line rate is R ps, the relation L C =T R must hold. Figure shows the memory architecture descried in []. In this paper we shall refer to this memory architecture as Random Access DRAM System (RADS). Fig.. t-sram Q t-mma DRAM Q h-mma h-sram Q ariter RADS memory architecture of the packet uffer. The tail of each queue is stored in the SRAM indicated as t-sram, the head is stored in the h-sram and the rest is stored in DRAM. Every time slots the susystem indicated as the tail-memory Management Algorithm (t-mma) selects the queue of the t-sram from which cells are to e transferred to DRAM. Similarly, We assume that the router or switch works internally with fixedlength cells. We shall refer to a slot as the transmission time of a cell.

2 the h-mma decides which queue from the h-sram has to e replenished with cells from DRAM. In this paper we will focus on the SRAM size and latency dimensioning. We will assume that all the queues have een initialized and have always cells in the DRAM memory. Furthermore, we shall suppose that the t-sram and h-sram memories are shared y all the queues, i.e. the cells of a queue can occupy any position in the t-sram and h-sram. This can e implemented e.g. using a single linked list for the cells of every queue. The t-mma has to guarantee that cells from the t-sram are transferred to DRAM so that we have space availale for incoming cells. Using a shared t-sram the t-mma is simple: chose any queue with an occupancy counter higher than or equal to. Furthermore, the required t-sram size is Q ( ) +. Note that this size guarantees that there are one or more queues having at least cells. At every slot the ariter selects the queue that is going to e read. The cell of this queue is taken from the h-sram. The h-mma has to guarantee that cells transferred etween DRAM and h-sram can accommodate the sequence of cells requested y the ariter. Otherwise it may happen that the cell requested y the ariter is not present in the h-sram ecause it has not een already transferred from the DRAM. We shall refer this condition as a miss. We will dedicate the rest of the paper to analyze the h-mma and h-sram, and we shall refer to them simply as MMA and SRAM. There are previous studies related with the analysis carried out in this paper. In [] the authors investigate a scheme that leads to the minimum SRAM size needed for the RADS memory architecture formerly descried. In this scheme the requests issued y the ariter are delayed such that the MMA knows the sequence of requests efore deciding which queue has to e replenished. The numer or ariter requests known y the MMA is called the lookahead. The special case when the lookahead is zero is analyzed in []. The main contriution of our work is a conflict free memory anking architecture that further allows reducing the SRAM size and latency than of the RADS architecture studied in []. We shall refer to our memory architecture as Conflict Free DRAM System (CFDS). We analyze CFDS giving the values of the system parameters that guarantee zero miss conditions. Furthermore, y means of a technological study of the system implementation, we show that for the same performance parameters, CFDS achieves consideraly etter access times and smaller SRAM area than RADS. if the queue has no cells in the DRAM, they can e directly granted to the ariter from t-sram or h-sram There are many proposals dealing with mechanisms to alleviate the prolem of memory conflicts [] or to provide conflict-free access [], [5], [6]. This is specially true in the vector processor domain. The novelty of our technique resides on the application of these mechanisms to the context of fast packet uffering. In packet uffering no ank collision can e produced as packets have to e granted within a ounded delay. II. RANDOM ACCESS DRAM SYSTEM In this section the SRAM dimensioning is analyzed when cells from the same queue are transfered etween the SRAM and the DRAM every random access time of DRAM. Figure shows the MMA scheme considered in this section. In this scheme, the ariter requests are delayed in the lookahead shift register of l positions. At every slot one cell from the queue demanded y the head of the lookahead is read from the SRAM and granted to the ariter. Then the request issued y the ariter is stored in the tail of the lookahead. Every slots the queue previously selected y the MMA is replenished with cells. The first of these cells can e granted to the ariter if they elong to the requested queue and the queue has no cells in the SRAM. Then, the MMA selects a new queue to e replenished. The decisions of the MMA take into account (i) the SRAM occupancy counters (i.e. the numer of cells of every queue present in the SRAM), and (ii) the ariter requests stored in the lookahead. The lookahead allows the MMA to select the queue to e replenished knowing the l requests that are going to e issued in the future. Armed with this information, the MMA can take etter decisions and hence, the SRAM size can e reduced and still guarantee zero miss proaility. For example, suppose that the parameters of the system shown in Figure are: Q =, =, l = 5. Suppose also that the MMA is called with the SRAM occupancy counters and the lookahead values shown in the figure. The MMA should select the queue. This queue would e replenished with cells after slots, and would remain with cells after 5 slots. If the MMA would have selected the queue, a miss would occur for queue after 5 slots. In [] it is shown that the minimum SRAM size necessary to have zero miss proaility is SRAM min = Q ( ) and the required lookahead is l = Q ( )+. Furthermore, the MMA necessary to have zero miss proaility is the so called Earliest Critical Queue First (ECQF-MMA). This algorithm works as follows: read the lookahead from the head (slot ) to the tail (slot l). For every request read from the lookahead, decrease the occupancy counter of the corresponding queue (we shall refer to the resulting value as the excess of the queue). If the

3 DRAM Fig.. SRAM MMA occupancy counters 6 lookahead ariter RADS Memory Architecture. ffl excess is < then the queue is said to e critical. The first queue found to e critical following this algorithm is the queue selected y the ECQF-MMA. Note that a lookahead l = Q ( ) + guarantees that there is always at least one critical queue. The special case of l =is analyzed in []. In this case, the MMA consists of selecting the queue having the lower occupancy counter. The ound SRAM max» Q( + ln Q) is given for the maximum SRAM size necessary to have zero miss proaility. In the following we analyze the intermediate case: < l < Q( ) +. We approach the analysis as follows: Let the system have all the queues with I cells in the SRAM. We shall refer to the deficit of a queue as I minus the occupancy counter of the queue. Let D e the maximum deficit that a queue of the system can reach having zero misses. Then, since any queue can reach this deficit, we have to dimension the SRAM size to QDcells and I = D. We shall use the following MMA: ffl Earliest Critical Queue First (ECQF): First look for the first critical queue (as formerly explained). If no critical queue is found then use the following principle. Maximum Deficit Queue First (MDQF): Take the queue with the maximum deficit at the end of the lookahead, i.e. the queue for which the occupancy counter minus the numer of requests pending in the lookahead is the lowest. If several queues have the same deficit at the end of the lookahead, chose one randomly. This MMA tries to minimize the proaility of unnecessarily refilling a queue, i.e. a queue for which no requests will arrive in the near future. We now compute the maximum deficit. We compute this deficit y finding out the worst case pattern (WCP) of ariter requests, i.e. the pattern that mostly empties the occupancy of any of the queues. Let refer to the queues as q i ;i [;Q ]. For sake of simplicity, lets assume that when the MDQF principle is used and several queues have the same deficit, the MMA chooses the queue of lower index (instead of doing a random selection). Note that this assumption implies only a queue reordering and has no influence on the result. Let q Q e the queue that reaches the maximum deficit. Without loss of generality, let the system start with all the queues having I cells in the SRAM. Clearly, the WCP consists of requests that empties as much as possile the queue q Q, ut forcing the MMA to replenish all the other queues efore replenishing q Q. We claim that the WCP of ariter requests is given y round roins of the queues q i ;i [;Q ] such that the MMA is forced to chose the sequence of replenishments q ;q ;:::;q Q. Every round roin starts with the queue of maximum index such that this sequence is fulfilled and ends with q Q. Note that the queue q Q reaches the largest deficit since its queue length is the only one decreased in all round roins. Furthermore, the queue is reduced to the maximum value since the queue replenishment is delayed as much as possile. ased on this idea, Figure gives an algorithm in C code that computes the exact maximum deficit for all values of the lookahead. int maximum_deficit(int L, int Q, int ) { int nextround, maxdeficit, slot ; } if(l > Q * (-)) { return - ; } if(l > ) { maxdeficit = int((l - )/ Q) + ; slot = maxdeficit * Q ; } else { maxdeficit = ; slot = ; } while() { nextround = int((slot - L) / ) + ; if(nextround >= Q-) { /* There is a RR of only one queue. */ return maxdeficit + Q * - - slot ; } slot += Q - nextround ; if(slot >= Q * ) { return maxdeficit ; } ++maxdeficit ; } Fig.. Algorithm to compute the maximum deficit. The parameters of the function are: lookahead size (L), numer of queues (Q) and numer of cells transfered from DRAM (). A close formula giving a ound for the maximum deficit can e derived ased on the WCP. Assume a fluid model where all the queues can e equally emptied. Figure shows the evolution of the deficit of q Q in this situation. Note that during the first l slots, the ariter requests are round roins of all the queues. This is done to force the MMA to chose q as the first queue to e replenished (this would occur at slot ). Therefore, during

4 deficit slope Q Q l l + l / Q slot Fig.. Evolution of the deficit of qq in the fluid model of the worst case pattern. mechanism can e implemented to avoid memory ank contention, the cycle time of a random DRAM access can e significantly reduced. This translates into the potential for reducing the required data granularity of any DRAM cell access, hence alleviating the requirements of the associated SRAM size. These features are exploited in the memory architecture proposed in section IV. In this section the concept of memory anking is introduced. l slots all the queues are emptied with a slope =Q. After the slot l, the round roin starts at queue q, and thus, the slope is =(Q ) and so on until the slot Qwhere the queue q Q would e replenished. The maximum deficit ( ^D) reached y q Q in this fluid model would e: ^D = Q Q l X + i i= l ; <l» Q: () Mhz ank ank ank M- The maximum deficit in this fluid model is an upper ound to the maximum deficit in the discrete P model we Q are approximating (D). Furthermore, since ln Q l, L= >, wehave: i= l < i D< l (Q Q + ) ln ; <l» Q: () l Figure 5 compares the exact maximum deficit computed with the algorithm of Figure and the ound given y equation (). Rememer that the SRAM size in cells is Q x maximum deficit. Fig. 5. Maximum deficit (cells) Q = 6, = 8 Fluid approx. Exact Lookahead (slots) Tradeoff etween maximum deficit and lookahead. III. POTENTIAL OF ANK INTERLEAVING In response to the growing gap etween processor and memory, DRAM manufacturers have created several new architectures that address the prolem of latency, andwidth and cycle time (e.g. DDR-SDRAM [7] or RAM- US DRDRAM [8]). All these commercial DRAM solutions implement several memory anks that can e addressed independently (as high as 5). If a conflict-free data clk Fig. 6. Organization of a DRAM main memory system in anks: (a) internal structure of a memory ank, () configuration of a DRDRAMstyle memory system. Figure 6 illustrates the concept of memory ank and the concept of an interleaved memory system. A memory ank is a set of memory modules that are always accessed in parallel with the same memory address. The numer of memory modules grouped in parallel is dictated y the size of the data element we want to address. This size is known as data granularity. For instance, in [], a data granularity of = 8 cells is used for DRAM memories with a random access time of 8 ns. This achieves an effective input/output rate of cell every 6 ns (correspondent to OC768). Figure 6 also shows a possile memory ank configuration (similar to that of a DRDRAM-like memory system [8]). In a conventional DRAM memory system, the data is interleaved across all memory anks using a certain policy, and the memory controller is simply responsile of roadcasting the addresses to all of them. Each memory ank has a special logic that determines whether the address identifies a data item that the ank contains or not. Given an array of M memory anks and a random cycle time of T seconds per ank, it is theoretically possile to initiate a new memory access every T=M seconds. Therefore, the data granularity of the packet uffering system ( cells accessed per memory access) can e potentially reduced y a factor of M, hence significantly reducing the need for large and expensive SRAM uffers.

5 5 There are two fundamental limits to the ank interleaving technique. First, the us address speed: that is, the cycle time required to roadcast an address to all memory anks. Second, the prolem of ank collisions. In order to fully exploit the potential andwidth of an interleaved memory system, we need to grant that the same ank is not accessed twice within its random access time (T ). The implementation of conflict-free mechanisms is specially sensile in the context of fast packet uffering, as we need to enforce that no ank collision is ever produced, otherwise, a packet would e lost as a result. IV. CONFLICT FREE DRAM SYSTEM In this section we descrie a novel DRAM memory system that grants conflict-free access with affordale cost. The asis on this system relies on a special memory ank organization that minimizes the likelihood of memory conflicts coupled to a reordering mechanism that schedules the different MMA requests. We shall refer to our memory architecture as Conflict Free DRAM System (CFDS). Figure 7 summarizes the CFDS memory architecture. Four items stand out in CFDS: ffl CFDS exploits the DRAM ank organization. ffl The Virtual SRAM Susystem works exactly as the RADS memory architecture descried in section II, ut using a granularity of <when cells are transfered etween DRAM and SRAM. ffl The DRAM Scheduler Susystem hides the implications of the DRAM ank organization to the former Virtual SRAM Susystem. ffl The DRAM ank organization introduces a cost in terms of latency and SRAM size. These items are discussed in the following susections. DRAM Scheduler Susystem Ongoing Requests Register (ORR) DRAM Fig. 7. DSA Requests Register (RR) SRAM Virtual SRAM Susystem latency V-MMA virtual occupancy counters 6 lookahead ariter CFDS Memory Architecture of the packet uffer. A. DRAM ank Organization Let M e the numer of DRAM anks. We organize these anks in N groups of = anks per group (see Fig- $!%! Fig. 8. "# Group? log N! ank? log (/) Proposed memory ank interleaving. ure 8). Thus, we have N = M=(=) groups. Each group stores the cells of Q=N queues. anks are accessed transferring cells of the same queue. Furthermore, in order to avoid ank conflicts, the cells of every queue are stored in round roin fashion across all the anks of the group (low-order interleaving). Doing this way, we can perform = consecutive access to the same queue (transferring cells overall) without ank conflicts. The distriution of the queues among the maximum numer of groups maximizes the likelihood of finding independent accesses, reducing the hardware requirements to provide conflict-free access. In Figure 8 we can oserve the mapping function used to otain the ank and group indexes. A given request address contains two it fields, one determining the virtual queue and one determining the relative order. The group index is otained using the low-order its of the virtual queue field while the ank inside that group is otained using the low order-its of the ordinal field. The rest of the its are used to determine the row and column addresses of the specified DRAM ank.. Virtual SRAM Susystem The Virtual SRAM Susystem shown in Figure 7 works in the same way as the RADS memory architecture descried in section II, ut assuming that < cells are transfered every DRAM memory access. The requests issued y the ariter are shifted into a lookahead register. Every slots the Virtual MMA (V-MMA) decides the queue to e replenished and release the request to the DRAM Scheduler Susystem. The ariter requests leaving the lookahead register and the V-MMA replenish requests update accordingly the virtual occupancy counters.

6 6 C. DRAM Scheduler Susystem The DRAM Scheduler Susystem shown in Figure 7 manages the transfers etween the DRAM and SRAM. The purpose of this susystem is to guarantee that the replenish requests issued y the V-MMA are fulfilled. To do so, we have introduced the DRAM Scheduler Algorithm (DSA), responsile of solving ank conflicts that could prevent a replenish request. In the following, we give a detailed description of this algorithm. As explained in the former susection, every slots the V-MMA delivers a replenish request to the DRAM Scheduler Susystem which, in turn, initiates a transfer of cells of a queue chosen y the DSA etween DRAM and SRAM. In order to avoid transfers having ank conflicts, the DSA makes use of two registers (see Figure 7): the Requests Register (RR) and the Ongoing Requests Register (ORR). The ORR is a shift register that stores the anks that are eing accessed. Hence, the anks stored in the ORR are locked and a new transfer with these anks cannot e initiated. Taking into account that a ank is locked during slots, we need to consider the latest = ongoing requests. The size of the ORR is hence =. Every slots the DSA stores the replenish request delivered y the V-MMA in the RR and removes one of the pending requests of the RR. This request is passed to the ORR. Oviously, the request removed from the RR must e to a ank different from those stored in the ORR. To do so, the DSA seeks the RR, starting from the oldest request (the head of the RR). In the appendix we demonstrate that a ank satisfying the former condition can always e found if the RR has a size of: L =(Q=N ) (= ) + : () The factor is due to the fact that we have Q incoming streams from the t-sram and Q outcoming streams for the h-sram. Note that in case the DSA always choose the head of the RR, the replenish requests delivered y the V-MMA would suffer a delay of (L ) slots until they are passed to the ORR. If the head of the RR is not always eligile, there will e requests having a delay higher and lower than (L ) slots. In the appendix we demonstrate that the maximum numer of times a request can e delayed is: R max =(Q=N ) (= ): () This delay in slots is R max. D. Cost of the DRAM ank organization The drawack of our proposed conflict-free access mechanism is that the DRAM may deliver the cells out of order (even those from the same queue that are not located in the same ank). Therefore, we have to introduce some mechanisms to cope with that. The replenish requests issued y V-MMA have to e already provided to the SRAM efore the cells are granted to the ariter. Therefore, an additional latency equal to the maximum delay that a replenish request can suffer (due to the DSA reordering) has to e added to the lookahead of the V-MMA. This delay is introduced y the latency shift register shown in Figure 7. From the results formerly, given we have that the size of this register is equal to: latency (in slots) = ((L ) + R max ) = ( Q=N ) (= ): Finally, note that the replenish requests of the SRAM queues are delivered to DRAM when they leave the RR register, ut cells are removed from SRAM when they leave the latency shift register. The mismatch etween these two events requires increasing the SRAM size (in order to store the cells downloaded to the SRAM efore they are granted to the ariter). We conclude that two factors contriute to the SRAM size dimensioning: (i) the maximum deficit (D (L;Q;) ) of the queues in the Virtual SRAM Susystem (computed as explained in section II), and (ii) the additional SRAM size to cope with the mismatch descried aove. Summing oth terms we have: (5) SRAM size (cells) = QD (L;Q;) + R max : (6) As we will see, QD (L;Q;) +R max is in general much lower than D (L;Q;). The issue of out of order insertion in the SRAM uffer will e addressed in section VI. V. EVALUATION OF RADS In this section we will perform an evaluation of different implementation issues concerning the design of the RADS memory architecture descried in section II. We pose the prolem of implementing a SRAM structure ale to handle several queues and we propose some design alternatives. We finally estimate the area and access time and determine the viaility of the proposed SRAM designs. Through all this section, we shall assume OC768 and OC7 links for an input-uffer architecture. For the OC768, we will assume Q = 8. This could correspond to 6 interfaces with 8 service classes. For OC7 we have assumed a more aggressive design with Q = 5. On the other hand, taking into account the link rates of OC768 and OC7, we will set the RADS data granularity () to 8 for the former system and for the latter.

7 7 A. Design of SRAM uffers In order to show how significant the enefits of our proposed CFDS system are, we can evaluate whether a RADS SRAM uffer will e technologically hurdled in the near future. We have used CACTI. [9] to estimate the access time (in ns) and the area (in cm ) of different implementations of the t-sram and h-sram uffers using a. μm technological process. CACTI is an integrated cache access time, cycle time, area, aspect ratio, and power analytical model. The main advantage of CACTI is that, as all these models are integrated together, tradeoffs etween the different parameters are all ased on the same assumptions and hence are mutually consistent. As mentioned in section I, we have assumed that the t- SRAM and h-sram are shared y all the queues. The design of a unified (shared) SRAM uffer is not as trivial as the design of a distriuted (isolated) SRAM uffer, where each queue has its own partition of the availale memory. The second kind of SRAM uffer could e easily implemented as a set of circular queues implemented with simple direct-mapped SRAM structures. On the other hand, in the shared SRAM uffer, we need a mechanism to identify where the n th element of a given queue q i is exactly placed. Intuitively, this is analog to the design of Q linked lists, where the next cell to access y the ariter is located at the head of the correspondent list, and the next cell to store coming from the DRAM is placed at the tail of the correspondent list. Figure 9 shows different alternatives to implement a unified SRAM uffer. The gloal CAM consists of a full content-addressale memory where all the cells are stored together. Each cell has a tag that identifies the queue relative to that cell, and the relative order inside the list of cells of that same queue. When the address (queue identifier and order) of the cell is set, the CAM searches across all entries for the related cell. This implementation requires two ports (one for reading and one for writing) or one read/write port if the operation is going to e timemultiplexed (i.e. the operations of reading and writing from the CAM are serialized). The time multiplexed alternative is slower ut requires less area. Note that we assume that the the refreshes from the DRAM are serialized along time slots at a rate of one cell per slot. The pointer CAM is an alternative to the previous proposal. As a cell is relatively ig (6 ytes), for certain configurations it may pay off to store pointers into the CAM instead of whole cells. This pointers are actually the position where the real cell is stored, in a more simple direct-mapped structure. This alternative has the potential to reduce the overall area, ut at the cost of increasing the access time (as the CAM has to e accessed to otain the pointer to the direct-mapped SRAM structure). Again. this implementation requires one read port and one write port (or one single read/write port for a time-multiplexed configuration) for oth the CAM and the direct-mapped memories. Finally, the unified linked-list proposal is a straightforward implementation of Q linked lists onto a directmapped memory structure. Each entry of that directmapped SRAM contains one cell and a pointer to the next cell (another entry of the same structure). In order to e ale to identify the head and the tail of each linked list, we have another direct-mapped structure that stores the head and tail pointers for each of the queues. The advantage of this implementation is that it does not require complex CAM implementation. However, it requires one additional write operation to store the position of the new tail onto the pointer field of the old tail. As a result, the structure needs one read port and two write ports per structure (or one single read/write port for a time-multiplexed configuration). cam cam direct mapped log Q log S log Q log S direct-mapped direct mapped Fig. 9. SRAM design alternatives: (a) gloal CAM, () pointer CAM and (c) unified linked-list.. RADS SRAM uffer performance Figure shows the access time and the area of the different SRAM implementations (regular and timemultiplexed) for OC768 and OC7 as a function of the numer of slots of the lookahead. The OC768 system SRAM size ranges etween k (for the minimum lookahead) and 6 k (for the maximum lookahead). The OC7 system SRAM size ranges etween 6. M (for the minimum lookahead) and. M (for the maximum). For a OC768 system, we need to access a new cell every.8 ns (assuming 6 ytes wide cells). We can oserve

8 8 OC768 - SRAM access time OC768 - SRAM area access time (ns) delay (slots) gloal CAM pointer CAM unified linked list gloal CAM (time-mux) pointer CAM (time-mux) unified linked list (time-mux) area (cm^) delay (slots) gloal CAM pointer CAM unified linked list gloal CAM (time-mux) pointer CAM (time-mux) unified linked list (time-mux) OC7 - SRAM access time OC7 - SRAM area access time (ns) delay (slots) gloal CAM pointer CAM unified linked list gloal CAM (time-mux) pointer CAM (time-mux) unified linked list (time-mux) area (cm^) delay (slots) gloal CAM pointer CAM unified linked list gloal CAM (time-mux) pointer CAM (time-mux) unified linked list (time-mux) Fig.. h-sram area and access time as a function of the lookahead for the RADS scheme (Q=8, =8 for OC768 and Q=5, = for OC7). from figure that the access time of all SRAM alternatives are way under that numer, even for the shortest lookahead. Therefore, as access time is not a concern, we shall focus on those implementations with minimum area. For instance, the regular gloal CAM and unified linked-list implementations have an overall area of more than. cm for lookaheads shorter than - slots, which could represent a significant fraction of the overall transistor udget of a medium cost system. On the other hand, time-multiplexed approaches such as the time-mux unified linked list exhiits an overall area of. cm with very small lookaheads. For the rest of the paper, we will use the time-mux unified linked list SRAM for all OC768 systems. For an OC7 system, we need to access a new cell every. ns, which is a significantly harder constraint to meet, taking into account that the SRAM uffers are now larger. Indeed, if we look at figure, we can clearly appreciate the fact no SRAM implementation is ale to comply with the. ns target (not even for the longest lookaheads). The fastest implementation is the gloal CAM which has an access time slightly higher than 7 ns. Furthermore, if we look at the area results for the different alternatives, we can oserve that only the time-multiplexed SRAMs have areas smaller than cm, which is already a significant fraction of the overall transistor udget of a high-end system (as we need oth h-sram and t-sram uffers). As the access time is clearly the constraint for OC7, for the rest of the paper we will use the gloal CAM implementation (the fastest) for evaluation purposes. VI. EVALUATION OF CFDS In this section we will perform an extensive evaluation of different implementation issues concerning the design of the CFDS memory architecture descried in section IV. We will discuss the design of the request register scheduling logic, as well as the required modifications for the SRAM uffers. We finally compare the performance (in terms of area and access time) of RADS and CFDS memory architectures. A. Implementation issues of the DRAM Scheduler Susystem As already descried in section IV-C, the Requests Register (RR) is a special lookahead of requests to the DRAM memory system. The main function of this register is to

9 9 store the requests so the DSA can schedule them in order to grant conflict-free access to the DRAM memory anks. The scheduling algorithm that is used to select the request to e serviced is very simple. The Ongoing Requests Register (ORR) contains the anks that are currently accessing a lock of cells ((=) ). The scheduler is responsile of finding the oldest request from the RR that does not access to either of the anks contained in the ORR. This logic prolem is actually equivalent to the design of instruction issue windows in out-of-order superscalar processors [], [], where we have to select a set of instructions that are free of dependences ased on the roadcast of results already availale each cycle. (a) Requests register select lock select lock select lock select lock () select lock Fig.. Implementation of the request register: (a) wake-up logic, () selection logic. Figure shows a possile design for the RR scheduling logic, ased on a conventional issue window implementation. The logic is ased on sending the tags from the ORR (that is, the (=) indexes of the anks eing accessed) across all the entries of the Requests Register. Each entry is responsile of determining whether the ank index corresponding to the request is different to all the indexes coming from the ORR. In this case, the entry is marked as ready. This stage is known as the wake-up, and is performed every time we want to select a new request candidate. After the wake-up stage, the logic needs to select the oldest entry that is in ready state (i.e. that can access the DRAM array with no ank conflicts). In order to do so, we can use a hierarchical selection logic that first propagates the ready signal across the tree, to send ack the selection signal afterwards using priority encoders. This stage is known as the selection stage. Finally, once the request has een selected, some mechanism to perform compaction of requests is required in order to maintain the ordering of the requests y age. Tale I shows the RR size for different values of and the time availale to perform the scheduling of one single = =6 =8 = = = OC768 RR size Scheduling time (ns) OC9 RR size Scheduling time (ns) TALE I REQUESTS REGISTER SIZE AND TIME TO PERFORM THE SCHEDULING OF A NEW REQUEST. request. In order to figure out the technological hurdles of those implementations, we may look at a current commercial processor: the Alpha 6 implements, with.5 CMOS process, a entry issue queue ale to select up to four instructions in less than ns [], using.5 cm of the overall die area. From this ase result, we can conclude that the implementation of the RR logic for OC768 is fairly trivial, since even for = we have.8 ns to search in a RR of length 6. For OC7, the design is attainale for values of higher than, and possile (yet extremely aggressive) for =. The implementation of the RR scheduling logic for OC7 and = is certainly of difficult viaility.. Design of CFDS SRAM uffers In order to implement a SRAM uffer for any CFDS configuration, we have to oserve two main issues. First, the cells coming from the DRAM memory system of a given queue may come out-of-order. Second, the SRAM must contain R max additional entries to e ale to hold elements efore there were scheduled y the MMA. The first prolem can e easily overcome y implementing some asic changes on our proposed SRAM structures to allow them to insert cells from a queue out of its natural order: ffl gloal CAM: Implementation of writing operations out-of-order is trivial in this configuration, as only more its are required in the ordinal tag field. ffl pointer CAM: Again, implementation of writing operations out-of-order is very simple y increasing the tag its used to identify the order of the cell inside the queue. ffl unified linked-list: Out-of-order writing operations is complex inside a linked-list. However, an easy solution is to implement Q (=) linked-lists instead of Q, as= is the numer of anks per group and two operations over the same ank are always performed in strict order. ) Performance comparison: Figure allows us to demonstrate the performance enefits of using CFDS instead of a asic RADS approach. The figure shows the

10 area (of oth h-sram and t-sram) and most restricting access time for OC768 and OC7 in function of the lookahead delay (measured in μ-seconds). Again, the numer of queues Q is 8 for the OC768 system and 5 for the OC7 system. The curves with a data granularity of = 8 for OC768 and = for OC7 correspond to the RADS implementation. The rest of the curves correspond to different CDFS configurations varying the value of. We assume the numer of anks M to e 56. There are two main conclusions that can e inferred from results on figure. First, it can e seen the evident advantages of CFDS relative to RADS. For a OC768 (where area is the main factor of merit), a CFDS system with =achieves an area.8x smaller than RADS with a lookahead delay times shorter ( μs instead of 8). The specially significant gains, however, come from the OC7 system. A CFDS system with = is compliant with the requirements of uffering packets at 6G/s, as the access time is lower than. ns. Moreover, this is accomplished with a modest lookahead delay ( μs) and an affordale area (.6 cm overall). This contrasts heavily with its RADS counterpart, which is hardly unale to access data in 7 ns, even with a delay of more than 5 μs and the non-irrelevant area of cm (almost the size of the Pentium IV). The second important conclusion is that there is an optimal value of for a given CFDS implementation. The reason is the trade-off etween the SRAM size required to tolerate the unpredictaility of arrivals from the ariter D (L;Q;) which is proportional to, see equation () and the SRAM size required to asor the level of reordering of the accesses from the DRAM R max which is proportional to =, see equation (). Max # queues Qmax RADS Qmax CFDS OC768 8 Max # queues Qmax RADS Qmax CFDS OC7 6 8 Fig.. Maximum numer of queues attainale for RADS and CFDS with different configurations (using maximum lookahead and complying OC768 and OC7 time constraints) ) Potential numer of queues: An important parameter is the numer of queues that can e supported y our system []. Figure shows the maximum numer of queues that the different SRAM uffers approaches can afford taking into account the access time constraints (.8 ns of maximum delay for OC768 and. ns for OC7). The first column ar (where = ) corresponds to a RADS implementation, while the rest of the columns corresponds to CFDS implementations as we reduce the data granularity. As shown in the figure, CFDS allow.5 more queues for OC768 and 6 times more queues for OC7. VII. CONCLUSIONS In this paper we have proposed a novel architecture targeted at fast packet uffering. In order to overcome the andwidth prolems of current commodity DRAM memory systems, the use of SRAM coupled to DRAM have een proposed in the past. Those SRAM memories act as ingress and egress uffers to allow wide transfers etween the uffering system and its main DRAM memory system. The main drawack of this organization is that the data granularity of the DRAM accesses have to e enlarged to sidestep the high cycle times of a single DRAM ank. As a result, the SRAM memories could e too large and slow for very high link rates. In our proposal, the memory ank organization is exploited in order to achieve conflict-free access to DRAM. This leads to a reduction of the granularity of the DRAM accesses, resulting in a decrease of storage capacity needed y the SRAM. The main cost of the access-free organization scheme is that cells can e transfer from DRAM to SRAM out of order. We carry out an analysis that shows that the reordering introduced y the access scheme is ounded and that zero miss conditions can e guaranteed. Moreover, a technological study of the system implementation shows that our organization gives etter results for area, access times, latency, and maximum numer of queues, than the previously proposed designs. For instance, our proposed mechanism allows implementing SRAM times smaller than those required for the aseline system for OC768. Moreover, our proposal allows SRAM uffers to e implemented fast enough to perform packet uffering at OC7 link rates, with reasonale area cost. Another contriution is the oservation that the SRAM size has two components: The first one (QD (L;Q;) )is proportional to the data granularity of the DRAM access. The second, the degree of reordering (R max ) is actually inversely proportional to the data granularity. As a result there is an optimum value of (generally greater than ) that minimizes the SRAM size.

11 OC768 - SRAM access tim OC768 - SRAM area access time (ns) 6 5 RADS =8 = = = area (cm^).5. RADS =8 = = = CFDS.5 CFDS.E+.E-6.E-6.E-6.E-6 5.E-6 6.E-6 7.E-6 8.E-6 9.E-6.E-5 delay (s).e+.e-6.e-6.e-6.e-6 5.E-6 6.E-6 7.E-6 8.E-6 9.E-6.E-5 delay (s) OC7 - SRAM access time OC7 - SRAM area RADS RADS 9 access time (ns) = =6 =8 = = = area (cm^) = =6 =8 = = = 5 CFDS CFDS.E+ 5.E-6.E-5.5E-5.E-5.5E-5.E-5.5E-5.E-5.5E-5 5.E-5 5.5E-5 delay (s).e+ 5.E-6.E-5.5E-5.E-5.5E-5.E-5.5E-5.E-5.5E-5 5.E-5 5.5E-5 delay (s) Fig.. SRAM area (oth h-sram and t-sram) and access time, as a function of the delay, for the RADS scheme and several CFDS configurations. Delay for RADS = lookahead delay. Delay for CFDS = lookahead delay + latency. APPENDIX In this appendix we derive (i) the size L of the RR register that allows avoiding ank conflicts in the CFDS memory architecture, and (ii) the maximum numer of times R max that a request can e delayed in the RR register (see section IV). Let us define A(j; n) to e the numer of requests issued y the MMA to the memory ank j during n consecutive -slots. A consequence of the CFDS memory organization proposed in section IV is the following: Property : For the memory anks elonging to a group of memory anks we have: max j:;:::; fa(j; n)g min j:;:::; fa(j; n)g»min(n; Q N ): where Q = Q as we are taking into account requests from oth tail and head V-MMA. We know that A(j; n)» n, hence: P j:;:::; A(j; n)» minfn; Q N ( )+n g: (7) We define a -slot to e time-slots Note: This is equivalent to a leaky-ucket rate controlled source with parameters ff = Q N ( ) and ρ = and C =(see [], equation 7). A direct consequence of this property is: Property : If A(j; n) =Q =N + k, then for the other anks of the group (i :;:::; ; i 6= j): A(i; n) k. Let us now define R(j; n; m) to e the numer of requests to the memory ank j stored etween positions n and m of registers OOR and RR (we are considering the OOR to e attached to the head of RR. Positions ;:::; correspond to requests stored in OOR, while positions ;::: correspond to requests stored in RR. We are also assuming that the length of RR is infinite. Property : For the memory anks elonging to the same group we have: max R(j; n; m) min R(j; n; m) j:;:::; j:;:::; (8)» min(m n; Q N ): Proof: Let us consider the initial state of the system. In this state the OOR is empty and the first request arrives to the head of the RR. At this instant the RR content is

12 identical to the sequence of requests issued y the MMA, and thus, using Property, it is easy to see that Property holds. Let t e the first -slot at which Property does not hold. For some anks j and i elonging to the same group of memory anks and some interval of length n, the following relation holds at -slot t: R(j; n; m) R(i; n; m) = Q +: (9) N Here we can assume that requests placed at positions n and m are addressed to ank j. Note that if Property was true at t -slot, then we should have issued a request for i that was placed etween positions n and m +. However, this is only possile if a request to ank j was at position p» (i.e. in the ORR). Otherwise we should have chosen the request for memory ank j which is placed at the n position of ORR+RR, instead of request to i which is located etween positions n and m +. Consequently, at t -slot we would have R(j; p + ;m+) R(j; n; m +)+and thus, R(i; p +;n ). However, this is a contradiction since in this case the request to memory ank i placed etween n and m+ at -slot t P would not e eligile. Using j:;:::; R(j; n; m)» m n we arrive to the following property: Property : If R(j; n; m) = Q =N + k, then for the other anks of the group (i : ;:::; ; i 6= j): R(i; n; m) k. Now we can proof the following theorem: Theorem : The minimum length L of the RR register that allows avoiding ank conflicts is given y: L =( Q N ) ( ) + ; () Proof: Note that in the ORR we have different requests. If there is no any eligile request for the next -slot, then in the RR we would only have requests to these memory anks. At least for one ank we would have R(j; ;L+ Q ) +. N From Property we deduce that for the rest of anks of the group there would e in OOR+RR at least one request, which is in contradiction with the fact that only anks have requests in the ORR+RR. Theorem : The maximum numer of times R max that a request can e delayed in the RR register is given y: R max =( Q N ) ( ) () Proof: Assume that in ORR + RR there are no requests to memory ank j. Assume further that the first request to memory ank j enters into the RR at -slot t, and that the n th request to ank j enters into this register at -slot t n, with t <t < :::. The first request to ank j will not suffer any additional delay and it will e issued to OOR at most L -slots after eing requested y the MMA. If during (t ;t ) there are requests to k different memory anks, the second request to ank j would suffer a delay lower than ( k)+ -slots. The maximum delay is produced when t = t +. In this case the second request to memory ank j can suffer a delay of ( Q ) -slots. In fact, if n < N, then the maximum delay occurs when t n = t + n and this request can suffer a delay of n ( Q )» ( N ) ( ) -slots. If n Q N, then from Property we have at least n Q N request to different memory anks. Consequently, until request n Q to ank j is issued, this N request could not suffer any further delay. Hence, when request n Q N is issued, the delay of request n can e at most ( Q N ) ( ) -slots. REFERENCES [] S. Iyer, R. Kompella, and N. McKeown, Analysis of a memory architecture for fast packet uffers, in IEEE Workshop on High Performance Switching and Routing, Dallas, Texas.,. [] S. Iyer, R. Kompella, and N. McKeown, Techniques for fast packet uffers, in Gigait Networking Workshop, Anchorage, Alaska, USA,. []. R. Rau, M. S. Schlansker, and D. W. L. Yen, The cydra 5 stride-insensitive memory system, ICPP, pp. 6, 989. [] M. Valero, T. Lang, J. M. LLaeria, M. Peiron, E. Ayguade, and J. J.Navarro, Increasing the numer of strides for conflict-free vector access, ISCA-9, pp. 7 8, May 99. [5] Mateo Valero, Tomas Lang, and Eduard Ayguade, Conflict-free access of vectors with power-of-two strides, ICS, July 99. [6] D. T. Harper III and J. R. Jump, Performance evaluation of vector accesses in parallel memories using a skewed storage scheme, ISCA-, pp. 8, 986. [7] Hitachi, Hitachi 66 mhz sdram, Hitachi HM557XX series, p. modules.htm,. [8] Richard Crisp, Direct ramus technology: The new main memory standard, IEEE Micro, vol. 7, pp. 8 8, Novemer/Decemer 997. [9] P. Shivakumar and N. P. Jouppi, Cacti.: An integrated cache timing, power and area model, Technical report, Compaq Computer Corp., August [] S. Palacharla, N. P. Jouppi, and J. E. Smith, Complexityeffective superscalar processors, in ISCA-, 997. [] D. S. Henry,. C. Kuszmaul, G. H. Loh, and R. Sami, Circuits for wide-window superscalar processors, ISCA-7,. [] R.E. Kessler, The alpha 6 microprocessor, IEEE Micro, pp. 6, March-April 999. [] W. ux, W.E. Denzel, T. Engersen, A. Herkersdorf, and R.P. Luijten, Technologies and uilding locks for fast packet forwarding, IEEE Communications Magazine, January. [] A. K. Parekh and R. G. Gallager, A generalized processor sharing approach to flow control in integrated services networks: The multiple-node case, IEEE/ACM Transactions on Networking, vol., no., pp. 7 5, 99.

/06/$20.00 c 2006 IEEE I. INTRODUCTION

/06/$20.00 c 2006 IEEE I. INTRODUCTION Scalale Router Memory Architecture Based on Interleaved DRAM Feng Wang and Mounir Hamdi Computer Science Department of The Hong Kong University of Science and Technology {fwang, hamdi}@cs.ust.hk Astract