On the Efficient Implementation of Pipelined Heaps for Network Processing

Size: px

Start display at page:

Download "On the Efficient Implementation of Pipelined Heaps for Network Processing"

MargaretMargaret Riley
5 years ago
Views:

1 On the Efficient Implementation of Pipelined Heaps for Network Processing Hao Wang ill in University of alifornia, San iego, a Jolla, {wanghao,billlin}@ucsd.edu bstract Priority queues are often used in many network processing applications. pplications include sophisticated perflow scheduling for providing advanced Quality-of-Service (QoS) guarantees, fast packet buffer memory management, and exact maintanence of statistics counters for real-time network measurements. In all these applications, the priority queues used must operate at very high-speeds, e.g. at 40 Gbps rates and beyond. One widely used data structure for implementing priority queues is the heap data structure. However, the logarithmic time complexity of heap operations is often too slow for increasingly fast line rates. To achieve constant time complexity, the pipelined heap structure has been proposed. In this paper, we describe new architecture techniques for the efficient implementation of pipelined heaps. In particular, we focus on aggressive memory management and pipelining techniques. I. INTROUTION Priority queues are often used in many network processing applications. For example, per-flow queueing with sophisticated scheduling based on Weighted Fair Queueing (WFQ) service disciplines has been proposed as an approach for providing advanced Quality-of-Service (QoS) guarantees [1], [2]. In advanced WFQ scheduling algorithms, packets are timestamps with a required departure time necessary to meet QoS guarantees. Priority queues are used in these algorithms to ensure that packets are serviced in the order of departure deadlines. nother application is in the management of perflow queues using RMs. To achieve aggressive line rates, random access to per-flow queues is necessary at high speeds. Therefore, portions of the queues must be maintained in SRM to achieve line rate, namely the heads and tails of these queues [3]. However, to provide bulk storage, the middle portion of the queues must be maintained using external RMs. One approach to managing the migration of packets between SRMs and RMs is based on replenishing heads of queues that are the shortest. Priority queues are again used to provide this real-time sorting. Finally, priority queues have been used for the maintanence of exact statistics counters for real-time network measurements [4], [5]. In all these applications, the priority queues used must operate at very high-speeds, e.g. at 40 Gbps rates and beyond. One widely used data structure for implementing priority queues is the heap data structure. However, operations on a heap require O(log N) complexity, which is still too slow for fast line rates. To achieve O(1) time complexity, we proposed in our earlier work the pipelined heap structure [6]. In the pipelined heap structure, each layer of the heap corresponds to a pipeline stage. Unlike a conventional binary heap, the operations on a pipelined heap all operate from the top of the heap downwards, progressing through each pipeline stage in one direction. t each pipeline stage, a memory structure is associated with the pipeline stage for storing the nodes in the corresponding layer of the heap. However, since the number of nodes in a layer of the heap is twice as many as the number of nodes in the immediately preceding layer, there is enormous disparity between the memory sizes at the top layers of the heap when compared to the bottom layers of the heap, making it difficult to provide any regularity in the memory structures. Moreover, the top layers of the heap are too small for structured memory. In this paper, we explore more efficient memory organizations for pipelined heaps. In addition, we explore implementation details of the pipelined heap to achieve more aggressive pipelining. The remainder of this paper is organized as follows. In Section II, we provide an overview of the pipelined heap structure and its operations. In Section III, we show how the pipelined heap can be implemented so that a new operation can be initiated on every clock cycle to support very fast line rates. In Section IV, we present our ideas on efficient memory organization and management of pipelined heaps. We first consider the case of a forest of pipelined heaps. We then show how a similar idea can be applied to the case of a single pipelined heaps by partitioning a heap into sub-heaps. Finally, we conclude in Section V. II. THE PIPEINE HEP In this section, we summarize how a pipelined heap works. The reader is referred to [6] for more details. The basic pipelined heap data structure is shown in Figure 1. We assume here a maximum heap where the largest value of the heap is at the top. Our pipelined heap is essentially the same as a conventional binary heap, except that every node has an additional capacity field that indicates the number of unused locations in the sub-heap rooted at the node. The implementation architecture of the pipelined heap is depicted in Figure 2. In the implementation architecture, each pipeline stage corresponds to a layer in the heap. In our pipelined heap, the operations of ENQUEUE, E- QUEUE, and ENQUEUEEQUEUE are supported. For EN- QUEUE, a new entry is inserted into the heap. For EQUEUE, the minimum value of the heap is deleted from the heap and return to the port. In ENQUEUEEQUEUE operation, the This full text paper was peer reviewed at the direction of IEEE ommunications Society subject matter experts for publication in the IEEE GOEOM 2006 proceedings. uthorized licensed use limited to Univ of alif San iego. ownloaded on January 13, 2010 at 1916 from IEEE Xplore. Restrictions apply.

2 value capacity Fig op1 Fig. 1. stage 1 stage The pipelined heap structure. arg1 op2 p2 arg2 op3 p3 arg3 stage 3 op4 p4 arg4 stage 4 input M1 M2 M3 M4 Implementation architecture of the pipelined heap. minimum value of the heap is replaced with a new entry. For each stage i, the specified operation op i, the operation position in the heap p i, and a data argument arg i are transferred from the previous stage. For the first stage, the operation is specified externally, and the operation position is always the root. The data argument is the value we want to put into the heap, if the operation is ENQUEUE or ENQUEUEEQUEUE. This argument will be void if we are doing a EQUEUE operation. Our pipelined heap is different from traditional heap in that it always works in a top down fashion. For an ENQUEUE operation, at stage i, position p j, we compare the value of the argument in arg i with the values in the memory at this stage with those in positions p j and p j +1. We find the first node (assumed in position p k, where p k is either p j or p j +1) that is greater than arg i with capacity greater that 0, exchange those two values, and decrease the capacity of node k by one. If there is no value greater than arg i, then we find the first node with capacity greater than zero. Then, we pass arg i, operation op i, and p k to the next stage. If there is no node with capacity greater than 0, then we return that the heap is full and terminate the operation. Overall, the ENQUEUE operation takes up to three cycles for each stage. They are 1) read the corresponding memory in this stage, 2) compare the value to be inserted with the value read, and 3) a possible exchange of values between the values in one position in the memory with the one to be inserted. For EQUEUE operation, at stage i, position p j, we compare the value of the argument in arg i with the values in the memory locations at the next stage that correspond to the children of node p j, from left to right. et p k be the position of the smaller node of the two children. We then move the value at node p k to node p j, and we delete the value stored at node p k. fter this, we increase the capacity of node p k by one. Finally, we pass arg i, op i, and p k to the next stage. Overall, the EQUEUE operation takes three cycles for each stage. They are 1) read the children of node p j from the memory of the next stage, 2) compare those values and find the minimum, and 3) remove this value and put it into node p j. For ENQUEUEEQUEUE operation, there is a new entry waiting to be inserted in to the heap. t stage 1, we use this new entry to replace the root. Then the children of root is accessed. We find the minimum of those two entries (in position p j ) and exchange it with the entry of the root. The name of this operation (i.e. ENQUEUEEQUEUE) and the position of this operation p j are sent down to stage 2. For all the other stages, the same procedures are followed. t stage i, position p j, we compare the entry of node p j with the entries that are the children of the node p j in the memory at the next stage. Then the minimum of those entries are found out. If the minimum entry is at position p j, then return operation done. Otherwise(assuming it is in position p k ), exchange the entry at position p j with the one at position p k. Then arg i, op i, and p k are sent down to the following stage. Overall, the ENQUEUEEQUEUE operation takes up to three cycles per stage. Those stages are 1) reading the children of node p j from the memory of the next stage, 2) comparing the value in current position p j with the value read, finding the minimum, and 3) a possible exchange of values between the values in one position in the memory with the one in current position. From what are shown above, we can see that all three operations can be divided into three cycles. pparently we can rip down the pipeline at a rate of one stage each three cycle. We will show that, in this pipelined heap structure, all operations can be pipelined in a cycle fashion. That is to say, we can add in one more operation into the system each cycle, which is discussed in the following chapter. III. SINGE-YE OPERTION The problem with initiating one new operation each clock cycle is that, for current operation, it has to wait at least two cycles to get the right path to go down the heap from the previous operation. So, in order to achieve a higher processing This full text paper was peer reviewed at the direction of IEEE ommunications Society subject matter experts for publication in the IEEE GOEOM 2006 proceedings. uthorized licensed use limited to Univ of alif San iego. ownloaded on January 13, 2010 at 1916 from IEEE Xplore. Restrictions apply.

3 and and =min{, } Write, E, F, G and E =min{, E} Write +2 =min{h, I} and and and Fig. 3. Part of a pipelined heap. =min{, } Write and =min{, } Write H, I, J, K H, I lock ycles Write +2 =min{, } +3 H I J K M N O and E and Write Fig. 5. Improved pipelined EQUEUE operation. lock ycles =min{,, } and,, Write, E, F, G,, E =min{,, E} Write =min{, H, I} Fig. 4. Improved pipelined ENQUEUE operation. H, I, J, K, H, I Write rate, in each stage, more entries need to be read from the memory and more memories need to be considered. In the following, an outline is provided about how this can be achieved for each of the pipelined heap operations. ENQUEUE onsider the partial pipelined heap shown in Figure 3. Suppose we want to insert to replace, then to replace, and then to replace, and so on. To achieve one stage per cycle, we can follow the pipelined operations that are depicted in Figure 4. In clock cycle 1, at stage, we have to start reading data. However, at this time, we are not sure whether we should use or. Instead, both of them are read. In cycle 2, the result of comparison between and is known, so we only need to compare and. t the mean time, and E are read in stage +2, since we don t have the result of comparison of and. So, in this way, we achieve one stage per cycle. The requirement for memory is that it has a read throughput of 2 values per cycle in each memory and a write throughput of 1 value per cycle in each memory. EQUEUE onsider again the partial pipelined heap shown in Figure 3, now suppose we want to delete, then use to replace, and then use to replace. Figure 5 illustrates the operation lock ycles H I J K M N O Fig. 6. Improved pipelined ENQUEUEEQUEUE operation. pipeline for achieving one stage per cycle. In clock cycle 1, at stage, we have to start reading data. However, at this time, we are not sure about whether or is smaller. Instead, the children of both of and are read. In cycle 2, the result of minimum value of and is known, so we only need to consider the children of in this case. t the mean time, H, I, J, K are read in stage +2, since we don t have the result of comparison of and E. So, in this way, we achieve one stage per cycle. The requirement for memory is that it has a read throughput of 4 values per cycle in each memory and a write throughput of 1 value per cycle in each memory. ENQUEUEEQUEUE Following the same procedure, now suppose we want to use to replace, use to replace, and then use to replace This full text paper was peer reviewed at the direction of IEEE ommunications Society subject matter experts for publication in the IEEE GOEOM 2006 proceedings. uthorized licensed use limited to Univ of alif San iego. ownloaded on January 13, 2010 at 1916 from IEEE Xplore. Restrictions apply.

4 . The operation pipeline is illustrated in Figure 6. In clock cycle 1, at stage, we have to start reading data. However, at this time, we are not sure that among and, which is the minimum entry. Instead, the children of both and are read. In cycle 2, the minimum of,, and is known, so we only need to compare,, E in stage. t the same time, H, I, J, K are read in stage +2, since we don t have the result of comparison of,, and E. So, in this way, we achieve one stage per cycle. The requirement for memory is that it has a read throughput of 4 values per cycle in each memory and a write throughput of 1 value per cycle in each memory. Inter-Operation Management If all three operations are supported in the same heap, interoperation management is worth considering. We can treat aequeue operation as an ENQUEUEEQUEUE operation that inserts a entry which has a void value. So the interoperation managements for EQUEUE and ENQUEUEEQUEUE are the same. Hence, there are four cases to be considered. They are ENQUEUE followed by ENQUEUE, ENQUEUE followed by EQUEUE (or ENQUEUEEQUEUE), EQUEUE (or ENQUEUEEQUEUE) followed by ENQUEUE, and EQUEUE (or ENQUEUEEQUEUE) followed by EQUEUE (or EN- QUEUEEQUEUE). For the first three cases, we can achieve one stage per operation. Once one operation goes to the next stage, we can insert another one. Since each stage can be finished in one cycle, the operation can be done in one cycle per operation for the first three cases. For the last case, if one EQUEUE operation followed by another EQUEUE operation, we can insert one more operation every two stages. The next EQUEUE needs to access the memory in next stage, so it has to wait until the last one finishes writing to the memory of that stage. Overall, the operation can be done in a fashion of every two cycles per operation. IV. EFFIIENT MEMORY MNGEMENT. Memory management for a forest of pipelined heaps In the applications of pipelined heaps, multiple heap structures are frequently used. For conventional heaps, all nodes are normally stored in an array in a single memory structure. However, in the case of a pipelined heap, the nodes are partitioned across log N pieces of memories, each corresponding to one of log N pipeline stages. These memories are of different sizes. For stage 1, the size is 1, for stage 2, it is 2, and for stage k, it is 2 k 1, where k is the number of layers in the heap. For a large heap, the disparity between the memory sizes at the the different layers can be enormous, making it difficult to maintain much regularity in the memory structures. Instead, we can treat the memories for a forest of pipelined heaps together. For pipelined heaps that have 6 stages, the memory sizes at each of the 6 stages are 1, 2, 4, 8, 16, and 32, respectively. ssume we have 10 heaps all together, then the memory sizes Heap 1 Heap 2 Fig. 7. Fig. 8. Heap 3 Heap N-1 Simplified forest of heaps. Heap N Example of memories of heaps. at each of the 6 stages for this forest of heaps would 10, 20, 40, 80, 160, and 320, respectively, if we were to implement a forest of heaps in a straightforward manner. However, there would be substantial disparity between the first stage and the last stage. lternatively, we can organize the forest of heaps as shown in Figure 7. In this scheme, half of the heaps are inverted and then interleaved with the other half. This way, the large storage requirements of the bottom layers of a heap are offset by the small storage requirements of the top layers of another heap. onsider again a forest of 10 heaps, each with 6 stages. y inverting half of the heaps in an alternating manner, the memory requirements at each of the 6 stages would become 165, 90, 60, 60, 90, and 165, respectively, which shows much less disparity in memory requirements between stages. onsider another simple example shown in Figure 8. y inverting half of the heaps in an alternating manner, the memory for this forest of heaps can be organized as shown in Figure 9. This forest of pipelined heaps needs to work in both directions. For pipelined heaps that are not inverted, the memory works from top to bottom. For pipelined heaps that are inverted, the memory works from bottom to top. However, in the implementation architecture for the pipeline stages, it is more logical to have the pipeline stages move only in one direction. In the case of a single pipelined heap, each pipeline stage needs to access the memory structure at its stage as well as the next stage, as depicted in Figure 2. Fig. 9. Memory of forest of pipelined heap. This full text paper was peer reviewed at the direction of IEEE ommunications Society subject matter experts for publication in the IEEE GOEOM 2006 proceedings. uthorized licensed use limited to Univ of alif San iego. ownloaded on January 13, 2010 at 1916 from IEEE Xplore. Restrictions apply.

5 input op1 arg1 stage 1 1 op2 p2 arg2 stage 2 2 op3 p3 arg3 stage 3 3 opn-2 pn-2 argn-2 stage N-2 N-2 opn-1 pn-1 argn-1 stage N-1 N-1 Fig. 12. Single pipelined heap arrangement. opn pn argn Fig. 10. stage N N Implementation structure for a forest of pipelined heaps. stage i Fig. 11. controller controller Memory management for one stage. To support inverted heaps that alternate with non-inverted heaps in a forest, we extend the implementation architecture to allow each pipeline stage to access the memory structures of the corresponding inverted memory layers, as shown in Figure 10. This way, the logical pipeline can still proceed in one physical direction. To hide this details from the control logic that implements the pipelined heap operations, a level of memory mapping can be inserted in hardware, as depicted in Figure 11. However, to implement this scheme, each memory structure needs to support potentially twice as many memory operations. Therefore, to support one stage per cycle, each memory structure needs to support a read throughput of 8 and a write throughput of 2 per cycle. Memory management for a single pipelined heap Even for a single pipelined heap, there is a better way of memory management that can achieve better performance. For the heap shown in Figure 12, the memory chips for the left hand side are 1, 2, 4, and 8. The memory chips for the right hand side are 7 and 8. For heap with 6 stages, the memories for pipelined heap are 1, 2, 4, 8, 16, and 32. fter rearrangement, the memories are 21, 18, and 24. It can be shown that after the rearrangement, the sizes of memories are closer distributed, which means that they are easier to be realized by using SRM. However, price is paid for the complex memories control logic. Since at one time, there can be only one operation entering the triangle on the top and only one operation entering all the triangles on the bottom. So, it requires that the memory has a read throughput of 8 and write throughput of 2. V. ONUSION In this paper, we presented new memory management techniques to enable a more efficient implementation of pipelined heaps. The memory management techniques are aimed at resolving the disparity in storage requirements between the top and bottom layers of a pipelined heap. In addition, we explored efficient implementation techniques to enable more aggressive pipelining. These efficient implementation techniques enable a faster priority queue implementation, which has important applications in network processing, including advanced QoSbased scheduling, fast packet buffer management, and network measurement. REFERENES [1]. K. Parekh, R. G. Gallager, generalized processor sharing approach to flow control in integrated service networks The single-node case, IEEE/M Transactions on Networking, vol. 1, pp , [2]. emers, S. Keshav, S. Shenkar, nalysis and simulation of a fair queueing algorithms, Proceedings of SIGOMM 89, pp. 1-12, ustin, TX, Sept [3] S. Iyer, N. Mckeown, esigning buffers for router line cards, Stanford University HPNG Technical Report - TR02-HPNG , Stanford,, Mar [4]. Shah, nalysis of a Statistics ounter rchitecture, Proc. IEEE Hot Interconnects 9, IEEE S Press, os lamitos,, [5] M. Roeder,. in, Maintaining Exact Statistics ounters with a Multi- evel ounter Memory, IEEE Global ommunications onference (Globecom 04), allas, Texas, Vol. 2, pages , November, [6] R. hagwan,. in, Fast and scalable priority queue architecture for high-speed network switches, Proceedings of INFOOM 2000, Tel viv, Israel, [7] T. H. ormen,. E. eiserson, and R.. Rivest, Introduction to algorithms, McGraw-Hill ook ompany, ISN This full text paper was peer reviewed at the direction of IEEE ommunications Society subject matter experts for publication in the IEEE GOEOM 2006 proceedings. uthorized licensed use limited to Univ of alif San iego. ownloaded on January 13, 2010 at 1916 from IEEE Xplore. Restrictions apply.

On the Efficient Implementation of Pipelined Heaps for Network Processing. Hao Wang, Bill Lin University of California, San Diego

On the Efficient Implementation of Pipelined Heaps for Network Processing Hao Wang, Bill Lin University of California, San Diego Outline Introduction Pipelined Heap Structure Single-Cycle Operation Memory