Cache Memory Design for Network Processors

Size: px
Start display at page:

Download "Cache Memory Design for Network Processors"

Transcription

1 Cache Design for Network Processors Tzi-Cker Chiueh Prashant Pradhan Computer Science Department State University of New York at Stony Brook Stony Brook, NY chiueh, Abstract The exponential growth in Internet traffic has motivated the development of a new breed of microprocessors called Network Processors, which are designed to address the performance problem resulting from exploding Internet traffic. The development efforts of these network processors concentrate almost exclusively on streamlining their data-paths to speed up network packet processing, which mainly constitute routing and data movement. Rather than blindly pushing the performance of packet processing hardware, an alternative approach is to avoid repeated computation by applying the time-tested architectural idea of caching to network packet processing. Because the data streams presented to network processors and general-purpose CPUs exhibit different characteristics, detailed cache design tradeoffs for the two also differ considerably. This research focuses on cache memory design specifically for network processors. Using a trace-driven simulation methodology, we evaluate a series of three progressively more aggressive routing-table cache designs. Our simulation results demonstrate that the incorporation of hardware caches into network processors, when combined with efficient caching algorithms, can significantly improve the overall packet forwarding performance due to a sufficiently high degree of temporal locality in the network packet streams. Moreover, different cache designs can result in up to a factor of difference in the average routing table lookup time and thus the packet forwarding rate. 1. Introduction With the enormous momentum behind Internet-related technologies and applications, demands for data network bandwidth are on the rise at an astounding rate. As a result, a growing number of microchips are being designed and fabricated specifically for networking devices rather than for traditional computing applications. In particular, a new breed of microprocessors called Network Processors have emerged that are designed specifically to efficiently execute network protocols on various kinds of network devices, like switches and routers. A major function that network processors perform is packet routing. At the IP 1 level, the routing table lookup problem is equivalent to longest prefix matching. The routing table consists of a set of entries, each containing a destination network address, a network mask and an output port identifier. Given a destination IP address, routing lookup can logically be thought of as follows. The network mask of an entry selects Æ most significant bits from the destination address. If the result matches the destination network address of the entry, the output port identifier in the entry is a potential lookup result. Among such matches, the entry with the longest mask is the final lookup result. Note that the routing table is essentially a set of destination address prefixes, and routing lookup searches for the longest matching prefix of a destination address in this set. In a classical addressing scheme, Æ could only take a fixed set of values, viz, ½ or ¾. However, to allow more efficient address space allocation, a technique called Classless Inter- Domain Routing (CIDR) is currently in use that allows Æ to take any value from ½ to ½. This generality complicates the search for the longest matching prefix of a given destination address. Efficient algorithms to solve this problem have been proposed [5, 22]. However, the architecture-level research question is how to execute them at wire speed. For example, suppose the router s performance target is 10 million packets per second, the per-packet processing, including longest prefix match, should be completed within 100 nsec. While many attempts have been made to build specialized hardware for clever packet routing and filtering algorithms, in this work we chose a time-tested architectural idea, viz caching, to attack this problem, based on the be- 1 Unless explicitly indicated otherwise, the term IP refers to Internet Protocol Version 4.

2 lief that there is sufficient locality in the packet stream for reusing results of routing computation. Caching alone is not sufficient due to less locality in packet address streams compared to the instruction/data reference streams in program execution. Given caches of a fixed configuration, the only way to improve the cache performance is to increase their effective coverage of the IP address space, i.e., each cache entry covering a larger portion of the IP address space. Towards this end, this work develops a novel address range merging technique by exploiting the fact that there is a limited number of outcomes for routing table lookup (the number of output interfaces in a network device) regardless of the size of the IP address space. Our simulation results demonstrate that address range merging improves the caching efficiency by a factor of over generic IP host address caching, in terms of average routing table lookup time. The rest of this paper is organized as follows. In Section 2, we review previous work related to network processors. In Section 3, we describe the network packet traces used and the architectural models assumed in this study. Section 4 presents the results for the baseline cache design, which supports routing table lookup based on individual destination host addresses. Section 5 presents the performance results of caching host address ranges rather than individual destination host addresses. Section 6 presents the simulation results of a further performance optimization technique that exploits the fact that the number of outcomes of routing table lookup is typically small. Section 7 concludes this paper with a summary of the main research results and a brief outline of on-going work in this project. 2. Related Work State-of-the-art Internet routing devices use generalpurpose CPUs or ASICs for routing. BBN s MGR router [14] uses a DEC Alpha processor as the routing engine and relies on the on-chip L1 and L2 cache for softwarebased routing table cache. IBM s Integrated Switch Router [1] uses PowerPC 603e for both control engines and forwarding engines on the interface cards. Some IP routers are rooted in massively parallel architectures using generalpurpose CPU [18] or ASIC-based routing engines [12] [17] on each node. SUNY at Stony Brook s Suez router project [15] is based on a cluster architecture that uses a 400-MHz Pentium on each node for both packet routing and real-time packet scheduling. Most high-performance switches and routers implement proprietary routing/filtering algorithms in ASIC chips [20] [9] [11]. Some chipsets [10] can perform lookups based on multiple packet fields in parallel. CAMs are exploited in certain routing co-processors [16] to achieve high performance. Certain protocol processors [4] are based on a distributed pipeline processing architecture and employ a traffic classifier module for packet routing. None of the custom-designed processors explicitly mention the use of cache in the chip descriptions, although some of them do mention [19] that the traffic as seen in major Internet backbone routers is not expected to exhibit sufficient locality to justify the use of caches. Feldmeier [7] studied the management policy for the routing-table cache, and showed that the routing-table lookup time can be reduced by up to 65%. Chen [2] investigated the validity and effectiveness of caching for routing-table lookup in multimedia environments. Estrin and Mitzel [6] derived the storage requirements for maintaining state and lookup information on the routers, and showed that locality exists by performing trace-driven simulations. Jain [8] studied the cache replacement algorithms for different types of traffic (interactive vs. non-interactive). More recently, results from Internet traffic studies [13] as well as our own [3] showed that there is significant locality in the packet stream that caching could be a simple and powerful technique to address the per-packet processing overhead in routers. We show that by increasing every cache entry s coverage of the IP address space, cache performance can be significantly improved, and the effects of reduced locality can be mitigated. Pink et al [23] proposed a technique to compress an expanded trie representation of a routing table, so that the result is small enough to fit in the L2 cache of a generalpurpose processor, thus reducing lookup time. Our approach is that of designing efficient ways of caching route lookup results by exploiting the structure in a given routing table, rather than that of speeding up route lookup by ensuring that memory accesses involved in the lookup mechanism are cache hits. 3. Performance Evaluation Methodology 3.1. Trace Collection We use a trace-driven simulation methodology to study the design of network processors cache memory systems. Although there were several IP packets traces available in the public domain, none of them are suitable for our purposes either because they were out of date, or they were sanitized by replacing IP addresses with unique integers, rendering the trace unfit for set-associative cache simulations. As a result, we decided to collect a packet trace from the periphery link that connects the Brookhaven National Laboratory (BNL) to the Internet via ESnet at T3 rate, i.e., 45 Mbits/sec. This link is the only link that connects the entire BNL community to the outside world. The trace has been collected by setting up a Pentium-II/233 MHz machine to snoop on a mirror port of a Fast Ethernet switch that links

3 A (1) 16 B (2) Level 1 Internal node Address node C (3) D (4) 8 4 Level 2 3 Level 3 Figure 1. The network address trie and its corresponding NART data structure. Addresses A and B belong in the level-1 table and have output ports 1 and 2 respectively. Address C falls in one of the level-2 tables and has output port 3. Address D lies in a level-3 table, with output port 4. Note that due to the longest prefix match requirement, part of address A s range in the level-1 table is culled away by address B s range. BNL s periphery router with ESnet s router. The packet trace was collected from 9AM on 3/3/98 to 5PM on 3/6/98. The total number of packets in the trace is 184,400,259. Because there are only three output interfaces in the BNL router, we used a backbone router s routing table from the IPMA project [21] in the simulations. Recognizing the fact that the BNL network is at the edge of the Internet and thus the collected trace may not reflect the traffic patterns on backbone routers, we multiplexed portions of the original trace to create an aggregated packet stream that emulates the effect of interleaving the traffic from a large number of unrelated traffic flows. Specifically, we extracted from the collected trace minute packet sequences that are temporally spaced as far apart as possible, and interleaved them to form a new amalgamated trace. Essentially we are simulating spatial traffic (un-)correlation using temporal traffic (un-)correlation. To further mitigate the performance skews due to possibly higher traffic locality in the collected trace, we focus our simulation study mostly on caches that are smaller than what is feasible in current processors Architectural Assumptions In the following sections, we will explore three network processor cache designs and their detailed architectural tradeoffs using trace-driven simulations. The first design is a generic CPU cache for routing-table lookup, where the destination host address is treated as a memory address. The second design is an improvement over the first by exploiting the fact that each routing table entry corresponds to a contiguous range of the IP address space. Therefore, instead of caching individual destination host addresses, the network processor cache can cover a larger portion of the IP address space if each cache entry corresponds to a host address range. The third design represents a further performance optimization over the second by exploiting the fact that the number of distinct outcomes of routing-table lookup is equal to the number of output interfaces in a router and is thus relatively small. As a result, one could combine disjoint host address ranges that share the same routing-table lookup result into larger address ranges, by choosing a different hash function than that used in generic CPU caches. This further increases the cache s effective coverage of the IP address space Cache Miss Handling The average routing table lookup time depends on both cache hit ratio as well as cache miss penalty, which is determined by the software algorithm used to perform routing table lookup. A simple and elegant data structure to solve the longest prefix match problem is a binary trie [5] (figure 1). Once network addresses in a routing table are inserted in a trie, every node of the trie corresponds either to a purely internal node or a node corresponding to a network address (called an address node). Given an input address, one can simply walk the trie from the root using the bits in the input, and stop at a node from which further branches are not possible (either because the node is a leaf node, or because the branch corresponding to an input bit does not exist). The last address node encountered along this path is the longest matching prefix of the input address.

4 Clearly, the worst case number of memory accesses for the trie walking algorithm is the same as the worst case depth of the trie, viz 32 for IP addresses. Instead, we use a data structure that reduces the trie lookup to a small number of table lookups. This is done by flattening out the trie to a fixed number of levels. We call this data structure the Network Address Routing Table (NART) [3]. To understand NART construction, refer to figure 1. Suppose we choose to flatten out the trie at three levels, corresponding to the first 16 bits (level 1), the next 8 bits (level 2) and the last 8 bits (level 3) of the address. Given an address, the trie node corresponding to it may lie at or above one of these levels. We can visualize level 1 as a simple table of ¾ ½ Ã entries. We will use the notation to represent bit numbers to of an Æ-bit number, with being the more significant bit position. Also, the most significant and least significant bits will be numbered ¼ and Æ ½ respectively. Note that an address of length Ä that lies above level 1 corresponds to a range of ¾ ½ Ä entries in the level 1 table, starting at entry ¼ Ä ½ ¾ ½ Ä. Thus, these entries can be filled with the output port identifier corresponding to address. Note that due to the longest prefix match requirement, if a level 1 table entry corresponds to more than one address, the output port identifier in that entry is the one corresponding to the longer address. Whenever an address node with length Ä lies between levels 1 and 2, first note that its first 16 bits correspond to entry ¼ ½ of the level 1 table. However, since the address continues beyond level 1, it is kept in a level 2 table which is pointed to by entry ¼ ½ of the level 1 table. Since such level 2 tables correspond to trie levels 16 to 23, the size of such tables is 256 entries. Thus, address corresponds to a range of ¾ ¾ Ä entries starting from ½ Ä ½ ¾ ¾ Ä in this level 2 table. The treatment of address nodes lying between levels 2 and 3 is exactly the same as those lying between levels 1 and 2, through the use of level 3 tables of 256 entries. The longest prefix requirement is also taken care of in level 2 and level 3 tables, just as in the case of the level 1 table. To perform a lookup for address in the NART, we perform a table lookup using bits ¼ ½ in the level 1 table. If this entry does not contain a pointer to a level 2 table, the lookup is complete and the output port identifier contained therein is the lookup result. Otherwise, the corresponding level 2 table is looked up using bits ½ ¾. If this level 2 table entry does not contain a pointer to a level 3 table, the lookup is complete. Otherwise, a final lookup using bits ¾ ½ is performed on the corresponding level 3 table. We shall call NART table entries that contain output port identifiers as leaf NART entries, whereas those containing pointers to other tables shall be called non-leaf entries. The NART data structure effectively reduces the worst-case lookup time to three memory accesses, with a reasonable Destination IP Address Index = Select Data Output Figure 2. The baseline network processor cache architecture, which is identical to generic CPU caches. Simulation results show that smaller block sizes are preferred due to lack of spatial locality in network packet stream. space overhead. We have implemented the above NART algorithm on a Pentium-II 233 MHz machine with a 16-KByte L1 data cache. The measured software NART lookup time for the packet trace, using the IPMA routing table, is 120 CPU cycles on the average. 4. Baseline: Host Address Cache (HAC) Figure 2 shows the baseline network processor cache architecture, which is identical to a conventional CPU cache. In this section, we report the results of generic cache simulations by varying the cache size, the cache block size, and the degree of associativity, for two reasons: to identify possible differences in the locality characteristics of network packet streams and program reference streams, and to establish the baseline model against which subsequent cache design alternatives are compared. The simulated cache miss ratio results in Table 1 show that the cache size and degree of associativity have a similar performance effect on the network processor cache as on the CPU cache. However, a distinct difference between network packet streams and program reference streams is that the former lacks spatial locality, as evidenced by the fact that for a given cache size and degree of associativity, decreasing the block size monotonically decreases the cache miss ratio 2. Intuitively 2 This result holds for the trace even without interleaving.

5 Cache Size Block Size Associativity Miss Ratio % % % % 4K % % % % % % % % % 8K % % % % % % % % % 32K % % % % % Table 1. Miss ratios for the baseline host address cache under varying cache sizes, cache block sizes and degrees of associativity. Cache sizes are reported in numbers of entries rather than numbers of bytes. this behavior is expected as there is no direct temporal correlation among network activities of the hosts residing in the same subnet. Poorer performance for caches with larger block size results because larger block size leads to inefficient cache space utilization when references to addresses within the same block are not correlated temporally, i.e., when there is low spatial locality. The performance difference between cache configurations that are identical except the block size could be dramatic. For example, the miss ratios of a 4-way set associative 8K-entry cache with a 32- entry block size and one with 1-entry block size is nearly an order of magnitude apart, 38.05% vs. 3.29%. As cache size increases, the performance impact of block size decreases, (although still significant) because the space utilization efficiency is less of an issue with larger caches. We conclude that the block size of network processor caches should always be small, preferably one entry wide. Whenever the base data structure from which a cache is built changes, there is a cache consistency problem. For the host address cache, modification of the routing table due to the routing protocol s message exchanges gives rise to the consistency problem. However, unlike CPU cache, temporal inconsistency in the host address cache is tolerable, because the routing protocol itself takes time to converge to the new routes. Therefore, there is much more latitude in the timing of consistency maintenance actions. To simulate the effects of routing table changes, we flush the contents of the host address cache and measure the impacts of routing table update frequency on the effectiveness of the host address cache. The results are shown in Table 2, which assume the cache is direct-mapped and its block size is one entry wide. As the flush interval increases, the miss ratio decreases as expected. But the performance difference due to flushing, as shown by the ratio of the miss rates corresponding to the 100K and ½ flush intervals, increases with the cache size. The reason for this behavior is that larger caches require a longer cold-start time, and therefore tend to suffer more than smaller caches when the flush interval is small. Consequently, the relative performance difference between different flush frequencies is more significant for larger caches. To put the flush intervals used in these simulations in perspective, 100K packets is equivalent to 100 msec for a router that can process one million packets per second. In reality, the interval between consecutive routing table changes is of the order of seconds. Cache Size Flush Interval Miss Ratio 100K 14.23% 4K 400K 13.16% ½ 12.71% 100K 9.69% 8K 400K 8.23% ½ 7.57% 100K 5.40% 32K 400K 3.41% ½ 2.39% Table 2. The impacts of the frequency of routing table updates, which translate into cache flushes, on the miss ratios. The flush interval is the number of packets the host address cache processes between consecutive flushes. An ½ flush interval corresponds to the no flush case.

6 Destination IP Address Right Shifter Range Size Index = Select Data Output Figure 3. The network processor cache architecture that caches host address ranges rather than individual host addresses. Range size is a global parameter applied across the entire address space, and is determined by maximally concatenating address ranges in the IP address space. 5. Host Address Range Cache (HARC) Each routing table entry corresponds to a contiguous range of the IP address space. For example, a routing table entry with a network address field of ¼Ü ¾ ¼¼¼¼ and a network mask field of ¼Ü ¼¼¼¼ corresponds to a contiguous range ¼Ü ¾ ¼¼¼¼ ¼Ü ¾ in the ¼ ¾ ¾ ½ IP address space. If a network packet s destination address falls within a routing table entry s range, it should be routed to that entry s output interface. One could exploit the above fact to increase the effective coverage of a host address cache, by caching host address ranges instead of individual addresses. Network addresses need to go through two additional processing steps before host address range cache (HARC) could be put to practical use. First, with the longest prefix match requirement, it is possible that some routing table entry s address range covers another s address range. The former is called an encompassing routing table entry while the latter is an encompassed entry. An encompassing entry s network address is a prefix of those entries it encompasses. The address range associated with each encompassed routing table entry needs to be culled away from the address ranges of all the entries that encompass it, so that every address range in the IP address space is covered by exactly one routing table entry. This culling step is essential because it ensures that an IP destination address lying in a particular address range has a unique lookup result. Second, adjacent address ranges that share the same output interface should be merged into larger ranges as much as possible. Once this merging is done, these ranges are aligned, that is, ranges are potentially split to make all range sizes powers of ¾ and to make starting addresses of all ranges aligned with a multiple of their size. Then the minimum of all resulting address range sizes is calculated. This minimum size becomes the the minimum range granularity parameter of the HARC. Range size, which is defined as ÐÓ (minimum range granularity), thus represents the number of least significant bits of an IP address that could be ignored during routing-table lookup, since destination addresses falling within a minimum address range size are guaranteed to have the same lookup result. Figure 3 shows the hardware architecture of the HARC, which is the baseline cache augmented with a logical shifter. The destination address of an incoming packet is logically right-shifted by range size before being fed to the baseline cache. Because each address range corresponds to a cacheable entity, HARC s effective coverage of of the IP address space is increased by a factor of minimum range granularity. Cache Assoc Miss Miss Average Size Ratio Ratio Lookup Time HARC HAC/HARC HAC/HARC % K % % % K % % Table 3. Cache miss ratio comparisons between host address range cache (HARC) and host address cache (HAC), assuming the block size is one entry wide and the range size is. The last column is the ratio between HAC s and HARC s average routing-table lookup times, assuming the hit access time is one cycle and the miss penalty is 120 cycles. We processed the IPMA routing table according to the steps described above, and calculated the range size parameter, which turned out to be. This means that each HARC entry now corresponds to a continuous range of

7 xxxx Index Bit Output Interface Host Address A C Figure 4. A routing table example that illustrates the usefulness of carefully choosing the index bits. In this case, the ½-th bit is chosen to be the index bit. The total number of distinct address ranges is reduced from 8, if the basic merging operation used in the formation of the host address range cache is used, to 3. The three address ranges are labeled as A, B, and C. B 32 addresses, a factor of 32 increase in the cache s effective coverage. Table 3 shows the comparison between the cache miss ratios of the HARC and the host address cache (HAC), assuming the block size is one entry wide. HAC s miss ratio is between 1.68 to 2.10 times higher than that of HARC. In terms of average routing-table lookup time, HARC is between 58% and 78% faster than HAC, assuming that the hit access time is one cycle and the miss penalty is 120 cycles. Because the logical right shifting step in the HARC lookup lowers the degree of variation in the address stream as seen by HARC, the probability of conflict miss increases. As a result, the miss ratio gap between HAC and HARC widens with the degree of associativity, because HARC benefits more from higher degrees of associativity by eliminating more conflict misses than HAC. 6. Intelligent Host Address Range Cache (IHARC) A traditional CPU cache of size ¾ à and block size ½ directly takes the least significant à bits of a given address to index into the data and tag arrays. In this section, we show that by choosing a more appropriate hash function for cache lookup, it is possible to further increase every cache entry s coverage of the IP address space. Consider the example routing table in Figure 4, where there are 16 4-bit host addresses with three distinct output interfaces, 1, 2 and 3. The merging algorithm used in calculating the range size of the HARC will stop after all adjacent address ranges with identical output interfaces are combined. In this case, the total number of address ranges is 8, because the minimum range granularity is 2. To further grow the address range that a cache entry can cover, one could choose the index bits carefully such that when the index bits are ignored, some of the identically labeled address ranges are now adjacent and thus could be combined. For example, if bit ½ (with bit ¼ being least significant) is chosen as the index bit into the data/tag array, then the host addresses 0000, 0001, 0100, and 0101 can be merged into an address range because they have the same output interface, 1, and when bit ½ is ignored, they form a continuous sequence, 000, 001, 010, and 011. Similarly, 1000, 1001, 1100, and 1101 also can be merged into an address range, as well as all the host addresses whose output interface is 2. With this choice of the index bit, the total number of address ranges to be distinguished during cache lookup is reduced from 8 to 3. Note that index bit ½ induced a partitioning of the address space such that in each partition, some address ranges that were not adjacent in the original address space became adjacent. Intuitively, IHARC provides more opportunities of merging identically-labeled address ranges by decomposing the address space into partitions, based upon a set of index bits, and merging identically-labeled ranges that are adjacent within a partition. Note that HARC insists on merging ranges that are adjacent in the original IP address space, and thus is a special case of IHARC. IHARC selects a set of à index bits in the destination address that correspond to ¾ à cache sets. Each cache set corresponds to a partition of the IP address space. In a partition, some address ranges that were not originally adja-

8 S = ; for (i=1; i K; i++) score = ½; candidate = 0; for (j=range size+1; j N; j++) if (!(j ¾ S)) currentscore = Score(S,j); if (currentscore score) score = currentscore ; candidate = j; S = S candidate ; Figure 5. A greedy index bit selection algorithm used to pick the bits in the input addresses for cache lookup cent in the IP address space, will become adjacent. Any adjacent ranges that are identically labelled are then merged into larger ranges. Thus, we get a set of distinct address ranges for every partition (or cache set). Since distinct address ranges in a cache set need unique tags, the number of distinct address ranges in a cache set represents the degree of contention in the cache set. Thus, the index bits are selected in such a way that after the merging operation, the total number of address ranges and the difference between the number of address ranges across cache sets is minimized. We first describe our index bit selection algorithm. Assume Æ and à are the number of bits in the input address and the index key, respectively. In general, any subset of à bits in the input addresses could be used as the index bits, except the least significant range size bits as determined by the basic merging step in constructing the HARC. We use a greedy algorithm to select the à index bits, as shown in Figure 5. Ë represents the set of index bits chosen by the algorithm so far. ËÓÖ Ë µ is a heuristic function that calculates the desirability of including the -th bit, given that the bits in Ë have already been chosen to be included in the index bit set. For each partition of the IP address space induced by the bits in Ë, the algorithm first merges adjacent identically-labelled ranges in that partition. This step gives us, for every partition, the number of distinct address ranges that need to be uniquely tagged. As mentioned earlier, the number of distinct address ranges in a partition represents the extent of contention in the corresponding cache set. Thus, for a candidate bit set Ë, we define the -th partition s metric Å Ë µ, as the number of distinct address ranges in partition. Then, the algorithm minimizes ËÓÖ Ë µ, given by ËÓÖ Ë µ Å Ë µ Ï Å Ë µ Å Ë Destination IP Address Right Shifter Range Size Programmable Hash Engine Index Mask = Select Data Output Figure 6. The intelligent host address range cache architecture that uses a hash function specific to a routing table in order to combine disjoint address ranges into a logical host address set which is then mapped to a cache entry. The programmable hash engine provides the flexibility needed to tailor the hash function to the network routing table, which is changing dynamically. where Å Ë is the mean of Å Ë µ over all partitions and Ï is a parameter that determines the relative weight of the two terms in the minimization. Note that the second term of the weighted sum minimizes the standard deviation and is included to prevent the occurrence of hot-spot partitions, and thus excessive conflict misses, in the IHARC cache sets. Figure 6 shows the hardware architecture of a host address range cache with a programmable hash function engine that allows tailoring the choice of the index bit set to individual routing tables. While HAC and HARC use a fixed three-level-table NART structure (16, 8 and 8 bits) that is independent of the hardware cache configurations, the NART associated with IHARC depends on the hardware configuration. In particular, the number of entries in the level 1 table is equal to ¾ Ã, where à is the number of index bits in IHARC,

9 Number of Bits Chosen No. of Ranges No. of Ranges Index Bits (without constraint) (with constraint) , , , , , , , , , ,538 Table 4. The number of address ranges that need to be distinguished and the bits chosen as index bits after applying the index bit selection algorithm to a routing table from the IPMA project. The last two columns correspond to the number of address ranges with and without the constraint that each address range s size has to be a power of 2, respectively. and the set of à selected index bits is used to index into the level 1 table. As a result, the cache miss penalty for IHARC may be different for different cache configurations. However, since the miss penalty is dominated by the number of memory accesses made in software NART lookup (3 lookups in the worst case), which is comparable in all configurations, measurements from our prototype implementation show that the average miss penalty is almost the same for all IHARC cache configurations we experimented with, and moreover is close to that of HAC and HARC, i.e., 120 cycles. Another important difference is that in addition to the output interface, a leaf NART entry must contain the address range it corresponds to, so that after an NART lookup following a cache miss, the cache set can be populated with the appropriate address range as the cache tag. Given an Æ-bit address, the à index bits select a particular cache set, say. The remaining Æ Ã bits of the address form a value, say Ì, which lies in one of the address ranges in this partition. Initially, when a cache set is not populated, Ì is looked up in software using the NART and the address range in which Ì falls becomes the tag of the cache set. If the cache entry was already populated with an address range, a range check is required to figure out whether the lookup is a hit (which corresponds to checking that Ì lies in the range ). However, a general range check is still too expensive to be incorporated into caching hardware. By guaranteeing that each address range size is a power of two and that the starting address of each range is aligned with a multiple of its size during the merge step, one can perform the range check simply by a mask-and-compare operation. Therefore, each tag memory entry in the IHARC includes a tag field as well as a mask field, which specifies the bits in the address to be used in the tag match. The price of simplifying cache lookup hardware is an increase in the number of resulting address ranges, as compared to the case when no such alignment requirement is imposed. If the range check results in a miss, the NART data structure is looked up and the cache is populated with the appropriate address range. Compared to the generic host address range cache (HARC), the intelligent host address range cache (IHARC) reduces the number of distinct address ranges that need to be distinguished, by a careful choice of the index bits. Table 4 shows the number of distinct address ranges that result after applying the index bit set selection algorithm to the IPMA routing table, for different numbers of index bits. To put these numbers in perspective, the number of entries in the original routing table is 39,681, and the number of address ranges from HARC is ¾ ¾ or 134,217,728. In other words, the index bit set selection algorithm effectively reduces the number of distinct address ranges from HARC to IHARC by three orders of magnitude. In addition, this number is only 3 to 4 times the number of entries in the original routing table, even though the resultant address ranges can now be efficiently looked up with conventional cache lookup hardware. As mentioned before, for address ranges to be tag-matched based on masks, their size has to be a power of two. Table 4 shows the difference in the number of distinct address ranges with and without this constraint. Cache Assoc Miss Miss Lookup Lookup Size Ratio Ratio Time Time IHARC HARC/ HARC/ HAC/ IHARC IHARC IHARC % K % % % K % % Table 5. Miss ratios for the intelligent host address range cache (IHARC), assuming that the block size is one entry wide. The last two columns are the ratio between HARC s and IHARC s average routing-table lookup times, and that between HAC s and IHARC s, respectively, with the HARC s Ö Ò Þ as 5.

10 Table 5 shows the miss ratios for IHARC, assuming that the block size is one entry wide. In terms of average routing-table lookup time, HARC is between 2.24 and 3.18 times slower than IHARC. This is because HARC s miss ratios are 2.91 to 7.09 times larger than IHARC s. In addition, the miss ratio gap between HARC and IHARC increases with the degree of associativity. This result conclusively demonstrates that there is significant performance improvement to be gained from IHARC over HARC. Compared to HAC, IHARC reduces the average routing table lookup time by up to a factor of. 7. Conclusion This paper reports the results of one of the first research efforts on cache memory designs for emerging network processors. Based on a real packet trace collected from the main router of a national laboratory, we studied a series of routing-table cache designs. Major results from this research are summarized as follows: Based upon the interleaved trace used in the study, there seems to be sufficient temporal locality in the packet stream to justify the use of a routing-table cache in network processors. However, spatial locality is weak, and therefore the block size should be small, preferably one entry wide. Caching address ranges rather than individual addresses greatly improves the effective coverage of caches of a given size and therefore their hit ratios. A careful choice of the index bits during cache lookup is crucial and can dramatically reduce the number of address ranges that need to be distinguished, and thus the cache miss ratio. We are currently investigating the performance impacts of routing table updates on HARC and IHARC, both of which exploit the current contents of the routing table to dynamically reconfigure the cache hardware, and therefore need to be changed on the fly. Specifically, developing an incremental version of the index bit selection algorithm will considerably enhance IHARC s practical usability. References [1] E. Basturk et al. Design and implementation of a qos capable switch-router. Sixth International Conference on Computer Communications and Networks, pages , September [2] X. Chen. Effect of caching on routing-table lookup in multimedia environments. IEEE INFOCOM, pages , April [3] T. Chiueh and P. Pradhan. High performance ip routing table lookup using cpu caching. IEEE INFOCOM, April [4] Xaqti Corporation. Gigapower protocol processor. ( 01.htm). [5] W. Doeringer, G. Karjoth, and M. Nassehi. Routing on longest matching prefixes. IEEE/ACM Transactions on Networking, 4(1):86 97, February [6] D. Estrin and D. Mitzel. An assessment of state and lookup overhead in routers. IEEE INFOCOM, pages , May [7] D. Feldmeier. Improving gateway performance with a routing-table cache. IEEE INFOCOM, pages , March [8] R. Jain. Characteristics of destination address locality in computer networks: a comparison of caching schemes. Computer Networks and ISDN Systems, 18(4):243 54, May [9] Kawasaki LSI. Longest match engine. ( com/products/lme.html). [10] Berkeley Networks. The integrated network services switch: A new architecture for emerging applications. ( [11] MMC Networks. 20 mpps network processor with wirespeed layer 3 processing for building switches and routers. ( [12] Neo Networks. Streamprocessor 2400 backbone switch router. ( 20literature/sp2400.htm#sp2400 top). [13] C. Partridge. Locality and route caches. NSF Workshop on Internet Statistics Measurement and Analysis ( [14] C. Partridge, P. Carvey et al. A fifty gigabit per second ip router. IEEE/ACM Transactions on Networking, 6(3):237 48, June [15] P. Pradhan and T. Chiueh. Operating systems support for programmable cluster-based internet routers. IEEE Workshop on Hot Topics in Operating Systems, pages 76 81, March [16] Music Semiconductor. Muac routing coprocessor (rcp) family. ( [17] Avici Systems. The world of terabit switch/router technology. ( new world 1.html). [18] Pluris Terabit Network Systems. Next generation internet router. ( [19] Torrent Network Technologies. High-speed routing table search algorithms. ( general/download/highspeed.pdf). [20] Torrent Network Technologies. The ip9000 gigabit router architecture. ( ip9000 Arch.pdf). [21] Michigan University and Merit Network. Internet performance management and analysis (ipma) project. ( ipma). [22] M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable high speed ip routing lookups. ACM SIGCOMM, September [23] A. Brodnik, S. Carlsson, M. Degermark and S. Pink. Small Forwarding Tables for Fast Routing Lookups. ACM SIG- COMM, September 1997.

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING James J. Rooney 1 José G. Delgado-Frias 2 Douglas H. Summerville 1 1 Dept. of Electrical and Computer Engineering. 2 School of Electrical Engr. and Computer

More information

Improving Route Lookup Performance Using Network Processor Cache

Improving Route Lookup Performance Using Network Processor Cache Improving Route Lookup Performance Using Network Processor Cache Kartik Gopalan and Tzi-cker Chiueh {kartik,chiueh}@cs.sunysb.edu Computer Science Department, Stony Brook University, Stony Brook, NY -

More information

High-Performance IP Routing Table Lookup Using CPU Caching

High-Performance IP Routing Table Lookup Using CPU Caching 1 High-Performance IP Routing Table Lookup Using CPU Caching Tzi-cker Chiueh Prashant Pradhan Computer Science Department State University of New York at Stony Brook Stony Brook, NY 11794-4400 fchiueh,

More information

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING James J Rooney 1 José G Delgado-Frias 2 Douglas H Summerville 1 1 Department of Electrical and Computer Engineering 2 School of Electrical Engineering and Computer

More information

Binary Search Schemes for Fast IP Lookups

Binary Search Schemes for Fast IP Lookups 1 Schemes for Fast IP Lookups Pronita Mehrotra, Paul D. Franzon Abstract IP route look up is the most time consuming operation of a router. Route lookup is becoming a very challenging problem due to the

More information

Maintaining Mutual Consistency for Cached Web Objects

Maintaining Mutual Consistency for Cached Web Objects Maintaining Mutual Consistency for Cached Web Objects Bhuvan Urgaonkar, Anoop George Ninan, Mohammad Salimullah Raunak Prashant Shenoy and Krithi Ramamritham Department of Computer Science, University

More information

Performance Evaluation of Myrinet-based Network Router

Performance Evaluation of Myrinet-based Network Router Performance Evaluation of Myrinet-based Network Router Information and Communications University 2001. 1. 16 Chansu Yu, Younghee Lee, Ben Lee Contents Suez : Cluster-based Router Suez Implementation Implementation

More information

Forwarding and Routers : Computer Networking. Original IP Route Lookup. Outline

Forwarding and Routers : Computer Networking. Original IP Route Lookup. Outline Forwarding and Routers 15-744: Computer Networking L-9 Router Algorithms IP lookup Longest prefix matching Classification Flow monitoring Readings [EVF3] Bitmap Algorithms for Active Flows on High Speed

More information

Efficient hardware architecture for fast IP address lookup. Citation Proceedings - IEEE INFOCOM, 2002, v. 2, p

Efficient hardware architecture for fast IP address lookup. Citation Proceedings - IEEE INFOCOM, 2002, v. 2, p Title Efficient hardware architecture for fast IP address lookup Author(s) Pao, D; Liu, C; Wu, A; Yeung, L; Chan, KS Citation Proceedings - IEEE INFOCOM, 2002, v 2, p 555-56 Issued Date 2002 URL http://hdlhandlenet/0722/48458

More information

IP Address Lookup in Hardware for High-Speed Routing

IP Address Lookup in Hardware for High-Speed Routing IP Address Lookup in Hardware for High-Speed Routing Andreas Moestedt and Peter Sjödin am@sics.se, peter@sics.se Swedish Institute of Computer Science P.O. Box 1263, SE-164 29 KISTA, Sweden Abstract This

More information

CS419: Computer Networks. Lecture 6: March 7, 2005 Fast Address Lookup:

CS419: Computer Networks. Lecture 6: March 7, 2005 Fast Address Lookup: : Computer Networks Lecture 6: March 7, 2005 Fast Address Lookup: Forwarding/Routing Revisited Best-match Longest-prefix forwarding table lookup We looked at the semantics of bestmatch longest-prefix address

More information

1 Connectionless Routing

1 Connectionless Routing UCSD DEPARTMENT OF COMPUTER SCIENCE CS123a Computer Networking, IP Addressing and Neighbor Routing In these we quickly give an overview of IP addressing and Neighbor Routing. Routing consists of: IP addressing

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Growth of the Internet Network capacity: A scarce resource Good Service

Growth of the Internet Network capacity: A scarce resource Good Service IP Route Lookups 1 Introduction Growth of the Internet Network capacity: A scarce resource Good Service Large-bandwidth links -> Readily handled (Fiber optic links) High router data throughput -> Readily

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

On characterizing BGP routing table growth

On characterizing BGP routing table growth University of Massachusetts Amherst From the SelectedWorks of Lixin Gao 00 On characterizing BGP routing table growth T Bu LX Gao D Towsley Available at: https://works.bepress.com/lixin_gao/66/ On Characterizing

More information

Fast IP Routing Lookup with Configurable Processor and Compressed Routing Table

Fast IP Routing Lookup with Configurable Processor and Compressed Routing Table Fast IP Routing Lookup with Configurable Processor and Compressed Routing Table H. Michael Ji, and Ranga Srinivasan Tensilica, Inc. 3255-6 Scott Blvd Santa Clara, CA 95054 Abstract--In this paper we examine

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

Introduction. Router Architectures. Introduction. Introduction. Recent advances in routing architecture including

Introduction. Router Architectures. Introduction. Introduction. Recent advances in routing architecture including Introduction Router Architectures Recent advances in routing architecture including specialized hardware switching fabrics efficient and faster lookup algorithms have created routers that are capable of

More information

Frugal IP Lookup Based on a Parallel Search

Frugal IP Lookup Based on a Parallel Search Frugal IP Lookup Based on a Parallel Search Zoran Čiča and Aleksandra Smiljanić School of Electrical Engineering, Belgrade University, Serbia Email: cicasyl@etf.rs, aleksandra@etf.rs Abstract Lookup function

More information

Introduction. Introduction. Router Architectures. Introduction. Recent advances in routing architecture including

Introduction. Introduction. Router Architectures. Introduction. Recent advances in routing architecture including Router Architectures By the end of this lecture, you should be able to. Explain the different generations of router architectures Describe the route lookup process Explain the operation of PATRICIA algorithm

More information

Router Architectures

Router Architectures Router Architectures Venkat Padmanabhan Microsoft Research 13 April 2001 Venkat Padmanabhan 1 Outline Router architecture overview 50 Gbps multi-gigabit router (Partridge et al.) Technology trends Venkat

More information

15-744: Computer Networking. Routers

15-744: Computer Networking. Routers 15-744: Computer Networking outers Forwarding and outers Forwarding IP lookup High-speed router architecture eadings [McK97] A Fast Switched Backplane for a Gigabit Switched outer Optional [D+97] Small

More information

Packet Classification Using Dynamically Generated Decision Trees

Packet Classification Using Dynamically Generated Decision Trees 1 Packet Classification Using Dynamically Generated Decision Trees Yu-Chieh Cheng, Pi-Chung Wang Abstract Binary Search on Levels (BSOL) is a decision-tree algorithm for packet classification with superior

More information

Request for Comments: 1787 T.J. Watson Research Center, IBM Corp. Category: Informational April 1995

Request for Comments: 1787 T.J. Watson Research Center, IBM Corp. Category: Informational April 1995 Network Working Group Y. Rekhter Request for Comments: 1787 T.J. Watson Research Center, IBM Corp. Category: Informational April 1995 Status of this Memo Routing in a Multi-provider Internet This memo

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

CS 268: Route Lookup and Packet Classification

CS 268: Route Lookup and Packet Classification Overview CS 268: Route Lookup and Packet Classification Packet Lookup Packet Classification Ion Stoica March 3, 24 istoica@cs.berkeley.edu 2 Lookup Problem Identify the output interface to forward an incoming

More information

Efficient Packet Classification using Splay Tree Models

Efficient Packet Classification using Splay Tree Models 28 IJCSNS International Journal of Computer Science and Network Security, VOL.6 No.5B, May 2006 Efficient Packet Classification using Splay Tree Models Srinivasan.T, Nivedita.M, Mahadevan.V Sri Venkateswara

More information

The iflow Address Processor Forwarding Table Lookups using Fast, Wide Embedded DRAM

The iflow Address Processor Forwarding Table Lookups using Fast, Wide Embedded DRAM Enabling the Future of the Internet The iflow Address Processor Forwarding Table Lookups using Fast, Wide Embedded DRAM Mike O Connor - Director, Advanced Architecture www.siliconaccess.com Hot Chips 12

More information

An Improved Cache Mechanism for a Cache-based Network Processor

An Improved Cache Mechanism for a Cache-based Network Processor An Improved Cache Mechanism for a Cache-based Network Processor Hayato Yamaki 1, Hiroaki Nishi 1 1 Graduate school of Science and Technology, Keio University, Yokohama, Japan Abstract Internet traffic

More information

Dynamic Routing Tables Using Simple Balanced. Search Trees

Dynamic Routing Tables Using Simple Balanced. Search Trees Dynamic Routing Tables Using Simple Balanced Search Trees Y.-K. Chang and Y.-C. Lin Department of Computer Science and Information Engineering National Cheng Kung University Tainan, Taiwan R.O.C. ykchang@mail.ncku.edu.tw

More information

Enhancement of the CBT Multicast Routing Protocol

Enhancement of the CBT Multicast Routing Protocol Enhancement of the CBT Multicast Routing Protocol Seok Joo Koh and Shin Gak Kang Protocol Engineering Center, ETRI, Korea E-mail: sjkoh@pec.etri.re.kr Abstract In this paper, we propose a simple practical

More information

Memory Design. Cache Memory. Processor operates much faster than the main memory can.

Memory Design. Cache Memory. Processor operates much faster than the main memory can. Memory Design Cache Memory Processor operates much faster than the main memory can. To ameliorate the sitution, a high speed memory called a cache memory placed between the processor and main memory. Barry

More information

Problem Statement. Algorithm MinDPQ (contd.) Algorithm MinDPQ. Summary of Algorithm MinDPQ. Algorithm MinDPQ: Experimental Results.

Problem Statement. Algorithm MinDPQ (contd.) Algorithm MinDPQ. Summary of Algorithm MinDPQ. Algorithm MinDPQ: Experimental Results. Algorithms for Routing Lookups and Packet Classification October 3, 2000 High Level Outline Part I. Routing Lookups - Two lookup algorithms Part II. Packet Classification - One classification algorithm

More information

A Pipelined Memory Management Algorithm for Distributed Shared Memory Switches

A Pipelined Memory Management Algorithm for Distributed Shared Memory Switches A Pipelined Memory Management Algorithm for Distributed Shared Memory Switches Xike Li, Student Member, IEEE, Itamar Elhanany, Senior Member, IEEE* Abstract The distributed shared memory (DSM) packet switching

More information

Scalable Packet Classification for IPv6 by Using Limited TCAMs

Scalable Packet Classification for IPv6 by Using Limited TCAMs Scalable Packet Classification for IPv6 by Using Limited TCAMs Chia-Tai Chan 1, Pi-Chung Wang 1,Shuo-ChengHu 2, Chung-Liang Lee 1,and Rong-Chang Chen 3 1 Telecommunication Laboratories, Chunghwa Telecom

More information

Chapter 12: Indexing and Hashing

Chapter 12: Indexing and Hashing Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Power Efficient IP Lookup with Supernode Caching

Power Efficient IP Lookup with Supernode Caching Power Efficient IP Lookup with Supernode Caching Lu Peng, Wencheng Lu * and Lide Duan Department of Electrical & Computer Engineering Louisiana State University Baton Rouge, LA 73 {lpeng, lduan1}@lsu.edu

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

Efficient Prefix Cache for Network Processors

Efficient Prefix Cache for Network Processors Efficient Prefix Cache for Network Processors Mohammad J. Akhbarizadeh and Mehrdad Nourani Center for Integrated Circuits & Systems The University of Texas at Dallas Richardson, TX 7583 feazadeh,nouranig@utdallas.edu

More information

Novel Hardware Architecture for Fast Address Lookups

Novel Hardware Architecture for Fast Address Lookups Novel Hardware Architecture for Fast Address Lookups Pronita Mehrotra, Paul D. Franzon ECE Department, North Carolina State University, Box 7911, Raleigh, NC 27695-791 1, USA Ph: +1-919-515-735 1, Fax:

More information

Efficient Construction Of Variable-Stride Multibit Tries For IP Lookup

Efficient Construction Of Variable-Stride Multibit Tries For IP Lookup " Efficient Construction Of Variable-Stride Multibit Tries For IP Lookup Sartaj Sahni & Kun Suk Kim sahni, kskim @ciseufledu Department of Computer and Information Science and Engineering University of

More information

Disjoint Superposition for Reduction of Conjoined Prefixes in IP Lookup for Actual IPv6 Forwarding Tables

Disjoint Superposition for Reduction of Conjoined Prefixes in IP Lookup for Actual IPv6 Forwarding Tables Disjoint Superposition for Reduction of Conjoined Prefixes in IP Lookup for Actual IPv6 Forwarding Tables Roberto Rojas-Cessa, Taweesak Kijkanjanarat, Wara Wangchai, Krutika Patil, Narathip Thirapittayatakul

More information

* I D ~~~ ~ Figure 2: Longest matching prefix.

* I D ~~~ ~ Figure 2: Longest matching prefix. A fast and compact longest match prefix look-up method using pointer cache for very long network address Masanori Uga Kohei Shiomoto "IT Network Service Systems Laboratories Midori 3-9-, Musashino, Tokyo

More information

THE advent of the World Wide Web (WWW) has doubled

THE advent of the World Wide Web (WWW) has doubled IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 17, NO. 6, JUNE 1999 1093 A Novel IP-Routing Lookup Scheme and Hardware Architecture for Multigigabit Switching Routers Nen-Fu Huang, Member, IEEE,

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

Routing Lookup Algorithm for IPv6 using Hash Tables

Routing Lookup Algorithm for IPv6 using Hash Tables Routing Lookup Algorithm for IPv6 using Hash Tables Peter Korppoey, John Smith, Department of Electronics Engineering, New Mexico State University-Main Campus Abstract: After analyzing of existing routing

More information

Chapter 12: Indexing and Hashing. Basic Concepts

Chapter 12: Indexing and Hashing. Basic Concepts Chapter 12: Indexing and Hashing! Basic Concepts! Ordered Indices! B+-Tree Index Files! B-Tree Index Files! Static Hashing! Dynamic Hashing! Comparison of Ordered Indexing and Hashing! Index Definition

More information

Optimization I : Brute force and Greedy strategy

Optimization I : Brute force and Greedy strategy Chapter 3 Optimization I : Brute force and Greedy strategy A generic definition of an optimization problem involves a set of constraints that defines a subset in some underlying space (like the Euclidean

More information

Surveying Formal and Practical Approaches for Optimal Placement of Replicas on the Web

Surveying Formal and Practical Approaches for Optimal Placement of Replicas on the Web Surveying Formal and Practical Approaches for Optimal Placement of Replicas on the Web TR020701 April 2002 Erbil Yilmaz Department of Computer Science The Florida State University Tallahassee, FL 32306

More information

Multiway Range Trees: Scalable IP Lookup with Fast Updates

Multiway Range Trees: Scalable IP Lookup with Fast Updates Multiway Range Trees: Scalable IP Lookup with Fast Updates Subhash Suri George Varghese Priyank Ramesh Warkhede Department of Computer Science Washington University St. Louis, MO 63130. Abstract In this

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Basic Idea. Routing. Example. Routing by the Network

Basic Idea. Routing. Example. Routing by the Network Basic Idea Routing Routing table at each router/gateway When IP packet comes, destination address checked with routing table to find next hop address Questions: Route by host or by network? Routing table:

More information

Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks

Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks Yuhua Chen Jonathan S. Turner Department of Electrical Engineering Department of Computer Science Washington University Washington University

More information

Homework 1 Solutions:

Homework 1 Solutions: Homework 1 Solutions: If we expand the square in the statistic, we get three terms that have to be summed for each i: (ExpectedFrequency[i]), (2ObservedFrequency[i]) and (ObservedFrequency[i])2 / Expected

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

PERSONAL communications service (PCS) provides

PERSONAL communications service (PCS) provides 646 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 5, NO. 5, OCTOBER 1997 Dynamic Hierarchical Database Architecture for Location Management in PCS Networks Joseph S. M. Ho, Member, IEEE, and Ian F. Akyildiz,

More information

A Scalable Approach for Packet Classification Using Rule-Base Partition

A Scalable Approach for Packet Classification Using Rule-Base Partition CNIR Journal, Volume (5), Issue (1), Dec., 2005 A Scalable Approach for Packet Classification Using Rule-Base Partition Mr. S J Wagh 1 and Dr. T. R. Sontakke 2 [1] Assistant Professor in Information Technology,

More information

Routing by the Network

Routing by the Network Routing Basic Idea Routing table at each router/gateway When IP packet comes, destination address checked with routing table to find next hop address Questions: Route by host or by network? Routing table:

More information

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS161 Design and Architecture of Computer Systems. Cache $$$$$ CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

Midterm Review. Congestion Mgt, CIDR addresses,tcp processing, TCP close. Routing. hierarchical networks. Routing with OSPF, IS-IS, BGP-4

Midterm Review. Congestion Mgt, CIDR addresses,tcp processing, TCP close. Routing. hierarchical networks. Routing with OSPF, IS-IS, BGP-4 Midterm Review Week 1 Congestion Mgt, CIDR addresses,tcp processing, TCP close Week 2 Routing. hierarchical networks Week 3 Routing with OSPF, IS-IS, BGP-4 Week 4 IBGP, Prefix lookup, Tries, Non-stop routers,

More information

Using Cache Line Coloring to Perform Aggressive Procedure Inlining

Using Cache Line Coloring to Perform Aggressive Procedure Inlining Using Cache Line Coloring to Perform Aggressive Procedure Inlining Hakan Aydın David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115 {haydin,kaeli}@ece.neu.edu

More information

University of Alberta. Sunil Ravinder. Master of Science. Department of Computing Science

University of Alberta. Sunil Ravinder. Master of Science. Department of Computing Science University of Alberta CACHE ARCHITECTURES TO IMPROVE IP LOOKUPS by Sunil Ravinder A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree

More information

Practical Anonymity for the Masses with MorphMix

Practical Anonymity for the Masses with MorphMix Practical Anonymity for the Masses with MorphMix Marc Rennhard, Bernhard Plattner () Financial Cryptography 2004 12 th February 2004 http://www.tik.ee.ethz.ch/~morphmix Overview Circuit-based mix networks

More information

Message Switch. Processor(s) 0* 1 100* 6 1* 2 Forwarding Table

Message Switch. Processor(s) 0* 1 100* 6 1* 2 Forwarding Table Recent Results in Best Matching Prex George Varghese October 16, 2001 Router Model InputLink i 100100 B2 Message Switch B3 OutputLink 6 100100 Processor(s) B1 Prefix Output Link 0* 1 100* 6 1* 2 Forwarding

More information

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose

More information

FPGA Implementation of Lookup Algorithms

FPGA Implementation of Lookup Algorithms 2011 IEEE 12th International Conference on High Performance Switching and Routing FPGA Implementation of Lookup Algorithms Zoran Chicha, Luka Milinkovic, Aleksandra Smiljanic Department of Telecommunications

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Chapter 18 and 22. IPv4 Address. Data Communications and Networking

Chapter 18 and 22. IPv4 Address. Data Communications and Networking University of Human Development College of Science and Technology Department of Information Technology Chapter 18 and 22 Data Communications and Networking IPv4 Address 1 Lecture Outline IPv4 Addressing

More information

Scalable IP Routing Lookup in Next Generation Network

Scalable IP Routing Lookup in Next Generation Network Scalable IP Routing Lookup in Next Generation Network Chia-Tai Chan 1, Pi-Chung Wang 1,Shuo-ChengHu 2, Chung-Liang Lee 1, and Rong-Chang Chen 3 1 Telecommunication Laboratories, Chunghwa Telecom Co., Ltd.

More information

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system.

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system. Cache Advantage August 1994 / Features / Cache Advantage Cache design and implementation can make or break the performance of your high-powered computer system. David F. Bacon Modern CPUs have one overriding

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

Configuring BGP. Cisco s BGP Implementation

Configuring BGP. Cisco s BGP Implementation Configuring BGP This chapter describes how to configure Border Gateway Protocol (BGP). For a complete description of the BGP commands in this chapter, refer to the BGP s chapter of the Network Protocols

More information

Parallel-Search Trie-based Scheme for Fast IP Lookup

Parallel-Search Trie-based Scheme for Fast IP Lookup Parallel-Search Trie-based Scheme for Fast IP Lookup Roberto Rojas-Cessa, Lakshmi Ramesh, Ziqian Dong, Lin Cai, and Nirwan Ansari Department of Electrical and Computer Engineering, New Jersey Institute

More information

The Interconnection Structure of. The Internet. EECC694 - Shaaban

The Interconnection Structure of. The Internet. EECC694 - Shaaban The Internet Evolved from the ARPANET (the Advanced Research Projects Agency Network), a project funded by The U.S. Department of Defense (DOD) in 1969. ARPANET's purpose was to provide the U.S. Defense

More information

Three Different Designs for Packet Classification

Three Different Designs for Packet Classification Three Different Designs for Packet Classification HATAM ABDOLI Computer Department Bu-Ali Sina University Shahid Fahmideh street, Hamadan IRAN abdoli@basu.ac.ir http://www.profs.basu.ac.ir/abdoli Abstract:

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Memory. Objectives. Introduction. 6.2 Types of Memory

Memory. Objectives. Introduction. 6.2 Types of Memory Memory Objectives Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured. Master the concepts

More information

Static Routing and Serial interfaces. 1 st semester

Static Routing and Serial interfaces. 1 st semester Static Routing and Serial interfaces 1 st semester 1439-2017 Outline Static Routing Implementation Configure Static and Default Routes Review of CIDR Configure Summary and Floating Static Routes Troubleshoot

More information

Abstract We consider the problem of organizing the Internet routing tables in such a way as to enable fast routing lookup performance. We concentrate

Abstract We consider the problem of organizing the Internet routing tables in such a way as to enable fast routing lookup performance. We concentrate IP Routing Lookups Algorithms Evaluation Lukas Kencl Doctoral School of Communication Systems EPFL Lausanne Supervisor Patrick Droz IBM Laboratories Zurich July 9, 998 Abstract We consider the problem

More information

Experimental Extensions to RSVP Remote Client and One-Pass Signalling

Experimental Extensions to RSVP Remote Client and One-Pass Signalling 1 Experimental Extensions to RSVP Remote Client and One-Pass Signalling Industrial Process and System Communications, Darmstadt University of Technology Merckstr. 25 D-64283 Darmstadt Germany Martin.Karsten@KOM.tu-darmstadt.de

More information

ignored Virtual L1 Cache HAC

ignored Virtual L1 Cache HAC Suez: A Cluster-Based Scalable Real-Time Packet Router Tzi-cker Chiueh? Prashant Pradhan? Computer Science Division, EECS University of California at Berkeley Berkeley, CA 94720-1776 Computer Science Department

More information

This presentation covers Gen Z Memory Management Unit (ZMMU) and memory interleave capabilities.

This presentation covers Gen Z Memory Management Unit (ZMMU) and memory interleave capabilities. This presentation covers Gen Z Memory Management Unit (ZMMU) and memory interleave capabilities. 1 2 Given the operational similarities between a Requester ZMMU and a Responder ZMMU, much of the underlying

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #21: Caches 3 2005-07-27 CS61C L22 Caches III (1) Andy Carle Review: Why We Use Caches 1000 Performance 100 10 1 1980 1981 1982 1983

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Parallel Programming Interfaces

Parallel Programming Interfaces Parallel Programming Interfaces Background Different hardware architectures have led to fundamentally different ways parallel computers are programmed today. There are two basic architectures that general

More information

FIGURE 3. Two-Level Internet Address Structure. FIGURE 4. Principle Classful IP Address Formats

FIGURE 3. Two-Level Internet Address Structure. FIGURE 4. Principle Classful IP Address Formats Classful IP Addressing When IP was first standardized in September 1981, the specification required that each system attached to an IP-based Internet be assigned a unique, 32-bit Internet address value.

More information

Lecture 3: Packet Forwarding

Lecture 3: Packet Forwarding Lecture 3: Packet Forwarding CSE 222A: Computer Communication Networks Alex C. Snoeren Thanks: Mike Freedman & Amin Vahdat Lecture 3 Overview Paper reviews Packet Forwarding IP Addressing Subnetting/CIDR

More information

OPTIMAL MULTI-CHANNEL ASSIGNMENTS IN VEHICULAR AD-HOC NETWORKS

OPTIMAL MULTI-CHANNEL ASSIGNMENTS IN VEHICULAR AD-HOC NETWORKS Chapter 2 OPTIMAL MULTI-CHANNEL ASSIGNMENTS IN VEHICULAR AD-HOC NETWORKS Hanan Luss and Wai Chen Telcordia Technologies, Piscataway, New Jersey 08854 hluss@telcordia.com, wchen@research.telcordia.com Abstract:

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors

More information

Dynamic Pipelining: Making IP- Lookup Truly Scalable

Dynamic Pipelining: Making IP- Lookup Truly Scalable Dynamic Pipelining: Making IP- Lookup Truly Scalable Jahangir Hasan T. N. Vijaykumar School of Electrical and Computer Engineering, Purdue University SIGCOMM 05 Rung-Bo-Su 10/26/05 1 0.Abstract IP-lookup

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures CS61C L22 Caches II (1) CPS today! Lecture #22 Caches II 2005-11-16 There is one handout today at the front and back of the room! Lecturer PSOE,

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

CS 261 Fall Caching. Mike Lam, Professor. (get it??)

CS 261 Fall Caching. Mike Lam, Professor. (get it??) CS 261 Fall 2017 Mike Lam, Professor Caching (get it??) Topics Caching Cache policies and implementations Performance impact General strategies Caching A cache is a small, fast memory that acts as a buffer

More information