CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS WE EVALUATE A SERIES OF THREE PROGRESSIVELY MORE AGGRESSIVE ROUTING-TABLE CACHE DESIGNS AND DEMONSTRATE THAT THE INCORPORATION OF HARDWARE CACHES INTO INTERNET PROCESSORS, COMBINED WITH EFFICIENT CACHING ALGORITHMS CAN SIGNIFICANTLY IMPROVE OVERALL PACKET FORWARDING PERFORMANCE. 2 Tz-cker Chueh Prashant Pradhan State Unversty of New York at Stony Brook As a result of the explodng bandwdth demand from the Internet, network router and swtch desgners are desgnng and fabrcatng a growng number of mcrochps specfcally for networkng devces rather than tradtonal computng applcatons. In partcular, a new breed of mcroprocessors, called Internet processors, has emerged that s desgned to effcently execute network protocols on varous types of nternetworkng devces ncludng swtches, routers, and applcaton-level gateways. One of the man tasks on the crtcal path of packet processng s route lookup. The routng lookup problem s equvalent to fndng the longest prefx of a packet s destnaton address n a table of address prefxes. 1 Although, effcent algorthms to solve ths problem exst, 2,3 the archtecture-level research queston s how to execute them at wre speed. For example, f the router s performance target s 10 mllon packets per second, the perpacket processng, ncludng longest prefx match, should be completed wthn 100 ns. Whle router desgners have made many attempts to buld specalzed hardware for clever packet routng and flterng algorthms, n ths work we chose a tme-tested archtectural dea, cachng, to attack ths problem. Ths s based on the belef that there s suffcent localty n the packet stream for reusng routng computaton results. However, cachng alone s not suffcent due to less localty n packet-address streams than the nstructon- and data-reference streams n program executon. Gven caches of a fxed confguraton, the only way to mprove the cache performance s to ncrease ther effectve coverage of the Internet protocol (IP) address space, that s, each cache entry must cover a larger porton of the IP address space. Toward ths end, our work develops a novel address-range-mergng technque by explotng the lmted number of outcomes for routng table lookup (the number of output nterfaces n a network devce) regardless of the sze of the IP address space. Our smulaton results demonstrate that address-range mergng mproves the cachng effcency by a factor of fve over generc IP host-address cachng, n terms of average routng tablelookup tme. Archtectural assumptons Because the major goal of our research s to explore the desgn space of cache 0272-1732/00/$10.00 2000 IEEE

subsystems for Internet processors, we dd not restrct ourselves to the conventonal CPU cache-hardware structures. Here, we explore three Internet processor cache desgns and ther detaled archtectural trade-offs usng trace-drven smulatons. (We use a packet trace collected from the egress router of a major natonal laboratory 4,5 and a routng table from the IPMA project. 6 Temporally spaced-apart peces of the trace were nterleaved to further reduce localty.) Note that the average routng table-lookup tme depends on both cache ht rato as well as cache mss penalty, whch s determned by the software algorthm used to perform routng-table lookup. Patrca tres 1 are very sutable data structures to match the longest prefx wthout backtrackng. Essentally, a tre s a bnary decson tree, whch can be populated wth a set of keys and can then used to search for a gven key or ts prefxes. The left and rght branches at an nternal node correspond to the value at a gven bt poston of the key beng 0 or 1 respectvely. Redundant nternal nodes that have only one branch may be collapsed together wthout loss of generalty. To fnd the longest matchng prefx of a key, the key s used to trace out a path n the tre, keepng track of the last key encountered along the path. However, they suffer from hgh worstcase lookup tmes, whch can equal as many accesses as the number of bts n the address, vz 32. In ths study, we chose the NART algorthm. 4 The NART data structure s essentally a three-level expanded tre that uses a tree of flat tables to encapsulate a tre. In the worst case, at most three flat table lookups are needed to perform a routng lookup. For the packet trace and routng table used n ths study, the average NART lookup tme s 120 CPU cycles on a Pentum-II 233- MHz machne. Baselne: Host address cache (HAC) Ths desgn s a generc CPU cache for routng table lookup, where the destnaton host address s treated as a address. Fgure 1 shows the baselne Internet processor cache archtecture, whch s dentcal to a conventonal CPU cache. We wanted to dentfy dfferences n localty characterstcs between network packet streams and program reference streams, and to establsh the Destnaton IP address Index Compare Match? baselne model aganst whch subsequent cache desgn alternatves are compared. So we performed generc cache smulatons on the trace by varyng the cache sze, cache block sze, and the degree of assocatvty. The results 5 show that the cache sze and degree of assocatvty have a smlar performance effect on the Internet processor cache as on the CPU cache. However, a dstnct dfference between network-packet streams and program-reference streams s that the former lacks spatal localty. Evdence of ths s that for a gven cache sze and degree of assocatvty, decreasng the block sze monotoncally decreases the cache mss rato. In fact, the performance dfference could be dramatc between cache confguratons that are dentcal except for the block sze. For example, the mss ratos of a four-way, set-assocatve, 8- Kbyte-entry cache wth a 32-entry block sze and one wth one-entry block sze are nearly an order of a magntude apart, 38.05% versus 3.29%. From ths study, we conclude that the block sze of a network processor cache should always be small, preferably one entry wde. Host address range cache (HARC) Ths desgn s an mprovement over HAC. Each routng table entry corresponds to a contguous range of the IP address space. Therefore, nstead of cachng ndvdual destnaton host addresses, an Internet processor cache can cover a larger porton of the IP address space f each cache entry corresponds to a host address range. Each routng table entry corresponds to a contguous range of the IP address Select Data JANUARY FEBRUARY 2000 Output Fgure 1. The baselne Internet processor cache archtecture, whch s dentcal to generc CPU caches. 3

CACHE DESIGN 4 Destnaton IP address Rght shfter Index IEEE MICRO Range sze Compare Match? Select Data Output Fgure 2. The Internet processor cache archtecture that caches host address ranges rather than ndvdual host addresses. space. If a network packet s destnaton address falls wthn a routng table entry s range, the Internet processor should route t to that entry s output nterface. A cache desgn could explot ths to ncrease the effectve coverage of a host address cache, by cachng host address ranges nstead of ndvdual addresses. Network addresses need to go through two addtonal processng steps before HARC can be put to practcal use. Wth the longest prefx match requrement, t s possble that the address range correspondng to a routng table entry covers the address range correspondng to some other routng table entry. The former s an encompassng routng table entry, whle the latter s an encompassed entry. An encompassng entry s network address s a prefx of those entres t encompasses. We need to cull the address range assocated wth each encompassed routng table entry away from the address ranges of all the entres that encompass t. Ths ensures that every address range n the IP address space s covered by exactly one routng table entry, and hence a packet whose destnaton address falls n an address range has a unque lookup result. Next, we merge adjacent address ranges that share the same output nterface nto larger ranges. Once ths mergng s done, these ranges are algned; that s, ranges are potentally splt to make all range szes powers of two and to make startng addresses of all ranges algned wth a multple of ther sze. Then, the mnmum of all resultng address range szes s calculated, gvng us the mnmum_range_granularty parameter of HARC. Range sze, whch s defned as log(mnmum_range_granularty), thus represents the number of least sgnfcant bts of an IP address that could be gnored durng routng table lookup, snce destnaton addresses wthn a mnmum address range sze are guaranteed to have the same lookup result. Fgure 2 shows the hardware archtecture of the HARC, whch s the baselne cache augmented wth a logcal shfter. The destnaton address of an ncomng packet s rght shfted by range sze before t s fed to the baselne cache. Because each address range corresponds to a cacheable tem, the sze of HARC s effectve coverage of the IP address space mproves by a factor of mnmum_range_granularty. We processed the routng table accordng to these steps, and calculated the range sze parameter, whch turned out to be 5. Ths means that each HARC entry now corresponds to a contguous range of 32 addresses, a factor of 32 ncrease n the cache s effectve coverage. Evaluatng ths desgn aganst HAC shows that HAC s mss rato s between 1.68 to 2.10 tmes hgher than that of HARC. In terms of average routng table-lookup tme, HARC s 58% to 78% faster than HAC, assumng that the ht access tme s one cycle and the mss penalty s 120 cycles. Intellgent host address range cache (IHARC) Ths desgn represents a further performance optmzaton and an mprovement over HARC. The number of dstnct outcomes of routng table lookups s equal to the number of output nterfaces n a router and s thus relatvely small. As a result, we could choose a dfferent hash functon than that used n generc CPUs to combne dsjont host address ranges that share the same routng table lookup result nto a larger logcal host address set. IHARC maps each such logcal host address to one cache entry. Ths technque thus further ncreases the Internet processor cache s coverage of the IP address space. A tradtonal CPU cache drectly takes the least sgnfcant bts of a gven address, and uses them to ndex nto the data and tag arrays. Therefore, the correspondng hash functon s a smple selector functon usng the least sgnfcant K

Output nterface Index bt x x x x Host address 3 2 1 0 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 1 1 2 2 1 1 2 2 3 3 2 2 3 3 2 2 A C B Fgure 3. A routng table example that llustrates the usefulness of carefully choosng the ndex bts. The number of dstnct address ranges s reduced from 8 to 3 (labeled A, B, and C, and shown grouped usng dashed, sold, and dotted lnes respectvely). bts of the nput address, where 2 K s the number of cache sets. We show that t s possble to further ncrease every cache entry s coverage of the IP address space by choosng a more approprate hash functon for cache lookup. Consder, for example, the routng table n Fgure 3, where there are 16 four-bt host addresses wth three dstnct output nterfaces: 1, 2, and 3. The mergng algorthm used n calculatng the range sze of the HARC wll stop after t combnes all adjacent address ranges wth dentcal output nterfaces. In ths case, the total number of address ranges s eght, because the mnmum-range-granularty s two. To further grow the address range that a cache entry can cover, we could choose the ndex bts carefully so that when they are gnored, some of the dentcally labeled address ranges are now adjacent and thus could be combned. For example, f we choose the 1th bt as the ndex bt nto the data and tag array, we can merge the host addresses 0000, 0001, 0100, and 0101 nto an address range. Ths s because they have the same output nterface (1), and when the 1th bt s gnored, they form a contguous sequence, 000, 001, 010, and 011. Smlarly, we could also merge 1000, 1001, 1100, and 1101 nto an address range, as well as all the host addresses whose correspondng output nterface, as shown n Fgure 3, s 2. Wth ths choce of the ndex bt, the total number of address ranges to dstngush durng cache lookup s reduced from 8 to 3. IHARC selects a set of K ndex bts n the destnaton address that correspond to 2 K cache sets. Each cache set corresponds to a partton of the IP address space. In a partton, some address ranges that were not orgnally adjacent n the IP address space wll become adjacent. IHARC then merges any adjacent ranges that are dentcally labeled nto larger ranges. Thus, gven a set S contanng K bts, we get a set of dstnct address ranges for every partton (or cache set). The number of such ranges becomes the metrc M(S) of the set. Snce dstnct address ranges n a cache set need unque tags, the number of dstnct address ranges n a cache set represents the degree of contenton n the cache set. Thus, IHARC selects the ndex bts n such a way that after the mergng operaton, the total number of address ranges n the entre address space and the dfference between the number of address ranges across cache sets s mnmzed. The ndex bt selecton algorthm s a greedy algorthm that grows a set of bts to nclude n the ndex bt set from 0 to K. Gven a set S of bts that the algorthm has already ncluded n the ndex bt set, t computes the desrablty of ncludng bt j n the set by computng the metrc M (S {j}) for each partton nduced by the bt set S {j}. The algorthm then chooses the bt j that mnmzes M ( S { j}) + W M( S { j}) M ( S { j}) where M S,j s the mean of M (S {j}) over all parttons. W s a parameter that determnes the relatve weght of the two terms n the mnmzaton. Note that the second term of the weghted sum mnmzes the devaton of ndvdual sets metrc from the mean, and s ncluded to prevent the occurrence of hot-spot JANUARY FEBRUARY 2000 5

CACHE DESIGN 6 Destnaton IP address Rght shfter Programmable hash engne Index IEEE MICRO Range sze Mask Compare Match? Select Data Output Fgure 4. The ntellgent host address range cache archtecture n whch a programmable hash engne provdes the flexblty needed to talor the hash functon to the network routng table. parttons, whch potentally could lead to excessve conflct msses n the IHARC cache sets. Fgure 4 shows the hardware archtecture of a host address range cache wth a programmable hash functon engne that lets us talor the choce of the ndex bt set to ndvdual routng tables. Gven an N-bt address, the K ndex bts select a partcular cache set, whch corresponds to one of the parttons nduced by IHARC. The remanng N K bts of the address form a value that falls wthn one of the address ranges n ths partton. Every cache entry holds a range that acts as the tag for the entry. Thus, a range check s requred to fgure out whether a lookup s a ht. Snce a general range check s too expensve to ncorporate nto cachng hardware, the algorthm guarantees that each resultng address range sze s a power of two and that the startng address of each range s algned wth a multple of ts sze durng the merge step. Hence, the range check s performed smply by a mask-and-compare operaton. Therefore, each tag entry n IHARC ncludes a tag feld as well as a mask feld. The prce of smplfyng cache lookup hardware s an ncrease n the number of resultng address ranges, as compared to the nstance when no such algnment requrement s mposed. In case of a mss, the algorthm can perform lookup of the NART data structure to populate the cache set wth the approprate address range as the tag. Compared to the generc HARC, IHARC reduces the number of dstnct address ranges that t needs to dstngush, by a careful choce of the ndex bts. In partcular, for the IPMA routng table, 6 the ndex bt set selecton algorthm effectvely reduces the number of dstnct address ranges from HARC to IHARC by three orders of magntude. In addton, ths number s only three to four tmes the number of entres n the orgnal routng table, even though conventonal cache lookup hardware can now look up the resultant address ranges. For the packet trace used n the study, Table 1 lsts the mss rato and lookup tme comparson between HAC, HARC, and IHARC, assumng that the block sze s one entry wde. In terms of average routng table-lookup tme, HARC s between 2.24 and 3.18 tmes slower than IHARC. Ths s because HARC s mss ratos are 2.91 to 7.09 tmes larger than IHARC s. In addton, the mss rato gap between HARC and IHARC ncreases wth the degree of assocatvty, because the degree of varaton n the address stream, as seen by IHARC, s lower than that seen by HARC. Hence, IHARC benefts more from elmnatng conflct msses through hgher assocatvty. Ths result conclusvely demonstrates that there s sgnfcant performance mprovement ganed from IHARC over HARC. Compared to HAC, IHARC reduces the average routng table lookup tme by up to a factor of fve. These results are one of the frst research efforts on cache desgns for emergng Internet processors. Based upon the trace used n our study, t seems that there s suffcent temporal localty n the packet stream to justfy the use of a routng table cache n Internet processors. However, due to weak spatal localty, the block sze should be as small as possble, preferably one-entry wde. Furthermore, cachng address ranges rather than ndvdual addresses greatly mproves the effectve coverage of caches of a gven sze and

therefore ther ht ratos. Fnally, a careful choce of the ndex bts durng cache lookup s crucal and can dramatcally reduce the number of address ranges that need dstngushng, and thus reduce the cache mss rato. The man applcaton of the cache desgn descrbed here s to edge routers, where the temporal localty of network packets s hgher than that n backbone routers. Because edge routers account for a majorty of commercal routers sold today, we beleve that the proposed cache desgn wll have sgnfcant practcal mpact on future network processor products. We are plannng to extend the dea of a network processor cache to attack the problem of packet classfcaton, whch requres examnaton of multple packet header felds and thus a longer lookup key. In addton, we are nvestgatng schemes to ncrementally update the network processor cache n the presence of frequent route updates wthout nvaldatng the entre cache. MICRO References 1. W. Doernger, G. Karjoth, and M. Nasseh, Routng on Longest Matchng Prefxes, IEEE/ACM Transactons on Networkng, Vol. 4, No.1, Feb. 1996, pp.86-97. 2. M. Waldvogel, et al., Scalable Hgh Speed IP Routng Lookups, Proc. ACM Sgcomm 97, ACM Press, New York, 1997, pp.3-14. 3. V. Srnvasan and G. Varghese, Faster IP Lookups Usng Controlled Prefx Expanson, Proc. ACM Sgmetrcs, ACM Press, 1998, pp. 1-10. 4. T. Chueh and P. Pradhan, Hgh Performance IP Routng Table Lookup usng CPU Cachng, Proc. IEEE Infocom 99, IEEE, New York, 1999, pp 1421-1428. 5. T. Chueh and P. Pradhan, Cache Memory Desgn for Network Processors, Proc. IEEE HPCA-6, IEEE Computer Socety, Los Alamtos, Calf., 2000. Table 1. Mss rato and lookup tme comparson for the IHARC, HARC and HAC, assumng that the block sze s one entry wde.* Mss Mss Mss Lookup Lookup Lookup Cache rato rato rato tme tme tme sze Assocatvty IHARC HARC HAC IHARC HARC HAC 4K 1 2.30% 7.5% 12.71% 3.74 9.92 16.12 2 1.12% 4.58% 8.42% 2.33 6.45 11.02 4 0.57% 3.64% 6.86% 1.68 5.33 9.16 8K 1 1.54% 4.48% 7.57% 2.83 6.33 10.01 2 0.48% 2.20% 4.59% 1.57 3.62 6.46 4 0.22% 1.56% 3.29% 1.26 2.85 4.91 * The nput to the cache smulator s a packet trace collected from the man router of Brookhaven Natonal Laboratory. The HARC s range sze s 5. Lookup tmes are reported n cycles. Ht access tme s one cycle, whereas mss penalty s 120 cycles. 6. Mchgan Unversty and Mert Network, Internet Performance Measurement and Analyss (IPMA) Project ; http://nc.mert.edu/pma. Tz-cker Chueh s an assocate professor n the Computer Scence Department of the State Unversty of New York at Stony Brook. Hs research nterests are 3D graphcs archtecture, scalable and secure network routers and gateways, and hgh-performance and storage systems. Chueh receved a BS n electrcal engneerng from Natonal Tawan Unversty, an MS n computer scence from Stanford Unversty, and a PhD n computer scence from the Unversty of Calforna, Berkeley. He receved a Natonal Scence Foundaton Career award n 1995. Prashant Pradhan s a PhD student n the Computer Scence Department of the State Unversty of New York at Stony Brook. He receved hs BTech n computer scence and engneerng from the Indan Insttute of Technology n Delh, Inda. Hs research nterests are hgh-speed networkng and operatng systems. Drect questons about ths artcle to Prashant Pradhan, State Unversty of New York at Stony Brook, Computer Scence Department, Stony Brook, NY 11794-4400; prashant@cs.sunysb.edu. JANUARY FEBRUARY 2000 7