FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

SIAM J. SCI. COMPUT. Vol. 22, No. 6, pp. 2113 2134 c 21 Society for Idustrial ad Applied Mathematics FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS ZHAO ZHANG AND XIAODONG ZHANG Abstract. I this paper, we examie differet methods usig techiques of blockig, bufferig, ad paddig for efficiet implemetatios of bit-reversals. We evaluate the merits ad limits of each techique ad its applicatio ad architecture-depedet coditios for developig cache-optimal methods. Besides testig the methods o differet uiprocessors, we coducted both simulatio ad measuremets o two commercial symmetric multiprocessors (SMP) to provide architectural isights ito the methods ad their implemetatios. We preset two cotributios i this paper: (1) Our itegrated blockig methods, which match cache associativity ad traslatio-lookaside buffer (TLB) cache size ad which fully use the available registers, are cache-optimal ad fast. (2) We show that our paddig methods outperform other software-orieted methods, ad we believe they are the fastest i terms of miimizig both CPU ad memory access cycles. Sice the paddig methods are almost idepedet of hardware, they could be widely used o may uiprocessor workstatios ad multiprocessors. Key words. cache optimizatios, memory hierarchy, bit-reversals, shared-memory multiprocessors, parallel computig AMS subject classificatios. 68P5, 65Y2, 65Y5 PII. S16482759935979 1. Itroductio. May FFT algorithms require data reorderig operatios of bit-reversal. If the bit-reversal operatios are ot implemeted properly, those FFT operatios ca slow dow sigificatly. O the other had, it is easy to improperly implemet bit-reversals o uiprocessors ad multiprocessors. This is because the performace of bit-reversals is highly sesitive to how caches ad memory hierarchies are used i the implemetatios. I other words, a fast bit-reversal implemetatio must be cache effective. Several papers have well addressed the sigificace ad effects of cosiderig memory hierarchy to bit-reversals (e.g., [2], [11], ad [15]). Besides the importat usage for FFT, differet versios of bit-reversal implemetatios ca also be used as bechmark programs to evaluate the memory hierarchy of various computer systems. With the rapid developmet of RISC ad VLSI techology, the speed of processors has icreased dramatically i the past decade. Processor clock rates have doubled every 1 2 years. Nevertheless, the memory speed has icreased at a much slower pace. Therefore we have see ad will cotiue to see a icreasig gap i speed betwee processor ad memory, ad this gap makes performace of applicatio programs o both uiprocessor ad multiprocessor systems rely more ad more o effective usage of caches. Performace degradatio of bit-reversals is maily caused by cache coflict misses. Bit-reversals are ofte repeatedly used as fudametal subrouties for scietific programs, such as FFT. Thus, i order to gai the best performace, cache- Received by the editors September 17, 1999; accepted for publicatio (i revised form) November 2, 2; published electroically April 12, 21. This work is supported i part by the Natioal Sciece Foudatio uder grats CCR-94719 ad CCR-9812187, by the Air Force Office of Scietific Research uder grat AFOSR-95-1-215, ad by Su Microsystems uder grat EDUE-NAFO- 9845. Prelimiary results of this work were preseted i the 1999 Supercomputig Coferece, Portlad, OR. http://www.siam.org/jourals/sisc/22-6/3597.html Departmet of Computer Sciece, College of William ad Mary, Williamsburg, VA 23187-8795 (zzhag@cs.wm.edu, zhag@cs.wm.edu). 2113

2114 ZHAO ZHANG AND XIAODONG ZHANG optimal methods ad their implemetatios should be carefully ad precisely doe at the programmig level. This type of performace programmig for some special programs, such as bit-reversals, may sigificatly outperform a optimizatio from a automatic tool, such as a compiler. A stadard bit-reversal program is described as follows: for i = 1, N Y[i ] = X[i] The values of array X i their sequetial positios i are copied to array Y i their bit-reversal positios, i for i =1,...,N, where N =2. The above program says that X is a bit-reversal reorderig of Y. The idices of i ad i of X ad Y are represeted by a sequece of biary digits. Positios i ad its bit-reversal i are defied i [11] as 1 1 i = a j 2 j ad i = a j 2 1 j, j= where a j is either or 1. For example, a 5-bit reversal of i = 11 is i = 11. The bit-reversal operatios have followig uique characteristics: First, i may implemetatios, each elemet i a array is used (read or writte) oly oce for its copy operatio. Thus, the reorderigs have oly spatial locality but o temporal locality for elemets. Secod, the loops follow certai sequeces with high spatial locality. Bit-reversals are highly sesitive to problem sizes, cache sizes, ad cache lie sizes. Sice the data array sizes are a power of 2, multiple elemets stored i differet memory locatios could map to the same cache lie, causig severe cache coflict misses ad cache thrashig. The reaso is simple. Most commercial computers use direct-mapped or -way associative caches where the mappig fuctios of cache sizes are also related to powers of 2. We use a idetical uit, called a elemet, to represet the sizes of data arrays, caches, ad others such as buffers ad blockig. Oe elemet may represet a 4-byte iteger, a 4-byte floatig poit umber, or a 8-byte double floatig poit umber. Because the sizes of caches ad cache lies are always a multiple of a elemet i practice, this idetical uit for all sizes is practically meaigful for both architects ad applicatio programmers ad makes the discussios straightforward. Here are the algorithmic ad architectural parameters we will use to describe cache-optimal methods of bit-reversals. C: data cache size, which could be further defied as C L1 ad C L2 for data cache sizes of L1 ad L2, respectively. L: the size of a cache lie, which could be further defied as L L1 ad L L2 for cache lies of L1 ad L2, respectively. K: cache associativity, which could be further defied as K L1 ad K L2 for cache associativity of L1 ad L2, respectively. K TLB : traslatio-lookaside buffer (TLB) cache associativity. (A TLB cache is a small buffer that holds most recet memory page mappigs. The cocept will be discussed i detail later i the paper.) T s : umber of etries i the TLB cache. N: the data size for the bit-reversal vector of size N =2, where is the umber bits used i the vector idex. B cache : blockig size of a B B submatrix for cache. B TLB : blockig size for TLB. P s : a memory page size. j=

FAST BIT-REVERSALS 2115 I this paper, we examie differet methods usig techiques of blockig, bufferig, ad paddig for efficiet implemetatios. We evaluate the merits ad limits of each techique ad its applicatio ad architecture-depedet coditios for developig cache-optimal methods. Although our methods are developed for out-of-place bit-reversals, they are also applicable to i-place bit-reversals where X ad Y are the same array. Symmetric multiprocessor (SMP) systems have become practical ad cost-effective servers for scietific computig ad other applicatios. Although parallel efficiecy ad commuicatio latecy reductio are major performace cocers, computatios o a SMP share may commo cosideratios with uiprocessors. The most importat oe is the effective usage of memory hierarchies. Whe the cache locality of each processor is effectively exploited, the memory accesses to the shared-memory will be reduced, ad so will be the memory access cotetio. People have studied parallel data reorderig algorithms o distributed-memory systems with special etworks, such as hypercubes (see, e.g., [6] ad [9]). I this study, we target parallel bit-reversals o SMPs ad show the sigificat impact of the cache ad TLB cosideratios for efficiet method developmet ad implemetatios. We also evaluate the performace impact of SMP itercoectio etworks. Our algorithm desigs ad implemetatios are optimized by cosiderig several otraditioal but practical ad performace-effective factors, amely, the programmig complexity, memory space requiremet, istructio cout, cross iterferece amog the data arrays, ad program portability. We will summarize the limits ad merits of differet bit-reversal methods d o these cosideratios after we have discussed the desigs ad preseted the performace results, aimig at providig a guidelie for performace programmig ad memory performace optimizatio for other scietific computig applicatios. We preset two cotributios i this paper: (1) Our itegrated blockig methods, which match cache associativity ad TLB cache size ad which fully use the available registers, are cache-optimal ad fast. (2) We show that our paddig methods outperform other software-orieted methods ad believe they are the fastest i terms of miimizig both CPU ad memory access cycles. Sice the paddig methods are almost idepedet of hardware, they could be widely used o may uiprocessor workstatios ad SMP multiprocessors. The rest of the paper is orgaized as follows. We discuss the iheretly blockig ature of bit-reverse operatios ad the effectiveess ad limits of blockig techiques for solvig the problems i sectio 2. I sectio 3, we evaluate a software bufferig techique ad our methods usig existig hardware compoets for implemetig the data reorderig. Our ew method itegratig blockig ad paddig will be preseted i sectio 4. We discuss blockig ad paddig techiques for TLB i sectio 5. The experimetal measuremets ad aalyses for evaluatig differet methods o uiprocessor workstatios ad SMP multiprocessors will be reported i sectios 6 ad 7. We summarize the work i sectio 8. 2. Blockig for bit-reversals. The blocked memory access patters of bitreversals ca be easily viewed whe we covert the oe-dimesioal vector to a twodimesioal equivalet array i Figure 1. All the reorderig elemets ad elemets i other groups will be allocated alog the colum i the two-dimesioal equivalet array formig a block. I this blockig method, the bit-reversal reorderig is performed block by block, where the operatios for each block are implemeted similarly to the Evas method

2116 ZHAO ZHANG AND XIAODONG ZHANG Fig. 1. Memory layout of a blocked bit-reversals, where B = B cache. [7]. (The Evas method is used to costruct a hybrid method i [11].) The program i the appedix presets such a implemetatio alog with paddig techique. (The paddig techique will be discussed i sectio 4.) The blockig algorithm we have used ca be classified as a hybrid method. I geeral, for a bit-reversal vector of N =2 elemets, the block size B cache is a power of 2, deoted by B cache =2 b. Each of the B cache elemets i X has the address format of fg, where g is B cache bits ad f has b bits. Each of the correspodig B cache elemets i Y has the address format of g f. Therefore, the distace betwee two earest elemets i the same group i Y is 2 b = N/B cache. Choosig the cache lie size as the miimum blockig size (B cache = L), we ca easily calculate the maximum N s for the bit-reversal vector d o differet data cache sizes. For example, for a large cache of 2 MB, the blockig techique is effective up to a 18-bit-reversal reorderig which represets 268,144 data elemets, where each elemet is a 8-byte double type, ad the cache lie is 32 bytes. I practice, the data size of bit-reversals could easily be larger tha = 2 [11]. 3. Blockig with buffers. As we have show, the effectiveess of blockig is limited by the size of the data arrays. I theory, the smallest blockig size could be 2 2. A cache lie i a moder processor usually holds more tha 2 elemets, i.e., is larger tha 16 bytes. If we choose a 2 2 block, the data i a cache lie will ot be fully used before their replacemet, causig more cache misses i the reorderigs. The bitreversal reorderig demads large cache space to make blockig effective. I order to effectively use limited cache space, Gatli ad Carter [8] preset a effective method usig a additioal buffer to first hold the coflict-missed elemets of a block i oe array temporarily ad the copy the block to their reordered positios i the other array. I this sectio, we discuss implemetatios of blockig methods supported by both software ad hardware buffers. 3.1. Blockig with a software buffer ad its limits. Because this buffer is defied i a reorderig program, we call it software buffer. This buffer shares the allocatio space with the data arrays X ad Y i the cache. There are two major limits i this approach. First, the buffer itself may iterfere with arrays of X ad Y, causig additioal access coflicts. This iterferece is certai whe the sizes of X ad Y are larger tha the size of the cache, C. Each cache block or set is mapped from arrays X ad Y more tha oce. No matter where the buffer is

FAST BIT-REVERSALS 2117 located i the cache, it will iterfere with them. The larger the buffer size, the more iterferece will occur. The secod limit is the additioal copy overhead time ivolved i movig data from the array X to the buffer ad the i movig them to the target array i their reordered positios. This overhead exactly doubles the istructio cycles for data copyig. The data copy through a buffer is a worthy ivestmet if the umber of cycles lost from cache misses is much higher tha the additioal CPU cycles for the data copy. To overcome the two limits, we propose several alteratives to elimiate cache iterferece caused by the software ad to reduce or elimiate the data copy time. 3.2. Cache structure depedet blockig. We will preset several blockig methods which deped o the cache orgaizatio of the ruig machie. These methods ca be implemeted at the user programmig level. Blockig d o set associativity. The cache associativity, K, isaim- portat factor to cosider for blockig. If K L, al L or a K K blockig method for bit-reversals would effectively avoid coflict misses. Because the hit time is a less sesitive performace factor tha the cache misses i the L2 cache, a higher associativity of the L2 cache is more effective tha that of L1. If a cache lie holds 4 double floatig poit elemets (L = 4 elemets of 32 bytes i Petium processors), a4 4 blockig method without ay data buffer is able to fully use the cache associativity. The blockig method would gai more beefit from caches of associativity higher tha 4, such as a desig i [2]. What would we do if the associativity is ot sufficietly high for the blockig, or K < L? Oe solutio is to make a K L rectagular blockig. Ufortuately bit-reversals require a L L blockig. Supplemet with registers. We may also cosider usig the available registers to supplemet a low associativity cache. The umber of registers available to a user program is limited. Normally, a uiprocessor provides up to 16 registers to users. For example, for a 2-way associative cache, we eed 8 registers to buffer 2 additioal cache lies so that we could effectively make a 4 4 blockig as if we ra the program o a 4-way associative cache. We develop a more efficiet blockig method for bit-reversals, which requires oly (L K) (L K) registers. The operatio sequece of this method is i three steps: (1) The L K cache lies of X are stored i K cache lies of Y ad accessed by copyig its (L K) K elemets to Y i the reordered positios ad copyig the rest of (L K) (L K) elemets to a buffer cosistig (L K) (L K) registers. (2) The rest of K lies of X are brought to the cache set, ad its K K elemets are copied to Y i the reordered positios. (3) Fially, the (L K) (L K) elemets i the register buffer ad the rest of the (L K) K elemets are copied to Y i their reordered positios. A cache set will be used more tha twice if K<L/2. Besides the advatage of o access coflicts betwee the register buffer ad the arrays of X ad Y, there is aother advatage of usig registers to buffer the data i a load/store processor. A data copy through the registers from X to Y is equivalet to the two-step process of load ad store, ad thus there will be o additioal overhead. We will show our experimetal performace i sectio 5. Usig registers as the buffer. If the cache is direct-mapped, we have to fully rely o a buffer for blockig. Here we discuss some ways to use registers to serve the buffer i order to elimiate the potetial cache coflicts ad elimiate extra data

2118 ZHAO ZHANG AND XIAODONG ZHANG copyig by takig advatage of the load/store operatios. The umber of registers for a buffer of L L elemets is determied by the umber of elemets a cache lie ca hold. The legth of a cache lie of the L1 cache i some processors, such as Su SPARC Micro I ad II, is L = 2 of 16 bytes, which holds oly two floatig poit elemets. The blockig size could be as small as 2 2 usig a buffer of 4 registers. The cache lie legth of the L1 cache i may advaced workstatios is 32 bytes, such as the Su Ultra ad Itel Petium processors, each of which holds 4 double floatig poit elemets. I this case, we eed a buffer of 4 4 = 16 registers for a blockig. This would be difficult due to the limited umber of available registers. We have two solutios for this. First, we use oly the umber of registers available to form a smaller buffer tha it should be, which will ot make each cache lie fully used ad will cause additioal cache misses. Our experimets show that this blockig method of usig a buffer of isufficiet umber of registers still achieves a reasoable performace improvemet ad outperforms the implemetatio usig a software buffer. The secod method is to further reduce the size of the buffer, which reduces the required umber of registers by usig our (L K) (L K) blockig method. L1 cache versus L2 cache. The mai objective of buildig two-level caches is to make the L1 cache small eough to catch up to the cycle time of the fast CPU ad to make the L2 cache large eough to capture as may accesses as possible [12]. I practice, the data size of a bit-reversal is larger tha the size of the L2 cache. L1 ad L2 caches offer differet sizes of the cache lie, L, ad the associativity, K. Both of the followig alteratives are effective for blockig. (1) Takig advatage of a short cache lie ad fast hit time of the L1 cache, we could effectively use limited registers as the buffer ad make a small L L blockig effective. (2) Takig advatage of high associativity of the L2 cache, we could effectively use both associativity ad supplemetal registers as the buffer ad make a large L L blockig effective. 3.3. Victim-cache-aided blockig. Victim cache [13] is a small fully associative cache servig as the buffer cotaiig oly cache blocks due to coflict misses from L1 cache. This is a o-chip cache coected betwee L1 ad the ext level cache or memory. O a miss i L1, the victim cache is first checked before goig to the ext level. If the missed block is foud there, the victim cache block ad the L1 cache block are swapped ad the the block is delivered to CPU from the L1 cache. Victim cache has bee available i some commercial workstatios, such as HP72. The miimum umber of victim cache lies required for L L blockigs of traspose ad bit-reversal reorderigs is L K. I the executio, L L elemets of each blockig are allocated i a set of K lies i L1 cache, ad the rest of the elemets are allocated i the L K lies of the victim cache. The victim cache is able to hold all the coflict misses i the reorderigs by a L L blockig. I additio, a coflict miss i the L1 cache that hits i the victim cache has oly oe additioal cycle miss pealty. Thus, a simple L L blockig method would be effective if such a victim cache is available. However, the victim cache does ot have a direct coectio with the CPU. Whe a data hit happes i the victim cache, it has to be first swapped to the L1 cache ad the delivered to CPU. This swappig operatio is uecessary for our reorderig algorithms. Without coutig the cold misses of brigig the elemets i the first colum for a L L blockig ad cosiderig the LRU replacemet policy, the etire blockig will have L (L 1) coflict misses i the L1 cache, which are the foud i the victim cache. This also meas that each of such a blockig eeds L (L 1) additioal swappig cycles betwee the L1 cache ad the victim

FAST BIT-REVERSALS 2119 cache, which is idepedet of the associativity, K. I cotrast with the blockig method d o the associativity supplemeted by registers, the swappig cycles i the victim cache are additioal overhead. Despite this, a victim-cache-aided blockig is more efficiet tha a blockig method with a software buffer because there are o cross iterferece coflicts betwee the victim buffer ad arrays of X ad Y. 4. Blockig with paddig. Paddig is a techique that modifies the data layout of a program so that the coflict misses are reduced or elimiated. The data layout modificatio ca be doe at ru-time by system software [3, 19] or at compile-time by complier optimizatio [16]. Sharig the same objective of compiler optimizatio to chage the addresses of potetially coflictig cache blocks i the reorderigs, we isert paddig variables iside the data array. For example, the paddig ca be doe as part of the last butterfly for the decimatio i a FFT computatio without additioal cost, ad the output is ot padded. However, we otice that this free paddig opportuity may ot be easily foud, ad the bit-reversal result may be padded i some cases. For example, the paddig of a recursive implemetatio of the Cooley Tukey FFT algorithm [5] is more complex tha the paddig i our implemetatios. The paddig method produces padded results i a vector if the bit-reversals are doe i a iplaced fashio. The accesses to the padded results eed to go through a simple address covertig process with additioal CPU cycles. I additio, our methods target bit-reversals d o the data size of powers of 2. However, FFT algorithms are ot limited to this data size. If the data size is ot a power of 2, the paddig method will be more complex to implemet. Poor memory performace of bit-reversals has bee reported eve for opower of 2 data sizes (see, e.g., [2]). Sice the data arrays of bit-reversals form a vector whose size is power of 2, the paddig is highly regular, isertig L elemets or a cache lie space startig at the vector positios of N/L, 2 N/L,..., ad (L 1) N/L. Usig L elemets or a sectio data of a cache lie to separate the vector at these L poits ca completely elimiate the cache coflicts caused by the address mappig d o powers of 2. Agai durig executio, the reorderig data copies are directly coducted betwee the arrays X ad Y without goig through a data buffer. Aother advatage is that the umber of paddig elemets eeded is oly L L or L cache lies ad is idepedet of the data array size, N. Compared with the data size of bit-reversals, the umber of paddig elemets is isigificat. Figure 2 shows how the data layout of a bit-reversal vector is modified by paddig so that coflict misses are elimiated. Compiler optimizatio targets a large rage of applicatio programs ad automatically iserts paddig variables i the programs for users. A optimal paddig is applicatio program depedet. For example, paddig positios are differet from differet applicatios i order to effectively chage addresses of coflictig cache blocks [18]. Based o the uique ature of the data reorderig, the optimal paddig uit used by our methods for bit-reversals is a cache lie with L elemets. I cotrast, a compiler optimizatio ormally uses a elemet as the basic paddig uit. How may paddig uits to use ad where to pad i the data arrays are determied by some approximatio models which may ot precisely fit the uique memory access patters of each case. I additio, applyig the paddig techique to bit-reversals embedded i applicatios would ot icrease complexity i the etire computatio. For example, whe a padded bit-reversal is performed i a FFT computatio, it has little effect o the eighborig butterfly operatios.

212 ZHAO ZHANG AND XIAODONG ZHANG Fig. 2. Data layout of a bit-reversal is modified by paddig, where B = B cache = L. 5. Blockig ad paddig for TLB. The TLB is a special cache that stores the most recetly used virtual-physical page traslatios for memory accesses. The TLB is a small ad usually fully associative cache. Each etry poits to a memory page of 4 KB to 64 KB. The page size is ormally fixed at the level of operatig systems ad caot be chaged by user programs. A TLB cache miss will make the system retrieve the missig traslatio from the page table i memory ad the select a TLB etry to replace. Whe the data to be accessed i our blockig method is larger tha the amout of data of all the memory pages that the TLB ca hold, we will have TLB thrashig. I this sectio, we will discuss ad preset blockig ad paddig methods for TLB cache optimizatios. 5.1. Blockig for a fully associative TLB. Before givig a geeral model to show how the blockig size is affected by the TLB size, let s go through a example to show that a moderate N for bit-reversals would easily lead to TLB cache thrashig. The 64 pages i the TLB of the Su UltraSparc-II processor hold 64 124 = 65536 elemets, which represets a 16-bit-reversal of N =2 16. Sice we have two vectors X ad Y, the TLB ca hold a 15-bit-reversal of N =2 15 elemets. This is also cosistet with our experimets o this machie, where executio time per elemet was a costat util = 15, but sharply icreased at = 16 bit-reversals caused by the TLB misses. I our cache-optimal methods, we iclude a outer loop to form a blockig for TLB, whose size is deoted as B TLB. The blockig size of B TLB for bit-reversals whe N T s P s is B TLB T s, where P s is the page size i elemets, ad T s is the umber of etries of the TLB. O the other had, the B TLB should be chose as large as possible to make effective use of the page space. Whe N<T s P s, the data size of a bit-reversal will be less tha the data size covered by the TLB. Thus there is o eed for TLB optimizatios.

FAST BIT-REVERSALS 2121 Fig. 3. Paddig for TLB: the data layout is modified by isertig a page space at multiple locatios, where B TLB =4, K TLB =1, T s =8. 5.2. Paddig for a set-associative TLB. Some processors TLBs are ot fully associative, but set-associative. For example, the TLB i the Petium-II 4 processor is 4-way associative (K TLB = 4). A simple blockig d o the umber of TLB etries is ot cache-optimal, because multiple pages withi a TLB-size-d blockig may map to the same TLB cache set ad cause TLB cache coflict misses. If the size N of a bit-reversal vector is a multiple of T s P s, where T s is the umber of TLB etries ad P s is the page size i elemets, ad if K TLB <B TLB, the TLB cache coflict misses will occur. This could easily happe i practice. For example, o the Petium-II 4, N is equal to 128K elemets (oe elemet = 8 bytes) for a 17-bit-reversal, ad this N is two times the value T s P s of the machie, where T s = 64, ad P s = 124 elemets. I a way similar to the techique of paddig for the data cache, we isert a page of elemets or a page of space startig at the vector positios of N/L, 2 N/L,... ad (L 1) N/L to elimiate the coflict of TLB cache misses. Figure 3 gives a example of the paddig for TLB, where the TLB is a direct-mapped cache of 8 etries, blockig size is B TLB = 4, ad the umber of elemets of a row is a multiple of 8 page elemets. Before paddig, each of blockig row is mapped to the same cache lie of the TLB. After paddig, these rows are mapped to differet cache lies of the TLB. Combiig paddig for data cache ad paddig for TLB cache, we are isertig L+P s elemets or a page plus a cache lie space i L locatios separated by a distace of N/L elemets. I practice, we selected more tha N/L poits to isert the paddig variables to elimiate both data cache ad TLB coflict misses. This approach could effectively merge two ested paddigs (oe for data cache ad the other oe for TLB) ito a sigle oe. A optimal umber of isertig poits ca be easily determied experimetally d o the size of the TLB cache. The paddig optimizatios are all d o L2 cache i our experimets. Partial idex mappig addresses of bit-reversals are precalculated ad stored i a small table as show i the program i the appedix. This approach further improves

2122 ZHAO ZHANG AND XIAODONG ZHANG Table 1 Architectural parameters of the 5 workstatios we have used for the experimets. All specificatios o L1 cache refer to the L1 data cache, ad all L2s are uiform. Each L2 cache block o UltraSPARC-IIi cosists of 2 16-byte subblocks. The hit times of L1, L2 ad the mai memory are measured by lmbech [14], ad their uits are coverted from aosecod (s) to their CPU cycles. Workstatios SGI O2 Su Ultra 5 Su E-45 Petium XP1 Processor type R1 UltraSparc-IIi UltraSparc II P-II 4 Alpha 21264 Clock rate (MHz) 15 27 3 4 5 L1 cache (KBytes) 32 16 16 16 64 L1 block size (Bytes) 32 32 32 32 64 L1 associativity 2 1 1 4 2 L1 hit time (cycles) 2 2 2 2 3 L2 cache (KBytes) 64 256 248 256 496 L2 block size (Bytes) 64 64 64 32 64 L2 associativity 2 2 2 4 1 L2 hit time (cycles) 13 14 1 21 15 TLB size (etries) 64 64 64 64 128 TLB associativity 64 64 64 4 128 Memory latecy (cycles) 28 76 73 68 92 the performace because the table will be accessed i the cache durig the computatio, ad the precalculatio overhead is trivial. The time for the precalculatio is icluded i the total executio time. 6. Experimetal results ad performace evaluatio. We have implemeted ad tested all the bit-reversal methods discussed i the previous sectios o a SGI O2 workstatio, a Su Ultra-5 workstatio, a Su SMP server E-45, a Petium PC, ad a Compaq XP1 workstatio. We will preset ad evaluate the performace of differet methods o differet machies. 6.1. Experimetal eviromet ad evaluatio methodology. We used lmbech [14] to measure the latecies of memory hierarchies at differet levels o each machie. The architectural parameters of the 5 machies are listed i Table 1. We focus the performace evaluatio o methods ad implemetatios of bitreversals i this paper. We compared all our methods with the method of blockig with a software buffer which was recetly published i [8]. We deote this method as blockig with buffer for bit-reversals. Two of our methods are experimetally compared: breg-br blockig with associativity ad registers for bit-reversals, ad blockig with paddig for bit-reversals. We have also applied blockig or paddig techique for the TLB i these two methods d o the TLB associativity. All the programs use a stadard subroutie to calculate the bit-reversal value for a give address. The executio times were collected by gettimeofday(), a stadard Uix timig fuctio. The resolutio of this fuctio is 1 µs o the machies beig measured, which is sigificatly smaller tha the executio times of ay programs we have measured. A small bit-reversal table is precalculated, ad we exclude this calculatio time. The reported time uit is cycles per elemet (CPE): CPE = executio time clock rate, N

FAST BIT-REVERSALS 2123 where executio time is the measured time i secods, clock rate is the CPU speed (cycles/secod) of the machie where the program is ru, ad N is the umber of elemets of the bit-reversal program. Besides the differet methods of bit-reversals, we also measured the executio time of a program copyig elemets betwee X ad Y. This program has the same umber of data copyig operatios with a cotiuous memory access patter. We use the executio time of this program to provide a lie referece for bit-reversal programs ad show how close a bit-reversal executio is to its ideal time. We deote this referece program as. Each method is further divided ito float data type usig 4 bytes to represet a elemet, ad double type usig 8 bytes to represet a elemet. The data type divisios will show the performace impact of the cache lie legth. For all experimets o differet machies, the bit-reversal programs first call a routie to flush the cache to make sure that all the data are allocated oly i the memory. All experimets were repeated multiple times. 6.2. Effects of TLB ad virtual memory. Before measurig ad comparig the performace of differet bit-reversal methods, we experimetally evaluated the effects of TLB ad virtual memory to cofirm our assumptios ad aalyses. Selectio of TLB blockig size. The TLB blockig size is a sesitive performace parameter to be selected, which is determied by the size of the TLB if it is fully associative. We executed program (blockig with paddig for bit-reversals) with = 2 o a sigle ode of Su E-45 by chagig the blockig sizes for TLB from 8 to 128. The TLB of the E-45 is a fully associative cache with 64 etries. Figure 4 shows the measured cycles per elemet of the program of differet blockig sizes o the ode. Our experimetal results are cosistet with our aalyses i the previous sectio. Whe the blockig size for TLB was 64, the executio time curve icreased sharply. This is because arrays X ad Y together demaded more tha 64 pages ad caused TLB thrashig. Virtual memory versus physical memory addresses. All our aalyses are d o cache mappigs betwee memory pages i the virtual address space ad cache blocks i the physical memory address space. This assumes that cotiguous memory pages will be cotiguously mapped to the cache. This assumptio is guarateed for the virtual-address caches [4]. However, all our experimets have bee performed o machies with physical address L2 caches. Sice the virtual-physical traslatios for L2 caches are hadled by operatig systems, our assumptios may sometimes be iaccurate. I order to show that may operatig systems attempt to map cotiguous virtual pages to cache blocks cotiguously so that our virtual-addressd study is practically meaigful ad effective, we coducted a simulatio by usig the SimOS [17] ad measuremets o differet workstatios to observe how a operatig system makes traslatios from virtual memory addresses to their physical addresses. The SimOS simulates a complete hardware of SGI machies ad rus the IRIX 5.3 operatig system i the simulatio. We executed a blockig-oly program of bitreversals usig the cache lie L as the blockig size. The bit-reversal vector size was chaged from =15to = 22. We measured the miss rates o array X. The cache size was set to 2 MB holdig two double type arrays up to = 18 i the virtual memory space. Figure 5 gives cosistet results from the SimOS simulatio: whe >18, the miss rate o array X was sharply icreased to 1% from 12.5%. From this experimet, we have observed that virtual-physical traslatios from

2124 ZHAO ZHANG AND XIAODONG ZHANG 7 6 E45 (double) cycles per elemet 5 4 3 2 1 8 16 32 64 128 Block size of TLB Fig. 4. Chagig the TLB blockig sizes o a sigle ode of the Su E-45: whe the blockig size for TLB was larger tha 32, the executio time curve was sharply icreased. SimOS (IRIX 5.3) blockig oly miss rate o array 1.125 15 16 17 18 19 2 21 22 23 Fig. 5. Usig the SimOS to observe the miss rates by chagig the size of the bit-reversal arrays of a blockig-oly program: whe >18, the miss rate was sharply icreased to 1%. the IRIX 5.3 operatig system are quite cosistet with our assumptio of cotiguous allocatios. We have also ru the similar experimets o differet targeted workstatios with differet operatig systems, such as Liux ad Solaris, to measure the chages of executio times whe the data size is chaged. Our measuremets are also cosistet

FAST BIT-REVERSALS 2125 P-II (float) Ultra 5 (float) 6 5 45 5 4 Cycles Per Elemet 4 3 2 hybrid Cycles Per Elemet 35 3 25 2 15 hybrid 1 1 5 11 12 13 14 15 16 17 18 19 11 12 13 14 15 16 17 18 19 Fig. 6. Executio times of the hybrid method o the Petium-II (left figure) ad o the Ultra-5 machie (right figure). to the SimOS results ad idicate that the larger the data arrays to be used, the more likely a operatig system will allocate the pages cotiguously. Because our study targets large data sets, our aalyses d o the virtual memory space is reasoably accurate. I additio, our methods assume that the operatig system uses a uiform page size for page allocatio, which is cosistet with most commercial ad commoly used operatig systems. 6.3. Performace of the hybrid method for bit-reversals. I order to show the effectiveess of our cache optimizatios, we first plot the measured executio times of the hybrid method 1 i float data types o the Petium-II ad the Ultra-5 machies i Figure 6. Although the hybrid method did reasoably well for 16 o Petium-II ad 12 o Ultra-5, the executio times sigificatly icreased due to limited cache performace after the data size was further icreased. 6.4. Performace comparisos o the SGI O2. The SGI O2 is a 1995 product usig a R1 processor of 15 MHz, 32 KB 2-way associative L1 cache, ad 64 KB 2-way associative L2 cache. The cache lie of L2 is 64 bytes. Sice the associativity of L2 is low, ad the cache lie of L2 is relatively log, it is difficult to do blockig with associativity ad available registers. We implemeted oly the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled bit-reversal methods from =16to = 21. Figure 7 shows the comparisos of CPE amog the three programs of both float type ad double type o the SGI O2 machie. The measuremets show that the paddig method slightly reduced the executio time compared with the method of blockig with software buffer. The time reductio was up to 6%. The reaso for the small performace improvemet comes from the extremely log memory latecy (28 cycles) of the O2 machie. The reductio ad savig of istructio cycles for data copies from paddig became less sigificat because memory latecies caused by the required cold misses i both methods were domiat i executio. 1 The program was writte i Fortra by Ala Karp.

2126 ZHAO ZHANG AND XIAODONG ZHANG cycles per elemet 4 35 3 25 2 15 1 5 O2 (float) 16 17 18 19 2 21 cycles per elemet O2 (double) 8 7 6 5 4 3 2 1 16 16.5 17 17.5 18 18.5 19 19.5 2 Fig. 7. Executio comparisos o the SGI O2 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. 6.5. Performace comparisos o the Su Ultra-5. The Su Ultra-5 is a 1998 product usig a UltraSparc-IIi processor of 275 MHz, 16 KB direct-mapped L1 cache, ad 256 KB 2-way associative L2 cache. The cache lie of L1 is 32 bytes cosistig of two 16-byte subblocks, ad L2 is 64 bytes log. Similar to the SGI O2, the associativity of L2 o the Ultra-5 is low, ad the cache lie of L2 is relatively log, so it is difficult to do blockig with associativity ad available registers. We implemeted oly the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from =16to = 23. Figure 8 shows the comparisos of cycles per elemet amog the three programs of both float type ad double type o the Ultra-5. The memory latecy of the Ultra-5 (76 cycles) is sigificatly lower tha that of the O2. We observed a more sigificat performace improvemet from the method of blockig with paddig over that of blockig with software buffer. For example, usig float type, the paddig program is 14% faster tha that of blockig with buffer for = 2 or larger. A L2 cache lie of the Ultra-5 holds 16 float type elemets (L = 16), ad 8 double type elemets (L = 8). The larger the L, the higher overhead the blockig with software buffer will have. This has bee cofirmed by our comparative experimets betwee the float ad double types o the Ultra-5 show i Figure 8. 6.6. Performace comparisos o the Su E-45. The Su E-45 is a 1998 4-processor SMP product. Each of the 4 odes is a UltraSparc-2 processor of 3 MHz, 16 KB direct-mapped L1 cache, ad 2 MB 2-way associative L2 cache. The cache lie of L1 is 32 bytes cosistig of two 16-byte subblocks, ad L2 is 64 bytes log. Due to the limited associativity ad a relatively log L2 cache lie, we implemeted oly the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from =16to = 25. Figure 9 shows the comparisos of CPE amog blockig with software buffer, blockig with paddig, ad the program o a sigle ode of E-45, each of which has both float type ad

FAST BIT-REVERSALS 2127 cycles per elemet 3 25 2 15 1 5 ultra5 (float) 16 17 18 19 2 21 22 23 cycles per elemet 5 45 4 35 3 25 2 15 1 5 ultra5 (double) 16 17 18 19 2 21 22 Fig. 8. Executio comparisos o the Su Ultra-5 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. cycles per elemet 3 25 2 15 1 5 E-45 (float) 16 17 18 19 2 21 22 23 24 25 cycles per elemet E-45 (double) 4 35 3 25 2 15 1 5 16 17 18 19 2 21 22 23 24 Fig. 9. Executio comparisos o the Su E-45 SMP: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. double type. The memory latecy of the Ultra-5 (73 cycles) is slightly lower tha that of Ultra-5. O this machie, we observed higher performace improvemet from the method of blockig with paddig over that of blockig with software buffer. For example, usig float type, the paddig program is 22% faster tha that of blockig with buffer for = 2 or larger. Our comparative experimets betwee the float ad double types o E-45 i Figure 9 also cofirms that the larger the L, the higher performace the paddig method would achieve. 6.7. Performace comparisos o the Petium-II 4. The Petium PC we used is a 1998 product usig a Petium-II 4 processor of 4 MHz, 8 KB directmapped L1 cache, ad 256 KB 4-way associative L2 cache. The cache lies of both

2128 ZHAO ZHANG AND XIAODONG ZHANG cycles per elemet 3 25 2 15 1 5 P-II (float) 16 17 18 19 2 21 22 23 24 breg-br cycles per elemet P-II (double) 4 35 3 lblk-br 25 2 15 1 5 16 17 18 19 2 21 22 23 Fig. 1. Executio comparisos o the Petium-II 4 PC: represets the method of blockig with software buffer; represets the method of blockig with paddig; breg-br represets the method of blockig with associativity ad registers; ad represets the ideal lie referece. L1 ad L2 are 32 bytes. Sice the L2 associativity is high, we are able to implemet the method of blockig with associativity ad available registers, L2 cache lie L = 8 elemets for a float type, ad we eed (L K)(L K) = 16 registers to supplemet the 4-way associative cache. A L2 cache lie holds 4 double type elemets (L = 4). Thus, we do ot eed ay registers to supplemet but simply make a 4 4 blockig. The TLB of the Petium processor is a 4-way associative cache of 64 etries. We used our paddig for the TLB techique to avoid TLB misses. We implemeted the blockig with paddig method ad the blockig with associativity ad registers to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from =16to = 24. Figure 1 shows the comparisos of cycles per elemet amog the four programs. As we expected, the paddigs for both cache ad TLB were highly effective, ad the paddig program performed the best. For example, usig float type, the paddig program is about 4% faster tha that of blockig with buffer for = 22 or larger. We also show that the method usig available registers to supplemet associativity is effective. Although it is ot as good as the paddig program due to the icrease of the istructio couts for additioal data copies, it still achieved up to 12% executio reductio over the blockig with software buffer program. As we expected, the executio time of the method usig the 4-way associative L2 cache without the supplemet of registers to form a 4 4 blockig was delayed maily by the loger L2 cache hit time. The performace of this method still outperformed the method of blockig with a software buffer. 6.8. Performace comparisos o the Compaq XP-1. The Compaq XP-1 is a 1999 product usig a Alpha 21264 processor of 5 MHz, 64 KB 2-way associative L1 cache, ad 4 MB 2-way associative L2 cache. The cache lies of both L1 ad L2 are 64 bytes log. Similar to the SGI ad Su machies, the associativity of L2 o the XP 1 is low, ad the cache lie of L2 is relatively log, so it is difficult to do blockig with associativity ad available registers. We implemeted oly the

FAST BIT-REVERSALS 2129 cycles per elemet 14 12 1 8 6 4 2 XP1 (float) 16 17 18 19 2 21 22 23 24 25 cycles per elemet XP1 (double) 14 12 1 8 6 4 2 16 17 18 19 2 21 22 23 24 Fig. 11. Executio comparisos o the Compaq XP-1 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from =16to = 25. Figure 11 shows the comparisos of CPE amog the three programs of both float type ad double type o the XP-1 machie. As we expected, we achieved better or comparable performace to the oes o the Su machies. For example, usig float type, for = 24 or larger, the paddig program is 3% faster tha that of blockig with buffer, ad 15% faster for double type. 7. Performace evaluatio o SMP multiprocessors. We implemeted the bit-reversal methods o two SMP multiprocessors: the Su E-45 ad the HP 9 V22. The parallel bit-reversal program o a SMP with M processors is described usig POSIX thread primitives [1] as follows: bit_reversal(id) my_start = id*(n/m); my_ed = (id-1)*(n/m); for i = 1, N Y[i ] = X[i]; The bit-reversal operatios are evely distributed amog M processors. 7.1. Performace comparisos o the Su E-45. The Su E-45 is a 1998 4-processor SMP product. Each of the 4 odes is a UltraSparc-2 processor of 3 MHz, 16 KB direct-mapped L1 cache, ad 2 MB 2-way associative L2 cache. The cache lie of L1 is 32 bytes cosistig of two 16-byte subblocks, ad L2 cache lie is 64 bytes. Due to the limited associativity ad a relatively log L2 cache lie, we implemeted oly the blockig with paddig algorithm to compare with blockig with software buffer ad the referece. We scaled the bit-reversal algorithms from = 16 to = 24. Figure 12 shows the comparisos of CPE amog blockig with software buffer, blockig with paddig, ad the program o the E-45 of 4 odes, each of which has both float type ad double type. O this machie, we observed some performace improvemet

213 ZHAO ZHANG AND XIAODONG ZHANG 2 E-45 (4 processors, float) 2 E-45 (4 processors, double) 15 15 cycles per elemet 1 5 cycles per elemet 1 5 16 17 18 19 2 21 22 23 24 16 17 18 19 2 21 22 23 24 Fig. 12. Executio comparisos o Su E-45 SMP of 4 processors: represets the algorithm of blockig with software buffer; represets the algorithm of blockig with paddig; ad represets the ideal lie referece. whe 18 from the algorithm of blockig with paddig over that of blockig with software buffer. However, whe >18 of double type or >19 of float type, each processor has to process a data set larger tha its cache capacity. Multiple processors simultaeously access the memory through a shared data lik would cause the cotetio to degrade the performace. Sice the data to be accessed from differet processors are distributed i differet locatios, a crossbar itercoectio etwork to lik each processor to all the memory modules would sigificatly reduce the cotetio. The E-45 does have a 5 5 crossbar to coect 2 pairs of processors, 2 I/O ports, ad the memory. The commuicatios betwee the 4 processors the memory modules are coected through the sigle memory data lik. Figure 13 shows the crossbar itercoectios of the E-45 amog the processors, the shared-memory modules, ad the 2 I/O ports. The cotetio occurs i the memory data lik whe the multiple processors request memory accesses simultaeously. We have observed severe performace degradatio caused by the memory access cotetio. Figure 12 shows that this cotetio makes the executio time curves of the three programs jump sharply ad merge together whe >18 of double type ad >19 of float type. I cotrast, o a sigle processor of E-45, accesses to the memory through the memory bus have o cotetio so that the algorithms were scaled well. 7.2. Performace comparisos o the HP 9 V22. HP 9 V22 is a 1997 SMP product with up to 16 processors. We used 4 processors for performace comparisos. Each ode is a HP PA-82 processor of 2 MHz with a 2 MB directmapped L1 data cache. The cache lie is 32 bytes. Due to limited associativity, we implemeted oly the blockig with paddig algorithm to compare with blockig with software buffer ad the referece. The HP SMP has a crossbar itercoectio etwork, the HyperPlae crossbar, to coect up to 8 pairs of processors to 8 memory modules. Multiple pairs of processors ca access differet memory modules simultaeously. Each pair of the processors is

FAST BIT-REVERSALS 2131 P1 Data lik PA82 PA82 PA82 PA82 P2 P3 Crossbar 128MB 128MB 128MB 128MB Memroy Baks Hyperplae Aget Crossbar Hyperplae Aget P4 UltraSPARC-II Processors I/O I/O 248MB 248MB Fig. 13. Architecture comparisos betwee Su E-45 SMP (left) ad HP 9 V22 SMP (right): the memory data lik of the E-45 may become a bottleeck whe simultaeous memory access requests from multiple processors; the HyperPlae crossbar coected betwee the memory modules ad the processors o the HP 9 V22 ca effectively reduce the cotetio. 1 8 V-22 (4 processors, float) 1 8 V-22 (4 processors, double) cycles per elemet 6 4 cycles per elemet 6 4 2 2 16 17 18 19 2 21 22 23 24 16 17 18 19 2 21 22 23 24 Fig. 14. Executio comparisos o HP 9 V22: represets the algorithm of blockig with software buffer; represets the algorithm of blockig with paddig; ad represets the ideal lie referece. coected to the crossbar through a adaptor called HyperPlae Ruway Aget. Figure 13 gives the itercoectio structure of the HP 9 V22 of 4 processors. I our experimets, the 4 processors are divided ito 2 pairs which are coected to 2 memory modules by a 2 2 hyperplae crossbar. Each pair of processors may have cotetio to compete the adaptor, but the crossbar is able to allow simultaeous data accesses amog the memory modules. The egative performace effect due to the data lik cotetio observed o Su E-45 was sigificatly reduced o the HP SMP, which shows the effectiveess of the crossbar. Figure 14 shows comparative executio time curves betwee the float ad double types o E-45 i Figure 14. The executio times of the 3 programs are quite stable ad idepedet of the size of. Both the paddig programs of the float type ad of the double type outperformed

2132 ZHAO ZHANG AND XIAODONG ZHANG Table 2 Summary of the blockig methods ad their impact o the three aspects of performace (cross iterferece, istructio cout, ad memory space) ad o the program complexity. The performace of blockig oly method is the lie for comparisos. Note: + meas that the method quatitatively icreases the factor ad hurts the performace, ad blak meas it has o impact. The program complicity is subjective ad compared with the block oly method, with 1 beig a slightly more complex, ad 2 a moderately more complex. Methods Cross Istructio Memory Program Commets iterferece cout space complexity Blockig oly limited by data sizes. Blockig with + + + 1 system idepedet. software buffer Blockig with 1 limited by the umber register buffer of available registers. Blockig with works well o high associativity 2 associativity caches. ad registers Blockig with + 1 works well o paddig all systems. a TLB size depedet TLB blockig outer loop, effective for fully associative TLBs. paddigs by usig L TLB paddig + 1 pages, effective for set associative TLBs. the blockig methods with buffer up to 4% ad 18%, respectively. Their executio curves almost merge together with the referece curve. 8. Coclusio. We have examied ad developed cache-optimal methods for bit-reversal data reorderigs. These methods have bee tested o 5 represetative uiprocessor workstatios of 1995 to 1999 products to show their effectiveess. Differet methods have their ow merits ad limits. The blockig-oly method is limited by data sizes. Although the blockig-with-software-buffer method is architecture idepedet, it icreases cross iterferece ad istructio cout ad eeds additioal memory space. The blockig-with-a-register-buffer method is fast but is limited by the umber of available registers. Blockig with associativity ad with registers work well o high associativity caches. We have show that the methods of blockig with paddig, blockig for TLB, ad paddig for TLB ca effectively exploit cache locality ad are almost idepedet o hardware. Thus, they could be widely used o may uiprocessors workstatios ad SMP multiprocessors. We summarize differet techiques ad their merits ad limits i Table 2, which gives a guidelie for applicatio users to choose a techique d o the size of the problem ad the machies available. The methods have also tested o two commercial SMP multiprocessors. By exploitig cache locality of each processor, we have effectively elimiated the coflict misses so that accesses to the shared memory ad cotetio are miimized. However, aother potetial bottleeck o SMPs is the data access cotetio to the sharedmemory. We show that crossbar itercoectios betwee processors ad memory modules play a importat role to parallel bit-reversal data reorderigs.