Cache-Optimal Methods for Bit-Reversals

Size: px

Start display at page:

Download "Cache-Optimal Methods for Bit-Reversals"

Holly Patrick
6 years ago
Views:

1 Proceedigs of the ACM/IEEE Supercomputig Coferece, November 1999, Portlad, Orego, U.S.A. Cache-Optimal Methods for Bit-Reversals Zhao Zhag ad Xiaodog Zhag Departmet of Computer Sciece College of William ad Mary Williamsburg, VA fzzhag or Abstract Bit-reversals are represetative ad importat data reorderig operatios i may scietific computatios. Performace degradatio is maily caused by cache coflict misses. Bit-reversals are ofte repeatedly used as fudametal subrouties for may scietific programs. Thus, i order to gai the best performace, cache-optimal methods ad their implemetatios should be carefully ad precisely doe at the programmig level. This type of performace programmig for some special programs, such as the data reorderigs, may sigificatly outperform a optimizatio from a automatic tool, such as a compiler. I this paper, we examie differet methods usig techiques of blockig, bufferig, ad paddig for efficiet implemetatios. We evaluate the merits ad limits of each techique ad their applicatio ad architecture-depedet coditios for developig cache-optimal methods. We preset two cotributios i this paper: (1) Our itegrated blockig methods, which match cache associativity ad TLB cache size ad which fully use the available registers are cache-optimal ad fast. (2) We show that our paddig methods outperform other software orieted methods, ad believe they are the fastest i terms of miimizig both CPU ad memory access cycles. Sice the paddig methods are almost idepedet of hardware, they could be widely used o may uiprocessor workstatios ad SMP multiprocessors. 1 Itroductio With the rapid developmet of RISC ad VLSI techology, the speed of processors has icreased dramatically i the past decade. Processor clock rates doubled every 2-3 years. Nevertheless, the speed of memories has icreased at a much slower pace. Therefore we have see ad will cotiue to see a icreasig gap i speed betwee processor ad memory, ad this gap makes performace of applicatio programs o both uiprocessor ad multiprocessor systems rely more ad more o effective usage of caches. Bit-reversals are importat data reorderig operatios i may scietific computatios. Performace degradatio is maily caused by cache coflict misses. Bit-reversals are ofte This work is supported i part by the Natioal Sciece Foudatio uder grats CCR ad CCR , by the Air Force Office of Scietific Research uder grat AFOSR , ad by Su Microsystems uder grat EDUE-NAFO

2 2 repeatedly used as fudametal subrouties for may scietific programs. Thus, i order to gai the best performace, cache-optimal methods ad their implemetatios should be carefully ad precisely doe at the programmig level. This type of performace programmig for some special programs, such as bit-reversals, may sigificatly outperform a optimizatio from a automatic tool, such as a compiler. A stadard bit-reversal program is described as follows: for i = 1, N Y[i ] = X[i] The values of array X i their sequetial positios i are copied to array Y i their bit-reversal positios, i, for i = 1 ::: N, where N = 2. The above program says that X is a bit-reversal reorderig of Y. The idices of i ad i of X ad Y are represeted by a sequece of biary digits. Positios i ad its bit-reversal i are defied i [5] as: ;1 i = j= a j 2 j ad i ;1 = j= a j 2 ;1; j where a j is either or 1. For example, a 5-bit reversal of i = 1 is i =. The bit-reversal operatios have followig uique characteristics: First, each elemet i a array is oly used (read or writte) oce for its copy operatio. Thus, the reorderigs have oly spacial locality but o temporal locality for elemets. Secod, the loops follow certai sequeces with high spatial locality. Bit-reversals are highly sesitive to problem sizes, cache sizes, ad cache lie sizes. Sice the data array sizes are a power of two, multiple elemets stored i differet memory locatios could map to the same cache lie, causig severe cache coflict misses ad cache thrashig. The reaso is simple. Most commercial computers use direct-mapped or -way associative caches where the mappig fuctios of cache sizes are also related to powers of two. We use a idetical uit, called a elemet, to represet the sizes of data arrays, caches ad others such as buffers ad blockig. Oe elemet may represet a 4-byte iteger, a 4-byte floatig poit umber, or a 8-byte double floatig poit umber. Because the sizes of caches ad cache lies are always a multiple of a elemet i practice, this idetical uit for all the sizes is practically meaigful for both architects ad applicatio programmers, ad makes the discussios straightforward. Here are the algorithmic ad architectural parameters we will use to describe cache-optimal methods of bit-reversals: C: data cache size, which could be further defied as C L1 ad C L2 for data cache sizes of L1 ad L2 respectively. L: the size of a cache lie, which could be further defied as L L1 ad L L2 for cache lies of L1 ad L2 respectively. K: cache associativity, which could be further defied as K L1 ad K L2 for cache associativity of L1 ad L2 respectively. K TLB : TLB cache associativity. T s : umber of etries i the TLB cache. N: the data size for the bit-reversal vector of size N = 2, where is the umber bits used i the vector idex. B cache : blockig size of a BB submatrix for cache.

3 B Memory Layout: the distace betwee each pair of segemets is (N-B)/L cache lies A 2-D Array equivalet Layout Distributios of B segmets i a vector of N elemets for bit-reversals Figure 1: Memory layout of a blocked bit-reversals, where B = B cache. B TLB : blockig size for TLB. P s : a memory page size. I this paper, we examie differet methods usig techiques of blockig, bufferig, ad paddig for efficiet implemetatios. We evaluate the merits ad limits of each techique ad their applicatio ad architecture-depedet coditios for developig cache-optimal methods. Although our methods are developed for out-of-place bit-reversals, they are also applicable to i-place bit-reversals where X ad Y are the same array. We preset two cotributios i this paper: (1) Our itegrated blockig methods, which match cache associativity ad TLB cache size ad which fully use the available registers are cache-optimal ad fast. (2) We show that our paddig methods outperform other software orieted methods, ad believe they are the fastest i terms of miimizig both CPU ad memory access cycles. Sice the paddig methods are almost idepedet of hardware, they could be widely used o may uiprocessor workstatios ad SMP multiprocessors. 2 Blockig for bit-reversals The blocked memory access patters of bit-reversals ca be easily viewed whe we covert the oe dimesioal vector to a 2-D equivalet array i Figure 1. All the reorderig elemets ad elemets i other groups will be allocated alog the colum i the 2-D equivalet array formig a block. I geeral, for a bit-reversal vector of N = 2 elemets, the block size B cache is a power of 2, deoted by B cache = 2 b. Each of the B cache elemets i X has the address format of fg, where g is B cache bits, ad f has ;b bits. Each of the correspodig B cache elemets i Y has the address format of g f. Therefore, the distace betwee two earest elemets i the same group i Y is 2 ;b = N=B cache. Choosig the cache lie size as the miimum blockig size (B cache = L), we ca easily calculate the maximum Ns for the bit-reversal vector d o differet data cache sizes. For example, for a large cache of 2 MBytes, the blockig techique is effective up to a 18-bit-reversal reorderig which represets 268,144 data elemets, where each elemet is a 8-byte double type, ad the cache lie is 32 bytes. I practice, the data size of bit-reversals could easily be larger tha = 2 [5].

4 4 3 Blockig with buffers As we have show, the effectiveess of blockig is limited by the size of the data arrays. I theory, the smallest blockig size could be 22. A cache lie i a moder processor usually holds more tha 2 elemets, i.e., is larger tha 16 bytes. If we choose a 2 2 block, the data i a cache lie will ot be fully used before their replacemet, causig more cache misses i the reorderigs. The bit-reversal reorderig demads large cache space to make blockig effective. I order to effectively use limited cache space, Gatli ad Carter [4] preset a effective method usig a additioal buffer to first hold the coflict-missed elemets of a block i oe array temporarily, ad the copy the block to their reordered positios i the other array. 3.1 Blockig with a software buffer ad its limits Because this buffer is defied i a reorderig program, we call it software buffer. This buffer shares the allocatio space with the data arrays X ad Y i the cache. There are two major limits i this approach. First, the buffer itself may iterfere with arrays of X ad Y, causig additioal access coflicts. This iterferece is certai whe the sizes of X ad Y are larger tha the size of the cache, C. Each cache block or set is mapped from arrays X ad Y more tha oce. No matter where the buffer is located i the cache, it will iterfere with them. The larger the buffer size, the more iterferece will occur. The secod limit is the additioal copy overhead time ivolved i movig data from the array X to the buffer ad the i movig them to the target array i their reordered positios. This overhead exactly doubles the istructio cycles for data copyig. The data copy through a buffer is a worthy ivestmet if the umber of cycles lost from cache misses is much higher tha the additioal CPU cycles for the data copy. To overcome the two limits, we propose several alteratives to elimiate cache iterferece caused by the software buffer ad to reduce or elimiate the data copy time. 3.2 Cache structure depedet blockig Blockig d o set associativity The cache associativity, K, is a importat factor to cosider for blockig. If K L, al L or a K K blockig methods for bit-reversals would effectively avoid coflict misses. Because the hit time is a less sesitive performace factor tha the cache misses i the L2 cache, a higher associativity of the L2 cache is more effective tha that of L1. If a cache lie holds 4 double floatig poit elemets, (L = 4 elemets of 32 bytes i Petium processors), a 4 4 blockig method without ay data buffer is able to fully use the cache associativity. The blockig method would gai more beefit from caches of associativity higher tha 4, such as a desig i [11]. What would we do if the associativity is ot sufficietly high for the blockig, or K < L? Oe solutio is to make a K L rectagular blockig. Ufortuately bit-reversals require a L L blockig. Supplemet with registers We may also cosider usig the available registers to supplemet a low associativity cache. The umber of registers available to a user program are limited. Normally, a uiprocessor provides up to 16 registers to users. For example, for

5 5 a 2-way associative cache, we eed 8 registers to buffer two additioal cache lies so that we could effectively make a 4 4 blockig as if we ra the program o a 4-way associative cache. We develop a more efficiet blockig method for bit-reversals, which requires oly (L;K) (L;K) registers. The operatio sequece of this method is i three steps: (1) The L;K cache lies of X are stored i K cache lies of Y ad accessed by copyig its (L;K)K elemets to Y i the reordered positios, ad copyig the rest of (L;K)(L;K) elemets to a buffer cosistig (L;K)(L;K) registers. (2) The rest of K lies of X are brought to the cache set, ad its K K elemets are copied to Y i the reordered positios. (3) Fially, the (L;K)(L;K) elemets i the register buffer ad the rest of the (L;K)K elemets are copied to Y i their reordered positios. A cache set will be used more tha twice if K < L=2. Besides the advatage of o access coflicts betwee the register buffer ad the arrays of X ad Y, there is aother advatage of usig registers to buffer the data i a load/store processor. A data copy through the registers from X to Y is equivalet to the two-step process of load ad store, ad thus there will be o additioal overhead. We will show our experimetal performace i sectio 5. Usig registers as the buffer If the cache is direct-mapped, we have to fully rely o a buffer for blockig. Here we discuss some ways to use registers to serve the buffer i order to elimiate the potetial cache coflicts ad elimiate extra data copyig by takig advatage of the load/store operatios. The umber of registers for a buffer of L L elemets is determied by the umber of elemets a cache lie ca hold. The legth of a cache lie of the L1 cache i some processors, such as Su SPARC Micro I ad II, is L = 2 of 16 bytes, which holds oly two floatig poit elemets. The blockig size could be as small as 22 usig a buffer of 4 registers. The cache lie legth of the L1 cache i may advaced workstatios is 32 bytes, such as the Su Ultra ad Itel Petium processors, each of which holds 4 double floatig poit elemets. I this case, we eed a buffer of 44 = 16 registers for a blockig. This would be difficult due to the limited umber of available registers. We have two solutios for this. First, we oly use the umber of registers available to form a smaller buffer tha it should be, which will ot make each cache lie fully used ad will cause additioal cache misses. Our experimets show that this blockig method of usig a buffer of isufficiet umber of registers still achieves a reasoable performace improvemet ad outperforms of the implemetatio usig software a buffer. The secod method is to further reduce the size of the buffer, which reduces the required umber of registers by usig our (L;K)(L;K) blockig method. L1 cache versus L2 cache The mai objective of buildig two-level caches is to make the L1 cache small eough to catch up to the cycle time of the fast CPU, ad to make the L2 cache large eough to capture as may accesses as possible [6]. I practice, the data size of a bit-reversal is larger tha the size of the L2 cache. L1 ad L2 caches offer differet sizes of the cache lie, L, ad the associativity, K. Both of the followig alteratives are effective for blockig. (1) Takig advatage of a short cache lie ad fast hit time of the L1 cache, we could effectively use limited registers as the buffer, ad make a small LL blockig effective. (2) Takig advatage of high associativity of the L2 cache, we could effectively use both associativity ad supplemetal registers as the buffer, ad make a large LL blockig effective.

6 6 Cache 111 Before paddig X-array N/B N/B N/B N/B Y-array Cache After paddig N/B+L N/B+L N/B+L N/B X-array Y-array Figure 2: Data layout of a bit-reversal is modified by paddig, where B = B cache = L. 4 Blockig with paddig Paddig is a techique that modifies the data layout of a program so that the coflict misses are reduced or elimiated. The data layout modificatio ca be doe at ru-time by system software [2, 1], or at compile-time by complier optimizatio [8]. Sharig the same objective of compiler optimizatio to chage the addresses of potetially coflictig cache blocks i the reorderigs, we isert paddig variables iside the data array. For example, i the FFT computatio, paddigs ca be combied with the copy operatios i the last step of butterfly without additioal cost. Sice the data arrays of bit-reversals form a vector whose size is power of 2, the paddig is highly regular, isertig L elemets or a cache lie space startig at the vector positios of N=L, 2 N=L,..., ad (L ; 1) N=L. Usig L elemets or a sectio data of a cache lie to separate the vector at these L poits ca completely elimiate the cache coflicts caused by Murphy reorderig. Agai durig executio, the reorderig data copies are directly coducted betwee the arrays X ad Y without goig through a data buffer. Aother advatage is that the umber of paddig elemets eeded is oly L L or L cache lies, ad is idepedet of the data array size, N. Compared with the data size of bit-reversals, the umber of paddig elemets is isigificat. Figure 2 shows how the data layout of a bit-reversal vector is modified by paddig so that coflict misses are elimiated. Compiler optimizatio targets a large rage of applicatio programs, ad automatically iserts paddig variables i the programs for users. A optimal paddig is applicatio program depedet. For example, paddig positios are differet from differet applicatios i order to effectively chage addresses of coflictig cache blocks. Based o the uique ature of the data reorderig, the optimal paddig uit used by our methods for bit-reversals is a cache lie with L elemets. I cotrast, a compiler optimizatio ormally uses a elemet as the basic paddig uit. How may paddig uits to use ad where to pad i the data arrays are determied by some approximatio models which may ot precisely fit the uique memory access patters of each case. I additio, applyig the paddig techique to bit-reversals embedded i applicatios would ot icrease complexity i the etire computatio. For example, whe a padded bit-reversal is performed i a FFT computatio, it has little effect o the eighborig butterfly operatios.

7 7 5 Blockig ad paddig for TLB The TLB (Traslatio-Lookaside Buffer) is a special cache that stores the most recetly used virtual-physical page traslatios for memory accesses. The TLB is a small ad usually fully associative cache. Each etry poits to a memory page of 4 KBytes to 64 KBytes. The page size is ormally fixed at the level of operatig systems, ad caot be chaged by user programs. A TLB cache miss will make the system retrieve the missig traslatio from the page table i memory, ad the to select a TLB etry to replace. Whe the data to be accessed i our blockig method is larger tha the amout of data of all the memory pages that the TLB ca hold, we will have TLB thrashig. 5.1 Blockig for a fully associative TLB Before givig a geeral model to show how the blockig size is affected by the TLB size, let s go through a example to show that a moderate N for bit-reversals would easily lead to TLB cache thrashig. The 64 pages i the TLB of the Su UltraSparc-II processor hold = elemets, which represets a 16-bit-reversal of N = Sice we have two vectors X ad Y, the TLB ca hold a 15-bit-reversal of N = 2 15 elemets. This is also cosistet with our experimets o this machie, where executio time per elemet was a costat util = 15, but sharply icreased at = 16 bit-reversals caused by the TLB misses. I our cache-optimal methods, we iclude a outer loop to form a blockig for TLB, whose size is deoted as B TLB. The blockig size of B TLB for bit-reversals whe N=L P s is B TLB T s where P s is the page size i elemets, ad T s is the umber of etries of the TLB. O the other had, the B TLB should be chose as large as possible to make effective use of the page space. 5.2 Paddig for a set-associative TLB Some processors TLBs are ot fully associative, but set-associative. For example, the TLB i the Petium II 4 processor is 4-way associative (K TLB = 4). A simple blockig d o the umber of TLB etries is ot cacheoptimal, because multiple pages withi a TLB-size-d blockig may map to the same TLB cache set ad cause TLB cache coflict misses. If the size N of a bit-reversal vector is a multiple of T s P s, where T s is the umber of TLB etries ad P s is the page size i elemets, ad if K TLB < B TLB, the TLB cache coflict misses will occur. This could easily happe i practice. For example, o the Petium II 4, N is equal to 128K elemets (oe elemet = 8 bytes) for a 17-bit-reversal, ad this N is two times of the value T s P s of the machie, where T s = 64, ad P s = 124 elemets. I a way similar to the techique of paddig for the data cache, we isert a page of elemets or a page of space startig at the vector positios of N=L, 2 N=L,... ad (L; 1) N=L to elimiate the coflict of TLB cache misses. Figure 3 gives a example of the paddig for TLB, where the TLB is a direct-mapped cache of 8 etries, blockig size is B TLB = 4, ad the umber of elemets of a row is a multiple of 8 page elemets. Before paddig, each of blockig row is mapped to the same cache lie of the TLB. After paddig, these rows are mapped to differet cache lies of the TLB. Combiig paddig for data cache ad paddig for TLB cache, we are isertig L + P s elemets or a page plus a cache lie space i L locatios separated by a distace of N=L elemets.

8 8 Before paddig 111 TLB 2-D memory layout of array i bit-reversal After Paddig TLB 2-D memory layout of array i bit-reversal Ps Figure 3: Paddig for TLB: the data layout is modified by isertig a page space at multiple locatios, where B TLB = 4, K TLB = 1, T s = 8. I practice, we selected more tha N=L poits to isert the paddig variables to elimiate both data cache ad TLB coflict misses. This approach could effectively merge two ested paddigs (oe for data cache ad the other oe for TLB) ito a sigle oe. A optimal umber of isertig poits ca be easily determied experimetally d o the size of the TLB cache. 6 Experimetal Results ad Performace Evaluatio We have implemeted ad tested all the bit-reversal methods discussed i the previous sectios o a SGI O2 workstatio, a Su Ultra-5 workstatio, a Su SMP server E-45, a Petium PC, ad a Compaq XP1 workstatio. We used lmbech [7] to measure the latecies of memory hierarchies at differet levels o each machie. The architectural parameters of the 5 machies are listed i Table 1. We focus the performace evaluatio o methods ad implemetatios of bit-reversals i this paper. We compared all our methods with the method of blockig with a software buffer which was recetly published i [4]. We deote this method as blockig with buffer for bit-reversals. Two of our methods are experimetally compared: breg-br blockig with associativity ad registers for bit-reversals, ad blockig with paddig for bit-reversals. We have also applied blockig or paddig techique for the TLB i these two methods d o the TLB associativity. All the programs use a stadard subroutie to calculate the bit-reversal value for a give address. The executio times were collected by gettimeofday(), a stadard uix timig fuctio. The reported time uit is (CPE): executio timeclock rate CPE = N where executio time is the measured time i secods, clock rate is the CPU speed (cycles/secod) of the machie where the program is ru, ad N is the umber of elemets of the bit-reversal program. Besides the differet methods of bit reversals, we also measured the executio time of a program copyig elemets betwee X ad Y. This program

9 9 Workstatios SGI O2 Su Ultra 5 Su E-45 Petium XP1 Processor type R1 UltraSparc-IIi UltraSparc II Petium II 4 Alpha clock rate (MHz) L1 cache (KBytes) L1 block size (Bytes) L1 associativity L1 hit time (cycles) L2 cache (KBytes) L2 block size (Bytes) L2 associativity L2 hit time (cycles) TLB size (etries) TLB associativity Memory latecy (cycles) Table 1: Architectural parameters of the 5 workstatios we have used for the experimets. All specificatios o L1 cache refer to the L1 data cache, ad all L2s are uiform. Each L2 cache block o UltraSPARC-IIi cosists of 2 16-Byte sub-blocks. The hit times of L1, L2 ad the mai memory are measured by lmbech [7], ad their uits are coverted from aosecod (s) to their CPU cycles. has the same umber of data copyig operatios with a cotiuous memory access patters. We use the executio time of this program to provide a lie referece for bit-reversal programs ad show how close a bit reversal executio is to its ideal time. We deote this referece program as. Each method is further divided ito float data type usig 4 bytes to represet a elemet, ad double type usig 8 bytes to represet a elemet. The data type divisios will show the performace impact of the cache lie legth. For all experimets o differet machies, the bit-reversal programs first call a routie to flush the cache to make sure that all the data are allocated oly i the memory. All experimets were repeated multiple times. 6.1 Effects of TLB ad virtual memory Before measurig ad comparig the performace of differet bit-reversal methods, we experimetally evaluated the effects of TLB ad virtual memory to cofirm our assumptios ad aalyses. Selectio of TLB blockig size The TLB blockig size is a sesitive performace parameter to be selected, which is determied by the size of the TLB if it is fully associative. We executed program (blockig with paddig for bit-reversals) with = 2 o a sigle ode of Su E-45 by chagig the blockig sizes for TLB from 8 to 128. The TLB of the E-45 is a fully associative cache with 64 etries. Figure 4 shows the measured of the program of differet blockig sizes o the ode. Our experimetal results are cosistet with our aalyses i the previous sectio. Whe

10 1 7 6 E45 (double) Block size of TLB Figure 4: Chagig the TLB blockig sizes o a sigle ode of the Su E45: whe the blockig size for TLB was larger tha 32, the executio time curve was sharply icreased. the blockig size for TLB was 64, the executio time curve icreased sharply. This is because arrays X ad Y together demaded more tha 64 pages ad caused TLB thrashig. Virtual memory versus physical memory addresses All our aalyses are d o cache mappigs betwee memory pages i the virtual address space ad cache blocks i the physical memory address space. This assumes that cotiguous memory pages will be cotiguously mapped to the cache. This assumptio is guarateed for the virtual-address caches [3]. However, all our experimets have bee performed o machies with physical address L2 caches. Sice the virtual-physical traslatios for L2 caches are hadled by operatig systems, our assumptios may ot be accurate sometimes. I order to show that may operatig systems attempt to map cotiguous virtual pages to cache blocks cotiguously so that our virtual-address-d study is practically meaigful ad effective, we coducted a simulatio by usig the SimOS [9] ad measuremets o differet workstatios to observe how a operatig system makes traslatios from virtual memory addresses to their physical addresses. The SimOS simulates a complete hardware of SGI machies ad rus the IRIX 5.3 operatig system i the simulatio. We executed a blockig-oly program of bit-reversals usig the cache lie L as the blockig size. The bit-reversal vector size was chaged from = 15 to = 22. We measured the miss rates o array X. The cache size was set to 2 MBytes holdig two double type arrays up to = 18 i the virtual memory space. Figure 5 gives cosistet results from the SimOS simulatio: whe > 18, the miss rate o array X was sharply icreased to 1% from 12.5%. From this experimet, we have observed that virtual-physical traslatios from the IRIX 5.3 operatig system are

11 11 SimOS (IRIX 5.3) blockig oly miss rate o array Figure 5: Usig the SimOS to observe the miss rates by chagig the the size of the bit-reversal arrays of a blockigoly program: whe > 18, the miss rate was sharply icreased to 1%. quite cosistet to our assumptio of cotiguous allocatios. We have also ru the similar experimets o differet targeted workstatios with differet operatig systems, such as Liux ad Solaris, to measure the chages of executio times whe the data size is chaged. Our measuremets are also cosistet to the SimOS results, ad idicate that the larger the data arrays to be used, the more likely a operatig system will allocate the pages cotiguously. Because our study targets large data set, our aalyses d o the virtual memory space is reasoably accurate. 6.2 Performace comparisos o the SGI O2 The SGI O2 is a 1995 product usig a R1 processor of 15 MHz, 32 KB 2-way associative L1 cache, ad 64 KB 2-way associative L2 cache. The cache lie of L2 is 64 bytes. Sice the associativity of L2 is low, ad the cache lie of L2 is relatively log, it is difficult to do blockig with associativity ad available registers. We oly implemeted the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled bit-reversal methods from = 16 to = 21. Figure 6 shows the comparisos of amog the three programs of both float type ad double type o the SGI O2 machie. The measuremets show that the paddig method slightly reduced the executio time compared with the method of blockig with software buffer. The time reductio was up to 6%. The reaso for the small performace improvemet comes from the extremely log memory latecy (28 cycles) of the O2 machie. The reductio ad savig of istructio cycles for data copies from paddig became less sigificat because memory latecies caused by the required cold misses i both methods were domiat i executio.

12 O2 (float) O2 (double) Figure 6: Executio comparisos o the SGI O2 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. 6.3 Performace comparisos o the Su Ultra-5 The Su Ultra-5 is a 1998 product usig a UltraSparc-IIi processor of 275 MHz, 16 KB direct-mapped L1 cache, ad 256 KB 2-way associative L2 cache. The cache lie of L1 is 32 bytes cosistig of two 16 byte subblocks, ad L2 is 64 bytes log. Similar to the SGI O2, the associativity of L2 o the Ultra-5 is low, ad the cache lie of L2 is relatively log, so it is difficult to do blockig with associativity ad available registers. We oly implemeted the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from = 16 to = 23. Figure 7 shows the comparisos of amog the three programs of both float type ad double type o the Ultra-5. The memory latecy of the Ultra-5 (76 cycles) is sigificatly lower tha that of the O2. We observed a more sigificat performace improvemet from the method of blockig with paddig over that of blockig with software buffer. For example, usig float type, the paddig program is 14% faster tha that of blockig with buffer for = 2 or larger. A L2 cache lie of the Ultra-5 holds 16 float type elemets (L = 16), ad 8 double type elemets (L = 8). The larger the L, the higher overhead the blockig with software buffer will have. This has bee cofirmed by our comparative experimets betwee the float ad double types o the Ulta-5 show i Figure Performace comparisos o the Su E-45 The Su E-45 is a processor SMP product. Each of the 4 odes is a UltraSparc-2 processor of 3 MHz, 16 KB direct-mapped L1 cache, ad 2 MB 2-way associative L2 cache. The cache lie of L1 is 32 bytes cosistig of two 16 byte subblocks, ad L2 is 64 bytes log. Due to the limited associativity ad a relatively log L2 cache lie, we oly implemeted the blockig with paddig method to compare with blockig with software buffer ad the

13 ultra5 (float) ultra5 (double) Figure 7: Executio comparisos o the Su Ultra-5 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. referece. We scaled the bit-reversal methods from = 16 to = 25. Figure 8 shows the comparisos of amog blockig with software buffer, blockig with paddig, ad the program o a sigle ode of E-45, each of which has both float type ad double type. The memory latecy of the Ultra-5 (73 cycles) is slightly lower tha that of Ultra-5. O this machie, we observed higher performace improvemet from the method of blockig with paddig over that of blockig with software buffer. For example, usig float type, the paddig program is 22% faster tha that of blockig with buffer for = 2 or larger. Our comparative experimets betwee the float ad double types o E-45 i Figure 8 also cofirms that the larger the L, the higher performace the paddig method would achieve. 6.5 Performace comparisos o the Petium II 4 The Petium PC we used is a 1998 product usig a Petium-II 4 processor of 4 MHz, 8 KB direct-mapped L1 cache, ad 256 KB 4-way associative L2 cache. The cache lies of of both L1 ad L2 are 32 bytes. Sice the L2 associativity is high, we are able to implemet the method of blockig with associativity ad available registers, L2 cache lie L = 8 elemets for a float type, ad we eed (L ; K)(L ; K) =16 registers to supplemet the 4-way associative cache. A L2 cache lie holds 4 double type elemets (L = 4). Thus, we do ot eed ay registers to supplemet, but simply make a 44 blockig. The TLB of the Petium processor is a 4-way associative cache of 64 etries. We used our paddig for the TLB techique to avoid TLB misses. We implemeted the blockig with paddig method ad the blockig with associativity ad registers to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from = 16 to = 24. Figure 9 shows the comparisos of

14 E-45 (float) E-45 (double) Figure 8: Executio comparisos o the Su E-45 SMP: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. amog the four programs. As we expected, the paddigs for both cache ad TLB were highly effective, ad the paddig program performed the best. For example, usig float type, the paddig program is about 4% faster tha that of blockig with buffer for = 22 or larger. We also show that the method usig available registers to supplemet associativity is effective. Although it is ot as good as the paddig program due to the icrease of the istructio couts, it still achieved up to 12% executio reductio over the blockig with software buffer program. As we expected, the executio time of the method usig the 4-way associative L2 cache without the supplemet of registers to form a 44 blockig was delayed maily by the loger L2 cache hit time. The performace of this method still outperformed the method of blockig with a software buffer. 6.6 Performace comparisos o the Compaq XP-1 The Compaq XP-1 is a 1999 product usig a Alpha processor of 5 MHz, 64 KB 2-way associative L1 cache, ad 4 MB 2-way associative L2 cache. The cache lies of both L1 ad L2 are 64 bytes log. Similar to the SGI ad Su machies, the associativity of L2 o the XP 1 is low, ad the cache lie of L2 is relatively log, so it is difficult to do blockig with associativity ad available registers. We oly implemeted the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from = 16 to = 25. Figure 1 shows the comparisos of amog the three programs of both float type ad double type o the XP-1 machie. As we expected, we achieved better or comparable performace to the oes o the Su machies. For example, usig float type, for = 24 or larger, the paddig program is 3% faster tha that of blockig with buffer; ad 15% faster for double type.

15 P-II (float) breg-br P-II (double) lblk-br Figure 9: Executio comparisos o the Petium II 4 PC: represets the method of blockig with software buffer; represets the method of blockig with paddig; breg-br represets the method of blockig with associativity ad registers; ad represets the ideal lie referece. 7 Coclusio We have examied ad developed cache-optimal methods for bit-reversal data reorderigs. These methods have bee tested o 5 represetative processors of 1995 to 1999 products to show their effectiveess. We summarize differet techiques ad their merits ad limits i Table 2, which gives a guidelie for applicatio users to choose a techique d o the size of the problem ad the machies available. We also attach the source code of the paddig method i the ed of the paper. Ackowledgemet: We thak Kag Su Gatli for his costructive suggestios o a prelimiary versio of this paper. Neal Wager carefully read the mauscript ad made costructive commets. Fially we appreciate the isightful reviews from the aoymous referees. Refereces [1] D. F. Baco, S. L. Graham, ad O. J. Sharp, Compiler trasformatios for high performace computig, ACM Computig Surveys, Vol. 26, No. 4, December 1994, pp [2] B. Bershad, D. Lee, T. Romer ad B. Che, Avoidig coflict misses dyamically i large direct-mapped caches, Proceedigs of the 6th Iteratioal Coferece o Architectural Support for Programmig Laguages ad Operatig Systems (ASPLOS-VI), October, [3] M. Cekleov ad M. Dubois, Virtual-address caches, IEEE Micro, September/October 1997, pp

16 XP1 (float) XP1 (double) Figure 1: Executio comparisos o the Compaq XP-1 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. [4] K. S. Gatli ad L. Carter, Memory hierarchy cosideratios for fast traspose ad bit-reversals, Proceedigs of 5th Iteratioal Symposium o High-Performace Computer Architecture, (HPCA-5), Jauary [5] A. H. Karp, Bit reversal o uiprocessors, SIAM Review, Vol. 38, No. 1, March 1996, pp [6] J. L Heessy ad D. A. Patterso, Computer Architecture: A Quatitative Approach, Morga Kaufma, [7] L. McVoy ad C. Staeli, lmbech: portable tools for performace aalysis, Proceedigs of the 1996 USENIX Techical Coferece, Sa Diego, Califoria, 1996, pp [8] C. Rivera ad C.-W. Tseg, Data trasformatios for elimiatig coflict misses, Proceedigs of the SIG- PLAN 98 Coferece o Programmig Laguage Desig ad Implemetatio, July [9] M. Roseblum, et.al, Usig the SimOS machie simulator to study complex computer systems, ACM Trasactios o Modelig ad Computer Simulatio, Vol. 7, No. 1, 1997, pp [1] Y. Ya, X. Zhag ad Z. Zhag, A memory-layout orieted ru-time techique for locality optimizatio, Proceedigs of 1998 Iteratioal Coferece of Parallel Processig, (ICPP 98), August, 1998, pp [11] C. Zhag, X. Zhag ad Y. Ya, Two fast ad high-associativity cache schemes, IEEE Micro, Vol. 17, No. 5, 1997, pp

17 17 methods cross Istructio memory program commets iterferece cout space complexity blockig oly limited by data sizes. blockig with system idepedet. software buffer blockig with 1 limited by the umber register buffer of available registers. blockig with works well o high associativity 2 associativity caches. ad registers blockig with + 1 works well o paddig all systems. a TLB size depedet blockig for TLB outer loop, effective for fully associative TLBs. paddigs by usig L paddig for TLB + 1 pages, effective for set associative TLBs. Table 2: Summary of the blockig methods ad their impact o the three aspects of performace (cross iterferece, istructio cout, ad memory space) ad o the program complexity. The performace of blockig oly method is the lie for comparisos. Note: + meas that the method quatitatively icreases the factor ad hurt the performace; ad blak meas it has o impact. The program complicity is subjective, ad compared with the block oly method, with 1 beig a slightly more complex, ad 2 a moderately more complex. /* This is a padded bit-reversal program for cache optimizatio. */ void bit_reversal() { it blk, blk_rev, i, i_rev, j, jump = PAD_LENGTH, k; it D = N >> 2*b, d = - 2*b; DATA_TYPE *Xp[B]; DATA_TYPE *Yp, f, f1, f2, f3; for (i = ; i < B; i ++) Xp[i] = &X[bitrev_tbl[i]*jump]; for (blk = ; blk < D; blk ++) { bitrev(blk, blk_rev, d); for (i = ; i < B; i ++) {

18 18 } } } i_rev = bitrev_tbl[i]; k = (blk << b) + i; Yp = &Y[(blk_rev<<b) + (i_rev<<(-b))]; for (j = ; j < B; j += 4) { f = Xp[j][k]; f1 = Xp[j+1][k]; f2 = Xp[j+2][k]; f3 = Xp[j+3][k]; Yp[j] = f; Yp[j+1] = f1; Yp[j+2] = f2; Yp[j+3] = f3; }

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

SIAM J. SCI. COMPUT. Vol. 22, No. 6, pp. 2113 2134 c 21 Society for Idustrial ad Applied Mathematics FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS ZHAO ZHANG AND XIAODONG ZHANG