FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

Size: px
Start display at page:

Download "FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS"

Transcription

1 SIAM J. SCI. COMPUT. Vol. 22, No. 6, pp c 21 Society for Idustrial ad Applied Mathematics FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS ZHAO ZHANG AND XIAODONG ZHANG Abstract. I this paper, we examie differet methods usig techiques of blockig, bufferig, ad paddig for efficiet implemetatios of bit-reversals. We evaluate the merits ad limits of each techique ad its applicatio ad architecture-depedet coditios for developig cache-optimal methods. Besides testig the methods o differet uiprocessors, we coducted both simulatio ad measuremets o two commercial symmetric multiprocessors (SMP) to provide architectural isights ito the methods ad their implemetatios. We preset two cotributios i this paper: (1) Our itegrated blockig methods, which match cache associativity ad traslatio-lookaside buffer (TLB) cache size ad which fully use the available registers, are cache-optimal ad fast. (2) We show that our paddig methods outperform other software-orieted methods, ad we believe they are the fastest i terms of miimizig both CPU ad memory access cycles. Sice the paddig methods are almost idepedet of hardware, they could be widely used o may uiprocessor workstatios ad multiprocessors. Key words. cache optimizatios, memory hierarchy, bit-reversals, shared-memory multiprocessors, parallel computig AMS subject classificatios. 68P5, 65Y2, 65Y5 PII. S Itroductio. May FFT algorithms require data reorderig operatios of bit-reversal. If the bit-reversal operatios are ot implemeted properly, those FFT operatios ca slow dow sigificatly. O the other had, it is easy to improperly implemet bit-reversals o uiprocessors ad multiprocessors. This is because the performace of bit-reversals is highly sesitive to how caches ad memory hierarchies are used i the implemetatios. I other words, a fast bit-reversal implemetatio must be cache effective. Several papers have well addressed the sigificace ad effects of cosiderig memory hierarchy to bit-reversals (e.g., [2], [11], ad [15]). Besides the importat usage for FFT, differet versios of bit-reversal implemetatios ca also be used as bechmark programs to evaluate the memory hierarchy of various computer systems. With the rapid developmet of RISC ad VLSI techology, the speed of processors has icreased dramatically i the past decade. Processor clock rates have doubled every 1 2 years. Nevertheless, the memory speed has icreased at a much slower pace. Therefore we have see ad will cotiue to see a icreasig gap i speed betwee processor ad memory, ad this gap makes performace of applicatio programs o both uiprocessor ad multiprocessor systems rely more ad more o effective usage of caches. Performace degradatio of bit-reversals is maily caused by cache coflict misses. Bit-reversals are ofte repeatedly used as fudametal subrouties for scietific programs, such as FFT. Thus, i order to gai the best performace, cache- Received by the editors September 17, 1999; accepted for publicatio (i revised form) November 2, 2; published electroically April 12, 21. This work is supported i part by the Natioal Sciece Foudatio uder grats CCR ad CCR , by the Air Force Office of Scietific Research uder grat AFOSR , ad by Su Microsystems uder grat EDUE-NAFO Prelimiary results of this work were preseted i the 1999 Supercomputig Coferece, Portlad, OR. Departmet of Computer Sciece, College of William ad Mary, Williamsburg, VA (zzhag@cs.wm.edu, zhag@cs.wm.edu). 2113

2 2114 ZHAO ZHANG AND XIAODONG ZHANG optimal methods ad their implemetatios should be carefully ad precisely doe at the programmig level. This type of performace programmig for some special programs, such as bit-reversals, may sigificatly outperform a optimizatio from a automatic tool, such as a compiler. A stadard bit-reversal program is described as follows: for i = 1, N Y[i ] = X[i] The values of array X i their sequetial positios i are copied to array Y i their bit-reversal positios, i for i =1,...,N, where N =2. The above program says that X is a bit-reversal reorderig of Y. The idices of i ad i of X ad Y are represeted by a sequece of biary digits. Positios i ad its bit-reversal i are defied i [11] as 1 1 i = a j 2 j ad i = a j 2 1 j, j= where a j is either or 1. For example, a 5-bit reversal of i = 11 is i = 11. The bit-reversal operatios have followig uique characteristics: First, i may implemetatios, each elemet i a array is used (read or writte) oly oce for its copy operatio. Thus, the reorderigs have oly spatial locality but o temporal locality for elemets. Secod, the loops follow certai sequeces with high spatial locality. Bit-reversals are highly sesitive to problem sizes, cache sizes, ad cache lie sizes. Sice the data array sizes are a power of 2, multiple elemets stored i differet memory locatios could map to the same cache lie, causig severe cache coflict misses ad cache thrashig. The reaso is simple. Most commercial computers use direct-mapped or -way associative caches where the mappig fuctios of cache sizes are also related to powers of 2. We use a idetical uit, called a elemet, to represet the sizes of data arrays, caches, ad others such as buffers ad blockig. Oe elemet may represet a 4-byte iteger, a 4-byte floatig poit umber, or a 8-byte double floatig poit umber. Because the sizes of caches ad cache lies are always a multiple of a elemet i practice, this idetical uit for all sizes is practically meaigful for both architects ad applicatio programmers ad makes the discussios straightforward. Here are the algorithmic ad architectural parameters we will use to describe cache-optimal methods of bit-reversals. C: data cache size, which could be further defied as C L1 ad C L2 for data cache sizes of L1 ad L2, respectively. L: the size of a cache lie, which could be further defied as L L1 ad L L2 for cache lies of L1 ad L2, respectively. K: cache associativity, which could be further defied as K L1 ad K L2 for cache associativity of L1 ad L2, respectively. K TLB : traslatio-lookaside buffer (TLB) cache associativity. (A TLB cache is a small buffer that holds most recet memory page mappigs. The cocept will be discussed i detail later i the paper.) T s : umber of etries i the TLB cache. N: the data size for the bit-reversal vector of size N =2, where is the umber bits used i the vector idex. B cache : blockig size of a B B submatrix for cache. B TLB : blockig size for TLB. P s : a memory page size. j=

3 FAST BIT-REVERSALS 2115 I this paper, we examie differet methods usig techiques of blockig, bufferig, ad paddig for efficiet implemetatios. We evaluate the merits ad limits of each techique ad its applicatio ad architecture-depedet coditios for developig cache-optimal methods. Although our methods are developed for out-of-place bit-reversals, they are also applicable to i-place bit-reversals where X ad Y are the same array. Symmetric multiprocessor (SMP) systems have become practical ad cost-effective servers for scietific computig ad other applicatios. Although parallel efficiecy ad commuicatio latecy reductio are major performace cocers, computatios o a SMP share may commo cosideratios with uiprocessors. The most importat oe is the effective usage of memory hierarchies. Whe the cache locality of each processor is effectively exploited, the memory accesses to the shared-memory will be reduced, ad so will be the memory access cotetio. People have studied parallel data reorderig algorithms o distributed-memory systems with special etworks, such as hypercubes (see, e.g., [6] ad [9]). I this study, we target parallel bit-reversals o SMPs ad show the sigificat impact of the cache ad TLB cosideratios for efficiet method developmet ad implemetatios. We also evaluate the performace impact of SMP itercoectio etworks. Our algorithm desigs ad implemetatios are optimized by cosiderig several otraditioal but practical ad performace-effective factors, amely, the programmig complexity, memory space requiremet, istructio cout, cross iterferece amog the data arrays, ad program portability. We will summarize the limits ad merits of differet bit-reversal methods d o these cosideratios after we have discussed the desigs ad preseted the performace results, aimig at providig a guidelie for performace programmig ad memory performace optimizatio for other scietific computig applicatios. We preset two cotributios i this paper: (1) Our itegrated blockig methods, which match cache associativity ad TLB cache size ad which fully use the available registers, are cache-optimal ad fast. (2) We show that our paddig methods outperform other software-orieted methods ad believe they are the fastest i terms of miimizig both CPU ad memory access cycles. Sice the paddig methods are almost idepedet of hardware, they could be widely used o may uiprocessor workstatios ad SMP multiprocessors. The rest of the paper is orgaized as follows. We discuss the iheretly blockig ature of bit-reverse operatios ad the effectiveess ad limits of blockig techiques for solvig the problems i sectio 2. I sectio 3, we evaluate a software bufferig techique ad our methods usig existig hardware compoets for implemetig the data reorderig. Our ew method itegratig blockig ad paddig will be preseted i sectio 4. We discuss blockig ad paddig techiques for TLB i sectio 5. The experimetal measuremets ad aalyses for evaluatig differet methods o uiprocessor workstatios ad SMP multiprocessors will be reported i sectios 6 ad 7. We summarize the work i sectio Blockig for bit-reversals. The blocked memory access patters of bitreversals ca be easily viewed whe we covert the oe-dimesioal vector to a twodimesioal equivalet array i Figure 1. All the reorderig elemets ad elemets i other groups will be allocated alog the colum i the two-dimesioal equivalet array formig a block. I this blockig method, the bit-reversal reorderig is performed block by block, where the operatios for each block are implemeted similarly to the Evas method

4 2116 ZHAO ZHANG AND XIAODONG ZHANG Fig. 1. Memory layout of a blocked bit-reversals, where B = B cache. [7]. (The Evas method is used to costruct a hybrid method i [11].) The program i the appedix presets such a implemetatio alog with paddig techique. (The paddig techique will be discussed i sectio 4.) The blockig algorithm we have used ca be classified as a hybrid method. I geeral, for a bit-reversal vector of N =2 elemets, the block size B cache is a power of 2, deoted by B cache =2 b. Each of the B cache elemets i X has the address format of fg, where g is B cache bits ad f has b bits. Each of the correspodig B cache elemets i Y has the address format of g f. Therefore, the distace betwee two earest elemets i the same group i Y is 2 b = N/B cache. Choosig the cache lie size as the miimum blockig size (B cache = L), we ca easily calculate the maximum N s for the bit-reversal vector d o differet data cache sizes. For example, for a large cache of 2 MB, the blockig techique is effective up to a 18-bit-reversal reorderig which represets 268,144 data elemets, where each elemet is a 8-byte double type, ad the cache lie is 32 bytes. I practice, the data size of bit-reversals could easily be larger tha = 2 [11]. 3. Blockig with buffers. As we have show, the effectiveess of blockig is limited by the size of the data arrays. I theory, the smallest blockig size could be 2 2. A cache lie i a moder processor usually holds more tha 2 elemets, i.e., is larger tha 16 bytes. If we choose a 2 2 block, the data i a cache lie will ot be fully used before their replacemet, causig more cache misses i the reorderigs. The bitreversal reorderig demads large cache space to make blockig effective. I order to effectively use limited cache space, Gatli ad Carter [8] preset a effective method usig a additioal buffer to first hold the coflict-missed elemets of a block i oe array temporarily ad the copy the block to their reordered positios i the other array. I this sectio, we discuss implemetatios of blockig methods supported by both software ad hardware buffers Blockig with a software buffer ad its limits. Because this buffer is defied i a reorderig program, we call it software buffer. This buffer shares the allocatio space with the data arrays X ad Y i the cache. There are two major limits i this approach. First, the buffer itself may iterfere with arrays of X ad Y, causig additioal access coflicts. This iterferece is certai whe the sizes of X ad Y are larger tha the size of the cache, C. Each cache block or set is mapped from arrays X ad Y more tha oce. No matter where the buffer is

5 FAST BIT-REVERSALS 2117 located i the cache, it will iterfere with them. The larger the buffer size, the more iterferece will occur. The secod limit is the additioal copy overhead time ivolved i movig data from the array X to the buffer ad the i movig them to the target array i their reordered positios. This overhead exactly doubles the istructio cycles for data copyig. The data copy through a buffer is a worthy ivestmet if the umber of cycles lost from cache misses is much higher tha the additioal CPU cycles for the data copy. To overcome the two limits, we propose several alteratives to elimiate cache iterferece caused by the software ad to reduce or elimiate the data copy time Cache structure depedet blockig. We will preset several blockig methods which deped o the cache orgaizatio of the ruig machie. These methods ca be implemeted at the user programmig level. Blockig d o set associativity. The cache associativity, K, isaim- portat factor to cosider for blockig. If K L, al L or a K K blockig method for bit-reversals would effectively avoid coflict misses. Because the hit time is a less sesitive performace factor tha the cache misses i the L2 cache, a higher associativity of the L2 cache is more effective tha that of L1. If a cache lie holds 4 double floatig poit elemets (L = 4 elemets of 32 bytes i Petium processors), a4 4 blockig method without ay data buffer is able to fully use the cache associativity. The blockig method would gai more beefit from caches of associativity higher tha 4, such as a desig i [2]. What would we do if the associativity is ot sufficietly high for the blockig, or K < L? Oe solutio is to make a K L rectagular blockig. Ufortuately bit-reversals require a L L blockig. Supplemet with registers. We may also cosider usig the available registers to supplemet a low associativity cache. The umber of registers available to a user program is limited. Normally, a uiprocessor provides up to 16 registers to users. For example, for a 2-way associative cache, we eed 8 registers to buffer 2 additioal cache lies so that we could effectively make a 4 4 blockig as if we ra the program o a 4-way associative cache. We develop a more efficiet blockig method for bit-reversals, which requires oly (L K) (L K) registers. The operatio sequece of this method is i three steps: (1) The L K cache lies of X are stored i K cache lies of Y ad accessed by copyig its (L K) K elemets to Y i the reordered positios ad copyig the rest of (L K) (L K) elemets to a buffer cosistig (L K) (L K) registers. (2) The rest of K lies of X are brought to the cache set, ad its K K elemets are copied to Y i the reordered positios. (3) Fially, the (L K) (L K) elemets i the register buffer ad the rest of the (L K) K elemets are copied to Y i their reordered positios. A cache set will be used more tha twice if K<L/2. Besides the advatage of o access coflicts betwee the register buffer ad the arrays of X ad Y, there is aother advatage of usig registers to buffer the data i a load/store processor. A data copy through the registers from X to Y is equivalet to the two-step process of load ad store, ad thus there will be o additioal overhead. We will show our experimetal performace i sectio 5. Usig registers as the buffer. If the cache is direct-mapped, we have to fully rely o a buffer for blockig. Here we discuss some ways to use registers to serve the buffer i order to elimiate the potetial cache coflicts ad elimiate extra data

6 2118 ZHAO ZHANG AND XIAODONG ZHANG copyig by takig advatage of the load/store operatios. The umber of registers for a buffer of L L elemets is determied by the umber of elemets a cache lie ca hold. The legth of a cache lie of the L1 cache i some processors, such as Su SPARC Micro I ad II, is L = 2 of 16 bytes, which holds oly two floatig poit elemets. The blockig size could be as small as 2 2 usig a buffer of 4 registers. The cache lie legth of the L1 cache i may advaced workstatios is 32 bytes, such as the Su Ultra ad Itel Petium processors, each of which holds 4 double floatig poit elemets. I this case, we eed a buffer of 4 4 = 16 registers for a blockig. This would be difficult due to the limited umber of available registers. We have two solutios for this. First, we use oly the umber of registers available to form a smaller buffer tha it should be, which will ot make each cache lie fully used ad will cause additioal cache misses. Our experimets show that this blockig method of usig a buffer of isufficiet umber of registers still achieves a reasoable performace improvemet ad outperforms the implemetatio usig a software buffer. The secod method is to further reduce the size of the buffer, which reduces the required umber of registers by usig our (L K) (L K) blockig method. L1 cache versus L2 cache. The mai objective of buildig two-level caches is to make the L1 cache small eough to catch up to the cycle time of the fast CPU ad to make the L2 cache large eough to capture as may accesses as possible [12]. I practice, the data size of a bit-reversal is larger tha the size of the L2 cache. L1 ad L2 caches offer differet sizes of the cache lie, L, ad the associativity, K. Both of the followig alteratives are effective for blockig. (1) Takig advatage of a short cache lie ad fast hit time of the L1 cache, we could effectively use limited registers as the buffer ad make a small L L blockig effective. (2) Takig advatage of high associativity of the L2 cache, we could effectively use both associativity ad supplemetal registers as the buffer ad make a large L L blockig effective Victim-cache-aided blockig. Victim cache [13] is a small fully associative cache servig as the buffer cotaiig oly cache blocks due to coflict misses from L1 cache. This is a o-chip cache coected betwee L1 ad the ext level cache or memory. O a miss i L1, the victim cache is first checked before goig to the ext level. If the missed block is foud there, the victim cache block ad the L1 cache block are swapped ad the the block is delivered to CPU from the L1 cache. Victim cache has bee available i some commercial workstatios, such as HP72. The miimum umber of victim cache lies required for L L blockigs of traspose ad bit-reversal reorderigs is L K. I the executio, L L elemets of each blockig are allocated i a set of K lies i L1 cache, ad the rest of the elemets are allocated i the L K lies of the victim cache. The victim cache is able to hold all the coflict misses i the reorderigs by a L L blockig. I additio, a coflict miss i the L1 cache that hits i the victim cache has oly oe additioal cycle miss pealty. Thus, a simple L L blockig method would be effective if such a victim cache is available. However, the victim cache does ot have a direct coectio with the CPU. Whe a data hit happes i the victim cache, it has to be first swapped to the L1 cache ad the delivered to CPU. This swappig operatio is uecessary for our reorderig algorithms. Without coutig the cold misses of brigig the elemets i the first colum for a L L blockig ad cosiderig the LRU replacemet policy, the etire blockig will have L (L 1) coflict misses i the L1 cache, which are the foud i the victim cache. This also meas that each of such a blockig eeds L (L 1) additioal swappig cycles betwee the L1 cache ad the victim

7 FAST BIT-REVERSALS 2119 cache, which is idepedet of the associativity, K. I cotrast with the blockig method d o the associativity supplemeted by registers, the swappig cycles i the victim cache are additioal overhead. Despite this, a victim-cache-aided blockig is more efficiet tha a blockig method with a software buffer because there are o cross iterferece coflicts betwee the victim buffer ad arrays of X ad Y. 4. Blockig with paddig. Paddig is a techique that modifies the data layout of a program so that the coflict misses are reduced or elimiated. The data layout modificatio ca be doe at ru-time by system software [3, 19] or at compile-time by complier optimizatio [16]. Sharig the same objective of compiler optimizatio to chage the addresses of potetially coflictig cache blocks i the reorderigs, we isert paddig variables iside the data array. For example, the paddig ca be doe as part of the last butterfly for the decimatio i a FFT computatio without additioal cost, ad the output is ot padded. However, we otice that this free paddig opportuity may ot be easily foud, ad the bit-reversal result may be padded i some cases. For example, the paddig of a recursive implemetatio of the Cooley Tukey FFT algorithm [5] is more complex tha the paddig i our implemetatios. The paddig method produces padded results i a vector if the bit-reversals are doe i a iplaced fashio. The accesses to the padded results eed to go through a simple address covertig process with additioal CPU cycles. I additio, our methods target bit-reversals d o the data size of powers of 2. However, FFT algorithms are ot limited to this data size. If the data size is ot a power of 2, the paddig method will be more complex to implemet. Poor memory performace of bit-reversals has bee reported eve for opower of 2 data sizes (see, e.g., [2]). Sice the data arrays of bit-reversals form a vector whose size is power of 2, the paddig is highly regular, isertig L elemets or a cache lie space startig at the vector positios of N/L, 2 N/L,..., ad (L 1) N/L. Usig L elemets or a sectio data of a cache lie to separate the vector at these L poits ca completely elimiate the cache coflicts caused by the address mappig d o powers of 2. Agai durig executio, the reorderig data copies are directly coducted betwee the arrays X ad Y without goig through a data buffer. Aother advatage is that the umber of paddig elemets eeded is oly L L or L cache lies ad is idepedet of the data array size, N. Compared with the data size of bit-reversals, the umber of paddig elemets is isigificat. Figure 2 shows how the data layout of a bit-reversal vector is modified by paddig so that coflict misses are elimiated. Compiler optimizatio targets a large rage of applicatio programs ad automatically iserts paddig variables i the programs for users. A optimal paddig is applicatio program depedet. For example, paddig positios are differet from differet applicatios i order to effectively chage addresses of coflictig cache blocks [18]. Based o the uique ature of the data reorderig, the optimal paddig uit used by our methods for bit-reversals is a cache lie with L elemets. I cotrast, a compiler optimizatio ormally uses a elemet as the basic paddig uit. How may paddig uits to use ad where to pad i the data arrays are determied by some approximatio models which may ot precisely fit the uique memory access patters of each case. I additio, applyig the paddig techique to bit-reversals embedded i applicatios would ot icrease complexity i the etire computatio. For example, whe a padded bit-reversal is performed i a FFT computatio, it has little effect o the eighborig butterfly operatios.

8 212 ZHAO ZHANG AND XIAODONG ZHANG Fig. 2. Data layout of a bit-reversal is modified by paddig, where B = B cache = L. 5. Blockig ad paddig for TLB. The TLB is a special cache that stores the most recetly used virtual-physical page traslatios for memory accesses. The TLB is a small ad usually fully associative cache. Each etry poits to a memory page of 4 KB to 64 KB. The page size is ormally fixed at the level of operatig systems ad caot be chaged by user programs. A TLB cache miss will make the system retrieve the missig traslatio from the page table i memory ad the select a TLB etry to replace. Whe the data to be accessed i our blockig method is larger tha the amout of data of all the memory pages that the TLB ca hold, we will have TLB thrashig. I this sectio, we will discuss ad preset blockig ad paddig methods for TLB cache optimizatios Blockig for a fully associative TLB. Before givig a geeral model to show how the blockig size is affected by the TLB size, let s go through a example to show that a moderate N for bit-reversals would easily lead to TLB cache thrashig. The 64 pages i the TLB of the Su UltraSparc-II processor hold = elemets, which represets a 16-bit-reversal of N =2 16. Sice we have two vectors X ad Y, the TLB ca hold a 15-bit-reversal of N =2 15 elemets. This is also cosistet with our experimets o this machie, where executio time per elemet was a costat util = 15, but sharply icreased at = 16 bit-reversals caused by the TLB misses. I our cache-optimal methods, we iclude a outer loop to form a blockig for TLB, whose size is deoted as B TLB. The blockig size of B TLB for bit-reversals whe N T s P s is B TLB T s, where P s is the page size i elemets, ad T s is the umber of etries of the TLB. O the other had, the B TLB should be chose as large as possible to make effective use of the page space. Whe N<T s P s, the data size of a bit-reversal will be less tha the data size covered by the TLB. Thus there is o eed for TLB optimizatios.

9 FAST BIT-REVERSALS 2121 Fig. 3. Paddig for TLB: the data layout is modified by isertig a page space at multiple locatios, where B TLB =4, K TLB =1, T s = Paddig for a set-associative TLB. Some processors TLBs are ot fully associative, but set-associative. For example, the TLB i the Petium-II 4 processor is 4-way associative (K TLB = 4). A simple blockig d o the umber of TLB etries is ot cache-optimal, because multiple pages withi a TLB-size-d blockig may map to the same TLB cache set ad cause TLB cache coflict misses. If the size N of a bit-reversal vector is a multiple of T s P s, where T s is the umber of TLB etries ad P s is the page size i elemets, ad if K TLB <B TLB, the TLB cache coflict misses will occur. This could easily happe i practice. For example, o the Petium-II 4, N is equal to 128K elemets (oe elemet = 8 bytes) for a 17-bit-reversal, ad this N is two times the value T s P s of the machie, where T s = 64, ad P s = 124 elemets. I a way similar to the techique of paddig for the data cache, we isert a page of elemets or a page of space startig at the vector positios of N/L, 2 N/L,... ad (L 1) N/L to elimiate the coflict of TLB cache misses. Figure 3 gives a example of the paddig for TLB, where the TLB is a direct-mapped cache of 8 etries, blockig size is B TLB = 4, ad the umber of elemets of a row is a multiple of 8 page elemets. Before paddig, each of blockig row is mapped to the same cache lie of the TLB. After paddig, these rows are mapped to differet cache lies of the TLB. Combiig paddig for data cache ad paddig for TLB cache, we are isertig L+P s elemets or a page plus a cache lie space i L locatios separated by a distace of N/L elemets. I practice, we selected more tha N/L poits to isert the paddig variables to elimiate both data cache ad TLB coflict misses. This approach could effectively merge two ested paddigs (oe for data cache ad the other oe for TLB) ito a sigle oe. A optimal umber of isertig poits ca be easily determied experimetally d o the size of the TLB cache. The paddig optimizatios are all d o L2 cache i our experimets. Partial idex mappig addresses of bit-reversals are precalculated ad stored i a small table as show i the program i the appedix. This approach further improves

10 2122 ZHAO ZHANG AND XIAODONG ZHANG Table 1 Architectural parameters of the 5 workstatios we have used for the experimets. All specificatios o L1 cache refer to the L1 data cache, ad all L2s are uiform. Each L2 cache block o UltraSPARC-IIi cosists of 2 16-byte subblocks. The hit times of L1, L2 ad the mai memory are measured by lmbech [14], ad their uits are coverted from aosecod (s) to their CPU cycles. Workstatios SGI O2 Su Ultra 5 Su E-45 Petium XP1 Processor type R1 UltraSparc-IIi UltraSparc II P-II 4 Alpha Clock rate (MHz) L1 cache (KBytes) L1 block size (Bytes) L1 associativity L1 hit time (cycles) L2 cache (KBytes) L2 block size (Bytes) L2 associativity L2 hit time (cycles) TLB size (etries) TLB associativity Memory latecy (cycles) the performace because the table will be accessed i the cache durig the computatio, ad the precalculatio overhead is trivial. The time for the precalculatio is icluded i the total executio time. 6. Experimetal results ad performace evaluatio. We have implemeted ad tested all the bit-reversal methods discussed i the previous sectios o a SGI O2 workstatio, a Su Ultra-5 workstatio, a Su SMP server E-45, a Petium PC, ad a Compaq XP1 workstatio. We will preset ad evaluate the performace of differet methods o differet machies Experimetal eviromet ad evaluatio methodology. We used lmbech [14] to measure the latecies of memory hierarchies at differet levels o each machie. The architectural parameters of the 5 machies are listed i Table 1. We focus the performace evaluatio o methods ad implemetatios of bitreversals i this paper. We compared all our methods with the method of blockig with a software buffer which was recetly published i [8]. We deote this method as blockig with buffer for bit-reversals. Two of our methods are experimetally compared: breg-br blockig with associativity ad registers for bit-reversals, ad blockig with paddig for bit-reversals. We have also applied blockig or paddig techique for the TLB i these two methods d o the TLB associativity. All the programs use a stadard subroutie to calculate the bit-reversal value for a give address. The executio times were collected by gettimeofday(), a stadard Uix timig fuctio. The resolutio of this fuctio is 1 µs o the machies beig measured, which is sigificatly smaller tha the executio times of ay programs we have measured. A small bit-reversal table is precalculated, ad we exclude this calculatio time. The reported time uit is cycles per elemet (CPE): CPE = executio time clock rate, N

11 FAST BIT-REVERSALS 2123 where executio time is the measured time i secods, clock rate is the CPU speed (cycles/secod) of the machie where the program is ru, ad N is the umber of elemets of the bit-reversal program. Besides the differet methods of bit-reversals, we also measured the executio time of a program copyig elemets betwee X ad Y. This program has the same umber of data copyig operatios with a cotiuous memory access patter. We use the executio time of this program to provide a lie referece for bit-reversal programs ad show how close a bit-reversal executio is to its ideal time. We deote this referece program as. Each method is further divided ito float data type usig 4 bytes to represet a elemet, ad double type usig 8 bytes to represet a elemet. The data type divisios will show the performace impact of the cache lie legth. For all experimets o differet machies, the bit-reversal programs first call a routie to flush the cache to make sure that all the data are allocated oly i the memory. All experimets were repeated multiple times Effects of TLB ad virtual memory. Before measurig ad comparig the performace of differet bit-reversal methods, we experimetally evaluated the effects of TLB ad virtual memory to cofirm our assumptios ad aalyses. Selectio of TLB blockig size. The TLB blockig size is a sesitive performace parameter to be selected, which is determied by the size of the TLB if it is fully associative. We executed program (blockig with paddig for bit-reversals) with = 2 o a sigle ode of Su E-45 by chagig the blockig sizes for TLB from 8 to 128. The TLB of the E-45 is a fully associative cache with 64 etries. Figure 4 shows the measured cycles per elemet of the program of differet blockig sizes o the ode. Our experimetal results are cosistet with our aalyses i the previous sectio. Whe the blockig size for TLB was 64, the executio time curve icreased sharply. This is because arrays X ad Y together demaded more tha 64 pages ad caused TLB thrashig. Virtual memory versus physical memory addresses. All our aalyses are d o cache mappigs betwee memory pages i the virtual address space ad cache blocks i the physical memory address space. This assumes that cotiguous memory pages will be cotiguously mapped to the cache. This assumptio is guarateed for the virtual-address caches [4]. However, all our experimets have bee performed o machies with physical address L2 caches. Sice the virtual-physical traslatios for L2 caches are hadled by operatig systems, our assumptios may sometimes be iaccurate. I order to show that may operatig systems attempt to map cotiguous virtual pages to cache blocks cotiguously so that our virtual-addressd study is practically meaigful ad effective, we coducted a simulatio by usig the SimOS [17] ad measuremets o differet workstatios to observe how a operatig system makes traslatios from virtual memory addresses to their physical addresses. The SimOS simulates a complete hardware of SGI machies ad rus the IRIX 5.3 operatig system i the simulatio. We executed a blockig-oly program of bitreversals usig the cache lie L as the blockig size. The bit-reversal vector size was chaged from =15to = 22. We measured the miss rates o array X. The cache size was set to 2 MB holdig two double type arrays up to = 18 i the virtual memory space. Figure 5 gives cosistet results from the SimOS simulatio: whe >18, the miss rate o array X was sharply icreased to 1% from 12.5%. From this experimet, we have observed that virtual-physical traslatios from

12 2124 ZHAO ZHANG AND XIAODONG ZHANG 7 6 E45 (double) cycles per elemet Block size of TLB Fig. 4. Chagig the TLB blockig sizes o a sigle ode of the Su E-45: whe the blockig size for TLB was larger tha 32, the executio time curve was sharply icreased. SimOS (IRIX 5.3) blockig oly miss rate o array Fig. 5. Usig the SimOS to observe the miss rates by chagig the size of the bit-reversal arrays of a blockig-oly program: whe >18, the miss rate was sharply icreased to 1%. the IRIX 5.3 operatig system are quite cosistet with our assumptio of cotiguous allocatios. We have also ru the similar experimets o differet targeted workstatios with differet operatig systems, such as Liux ad Solaris, to measure the chages of executio times whe the data size is chaged. Our measuremets are also cosistet

13 FAST BIT-REVERSALS 2125 P-II (float) Ultra 5 (float) Cycles Per Elemet hybrid Cycles Per Elemet hybrid Fig. 6. Executio times of the hybrid method o the Petium-II (left figure) ad o the Ultra-5 machie (right figure). to the SimOS results ad idicate that the larger the data arrays to be used, the more likely a operatig system will allocate the pages cotiguously. Because our study targets large data sets, our aalyses d o the virtual memory space is reasoably accurate. I additio, our methods assume that the operatig system uses a uiform page size for page allocatio, which is cosistet with most commercial ad commoly used operatig systems Performace of the hybrid method for bit-reversals. I order to show the effectiveess of our cache optimizatios, we first plot the measured executio times of the hybrid method 1 i float data types o the Petium-II ad the Ultra-5 machies i Figure 6. Although the hybrid method did reasoably well for 16 o Petium-II ad 12 o Ultra-5, the executio times sigificatly icreased due to limited cache performace after the data size was further icreased Performace comparisos o the SGI O2. The SGI O2 is a 1995 product usig a R1 processor of 15 MHz, 32 KB 2-way associative L1 cache, ad 64 KB 2-way associative L2 cache. The cache lie of L2 is 64 bytes. Sice the associativity of L2 is low, ad the cache lie of L2 is relatively log, it is difficult to do blockig with associativity ad available registers. We implemeted oly the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled bit-reversal methods from =16to = 21. Figure 7 shows the comparisos of CPE amog the three programs of both float type ad double type o the SGI O2 machie. The measuremets show that the paddig method slightly reduced the executio time compared with the method of blockig with software buffer. The time reductio was up to 6%. The reaso for the small performace improvemet comes from the extremely log memory latecy (28 cycles) of the O2 machie. The reductio ad savig of istructio cycles for data copies from paddig became less sigificat because memory latecies caused by the required cold misses i both methods were domiat i executio. 1 The program was writte i Fortra by Ala Karp.

14 2126 ZHAO ZHANG AND XIAODONG ZHANG cycles per elemet O2 (float) cycles per elemet O2 (double) Fig. 7. Executio comparisos o the SGI O2 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece Performace comparisos o the Su Ultra-5. The Su Ultra-5 is a 1998 product usig a UltraSparc-IIi processor of 275 MHz, 16 KB direct-mapped L1 cache, ad 256 KB 2-way associative L2 cache. The cache lie of L1 is 32 bytes cosistig of two 16-byte subblocks, ad L2 is 64 bytes log. Similar to the SGI O2, the associativity of L2 o the Ultra-5 is low, ad the cache lie of L2 is relatively log, so it is difficult to do blockig with associativity ad available registers. We implemeted oly the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from =16to = 23. Figure 8 shows the comparisos of cycles per elemet amog the three programs of both float type ad double type o the Ultra-5. The memory latecy of the Ultra-5 (76 cycles) is sigificatly lower tha that of the O2. We observed a more sigificat performace improvemet from the method of blockig with paddig over that of blockig with software buffer. For example, usig float type, the paddig program is 14% faster tha that of blockig with buffer for = 2 or larger. A L2 cache lie of the Ultra-5 holds 16 float type elemets (L = 16), ad 8 double type elemets (L = 8). The larger the L, the higher overhead the blockig with software buffer will have. This has bee cofirmed by our comparative experimets betwee the float ad double types o the Ultra-5 show i Figure Performace comparisos o the Su E-45. The Su E-45 is a processor SMP product. Each of the 4 odes is a UltraSparc-2 processor of 3 MHz, 16 KB direct-mapped L1 cache, ad 2 MB 2-way associative L2 cache. The cache lie of L1 is 32 bytes cosistig of two 16-byte subblocks, ad L2 is 64 bytes log. Due to the limited associativity ad a relatively log L2 cache lie, we implemeted oly the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from =16to = 25. Figure 9 shows the comparisos of CPE amog blockig with software buffer, blockig with paddig, ad the program o a sigle ode of E-45, each of which has both float type ad

15 FAST BIT-REVERSALS 2127 cycles per elemet ultra5 (float) cycles per elemet ultra5 (double) Fig. 8. Executio comparisos o the Su Ultra-5 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. cycles per elemet E-45 (float) cycles per elemet E-45 (double) Fig. 9. Executio comparisos o the Su E-45 SMP: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. double type. The memory latecy of the Ultra-5 (73 cycles) is slightly lower tha that of Ultra-5. O this machie, we observed higher performace improvemet from the method of blockig with paddig over that of blockig with software buffer. For example, usig float type, the paddig program is 22% faster tha that of blockig with buffer for = 2 or larger. Our comparative experimets betwee the float ad double types o E-45 i Figure 9 also cofirms that the larger the L, the higher performace the paddig method would achieve Performace comparisos o the Petium-II 4. The Petium PC we used is a 1998 product usig a Petium-II 4 processor of 4 MHz, 8 KB directmapped L1 cache, ad 256 KB 4-way associative L2 cache. The cache lies of both

16 2128 ZHAO ZHANG AND XIAODONG ZHANG cycles per elemet P-II (float) breg-br cycles per elemet P-II (double) lblk-br Fig. 1. Executio comparisos o the Petium-II 4 PC: represets the method of blockig with software buffer; represets the method of blockig with paddig; breg-br represets the method of blockig with associativity ad registers; ad represets the ideal lie referece. L1 ad L2 are 32 bytes. Sice the L2 associativity is high, we are able to implemet the method of blockig with associativity ad available registers, L2 cache lie L = 8 elemets for a float type, ad we eed (L K)(L K) = 16 registers to supplemet the 4-way associative cache. A L2 cache lie holds 4 double type elemets (L = 4). Thus, we do ot eed ay registers to supplemet but simply make a 4 4 blockig. The TLB of the Petium processor is a 4-way associative cache of 64 etries. We used our paddig for the TLB techique to avoid TLB misses. We implemeted the blockig with paddig method ad the blockig with associativity ad registers to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from =16to = 24. Figure 1 shows the comparisos of cycles per elemet amog the four programs. As we expected, the paddigs for both cache ad TLB were highly effective, ad the paddig program performed the best. For example, usig float type, the paddig program is about 4% faster tha that of blockig with buffer for = 22 or larger. We also show that the method usig available registers to supplemet associativity is effective. Although it is ot as good as the paddig program due to the icrease of the istructio couts for additioal data copies, it still achieved up to 12% executio reductio over the blockig with software buffer program. As we expected, the executio time of the method usig the 4-way associative L2 cache without the supplemet of registers to form a 4 4 blockig was delayed maily by the loger L2 cache hit time. The performace of this method still outperformed the method of blockig with a software buffer Performace comparisos o the Compaq XP-1. The Compaq XP-1 is a 1999 product usig a Alpha processor of 5 MHz, 64 KB 2-way associative L1 cache, ad 4 MB 2-way associative L2 cache. The cache lies of both L1 ad L2 are 64 bytes log. Similar to the SGI ad Su machies, the associativity of L2 o the XP 1 is low, ad the cache lie of L2 is relatively log, so it is difficult to do blockig with associativity ad available registers. We implemeted oly the

17 FAST BIT-REVERSALS 2129 cycles per elemet XP1 (float) cycles per elemet XP1 (double) Fig. 11. Executio comparisos o the Compaq XP-1 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from =16to = 25. Figure 11 shows the comparisos of CPE amog the three programs of both float type ad double type o the XP-1 machie. As we expected, we achieved better or comparable performace to the oes o the Su machies. For example, usig float type, for = 24 or larger, the paddig program is 3% faster tha that of blockig with buffer, ad 15% faster for double type. 7. Performace evaluatio o SMP multiprocessors. We implemeted the bit-reversal methods o two SMP multiprocessors: the Su E-45 ad the HP 9 V22. The parallel bit-reversal program o a SMP with M processors is described usig POSIX thread primitives [1] as follows: bit_reversal(id) my_start = id*(n/m); my_ed = (id-1)*(n/m); for i = 1, N Y[i ] = X[i]; The bit-reversal operatios are evely distributed amog M processors Performace comparisos o the Su E-45. The Su E-45 is a processor SMP product. Each of the 4 odes is a UltraSparc-2 processor of 3 MHz, 16 KB direct-mapped L1 cache, ad 2 MB 2-way associative L2 cache. The cache lie of L1 is 32 bytes cosistig of two 16-byte subblocks, ad L2 cache lie is 64 bytes. Due to the limited associativity ad a relatively log L2 cache lie, we implemeted oly the blockig with paddig algorithm to compare with blockig with software buffer ad the referece. We scaled the bit-reversal algorithms from = 16 to = 24. Figure 12 shows the comparisos of CPE amog blockig with software buffer, blockig with paddig, ad the program o the E-45 of 4 odes, each of which has both float type ad double type. O this machie, we observed some performace improvemet

18 213 ZHAO ZHANG AND XIAODONG ZHANG 2 E-45 (4 processors, float) 2 E-45 (4 processors, double) cycles per elemet 1 5 cycles per elemet Fig. 12. Executio comparisos o Su E-45 SMP of 4 processors: represets the algorithm of blockig with software buffer; represets the algorithm of blockig with paddig; ad represets the ideal lie referece. whe 18 from the algorithm of blockig with paddig over that of blockig with software buffer. However, whe >18 of double type or >19 of float type, each processor has to process a data set larger tha its cache capacity. Multiple processors simultaeously access the memory through a shared data lik would cause the cotetio to degrade the performace. Sice the data to be accessed from differet processors are distributed i differet locatios, a crossbar itercoectio etwork to lik each processor to all the memory modules would sigificatly reduce the cotetio. The E-45 does have a 5 5 crossbar to coect 2 pairs of processors, 2 I/O ports, ad the memory. The commuicatios betwee the 4 processors the memory modules are coected through the sigle memory data lik. Figure 13 shows the crossbar itercoectios of the E-45 amog the processors, the shared-memory modules, ad the 2 I/O ports. The cotetio occurs i the memory data lik whe the multiple processors request memory accesses simultaeously. We have observed severe performace degradatio caused by the memory access cotetio. Figure 12 shows that this cotetio makes the executio time curves of the three programs jump sharply ad merge together whe >18 of double type ad >19 of float type. I cotrast, o a sigle processor of E-45, accesses to the memory through the memory bus have o cotetio so that the algorithms were scaled well Performace comparisos o the HP 9 V22. HP 9 V22 is a 1997 SMP product with up to 16 processors. We used 4 processors for performace comparisos. Each ode is a HP PA-82 processor of 2 MHz with a 2 MB directmapped L1 data cache. The cache lie is 32 bytes. Due to limited associativity, we implemeted oly the blockig with paddig algorithm to compare with blockig with software buffer ad the referece. The HP SMP has a crossbar itercoectio etwork, the HyperPlae crossbar, to coect up to 8 pairs of processors to 8 memory modules. Multiple pairs of processors ca access differet memory modules simultaeously. Each pair of the processors is

19 FAST BIT-REVERSALS 2131 P1 Data lik PA82 PA82 PA82 PA82 P2 P3 Crossbar 128MB 128MB 128MB 128MB Memroy Baks Hyperplae Aget Crossbar Hyperplae Aget P4 UltraSPARC-II Processors I/O I/O 248MB 248MB Fig. 13. Architecture comparisos betwee Su E-45 SMP (left) ad HP 9 V22 SMP (right): the memory data lik of the E-45 may become a bottleeck whe simultaeous memory access requests from multiple processors; the HyperPlae crossbar coected betwee the memory modules ad the processors o the HP 9 V22 ca effectively reduce the cotetio. 1 8 V-22 (4 processors, float) 1 8 V-22 (4 processors, double) cycles per elemet 6 4 cycles per elemet Fig. 14. Executio comparisos o HP 9 V22: represets the algorithm of blockig with software buffer; represets the algorithm of blockig with paddig; ad represets the ideal lie referece. coected to the crossbar through a adaptor called HyperPlae Ruway Aget. Figure 13 gives the itercoectio structure of the HP 9 V22 of 4 processors. I our experimets, the 4 processors are divided ito 2 pairs which are coected to 2 memory modules by a 2 2 hyperplae crossbar. Each pair of processors may have cotetio to compete the adaptor, but the crossbar is able to allow simultaeous data accesses amog the memory modules. The egative performace effect due to the data lik cotetio observed o Su E-45 was sigificatly reduced o the HP SMP, which shows the effectiveess of the crossbar. Figure 14 shows comparative executio time curves betwee the float ad double types o E-45 i Figure 14. The executio times of the 3 programs are quite stable ad idepedet of the size of. Both the paddig programs of the float type ad of the double type outperformed

20 2132 ZHAO ZHANG AND XIAODONG ZHANG Table 2 Summary of the blockig methods ad their impact o the three aspects of performace (cross iterferece, istructio cout, ad memory space) ad o the program complexity. The performace of blockig oly method is the lie for comparisos. Note: + meas that the method quatitatively icreases the factor ad hurts the performace, ad blak meas it has o impact. The program complicity is subjective ad compared with the block oly method, with 1 beig a slightly more complex, ad 2 a moderately more complex. Methods Cross Istructio Memory Program Commets iterferece cout space complexity Blockig oly limited by data sizes. Blockig with system idepedet. software buffer Blockig with 1 limited by the umber register buffer of available registers. Blockig with works well o high associativity 2 associativity caches. ad registers Blockig with + 1 works well o paddig all systems. a TLB size depedet TLB blockig outer loop, effective for fully associative TLBs. paddigs by usig L TLB paddig + 1 pages, effective for set associative TLBs. the blockig methods with buffer up to 4% ad 18%, respectively. Their executio curves almost merge together with the referece curve. 8. Coclusio. We have examied ad developed cache-optimal methods for bit-reversal data reorderigs. These methods have bee tested o 5 represetative uiprocessor workstatios of 1995 to 1999 products to show their effectiveess. Differet methods have their ow merits ad limits. The blockig-oly method is limited by data sizes. Although the blockig-with-software-buffer method is architecture idepedet, it icreases cross iterferece ad istructio cout ad eeds additioal memory space. The blockig-with-a-register-buffer method is fast but is limited by the umber of available registers. Blockig with associativity ad with registers work well o high associativity caches. We have show that the methods of blockig with paddig, blockig for TLB, ad paddig for TLB ca effectively exploit cache locality ad are almost idepedet o hardware. Thus, they could be widely used o may uiprocessors workstatios ad SMP multiprocessors. We summarize differet techiques ad their merits ad limits i Table 2, which gives a guidelie for applicatio users to choose a techique d o the size of the problem ad the machies available. The methods have also tested o two commercial SMP multiprocessors. By exploitig cache locality of each processor, we have effectively elimiated the coflict misses so that accesses to the shared memory ad cotetio are miimized. However, aother potetial bottleeck o SMPs is the data access cotetio to the sharedmemory. We show that crossbar itercoectios betwee processors ad memory modules play a importat role to parallel bit-reversal data reorderigs.

Cache-Optimal Methods for Bit-Reversals

Cache-Optimal Methods for Bit-Reversals Proceedigs of the ACM/IEEE Supercomputig Coferece, November 1999, Portlad, Orego, U.S.A. Cache-Optimal Methods for Bit-Reversals Zhao Zhag ad Xiaodog Zhag Departmet of Computer Sciece College of William

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Course Site:   Copyright 2012, Elsevier Inc. All rights reserved. Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition. Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19 CIS Data Structures ad Algorithms with Java Sprig 09 Stacks, Queues, ad Heaps Moday, February 8 / Tuesday, February 9 Stacks ad Queues Recall the stack ad queue ADTs (abstract data types from lecture.

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

Fast Fourier Transform (FFT) Algorithms

Fast Fourier Transform (FFT) Algorithms Fast Fourier Trasform FFT Algorithms Relatio to the z-trasform elsewhere, ozero, z x z X x [ ] 2 ~ elsewhere,, ~ e j x X x x π j e z z X X π 2 ~ The DFS X represets evely spaced samples of the z- trasform

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity

More information

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1 CS200: Hash Tables Prichard Ch. 13.2 CS200 - Hash Tables 1 Table Implemetatios: average cases Search Add Remove Sorted array-based Usorted array-based Balaced Search Trees O(log ) O() O() O() O(1) O()

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

CS61C : Machine Structures

CS61C : Machine Structures CS 61C L24 VM II (1) ist.eecs.berkele.edu/~cs61c/su5 CS61C : Machie Structures Lecture #24: VM II Address Mappig: Virtual Address: VPN offset 25-8-2 Ad Carle idex ito page table located i phsical memor

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory! Why Care About the Memory Hierarchy? Memory Virtual Memory -DRAM Memory Gap (latecy) Reasos: Multi process systems (abstractio & memory protectio) Solutio: Tables (holdig per process traslatios) Fast traslatio

More information

GPUMP: a Multiple-Precision Integer Library for GPUs

GPUMP: a Multiple-Precision Integer Library for GPUs GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

IMP: Superposer Integrated Morphometrics Package Superposition Tool

IMP: Superposer Integrated Morphometrics Package Superposition Tool IMP: Superposer Itegrated Morphometrics Package Superpositio Tool Programmig by: David Lieber ( 03) Caisius College 200 Mai St. Buffalo, NY 4208 Cocept by: H. David Sheets, Dept. of Physics, Caisius College

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016

More information

Accuracy Improvement in Camera Calibration

Accuracy Improvement in Camera Calibration Accuracy Improvemet i Camera Calibratio FaJie L Qi Zag ad Reihard Klette CITR, Computer Sciece Departmet The Uiversity of Aucklad Tamaki Campus, Aucklad, New Zealad fli006, qza001@ec.aucklad.ac.z r.klette@aucklad.ac.z

More information

SCI Reflective Memory

SCI Reflective Memory Embedded SCI Solutios SCI Reflective Memory (Experimetal) Atle Vesterkjær Dolphi Itercoect Solutios AS Olaf Helsets vei 6, N-0621 Oslo, Norway Phoe: (47) 23 16 71 42 Fax: (47) 23 16 71 80 Mail: atleve@dolphiics.o

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

Operating System Concepts. Operating System Concepts

Operating System Concepts. Operating System Concepts Chapter 4: Mass-Storage Systems Logical Disk Structure Logical Disk Structure Disk Schedulig Disk Maagemet RAID Structure Disk drives are addressed as large -dimesioal arrays of logical blocks, where the

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 CPU-Memory Bottleeck Computer Architecture ELEC44 CPU Memory Lecture 8 Cache Dr. Hayde Kwok-Hay So Departmet of Electrical ad Electroic Egieerig Performace of high-speed computers is usually limited by

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

Lecture 18. Optimization in n dimensions

Lecture 18. Optimization in n dimensions Lecture 8 Optimizatio i dimesios Itroductio We ow cosider the problem of miimizig a sigle scalar fuctio of variables, f x, where x=[ x, x,, x ]T. The D case ca be visualized as fidig the lowest poit of

More information

Bayesian approach to reliability modelling for a probability of failure on demand parameter

Bayesian approach to reliability modelling for a probability of failure on demand parameter Bayesia approach to reliability modellig for a probability of failure o demad parameter BÖRCSÖK J., SCHAEFER S. Departmet of Computer Architecture ad System Programmig Uiversity Kassel, Wilhelmshöher Allee

More information

Lecture 28: Data Link Layer

Lecture 28: Data Link Layer Automatic Repeat Request (ARQ) 2. Go ack N ARQ Although the Stop ad Wait ARQ is very simple, you ca easily show that it has very the low efficiecy. The low efficiecy comes from the fact that the trasmittig

More information

A Study on the Performance of Cholesky-Factorization using MPI

A Study on the Performance of Cholesky-Factorization using MPI A Study o the Performace of Cholesky-Factorizatio usig MPI Ha S. Kim Scott B. Bade Departmet of Computer Sciece ad Egieerig Uiversity of Califoria Sa Diego {hskim, bade}@cs.ucsd.edu Abstract Cholesky-factorizatio

More information

Improving Template Based Spike Detection

Improving Template Based Spike Detection Improvig Template Based Spike Detectio Kirk Smith, Member - IEEE Portlad State Uiversity petra@ee.pdx.edu Abstract Template matchig algorithms like SSE, Covolutio ad Maximum Likelihood are well kow for

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 1 Computers ad Programs 1 Objectives To uderstad the respective roles of hardware ad software i a computig system. To lear what computer scietists

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations Applied Mathematical Scieces, Vol. 1, 2007, o. 25, 1203-1215 A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045, Oe

More information

Evaluation scheme for Tracking in AMI

Evaluation scheme for Tracking in AMI A M I C o m m u i c a t i o A U G M E N T E D M U L T I - P A R T Y I N T E R A C T I O N http://www.amiproject.org/ Evaluatio scheme for Trackig i AMI S. Schreiber a D. Gatica-Perez b AMI WP4 Trackig:

More information

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON Roberto Lopez ad Eugeio Oñate Iteratioal Ceter for Numerical Methods i Egieerig (CIMNE) Edificio C1, Gra Capitá s/, 08034 Barceloa, Spai ABSTRACT I this work

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045 Oe Brookigs Drive St. Louis, Missouri 63130-4899, USA jaegerg@cse.wustl.edu

More information

Performance Plus Software Parameter Definitions

Performance Plus Software Parameter Definitions Performace Plus+ Software Parameter Defiitios/ Performace Plus Software Parameter Defiitios Chapma Techical Note-TG-5 paramete.doc ev-0-03 Performace Plus+ Software Parameter Defiitios/2 Backgroud ad Defiitios

More information

The isoperimetric problem on the hypercube

The isoperimetric problem on the hypercube The isoperimetric problem o the hypercube Prepared by: Steve Butler November 2, 2005 1 The isoperimetric problem We will cosider the -dimesioal hypercube Q Recall that the hypercube Q is a graph whose

More information

The Magma Database file formats

The Magma Database file formats The Magma Database file formats Adrew Gaylard, Bret Pikey, ad Mart-Mari Breedt Johaesburg, South Africa 15th May 2006 1 Summary Magma is a ope-source object database created by Chris Muller, of Kasas City,

More information

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved. Chapter 11 Frieds, Overloaded Operators, ad Arrays i Classes Copyright 2014 Pearso Addiso-Wesley. All rights reserved. Overview 11.1 Fried Fuctios 11.2 Overloadig Operators 11.3 Arrays ad Classes 11.4

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

Speeding-up dynamic programming in sequence alignment

Speeding-up dynamic programming in sequence alignment Departmet of Computer Sciece Aarhus Uiversity Demark Speedig-up dyamic programmig i sequece aligmet Master s Thesis Dug My Hoa - 443 December, Supervisor: Christia Nørgaard Storm Pederse Implemetatio code

More information

How do we evaluate algorithms?

How do we evaluate algorithms? F2 Readig referece: chapter 2 + slides Algorithm complexity Big O ad big Ω To calculate ruig time Aalysis of recursive Algorithms Next time: Litterature: slides mostly The first Algorithm desig methods:

More information

EE123 Digital Signal Processing

EE123 Digital Signal Processing Last Time EE Digital Sigal Processig Lecture 7 Block Covolutio, Overlap ad Add, FFT Discrete Fourier Trasform Properties of the Liear covolutio through circular Today Liear covolutio with Overlap ad add

More information

Uniprocessors. HPC Prof. Robert van Engelen

Uniprocessors. HPC Prof. Robert van Engelen Uiprocessors HPC Prof. Robert va Egele Overview PART I: Uiprocessors PART II: Multiprocessors ad ad Compiler Optimizatios Parallel Programmig Models Uiprocessors Multiprocessors Processor architectures

More information

Isn t It Time You Got Faster, Quicker?

Isn t It Time You Got Faster, Quicker? Is t It Time You Got Faster, Quicker? AltiVec Techology At-a-Glace OVERVIEW Motorola s advaced AltiVec techology is desiged to eable host processors compatible with the PowerPC istructio-set architecture

More information

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1 COSC 1P03 Ch 7 Recursio Itroductio to Data Structures 8.1 COSC 1P03 Recursio Recursio I Mathematics factorial Fiboacci umbers defie ifiite set with fiite defiitio I Computer Sciece sytax rules fiite defiitio,

More information

6.854J / J Advanced Algorithms Fall 2008

6.854J / J Advanced Algorithms Fall 2008 MIT OpeCourseWare http://ocw.mit.edu 6.854J / 18.415J Advaced Algorithms Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.415/6.854 Advaced Algorithms

More information

1. SWITCHING FUNDAMENTALS

1. SWITCHING FUNDAMENTALS . SWITCING FUNDMENTLS Switchig is the provisio of a o-demad coectio betwee two ed poits. Two distict switchig techiques are employed i commuicatio etwors-- circuit switchig ad pacet switchig. Circuit switchig

More information

Lecture 5. Counting Sort / Radix Sort

Lecture 5. Counting Sort / Radix Sort Lecture 5. Coutig Sort / Radix Sort T. H. Corme, C. E. Leiserso ad R. L. Rivest Itroductio to Algorithms, 3rd Editio, MIT Press, 2009 Sugkyukwa Uiversity Hyuseug Choo choo@skku.edu Copyright 2000-2018

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 20 Itroductio to Trasactio Processig Cocepts ad Theory Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Trasactio Describes local

More information

Heaps. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015

Heaps. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 201 Heaps 201 Goodrich ad Tamassia xkcd. http://xkcd.com/83/. Tree. Used with permissio uder

More information

Alpha Individual Solutions MAΘ National Convention 2013

Alpha Individual Solutions MAΘ National Convention 2013 Alpha Idividual Solutios MAΘ Natioal Covetio 0 Aswers:. D. A. C 4. D 5. C 6. B 7. A 8. C 9. D 0. B. B. A. D 4. C 5. A 6. C 7. B 8. A 9. A 0. C. E. B. D 4. C 5. A 6. D 7. B 8. C 9. D 0. B TB. 570 TB. 5

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

Algorithms for Disk Covering Problems with the Most Points

Algorithms for Disk Covering Problems with the Most Points Algorithms for Disk Coverig Problems with the Most Poits Bi Xiao Departmet of Computig Hog Kog Polytechic Uiversity Hug Hom, Kowloo, Hog Kog csbxiao@comp.polyu.edu.hk Qigfeg Zhuge, Yi He, Zili Shao, Edwi

More information

THIN LAYER ORIENTED MAGNETOSTATIC CALCULATION MODULE FOR ELMER FEM, BASED ON THE METHOD OF THE MOMENTS. Roman Szewczyk

THIN LAYER ORIENTED MAGNETOSTATIC CALCULATION MODULE FOR ELMER FEM, BASED ON THE METHOD OF THE MOMENTS. Roman Szewczyk THIN LAYER ORIENTED MAGNETOSTATIC CALCULATION MODULE FOR ELMER FEM, BASED ON THE METHOD OF THE MOMENTS Roma Szewczyk Istitute of Metrology ad Biomedical Egieerig, Warsaw Uiversity of Techology E-mail:

More information

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software Structurig Redudacy for Fault Tolerace CSE 598D: Fault Tolerat Software What do we wat to achieve? Versios Damage Assessmet Versio 1 Error Detectio Iputs Versio 2 Voter Outputs State Restoratio Cotiued

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The

More information

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem A Improved Shuffled Frog-Leapig Algorithm for Kapsack Problem Zhoufag Li, Ya Zhou, ad Peg Cheg School of Iformatio Sciece ad Egieerig Hea Uiversity of Techology ZhegZhou, Chia lzhf1978@126.com Abstract.

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The

More information

The Penta-S: A Scalable Crossbar Network for Distributed Shared Memory Multiprocessor Systems

The Penta-S: A Scalable Crossbar Network for Distributed Shared Memory Multiprocessor Systems The Peta-S: A Scalable Crossbar Network for Distributed Shared Memory Multiprocessor Systems Abdulkarim Ayyad Departmet of Computer Egieerig, Al-Quds Uiversity, Jerusalem, P.O. Box 20002 Tel: 02-2797024,

More information

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method A ew Morphological 3D Shape Decompositio: Grayscale Iterframe Iterpolatio Method D.. Vizireau Politehica Uiversity Bucharest, Romaia ae@comm.pub.ro R. M. Udrea Politehica Uiversity Bucharest, Romaia mihea@comm.pub.ro

More information

Appendix A. Use of Operators in ARPS

Appendix A. Use of Operators in ARPS A Appedix A. Use of Operators i ARPS The methodology for solvig the equatios of hydrodyamics i either differetial or itegral form usig grid-poit techiques (fiite differece, fiite volume, fiite elemet)

More information

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device

More information

arxiv: v2 [cs.ds] 24 Mar 2018

arxiv: v2 [cs.ds] 24 Mar 2018 Similar Elemets ad Metric Labelig o Complete Graphs arxiv:1803.08037v [cs.ds] 4 Mar 018 Pedro F. Felzeszwalb Brow Uiversity Providece, RI, USA pff@brow.edu March 8, 018 We cosider a problem that ivolves

More information

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a 4. [10] Usig a combiatorial argumet, prove that for 1: = 0 = Let A ad B be disjoit sets of cardiality each ad C = A B. How may subsets of C are there of cardiality. We are selectig elemets for such a subset

More information

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem Exact Miimum Lower Boud Algorithm for Travelig Salesma Problem Mohamed Eleiche GeoTiba Systems mohamed.eleiche@gmail.com Abstract The miimum-travel-cost algorithm is a dyamic programmig algorithm to compute

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13 CIS Data Structures ad Algorithms with Java Sprig 08 Stacks ad Queues Moday, February / Tuesday, February Learig Goals Durig this lab, you will: Review stacks ad queues. Lear amortized ruig time aalysis

More information

CS2410 Computer Architecture. Flynn s Taxonomy

CS2410 Computer Architecture. Flynn s Taxonomy CS2410 Computer Architecture Dept. of Computer Sciece Uiversity of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/2410p/idex.html 1 Fly s Taxoomy SISD Sigle istructio stream Sigle data stream (SIMD)

More information

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components Aoucemets Readig Chapter 4 (4.1-4.2) Project #4 is o the web ote policy about project #3 missig compoets Homework #1 Due 11/6/01 Chapter 6: 4, 12, 24, 37 Midterm #2 11/8/01 i class 1 Project #4 otes IPv6Iit,

More information

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions U.C. Berkeley CS170 : Algorithms Midterm 1 Solutios Lecturers: Sajam Garg ad Prasad Raghavedra Feb 1, 017 Midterm 1 Solutios 1. (4 poits) For the directed graph below, fid all the strogly coected compoets

More information

Xiaozhou (Steve) Li, Atri Rudra, Ram Swaminathan. HP Laboratories HPL Keyword(s): graph coloring; hardness of approximation

Xiaozhou (Steve) Li, Atri Rudra, Ram Swaminathan. HP Laboratories HPL Keyword(s): graph coloring; hardness of approximation Flexible Colorig Xiaozhou (Steve) Li, Atri Rudra, Ram Swamiatha HP Laboratories HPL-2010-177 Keyword(s): graph colorig; hardess of approximatio Abstract: Motivated b y reliability cosideratios i data deduplicatio

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 26 Ehaced Data Models: Itroductio to Active, Temporal, Spatial, Multimedia, ad Deductive Databases Copyright 2016 Ramez Elmasri ad Shamkat B.

More information

Fast Interpolation of Grid Data at a Non-Grid Point

Fast Interpolation of Grid Data at a Non-Grid Point Fast Iterpolatio of Grid Data at a No-Grid Poit Hiroshi Ioue IBM Research - Tokyo Tokyo, Japa iouehrs@jp.ibm.com Abstract Defiig data at a o-grid poit by iterpolatig grid data is a commo operatio i may

More information

Bank-interleaved cache or memory indexing does not require euclidean division

Bank-interleaved cache or memory indexing does not require euclidean division Bak-iterleaved cache or memory idexig does ot require euclidea divisio Adré Sezec To cite this versio: Adré Sezec. Bak-iterleaved cache or memory idexig does ot require euclidea divisio. 11th Aual Workshop

More information

3D Model Retrieval Method Based on Sample Prediction

3D Model Retrieval Method Based on Sample Prediction 20 Iteratioal Coferece o Computer Commuicatio ad Maagemet Proc.of CSIT vol.5 (20) (20) IACSIT Press, Sigapore 3D Model Retrieval Method Based o Sample Predictio Qigche Zhag, Ya Tag* School of Computer

More information

Image Segmentation EEE 508

Image Segmentation EEE 508 Image Segmetatio Objective: to determie (etract) object boudaries. It is a process of partitioig a image ito distict regios by groupig together eighborig piels based o some predefied similarity criterio.

More information

Data Structures and Algorithms Part 1.4

Data Structures and Algorithms Part 1.4 1 Data Structures ad Algorithms Part 1.4 Werer Nutt 2 DSA, Part 1: Itroductio, syllabus, orgaisatio Algorithms Recursio (priciple, trace, factorial, Fiboacci) Sortig (bubble, isertio, selectio) 3 Sortig

More information

APPLICATION NOTE. Automated Gain Flattening. 1. Experimental Setup. Scope and Overview

APPLICATION NOTE. Automated Gain Flattening. 1. Experimental Setup. Scope and Overview APPLICATION NOTE Automated Gai Flatteig Scope ad Overview A flat optical power spectrum is essetial for optical telecommuicatio sigals. This stems from a eed to balace the chael powers across large distaces.

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 6 Defiig Fuctios Pytho Programmig, 2/e 1 Objectives To uderstad why programmers divide programs up ito sets of cooperatig fuctios. To be able to

More information

Heuristic Approaches for Solving the Multidimensional Knapsack Problem (MKP)

Heuristic Approaches for Solving the Multidimensional Knapsack Problem (MKP) Heuristic Approaches for Solvig the Multidimesioal Kapsack Problem (MKP) R. PARRA-HERNANDEZ N. DIMOPOULOS Departmet of Electrical ad Computer Eg. Uiversity of Victoria Victoria, B.C. CANADA Abstract: -

More information

Instruction and Data Streams

Instruction and Data Streams Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad

More information

Computer Systems - HS

Computer Systems - HS What have we leared so far? Computer Systems High Level ENGG1203 2d Semester, 2017-18 Applicatios Sigals Systems & Cotrol Systems Computer & Embedded Systems Digital Logic Combiatioal Logic Sequetial Logic

More information

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods.

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods. Software developmet of compoets for complex sigal aalysis o the example of adaptive recursive estimatio methods. SIMON BOYMANN, RALPH MASCHOTTA, SILKE LEHMANN, DUNJA STEUER Istitute of Biomedical Egieerig

More information

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs What are we goig to lear? CSC316-003 Data Structures Aalysis of Algorithms Computer Sciece North Carolia State Uiversity Need to say that some algorithms are better tha others Criteria for evaluatio Structure

More information