Cache-Optimal Methods for Bit-Reversals

Size: px
Start display at page:

Download "Cache-Optimal Methods for Bit-Reversals"

Transcription

1 Proceedigs of the ACM/IEEE Supercomputig Coferece, November 1999, Portlad, Orego, U.S.A. Cache-Optimal Methods for Bit-Reversals Zhao Zhag ad Xiaodog Zhag Departmet of Computer Sciece College of William ad Mary Williamsburg, VA fzzhag or Abstract Bit-reversals are represetative ad importat data reorderig operatios i may scietific computatios. Performace degradatio is maily caused by cache coflict misses. Bit-reversals are ofte repeatedly used as fudametal subrouties for may scietific programs. Thus, i order to gai the best performace, cache-optimal methods ad their implemetatios should be carefully ad precisely doe at the programmig level. This type of performace programmig for some special programs, such as the data reorderigs, may sigificatly outperform a optimizatio from a automatic tool, such as a compiler. I this paper, we examie differet methods usig techiques of blockig, bufferig, ad paddig for efficiet implemetatios. We evaluate the merits ad limits of each techique ad their applicatio ad architecture-depedet coditios for developig cache-optimal methods. We preset two cotributios i this paper: (1) Our itegrated blockig methods, which match cache associativity ad TLB cache size ad which fully use the available registers are cache-optimal ad fast. (2) We show that our paddig methods outperform other software orieted methods, ad believe they are the fastest i terms of miimizig both CPU ad memory access cycles. Sice the paddig methods are almost idepedet of hardware, they could be widely used o may uiprocessor workstatios ad SMP multiprocessors. 1 Itroductio With the rapid developmet of RISC ad VLSI techology, the speed of processors has icreased dramatically i the past decade. Processor clock rates doubled every 2-3 years. Nevertheless, the speed of memories has icreased at a much slower pace. Therefore we have see ad will cotiue to see a icreasig gap i speed betwee processor ad memory, ad this gap makes performace of applicatio programs o both uiprocessor ad multiprocessor systems rely more ad more o effective usage of caches. Bit-reversals are importat data reorderig operatios i may scietific computatios. Performace degradatio is maily caused by cache coflict misses. Bit-reversals are ofte This work is supported i part by the Natioal Sciece Foudatio uder grats CCR ad CCR , by the Air Force Office of Scietific Research uder grat AFOSR , ad by Su Microsystems uder grat EDUE-NAFO

2 2 repeatedly used as fudametal subrouties for may scietific programs. Thus, i order to gai the best performace, cache-optimal methods ad their implemetatios should be carefully ad precisely doe at the programmig level. This type of performace programmig for some special programs, such as bit-reversals, may sigificatly outperform a optimizatio from a automatic tool, such as a compiler. A stadard bit-reversal program is described as follows: for i = 1, N Y[i ] = X[i] The values of array X i their sequetial positios i are copied to array Y i their bit-reversal positios, i, for i = 1 ::: N, where N = 2. The above program says that X is a bit-reversal reorderig of Y. The idices of i ad i of X ad Y are represeted by a sequece of biary digits. Positios i ad its bit-reversal i are defied i [5] as: ;1 i = j= a j 2 j ad i ;1 = j= a j 2 ;1; j where a j is either or 1. For example, a 5-bit reversal of i = 1 is i =. The bit-reversal operatios have followig uique characteristics: First, each elemet i a array is oly used (read or writte) oce for its copy operatio. Thus, the reorderigs have oly spacial locality but o temporal locality for elemets. Secod, the loops follow certai sequeces with high spatial locality. Bit-reversals are highly sesitive to problem sizes, cache sizes, ad cache lie sizes. Sice the data array sizes are a power of two, multiple elemets stored i differet memory locatios could map to the same cache lie, causig severe cache coflict misses ad cache thrashig. The reaso is simple. Most commercial computers use direct-mapped or -way associative caches where the mappig fuctios of cache sizes are also related to powers of two. We use a idetical uit, called a elemet, to represet the sizes of data arrays, caches ad others such as buffers ad blockig. Oe elemet may represet a 4-byte iteger, a 4-byte floatig poit umber, or a 8-byte double floatig poit umber. Because the sizes of caches ad cache lies are always a multiple of a elemet i practice, this idetical uit for all the sizes is practically meaigful for both architects ad applicatio programmers, ad makes the discussios straightforward. Here are the algorithmic ad architectural parameters we will use to describe cache-optimal methods of bit-reversals: C: data cache size, which could be further defied as C L1 ad C L2 for data cache sizes of L1 ad L2 respectively. L: the size of a cache lie, which could be further defied as L L1 ad L L2 for cache lies of L1 ad L2 respectively. K: cache associativity, which could be further defied as K L1 ad K L2 for cache associativity of L1 ad L2 respectively. K TLB : TLB cache associativity. T s : umber of etries i the TLB cache. N: the data size for the bit-reversal vector of size N = 2, where is the umber bits used i the vector idex. B cache : blockig size of a BB submatrix for cache.

3 B Memory Layout: the distace betwee each pair of segemets is (N-B)/L cache lies A 2-D Array equivalet Layout Distributios of B segmets i a vector of N elemets for bit-reversals Figure 1: Memory layout of a blocked bit-reversals, where B = B cache. B TLB : blockig size for TLB. P s : a memory page size. I this paper, we examie differet methods usig techiques of blockig, bufferig, ad paddig for efficiet implemetatios. We evaluate the merits ad limits of each techique ad their applicatio ad architecture-depedet coditios for developig cache-optimal methods. Although our methods are developed for out-of-place bit-reversals, they are also applicable to i-place bit-reversals where X ad Y are the same array. We preset two cotributios i this paper: (1) Our itegrated blockig methods, which match cache associativity ad TLB cache size ad which fully use the available registers are cache-optimal ad fast. (2) We show that our paddig methods outperform other software orieted methods, ad believe they are the fastest i terms of miimizig both CPU ad memory access cycles. Sice the paddig methods are almost idepedet of hardware, they could be widely used o may uiprocessor workstatios ad SMP multiprocessors. 2 Blockig for bit-reversals The blocked memory access patters of bit-reversals ca be easily viewed whe we covert the oe dimesioal vector to a 2-D equivalet array i Figure 1. All the reorderig elemets ad elemets i other groups will be allocated alog the colum i the 2-D equivalet array formig a block. I geeral, for a bit-reversal vector of N = 2 elemets, the block size B cache is a power of 2, deoted by B cache = 2 b. Each of the B cache elemets i X has the address format of fg, where g is B cache bits, ad f has ;b bits. Each of the correspodig B cache elemets i Y has the address format of g f. Therefore, the distace betwee two earest elemets i the same group i Y is 2 ;b = N=B cache. Choosig the cache lie size as the miimum blockig size (B cache = L), we ca easily calculate the maximum Ns for the bit-reversal vector d o differet data cache sizes. For example, for a large cache of 2 MBytes, the blockig techique is effective up to a 18-bit-reversal reorderig which represets 268,144 data elemets, where each elemet is a 8-byte double type, ad the cache lie is 32 bytes. I practice, the data size of bit-reversals could easily be larger tha = 2 [5].

4 4 3 Blockig with buffers As we have show, the effectiveess of blockig is limited by the size of the data arrays. I theory, the smallest blockig size could be 22. A cache lie i a moder processor usually holds more tha 2 elemets, i.e., is larger tha 16 bytes. If we choose a 2 2 block, the data i a cache lie will ot be fully used before their replacemet, causig more cache misses i the reorderigs. The bit-reversal reorderig demads large cache space to make blockig effective. I order to effectively use limited cache space, Gatli ad Carter [4] preset a effective method usig a additioal buffer to first hold the coflict-missed elemets of a block i oe array temporarily, ad the copy the block to their reordered positios i the other array. 3.1 Blockig with a software buffer ad its limits Because this buffer is defied i a reorderig program, we call it software buffer. This buffer shares the allocatio space with the data arrays X ad Y i the cache. There are two major limits i this approach. First, the buffer itself may iterfere with arrays of X ad Y, causig additioal access coflicts. This iterferece is certai whe the sizes of X ad Y are larger tha the size of the cache, C. Each cache block or set is mapped from arrays X ad Y more tha oce. No matter where the buffer is located i the cache, it will iterfere with them. The larger the buffer size, the more iterferece will occur. The secod limit is the additioal copy overhead time ivolved i movig data from the array X to the buffer ad the i movig them to the target array i their reordered positios. This overhead exactly doubles the istructio cycles for data copyig. The data copy through a buffer is a worthy ivestmet if the umber of cycles lost from cache misses is much higher tha the additioal CPU cycles for the data copy. To overcome the two limits, we propose several alteratives to elimiate cache iterferece caused by the software buffer ad to reduce or elimiate the data copy time. 3.2 Cache structure depedet blockig Blockig d o set associativity The cache associativity, K, is a importat factor to cosider for blockig. If K L, al L or a K K blockig methods for bit-reversals would effectively avoid coflict misses. Because the hit time is a less sesitive performace factor tha the cache misses i the L2 cache, a higher associativity of the L2 cache is more effective tha that of L1. If a cache lie holds 4 double floatig poit elemets, (L = 4 elemets of 32 bytes i Petium processors), a 4 4 blockig method without ay data buffer is able to fully use the cache associativity. The blockig method would gai more beefit from caches of associativity higher tha 4, such as a desig i [11]. What would we do if the associativity is ot sufficietly high for the blockig, or K < L? Oe solutio is to make a K L rectagular blockig. Ufortuately bit-reversals require a L L blockig. Supplemet with registers We may also cosider usig the available registers to supplemet a low associativity cache. The umber of registers available to a user program are limited. Normally, a uiprocessor provides up to 16 registers to users. For example, for

5 5 a 2-way associative cache, we eed 8 registers to buffer two additioal cache lies so that we could effectively make a 4 4 blockig as if we ra the program o a 4-way associative cache. We develop a more efficiet blockig method for bit-reversals, which requires oly (L;K) (L;K) registers. The operatio sequece of this method is i three steps: (1) The L;K cache lies of X are stored i K cache lies of Y ad accessed by copyig its (L;K)K elemets to Y i the reordered positios, ad copyig the rest of (L;K)(L;K) elemets to a buffer cosistig (L;K)(L;K) registers. (2) The rest of K lies of X are brought to the cache set, ad its K K elemets are copied to Y i the reordered positios. (3) Fially, the (L;K)(L;K) elemets i the register buffer ad the rest of the (L;K)K elemets are copied to Y i their reordered positios. A cache set will be used more tha twice if K < L=2. Besides the advatage of o access coflicts betwee the register buffer ad the arrays of X ad Y, there is aother advatage of usig registers to buffer the data i a load/store processor. A data copy through the registers from X to Y is equivalet to the two-step process of load ad store, ad thus there will be o additioal overhead. We will show our experimetal performace i sectio 5. Usig registers as the buffer If the cache is direct-mapped, we have to fully rely o a buffer for blockig. Here we discuss some ways to use registers to serve the buffer i order to elimiate the potetial cache coflicts ad elimiate extra data copyig by takig advatage of the load/store operatios. The umber of registers for a buffer of L L elemets is determied by the umber of elemets a cache lie ca hold. The legth of a cache lie of the L1 cache i some processors, such as Su SPARC Micro I ad II, is L = 2 of 16 bytes, which holds oly two floatig poit elemets. The blockig size could be as small as 22 usig a buffer of 4 registers. The cache lie legth of the L1 cache i may advaced workstatios is 32 bytes, such as the Su Ultra ad Itel Petium processors, each of which holds 4 double floatig poit elemets. I this case, we eed a buffer of 44 = 16 registers for a blockig. This would be difficult due to the limited umber of available registers. We have two solutios for this. First, we oly use the umber of registers available to form a smaller buffer tha it should be, which will ot make each cache lie fully used ad will cause additioal cache misses. Our experimets show that this blockig method of usig a buffer of isufficiet umber of registers still achieves a reasoable performace improvemet ad outperforms of the implemetatio usig software a buffer. The secod method is to further reduce the size of the buffer, which reduces the required umber of registers by usig our (L;K)(L;K) blockig method. L1 cache versus L2 cache The mai objective of buildig two-level caches is to make the L1 cache small eough to catch up to the cycle time of the fast CPU, ad to make the L2 cache large eough to capture as may accesses as possible [6]. I practice, the data size of a bit-reversal is larger tha the size of the L2 cache. L1 ad L2 caches offer differet sizes of the cache lie, L, ad the associativity, K. Both of the followig alteratives are effective for blockig. (1) Takig advatage of a short cache lie ad fast hit time of the L1 cache, we could effectively use limited registers as the buffer, ad make a small LL blockig effective. (2) Takig advatage of high associativity of the L2 cache, we could effectively use both associativity ad supplemetal registers as the buffer, ad make a large LL blockig effective.

6 6 Cache 111 Before paddig X-array N/B N/B N/B N/B Y-array Cache After paddig N/B+L N/B+L N/B+L N/B X-array Y-array Figure 2: Data layout of a bit-reversal is modified by paddig, where B = B cache = L. 4 Blockig with paddig Paddig is a techique that modifies the data layout of a program so that the coflict misses are reduced or elimiated. The data layout modificatio ca be doe at ru-time by system software [2, 1], or at compile-time by complier optimizatio [8]. Sharig the same objective of compiler optimizatio to chage the addresses of potetially coflictig cache blocks i the reorderigs, we isert paddig variables iside the data array. For example, i the FFT computatio, paddigs ca be combied with the copy operatios i the last step of butterfly without additioal cost. Sice the data arrays of bit-reversals form a vector whose size is power of 2, the paddig is highly regular, isertig L elemets or a cache lie space startig at the vector positios of N=L, 2 N=L,..., ad (L ; 1) N=L. Usig L elemets or a sectio data of a cache lie to separate the vector at these L poits ca completely elimiate the cache coflicts caused by Murphy reorderig. Agai durig executio, the reorderig data copies are directly coducted betwee the arrays X ad Y without goig through a data buffer. Aother advatage is that the umber of paddig elemets eeded is oly L L or L cache lies, ad is idepedet of the data array size, N. Compared with the data size of bit-reversals, the umber of paddig elemets is isigificat. Figure 2 shows how the data layout of a bit-reversal vector is modified by paddig so that coflict misses are elimiated. Compiler optimizatio targets a large rage of applicatio programs, ad automatically iserts paddig variables i the programs for users. A optimal paddig is applicatio program depedet. For example, paddig positios are differet from differet applicatios i order to effectively chage addresses of coflictig cache blocks. Based o the uique ature of the data reorderig, the optimal paddig uit used by our methods for bit-reversals is a cache lie with L elemets. I cotrast, a compiler optimizatio ormally uses a elemet as the basic paddig uit. How may paddig uits to use ad where to pad i the data arrays are determied by some approximatio models which may ot precisely fit the uique memory access patters of each case. I additio, applyig the paddig techique to bit-reversals embedded i applicatios would ot icrease complexity i the etire computatio. For example, whe a padded bit-reversal is performed i a FFT computatio, it has little effect o the eighborig butterfly operatios.

7 7 5 Blockig ad paddig for TLB The TLB (Traslatio-Lookaside Buffer) is a special cache that stores the most recetly used virtual-physical page traslatios for memory accesses. The TLB is a small ad usually fully associative cache. Each etry poits to a memory page of 4 KBytes to 64 KBytes. The page size is ormally fixed at the level of operatig systems, ad caot be chaged by user programs. A TLB cache miss will make the system retrieve the missig traslatio from the page table i memory, ad the to select a TLB etry to replace. Whe the data to be accessed i our blockig method is larger tha the amout of data of all the memory pages that the TLB ca hold, we will have TLB thrashig. 5.1 Blockig for a fully associative TLB Before givig a geeral model to show how the blockig size is affected by the TLB size, let s go through a example to show that a moderate N for bit-reversals would easily lead to TLB cache thrashig. The 64 pages i the TLB of the Su UltraSparc-II processor hold = elemets, which represets a 16-bit-reversal of N = Sice we have two vectors X ad Y, the TLB ca hold a 15-bit-reversal of N = 2 15 elemets. This is also cosistet with our experimets o this machie, where executio time per elemet was a costat util = 15, but sharply icreased at = 16 bit-reversals caused by the TLB misses. I our cache-optimal methods, we iclude a outer loop to form a blockig for TLB, whose size is deoted as B TLB. The blockig size of B TLB for bit-reversals whe N=L P s is B TLB T s where P s is the page size i elemets, ad T s is the umber of etries of the TLB. O the other had, the B TLB should be chose as large as possible to make effective use of the page space. 5.2 Paddig for a set-associative TLB Some processors TLBs are ot fully associative, but set-associative. For example, the TLB i the Petium II 4 processor is 4-way associative (K TLB = 4). A simple blockig d o the umber of TLB etries is ot cacheoptimal, because multiple pages withi a TLB-size-d blockig may map to the same TLB cache set ad cause TLB cache coflict misses. If the size N of a bit-reversal vector is a multiple of T s P s, where T s is the umber of TLB etries ad P s is the page size i elemets, ad if K TLB < B TLB, the TLB cache coflict misses will occur. This could easily happe i practice. For example, o the Petium II 4, N is equal to 128K elemets (oe elemet = 8 bytes) for a 17-bit-reversal, ad this N is two times of the value T s P s of the machie, where T s = 64, ad P s = 124 elemets. I a way similar to the techique of paddig for the data cache, we isert a page of elemets or a page of space startig at the vector positios of N=L, 2 N=L,... ad (L; 1) N=L to elimiate the coflict of TLB cache misses. Figure 3 gives a example of the paddig for TLB, where the TLB is a direct-mapped cache of 8 etries, blockig size is B TLB = 4, ad the umber of elemets of a row is a multiple of 8 page elemets. Before paddig, each of blockig row is mapped to the same cache lie of the TLB. After paddig, these rows are mapped to differet cache lies of the TLB. Combiig paddig for data cache ad paddig for TLB cache, we are isertig L + P s elemets or a page plus a cache lie space i L locatios separated by a distace of N=L elemets.

8 8 Before paddig 111 TLB 2-D memory layout of array i bit-reversal After Paddig TLB 2-D memory layout of array i bit-reversal Ps Figure 3: Paddig for TLB: the data layout is modified by isertig a page space at multiple locatios, where B TLB = 4, K TLB = 1, T s = 8. I practice, we selected more tha N=L poits to isert the paddig variables to elimiate both data cache ad TLB coflict misses. This approach could effectively merge two ested paddigs (oe for data cache ad the other oe for TLB) ito a sigle oe. A optimal umber of isertig poits ca be easily determied experimetally d o the size of the TLB cache. 6 Experimetal Results ad Performace Evaluatio We have implemeted ad tested all the bit-reversal methods discussed i the previous sectios o a SGI O2 workstatio, a Su Ultra-5 workstatio, a Su SMP server E-45, a Petium PC, ad a Compaq XP1 workstatio. We used lmbech [7] to measure the latecies of memory hierarchies at differet levels o each machie. The architectural parameters of the 5 machies are listed i Table 1. We focus the performace evaluatio o methods ad implemetatios of bit-reversals i this paper. We compared all our methods with the method of blockig with a software buffer which was recetly published i [4]. We deote this method as blockig with buffer for bit-reversals. Two of our methods are experimetally compared: breg-br blockig with associativity ad registers for bit-reversals, ad blockig with paddig for bit-reversals. We have also applied blockig or paddig techique for the TLB i these two methods d o the TLB associativity. All the programs use a stadard subroutie to calculate the bit-reversal value for a give address. The executio times were collected by gettimeofday(), a stadard uix timig fuctio. The reported time uit is (CPE): executio timeclock rate CPE = N where executio time is the measured time i secods, clock rate is the CPU speed (cycles/secod) of the machie where the program is ru, ad N is the umber of elemets of the bit-reversal program. Besides the differet methods of bit reversals, we also measured the executio time of a program copyig elemets betwee X ad Y. This program

9 9 Workstatios SGI O2 Su Ultra 5 Su E-45 Petium XP1 Processor type R1 UltraSparc-IIi UltraSparc II Petium II 4 Alpha clock rate (MHz) L1 cache (KBytes) L1 block size (Bytes) L1 associativity L1 hit time (cycles) L2 cache (KBytes) L2 block size (Bytes) L2 associativity L2 hit time (cycles) TLB size (etries) TLB associativity Memory latecy (cycles) Table 1: Architectural parameters of the 5 workstatios we have used for the experimets. All specificatios o L1 cache refer to the L1 data cache, ad all L2s are uiform. Each L2 cache block o UltraSPARC-IIi cosists of 2 16-Byte sub-blocks. The hit times of L1, L2 ad the mai memory are measured by lmbech [7], ad their uits are coverted from aosecod (s) to their CPU cycles. has the same umber of data copyig operatios with a cotiuous memory access patters. We use the executio time of this program to provide a lie referece for bit-reversal programs ad show how close a bit reversal executio is to its ideal time. We deote this referece program as. Each method is further divided ito float data type usig 4 bytes to represet a elemet, ad double type usig 8 bytes to represet a elemet. The data type divisios will show the performace impact of the cache lie legth. For all experimets o differet machies, the bit-reversal programs first call a routie to flush the cache to make sure that all the data are allocated oly i the memory. All experimets were repeated multiple times. 6.1 Effects of TLB ad virtual memory Before measurig ad comparig the performace of differet bit-reversal methods, we experimetally evaluated the effects of TLB ad virtual memory to cofirm our assumptios ad aalyses. Selectio of TLB blockig size The TLB blockig size is a sesitive performace parameter to be selected, which is determied by the size of the TLB if it is fully associative. We executed program (blockig with paddig for bit-reversals) with = 2 o a sigle ode of Su E-45 by chagig the blockig sizes for TLB from 8 to 128. The TLB of the E-45 is a fully associative cache with 64 etries. Figure 4 shows the measured of the program of differet blockig sizes o the ode. Our experimetal results are cosistet with our aalyses i the previous sectio. Whe

10 1 7 6 E45 (double) Block size of TLB Figure 4: Chagig the TLB blockig sizes o a sigle ode of the Su E45: whe the blockig size for TLB was larger tha 32, the executio time curve was sharply icreased. the blockig size for TLB was 64, the executio time curve icreased sharply. This is because arrays X ad Y together demaded more tha 64 pages ad caused TLB thrashig. Virtual memory versus physical memory addresses All our aalyses are d o cache mappigs betwee memory pages i the virtual address space ad cache blocks i the physical memory address space. This assumes that cotiguous memory pages will be cotiguously mapped to the cache. This assumptio is guarateed for the virtual-address caches [3]. However, all our experimets have bee performed o machies with physical address L2 caches. Sice the virtual-physical traslatios for L2 caches are hadled by operatig systems, our assumptios may ot be accurate sometimes. I order to show that may operatig systems attempt to map cotiguous virtual pages to cache blocks cotiguously so that our virtual-address-d study is practically meaigful ad effective, we coducted a simulatio by usig the SimOS [9] ad measuremets o differet workstatios to observe how a operatig system makes traslatios from virtual memory addresses to their physical addresses. The SimOS simulates a complete hardware of SGI machies ad rus the IRIX 5.3 operatig system i the simulatio. We executed a blockig-oly program of bit-reversals usig the cache lie L as the blockig size. The bit-reversal vector size was chaged from = 15 to = 22. We measured the miss rates o array X. The cache size was set to 2 MBytes holdig two double type arrays up to = 18 i the virtual memory space. Figure 5 gives cosistet results from the SimOS simulatio: whe > 18, the miss rate o array X was sharply icreased to 1% from 12.5%. From this experimet, we have observed that virtual-physical traslatios from the IRIX 5.3 operatig system are

11 11 SimOS (IRIX 5.3) blockig oly miss rate o array Figure 5: Usig the SimOS to observe the miss rates by chagig the the size of the bit-reversal arrays of a blockigoly program: whe > 18, the miss rate was sharply icreased to 1%. quite cosistet to our assumptio of cotiguous allocatios. We have also ru the similar experimets o differet targeted workstatios with differet operatig systems, such as Liux ad Solaris, to measure the chages of executio times whe the data size is chaged. Our measuremets are also cosistet to the SimOS results, ad idicate that the larger the data arrays to be used, the more likely a operatig system will allocate the pages cotiguously. Because our study targets large data set, our aalyses d o the virtual memory space is reasoably accurate. 6.2 Performace comparisos o the SGI O2 The SGI O2 is a 1995 product usig a R1 processor of 15 MHz, 32 KB 2-way associative L1 cache, ad 64 KB 2-way associative L2 cache. The cache lie of L2 is 64 bytes. Sice the associativity of L2 is low, ad the cache lie of L2 is relatively log, it is difficult to do blockig with associativity ad available registers. We oly implemeted the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled bit-reversal methods from = 16 to = 21. Figure 6 shows the comparisos of amog the three programs of both float type ad double type o the SGI O2 machie. The measuremets show that the paddig method slightly reduced the executio time compared with the method of blockig with software buffer. The time reductio was up to 6%. The reaso for the small performace improvemet comes from the extremely log memory latecy (28 cycles) of the O2 machie. The reductio ad savig of istructio cycles for data copies from paddig became less sigificat because memory latecies caused by the required cold misses i both methods were domiat i executio.

12 O2 (float) O2 (double) Figure 6: Executio comparisos o the SGI O2 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. 6.3 Performace comparisos o the Su Ultra-5 The Su Ultra-5 is a 1998 product usig a UltraSparc-IIi processor of 275 MHz, 16 KB direct-mapped L1 cache, ad 256 KB 2-way associative L2 cache. The cache lie of L1 is 32 bytes cosistig of two 16 byte subblocks, ad L2 is 64 bytes log. Similar to the SGI O2, the associativity of L2 o the Ultra-5 is low, ad the cache lie of L2 is relatively log, so it is difficult to do blockig with associativity ad available registers. We oly implemeted the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from = 16 to = 23. Figure 7 shows the comparisos of amog the three programs of both float type ad double type o the Ultra-5. The memory latecy of the Ultra-5 (76 cycles) is sigificatly lower tha that of the O2. We observed a more sigificat performace improvemet from the method of blockig with paddig over that of blockig with software buffer. For example, usig float type, the paddig program is 14% faster tha that of blockig with buffer for = 2 or larger. A L2 cache lie of the Ultra-5 holds 16 float type elemets (L = 16), ad 8 double type elemets (L = 8). The larger the L, the higher overhead the blockig with software buffer will have. This has bee cofirmed by our comparative experimets betwee the float ad double types o the Ulta-5 show i Figure Performace comparisos o the Su E-45 The Su E-45 is a processor SMP product. Each of the 4 odes is a UltraSparc-2 processor of 3 MHz, 16 KB direct-mapped L1 cache, ad 2 MB 2-way associative L2 cache. The cache lie of L1 is 32 bytes cosistig of two 16 byte subblocks, ad L2 is 64 bytes log. Due to the limited associativity ad a relatively log L2 cache lie, we oly implemeted the blockig with paddig method to compare with blockig with software buffer ad the

13 ultra5 (float) ultra5 (double) Figure 7: Executio comparisos o the Su Ultra-5 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. referece. We scaled the bit-reversal methods from = 16 to = 25. Figure 8 shows the comparisos of amog blockig with software buffer, blockig with paddig, ad the program o a sigle ode of E-45, each of which has both float type ad double type. The memory latecy of the Ultra-5 (73 cycles) is slightly lower tha that of Ultra-5. O this machie, we observed higher performace improvemet from the method of blockig with paddig over that of blockig with software buffer. For example, usig float type, the paddig program is 22% faster tha that of blockig with buffer for = 2 or larger. Our comparative experimets betwee the float ad double types o E-45 i Figure 8 also cofirms that the larger the L, the higher performace the paddig method would achieve. 6.5 Performace comparisos o the Petium II 4 The Petium PC we used is a 1998 product usig a Petium-II 4 processor of 4 MHz, 8 KB direct-mapped L1 cache, ad 256 KB 4-way associative L2 cache. The cache lies of of both L1 ad L2 are 32 bytes. Sice the L2 associativity is high, we are able to implemet the method of blockig with associativity ad available registers, L2 cache lie L = 8 elemets for a float type, ad we eed (L ; K)(L ; K) =16 registers to supplemet the 4-way associative cache. A L2 cache lie holds 4 double type elemets (L = 4). Thus, we do ot eed ay registers to supplemet, but simply make a 44 blockig. The TLB of the Petium processor is a 4-way associative cache of 64 etries. We used our paddig for the TLB techique to avoid TLB misses. We implemeted the blockig with paddig method ad the blockig with associativity ad registers to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from = 16 to = 24. Figure 9 shows the comparisos of

14 E-45 (float) E-45 (double) Figure 8: Executio comparisos o the Su E-45 SMP: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. amog the four programs. As we expected, the paddigs for both cache ad TLB were highly effective, ad the paddig program performed the best. For example, usig float type, the paddig program is about 4% faster tha that of blockig with buffer for = 22 or larger. We also show that the method usig available registers to supplemet associativity is effective. Although it is ot as good as the paddig program due to the icrease of the istructio couts, it still achieved up to 12% executio reductio over the blockig with software buffer program. As we expected, the executio time of the method usig the 4-way associative L2 cache without the supplemet of registers to form a 44 blockig was delayed maily by the loger L2 cache hit time. The performace of this method still outperformed the method of blockig with a software buffer. 6.6 Performace comparisos o the Compaq XP-1 The Compaq XP-1 is a 1999 product usig a Alpha processor of 5 MHz, 64 KB 2-way associative L1 cache, ad 4 MB 2-way associative L2 cache. The cache lies of both L1 ad L2 are 64 bytes log. Similar to the SGI ad Su machies, the associativity of L2 o the XP 1 is low, ad the cache lie of L2 is relatively log, so it is difficult to do blockig with associativity ad available registers. We oly implemeted the blockig with paddig method to compare with blockig with software buffer ad the referece. We scaled the bit-reversal methods from = 16 to = 25. Figure 1 shows the comparisos of amog the three programs of both float type ad double type o the XP-1 machie. As we expected, we achieved better or comparable performace to the oes o the Su machies. For example, usig float type, for = 24 or larger, the paddig program is 3% faster tha that of blockig with buffer; ad 15% faster for double type.

15 P-II (float) breg-br P-II (double) lblk-br Figure 9: Executio comparisos o the Petium II 4 PC: represets the method of blockig with software buffer; represets the method of blockig with paddig; breg-br represets the method of blockig with associativity ad registers; ad represets the ideal lie referece. 7 Coclusio We have examied ad developed cache-optimal methods for bit-reversal data reorderigs. These methods have bee tested o 5 represetative processors of 1995 to 1999 products to show their effectiveess. We summarize differet techiques ad their merits ad limits i Table 2, which gives a guidelie for applicatio users to choose a techique d o the size of the problem ad the machies available. We also attach the source code of the paddig method i the ed of the paper. Ackowledgemet: We thak Kag Su Gatli for his costructive suggestios o a prelimiary versio of this paper. Neal Wager carefully read the mauscript ad made costructive commets. Fially we appreciate the isightful reviews from the aoymous referees. Refereces [1] D. F. Baco, S. L. Graham, ad O. J. Sharp, Compiler trasformatios for high performace computig, ACM Computig Surveys, Vol. 26, No. 4, December 1994, pp [2] B. Bershad, D. Lee, T. Romer ad B. Che, Avoidig coflict misses dyamically i large direct-mapped caches, Proceedigs of the 6th Iteratioal Coferece o Architectural Support for Programmig Laguages ad Operatig Systems (ASPLOS-VI), October, [3] M. Cekleov ad M. Dubois, Virtual-address caches, IEEE Micro, September/October 1997, pp

16 XP1 (float) XP1 (double) Figure 1: Executio comparisos o the Compaq XP-1 workstatio: represets the method of blockig with software buffer; represets the method of blockig with paddig; ad represets the ideal lie referece. [4] K. S. Gatli ad L. Carter, Memory hierarchy cosideratios for fast traspose ad bit-reversals, Proceedigs of 5th Iteratioal Symposium o High-Performace Computer Architecture, (HPCA-5), Jauary [5] A. H. Karp, Bit reversal o uiprocessors, SIAM Review, Vol. 38, No. 1, March 1996, pp [6] J. L Heessy ad D. A. Patterso, Computer Architecture: A Quatitative Approach, Morga Kaufma, [7] L. McVoy ad C. Staeli, lmbech: portable tools for performace aalysis, Proceedigs of the 1996 USENIX Techical Coferece, Sa Diego, Califoria, 1996, pp [8] C. Rivera ad C.-W. Tseg, Data trasformatios for elimiatig coflict misses, Proceedigs of the SIG- PLAN 98 Coferece o Programmig Laguage Desig ad Implemetatio, July [9] M. Roseblum, et.al, Usig the SimOS machie simulator to study complex computer systems, ACM Trasactios o Modelig ad Computer Simulatio, Vol. 7, No. 1, 1997, pp [1] Y. Ya, X. Zhag ad Z. Zhag, A memory-layout orieted ru-time techique for locality optimizatio, Proceedigs of 1998 Iteratioal Coferece of Parallel Processig, (ICPP 98), August, 1998, pp [11] C. Zhag, X. Zhag ad Y. Ya, Two fast ad high-associativity cache schemes, IEEE Micro, Vol. 17, No. 5, 1997, pp

17 17 methods cross Istructio memory program commets iterferece cout space complexity blockig oly limited by data sizes. blockig with system idepedet. software buffer blockig with 1 limited by the umber register buffer of available registers. blockig with works well o high associativity 2 associativity caches. ad registers blockig with + 1 works well o paddig all systems. a TLB size depedet blockig for TLB outer loop, effective for fully associative TLBs. paddigs by usig L paddig for TLB + 1 pages, effective for set associative TLBs. Table 2: Summary of the blockig methods ad their impact o the three aspects of performace (cross iterferece, istructio cout, ad memory space) ad o the program complexity. The performace of blockig oly method is the lie for comparisos. Note: + meas that the method quatitatively icreases the factor ad hurt the performace; ad blak meas it has o impact. The program complicity is subjective, ad compared with the block oly method, with 1 beig a slightly more complex, ad 2 a moderately more complex. /* This is a padded bit-reversal program for cache optimizatio. */ void bit_reversal() { it blk, blk_rev, i, i_rev, j, jump = PAD_LENGTH, k; it D = N >> 2*b, d = - 2*b; DATA_TYPE *Xp[B]; DATA_TYPE *Yp, f, f1, f2, f3; for (i = ; i < B; i ++) Xp[i] = &X[bitrev_tbl[i]*jump]; for (blk = ; blk < D; blk ++) { bitrev(blk, blk_rev, d); for (i = ; i < B; i ++) {

18 18 } } } i_rev = bitrev_tbl[i]; k = (blk << b) + i; Yp = &Y[(blk_rev<<b) + (i_rev<<(-b))]; for (j = ; j < B; j += 4) { f = Xp[j][k]; f1 = Xp[j+1][k]; f2 = Xp[j+2][k]; f3 = Xp[j+3][k]; Yp[j] = f; Yp[j+1] = f1; Yp[j+2] = f2; Yp[j+3] = f3; }

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS SIAM J. SCI. COMPUT. Vol. 22, No. 6, pp. 2113 2134 c 21 Society for Idustrial ad Applied Mathematics FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS ZHAO ZHANG AND XIAODONG ZHANG

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Course Site:   Copyright 2012, Elsevier Inc. All rights reserved. Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

IMP: Superposer Integrated Morphometrics Package Superposition Tool

IMP: Superposer Integrated Morphometrics Package Superposition Tool IMP: Superposer Itegrated Morphometrics Package Superpositio Tool Programmig by: David Lieber ( 03) Caisius College 200 Mai St. Buffalo, NY 4208 Cocept by: H. David Sheets, Dept. of Physics, Caisius College

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition. Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive

More information

Bayesian approach to reliability modelling for a probability of failure on demand parameter

Bayesian approach to reliability modelling for a probability of failure on demand parameter Bayesia approach to reliability modellig for a probability of failure o demad parameter BÖRCSÖK J., SCHAEFER S. Departmet of Computer Architecture ad System Programmig Uiversity Kassel, Wilhelmshöher Allee

More information

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory! Why Care About the Memory Hierarchy? Memory Virtual Memory -DRAM Memory Gap (latecy) Reasos: Multi process systems (abstractio & memory protectio) Solutio: Tables (holdig per process traslatios) Fast traslatio

More information

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1 CS200: Hash Tables Prichard Ch. 13.2 CS200 - Hash Tables 1 Table Implemetatios: average cases Search Add Remove Sorted array-based Usorted array-based Balaced Search Trees O(log ) O() O() O() O(1) O()

More information

Fast Fourier Transform (FFT) Algorithms

Fast Fourier Transform (FFT) Algorithms Fast Fourier Trasform FFT Algorithms Relatio to the z-trasform elsewhere, ozero, z x z X x [ ] 2 ~ elsewhere,, ~ e j x X x x π j e z z X X π 2 ~ The DFS X represets evely spaced samples of the z- trasform

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

Accuracy Improvement in Camera Calibration

Accuracy Improvement in Camera Calibration Accuracy Improvemet i Camera Calibratio FaJie L Qi Zag ad Reihard Klette CITR, Computer Sciece Departmet The Uiversity of Aucklad Tamaki Campus, Aucklad, New Zealad fli006, qza001@ec.aucklad.ac.z r.klette@aucklad.ac.z

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 1 Computers ad Programs 1 Objectives To uderstad the respective roles of hardware ad software i a computig system. To lear what computer scietists

More information

GPUMP: a Multiple-Precision Integer Library for GPUs

GPUMP: a Multiple-Precision Integer Library for GPUs GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations Applied Mathematical Scieces, Vol. 1, 2007, o. 25, 1203-1215 A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045, Oe

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON Roberto Lopez ad Eugeio Oñate Iteratioal Ceter for Numerical Methods i Egieerig (CIMNE) Edificio C1, Gra Capitá s/, 08034 Barceloa, Spai ABSTRACT I this work

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045 Oe Brookigs Drive St. Louis, Missouri 63130-4899, USA jaegerg@cse.wustl.edu

More information

Performance Plus Software Parameter Definitions

Performance Plus Software Parameter Definitions Performace Plus+ Software Parameter Defiitios/ Performace Plus Software Parameter Defiitios Chapma Techical Note-TG-5 paramete.doc ev-0-03 Performace Plus+ Software Parameter Defiitios/2 Backgroud ad Defiitios

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 20 Itroductio to Trasactio Processig Cocepts ad Theory Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Trasactio Describes local

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19

CIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19 CIS Data Structures ad Algorithms with Java Sprig 09 Stacks, Queues, ad Heaps Moday, February 8 / Tuesday, February 9 Stacks ad Queues Recall the stack ad queue ADTs (abstract data types from lecture.

More information

Evaluation scheme for Tracking in AMI

Evaluation scheme for Tracking in AMI A M I C o m m u i c a t i o A U G M E N T E D M U L T I - P A R T Y I N T E R A C T I O N http://www.amiproject.org/ Evaluatio scheme for Trackig i AMI S. Schreiber a D. Gatica-Perez b AMI WP4 Trackig:

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

3D Model Retrieval Method Based on Sample Prediction

3D Model Retrieval Method Based on Sample Prediction 20 Iteratioal Coferece o Computer Commuicatio ad Maagemet Proc.of CSIT vol.5 (20) (20) IACSIT Press, Sigapore 3D Model Retrieval Method Based o Sample Predictio Qigche Zhag, Ya Tag* School of Computer

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 CPU-Memory Bottleeck Computer Architecture ELEC44 CPU Memory Lecture 8 Cache Dr. Hayde Kwok-Hay So Departmet of Electrical ad Electroic Egieerig Performace of high-speed computers is usually limited by

More information

Computer Science Foundation Exam. August 12, Computer Science. Section 1A. No Calculators! KEY. Solutions and Grading Criteria.

Computer Science Foundation Exam. August 12, Computer Science. Section 1A. No Calculators! KEY. Solutions and Grading Criteria. Computer Sciece Foudatio Exam August, 005 Computer Sciece Sectio A No Calculators! Name: SSN: KEY Solutios ad Gradig Criteria Score: 50 I this sectio of the exam, there are four (4) problems. You must

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved. Chapter 11 Frieds, Overloaded Operators, ad Arrays i Classes Copyright 2014 Pearso Addiso-Wesley. All rights reserved. Overview 11.1 Fried Fuctios 11.2 Overloadig Operators 11.3 Arrays ad Classes 11.4

More information

Improving Template Based Spike Detection

Improving Template Based Spike Detection Improvig Template Based Spike Detectio Kirk Smith, Member - IEEE Portlad State Uiversity petra@ee.pdx.edu Abstract Template matchig algorithms like SSE, Covolutio ad Maximum Likelihood are well kow for

More information

How do we evaluate algorithms?

How do we evaluate algorithms? F2 Readig referece: chapter 2 + slides Algorithm complexity Big O ad big Ω To calculate ruig time Aalysis of recursive Algorithms Next time: Litterature: slides mostly The first Algorithm desig methods:

More information

SCI Reflective Memory

SCI Reflective Memory Embedded SCI Solutios SCI Reflective Memory (Experimetal) Atle Vesterkjær Dolphi Itercoect Solutios AS Olaf Helsets vei 6, N-0621 Oslo, Norway Phoe: (47) 23 16 71 42 Fax: (47) 23 16 71 80 Mail: atleve@dolphiics.o

More information

A Study on the Performance of Cholesky-Factorization using MPI

A Study on the Performance of Cholesky-Factorization using MPI A Study o the Performace of Cholesky-Factorizatio usig MPI Ha S. Kim Scott B. Bade Departmet of Computer Sciece ad Egieerig Uiversity of Califoria Sa Diego {hskim, bade}@cs.ucsd.edu Abstract Cholesky-factorizatio

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 26 Ehaced Data Models: Itroductio to Active, Temporal, Spatial, Multimedia, ad Deductive Databases Copyright 2016 Ramez Elmasri ad Shamkat B.

More information

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As

More information

Uniprocessors. HPC Prof. Robert van Engelen

Uniprocessors. HPC Prof. Robert van Engelen Uiprocessors HPC Prof. Robert va Egele Overview PART I: Uiprocessors PART II: Multiprocessors ad ad Compiler Optimizatios Parallel Programmig Models Uiprocessors Multiprocessors Processor architectures

More information

Speeding-up dynamic programming in sequence alignment

Speeding-up dynamic programming in sequence alignment Departmet of Computer Sciece Aarhus Uiversity Demark Speedig-up dyamic programmig i sequece aligmet Master s Thesis Dug My Hoa - 443 December, Supervisor: Christia Nørgaard Storm Pederse Implemetatio code

More information

The Magma Database file formats

The Magma Database file formats The Magma Database file formats Adrew Gaylard, Bret Pikey, ad Mart-Mari Breedt Johaesburg, South Africa 15th May 2006 1 Summary Magma is a ope-source object database created by Chris Muller, of Kasas City,

More information

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve Advaces i Computer, Sigals ad Systems (2018) 2: 19-25 Clausius Scietific Press, Caada Aalysis of Server Resource Cosumptio of Meteorological Satellite Applicatio System Based o Cotour Curve Xiagag Zhao

More information

CS61C : Machine Structures

CS61C : Machine Structures CS 61C L24 VM II (1) ist.eecs.berkele.edu/~cs61c/su5 CS61C : Machie Structures Lecture #24: VM II Address Mappig: Virtual Address: VPN offset 25-8-2 Ad Carle idex ito page table located i phsical memor

More information

Dynamic Programming and Curve Fitting Based Road Boundary Detection

Dynamic Programming and Curve Fitting Based Road Boundary Detection Dyamic Programmig ad Curve Fittig Based Road Boudary Detectio SHYAM PRASAD ADHIKARI, HYONGSUK KIM, Divisio of Electroics ad Iformatio Egieerig Chobuk Natioal Uiversity 664-4 Ga Deokji-Dog Jeoju-City Jeobuk

More information

Algorithms for Disk Covering Problems with the Most Points

Algorithms for Disk Covering Problems with the Most Points Algorithms for Disk Coverig Problems with the Most Poits Bi Xiao Departmet of Computig Hog Kog Polytechic Uiversity Hug Hom, Kowloo, Hog Kog csbxiao@comp.polyu.edu.hk Qigfeg Zhuge, Yi He, Zili Shao, Edwi

More information

arxiv: v2 [cs.ds] 24 Mar 2018

arxiv: v2 [cs.ds] 24 Mar 2018 Similar Elemets ad Metric Labelig o Complete Graphs arxiv:1803.08037v [cs.ds] 4 Mar 018 Pedro F. Felzeszwalb Brow Uiversity Providece, RI, USA pff@brow.edu March 8, 018 We cosider a problem that ivolves

More information

Bank-interleaved cache or memory indexing does not require euclidean division

Bank-interleaved cache or memory indexing does not require euclidean division Bak-iterleaved cache or memory idexig does ot require euclidea divisio Adré Sezec To cite this versio: Adré Sezec. Bak-iterleaved cache or memory idexig does ot require euclidea divisio. 11th Aual Workshop

More information

6.854J / J Advanced Algorithms Fall 2008

6.854J / J Advanced Algorithms Fall 2008 MIT OpeCourseWare http://ocw.mit.edu 6.854J / 18.415J Advaced Algorithms Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.415/6.854 Advaced Algorithms

More information

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software Structurig Redudacy for Fault Tolerace CSE 598D: Fault Tolerat Software What do we wat to achieve? Versios Damage Assessmet Versio 1 Error Detectio Iputs Versio 2 Voter Outputs State Restoratio Cotiued

More information

EE123 Digital Signal Processing

EE123 Digital Signal Processing Last Time EE Digital Sigal Processig Lecture 7 Block Covolutio, Overlap ad Add, FFT Discrete Fourier Trasform Properties of the Liear covolutio through circular Today Liear covolutio with Overlap ad add

More information

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem A Improved Shuffled Frog-Leapig Algorithm for Kapsack Problem Zhoufag Li, Ya Zhou, ad Peg Cheg School of Iformatio Sciece ad Egieerig Hea Uiversity of Techology ZhegZhou, Chia lzhf1978@126.com Abstract.

More information

Fast Interpolation of Grid Data at a Non-Grid Point

Fast Interpolation of Grid Data at a Non-Grid Point Fast Iterpolatio of Grid Data at a No-Grid Poit Hiroshi Ioue IBM Research - Tokyo Tokyo, Japa iouehrs@jp.ibm.com Abstract Defiig data at a o-grid poit by iterpolatig grid data is a commo operatio i may

More information

Redundancy Allocation for Series Parallel Systems with Multiple Constraints and Sensitivity Analysis

Redundancy Allocation for Series Parallel Systems with Multiple Constraints and Sensitivity Analysis IOSR Joural of Egieerig Redudacy Allocatio for Series Parallel Systems with Multiple Costraits ad Sesitivity Aalysis S. V. Suresh Babu, D.Maheswar 2, G. Ragaath 3 Y.Viaya Kumar d G.Sakaraiah e (Mechaical

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs What are we goig to lear? CSC316-003 Data Structures Aalysis of Algorithms Computer Sciece North Carolia State Uiversity Need to say that some algorithms are better tha others Criteria for evaluatio Structure

More information

Stone Images Retrieval Based on Color Histogram

Stone Images Retrieval Based on Color Histogram Stoe Images Retrieval Based o Color Histogram Qiag Zhao, Jie Yag, Jigyi Yag, Hogxig Liu School of Iformatio Egieerig, Wuha Uiversity of Techology Wuha, Chia Abstract Stoe images color features are chose

More information

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1

COSC 1P03. Ch 7 Recursion. Introduction to Data Structures 8.1 COSC 1P03 Ch 7 Recursio Itroductio to Data Structures 8.1 COSC 1P03 Recursio Recursio I Mathematics factorial Fiboacci umbers defie ifiite set with fiite defiitio I Computer Sciece sytax rules fiite defiitio,

More information

CS2410 Computer Architecture. Flynn s Taxonomy

CS2410 Computer Architecture. Flynn s Taxonomy CS2410 Computer Architecture Dept. of Computer Sciece Uiversity of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/2410p/idex.html 1 Fly s Taxoomy SISD Sigle istructio stream Sigle data stream (SIMD)

More information

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein 068.670 Subliear Time Algorithms November, 0 Lecture 6 Lecturer: Roitt Rubifeld Scribes: Che Ziv, Eliav Buchik, Ophir Arie, Joatha Gradstei Lesso overview. Usig the oracle reductio framework for approximatig

More information

Lower Bounds for Sorting

Lower Bounds for Sorting Liear Sortig Topics Covered: Lower Bouds for Sortig Coutig Sort Radix Sort Bucket Sort Lower Bouds for Sortig Compariso vs. o-compariso sortig Decisio tree model Worst case lower boud Compariso Sortig

More information

THIN LAYER ORIENTED MAGNETOSTATIC CALCULATION MODULE FOR ELMER FEM, BASED ON THE METHOD OF THE MOMENTS. Roman Szewczyk

THIN LAYER ORIENTED MAGNETOSTATIC CALCULATION MODULE FOR ELMER FEM, BASED ON THE METHOD OF THE MOMENTS. Roman Szewczyk THIN LAYER ORIENTED MAGNETOSTATIC CALCULATION MODULE FOR ELMER FEM, BASED ON THE METHOD OF THE MOMENTS Roma Szewczyk Istitute of Metrology ad Biomedical Egieerig, Warsaw Uiversity of Techology E-mail:

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

Isn t It Time You Got Faster, Quicker?

Isn t It Time You Got Faster, Quicker? Is t It Time You Got Faster, Quicker? AltiVec Techology At-a-Glace OVERVIEW Motorola s advaced AltiVec techology is desiged to eable host processors compatible with the PowerPC istructio-set architecture

More information

Improving Information Retrieval System Security via an Optimal Maximal Coding Scheme

Improving Information Retrieval System Security via an Optimal Maximal Coding Scheme Improvig Iformatio Retrieval System Security via a Optimal Maximal Codig Scheme Dogyag Log Departmet of Computer Sciece, City Uiversity of Hog Kog, 8 Tat Chee Aveue Kowloo, Hog Kog SAR, PRC dylog@cs.cityu.edu.hk

More information

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device

More information

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a 4. [10] Usig a combiatorial argumet, prove that for 1: = 0 = Let A ad B be disjoit sets of cardiality each ad C = A B. How may subsets of C are there of cardiality. We are selectig elemets for such a subset

More information

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu UH-MEM: Utility-Based Hybrid Memory Maagemet Yag Li, Saugata Ghose, Jogmoo Choi, Ji Su, Hui Wag, Our Mutlu 1 Executive Summary DRAM faces sigificat techology scalig difficulties Emergig memory techologies

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

Reversible Realization of Quaternary Decoder, Multiplexer, and Demultiplexer Circuits

Reversible Realization of Quaternary Decoder, Multiplexer, and Demultiplexer Circuits Egieerig Letters, :, EL Reversible Realizatio of Quaterary Decoder, Multiplexer, ad Demultiplexer Circuits Mozammel H.. Kha, Member, ENG bstract quaterary reversible circuit is more compact tha the correspodig

More information

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only Edited: Yeh-Liag Hsu (998--; recommeded: Yeh-Liag Hsu (--9; last updated: Yeh-Liag Hsu (9--7. Note: This is the course material for ME55 Geometric modelig ad computer graphics, Yua Ze Uiversity. art of

More information

Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta

Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta USENIX Associatio Proceedigs of the 4th Aual Liux Showcase & Coferece, Atlata Atlata, Georgia, USA October 1 14, 2 THE ADVANCED COMPUTING SYSTEMS ASSOCIATION 2 by The USENIX Associatio All Rights Reserved

More information

Lecture 18. Optimization in n dimensions

Lecture 18. Optimization in n dimensions Lecture 8 Optimizatio i dimesios Itroductio We ow cosider the problem of miimizig a sigle scalar fuctio of variables, f x, where x=[ x, x,, x ]T. The D case ca be visualized as fidig the lowest poit of

More information

Analysis of Documents Clustering Using Sampled Agglomerative Technique

Analysis of Documents Clustering Using Sampled Agglomerative Technique Aalysis of Documets Clusterig Usig Sampled Agglomerative Techique Omar H. Karam, Ahmed M. Hamad, ad Sheri M. Moussa Abstract I this paper a clusterig algorithm for documets is proposed that adapts a samplig-based

More information

Harris Corner Detection Algorithm at Sub-pixel Level and Its Application Yuanfeng Han a, Peijiang Chen b * and Tian Meng c

Harris Corner Detection Algorithm at Sub-pixel Level and Its Application Yuanfeng Han a, Peijiang Chen b * and Tian Meng c Iteratioal Coferece o Computatioal Sciece ad Egieerig (ICCSE 015) Harris Corer Detectio Algorithm at Sub-pixel Level ad Its Applicatio Yuafeg Ha a, Peijiag Che b * ad Tia Meg c School of Automobile, Liyi

More information

A Note on Least-norm Solution of Global WireWarping

A Note on Least-norm Solution of Global WireWarping A Note o Least-orm Solutio of Global WireWarpig Charlie C. L. Wag Departmet of Mechaical ad Automatio Egieerig The Chiese Uiversity of Hog Kog Shati, N.T., Hog Kog E-mail: cwag@mae.cuhk.edu.hk Abstract

More information

Data Structures and Algorithms Part 1.4

Data Structures and Algorithms Part 1.4 1 Data Structures ad Algorithms Part 1.4 Werer Nutt 2 DSA, Part 1: Itroductio, syllabus, orgaisatio Algorithms Recursio (priciple, trace, factorial, Fiboacci) Sortig (bubble, isertio, selectio) 3 Sortig

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information

A Parallel DFA Minimization Algorithm

A Parallel DFA Minimization Algorithm A Parallel DFA Miimizatio Algorithm Ambuj Tewari, Utkarsh Srivastava, ad P. Gupta Departmet of Computer Sciece & Egieerig Idia Istitute of Techology Kapur Kapur 208 016,INDIA pg@iitk.ac.i Abstract. I this

More information

Mobile terminal 3D image reconstruction program development based on Android Lin Qinhua

Mobile terminal 3D image reconstruction program development based on Android Lin Qinhua Iteratioal Coferece o Automatio, Mechaical Cotrol ad Computatioal Egieerig (AMCCE 05) Mobile termial 3D image recostructio program developmet based o Adroid Li Qihua Sichua Iformatio Techology College

More information

Revisiting the performance of mixtures of software reliability growth models

Revisiting the performance of mixtures of software reliability growth models Revisitig the performace of mixtures of software reliability growth models Peter A. Keiller 1, Charles J. Kim 1, Joh Trimble 1, ad Marlo Mejias 2 1 Departmet of Systems ad Computer Sciece 2 Departmet of

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13 CIS Data Structures ad Algorithms with Java Sprig 08 Stacks ad Queues Moday, February / Tuesday, February Learig Goals Durig this lab, you will: Review stacks ad queues. Lear amortized ruig time aalysis

More information

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem

Exact Minimum Lower Bound Algorithm for Traveling Salesman Problem Exact Miimum Lower Boud Algorithm for Travelig Salesma Problem Mohamed Eleiche GeoTiba Systems mohamed.eleiche@gmail.com Abstract The miimum-travel-cost algorithm is a dyamic programmig algorithm to compute

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must

More information

Instruction and Data Streams

Instruction and Data Streams Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad

More information

Lecture 28: Data Link Layer

Lecture 28: Data Link Layer Automatic Repeat Request (ARQ) 2. Go ack N ARQ Although the Stop ad Wait ARQ is very simple, you ca easily show that it has very the low efficiecy. The low efficiecy comes from the fact that the trasmittig

More information

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers * Load balaced Parallel Prime umber Geerator with Sieve of Eratosthees o luster omputers * Soowook Hwag*, Kyusik hug**, ad Dogseug Kim* *Departmet of Electrical Egieerig Korea Uiversity Seoul, -, Rep. of

More information

Image Segmentation EEE 508

Image Segmentation EEE 508 Image Segmetatio Objective: to determie (etract) object boudaries. It is a process of partitioig a image ito distict regios by groupig together eighborig piels based o some predefied similarity criterio.

More information