LINPACK Benchmark Optimizations on a Virtual Processor Grid

Size: px
Start display at page:

Download "LINPACK Benchmark Optimizations on a Virtual Processor Grid"

Transcription

1 LINPACK Bechmark Optimizatios o a Virtual Processor Grid Edward Aderso, Mayard Bradt ad Chao Yag, Cray Ic. ABSTRACT: The "massively parallel" LINPACK bechmark is a familiar measure of highperformace computig performace ad a useful check of the scalability of multi-processor computig systems. The CRAY X1 performs well o this bechmark, achievig i excess of 90% of peak performace per processor ad ear liear scalability. This paper will outlie Cray's implemetatio of the LINPACK bechmark code ad show how it leads to computatioal kerels that are easier to optimize tha those of ScaLAPACK or the High Performace LINPACK code, HPL. Specific areas i which the algorithm has bee tued for the CRAY X1 are i commuicatig across rows or dow colums, i iterchagig rows to implemet partial pivotig, ad i multiplyig dese matrices. The Cray LINPACK bechmark code has also bee adapted to ru o a virtual processor grid, that is, a p-by-q grid where p q is a multiple of the umber of processors. This feature is especially pertiet whe the umber of processors does ot factorize eatly, but it ca be used i ay situatio i which it is desired to assig more processors to row or colum operatios. The virtual processor grid cocept is ot specific to the LINPACK bechmark ad could be applied to ay applicatio that uses a 2-D grid decompositio. 1. Itroductio The LINPACK bechmark attracts a lot of attetio from performace aalysts because it is the metric used to rak systems for the biaual Top 500 list ( It is also a good predictor of performace for certai applicatios ivolvig the solutio of large dese liear systems or eigevalue problems. The rules for the highly parallel LINPACK bechmark (Table 3 i [5]) allow ay problem size ad ay algorithm to be used as log as it solves a dese 64-bit real liear system, Ax = b. I practice, the algorithm is always some form of Gaussia elimiatio with partial pivotig, ad for maximum efficiecy, the problem size is usually the largest problem that will fit i memory. A excellet implemetatio of the LINPACK bechmark for distributed memory systems is High Performace LINPACK (HPL) [6], a C code loosely based o ScaLAPACK [3]. I this paper we describe a alterative implemetatio of the LINPACK bechmark developed at Cray Ic. that has bee used o CRAY T3E ad CRAY X1 systems. We will refer to this implemetatio as the Cray LINPACK Bechmark code. The Cray LINPACK Bechmark code is writte i a mix of C ad Fortra ad uses Cray s oe-sided message-passig library SHMEM [4] for commuicatio. Like HPL, it assumes a 2-D block cyclic distributio of the matrix A ad solves a augmeted system usig a block right-lookig algorithm. Ulike HPL, it stores the distributio blocks by rows, istead of by colums. This storage order optimizes the memory access patter of the domiat matrix multiply kerel for the highest levels of the memory hierarchy ad preserves the logical structure of the distributio blocks. It is also a atural fit for a geeralizatio of the usual 2-D processor grid ito a virtual processor grid, that is, a p-by-q grid where p q is a multiple of the umber of processors. 2. Solvig liear systems o distributed memory computers The computatio measured i the LINPACK bechmark is the solutio of a dese liear system Ax = b, where A is a -by- 64-bit real matrix of full rak ad b is a vector of legth. The stadard techique for this problem uses Gaussia elimiatio with partial pivotig to compute a factorizatio A = LU, where L is a product of permutatio ad uit lower triagular matrices ad U is upper triagular, from which the solutio ca be obtaied by solvig Ly = b for y ad Ux = y for x. Whe there is oly oe right-had side vector b, it is advatageous to solve a augmeted system, i which b is stored as the (+1) st colum of the array A. The the factorizatio applies L -1 to b to form y, ad the solutio ca be determied i oe additioal step by solvig the triagular system Ux = y by back substitutio. To solve Ax = b o a distributed memory computer, the matrix A is partitioed amog the processors usig a blockcyclic distributio o a 2-D processor grid. Oe possible assigmet of blocks to processors o a 2-by-3 processor CUG 2004 Proceedigs 1

2 grid is show i Figure 1. The tilig patter repeats as may times as is ecessary, ad the last row or colum of blocks may be empty o some processors. The distributio block size ad the aspect ratio of the processor grid are tuable parameters, although choices for the p-by-q grid are limited by the umber of ways to factor the umber of processors, N p Figure 1: Block-cyclic distributio o a 2 3 processor grid The right-lookig algorithm to compute a LU factorizatio of a matrix A stored as i Figure 1 is described as follows. First, a pael of colum blocks is factored usig Gaussia elimiatio with partial pivotig. Next, iformatio about the sequece of row iterchages that was used is broadcast to the other processors, ad the row iterchages are applied to the rest of the matrix. The curret block row is the updated by applyig the trasformatios cotaied i the diagoal block; this operatio cosists of a triagular solve with multiple right had sides ad is performed locally by the Level 3 BLAS routie xtrsm. Fially the block colum below the diagoal block is commuicated across rows ad the block row to the right of the diagoal block is commuicated dow colums, ad the rest of the matrix is updated by a rak-b update, performed locally by the Level 3 BLAS routie xgemm. Block algorithms were first developed for shared memory computers i order to icrease the re-use of data at the highest levels of the memory hierarchy ad thereby improve performace. I the umerical liear algebra package LAPACK [1], the block size is a iteral parameter, used oly for tuig; it does ot affect how the 2-D matrix is stored. I ScaLAPACK [3], a extesio of LAPACK for distributed-memory computers, the distributio block size is aalogous to the algorithm block size i LAPACK, ad distributed BLAS were developed aalogous to the stadard BLAS so that much of the algorithmic structure of LAPACK could be preserved. Sice Fortra stores arrays i colum-major order, ScaLAPACK stored the distributio blocks colum-wise o each processor, makig the local portio of the distributed matrix a 2-D matrix as i the shared-memory case. The High Performace LINPACK code (HPL) [6] is a specialized versio of the ScaLAPACK routie to solve a dese real liear system, put ito a LINPACK Bechmark cotext with optios for tuig. Like ScaLAPACK, HPL stores the distributio blocks colum-wise o each processor, but it provides a optio for storig the traspose of the matrix for better cache use o some architectures. 3. Row-wise storage of distributio blocks While the colum-wise storage of local distributio blocks is cetral to the ScaLAPACK desig, there are both otatioal ad performace disadvatages to storig these blocks i a 2-D array. First, the blockig factors used i the distributio are lost i this data structure ad must be carried alog i a array descriptor, separate from the array A. Although a distributio block is a atural piece for the algorithm desiger to work with, it exists oly as a subblock of the 2-D array. Elemets i the same colum of a block of A are accessed with uit stride, but elemets i the same row are accessed with a stride that grows with the matrix size. For very large problems, such as those solved i the LINPACK bechmark, eve a small sub-block may spa several memory pages, icreasig the latecy to load the sub-block ito cache. Also, successively larger row strides will each have a differet footprit i the cache, makig it difficult to predict the performace of the algorithm for larger sizes based o the performace for smaller problems. I the Cray LINPACK Bechmark code, the distributio blocks are stored row-wise o the local processors. This orgaizatio of blocks is most coveietly described by a 4-D array A( b, b, cblks, rblks ) where the first two idices describe a b-by-b distributio block ad the last two idices describe the local colum idex ad row idex of each block, respectively. For example, the first elemet of the last block stored o processor 5 i Figure 1 would be A( 1, 1, 2, 3 ), sice it is the (1,1) elemet i the secod colum of blocks ad the third row of blocks o processor 5. 1 Assumig the blocks are stored oe after the other, this data structure ca be regarded as rblks block rows of size b-by-b cblks where it is coveiet to do so. 1 I Co-array Fortra otatio, we could be eve more complete ad call the elemet A( 1, 1, 2, 3; 5 ). CUG 2004 Proceedigs 2

3 Row-wise storage of blocks maitais the atural cotiguity of the distributio blocks ad facilitates their movemet to the highest levels of the memory hierarchy. Because a block is cotiguous i memory, its mappig to a associative cache is kow a priori ad its size ca be chose to fit i a give segmet of cache. Also, a block row of local distributio blocks is cotiguous, so a rak-b update as i the LINPACK bechmark will result i a uit stride access patter for the etire block row, as show i Figure 2. This access patter ca take advatage of stream buffers or hardware prefetch istructios if they are available. A disadvatage of storig the distributio blocks row-wise is that it requires additioal idexig of the local blocks. It also may ot be ay better tha the stadard techique uless the library routies have bee optimized for it. But it has the potetial to be a higher-performig alterative. 4. Optimizatio of the matrix multiply kerel I the Cray LINPACK Bechmark code, the mai matrix multiply kerel is a b-by-b matrix A times a b-by- matrix B added to a b-by- matrix C to produce a b-by result. The memory access patter for this operatio is show i Figure 2. O the CRAY T3E, the iermost kerel was a matrix-vector multiply. The first time the block of A was used, it would be loaded from memory, but subsequetly it would likely be foud i the o-chip secodary cache (Scache) of the CRAY T3E processor. A would multiply a colum of B loaded from memory, ad the result vector would be accumulated i registers before beig writte to C. If the leadig dimesio of the B array matched the block size b, the the colums of B would be accessed i oe cotiuous stream, activatig the stream buffers of the CRAY T3E processor. The matrix C was also accessed as oe cotiuous stream. Thus the optimizatio of the matrix-vector multiply kerel for the CRAY T3E cosisted of loadig elemets of A early eough to hide the Scache latecy, spacig the operatios far eough apart to hide the floatig-poit uit latecy, ad prefetchig elemets of the B vector to hide its latecy from the stream buffers. Costraits i the CRAY T3E processor desig limited the matrix multiply performace to about 75% of peak. First, there were oly 31 scalar floatig-poit registers, which were eeded for accumulatig the result vector as well as for temporarily holdig elemets of A ad B. The latecy of approximately 10 clock periods (CP) to Scache ad 4 CP for the floatig-poit uit made istructio schedulig for cached data a challege. The compiler ad most of the scietific library urolled loops by 4 or 8; for LINPACK, a special-purpose kerel was created that was urolled by 12. This desig meat that the optimal distributio block size b was a multiple of 12. The b-by-b block had to fit i the Scache of the T3E, which was 96 KB or 12 KW, but because of the radom replacemet policy of the 3-way set associative Scache, it was better to size the block for oly oe 4096 word set of the Scache. Thus the biggest possible block of A was 64-by-64, ad coupled with the multiple of 12 requiremet ad the 8-word Scache lie size, the optimal block size for the CRAY T3E was 48. Sice the iermost kerel was a 12-by-48 matrix-vector multiply, each vector of B was used four times as it was multiplied by each of the four 12-by-48 sub-blocks of A. The first time, B would be loaded from memory via the stream buffers, but the subsequet uses would hope to fid B i the Scache. However, if the vector of B had the same relative cache offset as part of A, the radom replacemet policy meat that oe of them might eed to be reloaded from memory. The CRAY X1 processor desig directly addresses may of the limitatios i the CRAY T3E memory hierarchy. The CRAY X1 multi-streamig processor (MSP) desig is show i Figure 3. At the highest level of the hierarchy, there are 32 vector registers, each of legth 64, o each of the four sigle-streamig processors (SSPs) of a MSP. Vector registers are used i the CRAY X1 matrix multiply kerel to hold elemets of A ad to accumulate results before writig back to C. I fact, there are so may registers available that we ca improve the re-use of the A ad B elemets ad keep all the vector fuctioal uits busy by computig four colums at a time, istead of just oe. Also, the o-chip cache is 2 MB or 256 KW, allowig the block of A to be up to 512-by-512. Fially, the four SSPs o a MSP ca perform the computatios with the four subblocks of A cocurretly, accessig elemets of B through their shared cache. The matrix multiply kerel o a CRAY X1 processor is show i Figure 4. Sustaied performace of the matrix multiply kerel i the cotext of the LINPACK bechmark is approximately 95% of peak. The matrix-multiply routie xgemm ad the other Level 3 BLAS kerel, xtrsm, are the oly parts of the Cray LINPACK bechmark code that are optimized at the SSP level. A prototype for the matrix multiply kerel showig the parallelizatio across SSPs is show i Figure 5. This subroutie is compiled with ft -Oaggress -O3 -s default64 c sgemm.f The SSP code, amed DMMNN i this prototype, is implemeted i assembly laguage to achieve the best possible istructio schedulig for elemets of A that are expected to be foud i the cache. subroutie sgemm( m,, k, alpha, a, ira, & b, irb, beta, c, irc ) iteger m,, k, ira, irb, irc real alpha, beta real a(ira,*), b(irb,*), c(irc,*) iteger i, js, m4 cdir$ SSP_PRIVATE dmm CUG 2004 Proceedigs 3

4 m4 = (m+3)/4 do i = 0, 3 js = mi( m4, max( m-i*m4, 0 ) ) call dmm( js,, k, alpha, a(1+i*m4,1), & ira, b, irb, beta, & c(1+i*m4,1), irc ) ed do retur ed Figure 5: MSP code for matrix multiply, callig SSP subroutie dmm 5. Optimizatio of the row iterchages The row exchage is a target for optimizatio because it ivolves movig data with o-uit stride, all processors participate, ad it is all overhead o floatig poit computatios are performed. The sequece of row iterchages from the pael factorizatio is described by a idex vector IPIV, where IPIV(i) = j idicates that row i was exchaged with row j. The idices are ot uique, because a row that is exchaged out of the pivot row block early i the pael factorizatio may be copied back ito the pivot row block i a exchage with a later row. 2 The challege is to tur the descriptio of a sequetial process the row exchages of the pael factorizatio ito a block update that ca be applied to the distributed matrix as a whole. We do this by traslatig the IPIV vector ito idex vectors ISCAT ad IGATH, where ISCAT cotais the destiatio row idices for the rows of the pivot block row ad IGATH cotais the idices of rows of the global array that are gathered ito the rows of the pivot block row. Row idices i each of these vectors are uique, so the gather or scatter operatio ca be doe as a block. Sice rows of the local array structure are o-uit stride, it is more efficiet to copy them to a cotiguous buffer before sedig them. O processors i the pivot block row, the ISCAT vector is used to copy ay rows to be trasferred to a remote processor to a sed buffer. O the remote processors, the IGATH vector is used to idetify local rows ivolved i the row exchage ad copy them to a buffer as well. Each processor participatig i the row exchage the sychroizes with the pivot block row, after which each ca get their data from the other s buffer. A fial 2 The most extreme example of row exchages usig partial pivotig is the matrix for which IPIV(1) = = IPIV(N) = N, i.e., every row is exchaged with row N. sychroizatio whe the trasfer is complete is required to idicate that the buffers ca be re-used. I the CRAY LINPACK bechmark code, the local data copy of selected rows to a cotiguous buffer is called trasposeaitob, ad the correspodig routie to copy from colums of the cotiguous buffer back to rows of the local array structure is called trasposebtoai. O the CRAY T3E, these operatios were optimized usig E-registers, like other traspose operatios [2]. O the CRAY X1, the vectorizig compiler hadles the idexed traspose without ay special effort. 6. The virtual processor grid Oe shortcomig of mappig a distributed matrix to a 2-D processor grid is the 2-D grid itself. If the umber of processors N p is a perfect square, the ay operatios o a block colum or block row (such as the pael factorizatio i the LINPACK bechmark) will ivolve oly sqrt(n p ) processors. Moreover, ot every processor cout factors eatly ito p q. For example, early CRAY X1 systems were shipped with 32 four-processor odes, with 31 odes cofigured for applicatio use. But the oly factorizatio of 124 is 4 31 (or 31 4), ad 123 = 3 41 ad 122 = 2 61 are o better. Rather tha leave 3 processors idle ad solve the system o 121 processors, we coceived of usig a virtual processor grid, i which each physical processor would hadle more tha 1 block i the virtual processor grid. More geerally, oe could solve ay 2-D problem usig N p processors o a p q grid such that p q = k N p, where k = 1, p = N p, ad lcm(p, N p ) = k N p. The restrictio p = N p is ecessary to prevet assigig more tha oe block i a colum of the virtual processor grid to the same processor, ad the costrait lcm(p, N p ) = k N p simply guaratees that the virtual processor grid is ot a multiple of aother smaller virtual processor grid. Usig a virtual processor grid is ot the same as multithreadig, because we create oly oe process per processor, ad the blocks of the virtual processor grid must sometimes be dealt with i a prescribed order, ot always from 1 to k. A example usig 6 processors o a 4 3 virtual processor grid is show i Figure 6. This tilig patter has two times the parallelism of a 2 3 grid for colum operatios. The oly drawback to the virtual processor grid, other tha algorithmic complexity, is a potetially loger sychroizatio time whe oe processor must deal with more tha oe of its blocks at a time. For istace, if colum 1 ad colum 2 of the virtual procesor grid i Figure 6 exchaged data, the processor 0 would have to exchage with 4 usig its first block before exchagig with 2 usig its secod block. However, this effect is lesseed o larger virtual grids, such as a grid o 124 CUG 2004 Proceedigs 4

5 processors, where the blocks o the same processor are farther apart. Figure 6: Six processors i a 4 3 virtual processor grid I the Cray LINPACK bechmark code, the virtual processor idexig is hadled by addig a 5 th dimesio to the local array structure A, which becomes A( b, b, cblks, rblks, vpi ) For some operatios, such as the rak-b update i the LINPACK bechmark code, the order of processig the distributio blocks is ot importat, so we ca update all the blocks with the same virtual processor idex together. I other cases, particularly commuicatio operatios, each processor must process all its blocks i the virtual processor grid tile before goig o to the ext tile. Furthermore, the order of processig the blocks i a tile may chage as the algorithm progresses, so it is ecessary to maitai the curret startig poit istart, ad process the block umbered 1 + mod( istart + i 1, vpi ) i a loop from 1 to vpi. Figure 7 shows the effect of the virtual processor grid algorithm o 124 processors of a CRAY X1. Both the 4 31 ad 31 4 grids show similar performace, while the virtual processor grid is up to 10% faster. I the LINPACK bechmark, which is so domiated by matrix multiply that improvemets to other parts of the code are usually i the 1% rage, this degree of improvemet is sigificat. 7. Summary I optimizig the LINPACK bechmark for the CRAY X1, we have foud ew aveues for performace improvemets i the code origially developed for a CRAY T3E. The storage order of the distributio blocks is a ofte overlooked parameter that affects how the data is mapped to the memory hierarchy i the mai computatioal kerels. By parallelizig these kerels across the SSPs of a MSP, we achieved performace of 95% of peak for matrix multiplicatio ad over 90% for the LINPACK bechmark overall. To further reduce iefficiecies i the row or colum operatios, which ivolve oly a subset of the processors, we itroduce the cocept of a virtual processor grid. A side beefit of this techique is that oe ca model the solutio of a problem o k N p processors usig oly N p processors if it will fit i memory. Ackowledgemets The authors wish to ackowledge Jeff Brooks for his isights i the developmet of the origial Cray LINPACK Bechmark code, ad Mike Aamodt of the Bechmarkig group for his support of the CRAY X1 bechmarkig efforts. Refereces [1] E. Aderso, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dogarra, J. Du Croz, A. Greebaum, S. Hammarlig, A. McKeey, ad D. Sorese, LAPACK Users Guide, Third Editio, SIAM, Philadelphia, ( [2] E. Aderso, J. Brooks, ad T. Hewitt, The Bechmarker's Guide to Sigle-processor Optimizatio for CRAY T3E Systems, Cray Research techical report, (available at ftp://ftp.arsc.edu/pub/mpp/docs/bmguide.ps.z) [3] L. S. Blackford et al., ScaLAPACK Users Guide, SIAM, Philadelphia, ( [4] Cray Ic., Ma Page Collectio: Shared Memory Access (SHMEM), S , (available at [5] J. J. Dogarra, Performace of Various Computers Usig Stadard Liear Equatios Software, Techical Report CS-89-85, Uiversity of Teessee. (updated versio at [6] A. Petitet, R. C. Whaley, J. Dogarra, A. Cleary, HPL - A Portable Implemetatio of the High-Performace Lipack Bechmark for Distributed-Memory Computers, Versio 1.0a, Ja (available at About the Authors Ed Aderso is a cosultat to Cray Ic. workig with the Bechmarkig group. He worked for Cray Research i the Scietific Libraries ad Bechmarkig groups from Ed ca be reached at eaderso@a-et.orl.gov. Mayard Bradt retired from Cray Ic. i 2004 after may years i the Bechmarkig group. He still lives i the Mieapolis/St. Paul area ad ca be foud at Gopher basketball games. Chao Yag is a Seior Applicatios Aalyst with Cray Ic., specializig i mathematical software developmet. Chao ca be reached at Cray Ic., 1340 Medota Heights Road, Medota Heights, MN, 55120, cwy@cray.com. CUG 2004 Proceedigs 5

6 b b = = Figure 2: Matrix-multiply C = A*B o the CRAY T3E, highlightig the matrix-vector kerel Figure 3: CRAY X1 multi-streamig processor desig b SSP0 = b SSP1 SSP2 SSP3 Figure 4: Matrix-multiply C = A*B o the CRAY X1, highlightig the SSP parallelism ad computatio of four vectors cocurretly. CUG 2004 Proceedigs 6

7 LINPACK Bechmark Performace CRAY X1, 124 MSPs Gflop/s N 4x31 grid 31x4 grid 32x31 grid Figure 7: LINPACK Bechmark performace o a virtual processor grid, compared to a covetioal processor grid CUG 2004 Proceedigs 7

A Recursive Blocked Schur Algorithm for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.

A Recursive Blocked Schur Algorithm for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012. A Recursive Blocked Schur Algorithm for Computig the Matrix Square Root Deadma, Edvi ad Higham, Nicholas J. ad Ralha, Rui 212 MIMS EPrit: 212.26 Machester Istitute for Mathematical Scieces School of Mathematics

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

The Magma Database file formats

The Magma Database file formats The Magma Database file formats Adrew Gaylard, Bret Pikey, ad Mart-Mari Breedt Johaesburg, South Africa 15th May 2006 1 Summary Magma is a ope-source object database created by Chris Muller, of Kasas City,

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

Appendix A. Use of Operators in ARPS

Appendix A. Use of Operators in ARPS A Appedix A. Use of Operators i ARPS The methodology for solvig the equatios of hydrodyamics i either differetial or itegral form usig grid-poit techiques (fiite differece, fiite volume, fiite elemet)

More information

LU Decomposition Method

LU Decomposition Method SOLUTION OF SIMULTANEOUS LINEAR EQUATIONS LU Decompositio Method Jamie Traha, Autar Kaw, Kevi Marti Uiversity of South Florida Uited States of America kaw@eg.usf.edu http://umericalmethods.eg.usf.edu Itroductio

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

Lower Bounds for Sorting

Lower Bounds for Sorting Liear Sortig Topics Covered: Lower Bouds for Sortig Coutig Sort Radix Sort Bucket Sort Lower Bouds for Sortig Compariso vs. o-compariso sortig Decisio tree model Worst case lower boud Compariso Sortig

More information

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access

More information

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

Alpha Individual Solutions MAΘ National Convention 2013

Alpha Individual Solutions MAΘ National Convention 2013 Alpha Idividual Solutios MAΘ Natioal Covetio 0 Aswers:. D. A. C 4. D 5. C 6. B 7. A 8. C 9. D 0. B. B. A. D 4. C 5. A 6. C 7. B 8. A 9. A 0. C. E. B. D 4. C 5. A 6. D 7. B 8. C 9. D 0. B TB. 570 TB. 5

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings Operatig Systems: Iterals ad Desig Priciples Chapter 4 Threads Nith Editio By William Stalligs Processes ad Threads Resource Owership Process icludes a virtual address space to hold the process image The

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

LINPACK Benchmark Optimizations on a Virtual Processor Grid

LINPACK Benchmark Optimizations on a Virtual Processor Grid Ed Anderson (eanderson@na-net.ornl.gov) Maynard Brandt (retired) Chao Yang (cwy@cray.com) Outline Organization of the Cray LINPACK Benchmark code Kernel optimizations on the CRAY X The virtual processor

More information

K-NET bus. When several turrets are connected to the K-Bus, the structure of the system is as showns

K-NET bus. When several turrets are connected to the K-Bus, the structure of the system is as showns K-NET bus The K-Net bus is based o the SPI bus but it allows to addressig may differet turrets like the I 2 C bus. The K-Net is 6 a wires bus (4 for SPI wires ad 2 additioal wires for request ad ackowledge

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

Instruction and Data Streams

Instruction and Data Streams Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad

More information

Operating System Concepts. Operating System Concepts

Operating System Concepts. Operating System Concepts Chapter 4: Mass-Storage Systems Logical Disk Structure Logical Disk Structure Disk Schedulig Disk Maagemet RAID Structure Disk drives are addressed as large -dimesioal arrays of logical blocks, where the

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

1.2 Binomial Coefficients and Subsets

1.2 Binomial Coefficients and Subsets 1.2. BINOMIAL COEFFICIENTS AND SUBSETS 13 1.2 Biomial Coefficiets ad Subsets 1.2-1 The loop below is part of a program to determie the umber of triagles formed by poits i the plae. for i =1 to for j =

More information

Fast Fourier Transform (FFT) Algorithms

Fast Fourier Transform (FFT) Algorithms Fast Fourier Trasform FFT Algorithms Relatio to the z-trasform elsewhere, ozero, z x z X x [ ] 2 ~ elsewhere,, ~ e j x X x x π j e z z X X π 2 ~ The DFS X represets evely spaced samples of the z- trasform

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

A Study on the Performance of Cholesky-Factorization using MPI

A Study on the Performance of Cholesky-Factorization using MPI A Study o the Performace of Cholesky-Factorizatio usig MPI Ha S. Kim Scott B. Bade Departmet of Computer Sciece ad Egieerig Uiversity of Califoria Sa Diego {hskim, bade}@cs.ucsd.edu Abstract Cholesky-factorizatio

More information

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The

More information

SPIRAL DSP Transform Compiler:

SPIRAL DSP Transform Compiler: SPIRAL DSP Trasform Compiler: Applicatio Specific Hardware Sythesis Peter A. Milder (peter.milder@stoybroo.edu) Fraz Frachetti, James C. Hoe, ad Marus Pueschel Departmet of ECE Caregie Mello Uiversity

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The

More information

How do we evaluate algorithms?

How do we evaluate algorithms? F2 Readig referece: chapter 2 + slides Algorithm complexity Big O ad big Ω To calculate ruig time Aalysis of recursive Algorithms Next time: Litterature: slides mostly The first Algorithm desig methods:

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Ruig Time of a algorithm Ruig Time Upper Bouds Lower Bouds Examples Mathematical facts Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite

More information

Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta

Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta USENIX Associatio Proceedigs of the 4th Aual Liux Showcase & Coferece, Atlata Atlata, Georgia, USA October 1 14, 2 THE ADVANCED COMPUTING SYSTEMS ASSOCIATION 2 by The USENIX Associatio All Rights Reserved

More information

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON Roberto Lopez ad Eugeio Oñate Iteratioal Ceter for Numerical Methods i Egieerig (CIMNE) Edificio C1, Gra Capitá s/, 08034 Barceloa, Spai ABSTRACT I this work

More information

Lecture 18. Optimization in n dimensions

Lecture 18. Optimization in n dimensions Lecture 8 Optimizatio i dimesios Itroductio We ow cosider the problem of miimizig a sigle scalar fuctio of variables, f x, where x=[ x, x,, x ]T. The D case ca be visualized as fidig the lowest poit of

More information

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition. Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive

More information

Improving Template Based Spike Detection

Improving Template Based Spike Detection Improvig Template Based Spike Detectio Kirk Smith, Member - IEEE Portlad State Uiversity petra@ee.pdx.edu Abstract Template matchig algorithms like SSE, Covolutio ad Maximum Likelihood are well kow for

More information

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence _9.qxd // : AM Page Chapter 9 Sequeces, Series, ad Probability 9. Sequeces ad Series What you should lear Use sequece otatio to write the terms of sequeces. Use factorial otatio. Use summatio otatio to

More information

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Course Site:   Copyright 2012, Elsevier Inc. All rights reserved. Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios

More information

Pattern Recognition Systems Lab 1 Least Mean Squares

Pattern Recognition Systems Lab 1 Least Mean Squares Patter Recogitio Systems Lab 1 Least Mea Squares 1. Objectives This laboratory work itroduces the OpeCV-based framework used throughout the course. I this assigmet a lie is fitted to a set of poits usig

More information

Message Integrity and Hash Functions. TELE3119: Week4

Message Integrity and Hash Functions. TELE3119: Week4 Message Itegrity ad Hash Fuctios TELE3119: Week4 Outlie Message Itegrity Hash fuctios ad applicatios Hash Structure Popular Hash fuctios 4-2 Message Itegrity Goal: itegrity (ot secrecy) Allows commuicatig

More information

A Note on Least-norm Solution of Global WireWarping

A Note on Least-norm Solution of Global WireWarping A Note o Least-orm Solutio of Global WireWarpig Charlie C. L. Wag Departmet of Mechaical ad Automatio Egieerig The Chiese Uiversity of Hog Kog Shati, N.T., Hog Kog E-mail: cwag@mae.cuhk.edu.hk Abstract

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition Lecture Goals Itroductio to Computig Systems: From Bits ad Gates to C ad Beyod 2 d Editio Yale N. Patt Sajay J. Patel Origial slides from Gregory Byrd, North Carolia State Uiversity Modified slides by

More information

Polynomial Functions and Models. Learning Objectives. Polynomials. P (x) = a n x n + a n 1 x n a 1 x + a 0, a n 0

Polynomial Functions and Models. Learning Objectives. Polynomials. P (x) = a n x n + a n 1 x n a 1 x + a 0, a n 0 Polyomial Fuctios ad Models 1 Learig Objectives 1. Idetify polyomial fuctios ad their degree 2. Graph polyomial fuctios usig trasformatios 3. Idetify the real zeros of a polyomial fuctio ad their multiplicity

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 6 Defiig Fuctios Pytho Programmig, 2/e 1 Objectives To uderstad why programmers divide programs up ito sets of cooperatig fuctios. To be able to

More information

Lecture 5. Counting Sort / Radix Sort

Lecture 5. Counting Sort / Radix Sort Lecture 5. Coutig Sort / Radix Sort T. H. Corme, C. E. Leiserso ad R. L. Rivest Itroductio to Algorithms, 3rd Editio, MIT Press, 2009 Sugkyukwa Uiversity Hyuseug Choo choo@skku.edu Copyright 2000-2018

More information

6.854J / J Advanced Algorithms Fall 2008

6.854J / J Advanced Algorithms Fall 2008 MIT OpeCourseWare http://ocw.mit.edu 6.854J / 18.415J Advaced Algorithms Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.415/6.854 Advaced Algorithms

More information

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1 Switchig Hardware Sprig 208 CS 438 Staff, Uiversity of Illiois Where are we? Uderstad Differet ways to move through a etwork (forwardig) Read sigs at each switch (datagram) Follow a kow path (virtual circuit)

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045 Oe Brookigs Drive St. Louis, Missouri 63130-4899, USA jaegerg@cse.wustl.edu

More information

SCI Reflective Memory

SCI Reflective Memory Embedded SCI Solutios SCI Reflective Memory (Experimetal) Atle Vesterkjær Dolphi Itercoect Solutios AS Olaf Helsets vei 6, N-0621 Oslo, Norway Phoe: (47) 23 16 71 42 Fax: (47) 23 16 71 80 Mail: atleve@dolphiics.o

More information

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 9. Pointers and Dynamic Arrays. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 9 Poiters ad Dyamic Arrays Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 9.1 Poiters 9.2 Dyamic Arrays Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Slide 9-3

More information

Math Section 2.2 Polynomial Functions

Math Section 2.2 Polynomial Functions Math 1330 - Sectio. Polyomial Fuctios Our objectives i workig with polyomial fuctios will be, first, to gather iformatio about the graph of the fuctio ad, secod, to use that iformatio to geerate a reasoably

More information

Chapter 24. Sorting. Objectives. 1. To study and analyze time efficiency of various sorting algorithms

Chapter 24. Sorting. Objectives. 1. To study and analyze time efficiency of various sorting algorithms Chapter 4 Sortig 1 Objectives 1. o study ad aalyze time efficiecy of various sortig algorithms 4. 4.7.. o desig, implemet, ad aalyze bubble sort 4.. 3. o desig, implemet, ad aalyze merge sort 4.3. 4. o

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 CPU-Memory Bottleeck Computer Architecture ELEC44 CPU Memory Lecture 8 Cache Dr. Hayde Kwok-Hay So Departmet of Electrical ad Electroic Egieerig Performace of high-speed computers is usually limited by

More information

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved. Chapter 11 Frieds, Overloaded Operators, ad Arrays i Classes Copyright 2014 Pearso Addiso-Wesley. All rights reserved. Overview 11.1 Fried Fuctios 11.2 Overloadig Operators 11.3 Arrays ad Classes 11.4

More information

Computers and Scientific Thinking

Computers and Scientific Thinking Computers ad Scietific Thikig David Reed, Creighto Uiversity Chapter 15 JavaScript Strigs 1 Strigs as Objects so far, your iteractive Web pages have maipulated strigs i simple ways use text box to iput

More information

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO Sagwo Seo, Trevor Mudge Advaced Computer Architecture Laboratory Uiversity of Michiga at A Arbor {swseo, tm}@umich.edu Yumig Zhu, Chaitali

More information

Numerical Methods Lecture 6 - Curve Fitting Techniques

Numerical Methods Lecture 6 - Curve Fitting Techniques Numerical Methods Lecture 6 - Curve Fittig Techiques Topics motivatio iterpolatio liear regressio higher order polyomial form expoetial form Curve fittig - motivatio For root fidig, we used a give fuctio

More information

Speeding-up dynamic programming in sequence alignment

Speeding-up dynamic programming in sequence alignment Departmet of Computer Sciece Aarhus Uiversity Demark Speedig-up dyamic programmig i sequece aligmet Master s Thesis Dug My Hoa - 443 December, Supervisor: Christia Nørgaard Storm Pederse Implemetatio code

More information

One advantage that SONAR has over any other music-sequencing product I ve worked

One advantage that SONAR has over any other music-sequencing product I ve worked *gajedra* D:/Thomso_Learig_Projects/Garrigus_163132/z_productio/z_3B2_3D_files/Garrigus_163132_ch17.3d, 14/11/08/16:26:39, 16:26, page: 647 17 CAL 101 Oe advatage that SONAR has over ay other music-sequecig

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must

More information

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS SIAM J. SCI. COMPUT. Vol. 22, No. 6, pp. 2113 2134 c 21 Society for Idustrial ad Applied Mathematics FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS ZHAO ZHANG AND XIAODONG ZHANG

More information

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software Structurig Redudacy for Fault Tolerace CSE 598D: Fault Tolerat Software What do we wat to achieve? Versios Damage Assessmet Versio 1 Error Detectio Iputs Versio 2 Voter Outputs State Restoratio Cotiued

More information

Accuracy Improvement in Camera Calibration

Accuracy Improvement in Camera Calibration Accuracy Improvemet i Camera Calibratio FaJie L Qi Zag ad Reihard Klette CITR, Computer Sciece Departmet The Uiversity of Aucklad Tamaki Campus, Aucklad, New Zealad fli006, qza001@ec.aucklad.ac.z r.klette@aucklad.ac.z

More information

Panel for Adobe Premiere Pro CC Partner Solution

Panel for Adobe Premiere Pro CC Partner Solution Pael for Adobe Premiere Pro CC Itegratio for more efficiecy The makes video editig simple, fast ad coveiet. The itegrated pael gives users immediate access to all medialoopster features iside Adobe Premiere

More information

Solution printed. Do not start the test until instructed to do so! CS 2604 Data Structures Midterm Spring, Instructions:

Solution printed. Do not start the test until instructed to do so! CS 2604 Data Structures Midterm Spring, Instructions: CS 604 Data Structures Midterm Sprig, 00 VIRG INIA POLYTECHNIC INSTITUTE AND STATE U T PROSI M UNI VERSI TY Istructios: Prit your ame i the space provided below. This examiatio is closed book ad closed

More information

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015. Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Hash Tables xkcd. http://xkcd.com/221/. Radom Number. Used with permissio uder Creative

More information

GPUMP: a Multiple-Precision Integer Library for GPUs

GPUMP: a Multiple-Precision Integer Library for GPUs GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract

More information

IMP: Superposer Integrated Morphometrics Package Superposition Tool

IMP: Superposer Integrated Morphometrics Package Superposition Tool IMP: Superposer Itegrated Morphometrics Package Superpositio Tool Programmig by: David Lieber ( 03) Caisius College 200 Mai St. Buffalo, NY 4208 Cocept by: H. David Sheets, Dept. of Physics, Caisius College

More information

EE123 Digital Signal Processing

EE123 Digital Signal Processing Last Time EE Digital Sigal Processig Lecture 7 Block Covolutio, Overlap ad Add, FFT Discrete Fourier Trasform Properties of the Liear covolutio through circular Today Liear covolutio with Overlap ad add

More information

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

ECE4050 Data Structures and Algorithms. Lecture 6: Searching ECE4050 Data Structures ad Algorithms Lecture 6: Searchig 1 Search Give: Distict keys k 1, k 2,, k ad collectio L of records of the form (k 1, I 1 ), (k 2, I 2 ),, (k, I ) where I j is the iformatio associated

More information

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a

n n B. How many subsets of C are there of cardinality n. We are selecting elements for such a 4. [10] Usig a combiatorial argumet, prove that for 1: = 0 = Let A ad B be disjoit sets of cardiality each ad C = A B. How may subsets of C are there of cardiality. We are selectig elemets for such a subset

More information

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory! Why Care About the Memory Hierarchy? Memory Virtual Memory -DRAM Memory Gap (latecy) Reasos: Multi process systems (abstractio & memory protectio) Solutio: Tables (holdig per process traslatios) Fast traslatio

More information

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1

CS200: Hash Tables. Prichard Ch CS200 - Hash Tables 1 CS200: Hash Tables Prichard Ch. 13.2 CS200 - Hash Tables 1 Table Implemetatios: average cases Search Add Remove Sorted array-based Usorted array-based Balaced Search Trees O(log ) O() O() O() O(1) O()

More information

Computer Science Foundation Exam. August 12, Computer Science. Section 1A. No Calculators! KEY. Solutions and Grading Criteria.

Computer Science Foundation Exam. August 12, Computer Science. Section 1A. No Calculators! KEY. Solutions and Grading Criteria. Computer Sciece Foudatio Exam August, 005 Computer Sciece Sectio A No Calculators! Name: SSN: KEY Solutios ad Gradig Criteria Score: 50 I this sectio of the exam, there are four (4) problems. You must

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations Applied Mathematical Scieces, Vol. 1, 2007, o. 25, 1203-1215 A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045, Oe

More information

GE FUNDAMENTALS OF COMPUTING AND PROGRAMMING UNIT III

GE FUNDAMENTALS OF COMPUTING AND PROGRAMMING UNIT III GE2112 - FUNDAMENTALS OF COMPUTING AND PROGRAMMING UNIT III PROBLEM SOLVING AND OFFICE APPLICATION SOFTWARE Plaig the Computer Program Purpose Algorithm Flow Charts Pseudocode -Applicatio Software Packages-

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers * Load balaced Parallel Prime umber Geerator with Sieve of Eratosthees o luster omputers * Soowook Hwag*, Kyusik hug**, ad Dogseug Kim* *Departmet of Electrical Egieerig Korea Uiversity Seoul, -, Rep. of

More information

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis Outlie ad Readig Aalysis of Algorithms Iput Algorithm Output Ruig time ( 3.) Pseudo-code ( 3.2) Coutig primitive operatios ( 3.3-3.) Asymptotic otatio ( 3.6) Asymptotic aalysis ( 3.7) Case study Aalysis

More information

top() Applications of Stacks

top() Applications of Stacks CS22 Algorithms ad Data Structures MW :00 am - 2: pm, MSEC 0 Istructor: Xiao Qi Lecture 6: Stacks ad Queues Aoucemets Quiz results Homework 2 is available Due o September 29 th, 2004 www.cs.mt.edu~xqicoursescs22

More information

Performance Plus Software Parameter Definitions

Performance Plus Software Parameter Definitions Performace Plus+ Software Parameter Defiitios/ Performace Plus Software Parameter Defiitios Chapma Techical Note-TG-5 paramete.doc ev-0-03 Performace Plus+ Software Parameter Defiitios/2 Backgroud ad Defiitios

More information

3D Model Retrieval Method Based on Sample Prediction

3D Model Retrieval Method Based on Sample Prediction 20 Iteratioal Coferece o Computer Commuicatio ad Maagemet Proc.of CSIT vol.5 (20) (20) IACSIT Press, Sigapore 3D Model Retrieval Method Based o Sample Predictio Qigche Zhag, Ya Tag* School of Computer

More information

Math 10C Long Range Plans

Math 10C Long Range Plans Math 10C Log Rage Plas Uits: Evaluatio: Homework, projects ad assigmets 10% Uit Tests. 70% Fial Examiatio.. 20% Ay Uit Test may be rewritte for a higher mark. If the retest mark is higher, that mark will

More information

ISSN (Print) Research Article. *Corresponding author Nengfa Hu

ISSN (Print) Research Article. *Corresponding author Nengfa Hu Scholars Joural of Egieerig ad Techology (SJET) Sch. J. Eg. Tech., 2016; 4(5):249-253 Scholars Academic ad Scietific Publisher (A Iteratioal Publisher for Academic ad Scietific Resources) www.saspublisher.com

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

COMMUNICATION-OPTIMAL PARALLEL AND SEQUENTIAL CHOLESKY DECOMPOSITION

COMMUNICATION-OPTIMAL PARALLEL AND SEQUENTIAL CHOLESKY DECOMPOSITION SIA J. SCI. COUT. Vol. 2, No. 6, pp. 495 52 c 2010 Society for Idustrial ad Applied athematics COUNICATION-OTIAL ARALLEL AND SEQUENTIAL CHOLESKY DECOOSITION GREY BALLARD, JAES DEEL, OLGA HOLTZ, AND ODED

More information

Cache-Optimal Methods for Bit-Reversals

Cache-Optimal Methods for Bit-Reversals Proceedigs of the ACM/IEEE Supercomputig Coferece, November 1999, Portlad, Orego, U.S.A. Cache-Optimal Methods for Bit-Reversals Zhao Zhag ad Xiaodog Zhag Departmet of Computer Sciece College of William

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 22 Database Recovery Techiques Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Recovery algorithms Recovery cocepts Write-ahead

More information