Cache and Bandwidth Aware Matrix Multiplication on the GPU

Size: px
Start display at page:

Download "Cache and Bandwidth Aware Matrix Multiplication on the GPU"

Transcription

1 Cache ad Badwidth Aware Matrix Multiplicatio o the GPU Jesse D. Hall Natha A. Carr Joh C. Hart Uiversity of Illiois Astract Recet advaces i the speed ad programmaility of cosumer level graphics hardware has sparked a flurry of research that goes eyod the realm of image sythesis ad computer graphics. We examie the use of the GPU (graphics processig uit) as a tool for scietific computig, y aalyzig techiques for performig large matrix multiplies i GPU hardware. A earlier method for multiplyig matrices o the GPU suffered from prolems of memory adwidth. This paper examies more efficiet algorithms that make the implemetatio of large matrix multiplicatio o upcomig GPU architectures more competitive, usig oly 25% of the memory adwidth ad istructios of previous GPU algorithms. 1 Itroductio The multiplicatio of matrices is oe of the most cetral operatios applied i scietific computig. Recet history has show cotiued research for etter tued algorithms that improve the efficiecy of matrix multiplicatio. The ATLAS system for automatic tuig of matrix multiplicatio for target CPU s has show much success [Whaley et al. 2001]. New advaces i PC chip desig (such as streamig SIMD extesios) has led to research ito how to est leverage moder mico-architectures for this task [Aerdee ad Baxter 2000]. Recetly, cosumer ased graphics processors (GPU s) have ecome icreasigly more powerful ad are startig to support programmale features. The parallelism i graphics hardware pipelies makes the GPU a strog cadidate for performig may computatioal tasks icludig matrix multiplicatio [Larso ad McAllister 2001]. We detail a ew approach for multiplyig matrices o GPU hardware. This approach takes advatage of multiple levels of parallelism foud i moder GPU hardware ad reduces the adwidth requiremets ecessary to make this techique effective. The architecture of moder GPUs relevat to this paper is descried i Sectio 2. I summary, GPUs receive 3D graphics primitives (typically triagles or quadrilaterals) specified as a set of vertices from the applicatio. These vertices are trasformed ito scree coordiates, ad a fragmet is geerated for each pixel covered y the primitive (geeratig these fragmets is called rasterizatio). Fragmets cotai iformatio such as colors ad texture coordiates which are iterpolated across the primitive from values associated with the vertices. These auxiliary attriutes are used to shade each fragmet, which results i a fial color which is writte to a pixel i the frameuffer. Oe of the most commo ways of shadig fragmets is to use auxiliary attriutes kow as texture coordiates to idex ito a previously supplied image (texture). Multiple sets of texture coordiates ca e used to retrieve colors from multiple textures; the results are comied to form the fial color for the fragmet. This is multitexturig. Fially, a shaded fragmet ca either replace the curret value of the pixel i the frame uffer or it ca e added to the curret value. This versio of the graphics pipelie is descried i [Woo et al. 1997]. The performace of GPUs comes from the fact that large amouts of parallelism are availale i this pipelie. I particular, each fragmet is idepet of all other fragmets, so they ca e processed i parallel. Processig of fragmets ca also e overlapped to hide pipelie stalls ad memory latecies, resultig i very efficiet use of the hardware. Multitexturig ca e used to multiply matrices [Larso ad McAllister 2001]. A m matrix ca e represeted y a greyscale texture, with the each pixel cotaiig a elemet of the matrix 1. These matrices ca e displayed o the scree y drawig a m -pixel rectage with the texture coordiates (0, 0), (0, 1), (m 1, 1), (m 1, 0) assiged to the vertices (clockwise startig with the upper-left vertex). Oe eefit of this is that we ca access the traspose of a matrix y drawig a m-pixel rectagle with texture coordiates (0, 0), (m 1, 0), (m 1, 1), (0, 1). The exact same texture is used for drawig the matrix trasposed ad utrasposed. We have oly chaged the mappig of the texture image oto the rectagle. Matrix multiplicatio performs C AB where A is a m l-elemet matrix ad B is a l -elemet matrix. By storig matrices A ad B as textures, we ca compute C i l multitexturig passes as show i Figure 1. Clear the scree. Set the drawig mode to overlay. Load texture texa with matrix A. Load texture texb with matrix B. Set the multitexturig mode to modulate. Set frameuffer write mode to accumulate. for i 0...l 1 draw a m -pixel rectagle with texa coords (0,i), (0,i), (m 1,i), (m 1,i), ad texb coords (i, 0), (i, 1), (i, 1), (i, 0). Scree cotais result of A B. Figure 1: The Larso-McAllister multipass algorithm for multiplyig two matrices. The texture coordiates (0,i), (0,i), (m 1,i), (m 1,i) replicate the ith colum across the etire rectagle, whereas the texture coordiates (i, 0), (i, 1), (i, 1), (i, 0) replicate the ith row. These textured rectagles are the comied as demostrated i Figure 2. 1 Textures are typically idexed usig (s, t), with s idexig the horizotal axis ad t idexig the vertical axis. This is the opposite of stadard matrix otatio, where the first idex represets the row ad the secod idex represetig the colum. I the rest of this paper, we ll use the matrix style rather tha the texture style. Also, texture coordiates have traditioally ee i the rage [0...1], with various filters used to geerate a color for idices that fall etwee pixels. We use a extesio [Kilgard 2001] to allow iteger idexig, ad disale filterig.

2 Pass 1 Pass 2 Pass 3 Pass 4 A col. 1 A col. 2 A col. 3 A col. 4 Brow1 Brow2 Brow3 Brow4 fragmet processor that descries the color each fragmet efore it is possily assiged to its correspodig pixel. The iputs to a fragmet shader are a set of program costats, iterpolated attriute data from triagle vertices, ad texture maps (locks of texture memory addressed y the texture coordiates). Before rerig, the fragmet shader is compiled ad loaded ito the graphics hardware. Primitives (i our case quadrilaterals) are the set dow the graphics pipelie ivokig the ealed fragmet shader as they are rasterized. The output of a fragmet shader is a output color plotted to the scree. A fragmet shader is alloted a fixed set of temporary registers R0...R. Each register holds a sigle 4-vector correspodig to the four color chaels red, gree, lue, alpha. The color chaels of each register may e accessed idividually as follows: Ri.c where c {r, g,, a}, i {1..}. Stadard arithmetic operatios are defied over the set of registers, such as additio ad multiplicatio. For example: R2. R1.a R0.g, assigs the lue chael of register R2 to e the sum of the alpha chaels of registers R1 with the gree chael of R0. Moder fragmet shaders allow for up to four-istructio to e executed simultaeously, much like that of the SIMD istructios foud i moder PC architectures. For example: R2 R1.agr R0.gga (1) Result A Figure 2: Demostratio of Larso-McAllister matrix multiplicatio o a pair of 4 4 matrices. (The output i this example is saturated, such that results greater tha oe appear uiformly white.) 2 Moder GPU Orgaizatio The graphics pipelie implemeted o graphics acceleratio hardware was classically orgaized i a series of trasformatios, clippig ad rasterizatio steps. Moder graphics hardware has geeralized this pipelie ito programmale elemets. The moder graphics pipelie cosists of vertex processig, rasterizatio ad fragmet processig. The vertex processor performs operatios o the idividual vertices of triagles set to the graphics accelerator. Oce trasformed, these triagles are rasterized ito a collectio of pixels. Each pixel output y the rasterizer is called a fragmet. The rasterizatio process liearly iterpolates attriutes, such as texture coordiates, stored at the vertices ad stores the iterpolated values at each fragmet. A fragmet processor uses the iterpolated texture coordiates to lookup texture values from texture memory, ad ca perform specialpurpose arithmetic operatios o oth the texture addresses ad the fetched texture values. The vertex processor is structured similarly to vector processors (pipelie o a stream of vertices), whereas the fragmet processor is structured similarly to a SIMD array processor (oe processor per pixel). Our experimets have show that the vertex processor does ot provide much advatage over existig CPU capailities, whereas the fragmet processor already outperforms the CPU o some operatios like ray-triagle itersectios [Carr et al. 2002]. Because these processors were origially developed for texturig, the programs the GPU executes are called shaders. A fragmet shader is a program executed y the B ca e issued as a sigle GPU istructio which refers to four simultaeous multiplicatios umerically equivalet to: R2.r R1.a R0.g R2.g R1. R0.g R2. R1.g R0.a R2.a R1.r R0. (2) The SIMD ature of the operatio defied i (1) allows for four additios to occur i parallel, takig oe fourth the computatio time of (2). I equatio (1), R1 s color chaels are refereced i aritrary order. This is referred to as swizzlig. The gree chael of R0 is refereced multiple times. This is kow as smearig. Aritrary swizzlig ad smearig (ad also egatio) of iput operads ca e doe with o performace pealty. This is i cotrast to the Itel s SSE istructios, where movig data etwee chaels requires additioal istructios. The output color of a fragmet program is placed i a desigated register (usually R0) upo termiatio of the program. This value is writte to the fragmet s scree locatio i the frame uffer. Fragmet shaders have access to three kids of data: costats, iterpolated vertex attriute data (e.g. texture coordiates), ad texture data (idexed y texture coordiates). Iterpolated vertex attriutes data are accessed as the registers T 0...Tm where m 1 is the umer of attriutes stored with each vertex. Data is fetched from texture memory y the lookup() operatio. For example, R0 lookup(t 0.r, T 0.g, M) uses the first two coordiates of T 0 to access the texture M. We ca also perform arithmetic o the texture coordiate efore the fetch, or we ca use the result of oe texture fetch as the coordiates for a secod texture fetch (depet texturig). Although fragmet shaders provide a very powerful SIMD model for programmig, they are curretly limited i a umer of ways. Moder implemetatios restrict the umer of availale registers, the total istructio legth, ad the

3 umer of lookup() operatios that may occur i a give fragmet shader. Cotrol flow is also restricted i fragmet shadig. For example, rachig is ot supported ad coditioal executio is limited to predicatig istructios o previously set coditio codes. Our model for a fragmet shaders is ased o the oe descried y the upcomig DirectX 9.0 specificatio [Marshall 2001]. This model provides capailities curretly foud i vertex shaders at the fragmet shader level. This model has also ee used to descrie the implemetatio of a ray tracer as a fragmet shader [Purcell et al. 2002]. This paper assumes similar fragmet processor capailities, specifically fragmet shaders of up to 256 istructios, a urestricted umer of texture access operatios, a set of at least six registers, ad stadard sigle-precisio floatig poit data formats. for k 1.. step for i 1... for j 1... fragmet shader R3.r 0 for m k...k 1 R1.r lookup(i, m, X) R2.r lookup(m, j, Y ) R3.r R3.r R1.r R2.r R4.r lookup(i, j, F ) R0.r R3.r R4.r Copy frame uffer ito texture F (5) 3 Cache Aware Matrix Multiply Suppose we are muliplyig two large matrices X ad Y, wlog who s dimesios are a perfect power of two, with 2 i rows ad 2 i colums. A geeral algorithm for computig Z XY ca e expressed as follows: for i 1... for j 1... Z ij 0 for m 1... Z ij Z ij X im Y mj The outer two loops are implemeted o the GPU y rerig a sigle scree fillig quadrilateral. This implies that everthig withi the outer two loops must e hadled y the fragmet shader. Below we have iserted pixel shader pseudo-code i the appropriate places. Matrices X ad Y are ow assumed to e stored i sigle chael texture maps ad accessed through the fragmet shader lookup() operatio. for i 1... for j 1... fragmet shader R3.r 0 for m 1... R1.r lookup(i, m, X) R2.r lookup(m, j, Y ) R0.r R0.r R1.r R2.r The aove psuedo-code i (4) requires that either loops are availale i fragmet shaders or that the fragmet shader istructio cout is log eough to allow the iermost loop to e urolled. We assume either are realistic assumptios. To address this issue, we tur to a stadard lockig strategy. Blockig has ee show to improve cache performace, ut for this applicatio lockig also serves the purpose of allowig us to work withi the costraits of our fragmet programmig model. The psuedo-code for our lockig strategy is show i (5). A ew matrix F is itroduced that is iitialized to e all zeroes ad used as a temporary store y the routie. The value is a scalar represetig the lock size. (3) (4) This ew algorithm is a multipass method requirig multiple rerigs to the frame uffer. The outer three loops are hadled y rerig scree fillig quadrilateral to the frame uffer / times. Betwee each of the / passes, the frame uffer is copied ito texture map F to e accumulated with result of the ext pass. This copy operatio is required sice moder GPU hardware does ot support direct lookup operatios o the frameuffer. Some graphics hardware does however, support lig modes allowig fragmet values to accumulate directly with the cotets of the frame uffer elimiatig the eed for the temporary texture F, ad cosequetly more efficiet rerig. The fragmet program (which covers the portio iside the j loop), ca ow e urolled y choosig a appropriate value of. For our tests we have chose 32. We were ale to reduce our total fragmet program istructio cout to e four istructios per iteratio of the m loop, for a total of total istructios. 4 Multi-Chael GPU Matrix Multiplies Texture map sizes o moder day GPU s are ofte restricted. Let eig the maximum size of ay dimesio. NVidia s GeForce4 Ti4600 has a maximum allowale rerale size of 2048, limitig multipass programs to two-dimesioal textures cotaiig at most elemets. Texture maps may cosist of etwee oe ad four chaels (lumiace, lumiace-alpha, RGB or RGBA). This implies that the GeForce4 ca hadle textures i size up to Methods have already ee preseted for hadlig matrices whose dimesios are at most 2048 usig sigle chael texture maps. It is aturally desirale to e ale to hadle matrices of larger sizes. The GeForce4 for example should e ale to multiply matrices i size y utilizig all four of the color chaels. This sectio descries a matrix multiplicatio algorithm that takes advatage of this four-compoet storage capaility. These four-chael textures store matrices of it floatig poit values, ad occupy 64MB. Our GPU matrix multiplicatio implemetatio requires four times this space for storig the two operads, a temporary store, ad the result. Curret cosumer level GPU s such as the GeForce4 curretly ship with 128MB of o-oard memory, suggestig a maximum capale matrix side of Exceedig this memory threshold uder a o-uified memory architecture of preset-day PC GPU s results i pagig to mai system memory, ad icreased traffic over the graphics card us.

4 4.1 Basic Formulatio Suppose we are muliplyig two large matrices X ad Y, wlog whose dimesios are a perfect power of two, with 2 i rows ad 2 i colums. XY Z (6) The matrix multiply i (6) ca e expressed as the followig series of matrix multiplies of smaller matrices X Y {[ }} ]{{[ }} ]{{[ }} ]{ A B E F AE BG AF BH (7) C D G H CE DG CF DH Elemets A, B, C, D, E, F, G, ad H are su-matrices decomposig X ad Y. Let the dimesios of A...H e 2 i 1 rows y 2 i 1 colums. 4.2 Blocked Matrix Texture Maps We ca store matrices X ad Y as texture maps i a 2 i 1 y 2 i 1 sized texture maps o the GPU, y placig the four su-matrices i the differet color chaels RGBA as X ( ) Ar B g,y C D a Z ( ) Er F g. (8) G H a We ow itroduce suscript otatio o matrices M, such that M i for i r, g,, a refers to the sumatrices composig M. For example X ry g AF. We have preseted our techique for multiplyig matrices cotaied i a sigle color chael X ry r Z r i Sectio 3. We ca ext this same approach to a SIMD otatio y usig multiple suscripts r, g,, a. For example: ( ) AEr AF X rr Y rgrg g. (9) CE CF a The aove otatio assumes a architecture where four su-matrices may e operated o i parallel y a sigle istructio. This otatio is useful sice graphics hardware is desiged i a SIMD maer to work simulateously o four color chaels at a time. Usig this otatio, we ca ow cocisely express matrix multiplicatio X ad Y as follows: X rga Y rga X rr Y rgrg X ggaay aa ( ) ( ) AEr AF g BGr BH g CE CF a DG DH a Z rga (10) 4.3 The Multi-Chael Algorithm To apply the formulatio derived i (10) ito a algorithm usale y graphics hardware, we must first re-examie fragmet shader programmig. As discussed i Sectio 1, a sigle lookup() operatio ca retrieve a 4-vector correspodig the four color chaels. If we store X as a texture map i locked form (8) the a sigle lookup R0 lookup(i, j, X) ca e used to retrieve four values comig from the sumatrices of R0 A ij,b ij,c ij,d ij. The four matrix multiplies from equatio (10) may ow e parallelized withi a sigle fragmet shader. The matrix swizzlig ad smearig suggested y (10) is hadled at the per-elemet level utilizig the capailities of graphics hardware, as show i (11). for k 1.../2step for i 1.../2 for j 1.../2 fragmet shader R3 0 for m k...k 1 R1 lookup(i, m, X) R2 lookup(m, j, Y ) R3 R1.rr R2.rgrg R3 R3 R1.ggaa R2.aa R3 R4 lookup(i, j, F ) R0 R3R4 copy frame uffer ito texture F (11) The aove algorithm represets a efficiet use of the SIMD computatio power of the GPU y workig o all four color chaels i parallel. This implemetatio oly icreases the fragmet shader istructio cout y oe per iteratio of m, thus resultig i a total fragmet program legth of with Aalysis We have aalyzed the ew locked ad multichael GPU matrix multiplicatio algorithms with respect to memory adwidth, istructio cout ad predicted performace. 5.1 Badwidth Cosideratios To aalyze the potetial adwidth limitatios for our approach, we first distiguish etwee the two adwidth limited areas of moder GPUs. The exteral adwidth is the rate at which data may e trasferred etwee the GPU ad the mai system memory. O moder PC s this is limited y the speed of the AGP graphics us which ca traser data at the rate of 1GB/sec. The iteral adwith is the rate at which the GPU may read ad write from its ow iteral memory. The GeForce4 Ti 4600 is curretly capale of trasferrig 10.4 GB/sec. For our applicatio the exteral adwidth of the GPU affects our applicatio i two areas. First, the matrices must e copied ito the GPU s memory as texture maps, ad the result of the computatio must e read ack from the card ito mai memory. These trasfers use the AGP us, which curretly has a theoretical adwidth of aout 1 GB/s (for AGP 4x). However, i practice sig data to the GPU is much faster tha readig data ack from the GPU (sice the hardware ad drivers are optimized for this case), ad the est speed we ve measured for readig data ack ito host memory is 175 MB/s. Eve assumig a average of 200 MB/s trasfer for oth reads ad writes, trasferig two sigle-precisio matrices to the GPU ad readig the result ack requires 60 ms. At 4 GFLOPS the actual computatio takes aout 510 ms. For this prolem, trasfer time is aout 11% of the total time. Thus, exteral adwidth is a sigificat ut ot overwhelmig fractio of the total time. The secod part of the algorithm affected y exteral adwidth is the time to s the geometry, the scree fillig quads o which the matrices are texture mapped. I fact this is a very small amout of data (aout 48 ytes per pass), ad graphics hardware is very good at trasferrig

5 geometry iformatio i parallel with other tasks, icludig ruig fragmet shaders. Thus, this cost is egligile. Oe of the primary ottleecks i performig matrix multiplies o the GPU is the iteral adwidth [Larso ad McAllister 2001]. This is also true for CPU implemetatios. For our aalysis we cosider mutiplyig two matrices. For oth the multi-chael ad sigle-chael lock-matrix approaches, the processig of each fragmet requires two texture lookup() operatios per iteratio of the ier m loop, plus a additioal lookup() to comie it with the results from the previous pass. A sigle write to output occurs per fragmet per pass as its result is writte to the frame uffer. Thus, there are 2 2 memory operatios per fragmet per pass. I our sigle chael method, every memory operatio trasfers 4 ytes of data. Our multichael method trasfers 16 ytes of data (4 chaels, 4 ytes per chael) per memory operatio. method passes frags/pass ytes/frag total L-M sigle multi 2 (2 2)4 2 ( 2 )2 (2 2) (1) 4 3 (1) Tale 1: Bytes trasferred iterally y each GPU matrix multiplicatio method. Tale 1 summarizes these results ad shows the total iteral adwidth i ytes trasferred y each method, which is just the product of the umer of passes, the fragmets per pass ad the ytes per fragmet. The L-M (Larso- McAllister) figures assume a implemetatio ased o a sigle floatig-poit chael. (Eve though multiple chaels were metioed, [Larso ad McAllister 2001] did ot descrie a multi-chael implemetatio.) The four-yte floats are accessed four times (two matrices, a temporary store ad a result) for a total of 16 ytes trasferred per fragmet. Our sigle-chael lock matrix algorithm performs idetically to L-M whe is set to oe (o lockig). As the lockig size grows, the adwidth drops y early a factor of two whe compared to L-M. The multi-chael method further reduces the memory adwidth y exactly oe half over the sigle chael method, reducig the adwidth to early 25% of L-M. 5.2 Istructios The GPU uses the fact that the same istructios are eig executed for a large umer of fragmets to overlap the processig of differet fragmets. As o commuicatio etwee executios of the fragmet shaders is eeded, a large amout of parallelism is availale. This parallelism is used to hide the latecy of memory operatios ad other causes of stalls. As a result, whe eough fragmets are availale (as i our case), the ruig time of a fragmet shader is approximately liear i the umer of istructios executed (assumig computatio is the limitig factor). Therefore, it makes sese to aalyze the umer of istructios used y our algorithm. Each executio of the fragmet shader eeds four istructios for setup ad addig i the result of the previous pass. The multi-chael algorithm also eeds three istructios per iteratio of the ier loop (two multiply-add istructios ad oe additio to update the texture idices). The sigle-chael algorithm removes oe of the multiply-add istructios. Therefore, the istructio couts are 3 4 for the multi-chael case ad 2 4 for the sigle-chael case. Note that i the multi-chael case, each may of the istructios ivolve a 4-wide data issue, operatig o the four color chaels i parallel. Curretly there is o performace pealty for this sice moder GPU s desiged to atively work o multiple chaels. Tale 2 summarizes the aalysis ad the total GPU floatig poit istructios required y each method. method passes frags/pass ist/frag total ist L-M sigle multi ( 2 ) (24) 3 (34) 8 Tale 2: Floatig poit operatios required y each GPU matrix multiplicatio method. The additioal fragmet program overhead makes the sigle chael lock-structured matrix multiplicatio loger tha the L-M algorithm. As icreases, the istructio cout asymptotically approaches that of the L-M algorithm. The multi-chael method executes 3/16ths as may istructios of either of the sigle chael methods as the lock size grows. 5.3 Performace ATLAS has demostrated 4.0 GFLOPS/s for matrix multiplicatio o a 1.5GHz Petium4 usig Itel s SSE2 SIMD istructios [Dogarra 2001]. Is the GPU comparale to this? For a matrix size of ad a lock size of 32 ( 1024, 32), our multi-chael algorithm trasfers GB of data. I order to match the ATLAS P4 SSE umers, we eed to perform this multiplicatio i 0.5 secods. This meas we will eed 8.25 GB/s of adwidth. Curret hardware has a theoretical adwidth of 10.4 GB/s to the mai memory; as with CPUs, it also has a cache etwee the GPU ad memory which supports much higher adwidth. Thus, existig hardware should e ale to support our adwidth eeds. Future hardware is likely to improve oth cache size ad performace ad memory adwidth. Performace of CPU implemetatios of matrix multiplicatio are typically limited y memory adwidth, ot CPU speed. Larso ad McAllister [Larso ad McAllister 2001] reported similar results with their GPU implemetatio. However, due to much lower clock speeds o CPUs relative to CPUs, the move from fixed-poit yte operatios to floatig poit may icrease the processig requiremets eough to make the GPU speed the ottleeck. 6 Coclusio We have preseted a multichael lock-ased GPU matrix multiplicatio algorithm. The lock structured approach should yield greater cache coherece tha previous methods. We also demostrated that our implemetatio uses oly aout 25% of the memory adwidth ad istructios whe compared to the previous method. Our results are curretly theoretical as we aticipate the implemetatio of graphics hardware that supports the DirectX 9.0 stadard. We expect such hardware will e availale efore the fial versio of this paper is required. (For example, as of this writig, 3Dlas has just aouced a processor, the P10, that partially satisfies upcomig stadards.)

6 We will the e ale to provide actual implemetatio times comparig our cache ad adwidth aware algorithms to the previous work [Larso ad McAllister 2001]. We have made umerous assumptios aout the performace of the upcomig hardware. These assumptios are ased o the speed of existig hardware, ut with the geerality to hadle the upcomig stadards. These simulatios ad aalyses suggest our method to e competitive with moder CPU implemetatios. The existece of hardware implemeatios will allow us to perform further tuig ad validate our claims with empirical data. Availale hardware would allow us to automatically tue our algorithm for a give GPU i much the same maer as performed y Altas. Emperical tests may e ru to provide searches over algorithm s parameter space to select the est aglorithm for a target GPU. Our algorithm is curretly parameterized y its lock size, ut we could also itroduce additioal parameters to cotrol order of rasterizatio ad lockig of memory layouts. The icreased memory adwidth ad SIMD orgaizatio of the GPU should make it a good choice for scietific applicatios. We have oetheless foud that the GPU remais aout as powerful as the CPU o actual tasks like matrix multiplicatio ad ray tracig [Carr et al. 2002]. This has ee disappoitig give the icreased adwidth ad processig power of the GPU. The costraits of GPU programmig coupled with the low adwidth coectio from the GPU ack ito the CPU have ee major ostacles i the capitalizatio of the GPU for scietific applicatios. We are oetheless ecouraged y the potetial of the GPU. Whereas future ehacemets to the CPU explore parallelism through speculative executio ad other proailistic methods, the GPU ca exploit parallelism across the frame uffer ad across the geometric data. This has ee partially resposile for the domiace of the GPU performace growth rate over that of the CPU. As GPU growth cotiues to outpace CPU growth, we expect the GPU will ecome the preferred platform for persoal highperformace scietific computig. Larso, S. E., ad McAllister, D Fast matrix multiplies usig graphics hardware. Super Computig (Nov.). Marshall, B DirectX graphics future. Microsoft DirectX Meltdow 2001 (Jul.). Purcell, T. J., Buck, I., Mark, W. R., ad Haraha, P Ray tracig o programmale graphics hardware. I Proceedigs of SIGGRAPH 2002, ACM Press / ACM SIGGRAPH, J. F. Hughes, Ed., Computer Graphics Proceedigs, Aual Coferece Series, ACM. Whaley, R. C., Petitet, A., ad Dogarra, J Automated empirical optimizatios of software ad the ATLAS project. Parallel Computig 27, 1-2, Woo, M., Neider, J., Davis, T., ad Shreier, D OpeGL Programmig Guide. Addiso-Wesley, Readig, MA, USA. Ackowledgmets This research was supported i part y the NSF uder the ITR grat ACI , ad y NVidia Corp. Coversatios with Jack Dogarra ad Jim Demmel (ad his studets) were also quite helpful. Refereces Aerdee, D., ad Baxter, J Geeral matrixmatrix multiplicatio usig SIMD features of the PIII (research ote). I Europea Coferece o Parallel Processig, Carr, N. A., Hall, J. D., ad Hart, J. C The ray egie. Tech. Rep. UIUCDCS-R , Uiversity of Illiois at Uraa-Champaig, Mar. Dogarra, J A update of a couple of tools: AT- LAS ad PAPI. DOE Salisha Meetig (Availale from SLIDES/salisha.ps), Apr. Kilgard, M. J GL NV texture rectagle. cotet/vopeglspecs/ GL NV texture rectagle.txt.

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware A Overview Graphics System Moitor Iput devices CPU/Memory GPU Raster Graphics System Raster: A array of picture elemets Based o raster-sca TV techology The scree (ad a picture)

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

Instruction and Data Streams

Instruction and Data Streams Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization Ed Semester Examiatio 2013-14 CSE, III Yr. (I Sem), 30002: Computer Orgaizatio Istructios: GROUP -A 1. Write the questio paper group (A, B, C, D), o frot page top of aswer book, as per what is metioed

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method A ew Morphological 3D Shape Decompositio: Grayscale Iterframe Iterpolatio Method D.. Vizireau Politehica Uiversity Bucharest, Romaia ae@comm.pub.ro R. M. Udrea Politehica Uiversity Bucharest, Romaia mihea@comm.pub.ro

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

How do we evaluate algorithms?

How do we evaluate algorithms? F2 Readig referece: chapter 2 + slides Algorithm complexity Big O ad big Ω To calculate ruig time Aalysis of recursive Algorithms Next time: Litterature: slides mostly The first Algorithm desig methods:

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must

More information

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON Roberto Lopez ad Eugeio Oñate Iteratioal Ceter for Numerical Methods i Egieerig (CIMNE) Edificio C1, Gra Capitá s/, 08034 Barceloa, Spai ABSTRACT I this work

More information

GPUMP: a Multiple-Precision Integer Library for GPUs

GPUMP: a Multiple-Precision Integer Library for GPUs GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved. Chapter 11 Frieds, Overloaded Operators, ad Arrays i Classes Copyright 2014 Pearso Addiso-Wesley. All rights reserved. Overview 11.1 Fried Fuctios 11.2 Overloadig Operators 11.3 Arrays ad Classes 11.4

More information

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS)

CSC165H1 Worksheet: Tutorial 8 Algorithm analysis (SOLUTIONS) CSC165H1, Witer 018 Learig Objectives By the ed of this worksheet, you will: Aalyse the ruig time of fuctios cotaiig ested loops. 1. Nested loop variatios. Each of the followig fuctios takes as iput a

More information

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As

More information

A Resource for Free-standing Mathematics Qualifications

A Resource for Free-standing Mathematics Qualifications Ope.ls The first sheet is show elow. It is set up to show graphs with equatios of the form = m + c At preset the values of m ad c are oth zero. You ca chage these values usig the scroll ars. Leave the

More information

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Course Site:   Copyright 2012, Elsevier Inc. All rights reserved. Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

Alpha Individual Solutions MAΘ National Convention 2013

Alpha Individual Solutions MAΘ National Convention 2013 Alpha Idividual Solutios MAΘ Natioal Covetio 0 Aswers:. D. A. C 4. D 5. C 6. B 7. A 8. C 9. D 0. B. B. A. D 4. C 5. A 6. C 7. B 8. A 9. A 0. C. E. B. D 4. C 5. A 6. D 7. B 8. C 9. D 0. B TB. 570 TB. 5

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

游戏设计与开发. Outline. Game Programming Topics. Building A Game

游戏设计与开发. Outline. Game Programming Topics. Building A Game 1896 1935 1987 2006 Outlie 游戏设计与开发 Real Time Requiremet A Coceptual Rederig Pipelie The Graphics Processig Uit (GPU) Example 技术篇 : 实时图形硬件 Game Programmig Topics Focus: Buildig game ad virtual world High-level

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

Pattern Recognition Systems Lab 1 Least Mean Squares

Pattern Recognition Systems Lab 1 Least Mean Squares Patter Recogitio Systems Lab 1 Least Mea Squares 1. Objectives This laboratory work itroduces the OpeCV-based framework used throughout the course. I this assigmet a lie is fitted to a set of poits usig

More information

Solution printed. Do not start the test until instructed to do so! CS 2604 Data Structures Midterm Spring, Instructions:

Solution printed. Do not start the test until instructed to do so! CS 2604 Data Structures Midterm Spring, Instructions: CS 604 Data Structures Midterm Sprig, 00 VIRG INIA POLYTECHNIC INSTITUTE AND STATE U T PROSI M UNI VERSI TY Istructios: Prit your ame i the space provided below. This examiatio is closed book ad closed

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

Improving Template Based Spike Detection

Improving Template Based Spike Detection Improvig Template Based Spike Detectio Kirk Smith, Member - IEEE Portlad State Uiversity petra@ee.pdx.edu Abstract Template matchig algorithms like SSE, Covolutio ad Maximum Likelihood are well kow for

More information

The Magma Database file formats

The Magma Database file formats The Magma Database file formats Adrew Gaylard, Bret Pikey, ad Mart-Mari Breedt Johaesburg, South Africa 15th May 2006 1 Summary Magma is a ope-source object database created by Chris Muller, of Kasas City,

More information

A Very Simple Approach for 3-D to 2-D Mapping

A Very Simple Approach for 3-D to 2-D Mapping A Very Simple Approach for -D to -D appig Sadipa Dey (1 Ajith Abraham ( Sugata Sayal ( Sadipa Dey (1 Ashi Software Private Limited INFINITY Tower II 10 th Floor Plot No. - 4. Block GP Salt Lake Electroics

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

1.2 Binomial Coefficients and Subsets

1.2 Binomial Coefficients and Subsets 1.2. BINOMIAL COEFFICIENTS AND SUBSETS 13 1.2 Biomial Coefficiets ad Subsets 1.2-1 The loop below is part of a program to determie the umber of triagles formed by poits i the plae. for i =1 to for j =

More information

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015. Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Hash Tables xkcd. http://xkcd.com/221/. Radom Number. Used with permissio uder Creative

More information

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition. Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive

More information

Computer Systems - HS

Computer Systems - HS What have we leared so far? Computer Systems High Level ENGG1203 2d Semester, 2017-18 Applicatios Sigals Systems & Cotrol Systems Computer & Embedded Systems Digital Logic Combiatioal Logic Sequetial Logic

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

A Study on the Performance of Cholesky-Factorization using MPI

A Study on the Performance of Cholesky-Factorization using MPI A Study o the Performace of Cholesky-Factorizatio usig MPI Ha S. Kim Scott B. Bade Departmet of Computer Sciece ad Egieerig Uiversity of Califoria Sa Diego {hskim, bade}@cs.ucsd.edu Abstract Cholesky-factorizatio

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods.

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods. Software developmet of compoets for complex sigal aalysis o the example of adaptive recursive estimatio methods. SIMON BOYMANN, RALPH MASCHOTTA, SILKE LEHMANN, DUNJA STEUER Istitute of Biomedical Egieerig

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control EE 459/500 HDL Based Digital Desig with Programmable Logic Lecture 13 Cotrol ad Sequecig: Hardwired ad Microprogrammed Cotrol Refereces: Chapter s 4,5 from textbook Chapter 7 of M.M. Mao ad C.R. Kime,

More information

Lecture 18. Optimization in n dimensions

Lecture 18. Optimization in n dimensions Lecture 8 Optimizatio i dimesios Itroductio We ow cosider the problem of miimizig a sigle scalar fuctio of variables, f x, where x=[ x, x,, x ]T. The D case ca be visualized as fidig the lowest poit of

More information

Bayesian approach to reliability modelling for a probability of failure on demand parameter

Bayesian approach to reliability modelling for a probability of failure on demand parameter Bayesia approach to reliability modellig for a probability of failure o demad parameter BÖRCSÖK J., SCHAEFER S. Departmet of Computer Architecture ad System Programmig Uiversity Kassel, Wilhelmshöher Allee

More information

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions Proceedigs of the 10th WSEAS Iteratioal Coferece o APPLIED MATHEMATICS, Dallas, Texas, USA, November 1-3, 2006 316 A Geeralized Set Theoretic Approach for Time ad Space Complexity Aalysis of Algorithms

More information

Fast Fourier Transform (FFT) Algorithms

Fast Fourier Transform (FFT) Algorithms Fast Fourier Trasform FFT Algorithms Relatio to the z-trasform elsewhere, ozero, z x z X x [ ] 2 ~ elsewhere,, ~ e j x X x x π j e z z X X π 2 ~ The DFS X represets evely spaced samples of the z- trasform

More information

One advantage that SONAR has over any other music-sequencing product I ve worked

One advantage that SONAR has over any other music-sequencing product I ve worked *gajedra* D:/Thomso_Learig_Projects/Garrigus_163132/z_productio/z_3B2_3D_files/Garrigus_163132_ch17.3d, 14/11/08/16:26:39, 16:26, page: 647 17 CAL 101 Oe advatage that SONAR has over ay other music-sequecig

More information

3D Model Retrieval Method Based on Sample Prediction

3D Model Retrieval Method Based on Sample Prediction 20 Iteratioal Coferece o Computer Commuicatio ad Maagemet Proc.of CSIT vol.5 (20) (20) IACSIT Press, Sigapore 3D Model Retrieval Method Based o Sample Predictio Qigche Zhag, Ya Tag* School of Computer

More information

Cache-Optimal Methods for Bit-Reversals

Cache-Optimal Methods for Bit-Reversals Proceedigs of the ACM/IEEE Supercomputig Coferece, November 1999, Portlad, Orego, U.S.A. Cache-Optimal Methods for Bit-Reversals Zhao Zhag ad Xiaodog Zhag Departmet of Computer Sciece College of William

More information

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access

More information

EE123 Digital Signal Processing

EE123 Digital Signal Processing Last Time EE Digital Sigal Processig Lecture 7 Block Covolutio, Overlap ad Add, FFT Discrete Fourier Trasform Properties of the Liear covolutio through circular Today Liear covolutio with Overlap ad add

More information

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS SIAM J. SCI. COMPUT. Vol. 22, No. 6, pp. 2113 2134 c 21 Society for Idustrial ad Applied Mathematics FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS ZHAO ZHANG AND XIAODONG ZHANG

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 6 Defiig Fuctios Pytho Programmig, 2/e 1 Objectives To uderstad why programmers divide programs up ito sets of cooperatig fuctios. To be able to

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016

More information

Computers and Scientific Thinking

Computers and Scientific Thinking Computers ad Scietific Thikig David Reed, Creighto Uiversity Chapter 15 JavaScript Strigs 1 Strigs as Objects so far, your iteractive Web pages have maipulated strigs i simple ways use text box to iput

More information

APPLICATION NOTE. Automated Gain Flattening. 1. Experimental Setup. Scope and Overview

APPLICATION NOTE. Automated Gain Flattening. 1. Experimental Setup. Scope and Overview APPLICATION NOTE Automated Gai Flatteig Scope ad Overview A flat optical power spectrum is essetial for optical telecommuicatio sigals. This stems from a eed to balace the chael powers across large distaces.

More information

1&1 Next Level Hosting

1&1 Next Level Hosting 1&1 Next Level Hostig Performace Level: Performace that grows with your requiremets Copyright 1&1 Iteret SE 2017 1ad1.com 2 1&1 NEXT LEVEL HOSTING 3 Fast page loadig ad short respose times play importat

More information

FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea

FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea FPGA IMPLEMENTATION OF BASE-N LOGARITHM Salvador E. Tropea Electróica e Iformática Istituto Nacioal de Tecología Idustrial Bueos Aires, Argetia email: salvador@iti.gov.ar ABSTRACT I this work, we preset

More information

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

Review: The ACID properties

Review: The ACID properties Recovery Review: The ACID properties A tomicity: All actios i the Xactio happe, or oe happe. C osistecy: If each Xactio is cosistet, ad the DB starts cosistet, it eds up cosistet. I solatio: Executio of

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 1 Computers ad Programs 1 Objectives To uderstad the respective roles of hardware ad software i a computig system. To lear what computer scietists

More information

6.854J / J Advanced Algorithms Fall 2008

6.854J / J Advanced Algorithms Fall 2008 MIT OpeCourseWare http://ocw.mit.edu 6.854J / 18.415J Advaced Algorithms Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.415/6.854 Advaced Algorithms

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Advaced Issues Review: Pipelie Hazards Structural hazards Desig pipelie to elimiate structural hazards.

More information

Isn t It Time You Got Faster, Quicker?

Isn t It Time You Got Faster, Quicker? Is t It Time You Got Faster, Quicker? AltiVec Techology At-a-Glace OVERVIEW Motorola s advaced AltiVec techology is desiged to eable host processors compatible with the PowerPC istructio-set architecture

More information

condition w i B i S maximum u i

condition w i B i S maximum u i ecture 10 Dyamic Programmig 10.1 Kapsack Problem November 1, 2004 ecturer: Kamal Jai Notes: Tobias Holgers We are give a set of items U = {a 1, a 2,..., a }. Each item has a weight w i Z + ad a utility

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Ruig Time of a algorithm Ruig Time Upper Bouds Lower Bouds Examples Mathematical facts Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite

More information

Searching a Russian Document Collection Using English, Chinese and Japanese Queries

Searching a Russian Document Collection Using English, Chinese and Japanese Queries Searchig a Russia Documet Collectio Usig Eglish, Chiese ad Japaese Queries Fredric C. Gey (gey@ucdata.berkeley.edu) UC Data Archive & Techical Assistace Uiversity of Califoria, Berkeley, CA 94720 USA ABSTRACT.

More information

Normals. In OpenGL the normal vector is part of the state Set by glnormal*()

Normals. In OpenGL the normal vector is part of the state Set by glnormal*() Ray Tracig 1 Normals OpeG the ormal vector is part of the state Set by glnormal*() -glnormal3f(x, y, z); -glnormal3fv(p); Usually we wat to set the ormal to have uit legth so cosie calculatios are correct

More information

Appendix A. Use of Operators in ARPS

Appendix A. Use of Operators in ARPS A Appedix A. Use of Operators i ARPS The methodology for solvig the equatios of hydrodyamics i either differetial or itegral form usig grid-poit techiques (fiite differece, fiite volume, fiite elemet)

More information

Creating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA

Creating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA Creatig Exact Bezier Represetatios of CST Shapes David D. Marshall Califoria Polytechic State Uiversity, Sa Luis Obispo, CA 93407-035, USA The paper presets a method of expressig CST shapes pioeered by

More information

Lecture 5. Counting Sort / Radix Sort

Lecture 5. Counting Sort / Radix Sort Lecture 5. Coutig Sort / Radix Sort T. H. Corme, C. E. Leiserso ad R. L. Rivest Itroductio to Algorithms, 3rd Editio, MIT Press, 2009 Sugkyukwa Uiversity Hyuseug Choo choo@skku.edu Copyright 2000-2018

More information

Algorithm. Counting Sort Analysis of Algorithms

Algorithm. Counting Sort Analysis of Algorithms Algorithm Coutig Sort Aalysis of Algorithms Assumptios: records Coutig sort Each record cotais keys ad data All keys are i the rage of 1 to k Space The usorted list is stored i A, the sorted list will

More information

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation

Task scenarios Outline. Scenarios in Knowledge Extraction. Proposed Framework for Scenario to Design Diagram Transformation 6-0-0 Kowledge Trasformatio from Task Scearios to View-based Desig Diagrams Nima Dezhkam Kamra Sartipi {dezhka, sartipi}@mcmaster.ca Departmet of Computig ad Software McMaster Uiversity CANADA SEKE 08

More information

Real-Time Simulation of 3D Smoke on GPU

Real-Time Simulation of 3D Smoke on GPU eal-time Simulatio of 3D Smoke o GPU QING YANG Computer Sciece ad Iformatio Techology College Zhejiag Wali Uiversity No.8,South Qia Hu oad Nigbo, Zhejiag P..CHINA http://www.computer.zwu.edu.c Abstract:

More information

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 5: Pipeliig Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab1 Due toight Lab2: out later today; due 2 weeks from ow Review sessio this Friday Turig award

More information

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 )

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 ) EE26: Digital Desig, Sprig 28 3/6/8 EE 26: Itroductio to Digital Desig Combiatioal Datapath Yao Zheg Departmet of Electrical Egieerig Uiversity of Hawaiʻi at Māoa Combiatioal Logic Blocks Multiplexer Ecoders/Decoders

More information

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence

9.1. Sequences and Series. Sequences. What you should learn. Why you should learn it. Definition of Sequence _9.qxd // : AM Page Chapter 9 Sequeces, Series, ad Probability 9. Sequeces ad Series What you should lear Use sequece otatio to write the terms of sequeces. Use factorial otatio. Use summatio otatio to

More information

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8)

CIS 121 Data Structures and Algorithms with Java Fall Big-Oh Notation Tuesday, September 5 (Make-up Friday, September 8) CIS 11 Data Structures ad Algorithms with Java Fall 017 Big-Oh Notatio Tuesday, September 5 (Make-up Friday, September 8) Learig Goals Review Big-Oh ad lear big/small omega/theta otatios Practice solvig

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 CPU-Memory Bottleeck Computer Architecture ELEC44 CPU Memory Lecture 8 Cache Dr. Hayde Kwok-Hay So Departmet of Electrical ad Electroic Egieerig Performace of high-speed computers is usually limited by

More information

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering EE 4363 1 Uiversity of Miesota Midterm Exam #1 Prof. Matthew O'Keefe TA: Eric Seppae Departmet of Electrical ad Computer Egieerig Uiversity of Miesota Twi Cities Campus EE 4363 Itroductio to Microprocessors

More information

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions U.C. Berkeley CS170 : Algorithms Midterm 1 Solutios Lecturers: Sajam Garg ad Prasad Raghavedra Feb 1, 017 Midterm 1 Solutios 1. (4 poits) For the directed graph below, fid all the strogly coected compoets

More information

Chapter 4. Procedural Abstraction and Functions That Return a Value. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 4. Procedural Abstraction and Functions That Return a Value. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 4 Procedural Abstractio ad Fuctios That Retur a Value Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 4.1 Top-Dow Desig 4.2 Predefied Fuctios 4.3 Programmer-Defied Fuctios 4.4

More information

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago CMSC 22200 Computer Architecture Lecture 2: ISA Prof. Yajig Li Departmet of Computer Sciece Uiversity of Chicago Admiistrative Stuff Lab1 out toight Due Thursday (10/18) Lab1 review sessio Tomorrow, 10/05,

More information

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov

Sorting in Linear Time. Data Structures and Algorithms Andrei Bulatov Sortig i Liear Time Data Structures ad Algorithms Adrei Bulatov Algorithms Sortig i Liear Time 7-2 Compariso Sorts The oly test that all the algorithms we have cosidered so far is compariso The oly iformatio

More information

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software Structurig Redudacy for Fault Tolerace CSE 598D: Fault Tolerat Software What do we wat to achieve? Versios Damage Assessmet Versio 1 Error Detectio Iputs Versio 2 Voter Outputs State Restoratio Cotiued

More information

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO Sagwo Seo, Trevor Mudge Advaced Computer Architecture Laboratory Uiversity of Michiga at A Arbor {swseo, tm}@umich.edu Yumig Zhu, Chaitali

More information

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers * Load balaced Parallel Prime umber Geerator with Sieve of Eratosthees o luster omputers * Soowook Hwag*, Kyusik hug**, ad Dogseug Kim* *Departmet of Electrical Egieerig Korea Uiversity Seoul, -, Rep. of

More information

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana

The Closest Line to a Data Set in the Plane. David Gurney Southeastern Louisiana University Hammond, Louisiana The Closest Lie to a Data Set i the Plae David Gurey Southeaster Louisiaa Uiversity Hammod, Louisiaa ABSTRACT This paper looks at three differet measures of distace betwee a lie ad a data set i the plae:

More information

Chapter 4 The Datapath

Chapter 4 The Datapath The Ageda Chapter 4 The Datapath Based o slides McGraw-Hill Additioal material 24/25/26 Lewis/Marti Additioal material 28 Roth Additioal material 2 Taylor Additioal material 2 Farmer Tae the elemets that

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

Chapter 8. Strings and Vectors. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Chapter 8. Strings and Vectors. Copyright 2014 Pearson Addison-Wesley. All rights reserved. Chapter 8 Strigs ad Vectors Overview 8.1 A Array Type for Strigs 8.2 The Stadard strig Class 8.3 Vectors Slide 8-3 8.1 A Array Type for Strigs A Array Type for Strigs C-strigs ca be used to represet strigs

More information

Polynomial Functions and Models. Learning Objectives. Polynomials. P (x) = a n x n + a n 1 x n a 1 x + a 0, a n 0

Polynomial Functions and Models. Learning Objectives. Polynomials. P (x) = a n x n + a n 1 x n a 1 x + a 0, a n 0 Polyomial Fuctios ad Models 1 Learig Objectives 1. Idetify polyomial fuctios ad their degree 2. Graph polyomial fuctios usig trasformatios 3. Idetify the real zeros of a polyomial fuctio ad their multiplicity

More information

Data diverse software fault tolerance techniques

Data diverse software fault tolerance techniques Data diverse software fault tolerace techiques Complemets desig diversity by compesatig for desig diversity s s limitatios Ivolves obtaiig a related set of poits i the program data space, executig the

More information

Lower Bounds for Sorting

Lower Bounds for Sorting Liear Sortig Topics Covered: Lower Bouds for Sortig Coutig Sort Radix Sort Bucket Sort Lower Bouds for Sortig Compariso vs. o-compariso sortig Decisio tree model Worst case lower boud Compariso Sortig

More information