Fast Interpolation of Grid Data at a Non-Grid Point

Size: px
Start display at page:

Download "Fast Interpolation of Grid Data at a Non-Grid Point"

Transcription

1 Fast Iterpolatio of Grid Data at a No-Grid Poit Hiroshi Ioue IBM Research - Tokyo Tokyo, Japa iouehrs@jp.ibm.com Abstract Defiig data at a o-grid poit by iterpolatig grid data is a commo operatio i may workloads icludig scietific applicatios ad imagig applicatios. This paper describes our techique to accelerate this iterpolatio operatio ad show its performace beefit usig D computed tomography recostructio. The D CT is oe of the computeitesive medical imagig applicatios that frequetly iterpolates grid data (D images at a o-grid poit. To efficietly execute this operatio with SIMD istructios, we create a i-memory pre-computed table from the iput D image at rutime before projectig voxels oto each image to reduce the amout of computatio ad avoid o-cotiguous memory accesses that atteuate the beefits of SIMD istructios. We implemeted ad evaluated our precomputatio techique usig a biliear iterpolatio ad a rddegree Lagrage iterpolatio o POWER processors; it yields up to 7% ad 7% performace improvemets i the RabbitCT bechmark for the two iterpolatio algorithms respectively. Keywords - grid-based iterpolatio, D CT recostructio, back projectio, SIMD I. INTRODUCTION Iterpolatio of grid data to obtai data at a o-grid poit is a commo operatio i may workloads icludig imagig applicatios ad scietific stecil applicatios. The iterpolatio operatio is ofte quite costly especially for large grid data (such as high-resolutio images. Oe reaso is the large umber of arithmetic istructios to calculate the iterpolated value. Aother reaso is its scattered memory access patter; readig values from the surroudig pixels requires o-cotiguous memory accesses, which reduce the efficiecy of SIMD istructios eve with the hardware gather/scatter support of the latest processors. I this paper, we aim to accelerate the iterpolatio of grid data usig a ew algorithm. To make the beefits ad overhead of our ew algorithm clear, we coduct throughout evaluatios of our techique usig a three-dimesioal (D computed tomography (CT workload [] with two iterpolatio algorithms. D CT is oe of the medical imagig workloads that frequetly iterpolate grid data (D images. It recostructs a structure i a D volume by backprojectig multiple projectio images from differet agles ito the D volume usig the grid-based iterpolatio. Figure shows a schematic overview of the system of D coe-beam CT. A X-ray poit source ad a flat-pael X-ray detector move aroud the D object to capture D projectio images from differet agles. RabbitCT [] is a ope framework for bechmarkig the backprojectio algorithms i the FDK algorithm [], a widely used o-iterative CT recostructio algorithm. For each projectio image, the FDK algorithm projects each volume elemet (voxel i the D volume oto the projectio image based o the trasformatio matrix (projectio matrix ad the obtais the value at the projected poit by performig a biliear iterpolatio of the eighborig four pixels. The iterpolatio is critically importat for image quality but causes sigificat overheads i the computatio time [4]. Due to the advaces i imagig hardware, the size ad resolutio of iput images are icreasig; hece the computatio performace for the D recostructio is also becomig more importat to keep the overall CT system usability. I this paper, we describe our ew techique to accelerate the iterpolatio operatio ad hece the etire CT workload by sigificatly reducig the umber of arithmetic istructios required ad avoidig o-cotiguous memory accesses. For each iput projectio image, we create a precomputed table i memory at rutime, ad the recostructio loop accesses the pre-computed table istead of the projectio image. For realistic resolutios, the beefit of usig the precomputed table i the recostructio phase is much larger tha the overhead of pre-computatio. We implemeted ad evaluated this ew techique o a system with -socket POWER processors; our techique improved the performace of the RabbitCT bechmark by up to 7%. We also evaluated it usig a rd-degree Lagrage iterpolatio by ehacig RabbitCT ad observed up to 7% speedups. The basic idea of our pre-computatio techique ca be applied to a much wider rage of applicatios ad hardware compared to those we evaluate i this paper. Although we evaluated our techique usig a o-iterative CT recostructio algorithm, it ca be applied to other imagig applicatios. For example, some iterative CT recostructio ad image-registratio algorithms are good targets for our techique because they also iterpolate data frequetly. The key metric of the performace beefits is the ratio of the umber of grid poits (pixels i data ad the umber of executed iterpolatio operatios. Furthermore, certai oimagig stecil applicatios are also potetial targets. For istace, particle-i-cell simulatios of two-phase flows [] frequetly iterpolate the surroudig grid (mesh poits to obtai the flow parameters at the locatios of the particles. Hece, they are iterestig targets for our pre-computatio techique. Our techique does ot use features specific to the processor; hece, it ca be implemeted o other CPUs or GPUs.

2 Flat-pael detector (capturig D projectio images projected poit {u, v } D volume cotaiig object to be imaged j ( i p, i i+ u α β ( α ( u v p ˆ, ( i + p, v voxel at {x, z} ( β j + p j + p ( i +, j + Figure. Overview of D coe-beam CT. Flat-pael detector ad X-ray source move aroud D object to be imaged to capture D projectio images from differet agles. II. OUR PRE-COMPUTATION TECHNIQUE I this sectio, we first describe the baselie backprojectio algorithm ad details o how we ehace it to achieve higher performace by pre-computatio. A. Baselie Recostructio Algorithm [] RabbitCT evaluates the performace ad accuracy of the backprojectio operatio of the FDK algorithm [] for D volume recostructio. RabbitCT provides 49 projectio images, whose size S x S y is 4 9 pixels, with a projectio matrix for each image. The itesity value of each pixel is represeted by a sigle-precisio floatig poit value, ad each elemet of the 4 projectio matrix is represeted by a double-precisio floatig poit value. We deote the - th projectio image as I. For each of the projectio images, the recostructio algorithm projects all voxels oto I ad updates the desity value for the voxel based o the itesity value at the projected poit obtaied by iterpolatig the surroudig pixels. The umber of voxels i the volume, i.e., the, is L =,,, or 4. Amog these four s, L = is the most commo, ad L = is a toy bechmark that is mostly used for debuggig (hece it is ot cosidered for rakig. Although differet umbers of voxels are used, the same I are used as iput for all s. The voxel, whose D positio is deoted as {x, z}, is projected oto poit {u, v } i I as follows. Here, a is a projectio matrix determied by the system geometry ad provided by the framework for each I. u v w z = ( ax + a y + az + a9 w z, z = ( ax + a4 y + a7z + a w z, z = a x + a y + a z + a. I is defied oly at grid poits; hece, the itesity value at the projected poit {u, v }, which we deote as p ˆ ( u, v, is obtaied by performig a biliear iterpolatio from the itesity values p at the eighborig four pixels by poit source of X-ray ( Figure. Overview of projected poit {u, v} ad eighborig four grid poits. pˆ ( u, v = ( α( β p + α( β p( i +, + ( α β p j + + αβ p ( i +, j +, i = u, j = v, α = u u, β = v v. Here, p (i, equals I (i, if positio {i, j} is withi I ; otherwise, p (i, is zero. Figure shows a overview of the projected poit ad the eighborig pixels. The desity value of the voxel f (x, z is calculated as N f z = pˆ ( u, v ( = w z. Sice this recostructio algorithm has iheretly huge parallelism, we ca accelerate the executio of the algorithm by exploitig thread-level parallelism by multiple cores ad data parallelism by SIMD istructios. I this algorithm, computatio performace i processor cores is the primary bottleeck. Sice the accesses to the D images have spatial localit i.e. eighbourig voxels are typically mapped oto the same or a earby pixel i the D image, they do ot cause too frequet cache misses. The accesses to the voxel data, which are mostly sequetial, may become bottleeck depedig o the memory system performace, but we ca reduce this memory badwidth to the voxel data by usig a techique similar to temporal blockig [, 7]. I additio to the executio time, RabbitCT reports o the errors from the referece implemetatio to evaluate the accuracy of the recostructio algorithm. The referece implemetatio is a straightforward implemetatio of the algorithm usig double-precisio floatig-poit umbers durig the computatio. B. Our Pre-Computatio Techique The biliear iterpolatio with Equatio is the most time-cosumig part of the overall executio time of the CT recostructio. However, the iterpolatio is critically importat for the overall quality of the recostructed image; hece, we caot simply skip the iterpolatio. I additio to the large umber of arithmetic istructios to calculate Equatio, vectorizig this equatio is aother high hurdle (

3 with our pre-computatio for each projectio image I ( = ton - // pre-computatio phase for i = to Sx - for j = to Sy - calculate C to C by Equatio ad store ito pre-computed table ed ed // recostructio phase for Iz = to L - for Iy = to L - determie rage of Ix to iterate for Ix = Ix start to Ix ed project voxel at (Ix, I Iz oto I calculate pˆ by Equatio 4 update desity of voxel ed ed ed ed R L : resolutio (size of a voxel, O L : origi without our pre-computatio for each projectio image I ( = ton - // create zero-padded copy [4] for i = to Sx - for j = to Sy - copy I (i, ito separate memory buffer (Sectio. ed ed // recostructio phase for Iz = to L - for Iy = to L - determie rage of Ix to iterate for Ix = Ix start to Ix ed project voxel at (Ix, I Iz oto I calculate pˆ by Equatio update desity of voxel ed ed ed ed ( for higher performace. It is kow that vectorizig Equatio is iefficiet with the SIMD istructios while it is almost straightforward to efficietly vectorize other parts of the algorithm icludig Equatios ad [4]. This iefficiecy is due to the o-cotiguous ad ualiged memory accesses to read I from four pixels. A SIMD load istructio ca load multiple elemets oly whe they are cotiguous i memory; hece, o-cotiguous memory access icurs the overhead of executig multiple load istructios ad mixig the loaded elemets. Although some of the latest processors, such as Itel Haswell, support gather ad scatter memory access istructios to load from or store to o-cotiguous memory area, they are still slower tha cotiguous memory accesses. To efficietly calculate Equatio by reducig the redudat computatio ad o-cotiguous memory accesses, we group terms by u ad v, the factors that deped o the curret voxel locatio; we modify the equatio as follows: pˆ ( u, v = ( α( β p + α( β p( i +, + ( α β p j + + α β p( i +, j + = u v C + u C + v C + C. j Here, the coefficiets C to C are calculated as C C C C = p + p( i +, j + p( i +, p j +, = ( j + ( p ( i +, p + j ( p j + p( i +, j +, = ( i + ( p j + p + i ( p( i +, p( i +, j +, = ( i + ( j + p + i j p( i +, j + ( i + j p j + i ( j + p ( i +,. Figure. Pseudocode of recostructio with ad without our pre-computatio techique (4 ( The key poit is that these coefficiets C to C for each pixel i the D projectio image are idepedet of u ad v (hece idepedet of x, ad z by the defiitio. Hece, we ca pre-compute these coefficiets before executig the recostructio ad store the results i a i-memory precomputed table. Sice the umber of voxels (L is much larger tha the umber of pixels i a projectio image (S x S y, multiple voxels are projected oto the same pixel positio {i, j}. Hece, we have a huge opportuity to reuse the computed coefficiets for iterpolatio. Oce we pre-compute the coefficiets for each pixel of the D iput image, we ca use lightweight iterpolatio with Equatio 4. Equatio 4 ca be computed much more efficietly tha Equatio ; it ca be computed usig oly three multiply-ad-add istructios as ( u, v u ( v C + C + ( v C C ( i pˆ +, =, while Equatio requires istructios to be computed i our implemetatio eve with the multiply-ad-add istructios. Figure shows the pseudocode of the etire recostructio algorithm with ad without our precomputatio. Our pre-computatio techique uses additioal memory space for the pre-computed table. The size of the table is four times as large as I because we hold four values, C (i, to C (i,, per pixel for the biliear iterpolatio. However, the overheads i memory space ad i memory badwidth are ot sigificat compared to the space ad badwidth for other data structures, especially the desity values of voxels. Access to the pre-computed table causes more cache misses due to its larger footprit tha that of I. However, the beefit of reduced computatios was much more sigificat tha the drawback of a icreased umber of cache misses. I additio to the reduced computatio, which is beeficial regardless of the processor type, aother beefit of usig the pre-computed table is SIMD friedliess. By storig C to C for each pixel cotiguously i memory aliged o -bit

4 address boudaries, we ca totally avoid o-cotiguous or ualiged memory accesses for I because these four values are eough to compute p ˆ ( u, v, as show i Equatio 4, ad we do ot eed to access I after the pre-computatio phase. To fully exploit the data parallelism of SIMD istructios, each iteratio of the ier-most loop of our algorithm loads C to C for four poits by usig four vector load istructios ad the trasposes the values i vector registers usig the permutatio istructios to pack the same coefficiet of four poits i oe vector register. This traspositio is also used by a previous implemetatio [4] ad our baselie implemetatio for efficiet use of SIMD istructios. Although C to C stored i the pre-computed table are sigle-precisio umbers, we execute the pre-computatio phase usig double-precisio umbers ad covert the resultig umbers to sigle precisio umbers just before storig them i the pre-computed table. Whe we use sigleprecisio umbers durig the pre-computatio, we observe degradatio i accuracy. This is because roudig errors i the coefficiets are magified by multiplyig by u ad v i Equatio 4. This degradatio is ot sigificat whe we use double-precisio umbers i the pre-computatio phase. C. Applicatio to Higher Degree Iterpolatio Algorithms We ca apply the basic idea of our pre-computatio to higher degree iterpolatio algorithms. I may scietific ad imagig applicatios, biliear iterpolatio caot provide sufficiet accuracy; hece, higher order iterpolatio algorithms are ofte used i real-world applicatios []. I this paper, we implemeted ad evaluated a rd-degree Lagrage polyomial iterpolatio, which uses 4 4 pixels (istead of pixels of the biliear iterpolatio to obtai the iterpolated value. For the rd-degree Lagrage iterpolatio, we eed coefficiets per pixel i the pre-computed table. I geeral, the umber of coefficiets is equivalet to the umber of pixels used i the iterpolatio. Hece, for a Kthdegree polyomial iterpolatio of a D-dimesioal grid, the umber of coefficiets per pixel is (K+ D. Whe we aively apply the trasformatio of Equatio 4 to higher order algorithms, we experiece accuracy problems. With the rd-degree Lagrage iterpolatio, the iterpolated value is calculated as p ˆ ( u, v = u v C (i, + u v C (i, +, but products of u ad v, such as u v, ca potetially become huge; hece, this formula suffers from huge floatigpoit errors. To avoid this problem, we modify the equatio of the iterpolatio by groupig terms by α ad β istead of u ad v as follows: pˆ ( u, v = α β C + α β C + α βc + α C + α β C4 + α β C + α βc + α C7 + αβ C + αβ C9 + αβ C + αc + β C + β C + βc + C. Here, α = u u, β = v v. The coefficiets C to C ca be obtaied by groupig terms i the equatio of the Lagrage iterpolatio by the umbers of α ad β icluded i each term. For example, 4 ( + = ( p( i, j / + p j / p( i+, j / + p( i+, j / ( p( i, / + p / p( i+, / + p ( i+, // ( p( i, j+ / + p j+ / p( i+, j+ / + p( i+, j+ / p ( i, j+ / + p j+ / p ( i+, j+ / + p ( i+, j+ C / + ( / / Sice α ad β are i the rage betwee. ad. by defiitio regardless of u ad v, this equatio does ot suffer from huge floatig-poit errors. However, there is the drawback that this equatio requires two additioal vector arithmetic istructios to calculate α ad β i the ier-most recostructio loop, but this additioal cost of the two istructios does ot matter for higher order iterpolatio algorithms severely because they use a much larger umber of istructios tha simple biliear iterpolatio. Whe we apply this accurate versio of the pre-computatio to the biliear iterpolatio, we observed about % performace reductios i a trade-off for better accuracy. I summar we itroduced two differet approaches to improve the accuracy for the biliear ad the rd-degree Lagrage iterpolatio algorithms by reducig the floatigpoit errors. For the biliear iterpolatio, we use doubleprecisio floatig poit umbers i the pre-computatio phase. It icreases the cost of pre-computatio by up to twice, but it icurs o overhead i the recostructio phase. Sice the cost of the pre-computatio is quite small compared to the recostructio, the additioal cost of usig double precisio i the pre-computatio is less tha % of the total executio time for large s. For the rd-degree Lagrage iterpolatio, we use a alterative equatio ( by groupig terms by α ad β istead of u ad v. This approach yields better accuracy i trade for the higher overhead i the recostructio phase compared to the first approach of usig double-precisio umbers oly i the pre-computatio phase; the alterative equatio requires two additioal istructios i the recostructio loop to calculate α ad β. However, the additioal overhead is about % ad is much less sigificat compared to the beefits of the pre-computatio. D. Performace Modelig As we empirically show later, the vector uit utilizatio is the primary bottleeck i this workload; hece, the umber of executed vector istructios is the key to model the executio performace. We first show the umbers of vector istructios icluded i the ier-most loops of the recostructio phase with ad without pre-computatio to estimate the beefit of our pre-computatio techique. The, we discuss the umber of vector istructios for the pre-computatio phase to evaluate the overhead of our pre-computatio techique. Each iteratio of the ier-most loop of the recostructio phase (lies - i Figure executes the followig steps for four voxels at oce usig SIMD istructios: i project each voxel oto a D image (calculate u, v, w, ii calculate i, j to fid idices i the pre-computed table, iii load data from the pre-computed table (or image I, iv traspose data for four voxels i vector registers, v execute iterpolatio, ad vi update desity values of the voxels. /

5 Table. Number of istructios that cosume vector uit resource i each step of ier-most loop Step i calculate u, v, w (equatio ii calculate i, j iii load from pre-computed table or D image iv traspose i vector registers v execute iterpolatio vi update desity values biliear iterpolatio with precomputatiocomputatio without pre- (4 aliged load ( ualiged istructios load ists rd-degree Lagrage iterpolatio with precomputatiocomputatio without pre- ( aliged ( ualiged load ists load ists 4 total 4 (4 + (9 + Note that some additioal vector istructios ca be iserted by compiler for register copyig etc. / meas for ualiged load istructios. These steps are ot uique to our implemetatio ad quite similar to the previous implemetatios [4]. Table summarizes the umber of istructios that cosume the vector uit resource i each step of our implemetatio for two iterpolatio algorithms. From Table, we estimate that our pre-computatio techique gives.x (=4/ performace gai for the biliear iterpolatio ad.77x (=/ gai for the rd-degree Lagrage iterpolatio with large s. O the little-edia PowerPC platform, a ualiged load istructio cosumes vector uit resource while a aliged load istructio does ot []. If a ualiged load istructio does ot cosume the vector uit resource, the beefit of our pre-computatio will be smaller. However, eve i such a case, our pre-computatio still provides sigificat reductio i vector uit utilizatio; we estimate the performace gai will be.x (=4/ for the biliear iterpolatio ad.x (=9/ for the rd-degree Lagrage iterpolatio. As we discussed later, these estimates explai our experimetal results well especially for biliear iterpolatio. Table assumes a -bit SIMD istructio set such as VSX of POWER or SSE of x. Whe the vector legth becomes loger, such as AVX of x, the relative cost of traspositio is icreased. For the biliear iterpolatio with -bit SIMD istructios, for example, we estimate the traspositio uses istructios while the umbers of istructios for other steps are uchaged (assumig the processor provides sufficietly flexible permutatio istructios. This icreased cost of traspositio may atteuate the beefit of our techique slightl but our techique ca still give good reductio i the vector istructios i the ier-most loop. I the curret implemetatio of the pre-computatio phase for the biliear iterpolatio, we use both sigleprecisio ad double-precisio floatig-poit vector arithmetic istructios to keep the accuracy as already discussed. I total, we use vector istructios (icludig two ualiged load istructios for each iteratio of the ier-most loop, which processes four pixels. For the rddegree Lagrage iterpolatio, we use 4 sigle-precisio arithmetic istructios (icludig four ualiged load istructios per four pixels i the pre-computatio loop. From these umbers ad the sizes of the iput image ad D volume, we estimate the overhead of pre-computatio as follows. For the s of L = ad 4, the ratios of the umbers of vector istructios cosumed for the precomputatio are less tha % ad.%, respectivel with either iterpolatio algorithm. Hece, the overhead caused by the pre-computatio is much smaller tha the beefit of the reduced computatio time i the recostructio phase uless the umber of voxels is urealistically small compared to the size of the projectio images. As we experimetally show later, the ratios of the executio time of the pre-computatio phase reasoably match these estimates. We expect that the overhead of the pre-computatio will become more isigificat o future CT systems that use higher resolutio i both D projectio images ad D volume because the executio time of the pre-computatio phase is proportioal to the square of the D image resolutio while the recostructio phase follows the cube of the D volume resolutio. III. IMPLEMENTATION AND EVALUATIONS I this sectio, we detail the implemetatio of our precomputatio techique for RabbitCT. The, we show the experimetal results o POWER processors to illustrate the beefit of our pre-computatio techique. We implemeted our techique o POWER, but it does ot use POWERspecific features; hece, it is applicable to other processors, such as Itel Xeo, or to GPUs. A. Implemetaio We implemeted the recostructio algorithms i C++ ad vectorized them by had usig the SIMD itrisics provided by the IBM XL C++ compiler.. We used sigleprecisio floatig poit umbers i the implemetatio uless we explicitly deoted aother data type. Because the vector registers of the uderlyig processor are -bit i legth, oe SIMD istructio ca execute four operatios for differet values at oce. Our baselie implemetatio follows FastRabbit [4, 7], a state-of-the-art implemetatio for Itel processors. The baselie optimizatios iclude: replacig divide istructios with a reciprocal estimate istructio, skippig the voxels that caot be projected oto I i the ier-most loop, ad elimiatig the coditioal braches to check the out-of-image-boud accesses by creatig a copy of I with

6 throughput (GUPS 7 4 with pre-computatio without pre-computatio iterpolatio disabled (upper boud for us throughput (GUPS... with pre-computatio without pre-computatio iterpolatio disabled (upper boud for us. L= L= L= L=4 L= L= L= L=4 Figure 4. Performace (throughput i GUPS with ad without our pre-computatio techique for biliear iterpolatio o cores ( NUMA ode of POWER. Table. Accuracy (root-mea-squared error i HU with ad without our pre-computatio techique Root -Mea-Square Error (HU Problem size With precomputatio Without precomputatio Iterpolatio disabled L=.4.. L=..7. L=... L=4.4.. zero paddig aroud the image. These three optimizatios were eabled eve whe we did ot eable our precomputatio techique. I optimizatio, we solve the iequalities to determie the ecessary rage of I x before executig the ier-most loop (lie i Figure. I optimizatio, we prepare a memory buffer of 4 pixels. I is copied ito this buffer before executig the recostructio (lie -7 of without pre-computatio i Figure. We use the same idea of paddig zeros outside the image boudary to avoid coditioal braches i the precomputed table. May recet multi-socket systems have o-uiform memory architecture (NUMA; accesses to the directly attached (local memory are faster tha accesses to the remote memory attached to other sockets. Hece, it is importat to reduce the remote memory accesses to achieve good performace scalability. To reduce remote memory accesses, we allocate the pre-computed table ad desity values of voxels for each NUMA ode, ad each worker thread is boud to a NUMA ode. The, all the threads i each NUMA ode process differet projectio images rather tha multiple NUMA odes cooperatig o oe image. Hece, durig the recostructio phase, we do ot eed to access data i the remote memory. I the pre-computatio phase, we might eed to read a iput D image from remote memory sice we do ot cotrol the locatio of the iput images. However, the size of a iput image is much smaller tha the desity values of the voxels; hece, the accesses to the iput images do ot Figure. Performace (throughput i GUPS with ad without our pre-computatio techique for Lagrage iterpolatio o cores ( NUMA ode of POWER. severely degrade the performace scalability. For better load balacig, each ode picks a ew iput image from the global work queue after processig oe iput image rather tha statically assigig images for each NUMA ode. After processig all the projectio images, we calculate the fial desity value for each voxel by summig up the partial results from all NUMA odes. This fial calculatio is icluded i the executio time. Withi a NUMA ode, all threads cooperate to process oe projectio image; each thread picks a small block of voxels oe by oe ad updates these values. A atomic operatio (atomic add is used oly whe a thread picks the ext block from the ode-local work queue. Because oly oe projectio image is processed per ode at a time, we do ot eed to use atomic operatio to update values for voxels. B. Evaluatios We evaluated our techique o POWER processors. The system has two.9-ghz POWER processors with GB of system memory ruig Ubutu Liux 4. for Little Edia POWER as its OS. The system has processor cores ( cores per socket, ad the total umber of SMT threads (logical CPUs is usig a -way SMT. Sice oe POWER processor is a dual chip module, oe socket is divided ito two NUMA odes; hece, each NUMA ode is equipped with cores (or 4 SMT threads. While we use all SMT threads per core for experimets with the biliear iterpolatio, we use oly 4 SMT threads per core for the rddegree Lagrage iterpolatio. This was because the SMT4 cofiguratio outperformed SMT for the rd-degree Lagrage iterpolatio regardless of the use of the precomputatio. Each core is equipped with KB L cache ad MB L cache memory. Also, the system has MB shared L4 cache memory i total. We disabled the dyamic frequecy scalig to obtai more cosistet ad reproducible results. We measured the performace times ad explai the average results. Figure 4 ad Table illustrate the performace (throughput i GUPS, giga voxels updates per secod ad accuracy (root-mea-square error from the referece result i

7 Table. Executio time breakdow o cores of POWER Problem size Liear iterpolatio With pre-computatio Without precomputatio precomputatio reco structio total L=.9 ( 47%...94 L=.94 ( % L=.94 (4.% L=4.9 (.% Numbers show executio time per projectio image (i msec. - Percetages show i parethesis show ratios of pre-computatio to total executio time. rd-degree Lagrage iterpolatio With pre-computatio Without precomputatio precomputatio reco structio total. ( % ( % (4.9%..9.. (.7% HUs, Housfield uits for biliear iterpolatio with the s of L =,,, ad 4 with ad without our pre-computatio techique usig oe NUMA ode equipped with cores. GUPS is calculated by the umber of voxels i the D volume (L the umber of projectio images (N / executio time / 4. It also illustrates the performace whe we totally disabled the iterpolatio, i.e., we used pˆ ( u, v = p( i, istead of Equatio. This gives the upper-limit performace for our pre-computatio techique, which reduces the overhead of the iterpolatio. Our pre-computatio techique exhibited up to 7% performace improvemets (for L = 4 over the baselie (without pre-computatio without sigificat reductio i accuracy. Whe the iterpolatio was disabled, ulike with our techique, the accuracy was sigificatly degraded as a trade-off for higher performace. As described i Sectio II.B, we used double-precisio umbers i the pre-computatio phase. If we had used sigle-precisio umbers, the accuracy would become as poor as about.7 HU. O the platform we used for evaluatio, a ualiged load istructio cosumes vector uit resource. To estimate the beefit of our precomputatio techique o platforms without such additioal pealty i ualiged load istructios, we evaluated the performace without pre-computatio by replacig all ualiged load istructios i the ier-most loop with aliged load istructios. This resulted i iaccurate results, but it allowed us to estimate the overhead due to the pealty i ualiged load istructios. Compared to this versio, our pre-computatio still gave up to 4% performace gais. These performace gais match the estimatio from the performace model explaied i Sectio II.D. Figure evaluates our pre-computatio techique usig the rd-order Lagrage iterpolatio. The gais due to our pre-computatio were up to 7%. The gais were slightly lower tha the estimatio from the performace model. Oe possible reaso for the differece betwee the measuremets ad the estimatio is the overhead due to additioal vector istructios for register copies iserted by the compiler sice the higher degree iterpolatio algorithm ivolves more values ad icurs higher register pressure. For the rd-order Lagrage iterpolatio, the chages i the accuracy caused by our pre-computatio were egligible sice we used a accurate versio of the pre-computatio with Equatio. The performace improvemets with the pre-computatio are more sigificat for a large. Whe the vector uit utilizatio % 9% % 7% % % 4% % % % % with pre-computatio without pre-computatio iterpolatio disabled L= L= L= L=4 Figure. The vector uit utilizatio with ad without our precomputatio techique for the biliear iterpolatio o cores of POWER. is tiy compared to the iput size, e.g., L = i Figures 4 ad, the pre-computatio degrades performace. Table shows the executio time broke dow ito the pre-computatio phase ad recostructio phase for each. Because the size of the projectio images is the same for all s, the time for the precomputatio phase is almost costat regardless of the, while the larger s icrease the time for the recostructio phase. Therefore, the executio time of pre-computatio is egligible for large s, but it matters for the total performace whe the is very small. These performace overheads almost match the estimatios from the performace model. The estimatios from the model were slightly smaller tha the measured overhead partially because of a performace optimizatio i the recostructio phase. We skip the voxels that caot be projected oto the projectio image i the ier-most loop of the recostructio phase, but we do ot cosider this (datadepedet optimizatio i the performace model. For more isight ito the improvemets with our precomputatio techique, we show the vector uit utilizatio ad the umber of cache misses measured by the performace couter of the processor. We select these two metrics because our techique reduces the umber of vector istructios i a trade-off for icreased cache misses. Figure shows the vector uit utilizatio for the biliear iterpolatio. The utilizatio becomes % whe oe core

8 Total L cache data misses (icludig hardware prefetchig total umber of L cache misses total umber of L cache misses.e+.e+7.e+.e+.e+4.e+.e+7.e+.e+.e+4 biliear iterpolatio with pre-computatio without pre-computatio iterpolatio disabled L= L= L= L=4 rd-degree Lagrage iterpolatio with pre-computatio without pre-computatio iterpolatio disabled L= L= L= L=4 lower is better lower is better umber of demad L misses umber of demad L misses.e+.e+7.e+.e+.e+4.e+.e+7.e+.e+.e+4 Demad L cache data misses biliear iterpolatio with pre-computatio without pre-computatio iterpolatio disabled L= L= L= L=4 rd-degree Lagrage iterpolatio with pre-computatio without pre-computatio iterpolatio disabled L= L= L= L=4 lower is better lower is better Figure 7. Total L cache misses (icludig cache misses caused by hardware prefetcher ad demad misses o cores for biliear iterpolatio ad rd-degree Lagrage iterpolatio per iput image. Y-axis is i logarithmic scale. executes two vector istructios every cycle. As show i the figure, the vector uit utilizatio is quite high: more tha % for L = ad 4 with or without pre-computatio. This meas that the vector uit is the primary bottleeck i this workload eve with the icreased cache misses with the precomputatio; hece, the performace gai due to the reduced vector istructios is much more sigificat tha the overhead due to the icreased cache misses. Figure 7 compares the umbers of total L cache data misses ad L cache demad data misses. We show the total misses, which icludes cache misses caused by the hardware memory prefetcher, to evaluate how our techique affects the total amout of data fetched ito the processors. We estimated the total cache misses caused by usig the demad cache misses whe the hardware memory prefetcher was disabled; hece, these estimates may slightly uderestimate the real values. Our techique icreased the L demad cache misses regardless of the as a trade-off for reducig the umber of executed vector istructios. The pre-computed table is 4x or x as large as the origial D image; hece, the larger memory footprit of the pre-computed table icreased demad cache misses. However, the icreases i the total amout of data trasfer were much less sigificat. These characteristics were commo betwee two iterpolatio algorithms. The curret algorithm reads ad writes the desity values of the all voxels for each projectio image, ad these accesses for the desity values of voxels domiate the memory badwidth of the system memory. Because the accesses to the voxel data are mostly sequetial, the hardware memory prefetcher of the processor works well; the accesses to the voxel data do ot cause frequet demad cache misses. Hece, the effect of our pre-computatio o the amout of the data trasfer betwee processors ad memory is much less sigificat compared to the effects o the demad cache misses. If the memory badwidth becomes the major performace bottleeck, we ca use a techique similar to temporal blockig to reduce memory badwidth requiremets for accessig the voxel data [, 7]. The POWER processor we used i the experimets supports SMT threads per core, while other geeral-purpose processors typically support a smaller umber of SMT threads; e.g. the latest x processors of Itel or AMD support -way SMT by hyper-threadig. By icreasig SMT level (threads per core, it is possible to hide the cache miss latecy ad improve the vector uit utilizatio. To cofirm that our pre-computatio is effective o other processors that oly support a smaller umber of SMT threads per core, we show the performace ad the vector uit utilizatio for the biliear iterpolatio usig cofiguratios with to SMT threads per core o POWER i Figure ad 9. Our pre-computatio improved the performace regardless of the SMT level. The performace gai by the pre-computatio was 7% with

9 throughput (GUPS 7 4 with pre-computatio without pre-computatio iterpolatio disabled (upper boud for us SMT SMT SMT 4 SMT SMT level Figure. Performace (throughput (GUPS with ad without our pre-computatio techique for liear iterpolatio with differet SMT levels o cores of POWER for of L =. SMT level meas umber of threads per core. vector uit utilizatio % 9% % 7% % % 4% % % % % with pre-computatio without pre-computatio iterpolatio disabled SMT SMT SMT 4 SMT SMT level Figure 9. Vector uit utilizatio with ad without our precomputatio techique for liear iterpolatio with differet SMT levels o cores of POWER for of L =. SMT ( SMT thread per core ad 4% with SMT. The best performace without the pre-computatio was achieved with SMT4, ad usig SMT slightly degraded the performace due to icreased cache misses caused by a larger memory footprit. With the pre-computatio, we obtaied the best result with SMT, ad SMT4 was the close secod best. The improvemets i vector uit utilizatio with icreasig SMT level was most sigificat whe icreasig the SMT level from to regardless of the use of the precomputatio. Sice the umbers of executed vector istructios were almost costat whe chagig the SMT level, the chages i the vector uit utilizatio caused by the SMT level show i Figure were correlated with the chages i the executio time. Hece, the performace improvemets were also most sigificat from SMT to SMT. Improvemets from usig more SMT threads over SMT were relatively small compared to the chages betwee SMT ad SMT. For the rd-degree Lagrage iterpolatio, these treds were mostly similar. These results show that our precomputatio techique ca improve the performace of this workload eve o systems with fewer SMT threads per core. Figure illustrates the performace scalability with ad without the optimizatio for the NUMA architecture described i Sectio III.A. With the NUMA optimizatio, the performaces of both iterpolatio algorithms scaled well, higher is better Liear Iterpolatio throughput (GUPS L= L= with NUMA opt L=4 L= L= without NUMA opt L=4 umber of cores ( NUMA ode = cores rd-degree Lagrage Iterpolatio throughput (GUPS L= L= with NUMA opt L=4 L= L= without NUMA opt L=4 umber of cores ( NUMA ode = cores Figure. Performace scalability with ad without optimizatio for NUMA o up to cores (4 NUMA odes of POWER. Sice POWER chip has two dies o oe socket, a system with two sockets has four NUMA odes. eve with cores ( sockets, 4 NUMA odes. The icrease i speed over the sigle core executio was up to.9x ad.7x by cores (for L = 4 for the biliear ad Lagrage iterpolatios, respectively. However, the scalability was much poorer without the NUMA optimizatio whe we used multiple NUMA odes due to the overhead of remote memory accesses. The performace improvemets caused by the NUMA optimizatio were more sigificat for the biliear iterpolatio compared to the rd-degree Lagrage iterpolatio. The umber of total memory accesses were ot that differet for the two algorithms as show i Figure 7, but the executio time was much loger for the rd-degree Lagrage iterpolatio due to higher computatio cost. Hece the higher commuicatio-ad-computatio ratio of the biliear iterpolatio resulted i more sigificat gais caused by the NUMA optimizatio. IV. RELATED WORK I may applicatios, pre-processig of iput data is a commo techique to improve the ed-to-ed performace of the workload; hece, a variety of pre-processig techiques have bee studied. For example, pre-coditioig techiques to improve memory access locality ad vector utilizatio have bee widely studied for sparse matrix systems. However,

10 there is o kow pre-processig techique to make the gridbased iterpolatio more efficiet. Also, the costs of the existig pre-processig techiques are ofte high; hece, the use of such pre-processig techiques may be justified oly whe the iput data is repeatedly used. I the case of the precoditioig for sparse matrix systems, for example, a precoditioed matrix is ofte repeatedly used i a iterative method such as the cojugate gradiet method; hece, the cost of pre-coditioig is amortized. Ulike such pre-processig techiques, our pre-computatio techique icurs a much smaller overhead compared to the beefit we ca obtai i oe executio of the recostructio phase. Hece, we ca use the pre-computatio techique eve if each projectio image is processed oly oce. Due to its importace, the D coe-beam CT recostructio has bee studied for a log time. The FDK algorithm by Feldkamp et al. [] is oe of the most popular recostructio algorithms. The FDK algorithm ad its variats have bee implemeted o various hardware platforms icludig geeral-purpose processors [4,, 9], FPGAs [], Xeo Phi [4], Cell BE processors [], ad GPUs [, -]. However, it was difficult to coduct fair comparisos of such implemetatios o differet hardware with various optimizatio techiques ivolved i terms of the executio performace ad the quality of recostructed images. Recetl RabbitCT [], a ope framework for bechmarkig the D coe-beam CT recostructio, has provided a way to fairly compare algorithms ad implemetatios by providig a bechmark framework ad also high-quality iput images ad referece results. Although our baselie implemetatio follows FastRabbit [4, 9], a state-of-the-art implemetatio for Itel s processors, our implemetatio outperformed FastRabbit ruig o a - socket machie (Itel IvyBridge [4] by.9x usig the same umber of cores as for L =. Although this higher performace is partially due to hardware differeces (with a higher frequecy ad larger umber of SMT threads, but without -bit vector istructios, our techique plays a critical role i exhibitig superior performace, as show i Figure 4. FastRabbit uses a aive formula to calculate the biliear iterpolatios like our baselie implemetatio. Although we implemeted our pre-computatio techique oly o POWER processors, our techique ca be applied for other processors. I RabbitCT bechmark, which uses biliear iterpolatio, GPUs achieved much higher throughput compared to geeral-purpose CPUs such as POWER sice GPUs support biliear iterpolatio by hardware [,-]. Because our pre-computatio is ot limited to biliear iterpolatio, our techique is also beeficial for GPUs if a higher order iterpolatio algorithm, which is ot supported by GPU hardware, is used to achieve higher image quality. Real-world CT systems ofte use higher order iterpolatio algorithms. V. CONCLUSION We developed a techique to accelerate iterpolatio of grid data (D projectio images at a o-grid poit. We argued that our pre-computatio techique sigificatly improves the throughput of a medical imagig workload without degradig the accuracy; it exhibited up to 7% ad 7% speed ups i the RabbitCT bechmark for biliear ad rd-degree Lagrage iterpolatio algorithms respectively. Sice the iterpolatio of grid data to obtai data at a o-grid poit is a commo operatio i may workloads, our ew techique ca cotribute to a wider rage of workloads ad algorithms tha just the D CT recostructio we described i this paper. REFERENCES [] W. C. Scarfe ad A. G. Farma, What is coe-beam CT ad how does it work?, Detal Cliics of North America, vol, pp (. [] C. Rohkohl, B. Keck, H. G. Hofma, ad J. Horegger, RabbitCT A Ope Platform for Bechmarkig -D Coe-beam Recostructio Algorithms, Medical Physics, vol., pp (9. [] L. Feldkamp, L. Davis, ad J. Kress, Practical Coe-Beam Algorithm, J.oural of the Optical Society of America, vol. A, o., pp. -9 (94. [4] J. Hofma, J. Treibig, G. Hager, ad G. Wellei, Comparig the performace of differet x SIMD istructio sets for a medical imagig applicatio o moder multi- ad maycore chips, I Workshop o Programmig models for SIMD/Vector processig (4. [] Q. Wag ad K. D. Squires, Large eddy simulatio of particle depositio i a vertical turbulet chael flow, Iteratioal Joural of Multiphase Flow, Vol., No. 4, pp. 7- (99. [] E. Papehause, Z. Zheg, K. Mueller, GPU-accelerated backprojectio revisited: squeezig performace by careful tuig, I Workshop o High Performace Image Recostructio (. [7] H. Ioue, Efficiet Tomographic Recostructio For Commodity Processors with Limited Memory Badwidth, I Proceedigs of IEEE Iteratioal Symposium o Biomedical Imagig (. [] M. Gschwid ad W. Schmidt, Bridgig edia-depedet SIMD vector represetatio with compiler optimizatio, I Workshop o Programmig models for SIMD/Vector processig (. [9] J. Treibig, G. Hager, H. G. Hofma, J. Horegger, ad G. Wellei, Pushig the limits for medical image recostructio o recet stadard multicore processors, It. J. High Perform. Comput. Appl. 7,, pp. -77 (. [] B. Heigl ad M. Kowarschik, High-Speed Recostructio for C-Arm Computed Tomograph I Proceedigs of Fully Three-Dimesioal Image Recostructio i Radiology ad Nuclear Medicie, pp. - (7. [] H. Scherla, M. Koerera, H. Hofmaa, W. Eckertb, M. Kowarschikc, ad J. Horeggera, Implemetatio of the FDK Algorithm for Coe- Beam CT o the Cell Broadbad Egie Architecture, I Proceedigs of SPIE Proceedigs Vol. Medical Imagig: Physics of Medical Imagig (7. [] E. Papehause ad K. Mueller, Rapid rabbit: Highly optimized GPU accelerated coe-beam CT recostructio, I Proceedigs of the Nuclear Sciece Symposium ad Medical Imagig Coferece, pp. - (. [] T. Zißer ad B. Keck, Systematic Performace Optimizatio of Coe- Beam Back-Projectio o the Kepler Architecture, I Proceedigs of Fully Three-Dimesioal Image Recostructio i Radiology ad Nuclear Medicie, pp. - (.

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

Cache-Optimal Methods for Bit-Reversals

Cache-Optimal Methods for Bit-Reversals Proceedigs of the ACM/IEEE Supercomputig Coferece, November 1999, Portlad, Orego, U.S.A. Cache-Optimal Methods for Bit-Reversals Zhao Zhag ad Xiaodog Zhag Departmet of Computer Sciece College of William

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS SIAM J. SCI. COMPUT. Vol. 22, No. 6, pp. 2113 2134 c 21 Society for Idustrial ad Applied Mathematics FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS ZHAO ZHANG AND XIAODONG ZHANG

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

Instruction and Data Streams

Instruction and Data Streams Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad

More information

Fast Fourier Transform (FFT) Algorithms

Fast Fourier Transform (FFT) Algorithms Fast Fourier Trasform FFT Algorithms Relatio to the z-trasform elsewhere, ozero, z x z X x [ ] 2 ~ elsewhere,, ~ e j x X x x π j e z z X X π 2 ~ The DFS X represets evely spaced samples of the z- trasform

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method

A New Morphological 3D Shape Decomposition: Grayscale Interframe Interpolation Method A ew Morphological 3D Shape Decompositio: Grayscale Iterframe Iterpolatio Method D.. Vizireau Politehica Uiversity Bucharest, Romaia ae@comm.pub.ro R. M. Udrea Politehica Uiversity Bucharest, Romaia mihea@comm.pub.ro

More information

How do we evaluate algorithms?

How do we evaluate algorithms? F2 Readig referece: chapter 2 + slides Algorithm complexity Big O ad big Ω To calculate ruig time Aalysis of recursive Algorithms Next time: Litterature: slides mostly The first Algorithm desig methods:

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

Accuracy Improvement in Camera Calibration

Accuracy Improvement in Camera Calibration Accuracy Improvemet i Camera Calibratio FaJie L Qi Zag ad Reihard Klette CITR, Computer Sciece Departmet The Uiversity of Aucklad Tamaki Campus, Aucklad, New Zealad fli006, qza001@ec.aucklad.ac.z r.klette@aucklad.ac.z

More information

GPUMP: a Multiple-Precision Integer Library for GPUs

GPUMP: a Multiple-Precision Integer Library for GPUs GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract

More information

The Magma Database file formats

The Magma Database file formats The Magma Database file formats Adrew Gaylard, Bret Pikey, ad Mart-Mari Breedt Johaesburg, South Africa 15th May 2006 1 Summary Magma is a ope-source object database created by Chris Muller, of Kasas City,

More information

FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea

FPGA IMPLEMENTATION OF BASE-N LOGARITHM. Salvador E. Tropea FPGA IMPLEMENTATION OF BASE-N LOGARITHM Salvador E. Tropea Electróica e Iformática Istituto Nacioal de Tecología Idustrial Bueos Aires, Argetia email: salvador@iti.gov.ar ABSTRACT I this work, we preset

More information

Chapter 3 Classification of FFT Processor Algorithms

Chapter 3 Classification of FFT Processor Algorithms Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As

More information

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation

Improvement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must

More information

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions

A Generalized Set Theoretic Approach for Time and Space Complexity Analysis of Algorithms and Functions Proceedigs of the 10th WSEAS Iteratioal Coferece o APPLIED MATHEMATICS, Dallas, Texas, USA, November 1-3, 2006 316 A Geeralized Set Theoretic Approach for Time ad Space Complexity Aalysis of Algorithms

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045 Oe Brookigs Drive St. Louis, Missouri 63130-4899, USA jaegerg@cse.wustl.edu

More information

6.854J / J Advanced Algorithms Fall 2008

6.854J / J Advanced Algorithms Fall 2008 MIT OpeCourseWare http://ocw.mit.edu 6.854J / 18.415J Advaced Algorithms Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.415/6.854 Advaced Algorithms

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016

More information

Improving Template Based Spike Detection

Improving Template Based Spike Detection Improvig Template Based Spike Detectio Kirk Smith, Member - IEEE Portlad State Uiversity petra@ee.pdx.edu Abstract Template matchig algorithms like SSE, Covolutio ad Maximum Likelihood are well kow for

More information

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve Advaces i Computer, Sigals ad Systems (2018) 2: 19-25 Clausius Scietific Press, Caada Aalysis of Server Resource Cosumptio of Meteorological Satellite Applicatio System Based o Cotour Curve Xiagag Zhao

More information

Data diverse software fault tolerance techniques

Data diverse software fault tolerance techniques Data diverse software fault tolerace techiques Complemets desig diversity by compesatig for desig diversity s s limitatios Ivolves obtaiig a related set of poits i the program data space, executig the

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The

More information

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs What are we goig to lear? CSC316-003 Data Structures Aalysis of Algorithms Computer Sciece North Carolia State Uiversity Need to say that some algorithms are better tha others Criteria for evaluatio Structure

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The

More information

IMP: Superposer Integrated Morphometrics Package Superposition Tool

IMP: Superposer Integrated Morphometrics Package Superposition Tool IMP: Superposer Itegrated Morphometrics Package Superpositio Tool Programmig by: David Lieber ( 03) Caisius College 200 Mai St. Buffalo, NY 4208 Cocept by: H. David Sheets, Dept. of Physics, Caisius College

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

A Study on the Performance of Cholesky-Factorization using MPI

A Study on the Performance of Cholesky-Factorization using MPI A Study o the Performace of Cholesky-Factorizatio usig MPI Ha S. Kim Scott B. Bade Departmet of Computer Sciece ad Egieerig Uiversity of Califoria Sa Diego {hskim, bade}@cs.ucsd.edu Abstract Cholesky-factorizatio

More information

condition w i B i S maximum u i

condition w i B i S maximum u i ecture 10 Dyamic Programmig 10.1 Kapsack Problem November 1, 2004 ecturer: Kamal Jai Notes: Tobias Holgers We are give a set of items U = {a 1, a 2,..., a }. Each item has a weight w i Z + ad a utility

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information

Pattern Recognition Systems Lab 1 Least Mean Squares

Pattern Recognition Systems Lab 1 Least Mean Squares Patter Recogitio Systems Lab 1 Least Mea Squares 1. Objectives This laboratory work itroduces the OpeCV-based framework used throughout the course. I this assigmet a lie is fitted to a set of poits usig

More information

( n+1 2 ) , position=(7+1)/2 =4,(median is observation #4) Median=10lb

( n+1 2 ) , position=(7+1)/2 =4,(median is observation #4) Median=10lb Chapter 3 Descriptive Measures Measures of Ceter (Cetral Tedecy) These measures will tell us where is the ceter of our data or where most typical value of a data set lies Mode the value that occurs most

More information

An Efficient Algorithm for Graph Bisection of Triangularizations

An Efficient Algorithm for Graph Bisection of Triangularizations Applied Mathematical Scieces, Vol. 1, 2007, o. 25, 1203-1215 A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045, Oe

More information

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved.

Chapter 11. Friends, Overloaded Operators, and Arrays in Classes. Copyright 2014 Pearson Addison-Wesley. All rights reserved. Chapter 11 Frieds, Overloaded Operators, ad Arrays i Classes Copyright 2014 Pearso Addiso-Wesley. All rights reserved. Overview 11.1 Fried Fuctios 11.2 Overloadig Operators 11.3 Arrays ad Classes 11.4

More information

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON Roberto Lopez ad Eugeio Oñate Iteratioal Ceter for Numerical Methods i Egieerig (CIMNE) Edificio C1, Gra Capitá s/, 08034 Barceloa, Spai ABSTRACT I this work

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

CS2410 Computer Architecture. Flynn s Taxonomy

CS2410 Computer Architecture. Flynn s Taxonomy CS2410 Computer Architecture Dept. of Computer Sciece Uiversity of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/2410p/idex.html 1 Fly s Taxoomy SISD Sigle istructio stream Sigle data stream (SIMD)

More information

Algorithms for Disk Covering Problems with the Most Points

Algorithms for Disk Covering Problems with the Most Points Algorithms for Disk Coverig Problems with the Most Poits Bi Xiao Departmet of Computig Hog Kog Polytechic Uiversity Hug Hom, Kowloo, Hog Kog csbxiao@comp.polyu.edu.hk Qigfeg Zhuge, Yi He, Zili Shao, Edwi

More information

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *

Load balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers * Load balaced Parallel Prime umber Geerator with Sieve of Eratosthees o luster omputers * Soowook Hwag*, Kyusik hug**, ad Dogseug Kim* *Departmet of Electrical Egieerig Korea Uiversity Seoul, -, Rep. of

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software Structurig Redudacy for Fault Tolerace CSE 598D: Fault Tolerat Software What do we wat to achieve? Versios Damage Assessmet Versio 1 Error Detectio Iputs Versio 2 Voter Outputs State Restoratio Cotiued

More information

APPLICATION NOTE. Automated Gain Flattening. 1. Experimental Setup. Scope and Overview

APPLICATION NOTE. Automated Gain Flattening. 1. Experimental Setup. Scope and Overview APPLICATION NOTE Automated Gai Flatteig Scope ad Overview A flat optical power spectrum is essetial for optical telecommuicatio sigals. This stems from a eed to balace the chael powers across large distaces.

More information

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis Outlie ad Readig Aalysis of Algorithms Iput Algorithm Output Ruig time ( 3.) Pseudo-code ( 3.2) Coutig primitive operatios ( 3.3-3.) Asymptotic otatio ( 3.6) Asymptotic aalysis ( 3.7) Case study Aalysis

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware A Overview Graphics System Moitor Iput devices CPU/Memory GPU Raster Graphics System Raster: A array of picture elemets Based o raster-sca TV techology The scree (ad a picture)

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

Wavelet Transform. CSE 490 G Introduction to Data Compression Winter Wavelet Transformed Barbara (Enhanced) Wavelet Transformed Barbara (Actual)

Wavelet Transform. CSE 490 G Introduction to Data Compression Winter Wavelet Transformed Barbara (Enhanced) Wavelet Transformed Barbara (Actual) Wavelet Trasform CSE 49 G Itroductio to Data Compressio Witer 6 Wavelet Trasform Codig PACW Wavelet Trasform A family of atios that filters the data ito low resolutio data plus detail data high pass filter

More information

Analysis of Algorithms

Analysis of Algorithms Presetatio for use with the textbook, Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Aalysis of Algorithms Iput 2015 Goodrich ad Tamassia Algorithm Aalysis of Algorithms

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

The golden search method: Question 1

The golden search method: Question 1 1. Golde Sectio Search for the Mode of a Fuctio The golde search method: Questio 1 Suppose the last pair of poits at which we have a fuctio evaluatio is x(), y(). The accordig to the method, If f(x())

More information

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein

Lecture 6. Lecturer: Ronitt Rubinfeld Scribes: Chen Ziv, Eliav Buchnik, Ophir Arie, Jonathan Gradstein 068.670 Subliear Time Algorithms November, 0 Lecture 6 Lecturer: Roitt Rubifeld Scribes: Che Ziv, Eliav Buchik, Ophir Arie, Joatha Gradstei Lesso overview. Usig the oracle reductio framework for approximatig

More information

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Course Site:   Copyright 2012, Elsevier Inc. All rights reserved. Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios

More information

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO

DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO Sagwo Seo, Trevor Mudge Advaced Computer Architecture Laboratory Uiversity of Michiga at A Arbor {swseo, tm}@umich.edu Yumig Zhu, Chaitali

More information

1&1 Next Level Hosting

1&1 Next Level Hosting 1&1 Next Level Hostig Performace Level: Performace that grows with your requiremets Copyright 1&1 Iteret SE 2017 1ad1.com 2 1&1 NEXT LEVEL HOSTING 3 Fast page loadig ad short respose times play importat

More information

Bayesian approach to reliability modelling for a probability of failure on demand parameter

Bayesian approach to reliability modelling for a probability of failure on demand parameter Bayesia approach to reliability modellig for a probability of failure o demad parameter BÖRCSÖK J., SCHAEFER S. Departmet of Computer Architecture ad System Programmig Uiversity Kassel, Wilhelmshöher Allee

More information

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only

Bezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only Edited: Yeh-Liag Hsu (998--; recommeded: Yeh-Liag Hsu (--9; last updated: Yeh-Liag Hsu (9--7. Note: This is the course material for ME55 Geometric modelig ad computer graphics, Yua Ze Uiversity. art of

More information

A Note on Least-norm Solution of Global WireWarping

A Note on Least-norm Solution of Global WireWarping A Note o Least-orm Solutio of Global WireWarpig Charlie C. L. Wag Departmet of Mechaical ad Automatio Egieerig The Chiese Uiversity of Hog Kog Shati, N.T., Hog Kog E-mail: cwag@mae.cuhk.edu.hk Abstract

More information

One advantage that SONAR has over any other music-sequencing product I ve worked

One advantage that SONAR has over any other music-sequencing product I ve worked *gajedra* D:/Thomso_Learig_Projects/Garrigus_163132/z_productio/z_3B2_3D_files/Garrigus_163132_ch17.3d, 14/11/08/16:26:39, 16:26, page: 647 17 CAL 101 Oe advatage that SONAR has over ay other music-sequecig

More information

Creating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA

Creating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA Creatig Exact Bezier Represetatios of CST Shapes David D. Marshall Califoria Polytechic State Uiversity, Sa Luis Obispo, CA 93407-035, USA The paper presets a method of expressig CST shapes pioeered by

More information

Evaluation scheme for Tracking in AMI

Evaluation scheme for Tracking in AMI A M I C o m m u i c a t i o A U G M E N T E D M U L T I - P A R T Y I N T E R A C T I O N http://www.amiproject.org/ Evaluatio scheme for Trackig i AMI S. Schreiber a D. Gatica-Perez b AMI WP4 Trackig:

More information

EE123 Digital Signal Processing

EE123 Digital Signal Processing Last Time EE Digital Sigal Processig Lecture 7 Block Covolutio, Overlap ad Add, FFT Discrete Fourier Trasform Properties of the Liear covolutio through circular Today Liear covolutio with Overlap ad add

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

3D Model Retrieval Method Based on Sample Prediction

3D Model Retrieval Method Based on Sample Prediction 20 Iteratioal Coferece o Computer Commuicatio ad Maagemet Proc.of CSIT vol.5 (20) (20) IACSIT Press, Sigapore 3D Model Retrieval Method Based o Sample Predictio Qigche Zhag, Ya Tag* School of Computer

More information

Math Section 2.2 Polynomial Functions

Math Section 2.2 Polynomial Functions Math 1330 - Sectio. Polyomial Fuctios Our objectives i workig with polyomial fuctios will be, first, to gather iformatio about the graph of the fuctio ad, secod, to use that iformatio to geerate a reasoably

More information

Algorithm. Counting Sort Analysis of Algorithms

Algorithm. Counting Sort Analysis of Algorithms Algorithm Coutig Sort Aalysis of Algorithms Assumptios: records Coutig sort Each record cotais keys ad data All keys are i the rage of 1 to k Space The usorted list is stored i A, the sorted list will

More information

Operating System Concepts. Operating System Concepts

Operating System Concepts. Operating System Concepts Chapter 4: Mass-Storage Systems Logical Disk Structure Logical Disk Structure Disk Schedulig Disk Maagemet RAID Structure Disk drives are addressed as large -dimesioal arrays of logical blocks, where the

More information

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device

More information

Performance Plus Software Parameter Definitions

Performance Plus Software Parameter Definitions Performace Plus+ Software Parameter Defiitios/ Performace Plus Software Parameter Defiitios Chapma Techical Note-TG-5 paramete.doc ev-0-03 Performace Plus+ Software Parameter Defiitios/2 Backgroud ad Defiitios

More information

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control EE 459/500 HDL Based Digital Desig with Programmable Logic Lecture 13 Cotrol ad Sequecig: Hardwired ad Microprogrammed Cotrol Refereces: Chapter s 4,5 from textbook Chapter 7 of M.M. Mao ad C.R. Kime,

More information

The isoperimetric problem on the hypercube

The isoperimetric problem on the hypercube The isoperimetric problem o the hypercube Prepared by: Steve Butler November 2, 2005 1 The isoperimetric problem We will cosider the -dimesioal hypercube Q Recall that the hypercube Q is a graph whose

More information

Isn t It Time You Got Faster, Quicker?

Isn t It Time You Got Faster, Quicker? Is t It Time You Got Faster, Quicker? AltiVec Techology At-a-Glace OVERVIEW Motorola s advaced AltiVec techology is desiged to eable host processors compatible with the PowerPC istructio-set architecture

More information

Cubic Polynomial Curves with a Shape Parameter

Cubic Polynomial Curves with a Shape Parameter roceedigs of the th WSEAS Iteratioal Coferece o Robotics Cotrol ad Maufacturig Techology Hagzhou Chia April -8 00 (pp5-70) Cubic olyomial Curves with a Shape arameter MO GUOLIANG ZHAO YANAN Iformatio ad

More information

Chapter 3. Floating Point Arithmetic

Chapter 3. Floating Point Arithmetic COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 3 Floatig Poit Arithmetic Review - Multiplicatio 0 1 1 0 = 6 multiplicad 32-bit ALU shift product right multiplier add

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 CPU-Memory Bottleeck Computer Architecture ELEC44 CPU Memory Lecture 8 Cache Dr. Hayde Kwok-Hay So Departmet of Electrical ad Electroic Egieerig Performace of high-speed computers is usually limited by

More information

Computer Systems - HS

Computer Systems - HS What have we leared so far? Computer Systems High Level ENGG1203 2d Semester, 2017-18 Applicatios Sigals Systems & Cotrol Systems Computer & Embedded Systems Digital Logic Combiatioal Logic Sequetial Logic

More information

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu UH-MEM: Utility-Based Hybrid Memory Maagemet Yag Li, Saugata Ghose, Jogmoo Choi, Ji Su, Hui Wag, Our Mutlu 1 Executive Summary DRAM faces sigificat techology scalig difficulties Emergig memory techologies

More information

Normals. In OpenGL the normal vector is part of the state Set by glnormal*()

Normals. In OpenGL the normal vector is part of the state Set by glnormal*() Ray Tracig 1 Normals OpeG the ormal vector is part of the state Set by glnormal*() -glnormal3f(x, y, z); -glnormal3fv(p); Usually we wat to set the ormal to have uit legth so cosie calculatios are correct

More information

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization Ed Semester Examiatio 2013-14 CSE, III Yr. (I Sem), 30002: Computer Orgaizatio Istructios: GROUP -A 1. Write the questio paper group (A, B, C, D), o frot page top of aswer book, as per what is metioed

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions

Lecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions U.C. Berkeley CS170 : Algorithms Midterm 1 Solutios Lecturers: Sajam Garg ad Prasad Raghavedra Feb 1, 017 Midterm 1 Solutios 1. (4 poits) For the directed graph below, fid all the strogly coected compoets

More information

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components Aoucemets Readig Chapter 4 (4.1-4.2) Project #4 is o the web ote policy about project #3 missig compoets Homework #1 Due 11/6/01 Chapter 6: 4, 12, 24, 37 Midterm #2 11/8/01 i class 1 Project #4 otes IPv6Iit,

More information

Lecture 28: Data Link Layer

Lecture 28: Data Link Layer Automatic Repeat Request (ARQ) 2. Go ack N ARQ Although the Stop ad Wait ARQ is very simple, you ca easily show that it has very the low efficiecy. The low efficiecy comes from the fact that the trasmittig

More information

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering EE 4363 1 Uiversity of Miesota Midterm Exam #1 Prof. Matthew O'Keefe TA: Eric Seppae Departmet of Electrical ad Computer Egieerig Uiversity of Miesota Twi Cities Campus EE 4363 Itroductio to Microprocessors

More information

Big-O Analysis. Asymptotics

Big-O Analysis. Asymptotics Big-O Aalysis 1 Defiitio: Suppose that f() ad g() are oegative fuctios of. The we say that f() is O(g()) provided that there are costats C > 0 ad N > 0 such that for all > N, f() Cg(). Big-O expresses

More information

A General Framework for Accurate Statistical Timing Analysis Considering Correlations

A General Framework for Accurate Statistical Timing Analysis Considering Correlations A Geeral Framework for Accurate Statistical Timig Aalysis Cosiderig Correlatios 7.4 Vishal Khadelwal Departmet of ECE Uiversity of Marylad-College Park vishalk@glue.umd.edu Akur Srivastava Departmet of

More information

THIN LAYER ORIENTED MAGNETOSTATIC CALCULATION MODULE FOR ELMER FEM, BASED ON THE METHOD OF THE MOMENTS. Roman Szewczyk

THIN LAYER ORIENTED MAGNETOSTATIC CALCULATION MODULE FOR ELMER FEM, BASED ON THE METHOD OF THE MOMENTS. Roman Szewczyk THIN LAYER ORIENTED MAGNETOSTATIC CALCULATION MODULE FOR ELMER FEM, BASED ON THE METHOD OF THE MOMENTS Roma Szewczyk Istitute of Metrology ad Biomedical Egieerig, Warsaw Uiversity of Techology E-mail:

More information

Mobile terminal 3D image reconstruction program development based on Android Lin Qinhua

Mobile terminal 3D image reconstruction program development based on Android Lin Qinhua Iteratioal Coferece o Automatio, Mechaical Cotrol ad Computatioal Egieerig (AMCCE 05) Mobile termial 3D image recostructio program developmet based o Adroid Li Qihua Sichua Iformatio Techology College

More information

Chapter 5. Functions for All Subtasks. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 5. Functions for All Subtasks. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 5 Fuctios for All Subtasks Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 5.1 void Fuctios 5.2 Call-By-Referece Parameters 5.3 Usig Procedural Abstractio 5.4 Testig ad Debuggig

More information