Fast Interpolation of Grid Data at a Non-Grid Point

Size: px

Start display at page:

Download "Fast Interpolation of Grid Data at a Non-Grid Point"

Gwendoline Teresa Bishop
5 years ago
Views:

1 Fast Iterpolatio of Grid Data at a No-Grid Poit Hiroshi Ioue IBM Research - Tokyo Tokyo, Japa iouehrs@jp.ibm.com Abstract Defiig data at a o-grid poit by iterpolatig grid data is a commo operatio i may workloads icludig scietific applicatios ad imagig applicatios. This paper describes our techique to accelerate this iterpolatio operatio ad show its performace beefit usig D computed tomography recostructio. The D CT is oe of the computeitesive medical imagig applicatios that frequetly iterpolates grid data (D images at a o-grid poit. To efficietly execute this operatio with SIMD istructios, we create a i-memory pre-computed table from the iput D image at rutime before projectig voxels oto each image to reduce the amout of computatio ad avoid o-cotiguous memory accesses that atteuate the beefits of SIMD istructios. We implemeted ad evaluated our precomputatio techique usig a biliear iterpolatio ad a rddegree Lagrage iterpolatio o POWER processors; it yields up to 7% ad 7% performace improvemets i the RabbitCT bechmark for the two iterpolatio algorithms respectively. Keywords - grid-based iterpolatio, D CT recostructio, back projectio, SIMD I. INTRODUCTION Iterpolatio of grid data to obtai data at a o-grid poit is a commo operatio i may workloads icludig imagig applicatios ad scietific stecil applicatios. The iterpolatio operatio is ofte quite costly especially for large grid data (such as high-resolutio images. Oe reaso is the large umber of arithmetic istructios to calculate the iterpolated value. Aother reaso is its scattered memory access patter; readig values from the surroudig pixels requires o-cotiguous memory accesses, which reduce the efficiecy of SIMD istructios eve with the hardware gather/scatter support of the latest processors. I this paper, we aim to accelerate the iterpolatio of grid data usig a ew algorithm. To make the beefits ad overhead of our ew algorithm clear, we coduct throughout evaluatios of our techique usig a three-dimesioal (D computed tomography (CT workload [] with two iterpolatio algorithms. D CT is oe of the medical imagig workloads that frequetly iterpolate grid data (D images. It recostructs a structure i a D volume by backprojectig multiple projectio images from differet agles ito the D volume usig the grid-based iterpolatio. Figure shows a schematic overview of the system of D coe-beam CT. A X-ray poit source ad a flat-pael X-ray detector move aroud the D object to capture D projectio images from differet agles. RabbitCT [] is a ope framework for bechmarkig the backprojectio algorithms i the FDK algorithm [], a widely used o-iterative CT recostructio algorithm. For each projectio image, the FDK algorithm projects each volume elemet (voxel i the D volume oto the projectio image based o the trasformatio matrix (projectio matrix ad the obtais the value at the projected poit by performig a biliear iterpolatio of the eighborig four pixels. The iterpolatio is critically importat for image quality but causes sigificat overheads i the computatio time [4]. Due to the advaces i imagig hardware, the size ad resolutio of iput images are icreasig; hece the computatio performace for the D recostructio is also becomig more importat to keep the overall CT system usability. I this paper, we describe our ew techique to accelerate the iterpolatio operatio ad hece the etire CT workload by sigificatly reducig the umber of arithmetic istructios required ad avoidig o-cotiguous memory accesses. For each iput projectio image, we create a precomputed table i memory at rutime, ad the recostructio loop accesses the pre-computed table istead of the projectio image. For realistic resolutios, the beefit of usig the precomputed table i the recostructio phase is much larger tha the overhead of pre-computatio. We implemeted ad evaluated this ew techique o a system with -socket POWER processors; our techique improved the performace of the RabbitCT bechmark by up to 7%. We also evaluated it usig a rd-degree Lagrage iterpolatio by ehacig RabbitCT ad observed up to 7% speedups. The basic idea of our pre-computatio techique ca be applied to a much wider rage of applicatios ad hardware compared to those we evaluate i this paper. Although we evaluated our techique usig a o-iterative CT recostructio algorithm, it ca be applied to other imagig applicatios. For example, some iterative CT recostructio ad image-registratio algorithms are good targets for our techique because they also iterpolate data frequetly. The key metric of the performace beefits is the ratio of the umber of grid poits (pixels i data ad the umber of executed iterpolatio operatios. Furthermore, certai oimagig stecil applicatios are also potetial targets. For istace, particle-i-cell simulatios of two-phase flows [] frequetly iterpolate the surroudig grid (mesh poits to obtai the flow parameters at the locatios of the particles. Hece, they are iterestig targets for our pre-computatio techique. Our techique does ot use features specific to the processor; hece, it ca be implemeted o other CPUs or GPUs.

2 Flat-pael detector (capturig D projectio images projected poit {u, v } D volume cotaiig object to be imaged j ( i p, i i+ u α β ( α ( u v p ˆ, ( i + p, v voxel at {x, z} ( β j + p j + p ( i +, j + Figure. Overview of D coe-beam CT. Flat-pael detector ad X-ray source move aroud D object to be imaged to capture D projectio images from differet agles. II. OUR PRE-COMPUTATION TECHNIQUE I this sectio, we first describe the baselie backprojectio algorithm ad details o how we ehace it to achieve higher performace by pre-computatio. A. Baselie Recostructio Algorithm [] RabbitCT evaluates the performace ad accuracy of the backprojectio operatio of the FDK algorithm [] for D volume recostructio. RabbitCT provides 49 projectio images, whose size S x S y is 4 9 pixels, with a projectio matrix for each image. The itesity value of each pixel is represeted by a sigle-precisio floatig poit value, ad each elemet of the 4 projectio matrix is represeted by a double-precisio floatig poit value. We deote the - th projectio image as I. For each of the projectio images, the recostructio algorithm projects all voxels oto I ad updates the desity value for the voxel based o the itesity value at the projected poit obtaied by iterpolatig the surroudig pixels. The umber of voxels i the volume, i.e., the, is L =,,, or 4. Amog these four s, L = is the most commo, ad L = is a toy bechmark that is mostly used for debuggig (hece it is ot cosidered for rakig. Although differet umbers of voxels are used, the same I are used as iput for all s. The voxel, whose D positio is deoted as {x, z}, is projected oto poit {u, v } i I as follows. Here, a is a projectio matrix determied by the system geometry ad provided by the framework for each I. u v w z = ( ax + a y + az + a9 w z, z = ( ax + a4 y + a7z + a w z, z = a x + a y + a z + a. I is defied oly at grid poits; hece, the itesity value at the projected poit {u, v }, which we deote as p ˆ ( u, v, is obtaied by performig a biliear iterpolatio from the itesity values p at the eighborig four pixels by poit source of X-ray ( Figure. Overview of projected poit {u, v} ad eighborig four grid poits. pˆ ( u, v = ( α( β p + α( β p( i +, + ( α β p j + + αβ p ( i +, j +, i = u, j = v, α = u u, β = v v. Here, p (i, equals I (i, if positio {i, j} is withi I ; otherwise, p (i, is zero. Figure shows a overview of the projected poit ad the eighborig pixels. The desity value of the voxel f (x, z is calculated as N f z = pˆ ( u, v ( = w z. Sice this recostructio algorithm has iheretly huge parallelism, we ca accelerate the executio of the algorithm by exploitig thread-level parallelism by multiple cores ad data parallelism by SIMD istructios. I this algorithm, computatio performace i processor cores is the primary bottleeck. Sice the accesses to the D images have spatial localit i.e. eighbourig voxels are typically mapped oto the same or a earby pixel i the D image, they do ot cause too frequet cache misses. The accesses to the voxel data, which are mostly sequetial, may become bottleeck depedig o the memory system performace, but we ca reduce this memory badwidth to the voxel data by usig a techique similar to temporal blockig [, 7]. I additio to the executio time, RabbitCT reports o the errors from the referece implemetatio to evaluate the accuracy of the recostructio algorithm. The referece implemetatio is a straightforward implemetatio of the algorithm usig double-precisio floatig-poit umbers durig the computatio. B. Our Pre-Computatio Techique The biliear iterpolatio with Equatio is the most time-cosumig part of the overall executio time of the CT recostructio. However, the iterpolatio is critically importat for the overall quality of the recostructed image; hece, we caot simply skip the iterpolatio. I additio to the large umber of arithmetic istructios to calculate Equatio, vectorizig this equatio is aother high hurdle (

3 with our pre-computatio for each projectio image I ( = ton - // pre-computatio phase for i = to Sx - for j = to Sy - calculate C to C by Equatio ad store ito pre-computed table ed ed // recostructio phase for Iz = to L - for Iy = to L - determie rage of Ix to iterate for Ix = Ix start to Ix ed project voxel at (Ix, I Iz oto I calculate pˆ by Equatio 4 update desity of voxel ed ed ed ed R L : resolutio (size of a voxel, O L : origi without our pre-computatio for each projectio image I ( = ton - // create zero-padded copy [4] for i = to Sx - for j = to Sy - copy I (i, ito separate memory buffer (Sectio. ed ed // recostructio phase for Iz = to L - for Iy = to L - determie rage of Ix to iterate for Ix = Ix start to Ix ed project voxel at (Ix, I Iz oto I calculate pˆ by Equatio update desity of voxel ed ed ed ed ( for higher performace. It is kow that vectorizig Equatio is iefficiet with the SIMD istructios while it is almost straightforward to efficietly vectorize other parts of the algorithm icludig Equatios ad [4]. This iefficiecy is due to the o-cotiguous ad ualiged memory accesses to read I from four pixels. A SIMD load istructio ca load multiple elemets oly whe they are cotiguous i memory; hece, o-cotiguous memory access icurs the overhead of executig multiple load istructios ad mixig the loaded elemets. Although some of the latest processors, such as Itel Haswell, support gather ad scatter memory access istructios to load from or store to o-cotiguous memory area, they are still slower tha cotiguous memory accesses. To efficietly calculate Equatio by reducig the redudat computatio ad o-cotiguous memory accesses, we group terms by u ad v, the factors that deped o the curret voxel locatio; we modify the equatio as follows: pˆ ( u, v = ( α( β p + α( β p( i +, + ( α β p j + + α β p( i +, j + = u v C + u C + v C + C. j Here, the coefficiets C to C are calculated as C C C C = p + p( i +, j + p( i +, p j +, = ( j + ( p ( i +, p + j ( p j + p( i +, j +, = ( i + ( p j + p + i ( p( i +, p( i +, j +, = ( i + ( j + p + i j p( i +, j + ( i + j p j + i ( j + p ( i +,. Figure. Pseudocode of recostructio with ad without our pre-computatio techique (4 ( The key poit is that these coefficiets C to C for each pixel i the D projectio image are idepedet of u ad v (hece idepedet of x, ad z by the defiitio. Hece, we ca pre-compute these coefficiets before executig the recostructio ad store the results i a i-memory precomputed table. Sice the umber of voxels (L is much larger tha the umber of pixels i a projectio image (S x S y, multiple voxels are projected oto the same pixel positio {i, j}. Hece, we have a huge opportuity to reuse the computed coefficiets for iterpolatio. Oce we pre-compute the coefficiets for each pixel of the D iput image, we ca use lightweight iterpolatio with Equatio 4. Equatio 4 ca be computed much more efficietly tha Equatio ; it ca be computed usig oly three multiply-ad-add istructios as ( u, v u ( v C + C + ( v C C ( i pˆ +, =, while Equatio requires istructios to be computed i our implemetatio eve with the multiply-ad-add istructios. Figure shows the pseudocode of the etire recostructio algorithm with ad without our precomputatio. Our pre-computatio techique uses additioal memory space for the pre-computed table. The size of the table is four times as large as I because we hold four values, C (i, to C (i,, per pixel for the biliear iterpolatio. However, the overheads i memory space ad i memory badwidth are ot sigificat compared to the space ad badwidth for other data structures, especially the desity values of voxels. Access to the pre-computed table causes more cache misses due to its larger footprit tha that of I. However, the beefit of reduced computatios was much more sigificat tha the drawback of a icreased umber of cache misses. I additio to the reduced computatio, which is beeficial regardless of the processor type, aother beefit of usig the pre-computed table is SIMD friedliess. By storig C to C for each pixel cotiguously i memory aliged o -bit

4 address boudaries, we ca totally avoid o-cotiguous or ualiged memory accesses for I because these four values are eough to compute p ˆ ( u, v, as show i Equatio 4, ad we do ot eed to access I after the pre-computatio phase. To fully exploit the data parallelism of SIMD istructios, each iteratio of the ier-most loop of our algorithm loads C to C for four poits by usig four vector load istructios ad the trasposes the values i vector registers usig the permutatio istructios to pack the same coefficiet of four poits i oe vector register. This traspositio is also used by a previous implemetatio [4] ad our baselie implemetatio for efficiet use of SIMD istructios. Although C to C stored i the pre-computed table are sigle-precisio umbers, we execute the pre-computatio phase usig double-precisio umbers ad covert the resultig umbers to sigle precisio umbers just before storig them i the pre-computed table. Whe we use sigleprecisio umbers durig the pre-computatio, we observe degradatio i accuracy. This is because roudig errors i the coefficiets are magified by multiplyig by u ad v i Equatio 4. This degradatio is ot sigificat whe we use double-precisio umbers i the pre-computatio phase. C. Applicatio to Higher Degree Iterpolatio Algorithms We ca apply the basic idea of our pre-computatio to higher degree iterpolatio algorithms. I may scietific ad imagig applicatios, biliear iterpolatio caot provide sufficiet accuracy; hece, higher order iterpolatio algorithms are ofte used i real-world applicatios []. I this paper, we implemeted ad evaluated a rd-degree Lagrage polyomial iterpolatio, which uses 4 4 pixels (istead of pixels of the biliear iterpolatio to obtai the iterpolated value. For the rd-degree Lagrage iterpolatio, we eed coefficiets per pixel i the pre-computed table. I geeral, the umber of coefficiets is equivalet to the umber of pixels used i the iterpolatio. Hece, for a Kthdegree polyomial iterpolatio of a D-dimesioal grid, the umber of coefficiets per pixel is (K+ D. Whe we aively apply the trasformatio of Equatio 4 to higher order algorithms, we experiece accuracy problems. With the rd-degree Lagrage iterpolatio, the iterpolated value is calculated as p ˆ ( u, v = u v C (i, + u v C (i, +, but products of u ad v, such as u v, ca potetially become huge; hece, this formula suffers from huge floatigpoit errors. To avoid this problem, we modify the equatio of the iterpolatio by groupig terms by α ad β istead of u ad v as follows: pˆ ( u, v = α β C + α β C + α βc + α C + α β C4 + α β C + α βc + α C7 + αβ C + αβ C9 + αβ C + αc + β C + β C + βc + C. Here, α = u u, β = v v. The coefficiets C to C ca be obtaied by groupig terms i the equatio of the Lagrage iterpolatio by the umbers of α ad β icluded i each term. For example, 4 ( + = ( p( i, j / + p j / p( i+, j / + p( i+, j / ( p( i, / + p / p( i+, / + p ( i+, // ( p( i, j+ / + p j+ / p( i+, j+ / + p( i+, j+ / p ( i, j+ / + p j+ / p ( i+, j+ / + p ( i+, j+ C / + ( / / Sice α ad β are i the rage betwee. ad. by defiitio regardless of u ad v, this equatio does ot suffer from huge floatig-poit errors. However, there is the drawback that this equatio requires two additioal vector arithmetic istructios to calculate α ad β i the ier-most recostructio loop, but this additioal cost of the two istructios does ot matter for higher order iterpolatio algorithms severely because they use a much larger umber of istructios tha simple biliear iterpolatio. Whe we apply this accurate versio of the pre-computatio to the biliear iterpolatio, we observed about % performace reductios i a trade-off for better accuracy. I summar we itroduced two differet approaches to improve the accuracy for the biliear ad the rd-degree Lagrage iterpolatio algorithms by reducig the floatigpoit errors. For the biliear iterpolatio, we use doubleprecisio floatig poit umbers i the pre-computatio phase. It icreases the cost of pre-computatio by up to twice, but it icurs o overhead i the recostructio phase. Sice the cost of the pre-computatio is quite small compared to the recostructio, the additioal cost of usig double precisio i the pre-computatio is less tha % of the total executio time for large s. For the rd-degree Lagrage iterpolatio, we use a alterative equatio ( by groupig terms by α ad β istead of u ad v. This approach yields better accuracy i trade for the higher overhead i the recostructio phase compared to the first approach of usig double-precisio umbers oly i the pre-computatio phase; the alterative equatio requires two additioal istructios i the recostructio loop to calculate α ad β. However, the additioal overhead is about % ad is much less sigificat compared to the beefits of the pre-computatio. D. Performace Modelig As we empirically show later, the vector uit utilizatio is the primary bottleeck i this workload; hece, the umber of executed vector istructios is the key to model the executio performace. We first show the umbers of vector istructios icluded i the ier-most loops of the recostructio phase with ad without pre-computatio to estimate the beefit of our pre-computatio techique. The, we discuss the umber of vector istructios for the pre-computatio phase to evaluate the overhead of our pre-computatio techique. Each iteratio of the ier-most loop of the recostructio phase (lies - i Figure executes the followig steps for four voxels at oce usig SIMD istructios: i project each voxel oto a D image (calculate u, v, w, ii calculate i, j to fid idices i the pre-computed table, iii load data from the pre-computed table (or image I, iv traspose data for four voxels i vector registers, v execute iterpolatio, ad vi update desity values of the voxels. /

5 Table. Number of istructios that cosume vector uit resource i each step of ier-most loop Step i calculate u, v, w (equatio ii calculate i, j iii load from pre-computed table or D image iv traspose i vector registers v execute iterpolatio vi update desity values biliear iterpolatio with precomputatiocomputatio without pre- (4 aliged load ( ualiged istructios load ists rd-degree Lagrage iterpolatio with precomputatiocomputatio without pre- ( aliged ( ualiged load ists load ists 4 total 4 (4 + (9 + Note that some additioal vector istructios ca be iserted by compiler for register copyig etc. / meas for ualiged load istructios. These steps are ot uique to our implemetatio ad quite similar to the previous implemetatios [4]. Table summarizes the umber of istructios that cosume the vector uit resource i each step of our implemetatio for two iterpolatio algorithms. From Table, we estimate that our pre-computatio techique gives.x (=4/ performace gai for the biliear iterpolatio ad.77x (=/ gai for the rd-degree Lagrage iterpolatio with large s. O the little-edia PowerPC platform, a ualiged load istructio cosumes vector uit resource while a aliged load istructio does ot []. If a ualiged load istructio does ot cosume the vector uit resource, the beefit of our pre-computatio will be smaller. However, eve i such a case, our pre-computatio still provides sigificat reductio i vector uit utilizatio; we estimate the performace gai will be.x (=4/ for the biliear iterpolatio ad.x (=9/ for the rd-degree Lagrage iterpolatio. As we discussed later, these estimates explai our experimetal results well especially for biliear iterpolatio. Table assumes a -bit SIMD istructio set such as VSX of POWER or SSE of x. Whe the vector legth becomes loger, such as AVX of x, the relative cost of traspositio is icreased. For the biliear iterpolatio with -bit SIMD istructios, for example, we estimate the traspositio uses istructios while the umbers of istructios for other steps are uchaged (assumig the processor provides sufficietly flexible permutatio istructios. This icreased cost of traspositio may atteuate the beefit of our techique slightl but our techique ca still give good reductio i the vector istructios i the ier-most loop. I the curret implemetatio of the pre-computatio phase for the biliear iterpolatio, we use both sigleprecisio ad double-precisio floatig-poit vector arithmetic istructios to keep the accuracy as already discussed. I total, we use vector istructios (icludig two ualiged load istructios for each iteratio of the ier-most loop, which processes four pixels. For the rddegree Lagrage iterpolatio, we use 4 sigle-precisio arithmetic istructios (icludig four ualiged load istructios per four pixels i the pre-computatio loop. From these umbers ad the sizes of the iput image ad D volume, we estimate the overhead of pre-computatio as follows. For the s of L = ad 4, the ratios of the umbers of vector istructios cosumed for the precomputatio are less tha % ad.%, respectivel with either iterpolatio algorithm. Hece, the overhead caused by the pre-computatio is much smaller tha the beefit of the reduced computatio time i the recostructio phase uless the umber of voxels is urealistically small compared to the size of the projectio images. As we experimetally show later, the ratios of the executio time of the pre-computatio phase reasoably match these estimates. We expect that the overhead of the pre-computatio will become more isigificat o future CT systems that use higher resolutio i both D projectio images ad D volume because the executio time of the pre-computatio phase is proportioal to the square of the D image resolutio while the recostructio phase follows the cube of the D volume resolutio. III. IMPLEMENTATION AND EVALUATIONS I this sectio, we detail the implemetatio of our precomputatio techique for RabbitCT. The, we show the experimetal results o POWER processors to illustrate the beefit of our pre-computatio techique. We implemeted our techique o POWER, but it does ot use POWERspecific features; hece, it is applicable to other processors, such as Itel Xeo, or to GPUs. A. Implemetaio We implemeted the recostructio algorithms i C++ ad vectorized them by had usig the SIMD itrisics provided by the IBM XL C++ compiler.. We used sigleprecisio floatig poit umbers i the implemetatio uless we explicitly deoted aother data type. Because the vector registers of the uderlyig processor are -bit i legth, oe SIMD istructio ca execute four operatios for differet values at oce. Our baselie implemetatio follows FastRabbit [4, 7], a state-of-the-art implemetatio for Itel processors. The baselie optimizatios iclude: replacig divide istructios with a reciprocal estimate istructio, skippig the voxels that caot be projected oto I i the ier-most loop, ad elimiatig the coditioal braches to check the out-of-image-boud accesses by creatig a copy of I with

6 throughput (GUPS 7 4 with pre-computatio without pre-computatio iterpolatio disabled (upper boud for us throughput (GUPS... with pre-computatio without pre-computatio iterpolatio disabled (upper boud for us. L= L= L= L=4 L= L= L= L=4 Figure 4. Performace (throughput i GUPS with ad without our pre-computatio techique for biliear iterpolatio o cores ( NUMA ode of POWER. Table. Accuracy (root-mea-squared error i HU with ad without our pre-computatio techique Root -Mea-Square Error (HU Problem size With precomputatio Without precomputatio Iterpolatio disabled L=.4.. L=..7. L=... L=4.4.. zero paddig aroud the image. These three optimizatios were eabled eve whe we did ot eable our precomputatio techique. I optimizatio, we solve the iequalities to determie the ecessary rage of I x before executig the ier-most loop (lie i Figure. I optimizatio, we prepare a memory buffer of 4 pixels. I is copied ito this buffer before executig the recostructio (lie -7 of without pre-computatio i Figure. We use the same idea of paddig zeros outside the image boudary to avoid coditioal braches i the precomputed table. May recet multi-socket systems have o-uiform memory architecture (NUMA; accesses to the directly attached (local memory are faster tha accesses to the remote memory attached to other sockets. Hece, it is importat to reduce the remote memory accesses to achieve good performace scalability. To reduce remote memory accesses, we allocate the pre-computed table ad desity values of voxels for each NUMA ode, ad each worker thread is boud to a NUMA ode. The, all the threads i each NUMA ode process differet projectio images rather tha multiple NUMA odes cooperatig o oe image. Hece, durig the recostructio phase, we do ot eed to access data i the remote memory. I the pre-computatio phase, we might eed to read a iput D image from remote memory sice we do ot cotrol the locatio of the iput images. However, the size of a iput image is much smaller tha the desity values of the voxels; hece, the accesses to the iput images do ot Figure. Performace (throughput i GUPS with ad without our pre-computatio techique for Lagrage iterpolatio o cores ( NUMA ode of POWER. severely degrade the performace scalability. For better load balacig, each ode picks a ew iput image from the global work queue after processig oe iput image rather tha statically assigig images for each NUMA ode. After processig all the projectio images, we calculate the fial desity value for each voxel by summig up the partial results from all NUMA odes. This fial calculatio is icluded i the executio time. Withi a NUMA ode, all threads cooperate to process oe projectio image; each thread picks a small block of voxels oe by oe ad updates these values. A atomic operatio (atomic add is used oly whe a thread picks the ext block from the ode-local work queue. Because oly oe projectio image is processed per ode at a time, we do ot eed to use atomic operatio to update values for voxels. B. Evaluatios We evaluated our techique o POWER processors. The system has two.9-ghz POWER processors with GB of system memory ruig Ubutu Liux 4. for Little Edia POWER as its OS. The system has processor cores ( cores per socket, ad the total umber of SMT threads (logical CPUs is usig a -way SMT. Sice oe POWER processor is a dual chip module, oe socket is divided ito two NUMA odes; hece, each NUMA ode is equipped with cores (or 4 SMT threads. While we use all SMT threads per core for experimets with the biliear iterpolatio, we use oly 4 SMT threads per core for the rddegree Lagrage iterpolatio. This was because the SMT4 cofiguratio outperformed SMT for the rd-degree Lagrage iterpolatio regardless of the use of the precomputatio. Each core is equipped with KB L cache ad MB L cache memory. Also, the system has MB shared L4 cache memory i total. We disabled the dyamic frequecy scalig to obtai more cosistet ad reproducible results. We measured the performace times ad explai the average results. Figure 4 ad Table illustrate the performace (throughput i GUPS, giga voxels updates per secod ad accuracy (root-mea-square error from the referece result i

7 Table. Executio time breakdow o cores of POWER Problem size Liear iterpolatio With pre-computatio Without precomputatio precomputatio reco structio total L=.9 ( 47%...94 L=.94 ( % L=.94 (4.% L=4.9 (.% Numbers show executio time per projectio image (i msec. - Percetages show i parethesis show ratios of pre-computatio to total executio time. rd-degree Lagrage iterpolatio With pre-computatio Without precomputatio precomputatio reco structio total. ( % ( % (4.9%..9.. (.7% HUs, Housfield uits for biliear iterpolatio with the s of L =,,, ad 4 with ad without our pre-computatio techique usig oe NUMA ode equipped with cores. GUPS is calculated by the umber of voxels i the D volume (L the umber of projectio images (N / executio time / 4. It also illustrates the performace whe we totally disabled the iterpolatio, i.e., we used pˆ ( u, v = p( i, istead of Equatio. This gives the upper-limit performace for our pre-computatio techique, which reduces the overhead of the iterpolatio. Our pre-computatio techique exhibited up to 7% performace improvemets (for L = 4 over the baselie (without pre-computatio without sigificat reductio i accuracy. Whe the iterpolatio was disabled, ulike with our techique, the accuracy was sigificatly degraded as a trade-off for higher performace. As described i Sectio II.B, we used double-precisio umbers i the pre-computatio phase. If we had used sigle-precisio umbers, the accuracy would become as poor as about.7 HU. O the platform we used for evaluatio, a ualiged load istructio cosumes vector uit resource. To estimate the beefit of our precomputatio techique o platforms without such additioal pealty i ualiged load istructios, we evaluated the performace without pre-computatio by replacig all ualiged load istructios i the ier-most loop with aliged load istructios. This resulted i iaccurate results, but it allowed us to estimate the overhead due to the pealty i ualiged load istructios. Compared to this versio, our pre-computatio still gave up to 4% performace gais. These performace gais match the estimatio from the performace model explaied i Sectio II.D. Figure evaluates our pre-computatio techique usig the rd-order Lagrage iterpolatio. The gais due to our pre-computatio were up to 7%. The gais were slightly lower tha the estimatio from the performace model. Oe possible reaso for the differece betwee the measuremets ad the estimatio is the overhead due to additioal vector istructios for register copies iserted by the compiler sice the higher degree iterpolatio algorithm ivolves more values ad icurs higher register pressure. For the rd-order Lagrage iterpolatio, the chages i the accuracy caused by our pre-computatio were egligible sice we used a accurate versio of the pre-computatio with Equatio. The performace improvemets with the pre-computatio are more sigificat for a large. Whe the vector uit utilizatio % 9% % 7% % % 4% % % % % with pre-computatio without pre-computatio iterpolatio disabled L= L= L= L=4 Figure. The vector uit utilizatio with ad without our precomputatio techique for the biliear iterpolatio o cores of POWER. is tiy compared to the iput size, e.g., L = i Figures 4 ad, the pre-computatio degrades performace. Table shows the executio time broke dow ito the pre-computatio phase ad recostructio phase for each. Because the size of the projectio images is the same for all s, the time for the precomputatio phase is almost costat regardless of the, while the larger s icrease the time for the recostructio phase. Therefore, the executio time of pre-computatio is egligible for large s, but it matters for the total performace whe the is very small. These performace overheads almost match the estimatios from the performace model. The estimatios from the model were slightly smaller tha the measured overhead partially because of a performace optimizatio i the recostructio phase. We skip the voxels that caot be projected oto the projectio image i the ier-most loop of the recostructio phase, but we do ot cosider this (datadepedet optimizatio i the performace model. For more isight ito the improvemets with our precomputatio techique, we show the vector uit utilizatio ad the umber of cache misses measured by the performace couter of the processor. We select these two metrics because our techique reduces the umber of vector istructios i a trade-off for icreased cache misses. Figure shows the vector uit utilizatio for the biliear iterpolatio. The utilizatio becomes % whe oe core

8 Total L cache data misses (icludig hardware prefetchig total umber of L cache misses total umber of L cache misses.e+.e+7.e+.e+.e+4.e+.e+7.e+.e+.e+4 biliear iterpolatio with pre-computatio without pre-computatio iterpolatio disabled L= L= L= L=4 rd-degree Lagrage iterpolatio with pre-computatio without pre-computatio iterpolatio disabled L= L= L= L=4 lower is better lower is better umber of demad L misses umber of demad L misses.e+.e+7.e+.e+.e+4.e+.e+7.e+.e+.e+4 Demad L cache data misses biliear iterpolatio with pre-computatio without pre-computatio iterpolatio disabled L= L= L= L=4 rd-degree Lagrage iterpolatio with pre-computatio without pre-computatio iterpolatio disabled L= L= L= L=4 lower is better lower is better Figure 7. Total L cache misses (icludig cache misses caused by hardware prefetcher ad demad misses o cores for biliear iterpolatio ad rd-degree Lagrage iterpolatio per iput image. Y-axis is i logarithmic scale. executes two vector istructios every cycle. As show i the figure, the vector uit utilizatio is quite high: more tha % for L = ad 4 with or without pre-computatio. This meas that the vector uit is the primary bottleeck i this workload eve with the icreased cache misses with the precomputatio; hece, the performace gai due to the reduced vector istructios is much more sigificat tha the overhead due to the icreased cache misses. Figure 7 compares the umbers of total L cache data misses ad L cache demad data misses. We show the total misses, which icludes cache misses caused by the hardware memory prefetcher, to evaluate how our techique affects the total amout of data fetched ito the processors. We estimated the total cache misses caused by usig the demad cache misses whe the hardware memory prefetcher was disabled; hece, these estimates may slightly uderestimate the real values. Our techique icreased the L demad cache misses regardless of the as a trade-off for reducig the umber of executed vector istructios. The pre-computed table is 4x or x as large as the origial D image; hece, the larger memory footprit of the pre-computed table icreased demad cache misses. However, the icreases i the total amout of data trasfer were much less sigificat. These characteristics were commo betwee two iterpolatio algorithms. The curret algorithm reads ad writes the desity values of the all voxels for each projectio image, ad these accesses for the desity values of voxels domiate the memory badwidth of the system memory. Because the accesses to the voxel data are mostly sequetial, the hardware memory prefetcher of the processor works well; the accesses to the voxel data do ot cause frequet demad cache misses. Hece, the effect of our pre-computatio o the amout of the data trasfer betwee processors ad memory is much less sigificat compared to the effects o the demad cache misses. If the memory badwidth becomes the major performace bottleeck, we ca use a techique similar to temporal blockig to reduce memory badwidth requiremets for accessig the voxel data [, 7]. The POWER processor we used i the experimets supports SMT threads per core, while other geeral-purpose processors typically support a smaller umber of SMT threads; e.g. the latest x processors of Itel or AMD support -way SMT by hyper-threadig. By icreasig SMT level (threads per core, it is possible to hide the cache miss latecy ad improve the vector uit utilizatio. To cofirm that our pre-computatio is effective o other processors that oly support a smaller umber of SMT threads per core, we show the performace ad the vector uit utilizatio for the biliear iterpolatio usig cofiguratios with to SMT threads per core o POWER i Figure ad 9. Our pre-computatio improved the performace regardless of the SMT level. The performace gai by the pre-computatio was 7% with

9 throughput (GUPS 7 4 with pre-computatio without pre-computatio iterpolatio disabled (upper boud for us SMT SMT SMT 4 SMT SMT level Figure. Performace (throughput (GUPS with ad without our pre-computatio techique for liear iterpolatio with differet SMT levels o cores of POWER for of L =. SMT level meas umber of threads per core. vector uit utilizatio % 9% % 7% % % 4% % % % % with pre-computatio without pre-computatio iterpolatio disabled SMT SMT SMT 4 SMT SMT level Figure 9. Vector uit utilizatio with ad without our precomputatio techique for liear iterpolatio with differet SMT levels o cores of POWER for of L =. SMT ( SMT thread per core ad 4% with SMT. The best performace without the pre-computatio was achieved with SMT4, ad usig SMT slightly degraded the performace due to icreased cache misses caused by a larger memory footprit. With the pre-computatio, we obtaied the best result with SMT, ad SMT4 was the close secod best. The improvemets i vector uit utilizatio with icreasig SMT level was most sigificat whe icreasig the SMT level from to regardless of the use of the precomputatio. Sice the umbers of executed vector istructios were almost costat whe chagig the SMT level, the chages i the vector uit utilizatio caused by the SMT level show i Figure were correlated with the chages i the executio time. Hece, the performace improvemets were also most sigificat from SMT to SMT. Improvemets from usig more SMT threads over SMT were relatively small compared to the chages betwee SMT ad SMT. For the rd-degree Lagrage iterpolatio, these treds were mostly similar. These results show that our precomputatio techique ca improve the performace of this workload eve o systems with fewer SMT threads per core. Figure illustrates the performace scalability with ad without the optimizatio for the NUMA architecture described i Sectio III.A. With the NUMA optimizatio, the performaces of both iterpolatio algorithms scaled well, higher is better Liear Iterpolatio throughput (GUPS L= L= with NUMA opt L=4 L= L= without NUMA opt L=4 umber of cores ( NUMA ode = cores rd-degree Lagrage Iterpolatio throughput (GUPS L= L= with NUMA opt L=4 L= L= without NUMA opt L=4 umber of cores ( NUMA ode = cores Figure. Performace scalability with ad without optimizatio for NUMA o up to cores (4 NUMA odes of POWER. Sice POWER chip has two dies o oe socket, a system with two sockets has four NUMA odes. eve with cores ( sockets, 4 NUMA odes. The icrease i speed over the sigle core executio was up to.9x ad.7x by cores (for L = 4 for the biliear ad Lagrage iterpolatios, respectively. However, the scalability was much poorer without the NUMA optimizatio whe we used multiple NUMA odes due to the overhead of remote memory accesses. The performace improvemets caused by the NUMA optimizatio were more sigificat for the biliear iterpolatio compared to the rd-degree Lagrage iterpolatio. The umber of total memory accesses were ot that differet for the two algorithms as show i Figure 7, but the executio time was much loger for the rd-degree Lagrage iterpolatio due to higher computatio cost. Hece the higher commuicatio-ad-computatio ratio of the biliear iterpolatio resulted i more sigificat gais caused by the NUMA optimizatio. IV. RELATED WORK I may applicatios, pre-processig of iput data is a commo techique to improve the ed-to-ed performace of the workload; hece, a variety of pre-processig techiques have bee studied. For example, pre-coditioig techiques to improve memory access locality ad vector utilizatio have bee widely studied for sparse matrix systems. However,

10 there is o kow pre-processig techique to make the gridbased iterpolatio more efficiet. Also, the costs of the existig pre-processig techiques are ofte high; hece, the use of such pre-processig techiques may be justified oly whe the iput data is repeatedly used. I the case of the precoditioig for sparse matrix systems, for example, a precoditioed matrix is ofte repeatedly used i a iterative method such as the cojugate gradiet method; hece, the cost of pre-coditioig is amortized. Ulike such pre-processig techiques, our pre-computatio techique icurs a much smaller overhead compared to the beefit we ca obtai i oe executio of the recostructio phase. Hece, we ca use the pre-computatio techique eve if each projectio image is processed oly oce. Due to its importace, the D coe-beam CT recostructio has bee studied for a log time. The FDK algorithm by Feldkamp et al. [] is oe of the most popular recostructio algorithms. The FDK algorithm ad its variats have bee implemeted o various hardware platforms icludig geeral-purpose processors [4,, 9], FPGAs [], Xeo Phi [4], Cell BE processors [], ad GPUs [, -]. However, it was difficult to coduct fair comparisos of such implemetatios o differet hardware with various optimizatio techiques ivolved i terms of the executio performace ad the quality of recostructed images. Recetl RabbitCT [], a ope framework for bechmarkig the D coe-beam CT recostructio, has provided a way to fairly compare algorithms ad implemetatios by providig a bechmark framework ad also high-quality iput images ad referece results. Although our baselie implemetatio follows FastRabbit [4, 9], a state-of-the-art implemetatio for Itel s processors, our implemetatio outperformed FastRabbit ruig o a - socket machie (Itel IvyBridge [4] by.9x usig the same umber of cores as for L =. Although this higher performace is partially due to hardware differeces (with a higher frequecy ad larger umber of SMT threads, but without -bit vector istructios, our techique plays a critical role i exhibitig superior performace, as show i Figure 4. FastRabbit uses a aive formula to calculate the biliear iterpolatios like our baselie implemetatio. Although we implemeted our pre-computatio techique oly o POWER processors, our techique ca be applied for other processors. I RabbitCT bechmark, which uses biliear iterpolatio, GPUs achieved much higher throughput compared to geeral-purpose CPUs such as POWER sice GPUs support biliear iterpolatio by hardware [,-]. Because our pre-computatio is ot limited to biliear iterpolatio, our techique is also beeficial for GPUs if a higher order iterpolatio algorithm, which is ot supported by GPU hardware, is used to achieve higher image quality. Real-world CT systems ofte use higher order iterpolatio algorithms. V. CONCLUSION We developed a techique to accelerate iterpolatio of grid data (D projectio images at a o-grid poit. We argued that our pre-computatio techique sigificatly improves the throughput of a medical imagig workload without degradig the accuracy; it exhibited up to 7% ad 7% speed ups i the RabbitCT bechmark for biliear ad rd-degree Lagrage iterpolatio algorithms respectively. Sice the iterpolatio of grid data to obtai data at a o-grid poit is a commo operatio i may workloads, our ew techique ca cotribute to a wider rage of workloads ad algorithms tha just the D CT recostructio we described i this paper. REFERENCES [] W. C. Scarfe ad A. G. Farma, What is coe-beam CT ad how does it work?, Detal Cliics of North America, vol, pp (. [] C. Rohkohl, B. Keck, H. G. Hofma, ad J. Horegger, RabbitCT A Ope Platform for Bechmarkig -D Coe-beam Recostructio Algorithms, Medical Physics, vol., pp (9. [] L. Feldkamp, L. Davis, ad J. Kress, Practical Coe-Beam Algorithm, J.oural of the Optical Society of America, vol. A, o., pp. -9 (94. [4] J. Hofma, J. Treibig, G. Hager, ad G. Wellei, Comparig the performace of differet x SIMD istructio sets for a medical imagig applicatio o moder multi- ad maycore chips, I Workshop o Programmig models for SIMD/Vector processig (4. [] Q. Wag ad K. D. Squires, Large eddy simulatio of particle depositio i a vertical turbulet chael flow, Iteratioal Joural of Multiphase Flow, Vol., No. 4, pp. 7- (99. [] E. Papehause, Z. Zheg, K. Mueller, GPU-accelerated backprojectio revisited: squeezig performace by careful tuig, I Workshop o High Performace Image Recostructio (. [7] H. Ioue, Efficiet Tomographic Recostructio For Commodity Processors with Limited Memory Badwidth, I Proceedigs of IEEE Iteratioal Symposium o Biomedical Imagig (. [] M. Gschwid ad W. Schmidt, Bridgig edia-depedet SIMD vector represetatio with compiler optimizatio, I Workshop o Programmig models for SIMD/Vector processig (. [9] J. Treibig, G. Hager, H. G. Hofma, J. Horegger, ad G. Wellei, Pushig the limits for medical image recostructio o recet stadard multicore processors, It. J. High Perform. Comput. Appl. 7,, pp. -77 (. [] B. Heigl ad M. Kowarschik, High-Speed Recostructio for C-Arm Computed Tomograph I Proceedigs of Fully Three-Dimesioal Image Recostructio i Radiology ad Nuclear Medicie, pp. - (7. [] H. Scherla, M. Koerera, H. Hofmaa, W. Eckertb, M. Kowarschikc, ad J. Horeggera, Implemetatio of the FDK Algorithm for Coe- Beam CT o the Cell Broadbad Egie Architecture, I Proceedigs of SPIE Proceedigs Vol. Medical Imagig: Physics of Medical Imagig (7. [] E. Papehause ad K. Mueller, Rapid rabbit: Highly optimized GPU accelerated coe-beam CT recostructio, I Proceedigs of the Nuclear Sciece Symposium ad Medical Imagig Coferece, pp. - (. [] T. Zißer ad B. Keck, Systematic Performace Optimizatio of Coe- Beam Back-Projectio o the Kepler Architecture, I Proceedigs of Fully Three-Dimesioal Image Recostructio i Radiology ad Nuclear Medicie, pp. - (.

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems