A Multi-Core Numerical Framework for Characterizing Flow in Oil Reservoirs

Size: px

Start display at page:

Download "A Multi-Core Numerical Framework for Characterizing Flow in Oil Reservoirs"

Beatrix Wilkerson
5 years ago
Views:

1 A Mult-Core Numercal Framework for Characterzng Flow n Ol Reservors The MIT Faculty has made ths artcle openly avalable. Please share how ths access benefts you. Your story matters. Ctaton As Publshed Publsher Leonard, Chrstopher R. et al. "A Mult-Core Numercal Framework for Characterzng Flow n Ol Reservors." n Papers of the 19th Hgh Performance Computng Symposum (HPC 2011) Boston, Massachusetts, USA Aprl 4 6, Socety for Modelng & Smulaton Internatonal Verson Author's fnal manuscrpt Accessed Wed Apr 24 21:51:12 EDT 2019 Ctable Lnk Terms of Use Creatve Commons Attrbuton-Noncommercal-Share Alke 3.0 Detaled Terms

2 A Mult-Core Numercal Framework for Characterzng Flow n Ol Reservors Chrstopher R. Leonard, Cvl and Envronmental Engneerng, Massachusetts Insttute of Technology, 77 Massachusetts Avenue, Cambrdge, MA chrsleo@mt.edu Davd W. Holmes, Department of Mechancal Engneerng, James Cook Unversty, Angus Smth Drve, Douglas, QLD 4811, Australa davd.holmes1@cu.edu.au John R. Wllams, Cvl and Envronmental Engneerng and Engneerng Systems, Massachusetts Insttute of Technology, 77 Massachusetts Avenue, Cambrdge, MA rw@mt.edu Peter G. Tlke, Department of Mathematcs and Modelng, Schlumberger-Doll Research Center, 1 Hampshre Street, Cambrdge, MA tlke@slb.com Keywords: Parallel computaton, mult-core, smoothed partcle hydrodynamcs, lattce Boltzmann method, enhanced ol recovery Abstract Ths paper presents a numercal framework that enables scalable, parallel executon of engneerng smulatons on mult-core, shared memory archtectures. Dstrbuton of the smulatons s done by selectve hash-tablng of the model doman whch spatally decomposes t nto a number of orthogonal computatonal tasks. These tasks, the sze of whch s crtcal to optmal cache blockng and consequently performance, are then dstrbuted for executon to multple threads usng the prevously presented task management algorthm, H-Dspatch. Two numercal methods, smoothed partcle hydrodynamcs (SPH) and the lattce Boltzmann method (LBM), are dscussed n the present work, although the framework s general enough to be used wth any explct tme ntegraton scheme. The mplementaton of both SPH and the LBM wthn the parallel framework s outlned, and the performance of each s presented n terms of speed-up and effcency. On the 24-core server used n ths research, near lnear scalablty was acheved for both numercal methods wth utlzaton effcences up to 95%. To close, the framework s employed to smulate flud flow n a porous rock specmen, whch s of broad geophyscal sgnfcance, partcularly n enhanced ol recovery. 1. INTRODUCTION The extenson of engneerng computatons from seral to parallel has had a profound effect on the scale and complexty of problems that can be modeled n contnuum and dscontnuum mechancs. Tradtonally, such parallel computng has almost exclusvely been undertaken wth dstrbuted memory parallel archtectures, such as clusters of sngle-processor machnes. A number of authors have reported on parallel partcle methods (of whch smoothed partcle hydrodynamcs, SPH, s an example) demonstratng scalablty on such archtectures, for example Walther and Sbalzarn [1] and Ferrar et al. [2]. The lattce Boltzmann method (LBM) has also been a popular canddate for dstrbuted computng whch s unsurprsng due to the naturally parallel characterstcs of ts tradtonally regular, orthogonal grd and local node operatons. For example, Vdal et al. [3] presented results ncorporatng fve bllon LBM nodes wth a speed-up effcency of 75% on 128 processors. Götz et al. [4] smulated dense partcle suspensons wth the LBM and a rgd body physcs engne on an SGI Altx system wth 8192 cores (based on dual-core processors). At 7800 processors an effcency of approxmately 60% s acheved n a smulaton featurng 15.6 bllon LBM nodes and 4.6 mllon suspended partcles. Bernasch et al. [5] utlzed a GPU mplementaton of a mult-component LBM and n 2D smulatons of 4.2 mllon nodes acheved a speed-up factor of 13 over ther benchmark CPU performance. Of partcular relevance to ths study s the work of Zeser et al. [6] n whch a parallel cache blockng strategy was used to optmally decompose space-tme of ther LBM smulatons. In addton, ths strategy was purported to be cacheoblvous so that the decomposed blocks were automatcally matched to the cache sze of the hardware used, mnmzng the latency of memory access durng the smulaton. Shared memory mult-core processors have emerged n the last fve years as a relatvely nexpensve commercaloff-the-shelf hardware opton for techncal computng. Ther development has been motvated by the current clockspeed lmtatons that are hnderng the advancement, at least n terms of pure performance, of sngle processors [7]. However, as a comparatvely young technology, there exsts lttle publshed work ([8] s one example) addressng the mplementaton of numercal codes on shared memory mult-core processors. Wth the expense and hgh demand for compute tme on cluster systems, mult-core represents an attractve and accessble HPC alternatve, but the known challenges of software development on such archtectures (.e. thread safety and memory bandwdth ssues [9, 10, 11, 12]) must be addressed. Mult-core technologes are mportantly begnnng to nfltrate all levels of computng, ncludng wthn each node of modern cross-machne

3 clusters. As such, the development of scalable programmng strateges for mult-core wll have wdespread beneft. A varety of approaches to programmng on mult-core have been proposed to date. Commonly, concurrency tools from tradtonal cluster computng lke MPI [13] and OpenMP [14] have been used to acheve fast take-up of the new technology. Unfortunately, fundamental dfferences n the cross-machne and mult-core archtectures mean that such approaches are rarely optmal for mult-core and result n poor scalablty for many applcatons. In response to ths, Chrysanthakopoulos and co-workers [15, 16], based on earler work by Stewart [17], have mplemented mult-core concurrency lbrares usng Port based abstractons. These mmc the functonalty of a message passng lbrary lke MPI, but use shared memory as the medum for data exchange, rather than exchangng seralzed packets over TCP/IP. Such an approach provdes flexblty n program structure, whle stll captalzng on the speed advantages of shared memory. Perhaps as a reflecton of the growng mportance of mult-core software development, a number of other concurrency lbrares have been developed such as Axum and Clk++. In addton, the latest.net Framework ncludes a Task Parallel Lbrary (TPL) whch provdes methods and types, wth varyng degrees of abstracton, whch can be used wth mnmal programmatc dffculty to dstrbute tasks on multple threads. In an earler paper [18] we have shown that a programmng model developed usng such port-based technques descrbed n [15, 16] provdes sgnfcant performance advantages over approaches lke MPI and OpenMP. Importantly, t was found that the H-Dspatch dstrbuton model facltated adustable cache-blockng whch allowed performance to be tuned va the computatonal task sze. In ths paper, we apply the proposed programmng model to the parallelzaton of both partcle based methods and fxed-grd numercal methods on mult-core. The unque challenges n parallel mplementaton of both methods wll be dscussed and the performance mprovements wll be presented. The layout of ths paper s as follows. In Secton 2 a bref descrpton of the mult-core dstrbuton model, H- Dspatch, s provded. Both the SPH and LBM numercal methods are outlned n Secton 3 and the relevant aspects of ther mplementaton n the mult-core framework, ncludng thread safety and cache memory effcency, are dscussed. Secton 4 presents performance test results from both the SPH and LBM smulators as run on a 24-core server and, fnally, an applcaton of the mult-core numercal framework to a porous meda flow problem relevant to enhanced ol recovery s presented n Secton MULTI-CORE DISTRIBUTION One of the hndrances to scalable, cross-machne dstrbuton of numercal methods s the communcaton of ghost regons. These regons correspond to neghborng sectons of the problem doman (resdent n memory on other cluster nodes) whch are requred on a cluster node for the processng of ts own sub-doman. In the LBM ths s typcally a 'layer' of grd ponts that encapsulates the local sub-doman, but n SPH the layer of neghborng partcles requred s equal to the radus of the compact support zone. In 3D t can be shown that, dependng on the sub-doman sze, the communcated fracton of the problem doman can easly exceed 50%. In ths stuaton Amdahl's Law [7], and the fact that tradtonal cross-machne parallelsm usng messagng packages s a seral process, dctates that ths type of dstrbuted memory approach wll scale poorly. If a problem s dvded nto spatal sub-domans for mult-core dstrbuton, ghost regons are no longer necessary because adacent data s readly avalable n shared memory. Further, the removal of relatvely slow network communcatons requred n cluster computng allows for an entrely new programmng paradgm. Subdomans can take on any smple shape or sze and threaded programmng means many small sub-domans can be processed on each core from an events queue rather than needng to approxmate a sngle large, computatonally balanced doman for each processor. Consequently, dynamc doman decomposton becomes unnecessary and a partcle's poston n a doman can be as smple as a spatal hashng, allowng advecton to proceed wth mnmal management. Such characterstcs mean that mult-core s perfectly suted to the parallel mplementaton of partcle methods, however, shared memory challenges such as thread safety and bandwdth lmtatons must be addressed. The decomposton of the spatal doman of a numercal method creates a number of computatonal tasks. Mult-core dstrbuton of these tasks requres the use of a coordnaton tool to manage them onto processng cores n a load balanced way. Whle such tasks could easly be dstrbuted usng a tradtonal approach lke scatter-gather, here the H- Dspatch programmng model of [18] has been used because of the demonstrated advantages for performance and memory effcency. A schematc llustratng the functonalty of the H- Dspatch programmng model s shown n Fgure 1. The fgure shows three endurng threads (correspondng to three processng cores) that reman actve through each tme step of the analyss. A smple problem space wth nne decomposed tasks s dstrbuted across these threads by H- Dspatch. The novel feature of H-Dspatch s the way n whch tasks are dstrbuted to threads. Rather than a scatter or push of tasks from the manager to threads, here threads request values when free. H-Dspatch manages requests and dstrbutes cells to the requestng threads accordngly. It s ths pull mechansm that enables the use of a sngle thread per core as threads only request a value when free, thus, there s never more than one task at a tme assocated wth a

gven endurng thread (and ts assocated local varable memory). Addtonally, when all tasks n the problem space have been dspatched and processed, H-Dspatch dentfes step completon (.e. synchronzaton) and the process can begn agan.

4 gven endurng thread (and ts assocated local varable memory). Addtonally, when all tasks n the problem space have been dspatched and processed, H-Dspatch dentfes step completon (.e. synchronzaton) and the process can begn agan. Fgure 1. Schematc representaton of the H-Dspatch programmng model [18] used to dstrbute tasks to cores. An endurng processng thread s avalable for each core, whch s three n ths smplfed representaton, and H- Dspatch coordnates tasks to threads n a load balanced way over a number of tme steps. The key beneft of such an approach from the perspectve of memory usage s n the ablty to mantan a sngle set of local varables for each endurng thread. The numercal task assocated wth analyss on a sub-doman wll nevtably requre local calculaton varables, often wth sgnfcant memory requrements (partcularly for the case of partcle methods). Overwrtng ths memory wth each allocated cell means the number of local varable sets wll match core count, rather than total cell count. Consderng that most problems wll be run wth core counts n the 10's or 100's, but cell counts n the 1,000's or 10,000's, ths can sgnfcantly mprove the memory effcency of a code. Addtonally, n managed codes lke C#.NET and Java, because thread local varable memory remans actve throughout the analyss, t s not repeatedly reclamed by the garbage collector, a process that holds all other threads untl completon and degrades performance markedly (see [18]). 3. NUMERICAL METHODS The mult-core numercal framework featured n ths paper has been desgned n a general fashon so as to accommodate any explct numercal method, such as SPH, LBM, the dscrete element method (DEM), the fnte element method (FEM) or fnte dfference (FD) technques. It s worth notng that t could be adapted to accommodate mplct, teratve schemes wth the correct data structures for thread safety but that s not the focus of ths work. Instead, ths study wll focus on SPH and LBM, however the performance of the mult-core framework wth an FD scheme has been prevously reported [18] Smoothed Partcle Hydrodynamcs SPH s a mesh-free Lagrangan partcle method whch was frst proposed for the study of astrophyscal problems by Lucy [19] and Gngold and Monaghan [20], but s now wdely appled to flud mechancs problems [21]. A key advantage of partcle methods such as SPH (see also dsspatve partcle dynamcs (DPD) [22]) s n ther ablty to advect mass wth each partcle, thus removng the need to explctly track phase nterfaces for problems nvolvng multple flud phases or free surface flows. However, the management of free partcles brngs wth t the assocated computatonal cost of performng spatal reasonng at every tme step. Ths requres a search algorthm to determne whch partcles fall wthn the compact support (.e. nteracton) zone of a partcle and then processng each nteractng par. Nevertheless, n many crcumstances ths expense can be ustfed by the versatlty wth whch a varety of mult-physcs phenomena can be ncluded. SPH theory has been detaled wdely n the lterature wth varous formulatons havng been proposed. The methodology of authors such as Tartakovsky and Meakn [23, 24] and Hu and Adams [25] has been shown to perform well for the case of mult-phase flud flows. Ther partcle number densty varant of the conventonal SPH formulaton removes erroneous artfcal surface tenson effects between phases and allows for phases of sgnfcantly dfferng denstes. Such a method has been used for the performance testng n ths work. The dscretzed partcle number densty SPH equatons for some feld quantty, A, s gven as, A A W r r, h, (1) n along wth ts gradent, A A W r r, h, (2) n where n m Wr r, h s the partcle number densty term, whle W s the smoothng functon (typcally a Gaussan or some form of splne), h s the smoothng length and r and r are poston vectors. These expressons are appled to the Naver-Stokes conservaton equatons to determne the SPH equatons of moton. Computng densty drectly from (1) gves, m W r r, h, (3) where ths expresson conserves mass exactly, much lke the summaton densty approach of conventonal SPH. An approprate term for partcle velocty rate has been provded by Morrs et al. [26], and used by Tartakovsky and Meakn [23], where,

5 dv dt n whch vscosty, 1 m N 1 1 m P P dw F 2 2 d r N r r v v nn r r 1 P s the partcle pressure, v s the partcle velocty and 2 dw dr (4) s the dynamc F s the body th force appled on the partcle. Surface tenson s ntroduced nto the method va the supermposton of par-wse nter-partcle forces followng Tartakovsky and Meakn [24], 3 r r s cos r r, r r h F 2h r r, (5) 0, r h r wheren s s the strength of force between partcles and, whle h s the nteracton dstance of a partcle. By defnng s as beng stronger between partcles of the same phase, than between partcles of a dfferent phase, surface tenson manfests naturally as a result of force mbalances at phase nterfaces. Smlarly, s can be defned to control the wettablty propertes of a sold. Sold boundares n the smulator are defned usng rows of vrtual partcles smlar to that used by Morrs et al [26], and no-slp boundary condtons are enforced for low Reynolds number flow smulatons usng an artfcally mposed boundary velocty method developed n [27] and shown to produce hgh accuracy results Mult-Core Implementaton of SPH Because partcle methods necesstate the recalculaton of nteractng partcle pars at regular ntervals, algorthms that reduce the number of canddate nteractng partcles to check are crtcal to numercal effcency. Ths s acheved by spatal hashng, whch assgns partcles to cells or 'bns' based on ther Cartesan coordnates. Wth a cell sde length greater than or equal to the nteracton depth of a partcle, all canddates for nteracton wth some target partcle wll be contaned n the target cell, or one of the mmedately neghborng cells. The storage of partcle cells s handled usng hash table abstractons such as the Dctonary<Tkey, Tvalue> class n C#.NET [28], and parallel dstrbuton s performed by allocaton of cell keys to processors from an events queue. In cases where data s requred from partcles n adacent cells, t s addressed drectly usng the key of the relevant cell. Wth the descrbed partcle cell decomposton of the doman care must be taken to avod common shared memory problems lke race condtons and thread contenton. To crcumvent the problems assocated wth usng locks (coarse graned lockng scales poorly, whle fne graned lockng s tedous to mplement and can ntroduce deadlockng condtons [29]) the SPH data can be structured to remove the possblty of thread contenton altogether. By storng the present and prevous values of the SPH feld varables, necessary gradent terms can be calculated as functons of values n prevous memory, whle updates are wrtten to the current value memory. Ths reduces the number of synchronzatons per tme step from two (f the gradent terms are calculated before synchronzng, followed by the update of the feld varables) to one, and a rollng memory algorthm swtches the ndex of prevous and current data wth successve tme steps. An mportant advantage of the use of spatal hashng to create partcle cells s the ease wth whch the cell sze can be used to optmze cache blockng. By adustng the cell sze, the assocated computatonal task can be re-szed to ft n cache close to the processor (e.g. L1 or L2 cache levels). It can be shown that cells fttng completely n cache demonstrate a sgnfcantly better performance (15 to 30%) than those that overflow cache causng an ncrease n cache msses, because cache msses requre that data then be retreved from RAM wth a greater memory latency The Lattce Boltzmann Method The lattce Boltzmann method (LBM) (see [30] for a revew) has been establshed n the last 20 years as a powerful numercal method for the smulaton of flud flows. It has found applcaton n a vast array of problems ncludng magnetohydrodynamcs, multphase and multcomponent flows, flows n porous meda, turbulent flows and partcle suspensons. The prmary varables n the LBM are the partcle dstrbuton functons, f x,t, whch exst at each of the lattce nodes that comprse the flud doman. These functons relate the probable amount of flud partcles movng wth a dscrete speed n a dscrete drecton at each lattce node at each tme ncrement. The partcle dstrbuton functons are evolved at each tme step va the two-stage, collde-stream process as defned n the lattce-bhatnagar- Gross-Krook [31] equaton (LBGK), t eq f x ct, t t f x, t f x, t f x, t,(6) AG c n whch x defnes the node coordnates, t s the explct tme step, c x / t defnes the lattce veloctes, s the relaxaton tme, f eq x t, are the nodal equlbrum functons, G s a body force (e.g. gravty) and A s a massconservng constant. The collson process, whch s descrbed by the frst two terms n the RHS of (6), monotoncally relaxes the partcle dstrbuton functons towards ther respectve equlbra. The redstrbuted

6 functons are then adusted by the body force term, after whch the streamng process propagates them to ther nearest neghbor nodes. Spatal dscretzaton n the LBM s typcally based on a perodc array of polyhedra, but ths s not mandatory [32]. A choce of lattces s avalable n two and three dmensons wth an ncreasng number of veloctes and therefore symmetry. However, the benefts of ncreased symmetry can be offset by the assocated computatonal cost, especally n 3D. In the present work the D3Q15 lattce s employed, whose velocty vectors are ncluded n (7) c c (7) The macroscopc flud varables, densty, f, and momentum flux, f c, are calculated at each lattce node as velocty moments of the partcle dstrbuton functons. The defntons of the flud pressure and vscosty are by-products of the Chapman-Enskog expanson (see [34] for detals), whch shows how the Naver-Stokes equatons are recovered n the near-ncompressble lmt wth sotropy, Gallean nvarance and a velocty ndependent pressure. An sothermal equaton of state, p c 2 s, n whch c s c / 3 s the lattce speed of sound, s used to calculate the pressure drectly from the densty, whle the knematc vscosty, x, (8) 3 2 t s evaluated from the relaxaton and dscretzaton parameters. The requrement of postve vscosty n (8) mandates that 1 2 and to ensure near-ncompressblty of the flow the computatonal Mach number s lmted, Ma u cs 1. The most straghtforward approach to handlng wall boundary condtons s to employ the bounce-back technque. Although t has been shown to be generally frstorder n accuracy [35], as opposed to the second order accuracy of the lattce Boltzmann equaton at nternal flud nodes [30], ts operatons are local and the orentaton of the boundary wth respect to the grd s rrelevant. A number of alternatve wall boundary technques [36, 37] that offer generalzed second-order convergence are avalable n the LBM, however these are at the expense of the localty and smplcty of the bounce-back condton. In the present work, the mmersed movng boundary (IMB) method of Noble and Torczynsk [38] s employed to handle the hydrodynamc couplng of the flud and structure. In ths method the LBE s modfed to nclude an addtonal collson term whch s dependent on the proporton of the nodal cell that s covered by sold, thus mprovng the boundary representaton and smoothng the hydrodynamc forces calculated at an obstacle's boundary nodes as t moves relatve to the grd. Consequently, t overcomes the momentum dscontnuty of bounce-back and lnk-bounceback-based [39] technques and provdes adequate representaton of non-conformng boundares at lower grd resolutons. It also retans two crtcal advantages of the LBM, namely the localty of the collson operator and the smple lnear streamng operator, and thus facltate solutons nvolvng large numbers of rregular-shaped, movng boundares. Further detals of the IMB method and the couplng of the LBM to the DEM, ncludng an assessment of mxed boundary condtons n varous flow geometres, can be found n Owen et al. [40] Mult-Core Implementaton of the LBM Two characterstc aspects of the LBM often result n t beng descrbed as a naturally parallel numercal method. The frst feature s the regular, orthogonal dscretzaton of space, whch s typcal of Euleran schemes, and can smplfy doman decomposton. The second feature s the use of only local data to perform nodal operatons, whch consequently results n partcle dstrbuton functons at a node beng updated usng only the prevous values. However, t should be noted that ncluson of addtonal features such as flux boundary condtons and non- Newtonan rheology, f not mplemented carefully, can negate the localty of operatons. The obvous choce for decomposton of the LBM doman s to use cubc nodal bundles, as shown schematcally n Fgure 2. The bundles are analogous to the partcle cells that were used n SPH, and smlarly H- Dspatch s used to dstrbute bundle keys to processors. Data storage s handled usng a Dctonary of bundles, whch are n turn Dctonares of nodes. Ths technque s used, as opposed to a master Dctonary of all nodes, to overcome problems that can occur wth Collecton lmts (approxmately 90 mllon on the 64-bt server used here). By defnton, the LBM nodal bundles can be used to perform cache blockng ust as the partcle cells were n SPH. Wth the correct bundle sze, the assocated computatonal task can be stored sequentally n processor cache and the latency assocated wth RAM access can be mnmzed. Smlar technques for the LBM have been reported n [41, 42] and extended to perform decomposton of space-tme [6] (as opposed to ust space) n a way that s ndependent of cache sze (a recursve algorthm s used to determne the optmal block sze). To ensure thread safety, two copes of the LBM partcle dstrbuton functons at each node are stored. Nodal processng s undertaken usng the current values, whch are

then overwrtten and propagated to the future data sets of approprate neghbor nodes.

and hardware employed [44, 42]. mcrotomographc mages of ol reservor rock (see Secton 5). Approxmately 1.

7 then overwrtten and propagated to the future data sets of approprate neghbor nodes. Technques such as SHIFT [43] have been presented whch employ specalzed data structures that remove the need for storng two copes of the partcle dstrbuton functons, however ths s at the expense of the flexblty of the code. Note that the collde-push sequence mplemented here can easly be reordered to a pull-collde sequence, wth each havng ther own subtle convenences and performance benefts dependng on the data structure and hardware employed [44, 42]. mcrotomographc mages of ol reservor rock (see Secton 5). Approxmately 1.4 mllon partcles were used n the smulaton and the executon duraton was defned as the tme n seconds taken to complete a tme step, averaged over 100 steps. For the double-search algorthm a speed-up of approxmately 22 was acheved wth 24 cores, whch corresponds to an effcency of approxmately 92%. Ths, n conuncton wth the fact that the processor scalng response s near lnear, s an excellent result. Fgure 3. Mult-core, parallel performance of the SPH solver contrastng the counterntutve scalablty of the sngle-search and double-search algorthms. Fgure 2. Schematc representaton of the decomposton of the LBM doman nto nodal bundles. Data storage s handled usng a Dctonary of bundles, whch are n turn Dctonares of nodes. 4. PARALLEL PERFORMANCE OF METHODS The parallel performance of SPH and the LBM n the mult-core numercal framework was tested on a 24-core Dell Server PER900 wth Intel Xeon CPU, 2.40 GHz, runnng 64-bt Wndows Server Enterprse Here, two metrcs are used to defne the scalablty of the smulaton framework, namely SpeedUp t / t and 1proc nproc Effcency SpeedUp / N. Obvously, dealzed maxmum performance corresponds to a speed-up rato equal to the number of cores at whch pont effcency would be 100%. Fgure 3 graphs the ncreasng speed-up of the SPH solver wth ncreasng cores. The test problem smulated flow through a porous geometry determned from The results n Fgure 3 also provde an nterestng nsght nto the comparatve benefts of mnmzng computaton or mnmzng memory commt when usng mult-core hardware. The sngle-search result s attaned wth a verson of the solver that performs the spatal reasonng once per tme step and stores the results n memory for use twce per tme step. Conversely, the doublesearch results are acheved when the code s modfed to perform the spatal search twce per tme step, as needed. Intutvely, the sngle-search approach requres a greater memory commt but the double-search approach requres more computaton. However, t s counterntutve to see that double-search sgnfcantly outperforms sngle-search, especally as the number of processors ncreases. Ths can be attrbuted to better cache blockng of the second approach and the smaller amount of data experencng latency when loaded from RAM to cache. The fact that such performance gans only manfest when more than 10 cores are used, suggests that for less than 10 cores, RAM ppelne bandwdth s suffcent to handle a global nteracton lst. As n the SPH testng, the LBM solver was assessed n terms of speed-up and effcency. Fgure 4 graphs speed-up aganst the number of cores for a 3D duct flow problem on

a 200 3 doman. Perodc boundares were employed on the n-flow and out-flow surfaces and the bounce-back wall boundary condton was used on the remanng surfaces.

8 a doman. Perodc boundares were employed on the n-flow and out-flow surfaces and the bounce-back wall boundary condton was used on the remanng surfaces. A constant body force was used to drve the flow. The mult-core framework, usng the SPH solver, was appled to numercally determne the porosty-permeablty relatonshp of a sample of Dolomte. The structural model geometry was generated from X-ray mcrotomographc mages of the sample, whch were taken from a 4.95mm dameter cylndrcal core sample, 5.43mm n length and wth an mage resoluton of 2.8m. Ths produced a voxelated mage set that s n sze. Current hardware lmtatons prevent the full sample from beng analyzed n one numercal experment, therefore sub-blocks (see the nsets n Fgure 5) of voxel dmensons were taken from the full mage set to carry out flow testng and map the porosty-permeablty relatonshp. Fgure 4. Mult-core, parallel performance of the LBM solver for varyng bundle and doman szes. The sde length, n nodes, of the bundles was vared between 10 and 50 and the dfference n performance for each sze can clearly be seen. Optmal performance s acheved at a sde length of 20, where the speed-up and effcency are approxmately 22 and 92%, respectvely, on 24 cores. Ths bundle sze represents the best cache blockng scenaro for the tested hardware. As the bundle sze s ncreased, performance degrades due to the computatonal tasks becomng too large for storage n cache whch results n slow communcaton wth RAM for data. The bundle sde length of 20 was then transferred to dentcal problems usng a and doman, and the results of these tests are also ncluded n Fgure 4. As n the smaller problem, near lnear scalablty s acheved and at 24 cores the speed ups are approxmately 23 and 22, and the effcences are approxmately 95% and 92% for and 400 3, respectvely. Ths s an mportant result, as t suggests that the optmum bundle sze for 3D LBM problems can be determned n an a pror fashon for a specfc archtecture. 5. APPLIED PERMEABILITY EXPERIMENT The permeablty of reservor rocks to sngle and multple flud phases s of mportance to many enhanced ol recovery procedures. Tradtonally, these data are determned expermentally from cored samples of rock. To be able to perform these experments numercally would present sgnfcant cost and tme savngs, and therefore t s the focus of the present study. Fgure 5. Results of the Dolomte permeablty tests undertaken usng SPH and the mult-core framework. Expermental data [45] are ncluded for comparson. The assemblage of SPH partcles was ntalzed wth a densty of approxmately one partcle per voxel. Flud partcles (.e. those located n the pore space) were assgned the propertes of water ( 0 = 10 3 kgm -3, = 10-6 m 2 s -1 ) and boundary partcles located further than 6h from a boundary surface were deleted for effcency. The four doman surfaces parallel to the drecton of flow were desgnated noflow boundares and the n-flow and out-flow surfaces were specfed as perodc. Due to the ncompatblty of the two rock surfaces at these boundares, perodcty could not be appled drectly. Instead, the expermental arrangement was replcated by addng a narrow volume of flud at the top and bottom of the doman. Fnally, all smulatons were drven from rest by a constant body force equvalent to an appled pressure dfferental across the sample. The results of the permeablty tests are graphed n Fgure 5 and expermental data [45] relevant to the gran sze of the sample are ncluded for comparson. It can be

9 seen that the numercal results le wthn the expermental band, suggestng that the presented numercal procedure s approprate. As expected, each sub-block exhbts a dfferent porosty and by analyzng a range of sub-blocks a porostypermeablty curve for the rock sample can be defned. 6. CONCLUDING REMARKS In ths paper a parallel, mult-core framework has been appled to the SPH and LBM numercal methods for the soluton of flud-structure nteracton problems n enhanced ol recovery. Important aspects of ther mplementaton, ncludng spatal decomposton, data structurng and the management of thread safety have been brefly dscussed. Near lnear speed-up over 24 cores was found n testng and peak effcences of 92% n SPH and 95% n LBM were attaned at 24 cores. The mportance of optmal cache blockng was demonstrated, n partcular n the LBM results, by varyng the dstrbuted computatonal task sze va the sze of the nodal bundles. Ths mnmzed the cache msses durng executon and the latency assocated wth accessng RAM. In addton, t was found that the optmal nodal bundle sze n the 3D LBM could be transferred to larger problem domans and acheve smlar performance, suggestng an a pror technque for determnng the best computatonal task sze for parallel dstrbuton. Fnally, the mult-core framework wth the SPH solver was appled n a numercal experment to determne the porosty-permeablty relatonshp of a sample of Dolomte (.e. a canddate reservor rock). Due to hardware lmtatons, a number of sub-blocks of the complete mcrotomographc mage of the rock sample were tested. Each sub-block was found to have a unque porosty and correspondng permeablty, and when these were supermposed on relevant expermental data the correlaton was excellent. Ths result provdes strong support for the numercal expermentaton technque presented. Future work wll extend testng of the mult-core framework to 64-core and 256-core server archtectures. However, the next maor numercal development les n the extenson of the flud capabltes n SPH and the LBM to multple flud phases. Ths wll allow the predcton of the relatve permeablty of rock samples whch s essental to dranage and mbbton processes n enhanced ol recovery. Acknowledgements The authors are grateful to the Schlumberger-Doll Research Center for ther support of ths research. References [1] J. H. Walther and I. F. Sbalzarn. Large-scale parallel dscrete element smulatons of granular flow. Engneerng Computatons, 26(6): , [2] A. Ferrar, M. Dumbser, E. F. Toro, and A. Armann. A new 3D parallel SPH scheme for free surface flows. Computers & Fluds, 38(6): , [3] D. Vdal, R. Roy, and F. Bertrand. A parallel workload balanced and memory effcent lattce-boltzmann algorthm wth sngle unt BGK relaxaton tme for lamnar Newtonan flows. Computers & Fluds, 39(8): , [4] J. Götz, K. Iglberger, C. Fechtnger, S. Donath, and U. Rüde. Couplng multbody dynamcs and computatonal flud dynamcs on 8192 processor cores. Parallel Computng, 36(2-3): , [5] M. Bernasch, L. Ross, R. Benz, M. Sbragagla, and S. Succ. Graphcs processng unt mplementaton of lattce Boltzmann models for flowng soft systems. Physcal Revew E, 80(6):066707, [6] T. Zeser, G. Wellen, A. Ntsure, K. Iglberger, U. Rude, and G. Hager. Introducng a parallel cache oblvous blockng approach for the lattce Boltzmann method. Progress n Computatonal Flud Dynamcs, 8(1-4): , [7] M. Herlhy and N. Shavt. The art of multprocessor programmng. Morgan Kaufman, [8] L. Valant. A brdgng model for mult-core computng. Lecture Notes n Computer Scence, 5193:13-28, [9] K. Poulsen. Software bug contrbuted to blackout. Securty Focus, [10] J. Dongarra, D. Gannon, G. Fox, and K. Kennedy. The mpact of multcore on computatonal scence software. CTWatch Quarterly, 3(1), [11] C. E. Leserson and I. B. Mrman. How to Survve the Multcore Software Revoluton (or at Least Survve the Hype). Clk Arts, Cambrdge, [12] N. Snger. More chp cores can mean slower supercomputng, Sanda smulaton shows. Sanda Natonal Laboratores News Release, Avalable: ore.html [13] W. Gropp, E. Lusk, and A. Skellum. Usng MPI: Portable Parallel Programmng Wth the Message-Passng Interface. MIT Press, Cambrdge, [14] M. Curts-Maury, X. Dng, C. D. Antonopoulos, and D. S. Nkolopoulos. An evaluaton of OpenMP on current and emergng multthreaded/multcore processors. In M. S. Mueller, B. M. Chapman, B. R. de Supnsk, A. D. Malony, and M. Voss, edtors, OpenMP Shared Memory Parallel Programmng, Lecture Notes n Computer Scence, 4315/2008: , Sprnger, Berln/Hedelberg, [15] G. Chrysanthakopoulos and S. Sngh. An asynchronous messagng lbrary for C#. In Proceedngs of the Workshop on Synchronzaton and Concurrency n Obect-Orented Languages, 89-97, San Dego, [16] X. Qu, G. Fox, G. Chrysanthakopoulos, and H. F. Nelsen. Hgh performance mult-paradgm messagng runtme on multcore systems. Techncal report, Indana

10 Unversty, Avalable: ptlupages/publcatons/ccraprl16open.pdf. [17] D. B. Stewart, R. A. Volpe, and P. K. Khosla. Desgn of dynamcally reconfgurable real-tme software usng portbased obects. IEEE Transactons on Software Engneerng, 23: , [18] D. W. Holmes, J. R. Wllams, and P. G. Tlke. An events based algorthm for dstrbutng concurrent tasks on mult-core archtectures. Computer Physcs Communcatons, 181(2): , [19] L. B. Lucy. A numercal approach to the testng of the fsson hypothess. Astronomcal Journal, 82: , [20] R. A. Gngold and J. J. Monaghan. Smoothed partcle hydrodynamcs: Theory and applcaton to non-sphercal stars. Monthly Notces of the Royal Astronomcal Socety, 181: , [21] G. R. Lu and M. B. Lu. Smoothed Partcle Hydrodynamcs: a meshfree partcle method. World Scentfc, Sngapore, [22] M. Lu, P. Meakn, and H. Huang. Dsspatve partcle dynamcs smulatons of multphase flud flow n mcrochannels and mcrochannel networks. Physcs of Fluds, 19:033302, [23] A. M. Tartakovsky and P. Meakn. A smoothed partcle hydrodynamcs model for mscble flow n threedmensonal fractures and the two-dmensonal Raylegh- Taylor nstablty. Journal of Computatonal Physcs, 207: , [24] A. M. Tartakovsky and P. Meakn. Pore scale modelng of mmscble and mscble flud flows usng smoothed partcle hydrodynamcs. Advances n Water Resources, 29: , [25] X. Y. Hu and N. A. Adams. A mult-phase SPH method for macroscopc and mesoscopc flows. Journal of Computatonal Physcs, 213: , [26] J. P. Morrs, P. J. Fox, and Y. Zhu. Modelng low Reynolds number ncompressble flows usng SPH. Journal of Computatonal Physcs, 136: , [27] D. W. Holmes, J. R. Wllams, and P. G. Tlke. Smooth partcle hydrodynamcs smulatons of low Reynolds number flows through porous meda. Internatonal Journal for Numercal and Analytcal Methods n Geomechancs, n/a. do: /nag.898, [28] J. Lberty and D. Xe. Programmng C# 3.0: 5 th Edton. O'Relly Meda, Sebastopol, [29] A. R. Adl-Tabataba, C. Kozyraks, and B. Saha. Unlockng concurrency. Queue, 4(10):24-33, [30] S. Chen and G. D. Doolen. Lattce Boltzmann method for flud flows. Annual Revew of Flud Mechancs, 30: , [31] H. Chen, S. Chen, and W. H. Matthaeus. Recovery of the Naver-Stokes equatons usng a lattce-gas Boltzmann method. Physcal Revew A, 45(8):R5339-R5342, [32] X. He and G. D. Doolen. Lattce Boltzmann method on a curvlnear coordnate system: Vortex sheddng behnd a crcular cylnder. Physcal Revew E, 56(1):434440, [33] U. Frsch, B. Hasslacher, and Y. Pomeau. Lattce-gas automata for the Naver-Stokes equaton. Physcal Revew Letters, 56(14): , [34] S. Hou, Q. Zou, S. Chen, G. Doolen, and A. C. Cogley. Smulaton of cavty flow by the lattce Boltzmann method. Journal of Computatonal Physcs, 118(2): , [35] R. Cornubert, D. d'humères, and D. Levermore. A Knudsen layer theory for lattce gases. Physca D, 47(1-2): , [36] T. Inamuro, M. Yoshno, and F. Ogno. A non-slp boundary condton for lattce Boltzmann smulatons. Physcs of Fluds, 7(12): , [37] D. R. Noble, S. Chen, J. G. Georgads, and R. O. Buckus. A consstent hydrodynamc boundary condton for the lattce Boltzmann method. Physcs of Fluds, 7(1): , [38] D. R. Noble and J. R. Torczynsk. A lattce-boltzmann method for partally saturated computatonal cells. Internatonal Journal of Modern Physcs C, 9(8): , [39] A. J. C. Ladd. Numercal smulatons of partculate suspensons va a dscretzed Boltzmann equaton. Part 1. Theoretcal foundaton. Journal of Flud Mechancs, 271: , [40] D. R. J. Owen, C. R. Leonard, and Y. T. Feng. An effcent framework for flud-structure nteracton usng the lattce Boltzmann method and mmersed movng boundares. Internatonal Journal for Numercal Methods n Engneerng, n/a. do: /nme.2985, [41] T. Pohl, F. Deserno, N. Thurey, U. Rude, P. Lammers, G. Wellen, and T. Zeser. Performance evaluaton of parallel large-scale Lattce Boltzmann applcatons on three supercomputng archtectures. In SC '04: Proceedngs of the 2004 ACM/IEEE conference on Supercomputng, 21-33, Washngton, [42] G. Wellen, T. Zeser, G. Hager, and S. Donath. On the sngle processor performance of smple lattce Boltzmann kernels. Computers & Fluds, 35(8-9): , [43] J. Ma, K. Wu, Z. Jang, and G. D. Couples. SHIFT: An mplementaton for lattce Boltzmann smulaton n lowporosty porous meda. Physcal Revew E, 81(5):056702, [44] T. Pohl, M. Kowarschk, J. Wlke, K. Iglberger, and U. Rüde. Optmzaton and proflng of the cache performance of parallel Lattce Boltzmann codes. Parallel Processng Letters, 13(4): , [45] R. M. Sneder and J. S. Sneder. New ol n old places. Search and Dscovery, 10007, Avalable: der/

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr