Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units

Size: px

Start display at page:

Download "Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units"

Rudolf Fox
5 years ago
Views:

1 Invted Artcle Computer Scence & Technology March 2012 Vol.57 No.7: do: /s y SPECIAL TOPICS: Effcent parallel mplementaton of the lattce Boltzmann method on large clusters of graphc processng unts XIONG QnGang 1,2, LI Bo 1,2, XU J 1,2, FANG XaoJan 1,2, WANG XaoWe 1*, WANG LMn 1*, HE XanFeng 1 & GE We 1 1 State Key Laboratory of Multphase Complex Systems, Insttute of Process Engneerng, Chnese Academy of Scences, Bejng , Chna; 2 Graduate Unversty of Chnese Academy of Scences, Bejng , Chna Receved May 23, 2011; accepted October 19, 2011 Many-core processors, such as graphc processng unts (GPUs), are promsng platforms for ntrnsc parallel algorthms such as the lattce Boltzmann method (LBM). Although tremendous speedup has been obtaned on a sngle GPU compared wth manstream CPUs, the performance of the LBM for multple GPUs has not been studed extensvely and systematcally. In ths artcle, we carry out LBM smulaton on a GPU cluster wth many nodes, each havng multple Ferm GPUs. Asynchronous executon wth CUDA stream functons, OpenMP and non-blockng MPI communcaton are ncorporated to mprove effcency. The algorthm s tested for two-dmensonal Couette flow and the results are n good agreement wth the analytcal soluton. For both the one- and two-dmensonal decomposton of space, the algorthm performs well as most of the communcaton tme s hdden. Drect numercal smulaton of a two-dmensonal gas-sold suspenson contanng more than one mllon sold partcles and one bllon gas lattce cells demonstrates the potental of ths algorthm n large-scale engneerng applcatons. The algorthm can be drectly extended to the three-dmensonal decomposton of space and other modelng methods ncludng explct grd-based methods. asynchronous executon, compute unfed devce archtecture, graphc processng unt, lattce Boltzmann method, non-blockng message passng nterface, OpenMP Ctaton: Xong Q G, L B, Xu J, et al. Effcent parallel mplementaton of the lattce Boltzmann method on large clusters of graphc processng unts. Chn Sc Bull, 2012, 57: , do: /s y Hgh-performance computng (HPC) on general-purpose graphcal processng unts (GPGPUs) has emerged as a compettve approach to make demandng computatons such as those of computatonal flud dynamcs (CFD) [1,2] and dscrete partcle smulatons [3 5]. Ths s, on one hand, due to the computatonal capacty of graphcal processng unts (GPUs), whch s almost one order of magntude hgher than that of manstream central processng unts (CPUs) n terms of both peak performance and memory bandwdth, and on the other hand, due to the ntroducton of effectve and convenent programmng nterfaces such as Compute Unfed Devce Archtecture (CUDA). *Correspondng authors (emal: xwwang@home.pe.ac.cn; lmwang@home.pe.ac.cn) The lattce Boltzmann method (LBM) [6] s a numercal method sutable for GPGPUs owng to ts explct numercal scheme, localzed communcaton mode and nherent addtvty of ts numercal operatons. Hence, t s a powerful alternatve to CFD methods such as fnte dfference and fnte volume methods. Implementatons of LBM on a sngle GPU have been reported [7 10] wth speedup ratos rangng from tens to above 100 relatve to a sngle CPU core. In the case of mult-gpu mplementatons, L et al. [11] performed LBM smulaton of ld-drven cavty flow on an HPC system comprsng both Nvda and AMD GPUs, usng CUDA and Brook+, respectvely, and combnng va the Message Passng Interface (MPI). Myre et al. [12] mplemented sngle-phase, mult-phase and mult-component The Author(s) Ths artcle s publshed wth open access at Sprngerlnk.com csb.scchna.com

2 708 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 LBMs on GPU clusters usng Open Mult-Processng (OpenMP). In these mplementatons, data communcaton between GPUs s trval or the GPUs are nstalled at the same node, so the real performances of these mplementatons were almost unaffected by communcaton. However, ths s not typcal n engneerng practce. In fact, the data n GPUs cannot be accessed by the network drectly and has to be coped, from the GPU to CPU before sendng and from the CPU to GPU after recevng, through a PCIe bus wth bandwdth of about 10 GB/s currently (Gen 2), whch s much lower than that of the GPU global memory. Therefore, communcaton between the CPU and GPU can be a bottleneck n some applcatons. In ths artcle, we ntegrate asynchronous computng communcaton va the CUDA v3.1 framework [13], sharedmemory parallelzaton usng OpenMP and nter-node parallelzaton usng non-blockng MPI to mprove the performance of mult-gpu LBM smulatons. Performances for both one- and two-dmensonal decompostons are analyzed and t s found that our mplementaton s very effcent. The consstency of our mplementaton on HPC systems wth multple GPUs at one node s emphaszed. 1 The lattce Boltzmann method The lattce BGK model [14] s one of the most frequently used schemes for the LBM. Dependng on the dmensonalty (D) and number of dscrete lattce veloctes (Q), there are dfferent varatons, such as D2Q9, D3Q13, and D3Q19. The formulaton of the lattce BGK model s 1 eq f( x 1, t 1) f( x, t) ( f ( x, t) f( x, t)), (1) where f (x,t) s the densty functon of the th drecton at poston x and tme t. τ s the relaxaton tme related to flud molecular dynamc vscosty μ. The term f eq ( x, t) s approxmated to second order as 2 eq e u ( e u) u u f ( x, t) w( ), (2) 2 2 c c c where f, u e f. (3) The D2Q9 scheme s llustrated n Fgure 1 and further detals were gven by Qan et al. [14]. To reduce the compressng effect n the orgnal lattce BGK model, He et al. [15] proposed revsons to the DdQq schemes and named them DdQq. The evolutonal rules are the same but wth dfferent equlbrum densty propagatons: eq f ( x, t) 0 p 2 e u ( e u) u u w ( ), (4) 2 2 c c c Fgure 1 D2Q9 model wth w = 4/9 when = 0, w = 1/9 when = 1, 2, 3 and 4, and w = 1/36 when = 5, 6, 7 and 8. w0 1 w where 0 3, 3, ρ 0 s the referenced c c flud densty for the ntal state, pressure p and velocty u are expressed as 2 c uu p ( f 1.5 w0 ), u 2 e f. (5) 3(1 w0 ) 0 c 0 The DdQq schemes ntroduce no further computatonal cost, and for GPU mplementaton, the zeroth drecton can be omtted, whch makes the schemes faster than the correspondng DdQq schemes. However, for DdQq schemes, the hdng of data communcatons s more mportant snce the communcaton-to-computaton rato s hgher than DdQq for the sze of data to be transfered among GPUs s same. 2 Mult-GPU mplementaton of the D2Q9 scheme The mplementaton of the LBM for a sngle GPU has been dscussed extensvely n [7,16]. We emphass one pont here. As the GPU s sutable for data-ndependent computaton-ntensve tasks, the memory access mode s crtcal to the performance. For ths reason, the storage of LBM grd data must be algned and accessed n a coalescent manner to make full use of the memory bandwdth. As long as global memory access s optmzed, the performance of dfferent mplementatons on the same sngle GPU vares lttle. However, for mult-gpu mplementaton, GPU CPU data transfer and CPU CPU communcaton may requre a large porton of the wall tme, and they have to be optmzed also. In CUDA 3.1, the launch of a GPU kernel s asynchronous, whch means that when a kernel s launched, the system returns to ts ntal state before the kernel completes ts computng. Ths feature enables the host CPU to perform

3 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No Fgure 2 Schematc map of the overlappng of GPU computaton and data communcatons. ndcates a boundary cell and an nner cell; and cells make up the entre grd executed n stream [1]. other jobs whle watng for the GPU kernel to fnsh; e.g., copyng data between a GPU and CPU and carryng out nter-cpu communcaton and arthmetc operatons. For LBM smulatons, ths mples that collson and propagaton of the densty functons can be run n parallel by copyng boundary grd nformaton to a CPU and then transferrng the nformaton to neghborng CPUs. As shown n Fgure 2, ths s realzed usng the stream functon and portable pnned memory n CUDA 3.1, OpenMP and non-blockng communcatons provded by MPI. The flowchart of parallel mplementaton of LBM on GPU cluster s gven n Fgure 3. At the begnnng of each teraton, the collson operaton on boundary cells s launched asynchronously by the kernel Boundary_Collson n stream[0]. In ths kernel, the boundary grds are only subject to collson and not to propagaton, and post-collson boundary nformaton s wrtten to sendng buffers n the GPU global memory. The collson and propagaton on the entre grd are launched by the kernel Collson_Propagaton n stream[1] as soon as Boundary_Collson returns. The host can return before these asynchronous kernels completon, but kernels n the same stream are carred out n seres. Therefore, we launch the copy between GPU and CPU cudamemcpyasync n stream[0] to ensure that the copy operaton starts after the completon of Boundary_Collson. Although the operatons n stream[0] are n seres, these operatons can be done whle Collson_Propagaton s n executon. To use the asynchronous cudamemcpyasync, the buffers n the host must be allocated as pnned memory. After the GPU CPU copy operaton, the communcatons between CPUs are ready to be carred out. To confrm the fnsh of GPU CPU data copy n host memory, cudastreamsynchronze (stream[0]) s performed to ensure that all boundary nformaton s coped to sendng buffers n host memory. Non-blockng MPI_Isend and MPI_Irecv are then launched f the neghborng processors do not belong to the same node. These two MPI functons are non-blocked so that other CPU operatons can proceed whle data are beng sent or receved. MPI_Wat s needed to wat untl data have been receved. If neghborng processors are located on the same node, data can be transfered wth the portable pnned memory n CUDA. Ths desgn results n the reducton of the amount of data n MPI and acheves a hgher data transfer speed. Such an dea s realzed usng OpenMP for data communcatons wthn a node [17]. OpenMP threads control GPU devces and make portable pnned memory vsble to all GPU devces at the same node. Furthermore, a new technology, GPUDrect [18] for Tesla or Ferm GPUs, s adopted to mprove communcaton performance. The mprovement s acheved by removng the step of copyng data from GPU-dedcated host memory to host memory avalable to InfnBand devces to execute the RDMA communcatons. After the data communcatons, receved data are stll coped to the GPU wth cudamemcpyasync. Fnally, the

4 710 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 Fgure 3 Flowchart of the hybrd mplementaton of the LBM on mult-gpus [20]. boundary nformaton s updated by the data from recevng buffers n GPU global memory. 3 Results and dscusson In the followng, the algorthm s valdated and ts performance tested for our GPU cluster Mole-8.5 (cf. top500.org/lst/2011/11/100), whch conssts of 362 nodes connected wth Quad Data Rate InfnBand. Most of the computng nodes are equpped wth two quad-core CPUs and sx Nvda Tesla C2050 GPUs; therefore, the whole system s confgured wth more than 2000 GPUs, resultng n peak performance of 2 petaflops n sngle precson. 3.1 Valdaton Numercal valdaton s mportant n GPU computng, although many authors [7,19] have declared that the results are nsenstve to sngle precson. We consder the analytcal soluton for the classcal case of two-dmensonal Couette flow to evaluate the accuracy of our GPU mplementaton. The doman sze s and the Reynolds number Re s 400. The smulaton s run n parallel on four GPUs. The smulaton results and the analytcal soluton are llustrated n Fgure 4. We fnd that the computatonal results of our GPU mplementaton agree very well wth the analytcal soluton wth a maxmum error of about 1.5%. 3.2 Performance Fve cases of Couette flow are smulated wth the grd szes for each GPU rangng from (A), to (B), (C), (D) and (E). The whole computaton doman s parttoned n ether one

5 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No Fgure 4 Velocty profles at steady state for a two-dmensonal Couette flow smulaton wth grd sze (Reynolds number Re = UH/υ = 400). or two dmensons. All cases were run 10 tmes wth teraton steps for each and the wall tmes were recorded after arthmetcal averagng. In the followng, unless otherwse specfed, each node runs sx GPUs concurrently. Tme costs of GPU computaton, data transfer between the GPU and CPU and communcaton between neghborng CPUs n cases usng 12 GPUs for one- and two-dmensonal decomposton wth synchronous executon and blockng MPI are plotted n Fgures 5 and 6 respectvely. We fnd that the tme portons of GPU CPU data transfer and communcaton between CPUs ncrease wth reducton of the doman sze for each GPU. In addton, as expected, the tme percentage of GPU CPU and CPU CPU data transfer n two-dmensonal decomposton s hgher than that for one-dmensonal decomposton and sometmes the tme consumpton even exceeds the tme for GPU computng, whch means there s more room to mprove the effcency by hdng data transfer between the GPU and CPU and communcatons between CPUs. Smulatons deployng the proposed computaton communcaton overlappng algorthm n both one-and twodmensonal decomposton were carred out. The tme costs for all cases are llustrated n Fgures 7 and 8. The fgures show that most of the tme for data copy and communcaton s successfully hdden through overlappng wth GPU computaton, leadng to an obvous reducton n the total tme. In two-dmensonal decomposton, the performance mprovement s even greater than that n one- dmensonal Fgure 5 (a) Tme component of each part of the algorthm wth synchronous executon and blockng MPI but wthout OpenMP n one-dmensonal decomposton; (b) tme percentages of GPU CPU data transfer and CPU CPU communcaton. Fgure 6 (a) Tme component of each part of the algorthm wth synchronous executon and blockng MPI n two-dmensonal decomposton; (b) tme percentages of GPU CPU data transfer and CPU CPU communcaton.

6 712 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 decomposton snce more tme for data transfer between a GPU and CPU and communcaton s hdden. To descrbe the performance mprovement clearly, we take case E n one-dmensonal decomposton usng 12 GPUs as an example to compare tme components of 5 algorthms: (a) synchronous executon and blockng MPI wthout OpenMP; (b) synchronous executon and blockng MPI wth OpenMP; (c) asynchronous executon and blockng MPI wth OpenMP; (d) synchronous executon and non-blockng MPI wth OpenMP; (e) asynchronous executon and nonblockng MPI wth OpenMP. The tme results are lsted n Table 1. Because of the non-seral characterstc of asynchronous executon and non-blockng MPI, the tme requred for asynchronous GPU executon and non-blockng MPI s dffcult to separate. Therefore, the GPU computaton tme was assumed to be the same for the asynchronous cases. Table 1 shows that the tme requred for data delvery between the GPU and CPU s reduced by about 60% 70% and the tme requred for nter-cpu communcaton s reduced by 70% 80%, whch gves performance of 1192 mllon lattce updates per second for each GPU card n mult-node and multple GPU mplementaton. Table 1 Comparson of tme components for fve algorthms n case E Algorthm GPU computaton (s) GPU CPU data transfer (s) CPU CPU communcaton (s) Total (s) (a) (b) (c) (d) (e) Fgure 7 (a) Tme component for the algorthm wth asynchronous executon, OpenMP and non-blockng MPI n one-dmensonal decomposton; (b) tme percentage of GPU CPU copy and CPU CPU communcaton. Fgure 8 (a) Tme component for the algorthm wth asynchronous executon, OpenMP and non-blockng MPI n two-dmensonal decomposton; (b) tme percentage of GPU CPU copy and CPU CPU communcaton.

7 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No To nvestgate the scalablty of the mplementaton further, we change the number of GPUs n case E, rangng from 12 to The correspondng tme costs for communcaton are shown n Fgure 9. We see that the computaton communcaton overlappng algorthm stll performs better than orgnal algorthms wth blockng MPI as the number of GPUs ncreases. Ths shows that the optmzaton can be appled to hundreds or thousands of GPUs wth good scalablty. 3.3 Performance balance for mult-gpus nodes In addton to the above performance dscussons, we also run our GPU mplementaton usng 12 GPUs for case E but wth a varyng number (one, two, three, four or sx) of GPUs at each node to test the balance of performance and economy for computng nodes ntegratng multple GPUs. As t s known that the bandwdth of the PCI-E bus s usually a bottleneck owng to data transfer between the GPU and CPU durng computaton compared wth the GPU computng, the performance deterorates when multple GPUs at one node are engaged n a parallel computaton because of the PCI-E bandwdth conflct. Owng to the use of CUDA portable pnned memory and OpenMP, the communcaton load of the processes wthn a node s theoretcally equal, rrespectve of how many GPUs are employed concurrently at a node. Therefore, we can ensure that there are neglgble dfferences n the CPU CPU communcaton tme for the fve confguraton settngs. The performance of our mplementaton s summarzed n Table 2. We fnd that although the number of GPUs used at each node ncreases from one to sx, the ncrease n the total computaton tme s almost neglgble as most of the tme for communcaton and data transfer s hdden owng to the asynchronous executon. The tme dfference s manly due to the GPU CPU data transfer as more data are transfered through the PCI-E bus n the case that more GPUs are runnng on the same Fgure 9 Comparson of communcaton tme between blockng and non-blockng MPI n large-scale LBM smulatons. node. Therefore, we beleve that nodes ntegratng more GPUs lke Mole-8.5 acheve a good balance between performance and economy for some applcatons wth an effcent algorthm consderng the hardware cost and space occupaton. 3.4 Applcaton Because of CUDA s nteroperablty wth OpenGL, we couple the effcent GPU mplementaton of the LBM wth a vsualzaton framework developed by our group [20] to realze large-scale smulatons. In ths secton, we conduct a drect numercal smulaton of gas up-flowng through suspended sold partcles under a two-dmensonal doubly perodcal boundary condton. The smulaton doman s 11.5 cm 46 cm, whch s dscretzed by about one bllon lattce cells. We smulate the gas-sold flow usng 576 GPUs at 96 nodes by two-dmensonal doman decomposton. In Fgure 10, dstnct regons of partcle aggregaton, whch are called clusters n the chemcal communty, are reproduced. Ths large-scale smulaton confrms that the effcent mult-gpu parallel LBM smulaton wth a powerful GPU cluster s a promsng tool for scentfc or ndustral modelng. 4 Conclusons and prospects A hybrd parallel GPU mplementaton for LBM smulaton was proposed. Asynchronous GPU executon technology was appled to confrm overlappng between GPU CPU data transfer and GPU computaton, ndcatng that a large porton of the tme for GPU CPU copy can be hdden. Data transfer between CPUs s realzed wth MPI. To hde ths nter-cpu communcaton cost, non-blockng MPI was used to enable concurrent executons of GPU computng and MPI sendng and recevng. A shared memory model such as OpenMP was appled to mprove the performance of nodes ntegrated wth multple GPUs. In our test cases, the tme requred for GPU CPU data transfer and nter-cpu communcaton was reduced by up to about 70% for one-dmensonal decomposton and 80% for twodmensonal decomposton. These results show that the hybrd mult-gpu LBM mplementaton s a feasble way to mprove effcency. Large-scale drect numercal smulaton of an 11.5 cm 46 cm two-dmensonal doubly perodcal gas-sold suspenson was demonstrated by couplng the mplementaton wth a vsualzaton framework. The hybrd mode was easy to mplement and can be extended to three-dmensonal decomposton. Although our mplementatons were based on the LBM, other CFD methods such as the fnte dfference and fnte volume methods can be ncorporated nto ths hybrd mode and we beleve that they wll also perform well.

714 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.

communcaton (s) Total (s) 1 33.90231 0.4678 0.6431 35.01321 2 33.90231 0.5307 0.6431 35.07611 3 33.90231 0.5735 0.6431 35.11891 4 33.90231 0.61142 0.6431 35.15683 6 33.90231 0.63391 0.6431 35.17932 Fgure 10 Large-scale drect numercal smulaton of a two-dmensonal gas-sold suspenson contanng more than one mllon partcles [20].

Two anonymous revewers who gave valuable comments and suggestons that helped mprove the qualty of ths artcle are gratefully acknowledged.

8 714 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No.7 Table 2 Tme costs for GPU CPU data transfer and CPU CPU communcaton wth a varyng number of GPUs at each node n case E Number of GPUs n a node GPU computaton (s) GPU CPU data transfer (s) CPU CPU communcaton (s) Total (s) Fgure 10 Large-scale drect numercal smulaton of a two-dmensonal gas-sold suspenson contanng more than one mllon partcles [20]. Ths work was supported by the Natonal Natural Scence Foundaton of Chna ( and ). We are grateful to Prof. Abng Yu of Unversty of New South Wales for llumnatve dscussons. Two anonymous revewers who gave valuable comments and suggestons that helped mprove the qualty of ths artcle are gratefully acknowledged. Support from Nvda through the CUDA Center of Excellence Program s also apprecated. 1 Kampols I C, Trompouks X S, Asout V G, et al. CFD-based analyss and two-level aerodynamc optmzaton on graphcs processng unts. Comput Method Appl M, 2010, 199: Wang J, Xu M, Ge W, et al. GPU accelerated drect numercal smulaton wth SIMPLE arthmetc for sngle-phase flow. Chn Sc Bull, 2010, 55: Anderson J A, Lorenz C D, Travesset A. General purpose molecular dynamcs smulatons fully mplemented on graphcs processng unt. J Comput Phys, 2008, 227: Chen F, Ge W, L J. Molecular dynamcs smulaton of complex multphase flow on a computer cluster wth GPUs. Sc Chna Ser B: Chem, 2009, 52: Xong Q, L B, Chen F, et al. Drect numercal smulaton of sub-grd structures n gas-sold flow GPU mplementaton of macro-scale pseudo-partcle modelng. Chem Eng Sc, 2010, 65: McNamara G R, Zanett G. Use of the Boltzmann equaton to smulate lattce-gas automata. Phys Rev Lett, 1988, 61: Tolke J, Krafczyk M. TeraFLOP computng on a desktop PC wth GPUs for 3D CFD. Int J Comput Flud D, 2008, 22: Ge W, Chen F, Meng F, et al. Mult-scale Dscrete Smulaton Parallel Computng Based on GPU (n Chnese). Bejng: Scence Press, Bernasch M, Fatca M, Melchonna S, et al. A flexble hghperformance lattce Boltzmann GPU code for the smulatons of flud flows n complex geometres. Concurr Comp-Pract E, 2010, 22: Kuznk F, Obrecht C, Rusaouen G, et al. LBM based flow smulaton usng GPU computng processor. Comput Math Appl, 2010, 59: L B, L X, Zhang Y, et al. Lattce Boltzmann smulaton on Nvda

9 Xong Q G, et al. Chn Sc Bull March (2012) Vol.57 No and AMD GPUs (n Chnese). Chn Sc Bull (Chn Ver), 2009, 54: Myre J, Walsh S, Llja D, et al. Performance analyss of sngle-phase, multphase, and multcomponent lattce-boltzmann flud flow smulatons on GPU clusters. Concurr Comp-Pract E, 2010, 23: NVIDIA. NVIDIA CUDA compute unfed devce archtecture Programmng Gude Verson 3.1, Qan Y, Humeres D, Lallemand P. Lattce BGK for Naver-Stokes equaton. Europhys Lett, 1992, 17: He N, Wang N, Sh B. A unfed ncompressble lattce BGK model and ts applcaton to three-dmensonal ld-drven cavty flow. Chn Phys, 2004, 13: Obrecht C, Kuznk F, Tourancheau B, et al. A new approach to the lattce Boltzmann method for graphcs processng unts. Comput Math Appl, 2011, 61: Yang C, Huang C, Ln C. Hybrd CUDA, OpenMP, and MPI parallel programmng on multcore GPU clusters. Comput Phys Commun, 2011, 182: Mellanox. NVIDIA GPUDrect Technology Acceleratng GPU-based Systems Komattsch D, Erlebacher G, Goddeke D, et al. Hgh-order fnte-element sesmc wave propagaton modelng wth MPI on a large GPU cluster. J Comput Phys, 2010, 229: Ge W, Wang W, Yang N, et al. Meso-scale orented smulaton towards vrtual process engneerng (VPE) The EMMS paradgm. Chem Eng Sc, 2011, 66: Open Access Ths artcle s dstrbuted under the terms of the Creatve Commons Attrbuton Lcense whch permts any use, dstrbuton, and reproducton n any medum, provded the orgnal author(s) and source are credted.

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr