An Efficient Parallel Algorithm of Modified Jacobi Approach for Sparse Linear System

Size: px

Start display at page:

Download "An Efficient Parallel Algorithm of Modified Jacobi Approach for Sparse Linear System"

Christiana Morris
6 years ago
Views:

1 991 An Effcent Parallel Algorthm of Modfed Jacob Approach for Sparse Lnear System Bkash Kant Sarkar 1, Shb Sankat Sana 2, G. Sahoo 3 1,3 Department of Informaton Technology, BIT, Mesra, Ranch, Inda E-mal: bk_sarkarbt@hotmal.com 2 Department of Mathematcs, Bhangar Mahavdyalaya, CU, West Bengal, Inda E-mal: shb_sankar@yahoo.com ABSTRACT Several parallel approaches have been developed to solve sparse lnear system based on well-known memory savng schemes. To solve such a lnear system, ths artcle proposes a very smple parallel verson of the modfed Jacob teratve method on Dstrbuted Memory Archtecture, usng the well-known Compressed Sparse Row(CSR) storage format and Recursve Graph Bsecton(RGB). The prme contrbuton of the present nvestgaton s that the ndvdual processors wll not update ts assgned varables any more, provded the prevous teraton acheves smaller than the prescrbed accuracy. Consequently, such processors wll stop computaton as well as communcaton wth other processors to reduce both the computaton and the communcaton tme to a great extent. In fact, the use of such local stoppng crtera ensures to acheve such overall system performance. The expected beneft of ths algorthm s explaned through the analytcal results. Keywords : Parallel, Effcent, Communcaton, Optmzaton, Dstrbuted Memory, Jacob Date of Submsson: February 04, 2011 Date of Acceptance: Aprl 27, INTRODUCTION Large number of physcal problems lke Ar flow over an arcraft wng, Blood crculaton n human body, Water crculaton n an ocean, Weather Forecastng, etc, are descrbed by Partal Dfferental Equatons(PDE). These equatons, when solved usng, Fnte Dfference Method(FDM), generate sets of lnear equatons. But the lnear systems of the most of the physcal problems yeld sparse structure after such transformaton. Several well-known memory savng schemes are developed to store such knd of system. But, soluton of these problems requres huge amount of computatons and becomes very dffcult by employng conventonal computers. So, to solve such problems effcently n parallel manner s stll an attractve ssue. Recently, hgh performance computng has emerged as a key technology nto dverse areas especally for the numercal soluton of large scale problems. Although, there exst several forms of parallelsm[5], but ntroducng data parallelsm usng clusterng wll be easer. A matrx s termed sparse, f majorty of ts entres are zero. As there s no reason to store and operate on huge number of zeros, t s often necessary to modfy the exstng algorthms to take the advantage of the sparse structure of the matrx. Such matrces can be easly compressed, yeldng sgnfcant reducton n memory usage. Several sparse matrx formats exst lke Compressed Sparse Row(CSR) Storage[1], Jagged Dagonal Format[2], Compressed Dagonal Storage Format[3] and Sparse Block Compressed Row Storage Format[4]. Each format takes advantage of a specfc property of the sparse matrx, and therefore acheves dfferent degree of space effcency. In ths work, the CSR storage format (dscussed n secton[2.1]) s used, as t s rather ntutve, straghtforward and more sutable for parallelzaton. The soluton of a lnear system of equatons can be accomplshed by ether of the two numercal methods: Drect or Iteratve. In Drect methods lke Gauss Elmnaton, Gauss Jordon (modfcaton of Gauss Elmnaton) and Matrx Inverson, the amount of computaton s fxed. However, Iteratve methods lke Jacob and Gauss Sedel yeld values whch are found teratvely startng from an approxmaton untl the requred accuracy s obtaned, and hence the amount of computaton depends on the accuracy requred. Further, the parallelzaton of teratve approaches becomes easer as compared to the drect approaches. But some teratve methods are sutable on Multple Instructon Stream and Multple Data stream(mimd) Dstrbuted Memory Machne.

2 For example, Jacob method n comparson to Gauss- Sedel method takes less communcaton tme because all the computatons for -th approxmaton must be ready before the computaton for (+1)-th approxmaton starts. In other words, Jacob teratve approach do not requre exchange of the most recent values of the varables, whereas, a subsequent teraton n Gauss-Sedel needs the values of some varables n that teraton too (.e., causes ntra-teraton dependences). Because of ths fact, Jacob approach s preferred for parallelzaton on Dstrbuted Memory computer as compared to Shared Memory computer. For parttonng data (nvolved wth lnear system) nto dfferent processors to mantan data localty aspect, there are several algorthms lke Mult-grd (Square Mesh Parttonng), Ellpack-Itpack(Row Partton Format), RGB(Recursve Graph Bsecton dscussed n secton[2.2]) etc; and all of whch are appled for parallel machnes. In ths nvestgaton, RGB n comparson to other technques s preferred as t nfluences to opt for less communcaton tme, achevng better statc load balancng of the sparse graph among the processors. Many researchers have concentrated to solve smultaneous system of lnear equatons sequentally and n parallel, usng Jacob and other approaches [5], [6],[7], [8],[9],[10], [11], [12], [13],[19] [20],[21][22]. Some knds of tllng technques[14] are developed for solvng set of lnear system. Tllng s a comple-tme transformaton whch subdvdes the teraton space for a regular computaton so that a new tle-based schedule(where each tle s executed atomcally) exhbts better data localty. So, tllng provdes a method of achevng nter-teraton localty. In[15], Communcaton optmzaton for rregular scentfc computatons on Dstrbuted Memory archtectures s focused. Although a number of technques has been developed tll today to solve set of lnear systems on Dstrbuted Memory machne tryng to reduce communcaton among processors, but only few of them such as [16][17][21] pay attenton to the amount of work done by ndvdual processors. 992 RGB approaches, s developed on Dstrbuted Memory Archtecture to stop unwanted computaton and communcaton among the processors (n order to reduce both the costs). We compare the analytcal results of the proposed work wth Tmng Models [17] and report that the proposed s a better choce. The present artcle s organzed as follows: secton-2 gves theoretcal background about CSR, graph parttonng technque, Jacob method, parallel computers. Secton-3 descrbes the modfed verson of Jacob approach. In secton-4, the proposed parallel algorthm and ts proof of correctness are descrbed. Secton-5 shows the analytcal results. Fnally, secton-6 exhbts the future scope of the work. 2. THEORETICAL BACKGROUND 2.1. COMPRESSED SPARSE ROW(CSR) FORMAT Maxmum storage schemes for sparse matrx employ the technque as follows. Compress all the non-zero elements of the sparse matrx (say, A) n a lnear array and then use some number of auxlary arrays to descrbe the locatons of the non-zeros of the orgnal matrx A. The CSR format uses three arrays to store an n n sparse matrx wth m nonzero entres. () An m 1 array, nonzero[ ], contans the nonzero elements of the lnear system. These are stored n the order of ther rows from 0 to (n 1). However, elements of the same row can be stored n any order. () An m 1 array, col_vector [ ], stores row-wse the respectve column number of each nonzero element. Indeed, each column number of a row represents also the varable wth non-zero co-effcent n that row. () An n 1 array row_vector[ ], and the content of row_vector[] ponts to the frst entry of the th row n nonzero[ ] and col_vector[ ]. One sparse matrx of the form AX=b and ths matrx mapped nto three arrays are shown n Fgs.1 and 2 respectvely. In partcular, n ths work, a very smple parallel verson based on the modfed Jacob teratve method[18] and combnng the capabltes of CSR and

3 x x x x x x x x x x x x x x x x Fgure 1: Sparse Matrx n Equaton form AX=b b 0 b1 b 2 b 3 b 4 b 5 b 6 b 7 = b 8 b 9 b10 b 11 b12 b 13 b14 b15 Fgure2: Sparse Matrx (shown n Fg.-1) Mapped nto three arrays

4 2.2. RECURSIVE GRAPH BISECTION(RGB) TECHNIQUE The RGB technque parttons the doman(graph) by recursvely subdvdng t nto two parts at each step. For p = 2 k processors, the doman yelds p parttons recursvely subdvdng k tmes. Ths bsecton nvolves three major steps. () Intally set the level(startng wth 0). () Then, fnd pseudoperpheral node. () Fnally, partton the graph recursvely. To determne pseudo-perpheral or perpheral node of a graph, dameter of the graph(here, graph s represented by matrx) s requred to be computed frst. The dameter of a graph s defned as follows. δ (G) = max { d(x,y) x ε V, y ε V }, where d (x,y) s the dstance (shortest path) between any two nodes n the graph(g) wth vertex set V. Ideally, one of the two nodes n par (x, y) that acheves the dameter can be used as startng node. These two nodes are called as perpheral nodes, and are very expensve to determne. A pseudo-perpheral node s often employed to partton the graph. For p =2 2 =4, applyng the above segment on the graph (represented by the matrx shown n Fg-1), the parttoned graph s shown n Fg To solve a lnear system, AX = b, through ths method, the soluton vector X must satsfy the equaton: X 1 = b a, j a, j X j () 1 In fact, to solve the system, one may start the process wth an ntal estmaton. However, the Jacob approach reles upon estmaton of every element of vector X to come up wth a new value of X. It uses values already computed for each varable X durng teraton (t+1): 1 X ( ) () t + 1 = b a, jx j t a, j After computng a new estmaton, the approach computes the new value of dff(dfference) based on the change n all elements of X(assume that the ntal value of dff s 0). Actually, the value of dff ensures to stop the approach. Now, dff s computed as: dff = max(abs(x 1 (t) - X 1 (t+1)), abs((x n (t) X n (t+1))...(2) Fgure- 3: Parttoned Graph usng RBG method(here, d 11, d 12. d 22, d 23 are the domans, as four processors are used) 2.3. JACOBI METHOD A set of lnear equatons s represented as AX=b where A s a matrx of sze n x n wth coeffcents a,j, X s an nx1 vector varable to be solved and b s an nx1 vector of rght sde values. Jacob method s an example of teratve method for solvng lnear system AX=b, typcally generated whle workng wth PDE PARALLEL COMPUTERS Parallel computers are those systems that emphasze parallel processng. Parallel computers are generally dvded nto three archtectural confguratons: Ppelne computers: whch belong to SISD(Sngle Instructon Stream and Sngle Data Stream) model computers and the parallelsm acheved through ths type of computers s called as temporal parallelsm. Array processors : whch belong to SIMD(Sngle Instructon Stream and Multple Data Stream) model computers and the parallelsm acheved through ths model s called spatal or synchronous or data parallelsm. The global CU dspatches the same nstructon to each PEs (whch are organzed by a partcular network ) and each executes the same nstructon on a dstnct data set. Multprocessor systems : whch belong to MIMD(Multple Instructon Stream and Multple Data Stream) model computers and the parallelsm acheved through ths type of computers s called as control or asynchronous parallelsm. Ths type of system s agan classfed nto two categores : (a) Shared Memory model computers(or Multprocessors) and (b) Dstrbuted Memory model

5 computers (Message Passng Parallel Computers or Mult-computers). Message Passng Model Computers are also called Loosely Coupled Computers as the degree of nteracton among the processors s not very hgh. A Message Passng Computer, on the other hand, s programmed usng Send-Receve prmtves. There are several send-receve used n practce. 3. MODIFIED JACOBI APPROACH From eqn(2) of secton-2.3, t s clear that n the standard Jacob teratve method, dff s computed as the maxmum among all the absolute dfferences of the values of respectve varables n the current teraton and the mmedate prevous teraton. As per the standard approach, n spte of achevng the desred accuracy by some of the varables n the current teraton, the same varables are updated agan n the next teraton to converge the remanng varables. Consequently, t causes unnecessary update of the converged varables. It s also true that the varables whch are converged to the desred soluton n the present teraton, are needed by the present not converged varables. Thus, the modfed verson stops the updatng of the converged varables n the next teraton to reduce executon tme but the non-converged varables use the values of the necessary converged varables by updatng ther current contents wth the dff value n the current teraton. For example, suppose varable X k s not converged at the present teraton but varable X m s converged. Then, X m s smply updated n the successve teratons as follows. X m = X m + dff.. (3) where dff represents the value of dff(computed from the rest non-converged varables followng eqn(2)) at the current teraton. In fact, X m s here not updated followng equaton-1 (mentoned n secton-2.3),.e., no multplcaton, dvson and more number of addtons are performed. However, X k s computed as per eq (1), usng the value of X m (calculated by eqn(3)). However, n the modfed approach, t s assumed that each row has some non-zero coeffcents excludng the dagonal one. Further, ths method requres that the dagonal elements are dagonally domnant, means that the dagonal element s greater than the sum of the absolute values of the other elements n the gven row. In ths artcle, a smple parallelzed verson of ths modfed approach, based on CSR storage format and RGB partton technque, s presented (n 995 secton-4) on the Multprocessors to solve a sparse lnear system. 4. PROPOSED PARALLEL ALGORITHM It s well-known that, n sequental teratve approaches, we concentrate on the approxmate values of the soluton vector X and these normally depend on certan degree of accuracy. In partcular, the varable dff s only used to contnue the specfed accuracy of the varables. However, nstead of global dff(whch s the maxmum among the computed dfferences of all the varables by followng eqn(2)), the local dff (whch s, ndeed, the maxmum among all the computed dfferences of the varables assgned to ndvdual processor, followng the same eqn(2)), can guarantee to acheve the same. Of greater nterest, the work[18] clams t. In ths secton, we present a smple parallel algorthm of the modfed Jacob method on Dstrbuted Memory Archtecture to solve sparse lnear system. Further, our algorthm s based on Compressed Sparse Row(CSR) storage format and Recursve Graph Bsecton(RGB). The goal of ths algorthm s to optmze the communcaton overhead among the processors, reducng computatonal cost too. However, n the desgned algorthm, the status varable for every processor fulflls such great role. Assumptons: ) All the dagonal elements of the matrx A must be non-zero values. ) The dagonal elements are dagonally domnant, means that the dagonal element s greater than the sum of the absolute values of the other elements n the gven row. ) Processors are represented by unque ds such as : 0, 1, 2, 3, etc. Bref descrpton of the used varables: (a) To represent soluton vector X, one array of structures(records) X[ ] s consdered. In fact, each element of X[ ] represents one varable, and conssts of two felds. In C lke language, such structure(record) can be declared as: struct varable { float value; nt source } X [ ]. Clearly, X[] represents here the th varable (lke X ). The mportance of each of the felds are dscussed below. ) value (ths feld stores the latest updated value of a partcular varable by a processor). ) source(feld manly stores the processor-d of the processor whch s updatng the partcular varable). Thus, t s clear that each varable keeps more nformaton except the value of varable, and each sub-scrpt value of X[ ] represents one varable.

6 Smultaneously, another array NewX[ ] s essental to store temporarly only the current contents of updated varables(.e., NewX[ndex] s used to store temporarly the updated content of X[ndex].value at the current teraton). (b) The 1-D array processor_status[ ] plays here the sgnfcant role to mantan status of the partcpatng processors. For example, processor_status[0] stores the status of 0 th processor and so on. However, ts content s ether 0(means ts work s not over) or 1 (means ts work s over). If p number of processors partcpate n the work, then ts sze wll be p. (c) Locaton[ ], a smple 1-D array, s used to store var_ndces(.e., varables) to be by a processor. So, If v number of varables are updated by a processor, then ts sze wll be v (.e., Locaton[v ]). (d) Three 1-D arrays : row_vector[ ], col_vector[ ] and nonzero[ ] are used to represent CSR storage format of sparse matrx (example shown through Fg-1 for sparse matrx and Fg-2 for ts equvalent CSR). Sub-scrpt of row_vector[ ] ndcates row number. (e) b[ ], 1-D array, s to represent the source vector,.e., each locaton of ths array stores the rght hand sde of a partcular equaton. (f) The varable dff (local dff) s responsble for checkng the desred accuracy of soluton of the assgned varables to each processor. Proposed Algorthm A bref sketch of the algorthm s outlned below. Step-1: Processor P 0 (root processor) ntalzes value 0(zero) to the value part of each element of the soluton vector (X) as well as the necessary values of the other felds(members) of X, and the value of the varable dff. It then broadcasts all these values to the rest processors partcpatng n ths work. Step-2: Assgn varables to be updated by each processor nto ts local varable Locaton[ ]. [Here, assgnng varables to processors s done, seeng the parttoned graph of the matrx A.] Step-3: for all the workng processors P, where 0 p-1 do the followng tasks. // P s the processor-d and p s the total number of workng processors. Step-3.1: for each varable X k assgned to P (K Locaton[ ]), perform the followng sub-steps to update the current retreved varable. Step-3.1.1: Frst retreve the necessary varables as well as ther respectve co-effcents smultaneously 996 accessng col_vector[ ] and nonzero[ ] arrays, followng eq-1. Next, collect the values of these necessary varables from the respectve processors(retreved through source feld of X[ ]) by passng message (f ther work s not over). However, f work of any one s over, then frst update the values of the necessary varables computed earler by that stopped processor, followng eq-3 (presented n secton-3), and use those. Step-3.1.2: Now, update X k (followng eq-1) and store the value of ths varable nto an ndex of the local array NewX[ ]. // For P, step-3.1 ends and step -3.2 starts Step-3.2: Update dff(local dff) followng the eqn (2) [mentoned n secton-2.3]. Step-3.3: Copy the updated values of the varables( stored n NewX[ ]) nto the value part of the respectve locatons of X[ ]. Step-3.4: If dff(local dff) reaches to the desred value (say ε: some value s set ntally), then assgn value 1 to ts processor_status[ ](.e., processor_status[1] = 1) and send ths value (to all other destnaton processors to stop further communcaton wth t) and the latest updated values of the assgned varables to the respectve processors as well as root processors else the processng goes back to step-3.1 for next teraton. Step-4 : If the algorthm termnates, then the root processor(p 0 ) gets the fnal solutons of the varables. 5. ANALYTICAL RESULTS Assume that the number of processors s p. Now, f k number of teratons are requred to acheve the desred accuracy n worst case and total number of nonzero elements n the nonzero array s m ( m << n), then maxmum number operatons lke multplcaton, addton etc, wll be mk n sequental approach whch can be expressed as O(mk). Clearly, the proposed parallel algorthm takes O(mk/p) computaton tme. Although, almost all other exstng parallel approaches also demand the same asymptotc runnng tme. But the status varable processor_status adopted n our algorthm ensures to stop processng of the varables assgned to the ndvdual processors when the desred accuracy computed from the respectve assgned varables to them s found. In other words, there s maxmum probablty to be converged the respectve assgned varables earler(whch s, n fact, less n number of teratons). Consequently, the processors whch

7 termnate ther respectve assgned tasks, can be employed to perform the task of another dstnct dfferent problem n mult-processor envronment. Thus, the present approach reduces executon tme, snce the processors(whose processng s over) stop computaton lke data access, multplcaton, addton, etc. For nstance, suppose processor P 0 acheves the desred accuracy over ts assgned varables after 8 teratons whereas processor P 1 after 12 teratons, P 2 after 14 teratons etc, then unnecessarly P 0, P 1 need not contnue executon up to maxmum teraton(14) over ther respectve assgned varables. Snce ths drawback s overcome here, so t ultmately saves sgnfcant amount of executon tme as compared to the exstng approaches. Also, the approach clams less communcaton cost between processors than any exstng parallel method, snce unwanted communcaton among processors stops. Now, we consder the Tmng Models [17], the total parallel processng tme(assumng same processng speed for each processor) can be expressed as T par = T master +T worker +T com, where T master defnes the master total computaton tme, T worker as the workers total computaton tme, and T com = T c_master + T c_worker. Here, T c_master ncludes broadcast of global geometry, dstrbuton of workng tuples and extracton of workng tuples, whereas T c_worker ncludes extracton of global nformaton, extracton of workng tuples, return of result tuples and ntermedate exchange of data wth the neghborng processors. Clearly, the proposed approach clams better system performance as compared to the mentoned approach, snce master(root) processor need not collect solutons from any processor to compute the global dff and to send the same to the other processors. Consequently, T c_worker does not nclude here extracton of global nformaton (dff ) from master processor and unnecessary extracton of workng tuples, return of result tuples and ntermedate exchange of data wth the neghborng processors whenever the work of the respectve processors s over. 6. CONCLUSION The artcle addresses parallelzaton of a varant of the Jacob method for lnear system soluton n dstrbuted memory computers. In ths varant, once a varable s detected to be converged t s not communcated afterwards. The status varable and graph parttonng technque used n the proposed algorthm balance the computatons and reduce the communcaton overhead. Ths novel dea s very much mportant for the cluster computng because the connecton between 997 processors n such envronment s often slower. The concept s verfed and valdated mathematcally. The proposed algorthm can be mplemented cluster of personal computers connected by hgh speed network. The mplementaton wll be usng the Message Passng Interface(MPI) lbrary as the parallel programmng platform. REFERENCES [1]. J.D.Z. Ba, J. Dongarra, A. Ruhe, H. van der Vorst, Templates for the soluton of algebrac egenvalue problems: A Practcal gude.siam [2]. Y. Saad, Krylov. Subspace methods on supercomputers, SIAM Journal on Scentfc and Statstcal Computng. (1989). [3]. J. Dongarra, In Templates for the Soluton of Algebrac Egenvalue Problems: A Practcal Gude, SIAM [4]. S. Vasslads, S. D. Cotofana, P. Staths, Block based compresson storage expected performance, In Proceedngs of 14th Int l Conf. on Hgh Performance Computng Systems and Applcatons (HPCS 2000) (June 2000). [5]. M. J. Qunn, Parallel Computng Theory and Practce, Oregon State Unversty, Second Edton [6]. Van der Vorst, H. A., Large sparse lnear systems on vector and parallel computers, Parallel Computng, 5, (1987). [7]. Bswa Nath. Datta, Jacob Iteratve Methods for Large Scale Matrx Problems n Control. Dept. Mathematcal Scences, pp.2-21, [8]. F. L. Alvarado, A. Pothen, R. Schreber, Hghly parallel sparse soluton, n Techncal Report, RIACS TR 92.11, NASA Ames Research Center, Moffet Feld,CA. May [9]. C.Aykanat, F. Ozguner, F. Ercal, P. Sadayappan, Iteratve algorthms for soluton of large sparse systems of lnear equatons, IEEE Transactons on Computers, 37(12): (1988). [10]. U. Meer, A parallel partton method for solvng sparse systems of lnear equatons, Parallel Computng, 2: (1985). [11]. M.T. Heath, E.G.Y. Ng, B.W. Peyton, Parallel algorthms for sparse lnear systems, SIAM Revew, 33 : (1991). [12]. Y. Saad, Iteratve Methods for Sparse Lnear Systems, Unversty of Mnnesota, Dept. of Computer Scence and Engneerng.

8 998 [13]. S.G. Petton, Massvely parallel sparse matrx computaton for teratve methods, n Techncal Report. YALEU/DCS/878, Yale Unversty, Department of Computer Scence, New Haven, CT, [14]. M.M. Strout, Larry Carter, Jeanne Ferrante,, Kreaseck Barbara, Sparse Tllng For Statonary Iteratve Methods, Internatonal Journal of Hgh Performance Computng Applcatons, vol-18, no-1, pp ( 2004). [15]. R. Das, M. Uysal, J. Saltz, Y.S.S. Hwang, Communcaton Optmzatons for rregular scentfc computatons on Dstrbuted Memory Archtectures, Journal of Parallel and Dstrbuted Computng, 22(3), pp (1994). [16]. X. Wang, M.M. Chen, X. Wu, and X. An, A Class of Parallel Algorthms for Solvng Large Sparse Lnear Systems on Multprocessors, n Proceedngs of Int. Conf. on Hgh Performance Computng(IEEE-CS), vol-2, ,(2000) [17]. Kostas Blathras, Danel B. Szyld1, and Yuan Sh, Tmng Models and Local Stoppng Crtera for Asynchronous Iteratve Algorthms, Journal of Parallel and Dstrbuted Computng, 58, (1999) [18]. B.K.Sarkar, S.S.Sana, and P.K. Mahant, A Modfed Verson of Jacob Approach, Internatonal Journal of Innovatve Computng and Applcatons, 2, 60-65, [19]. S. Kortas and P. Angot, A practcal and portable model for programmng for teratve solvers on dstrbuted memory machnes, Parallel Computng, 22 (1996), [20]. A.H. Sameh and D.J. Kuck, On Stable Parallel Lnear System Solvers, Journal of the Assocaton for Computng Machnery, 25(1), 81-91, [21]. R. Chandra and C. Sva Ram Murthy, A faster algorthm for solvng lnear algebrac equatons on the star graph, Journal of Parallel and Dstrbuted Computng, 63, , [22]. D. Heller, A survey of parallel algorthms n numercal lnear algebra, SIAM Rev. 20 (4) , 1978.

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr