Parallelism for Nested Loops with Non-uniform and Flow Dependences

Similar documents
Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Vectorization in the Polyhedral Model

Today Using Fourier-Motzkin elimination for code generation Using Fourier-Motzkin elimination for determining schedule constraints

An Optimal Algorithm for Prufer Codes *

Positive Semi-definite Programming Localization in Wireless Sensor Networks

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

Loop Transformations, Dependences, and Parallelization

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

The Shortest Path of Touring Lines given in the Plane

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Load Balancing for Hex-Cell Interconnection Network

Edge Detection in Noisy Images Using the Support Vector Machines

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Classifier Selection Based on Data Complexity Measures *

Polyhedral Compilation Foundations

The Codesign Challenge

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

A Robust Method for Estimating the Fundamental Matrix

Problem Set 3 Solutions

A Binarization Algorithm specialized on Document Images and Photos

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Solving two-person zero-sum game by Matlab

Constructing Minimum Connected Dominating Set: Algorithmic approach

Repeater Insertion for Two-Terminal Nets in Three-Dimensional Integrated Circuits

Parallel matrix-vector multiplication

Programming in Fortran 90 : 2017/2018

Smoothing Spline ANOVA for variable screening

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Module Management Tool in Software Development Organizations

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

GSLM Operations Research II Fall 13/14

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Support Vector Machines

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

LLVM passes and Intro to Loop Transformation Frameworks

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

2.1. The Program Model

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Cluster Analysis of Electrical Behavior

A Deflected Grid-based Algorithm for Clustering Analysis

A Five-Point Subdivision Scheme with Two Parameters and a Four-Point Shape-Preserving Scheme

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

Related-Mode Attacks on CTR Encryption Mode

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Design of Structure Optimization with APDL

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

The relation between diamond tiling and hexagonal tiling

Querying by sketch geographical databases. Yu Han 1, a *

Non-Split Restrained Dominating Set of an Interval Graph Using an Algorithm

FRACTAL COMPRESSION TECHNIQUE FOR COLOR IMAGES USING VARIABLE BLOCK

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

Performance Study of Parallel Programming on Cloud Computing Environments Using MapReduce

Lecture 5: Multilayer Perceptrons

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

A fast algorithm for color image segmentation

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances

CS 534: Computer Vision Model Fitting

Parallel Incremental Graph Partitioning Using Linear Programming

Wireless Sensor Network Localization Research

Private Information Retrieval (PIR)

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Report on On-line Graph Coloring

Communication-Minimal Partitioning and Data Alignment for Af"ne Nested Loops

Ecient Computation of the Most Probable Motion from Fuzzy. Moshe Ben-Ezra Shmuel Peleg Michael Werman. The Hebrew University of Jerusalem

Routability Driven Modification Method of Monotonic Via Assignment for 2-layer Ball Grid Array Packages

X- Chart Using ANOM Approach

Load-Balanced Anycast Routing

BRDPHHC: A Balance RDF Data Partitioning Algorithm based on Hybrid Hierarchical Clustering

Virtual Machine Migration based on Trust Measurement of Computer Node

Parallel and Distributed Association Rule Mining - Dr. Giuseppe Di Fatta. San Vigilio,

Mathematics 256 a course in differential equations for engineering students

The Research of Support Vector Machine in Agricultural Data Classification

An Image Compression Algorithm based on Wavelet Transform and LZW

Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations

Clustering on antimatroids and convex geometries

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

Task Scheduling for Directed Cyclic Graph. Using Matching Technique

Adaptive Free Space Management of Online Placement for Reconfigurable Systems

Solitary and Traveling Wave Solutions to a Model. of Long Range Diffusion Involving Flux with. Stability Analysis

A high precision collaborative vision measurement of gear chamfering profile

Explicit Formulas and Efficient Algorithm for Moment Computation of Coupled RC Trees with Lumped and Distributed Elements

Efficient Distributed Linear Classification Algorithms via the Alternating Direction Method of Multipliers

Parallel Solutions of Indexed Recurrence Equations

Visual Curvature. 1. Introduction. y C. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2007

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

High-Boost Mesh Filtering for 3-D Shape Enhancement

Unsupervised Learning

Machine Learning 9. week

Transcription:

Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr Abstract. Many methods are proposed n order to parallelze loops wth nonunform dependence, but most of such approaches perform poorly due to rregular and complex dependence constrants. Ths paper proposes an effcent method of tlng and transformng nested loops wth non-unform and flow dependences for maxmzng parallelsm. Our approach s based on the Convex Hull theory that has adequate nformaton to handle non-unform dependences, and also based on mnmum dependence dstance tlng, the unque set orented parttonng, and three regon parttonng methods. We wll frst show how to fnd the ncrementng mnmum dependence dstance. Next, we wll propose how to tle the teraton space effcently accordng to the ncrementng mnmum dependence dstance. Fnally, we wll show how to acheve more parallelsm by loop nterchangng and how to transform t nto parallel loops. Comparson wth some other methods shows more parallelsm than other exstng methods. Introducton Parallel processng s recognzed as an mportant vehcle for the soluton of many areas of computer applcatons. Most of the computng tme s spent n loops n such applcatons. The exstng parallelzng complers can parallelze most of the loops wth unform dependences, but they do not satsfactorly handle loops wth nonunform dependences. Most of the tme, the compler leaves such loops runnng sequentally. Unfortunately, loops wth non-unform dependences are not so uncommon n the real world. Several works have been done for loops wth non-unform dependences. All of the exstng technques do a good ob for some partcular types of loops, but show us a poor performance on some other types of loops. Some technques, based on Convex Hull theory [7] that has been proven to have enough nformaton to handle non-unform dependences, are the mnmum dependence dstance tlng method [5], [6], the unque set orented parttonng method [4], and the three regon parttonng [], [3]. Ths paper wll focus on parallelzng perfectly nested loops wth non-unform and flow dependences. The rest of ths paper s organzed as follows. Secton two descrbes our loop model, and revews some fundamental concepts n non-unform and flow dependence M. Bubak et al. (Eds.): ICCS 2004, LNCS 3038, pp. 46 52, 2004. Sprnger-Verlag Berln Hedelberg 2004

Parallelsm for Nested Loops wth Non-unform and Flow Dependences 47 loop. Secton three presents an mproved tlng method for parallelzaton wth nested loops wth non-unform and flow dependences. In ths secton, we show how to fnd the ncrementng mnmum dependence dstances n the teraton space. Then, we dscuss how to tle the teraton space effcently accordng to the ncrementng mnmum dependence dstance and how to acheve more parallelsm by loop nterchangng. Secton four shows comparson wth related works. Fnally, we conclude n secton fve wth the drecton to enhance ths work. 2 Data Dependence Analyss n Non-unform and Flow Dependence Loop The loop model consdered n ths paper s doubly nested loops wth lnearly coupled subscrpts and both lower and upper bounds for loop varables should be known at comple tme. The loop model has the form n Fg.. do = l, u do = l 2, u 2 A(a + b + c, a 2 + b 2 + c 2 ) =...... = A(a 2 + b 2 + c 2, a 22 + b 22 + c 22 ) Fg.. A doubly nested loop model The dependence dstance functon d ) n flow dependence loops gves the dependence dstances d ) and d ) n dmensons and, respectvely. For unform dependence vector sets these dstances are constant. But, for the non-unform dependence sets these dstances are lnear functons of the loop ndces. We can wrte these dependence dstance functons n a general form as d(, ) = (d (, ), d (, )) d (, ) = p * + q * + r d (, ) = p 2 * + q 2 * + r 2 where p, q, and r are real values and and are nteger varables of the teraton space. The propertes and theorems for tlng of nested loops wth flow dependence can be descrbed as follows. Theorem. If there s only flow dependence n the loop, DCH contans flow dependence tals and DCH2 contans flow dependence heads. Theorem 2. If there s only flow dependence n the loop, then d (x, y) = 0 or d (x, y) = 0 does not pass through any DCH. If there exsts only flow dependence n the loop, then d (x ) = 0 or d (x ) = 0 does not pass through any IDCH(Integer Dependence Convex Hull) because the IDCH s a subspace of DCH(Dependence Convex Hull) [5].

48 S.-J. Jeong Theorem 3. If there s only flow dependence n the loop, the mnmum and maxmum values of the dependence dstance functon d(x ) appear on the extreme ponts. Theorem 4. If there s only flow dependence n the loop, the mnmum dependence dstance value d mn s equal or greater than zero. From theorem 4, we know that when there s only flow dependence n the loop and d mn s zero, d mn s greater than zero. In ths case, snce d (x ) = 0 does not pass through the IDCH, the mnmum value of d (x ), d mn, occurs at one of the extreme ponts. Theorem 5 If there s only flow dependence n the loop, the dfference between the dstance of a dependence and that of the next dependence, d nc, s equal to or greater than zero. Thus, d nc s equal to to or greater than zero when there s only flow dependence n the loop. 3 Improved Tlng Method Cho and Lee [2] present a more general and powerful loop splttng method to enhance all parallelsm on a sngle loop. The method uses more nformaton from the loop such as ncrement factors, and the dfference between the dstance of dependence, and that of the next dependence. Cho and Lee [3] derve an effcent method for nested loops wth smple scrpts from enhancng [2]. The mnmum dependence dstance tlng method [6] presents an algorthm to convert the extreme ponts wth real coordnates to the extreme ponts wth nteger coordnates. The method obtans an IDCH from a DCH. It can compute d mn, the mnmum value of the dependence dstance functon d ) and d mn, the mnmum value of the dependence dstance functon d ) from the extreme ponts of the IDCH. The frst mnmum dependence dstances d mn and d mn are used to determne the unform tle sze n the teraton space. 3. Tlng Method by the Incrementng Mnmum Dependence Dstance From theorem 5, when p > 0 and q 0, we know that the dfference between the dstance of a dependence and that of the next dependence n loop wth flow dependence, d nc, s equal to or greater than zero. For each, d mn s ncremented as the value of s ncremented. So, the second d mn s equal to or greater than the frst one, and the thrd one s greater than the second one, and so on. The mproved tlng method for doubly nested loops wth non-unform and flow dependence s descrbed as Procedure Tlng_Method, whch s the algorthm of tlng loop by the ncrementng mnmum dependence dstance as shown n Fg. 2. Ths algorthm computes the ncrementng mnmum dependence dstance, tles the teraton space effcently accordng to the ncrementng mnmum dependence dstance, and transforms t nto parallel loops.

Parallelsm for Nested Loops wth Non-unform and Flow Dependences 49 Procedure Tlng_Method(,, l, l 2, u, u 2, d )), : and value for the source of the frst mnmum dependence n the loop computed by the extreme ponts of the IDCH l, l 2, u, u 2 : the lower and upper bounds of outer loop and nner loop, respectvely d ): the dependence dstance functon of the IDCH begn Step : when the frst source pont, (, ), s gven, the frst mnmum dependence dstance d mn and frst tle sze are computed. Step 2: Next d mn s computed. If (next snk pont s greater than bound), Goto Step 4. Step 3: Next tle sze s computed, and Goto Step 2. Step 4: the orgnal loop s transformed nto n parallel tles. end Tlng_Method. Fg. 2. Algorthm of tlng loop by the ncrementng mnmum dependence dstence Example do =, 50 do =, 50 A(3*+, 4*+2*+) =......= A(2*-4, +-4) An example gven n Example llustrates the case that there s non-unform and flow dependence. Fg. 3(a) shows CDCH(Complete Dependence Convex Hull) of Example. As the example, we can obtan the followng results usng the mproved tlng method proposed n ths secton. 50 DCH2 22 22 DCH 8 4 0 9 (a) (b) Fg. 3. (a) CDCH, (b) Tlng by mnmum dependence dstance n Example. 50

50 S.-J. Jeong From the algorthm to compute a two-dmensonal IDCH n [5], we can obtan the extreme ponts such as (, ), (, 22), and (8, ) as shown n Fg. 3(a). The frst mnmum value of d ) occurs at one of the extreme ponts. The value for the source of the frst dependence n the second tle s 4. The value n the thrd tle s 0, and next values are 9, 3, and 49. Then, we can dvde the teraton space by four tles as shown n Fg. 3(b). 3.2 Loop Tlng Method Usng Loop Interchangng When there s only flow dependence n the loop, we can tle the teraton space nto tles wth wdth = d mn or wdth = d mn. In case d mn > d mn, we can tle the teraton space nto tles wth wdth = d mn. In Example, because d ) (= 5/2* + + 5/2) s greater than d ) (= /2* + 5/2), we can use an changed form of the example that the outer loop and the nner loop are nterchanged as shown n Fg. 4. do =, 50 do =, 50 A(3*+, 4*+2*+) =......= A(2*-4, +-4) Fg. 4. Another form of Example by loop nterchangng. If the upper lmts of loop and are 00 by 00 as an example gven n Fg. 4, the number of tles for the orgnal loop s sx as shown n Fg. 5(a), and for the nterchanged loop s fve as shown n Fg. 5(b). When d mn > d mn n ths loop, we can acheve greater parallelsm by loop nterchangng. 00 00 00 40 9 3 49 (a) 7 9 43 9 00 (b) Fg. 5. (a) Tlng by the ncrementng mnmum dependence dstance, (b) Tlng by Loop Interchangng n Example.

Parallelsm for Nested Loops wth Non-unform and Flow Dependences 5 4 Performance Analyss Ths secton dscusses the performance analyss of our proposed methods through the comparsons wth related works theoretcally. Theoretcal speedup for performance analyss can be computed as follows. Ignorng the synchronzaton, schedulng and varable renamng overheads, and assumng an unlmted number of processors, each partton can be executed n one tme step. Hence, the total tme of executon s equal to the number of parallel regons, N p, plus the number of sequental teratons, N s. Generally, speedup s represented by the rato of total sequental executon tme to the executon tme on parallel computer system as follows: Speedup = (N * N )/(N p + N s ) where N, N are the sze of loop,, respectvely We wll compare our proposed methods wth the mnmum dependence dstance tlng method and the unque set orented parttonng method as follows: Let's consder the loop shown n Example. Fg. 3(a) shows orgnal parttonng of Example. Ths example s the case that there s only flow dependence and DCH overlaps DCH2. Applyng the unque set orented parttonng to ths loop llustrates case 2 of [4]. Ths method can dvde the teraton space nto four regons: three parallel regons, AREA, AREA2 and AREA4, and one seral regon, AREA3, as shown n Fg. 6. The speedup for ths method s (00*00)/(3+44) = 22.8. AREA AREA4 22 AREA2 AREA3 8 Fg. 6. Regons of the loop parttoned by the unque sets orented parttonng n Example. Applyng the mnmum dependence dstance tlng method to ths loop llustrates case of ths technque [5], whch s the case that lne d (, ) = 0 does not pass through the IDCH. The mnmum value of d (, ), d mn, occurs at the extreme pont (, ) and d mn = 3. The space can be tled wth wdth = 3, thus 34 tles are obtaned. The speedup for ths method s (00*00)/34 =294. Let s apply our proposed method - the mproved tlng method as gven n secton 3. Ths loop s tled by sx areas as shown n Fg. 5(a). The teratons wthn each area can be fully executed n parallel. So, the speedup for ths method s (00*00)/6 = 666.

52 S.-J. Jeong Applyng the loop nterchangng method n ths example, ths loop s tled by fve areas as shown n Fg. 6(b). So, the speedup for ths method s (00*00)/5 = 2000. If the upper bounds of loop, are 000 by 000, the speedup for the orgnal loop s (000*000)/ = 90909, and the speedup for the nterchanged loop s (000*000)/8 = 25000. Because d mn > d mn n ths example, we can acheve more parallelsm by loop nterchangng. 5 Conclusons In ths paper, we have studed the problem of transformng nested loops wth nonunform and flow dependences to maxmze parallelsm. When there s only flow dependence n the loop, we propose the mproved tlng method. The mnmum dependence dstance tlng method tles the teraton space by the frst mnmum dependence dstance unformly. Our proposed method, however, tles the teraton space by mnmum dependence dstance values that are ncremented as the value of s ncremented. Furthermore, when d mn > d mn n the gven loop, loop parallelsm can be mproved by loop nterchangng. In comparson wth some prevous parttonng methods, the mproved tlng method gves much better speedup than the mnmum dependence dstance tlng method and the unque set orented parttonng method n the case that there s only flow dependence and DCH overlaps DCH2. Our future research work s to develop a method for mprovng parallelzaton of hgher dmensonal nested loops. References. A. A. Zaafran and M. R. Ito, "Parallel regon executon of loops wth rregular dependences," n Proceedngs of the Internatonal Conference on Parallel Processng, vol. II, (994) -9 2. C. K. Cho, J. C. Shm, and M. H. Lee, "A loop transformaton for maxmzng parallelsm from sngle loops wth non-unform dependences," n Proceedngs of Hgh Performance Computng Asa '97, (997) 696-699 3. C. K. Cho and M. H. Lee, "A loop parallzaton method for nested loops wth non-unform dependences", n Proceedngs of the Internatonal Conference on Parallel and Dstrbuted Systems, (997) 34-32 4. J. Ju and V. Chaudhary, "Unque sets orented parttonng of nested loops wth nonunform dependences," n Proceedngs of Internatonal Conference on Parallel Processng, vol. III, (996) 45-52 5. S. Punyamurtula and V. Chaudhary, "Mnmum dependence dstance tlng of nested loops wth non-unform dependences," n Proceedngs of Symposum on Parallel and Dstrbuted Processng, (994) 74-8 6. S. Punyamurtula, V. Chaudhary, J. Ju, and S. Roy, "Comple tme parttonng of nested loop teraton spaces wth non-unform dependences," Journal of Parallel Algorthms and Applcatons, (996) 7. T. Tzen and L. N, "Dependence unformzaton: A loop parallelzaton technque," IEEE Transactons on Parallel and Dstrbuted Systems, vol. 4, no. 5, (993) 547-558