Chain Pattern Scheduling for nested loops

Similar documents
Optimal Scheduling for UET-UCT Generalized n-dimensional Grid Task Graphs =

CODE GENERATION FOR GENERAL LOOPS USING METHODS FROM COMPUTATIONAL GEOMETRY

NONCONGRUENT EQUIDISSECTIONS OF THE PLANE

Quad-Tree Based Geometric-Adapted Cartesian Grid Generation

Approaches to Simulate the Operation of the Bending Machine

Enhancing Self-Scheduling Algorithms via Synchronization and Weighting

triangles leaves a smaller spiral polgon. Repeating this process covers S with at most dt=e lights. Generaliing the polgon shown in Fig. 1 establishes

Lab #4: 2-Dimensional Kinematics. Projectile Motion

Contrast Prediction for Fractal Image Compression

On the Kinematics of Undulator Girder Motion

Computer Graphics. Geometric Transformations

Computer Graphics. Geometric Transformations

APPLICATION OF RECIRCULATION NEURAL NETWORK AND PRINCIPAL COMPONENT ANALYSIS FOR FACE RECOGNITION

Partial Fraction Decomposition

An Energy Efficient Location Service for Mobile Ad Hoc etworks

ARELAY network consists of a pair of source and destination

On the Minimum k-connectivity Repair in Wireless Sensor Networks

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

Data Parallel Code Generation for Arbitrarily Tiled Loop Nests

High precision computation of elementary functions in Maple

Fractal Interpolation Representation of A Class of Quadratic Functions

3 No-Wait Job Shops with Variable Processing Times

7.5. Systems of Inequalities. The Graph of an Inequality. What you should learn. Why you should learn it

Algorithms Design for the Parallelization of Nested Loops

arxiv: v4 [cs.cg] 6 Sep 2018

Today s class. Geometric objects and transformations. Informationsteknologi. Wednesday, November 7, 2007 Computer Graphics - Class 5 1

What and Why Transformations?

Think About. Unit 5 Lesson 3. Investigation. This Situation. Name: a Where do you think the origin of a coordinate system was placed in creating this

3D Semantic Parsing of Large-Scale Indoor Spaces Supplementary Material

The Graph of an Equation

Chapter 4 Section 1 Graphing Linear Inequalities in Two Variables

Appendix F: Systems of Inequalities

y = f(x) x (x, f(x)) f(x) g(x) = f(x) + 2 (x, g(x)) 0 (0, 1) 1 3 (0, 3) 2 (2, 3) 3 5 (2, 5) 4 (4, 3) 3 5 (4, 5) 5 (5, 5) 5 7 (5, 7)

Lines and Their Slopes

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Scheduling with Bus Access Optimization for Distributed Embedded Systems

Computer Graphics. Modelling in 2D. 2D primitives. Lines and Polylines. OpenGL polygon primitives. Special polygons

Estimation of Tikhonov Regularization Parameter for Image Reconstruction in Electromagnetic Geotomography

A new Adaptive PID Control Approach Based on Closed-Loop Response Recognition

IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 04, 2016 ISSN (online):

Sparse Hypercube 3-Spanners

A noninformative Bayesian approach to small area estimation

Deadlock-Free Adaptive Routing in Meshes Based on Cost-Effective Deadlock Avoidance Schemes

Region Mura Detection using Efficient High Pass Filtering based on Fast Average Operation

Complexity Results on Graphs with Few Cliques

CS770/870 Spring 2017 Transformations

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

CS770/870 Spring 2017 Transformations

Integrating Customer Requirements into Data Product Design

8.B. The result of Regiomontanus on tetrahedra

On the Complexity of the Policy Improvement Algorithm. for Markov Decision Processes

Key Block Analysis Method for Information Design and Construction Technology in Large Long Tunnels

Discussion: Clustering Random Curves Under Spatial Dependence

A Hybrid Recursive Multi-Way Number Partitioning Algorithm

Formal Model. Figure 1: The target concept T is a subset of the concept S = [0, 1]. The search agent needs to search S for a point in T.

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks

Intermediate Algebra. Gregg Waterman Oregon Institute of Technology

EVALUTION METHOD OF COATING THICKNESS OF COATING THICKNESS STANDARD

Shape Optimization of Clamping Members of Frame-Supported Membrane Structures under Stress constraints

CSE328 Fundamentals of Computer Graphics: Theory, Algorithms, and Applications

Discrete Optimization. Lecture Notes 2

Spectral Analysis for Fourth Order Problem Using Three Different Basis Functions

Appendix F: Systems of Inequalities

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints

Stage-dependent hierarchy of criteria in multiobjective multistage decision processes

Polynomial and Rational Functions

DEGENERACY AND THE FUNDAMENTAL THEOREM

A Circle Detection Method Based on Optimal Parameter Statistics in Embedded Vision

Tubes are Fun. By: Douglas A. Ruby Date: 6/9/2003 Class: Geometry or Trigonometry Grades: 9-12 INSTRUCTIONAL OBJECTIVES:

Stereo: the graph cut method

Graphs, Linear Equations, and Functions

2.2 Absolute Value Functions

Center for Automation and Autonomous Complex Systems. Computer Science Department, Tulane University. New Orleans, LA June 5, 1991.

THE ORTHOGLIDE: KINEMATICS AND WORKSPACE ANALYSIS

Two Dimensional Viewing

2.8 Distance and Midpoint Formulas; Circles

EELE 482 Lab #3. Lab #3. Diffraction. 1. Pre-Lab Activity Introduction Diffraction Grating Measure the Width of Your Hair 5

Computer Graphics. Si Lu. Fall er_graphics.htm 10/11/2017

5.3 Angles and Their Measure

Lens Screw Pad Arm (a) Glasses frames Screw (b) Parts Bridge Nose Pad Fig. 2 Name of parts composing glasses flames Temple Endpiece 2.2 Geometric mode

Topic 2 Transformations of Functions

Backbone Colouring of Graphs

Bidirectional Object Layout for Separate Compilation

Getting to Know Your Data

E V ER-growing global competition forces. Accuracy Analysis and Improvement for Direct Laser Sintering

A Full Analytical Solution to the Direct and Inverse Kinematics of the Pentaxis Robot Manipulator

KEYWORDS Fuzzy numbers, trapezoidal fuzzy numbers, fuzzy Vogel s approximation method, fuzzy U-V distribution method, ranking function.

Precision Peg-in-Hole Assembly Strategy Using Force-Guided Robot

New Optimal Load Allocation for Scheduling Divisible Data Grid Applications

ON WEIGHTED RECTANGLE PACKING WITH LARGE RESOURCES*

Improved N-port optical quasi-circulator by using a pair of orthogonal holographic spatialand polarization- modules

Closure Polynomials for Strips of Tetrahedra

Efficient Embedded Runtime Systems through Port Communication Optimization

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

Parameterized graph separation problems

Polynomials. Math 4800/6080 Project Course

ECE 307- Techniques for Engineering Decisions

CS 157: Assignment 6

An Optimal Scheduling Scheme for Tiling in Distributed Systems

FUTURE communication networks are expected to support

Transcription:

Chain Pattern Scheduling for nested loops Florina Ciorba, Theodore Andronikos and George Papakonstantinou Computing Sstems Laborator, Computer Science Division, Department of Electrical and Computer Engineering, National Technical Universit of Athens, Zographou Campus, 15773, Athens, Greece Email: {cflorina,tedandro}@cslab.ntua.gr Abstract It is well known that most time consuming applications consist of nested DO(FOR) loops. The iterations within a loop nest can are either independent iterations or precedence constrained iterations. Furthermore, the precedence constraints can be uniform (constant) or non-uniform throughout the execution of the program. The index space of a uniform dependence loop, due to the existence of dependence vectors, is partitioned into subspaces of points that can be executed at a certain time instance. The geometric representations of these sets form polgonal shapes called wavefronts or patterns, with special attributes and characteristics. In this paper we tackle the uniform precedence constraints case, and we present a novel scheduling technique for perfectl nested loops. The presented method exploiting the geometric properties of the index space in order to arrive at an efficient geometric decomposition. General multiprocessor scheduling with precedence constrains was proved to be NP-complete [1], even when the number of available processors is unbounded []. However, in our case things are simpler, as the index space is characterized b regularit due to the presence of uniform dependence vectors. In man cases where certain conditions between the dependence vectors hold, one can predict efficientl the exact geometric outline of the wavefront at each moment [3][]. One such a case is when the initial Partiall supported b the Greek State Scholarships Foundation.

wavefront is repeated k times at moment k (see Figure (b)). Using wavefronts one can find the optimal solution regarding the total makespan, in contrast to the classic hperplane approach [5] where the solution provided is nearl optimal, achieving a makespan of constant dela from the earliest possible. When parallelizing nested loops, the following tasks need to be performed: (1) detection of the inherent parallelism and application of program transformations that enable it; () computation scheduling (specifing when the different computations are performed); (3) computation mapping (specifing where the different computations are performed); () explicitl managing the memor and communication (this is an additional task for distributed memor sstems); and (5) generating the code so that each processor will execute its apportioned computation and communication explicitl. In this paper we focus on the first three tasks: parallelism detection, computation scheduling and computation mapping, or in other words the time-scheduling and space-mapping. We achieve these via the use of a novel and efficient communication scheme we call chain pattern for data exchange between processors. We are currentl working on the code generation based on the this scheme for various high performance architectures. This work is an extension of [3][], in which a geometric method of pattern-outline prediction was presented for uniform dependence loops, but no space mapping method was presented. In [3] and [] no communication cost was taken into account, and the Unit-Execution-Time (UET) model was assumed. In contrast, in this paper the cost for communication is accounted for and a trade-off is given between different space-mapping schemes. Several wavefront scheduling methods are presented and compared in []. In this paper we combine the method in [3] with the static cclic scheduling of [], thus presenting the chain pattern scheduling, which is an improvement with respect to [3][] in that it enhances the data localit utilizing methods from computational geometr, in order to take advantage of the regularit of the index space. In particular, we identif groups of iteration points that we call chains, which are points connected via a specific dependence vector, called communication vector d c []. Subsequentl, we map such chains to the same processor so as to enhance the data localit. The contributions of this work are the following: (1) we present a simple method for scheduling nested loops while enhancing the data localit, without the complicated transformations of other approaches (e.g., tiling); () we introduce a new scheme for the efficient space-time mapping that trades off execution vs. communication time in order to achieve overall minimum completion time. It can also be noted that our method is general enough to be applied to various high performance computing architectures, such as distributed sstems, SMP sstems, heterogeneous sstems.

Kewords: patterns, data localit, uniform dependence loop, nested loop scheduling. 1 Definitions and Notation The index space J of an n-dimensional uniform dependence loop is a n-dimensional subspace of N n. Due to the existence of dependence vectors, onl a certain set of points can be executed at ever moment []. The geometric border of this set, forms a polgonal shape called pattern. Our scheduling polic executes ever index point at its earliest computation time (ECT), forced b the existing dependence vectors. This polic guarantees the optimal execution time of the entire loop (under the assumption of an unbounded number of processors). The index space is partitioned into disjoint time subsets denoted P at k, k 0, such that P at k contains the set of points of J, whose earliest computation time is k. We call P at k the pattern corresponding to moment k. B definition, P at 0 denotes the boundar (pre-computed ) points. The geometric shape of the subset of index points that can be computed initiall is called initial pattern (see Figure 1) and is denoted as P at 1. The pattern-outline is the upper boundar of each P at k and is denoted as pat k, k 1. In this work, the points of the pattern outlines are given and computed in clockwise order that reflects the lexicographic ordering of the initial loop, unless noted. Therefore, pat k is a list of points, not a simple set. The pattern points are those points that are necessar in order to define the polgon shape of the pattern. The rest of the points just complete the polgon area. The pattern vectors are those dependence vectors d i whose end-points are the pattern points of the initial pattern. Their number is usuall denoted as m. Figure 1 depicts these definitions. Partitioning the index space into time patterns is beneficial for the time scheduling. However it is not sufficient for an efficient space scheduling, because it does not address the data localit issue. Therefore, an additional decomposition of the index space is necessar in order to enhance the data localit [7] []. Usuall, there are certain sequences of points that are computed b the same processor. Each such sequence can be viewed as a chain of computations, that is created b a certain dependence vector, as shown in Figure 1. This dependence vector is called communication vector and will be referred to as d c, and is usuall chosen to be the dependence (pattern) vector that incurs the largest amount of communication (in most case this is the vector with the smallest coordinates). Therefore, J is partitioned into such chains, denoted C j = {j + kd c k N, j P at 0 }. C j is called the chain with origin in j. The points of an C j The pre-computed points designate iteration points that do not belong to the first quadrant of the index space, but represent initial values for specific problems.

Chain (3,0) Pat 1 Chain (0,) Chain (0,0) Chain (1,0) k time step pattern border Chain j=(0,0) d d3 Figure 1: The index space of an example algorithm with dependence vectors d 1 = (1, 3), d = (, ), d 3 = (, 1), and d = (, 3). Each shaded polgon depicts the subset of J that can be executed at a certain moment. This is what we call pattern; the first 3 patterns are showed. The dashed lines mark the border of each pattern, called pattern outline. Except d, all other vectors contribute to the formation of the pattern outline, therefore are called pattern vectors. On the axes lie the pre-computed at moment t = 0 (boundar) points. The communication vector in this example is d c = d. The chain with origin (0, 0) (C (0,0) ) is shown, as well as the chains it sends data to due to d 1, d and d 3, i.e. C (0,), C (1,0) and C (3,0) respectivel. communicate via d i (i designating ever dependence vector except d c ) with the points of chain C ji. It is appropriate now to define the pattern width as the number of iteration points along chain C j which belong to P at k, k 1. In Figure 1, for all points between the lines formed b d 1 and d 3 the pattern width is (points) for all patterns, whereas for all other points it is 1 (point) for all patterns. Hence, it can be noted that the pattern width depends on the dependence vectors of the problem. Chain Pattern Scheduling The chain pattern scheduling is devised to enhance the data localit of programs containing uniform nested loops using the concept of chains, while taking advantage of the optimalit of the scheduling method based on patterns. When enhancing data localit, there are three possible scenarios with respect to the volume of communication and overall performance of the parallel program: (1) little or no communication at all high performance, () little communication

mediocre performance, and (3) moderate communication good performance. B mapping all points of a chain C j to a single processor, communication incurred b d c is completel eliminated. Furthermore, mapping all points of a chain C j and all other chains it communicates with, to the same processor, the communication incurred b the other dependence vectors is also eliminated, increasing the performance of the parallel program using this time-space mapping. A common feature of all three scenarios is that chains are mapped to processors starting with the chain C (0,0) and proceeding with the chains to its left (or above) and right (or below), like a fan spreading out. It is straightforward that this wa more points are released (from their dependencies) for execution than in an other wa. For the first scenario, consider the examples shown in Figure. This is the case where there are enough available processors so that each chain is mapped to a different processor and executed concurrentl. Two similar examples are given: (a) An index space with three dependence vectors, d 1 = (1, 3), d = d c = (, ) and d 3 = (, 1) where chains are formed along d. (b) An index space with two dependence vectors, d 1 = (, 1) and d = d c = (, ), where chains are formed also along d. One can see that in both cases, chains would be needed. This usuall is an unrealistic assumption, since the number of processors must depend on the problem size, which in most cases is prett large. 1 Chain Chain 0 0 3 Pat 1 9 7 5 3 1 Chain 9 Chain 7 d3 Chain Chain 0 0 3 Pat 1 1 9 7 5 3 1 Chain 9 Chain 7 (a) (b) Figure : Scenario (1) little or no communication high performance: ever chain is mapped to a different processor (unrealistic assumption with respect to the number of available processors).

For the second scenario, consider the examples given in Figure 3. In order to reduce the amount of communication, the chains above and below d c are mapped to as man different processors as needed for this scenario, so as to minimize the number of required processors. In particular, in (a) chains below d c are mapped to 3 different processors so as to eliminate the communication due to d 3, while above d c chains are mapped to different processors so as to eliminate the communication due to d 1. Similarl in (b) chains below d c are mapped to 3 different processors, thus eliminating the communication due to d 3. However, there is no other dependence vector creating communication above the chosen d c, therefore, the mapping scheme applied below d c is also applied to the chains above d c, hence ielding no communication at all. Therefore, the case depicted in (b) can be considered a particular case of scenario (1), where there are sufficientl few dependence vectors so that all communication is eliminated using a reasonable number of processors. This scenario is more realistic that (1), however it does not ield ver good performance, since the number of processors required is in most cases small, therefore limiting the parallelism to a certain extent. d3 Pat 1 Pat 1 (a) (b) Figure 3: Scenario () little communication mediocre performance: chains above and below d c are mapped to processors so as to minimize the communication due to the dependence vectors above and below d c. The chains above designate chains to the left of chain C (0,0). The chains below designate chains to the right of chain C (0,0).

For the third scenario, consider the cases shown in Figure. In order to have moderate communication, but better performance than in scenario (), an arbitrar number of processors was considered (herein 5 processors). In particular, in (a) the first three processors are mapped to chains below d c, whereas the other two processors were mapped to chains above d c. The difference from Figure 3(a) is that in this case, no all chains span the entire index space, but onl a part of it (i.e., below or above d c ). The benefits of doing so are that a more realistic number of processors is assumed while still having acceptable communication, and performance is increased as such. In (b) due to the fact that there is no dependence vector above d c, a cclic mapping of chains is possible, starting with the chain C (0,0) and moving above and below, so as to incrementall release points for execution. This is similar to Figure 3(b), with the difference that 5 processors were used instead of 3. This scenario has the advantage that it does not limit the degree of parallelism because it uses all the available processors, hence being the most realistic of all three presented scenarios. d3 Pat 1 Pat 1 (a) (b) Figure : Scenario (3) moderate communication good performance: chains are mapped to the available processors so as to minimize communication (the most realistic of all scenarios regarding the number of processors). For the sake of simplicit, for all the above scenarios, regardless of the pattern width, ever chain was assigned to a single processor. One should note, however, that depending on the pattern width, each chain ma be assigned to more than one processors in a wa that no

communication among them is required. This is best suited for the case where the target architecture is a distributed memor sstem, that usuall contains single processor nodes. On the other hand, if the width of the pattern is one, a chain ma still consist of more than one processor. However, the processors will have to communicate during execution. This is best suited to the case where the target architecture is a smmetric multiprocessor sstem (SMP), where processors within a node can communicate through the locall (intra node) shared memor. Our approach is also suitable for heterogeneous networks, in which case a processor with higher computational power would be given either longer or more chains, whereas one with lesser computational power would be given either fewer or shorter chains of points. It is obvious that the ratio communication time processing time is critical for deciding which scenario and which architecture suits best for a specific application. Theoretical formulas can be derived for obtaining the optimal makespan as a function of the number of processors and the ratio communication time processing time. 3 Conclusion The pattern chain scheduling described above has some similarities with the static cclic and static block scheduling methods described in []. The similarities are: the assignment of iterations to a processor is determined a priori and remains fixed ielding reduced scheduling overhead; the require explicit snchronization between dependent chains or tiles. However, the three scheduling methods are in the same time different, and the main differences are: iterations within a chain are independent, while within a tile the are not (this promotes the adaptabilit of our scheduling to different architectures, as mentioned above); our method significantl enhances the data localit, hence performs best when compared to the static cclic and block methods. Our methodolog can be easil augmented to automaticall generate the appropriate SPMD parallel code, and we are currentl working towards this goal. Preliminar experimental results show that the presented method outperforms other scheduling techniques and corroborate the efficienc of the adaptive scheduling. References [1] M. R. Gare and D. S. Johnson, Computers and Intractabilit, a Guide to the Theor of NP-completeness. W. H. Freeman and Compan, New York, 1979.

[] C. Papadimitriou and M. Yannakakis, Toward an architecture-independent analsis of parallel algorithms, SIAM Journal of Computers, Extended Abstract in Proceedings STOC 19, vol. 19, pp. 3 3, 1990. [3] I. Drositis, T. Andronikos, A. Kokorogiannis, G. Papakonstantinou, and N. Koziris, Geometric pattern prediction and scheduling of uniform dependence loops, in 5th Hellenic European Conference on Computer Mathematics and its Applications - HERCMA 001, (Athens), 001. [] G. Papakonstantinou, T. Andronikos, and I. Drositis, On the parallelization of UET/UET- UCT loops, NPSC Journal on Computing, 001. [5] I. Drositis, T. Andronikos, M. Kalathas, G. Papakonstantinou, and N. Koziris, Optimal loop parallelization in n-dimensional index spaces, in Proceedings of The 00 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 0), 00. [] T. S. A. Naraig Manjikian, Exploiting Wavefront Parallelism on Large-Scale Shared- Memor Multiprocessors, IEEE Trans. on Parallel and Distributed Sstems, vol. 1, no. 3, pp. 59 71, 001. [7] B. L. Massingill, T. G. Mattson, and B. A. Sanders, Patterns for parallel application programs, in In Proceedings of the th Pattern Languages of Programs (PLoP 99), 1999. [] T. G. Mattson, B. A. Sanders, and B. L. Massingill, Patterns for Parallel Programming. Addison-Wesle, 1st ed., 00.