Parallelization of the N-queens problem. Load unbalance analysis.

Similar documents
Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture

Introduction to Parallel. Programming

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:

Evaluation of Parallel Programs by Measurement of Its Granularity

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Dense Matrix Algorithms

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle

A Comparative study on Algorithms for Shortest-Route Problem and Some Extensions

CSE : PARALLEL SOFTWARE TOOLS

First, the need for parallel processing and the limitations of uniprocessors are introduced.

Consultation for CZ4102

Introduction to Parallel Programming

Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine)

TAGUCHI TECHNIQUES FOR 2 k FRACTIONAL FACTORIAL EXPERIMENTS

Scalability of Heterogeneous Computing

Truss structural configuration optimization using the linear extended interior penalty function method

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Introduction to Parallel. Programming

Search Algorithms for Discrete Optimization Problems

Equi-sized, Homogeneous Partitioning

Chapter 8 Dense Matrix Algorithms

Vertex Magic Total Labelings of Complete Graphs 1

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Chapter 5: Analytical Modelling of Parallel Programs

A Performance Study of Parallel FFT in Clos and Mesh Networks

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Parallel Linear Algebra on Clusters

DESIGN AND ANALYSIS OF ALGORITHMS

Foundations of Discrete Mathematics

Lecture 3. Brute Force

Vertex Magic Total Labelings of Complete Graphs

Problem Solving. Russell and Norvig: Chapter 3

An Integrated Course on Parallel and Distributed Processing

Proceedings of MASPLAS'01 The Mid-Atlantic Student Workshop on Programming Languages and Systems IBM Watson Research Centre April 27, 2001

Study of Butterfly Patterns of Matrix in Interconnection Network

ECMRE: Extended Concurrent Multi Robot Environment

The Cache Write Problem

Position Sort. Anuj Kumar Developer PINGA Solution Pvt. Ltd. Noida, India ABSTRACT. Keywords 1. INTRODUCTION 2. METHODS AND MATERIALS

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Trees, Trees and More Trees

Approximating Fault-Tolerant Steiner Subgraphs in Heterogeneous Wireless Networks

Principles of Parallel Algorithm Design: Concurrency and Decomposition

Performance Estimation of Parallel Face Detection Algorithm on Multi-Core Platforms

Introduction to Multithreaded Algorithms

Design of Parallel Algorithms. Course Introduction

CS2 Algorithms and Data Structures Note 1

CSC 8301 Design and Analysis of Algorithms: Exhaustive Search

Self-Stabilizing Distributed Algorithms for Graph Alliances

A4B36ZUI - Introduction ARTIFICIAL INTELLIGENCE

Chapter 11 Search Algorithms for Discrete Optimization Problems

The Paired Assignment Problem

Solving the Travelling Salesman Problem in Parallel by Genetic Algorithm on Multicomputer Cluster

A4B36ZUI - Introduction to ARTIFICIAL INTELLIGENCE

ALGORITHMIC DECIDABILITY OF COMPUTER PROGRAM-FUNCTIONS LANGUAGE PROPERTIES. Nikolay Kosovskiy

ON SOME METHODS OF CONSTRUCTION OF BLOCK DESIGNS

Solving the Linear Transportation Problem by Modified Vogel Method

Network Load Balancing Methods: Experimental Comparisons and Improvement

International Journal of Modern Engineering and Research Technology

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

1 Graph Visualization

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

An Anomaly in Unsynchronized Pointer Jumping in Distributed Memory Parallel Machine Model

SUPPORT SYSTEM FOR PROCESS FLOW SCHEDULING

An Algorithm to Compute a Basis of Petri Net Invariants

Pattern Recognition Chapter 3: Nearest Neighbour Algorithms

DIFFERENTIAL IMAGE COMPRESSION BASED ON ADAPTIVE PREDICTION

SRM UNIVERSITY FACULTY OF ENGINEERING AND TECHNOLOGY SCHOOL OF COMPUTING DEPARTMENT OF CSE COURSE PLAN

Advanced Parallel Architecture. Annalisa Massini /2017

A Parallel Genetic Algorithm for Maximum Flow Problem

Smart Sort and its Analysis

Subset Cut Enumeration of Flow Networks with Imperfect Nodes

Parallel Matrix Multiplication on Heterogeneous Networks of Workstations

An Introduction to Graph Theory

Int. J. of Advanced Networking and Applications 597 Volume: 02, Issue: 02, Pages: (2010)

ENTITIES IN THE OBJECT-ORIENTED DESIGN PROCESS MODEL

Scalable GPU Graph Traversal!

Lecture 5: Matrices. Dheeraj Kumar Singh 07CS1004 Teacher: Prof. Niloy Ganguly Department of Computer Science and Engineering IIT Kharagpur

Parallel Evaluation of Hopfield Neural Networks

An Algorithm for Minimal and Minimum Distance - 2 Dominating Sets of Graph

High Performance Computing

Advances in Parallel Branch and Bound

Transactions on Information and Communications Technologies vol 15, 1997 WIT Press, ISSN

Using Templates to Introduce Time Efficiency Analysis in an Algorithms Course

Graphs (MTAT , 6 EAP) Lectures: Mon 14-16, hall 404 Exercises: Wed 14-16, hall 402

Module 5: Concurrent and Parallel Programming

Day Hour Timing pm am am am

CS 177. Lists and Matrices. Week 8

Evaluating CPU utilization in a Cloud Environment

Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods

Algebraic Graph Theory- Adjacency Matrix and Spectrum

Rendering and Modeling of Transparent Objects. Minglun Gong Dept. of CS, Memorial Univ.

Algorithm Engineering

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

Web page recommendation using a stochastic process model

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Heap-on-Top Priority Queues. March Abstract. We introduce the heap-on-top (hot) priority queue data structure that combines the

Abacus 5 and 6: Autumn term 1st half UNIT TOPIC ABACUS 5 UNITS ABACUS 6 UNITS 1 Place-value, ordering and rounding

CHAPTER 4 OBJECT ORIENTED COMPLEXITY METRICS MODEL

Transcription:

Parallelization of the N-queens problem Load unbalance analysis Laura De Giusti 1, Pablo Novarini 2, Marcelo Naiouf 3,Armando De Giusti 4 Research Institute on Computer Science LIDI (III-LIDI) 5 Faculty of Computer Science - National University of La Plata Abstract The paper presents an analysis of three parallelization structures of the N-queens problem, taking into account N processors The focus has been set on investigating the adaptation of the architecture structure to the proposed algorithm type, so as to study the load unbalance in each case, for which two different metrics have been established The experimental results and the efficient implementation of the algorithms are discussed together with the related current research lines Key words: Parallel Systems Parallel algorithms Load Balance Complexity Introduction The increasing importance and interest in parallel processing within Computer Science is clear for several reasons Generally speaking, parallel machines allow to solve problems of increasing complexity and obtain results faster, and in several cases, they represent the only viable choice since sequential solutions involve unacceptable times Besides providing faster solutions, parallel applications are capable of solving larger, more complex problems whose input data or intermediate results exceed a CPU memory capacity; simulations can be run at finer resolution, and the physical phenomena can be modeled more realistically [1] [2] It is important to refer to a parallel algorithm not in an isolated manner but together with the computing model for which it was designed Unlike sequential computation, where Random Access Machine (RAM) is practically accepted as a standard, in parallelism there does not exist a unifying theoretical model (since each emphasizes certain aspects over others) and there exists a broad diversity of platforms On the other hand, for each application, there could be an optimal machine, various implementation alternatives with homogeneous or heterogeneous hardware, with tight and loose coupling of its components Thereby, parallel systems are referred to as the combination of algorithm and architecture [3] [4] 1 UNLP Scholar Part-time Teaching Assistant Faculty of Computer Science UNLP ldgiusti@lidiinfounlpeduar 2 Part-time Teaching Assistant Faculty of Computer Science UNLP pnovarini@infounlpeduar 3 Chair Professor Faculty of Computer Science UNLP mnaiouf@lidiinfounlpeduar 4 Principal Researcher of CONICET Full-time Chair Professor Faculty of Computer Science UNLP degiusti@infounlpeduar 5 III-LIDI member of Research Institute on Computer Science and Technology (IICyTI)TE/Fax +(54)(221)422-7707 http://lidiinfounlpeduar CACIC 2003 - RedUNCI 397

A parallel application defines a set of intercommunicated components that should be assigned in the physical resources of the target architecture The last step in the development of parallel algorithms is the process mapping over processors; the objectives aim at optimizing the use of processors and obtaining the best response time of the application, carrying out the work distribution to the processors so that the computational load tends to be equal (balanced) in time [5] [6] [7] This is one of the core aspects of parallel processing, since it has a direct impact on the efficient use of resources (that imply costs) and on the achievable performance improvement Within the types of problem, a distinction can be made between those whose nature allows a parallelization so that the obtained load balance is near optimal (generally, those with a regular execution pattern, such as some solutions to the matrix multiplication problem), and those where the execution pattern is irregular or has a dynamic nature or is datadependant (where the load balance objective is harder to achieve) [8] [9] [10] This paper aims at analyzing the load balance obtainable in the parallelization of a not basically balanced problem The N-queens problem belongs to this type, and it has thus been chosen for the analysis N-queens Problem Sequential Solution The N-queens problem is a generalization of the well-known 8-queens problem, which consists in arranging 8 queens on a chessboard so that none can take another A queen attacks another if they are located on the same diagonal, row or column In the case of the N-queens, N queens are placed on a NxN board There exists a known number of solutions; for instance, there are 92 solutions for placing 8 queens on a 8x8 board [11] [12] Other problems are derivable from this Among them, it is worth mentioning the problem which, given a NxN board, looks for the lesser quantity of queens that can be placed so that all the board squares are attacked by some queen [13] An initial solution to the N-queens problem, by way of a sequential algorithm, consists in testing all the possible placement combinations of queens on the board and choosing the valid ones The combination in which no queen of the board is attacked by another is considered as valid This solution can be upgraded by discarding, during the search, those ways by which a solution to the problem cannot be found [14] CACIC 2003 - RedUNCI 398

The pseudo-code of the sequential solution is as follows: beginarray () //diagonals and columns marking them empty call to addqueen proceeding addqueen() //place a queen on the following row row++ for each column do(i:1n) test if a queen can be placed on column i If true then mark the column and diagonals as filled If is the last row then New solution found If not Call addqueen proceeding Unmark the column and diagonals for testing other combinations The solution presented above shows a non-linear growth in the complexity as the size of the board increases Load Balance for the N-queens problem This paper aims at analyzing the load balance in the parallelization of the N-queens problem The metrics used for the load is function of the total quantity of analyzed squares to place some queen (l i ) In order to measure the load unbalance, two metrics were used: 1- Unbalance between maximum and minimum load: This metrics considers the relation between the maximum work load and the minimum load of processors M = (1) 1 Cmax / Cmin For instance, if C max = 100 and C min = 50, M 1 = 2 This means that the processor that works more does twice as much work as that which works less 2- Unbalance related to the average work done: This metrics takes into account the deviation percentage of the work done by the processors in relation to the average of the work done N M 2 = ( P*100) / T prom where P = T T i = 1 i prom / N (2) For instance: if M 2 = 0 then the obtained balance is the optimal if M 2 = 50 it means that each processor deviates a 50 % of the work that it should carry out if it had an optimal balance CACIC 2003 - RedUNCI 399

Parallelization of the N-queens problem All the algorithms presented below perform N stages where, in each stage I, they try to place queens on all the valid position of row i 1- Solution with processor pipe: Let P 1 P N be processors and T i the set of boards with queens placed on valid positions on the first i rows A processor P i is in charge of placing the queens on row i The work begins with processor P 1, which places a queen on each possible position of row 1, thus obtaining a set of initial boards (T 1 ) Each of these obtained boards passes to processor P 2, which places a queen on all the valid positions of row 2 in each of them, passing the obtained boards (T 2 ) to the next processor The process is repeated until N processor is reached This receives from processor P N-1 the boards with queens located accurately on the first N-1 rows (T N-1 ), and its task is to place the queens on row N, in order to obtain the set of solutions to the problem (T N ) A graphical representation of the architecture is as follows: T N-1 P 1 P T 2 P 1 T N 2 T N Set of valid solutions The following graphic shows the quantity of work done by each processor for a 16 (N=16) sized board To the right, there is a chart with the work load of each processor Millions 5000 4000 3000 2000 1000 0 Workload 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processor number Processor nª Work Load (l i) 1 16 2 256 3 3,360 4 35,776 5 315,008 6 2,268,992 7 13,421,056 8 63,975,296 9 245,195,328 10 741,742,016 11 1,735,663,456 12 3,102,285,760 13 4,164,854,528 14 4,062,362,112 15 2,738,528,288 16 1,152,033,408 CACIC 2003 - RedUNCI 400

Next, the two metrics defined in (1) and (2) are analyzed for different number of processors: N M 1 M 2 8 568 8055 10 9,632 8938 12 222,720 9938 14 6,471,872 10582 16 260,303,408 11315 It is clear that the load unbalance is unacceptable for increasing N, and this is typical of the characteristic of the problem (breadth first search) 2-Loosely coupled processor N Solution: Let P 1 P N be processors and T i,j the set of boards where i represents the column in which the queen was placed on the first row, and j the row up to which the board has queens For instance: T 2,4 is the set of all the boards with queens placed accurately on the first four rows, being the queen of the first row placed on the second column A processor P i is in charge of finding the set of boards (T i,n ) with all the possible solutions having placed the queen on the first row column i In this case there are no communication between the processors during the search of the solutions Each processor works independently A graphical representation of the architecture is the following: P 1 P 2 P N T 1,1 T 2,1 T N,1 T 1,2 T 2,2 T 1,N T 2,N T N,2 T N,N Set of solutions An equivalence with the previous notation is given by: T, 1 T1 N U j= 1 j = CACIC 2003 - RedUNCI 401

The following graphic shows the quantity of work done by each processor for a 16 (N=16) sized board To the right, there is a chart with the work load of each processor Millions 1300 1200 1100 1000 900 800 Workload 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processor number Processor nª Work Load (l i) 1 887,143,505 2 992,762,433 3 1,089,991,889 4 1,133,623,105 5 1,182,782,369 6 1,216,458,385 7 1,249,758,561 8 1,258,822,081 9 1,258,822,081 10 1,249,758,561 11 1,216,458,385 12 1,182,782,369 13 1,133,623,105 14 1,089,991,889 15 992,762,433 16 887,143,505 If the analysis of the two defined unbalance metrics is now repeated, N M 1 M 2 8 1,15 4,58 10 1,16 4,03 12 1,22 5,41 14 1,31 7,01 16 1,41 9,08 a remarkable decrease in the load unbalance can be noticed For instance, for 16 processors the average unbalance - though with metrics M1 it reaches the 41% between the maximum load processor and that of minimal load - is only of 9% 3- Parallel pipe variant: Noting the work unbalance carried out in the first solution (using a processors pipe), it can be said that the work to be done by each processor varies according to the corresponding row in which it places the queens The processors that place queens on the central rows of the board carry out a greater work load, since they count with more possible combinations to test The algorithm represented below attempts to balance the work done by each of the processors used in the solution of the problem Let P 1 P N be processors and T i,j the set of boards Each processor P i begins a solution placing a queen on column i of row 1, T i,j, In each step of the algorithm, processor P i receives from processor P i-1 a set of boards T k,j, and places a queen on each valid position of the next row, thus obtaining the set of boards T k,j+1, which are in turn passed to processor P i+1 CACIC 2003 - RedUNCI 402

For instance, processor P 1 starts its work placing a queen on row 1, column 1, and send its board (T 1,1 ) to processor P 2 It then places a queen on each valid position of row 2 for the solution started by P N, obtaining T N,2 Next, it receives from processor P N, boards with two queens placed on the first two rows (generated by P N-1 ), and it places queens on each of them on the valid positions of row 3, (obtaining T N,3 ), and so on Thus, each processor carries out in each stage j the work equivalent to that of processor j of the pipe solution Thereby each processor plays all the roles of the first solution processors With this, an improvement on the load balance is achieved by way of balancing each processors work All the processors place queens on all the rows Note that without changing the physical and logical architecture of the solution with the processors pipe, and without essentially changing the algorithm, an important change is achieved in the load balance A graphical representation: P 1 P 2 P N Boards generated by each processor T 1,1 T N,2 T 2,N T 2,1 T 1,2 T 3,N T N,1 T N-1,2 T 1,N Set of solutions Each processor P i has the set of boards T i+1,n that represents the solution initiated by processor P i+1, which placed the queen of the first row on column i+1 The following graphic shows the quantity of work done by each processor for a 16 (N=16) sized board To the right, there is a chart with the work load of each processor Millions 1300 1200 1100 1000 900 800 Workload 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processor number Processor nª Work Load (l i) 1 1,061,848,177 2 1,134,981,281 3 1,194,115,313 4 1,235,277,169 5 1,264,375,793 6 1,274,235,233 7 1,273,619,297 8 1,251,118,593 9 1,216,519,633 10 1,150,072,401 11 1,076,401,825 12 999,309,057 13 949,764,977 14 943,807,137 15 974,697,313 16 1,022,541,457 CACIC 2003 - RedUNCI 403

N M 1 M 2 8 123 814 10 123 661 12 124 694 14 128 821 16 135 950 The unbalance metrics M 1, compared to the previous parallel solution, tends to be better in this solution for increasing N, while metrics M 2 shows similar results (by definition, it averages the sum of unbalances) Conclusions and future work lines The study of the load unbalance has been initiated for a type of parallel systems, focusing on the adjustment of the algorithm to the supporting architecture An important result is the development of a new parallel algorithm over a processor pipe for the N-queens problem, which tends to balance the load for increasing N There exist several open research lines with this respect: To analyze in depth the measurements for increasing N To study the communications incidence in each case To experiment on distributed shared memory architectures To experiment on multiprocessor structures with M processors (M < N) To analyze the effect (and the compensation potential) of the processors heterogeneity Bibliography [1]Andrews G, Concurrent Programming: Principles and Practice, The Benjamin/Cummings Publishing, Inc, Andrews G, Foundations on Multithread and Distributed Programming, Addison Wesley, 1999 [2]Coffin M, Parallel programming- A new approach, Prentice Hall, Englewood Cliffs, 1992 [3]Hwang K, Xu Z, Scalable Parallel Computing, McGraw-Hill, 1998 [4]Keller J, Kebler C, Traff J, Practical PRAM Programming A Wiley Interscience publication 2001 [5]Quinn M Parallel Computing Theory and Practice McGraw Hill 1994 [6]Watts J, Taylor S, A Practical Approach to Dynamic Load Balancing, IEEE Transactions on Parallel and Distributed Systems, 9(3), March 1998, pp 235-248 [7] Kumar V, Grama A, Gupta A, Karypis G, Introduction to Parallel Computing Design and Analysis of Algorithms, The Benjamin/Cummings Pub Company, Inc, 1994 [8]Bruen A, Dixon R, The n-queens Problem Discrete Mathematics 12:393-395, 1997 CACIC 2003 - RedUNCI 404

[9]Hedetniemi S, Hedetniemi T, Reynolds R Combinatorial problems on chessboards: II Chapter 6 in Domination in graphs: advanced topic, pp 133-162, 1998 [10]Bernhardsson B, Explicit Solution to the n-queens Problems for all n ACM SIGART Bulletin,2:7,1991 [11]http://wwwwileidenunivnl/~kosters/nqueenshtml [12]http://wwwrainorg/~mkummel/stumpers/8queenstxt [13]http://wwwjsomerscom/nqueen_demo/nqueenshtml [14]http://wwwfunducodecom/freec/recursion/recursion3htm CACIC 2003 - RedUNCI 405