Foundation of Parallel Computing- Term project report

Size: px

Start display at page:

Download "Foundation of Parallel Computing- Term project report"

Nancy Poole
5 years ago
Views:

1 Foundation of Parallel Computing- Term project report Shobhit Dutia Shreyas Jayanna Anirudh S N (snd7555@rit.edu) (sj7316@rit.edu) (asn5467@rit.edu) 1. Overview: Graphs are a set of connections between nodes or items and represent some sort of a relation between them. It can represent anything from distances between cities to a group of people in a given space. In the practical world, graphs are used in finding shortest distance between cities, networks, manufacturing scheduling and electrical circuits. In this project, we have worked on a Graph coloring problem. 2. Description of computational problem: We will be looking at a particular graph problem involving coloring its vertices. Graph coloring is a form of graph labelling. It is an NP-hard problem and involves assigning minimum number of colors to the nodes in such a way that no two adjacent vertices share the same colors. An extension of the graph coloring problem is the graph-k coloring problem where each vertex uses a different color. Graph coloring problems are used in processor scheduling, student class scheduling and radio frequency assignment. 3. Research paper 1: The paper by Boman et. al [1] presents their own graph coloring algorithm that uses a distributed memory structure. It is a parallel algorithm and works as a set of supersteps i.e. the number of vertices to be colored before sending and receiving the color information from other processors. The algorithm has two phases. The first phase is called tentative coloring phase and the second is called the conflict detection phase. Tentative coloring phase requires communication between the various processors to send and receive the colors of boundary vertices and is therefore organized into supersteps which reduce the communication frequency of coloring. On the other hand, conflict detection phase resolves conflicts by randomly selecting a processor for color re-assignment and hence there is no communication needed in this phase. Thus, conflict detection phase is not organized into supersteps as this can be done independently. In the tentative coloring phase, the vertices are colored by each processor concurrently and the color information is exchanged with the other processors after the given number of vertices are colored in the superstep. The messages are sent so as to reduce the number of conflicts in the subsequent supersteps. The second phase - conflict detection, is used to detect such conflicts and resolve them. The processor that colors a given vertex is chosen at random. Further, the paper discusses variations to this algorithm. These are as follows: a. Initial partitioning: If the vertices are partitioned to processors in such a way that the number of cross edged links across processors is high, it will result in a poor performance as there may be a large number of conflicts. b. Synchronous vs asynchronous supersteps: In synchronous mode there exists a barrier at the end of each superstep. This is advantageous as in the conflict detection phase, the color of the boundary vertices needs to be checked only against its neighbors colored in the same superstep. This however introduces a delay as there exists a barrier at the end of every superstep. On the other hand, in asynchronous mode, there are no barriers at the end of supersteps and vertices use the information that is available completely at that instant. Thus the conflict detection phase, the boundary vertices needs to be checked against all its off processor neighbors. The number of conflicts may possibly be more in asynchronous mode. 4. Research paper 2: The second research paper by Fredrik Manne[2] et.al focusses on speeding up parallel graph colouring algorithms. They explain prior work done on it and go on to present their experimental results based on ordering and

2 partitioning of graphs. The paper discusses the Gebremedhin and Manne parallel algorithm which assumes that the number of processors p is less than the number of vertices n in the graph. Each processor is assigned n/p vertices. The algorithm is shown as the second algorithm. The authors have chosen this algorithm because they believed that this is the only algorithm that can be expected to perform even faster through addition of more processors. They also discuss the issues faced in ordering of vertices in a graph and partitioning of a graph. According to the paper, a highly clustered with minimum number of boundary vertices are preferable as they increase cache hits and reduce random access and inter-process communication. In the experiments conducted, two graphs are chosen and the vertices are visited using three methods - their natural order, a random order and using reverse Cuthill-McKee(RCM) order. They conclude that random ordering increases running time by a factor of three and RCM ordering reduces running time by almost 31%. Based on these findings, graphs were ordered using RCM and their vertex partitioning was compared using Metis partitioning method. The paper concludes by saying that what determines speed in a graph coloring problem is only two things - partitioning method and ordering of graph. Choosing the right one will determine how fast the problem is solved. 5. Research paper 3: Erik Saule, et.al propose few methods in the paper Improving graph coloring on distributed-memory parallel computers to improve the number of colors used to color a graph. The first two methods target the coloring phase and the other three methods target the recoloring phase. The two methods that are proposed in this paper for coloring phase concentrate on the vertex-visit orderings. They are Largest First (LF) and Smallest Last (SL) orderings. In the LF ordering scheme, the vertex with the largest degree is selected and removed from the graph for the next round of ordering. This is repeated until all the vertices in the graph are colored. The SF ordering is the exact opposite of this scheme. Once the vertices are ordered based on either of these schemes, the coloring of the vertices will follow the same order. This approach might result in a different coloring of the graph when colored on multiple processors as compared to coloring on a single processor because each processor would follow the same ordering scheme but only local knowledge is used by each processor for vertex ordering, i.e., each processor will order the vertices in its set based on the local knowledge that it possesses about those vertices. The methods investigated in this paper for recoloring are permutation of color classes - Reverse Order (RO) of colors, Non-Increasing (NI) number of vertices and Non-Decreasing (ND) number of vertices. In the recoloring phase, vertices that belong to the same color classes are considered to be colored in a consecutive manner. In the NI scheme, the color classes are ordered in the non-increasing order of the number of vertices and the ND scheme is the exact opposite of this scheme. The authors of this paper tested with different types of graphs with a combination of these approaches. In their study, the authors found that coloring large conflicts inducing graphs in a synchronous manner yielded better results. They found that vertex visiting based on the properties of the graph and properties of the partition, when considered together gave closer to optimal number of colors. 6. Sequential program-design: The sequential program operates on the Greedy algorithm presented as a pseudo-code below. It first reads the input and creates an adjacency list of vertices. For a possible color in the colorset, we check if the colorset contains the possible color. If not, the current vertex color is set to the color chosen from possible colorset. Otherwise, the possible colorset is incremented by one and the entire process is repeated. The data structures for representing the graph is a TreeMap as it provides natural ordering. For the sequential version, we can also use a hashset however we kept it as a TreeMap as it is used in the parallel version where we need to maintain the ordering. This is since the vertices are ordered from 1 to n, thread 1 will have the initial values, thread 2 will have the next values and so on. The reason for this partitioning is that we are implementing an efficient graph partitioning approach where the number of boundary vertices across the threads is less which would not have been the case if the threads would get a random value.

3 The complexity of the algorithm is O(V*E) where V is the number of vertices in the graph and E is the number of edges. The disadvantage of the sequential approach is the obvious time taken in solving it. Algorithm: Read input Let colorset=set of possible colors For each vertex: Get adjacency list of vertex For all adjacent vertices Get color of adjacent vertices into a adjcolorlist For possiblecolor in colorset Check if adjcolorlist contains possiblecolor If not, set current vertex color to possiblecolor and continue to next vertex If yes, increment posiblecolor and repeat 7. Parallel program-design: The parallel algorithm we implemented was a variation of the one in the research paper 1[1]. We modified the algorithm wherein the entire graph memory could be accessed by all the threads. Thus there was a huge advantage of minimizing the latency of sending and receiving messages across the processors as the required information was accessed directly. This approach is of course limited to using a single node as using a cluster would require sharing messages across tuple space which would induce the latency. In our approach, only the thread which is currently assigned the required node can modify the node s color. Another thread can only possibly read the information at the same time as the other thread however it will never write it. As we have introduced a barrier after the tentative coloring phase, all the processors will update their respective information and only then proceed to the conflict detection phase. Even if the thread reading the information reads the information which is updated by the other thread, at the conflict detection phase, the information is again read thus there is no problem in this approach. Moreover, we have also modified the algorithm in research paper 1[1] in terms of load balancing. In our approach, after the conflict detection phase if there exists say 5 conflicts in thread 1 and only 1 conflict in thread 2, this will be reduced to a new vertex set of size 6 and the tentative coloring phase will start again with a balanced load of 3-3 vertices instead of the unbalanced load of 5-1. This, in our opinion is also a huge advantage. Algorithm: The terms used for the algorithm are: a. Superstep: Number of vertices the processor will color before exchanging color information with other processors. Step 1: Let colorset=set of possible colors Step 2: Read input vertex set Step 3: While size of vertex set is not empty Step 3a: Distribute the vertex set across all the processors (parallel for loop) Step 3b: For each processor Partition the vertex set into l subsets of size s (superstep size) For i=1 to l For each vertex Assign the vertex a permissible color (same as the sequential algorithm) Send and receive colors of boundary vertices Step 3c: For each processor Partition the vertex set into l subsets of size s For i=1 to l For each vertex If vertex is a boundary vertex and has a same color with adjacent vertex

4 Add it to a set S_thread of vertices to be recolored Step 3d: Let all the S_threads be reduced to a single set S Step 3e: Let vertex set=s 8. Developer manual: GraphColSmp.java contains the parallel version of the graph coloring problem. GraphColSeq.java is the sequential version of the graph coloring problem. Compile either of the sequential version or the parallel version along with the Graph.java, Node.java and the VertexSetVbl.java using javac *.java. To run the program use java pj2 <cores=n> GraphColSeq <filename> where the number of cores are optional (default=1 core). 9. Creating input files: The input to the code was a hypergraph of 22 dimensions. We re-used the code from HypercubeGraph.java file. Input file 1 was generated as an output of HypercubeGraph.java which resulted as a single hypercube of 22 dimensions. To create input file 2, we generated the same output but with different numbered vertices and concatenated the output with the input 1 which resulted in two 22 dimensional hypercubes. We then connected some of the vertices from input 1 with input 2 manually. In a similar manner we created input files 3, 4 and 5 respectively. This effectively doubled the size at every input. 1. Strong scaling performance: We tested the strong scaling metric across 8 cores with five different problem sizes. Following tables represent the running time, speedup and efficiency. Also, we increased the problem size since the presentation 3 and got much better results. #Vertices #Edges K T (msec) Speedup Eff 4,194,34 46,137,344 Seq ,388,68 92,274,699 Seq ,582, ,412,63 Seq 1584

5 Running Time (msec) ,777, ,549,47 Seq ,971,52 257,529,856 Seq Running Time vs V = 4,194,34 V = 8,388,68 V = 12,582,912 V = 16,777,216 V = 2,971,52

6 Speedup Efficiency Efficiency vs V = 4,194,34 V = 8,388,68 V = 12,582,912 V = 16,777,216 V = 2,971, Speedup vs. V = 4,194,34 V = 8,388,68 V = 12,582,912 V = 16,777,216 V = 2,971, Comment on strong scaling performance: As we can see from the above charts, the speedup is influenced by the input size. For a sufficiently large input, we have a very good efficiency. However for a small input graph, the maximum speedup on 8 cores is 6. Also, we are getting an extremely good speedup for large input graphs. This is since all our inputs are hypergraphs, all the vertices have roughly the same number of edges (roughly since there are vertices which we have connected manually across the different input sets). Say if two processors are coloring the given input file. After the processors color their respective subsets, and in the worst case, even if all the boundary vertices are colored the same, the next phase of the algorithm will consist of only half of the vertices. 12. Weak scaling performance: n(1) K n T (msec) Sizeup Sizeup Efficiency 4,194,34 Seq 4,194, ,194, ,388, ,582, ,777, ,971, ,165, ,36,

7 Efficiency Sizeup Runnint Time (msec) 8 33,554, Running Time vs n(1) = 4,194,34 Sizeup vs n(1) = 4,194,34 Efficiency vs n(1) = 4,194, Why non-ideal weak scaling is occurring: As we can see from the weak scaling graphs, the efficiency slightly decreases as we increase the input size and number of cores. This is due to the fact that as we are increasing the cores and size of the graph, the number of boundary vertices also increases. This in-turn increases the likelihood of the conflicts and thus increases the number of vertices that need to be re-colored.

8 14. Future work: We propose the use of techniques presented in research paper 2 [2] to achieve a better speed up of the above algorithm. The major techniques termed are: a. Sequential optimization: The second research paper proposes a technique to order the vertices in a way that the neighbors of vertices span as few cache lines as possible. b. Assign vertices to processors with less number of cross edged links: As it is evident from the algorithm, the number of boundary vertices directly influence the possibility of having more number of conflicts. Thus, if the graph is partitioned in such a way that the number of cross edged links are minimized, we can get a better speed up. VMetis [5] is a library that can be used to create efficient partitioning of graphs. c. Reduced conflict checking: In the current algorithm, the interior vertices are also iterated to check for conflicts in the conflict detection phase. There is no need to do so as only the boundary vertices can be iterated. 15. What we learned: a. Speed of algorithm: We learned that the speed of the algorithm is directly influenced by the way the graph is partitioned across the processors. If we partition the graph in a way that the number of cross edge links is low across processors, a better scaling is obtained. b. Next, we also learned that mid-sized graphs work the best as opposed to very sparse or very dense graphs. 16. Distribution of work: The project topic phase, which was one of the most time consuming phase, was decided mutually by each team member. Next, the team presentations had an equal effort wherein each team member contributed to different sections of the presentation viz. describing the topic area, presentation of sequential code, presentation of parallel code. As there were three research papers, each team member had a task of analyzing the research papers in depth. The majority the code (sequential as well as parallel) was written by Shobhit Dutia and Shreyas Jayanna. 17. References: [1] Erik G. Boman, Doruk Bozdağ, Umit Catalyurek, Assefaw H. Gebremedhin, Fredrik Manne, A Scalable Parallel Graph Coloring Algorithm for Distributed Memory Computers, 11th International Euro-Par Conference, Lisbon, Portugal, Proceedings, v 3648, pp , 25. [2] Assefaw H. Gebremedhin,Fredrik Manne, and Tom Woods, Speeding up Parallel Graph Coloring, 7th International Workshop, PARA 24, Lyngby, Denmark, June 2-23, 24. Revised Selected Papers, v 3732, pp , 26. [3] Sariyuce A.E., Saule E., Catalyurek U.V, Improving graph coloring on distributed-memory parallel computers, 18th International Conference on High Performance Computing (HiPC), pp 1-1, 211

A Scalable Parallel Graph Coloring Algorithm for Distributed Memory Computers

A Scalable Parallel Graph Coloring Algorithm for Distributed Memory Computers Erik G. Boman 1, Doruk Bozdağ 2, Umit Catalyurek 2,,AssefawH.Gebremedhin 3,, and Fredrik Manne 4 1 Sandia National Laboratories,