Parallel Ant System on Max Clique problem (using Shared Memory architecture)

Size: px

Start display at page:

Download "Parallel Ant System on Max Clique problem (using Shared Memory architecture)"

Olivia Gilmore
5 years ago
Views:

1 Parallel Ant System on Max Clique problem (using Shared Memory architecture) In the previous Distributed Ants section, we approach the original Ant System algorithm using distributed computing by having the graph partitioned proportionally to the number of 'slaves'. As shown, this method produces promising results, however it depends on the types of graphs and configurations (i.e., the works, ant activities are balanced in all graph regions) ( Note to Dr. Bui: I assume that there is a section on Mike's distributed code prior to this section). In this section, we propose another technique that make use of the readily available multi-processor architecture. This technique enhances the performance the original version (referred to as 'sequential version' hereafter) by taking advantage of the shared-memory multiprocessor architecture. We will first introduce this parallel model, discuss on the parallel techniques applied, conduct analysis on the differences between two versions (i.e., parallel vs sequential) and finally present the results. Parallel computing with OpenMP The symmetric multiprocessor (SMP) architecture we use belongs to the shared-memory (tightly coupled) class (ref 1). This class allows a collection of processors to have access to the shame shared memory. The main benefits of SMP (v.s. distributed computing or cluster) is performance improvement since communication on shared memory is much faster than with distributed memory in cluster 1. The standard view of parallelism in shared-memory uses the fork/join model (see figure B). The program starts with only a single thread (i.e., the master), the master creates slave threads when parallelism is required and work together through the parallel region. When done with the parallel code, all the threads except the Master die (or suspended), and serial execution continues by the Master from that point. OpenMP (MP stands for Multi Processing) is a newly emerged as a standard shared-memory programming model backed by a host of large computer vendors and organizations (e.g., Compaq, HP, Intel, IBM, SGI, Sun and U.S Dept of Energy). The OpenMP API consists of a set of compiler directives and a library of support in functions. The current supporting programming languages include C, C++ and Fortran 2. (Ref 3) Nevertheless, this section is about describing the technical details of OpenMP but rather about the analysis and designing of appropriate parallelism to the existed sequential Ant System algorithm. Converting Existing Ant's Sequential Code to Parallel - March Ants in Parallel 1 Reliable is another benefit, since it's common for nodes (in cluster) to stop functioning but very rare that a processor in SMP stop functioning. However some people argue that this is actually a benefit for cluster since even if nodes go down, cluster still function but if a proc in SMP stop working, the whole machine is gone (i.e, need to take it down and replace the proc). 2All implementations in this section uses the Linux Intel C++ Compiler version with OpenMP support.

2 for each cycles { for (i = 0 : numants) { //for each ant ant(i).choosenextmove(){ ant(i).nextvertex=getnextmoveheuristically(); for (j=0: ant(i).adjedges.size()){ if (edges[ant(i).adjedges[j]].lastupdate < currenttime){ edges[ant(i).adjedges[j]].evaporatepheromone(); edges[ant(i).adjedges[j]].lastupdate= currenttime; for (i=0 : numants) { ant(i).moves(){ deposit a small amount of pheromone to the edge from ant(i).currentvertex to ant(i).nextvertex ant(i).currentvertex = ant(i).nextvertex; ant(i).nextvertex=null; Figure C: sequential version Figure c is the ants' activities section from the sequential version which will be parallelized. The reasons to apply parallelism to this part are because 1) this region take the most computing time (more than 50% 3 ) and 2) the amount of data dependencies can be dealt with safely as shown below. 4 In each cycle, all ants do two main operations: 1. decides where to move next via a heuristic calculation (i.e., more favor toward vertex that have more ants and the edge to those vertex that has more pheromone). There is also another operation inside which evaporate a small amount of pheromone on the edges adjacent to each ant's current vertex. 2. once all the ants made their decision, they will move (i.e., update location configurations, deposit pheromones, etc). The first operation is the most time consuming section since both the heuristic calculation and the pheromone evaporation can very expensive. The decision of an ant is based on the pheromone of its adjacent edges which has the upper bound O(V) (the number of vertices) if that ant happens to be on a vertex connecting to all others vertices. The same complexity is also applied to the evaporation operation since all adjacent edges to the ant's current location must be scanned (to determined if it should be evaporated). The number of ants is set to be v * multiplier (=> 1), thus this section has the complexity of O(V^2) since each ant's takes O(V) time and there is at least V ants. It's tempting to immediately parallelize this ant activities portion by having each processor handles the works of the numants/numprocs 5 ants. However the 3 We used a profiler to study the computation complexity and it shows this ant activities loop take more than 50% of the time. Furthermore, this percentage will increase as the graph is more complex (see explanation on O(V^2) complexity) 4 Similar to any parallel technique, it is important to take care of data dependency. It is most likely due to data dependencies that sequential algorithms cannot be paralleled. 5 OpenMP automatically takes care of the case when n/p is not an even number, by simply have the

3 (heuristic) decision of the ants depend on the pheromone amount on each edge. That is, ant(i)'s decision at cycle(c) may be altered by ant(i-1)'s decisions at cycle(c) (since the evaporation action). A re-design in the algorithm has been made by having all ants make their decision based on the current information and independently of others. For example ant(i) at cycle(c) makes its decision based on the configurations from cycle(c-1) and not based on other ants' decisions at cycle(c). This redesign is a quite a change to the sequential algorithm, however it has no effect on the result quality as shown in the below Difference Analysis and Result sections. There is another dependency in the sequential code: each adjacent edges to each ant's current location can only be updated once at each cycle (i.e., currenttime). If this was to be done in parallel then ants that are currently on the same location might update the information (evaporate pheromone) concurrently, hence a dirty read/write. Fortunately this is easy to deal with, by having the 'decision' of which edges to be updated done in parallel but the actual updating is done sequentially. 6 We created a shared-memory array called bool edgetoupdate[numedges] that has everything reset to 'false' in each cycle. When an edge(e) adjacent to ant(i)'s current location is to be updated (i.e., if lastupdate < currenttime), then edgetoupdate[e] will be set to false. Multiple ants (on different processors) can simultanously write 'true' to the same array index, but that's does not affect any decision since we only need to update an edge once. After all the decision have been made (i.e., the for ant loop is done), edgetoupdate[] will be scanned and each edge with index marked with 'true' will be updated, then it will be reset to 'false' (ready for next cycle). For enhance the performance further, we also parallelize this scanning part if the number of edges is large enough (can be quite large depends on the graphs). The parallel version is presented in figure D below. bool edgetoupdate[numedges] ; // shared variable, declared and init to all false somewhere above for each cycles { //parallel region, master forks threads and work concurrently for (i = 0 : numants) { //for each ant ant(i).choosenextmove(){ ant(i).nextvertex=getnextmoveheuristically(); for (j=0: ant(i).adjedges.size()){ if (edges[ant(i).adjedges[j]].lastupdate==false && edges[ant(i).adjedges[j]].lastupdate < currenttime){ edgestoupdate[ant(i).adjedges[j]]=true; //end parallel region, Master thread continues, other threads suspended //parallel region, master forks threads and work concurrently for (e = 0 : numedges){ if (edgetoupdate[e]==true){ edges[e.evaporatepheromone(); edges[e].lastupdate = currenttime; edgetoupdate[e]=false; //reset last processor takes care of its part and the remaining. For example n=8, p = 3, p1 and p2 will do 2 and p3 will do 3. 6 Note that we can use OpenMP's critical region lock feature (i.e., lock the part that are being used), however experiment shows that this technique takes too much time overhead dues to continuous locking and releasing.

4 //end parallel region, Master thread continues, other threads suspended for (i=0 : numants) { ant(i).moves(){ deposit a small amount of pheromone to the edge from ant(i).currentvertex to ant(i).nextvertex ant(i).currentvertex = ant(i).nextvertex; ant(i).nextvertex=null; Figure D: parallel version - Other sections in parallel One of the greatest benefits of shared memory programming is 'incremental parallelization', which allows applying the fork/join operations wherever needed 7. Thus this makes it very easy (and appealing) to apply parallelism everywhere in sequential code. However it's necessary to determine 8 if the speed up gain exceed the overhead (e.g., the overhead caused by fork/join operations)! The above Ants march in parallel section shows where parallelism is very appropriate and beneficial because this very expensive section appears in almost every Ant System based algorithms. Further optimizations can be done by applying incremental parallelization to other parts including the local optimization, large size for loop (as we have shown in the for all edges loop) or simply any region that can done independently. The below lists several instances in our Max Clique Ant System's local optimization where parallelism gives noticeable performance improvements. We applied parallelism to the set score code in the findclique function in the local optimization (see figure X, Dr. Bui, I assumed Rizzo's section is above this and it has the findclique function). This region's complexity is O(V^2) because it calculates the score of each vertex, i.e., the pheromone amount on the adj edges. Analysis on the difference of the two algorithms TO BE DONE 7(Technical details of fork/join model, its benefits comparing to other parallel techniques can be consult from Ref 3). 8Use analysis such as Amdahl's law to estimate if the speedup is visible before attempting to apply parallelism. And of course, the most accurate is to run a profiler on the program's actual run to determine the exact cpu/time usage to compare.

5 Results Table A shows the expected result qualities: the parallel and sequential version give similar results. (Question to Dr. Bui: do you want to mention it performs better in many cases too? I don't think it's a good idea to claim this since it will confuse the readers that the parallel version design is the main cause for these better qualities, I think it is caused by other factors (e.g. Diff code, etc). Name Vertices Edges Opt RizzoSol ParSol SeqSol AvgParSol AvgSeqSol StdDP StdDS c-fat c-fat johnson johnson keller keller hamming hamming san200_0.7_ san200_0.9_ san200_0.9_ san200_0.9_ san400_0.5_ san400_0.9_ sanr200_ Sanr400_ san Brock200_ brock400_ brock800_ p_hat300_ p_hat300_ p_hat300_ p_hat500_ p_hat500_ p_hat700_ p_hat1000_ p_hat1500_ MANN_a MANN_a Table A : Results Graphs A and B below show the speedup 9 obtained with different number of processors 10. The speedup starts with 1.5 on 2 processors and continuously rises to around 3 on 16 processors for small and simple configurations (e.g., graphs that have small number of vertices, edges and not too fully connected). Furthermore, we even see the speed up less than 1 (i.e., parallel time runs slower than sequential time) in some graphs (e.g., c-fat*). This is as expected since the overhead is a more dominant 9 Speedup = time_sequential / time_parallel, (e.g., if a program takes half the time run in parallel comparing to running in sequential, then the speedup is 2, which is the linear speedup, the *best* and probably never achievable due to overheads in parallel) 10 We use a computer with 16 Intel Itanium processors, 8 Gigs of Ram, running Redhat Linux and all programs compiled with Intel C/C++ compiler.

6 factor in these cases where the programs runs extremely fast (less than 1 seconds) However the excitements come in with large graphs (e.g., MANN_45, man10-2) where speed up steadily rises, around 9 on 16 processors, and it shows no signs of slowing down. By this, we achieve our main goal of applying parallelism to the program which is to increase performance on complex computation. 10 SU SpeedUp brock200_1.clq.b brock400_1.clq.b c-fat200-1.clq.b c-fat500-1.clq.b hamming8-2.clq.b hamming10-2.clq.b johnson clq.b MANN_a27.clq.b MANN_a45.clq.b p_hat300-1.clq.b p_hat300-2.clq.b p_hat300-3.clq.b p_hat500-1.clq.b Speedup p2 p4 p6 p8 p10 p12 p14 p16 Procs Su p2 p4 p6 p8 p10 p12 p14 p16 Procs p_hat500-2.clq.b p_hat700-1.clq.b p_hat clq.b p_hat clq.b san1000.clq.b san200_0.7_1.clq.b san200_0.9_1.clq.b san200_0.9_2.clq.b san200_0.9_3.clq.b san400_0.5_1.clq.b san400_0.9_1.clq.b sanr200_0.7.clq.b sanr400_0.5.clq.b Graphs A and B : Speedup Question to Dr. Bui: do you want graphs? Or you want tabular type of data as below? Or this is fine? Question to Dr. Bui: do you want a conclusion on this section? such as summarizing, re-capture main points, and talk about further enhancements Conclusion: Can apply as much parallelism as possible with OpenMP since it supports of incremental parallel. For examples can simply apply parallelism to the local optimization part, any loop, any section that can be paralleled. Of course have to

7 analyze if it worths it, i.e the advantage of parallel work exceeds fork/join overhead. Further enhancement : hybrid, combining distributed memory with shared memory.

8 Reference Dr. Bui, you can get more information on the books (e.g., isbn, complete author names) by searching for the title from Amazon 1) William Stallings, Operating Systems: Internal and Design, chapter 4: Threads, SMP and Microkernel 2) R.Chandra, L Dagum, D Kohr, D Maydan, J Mcdonald, R. Menon, Parallel Programming with OpenMP, Preface 3) Michael J. Quinn, Parallel Programming in C with MPI and OpenMP, chapter 17: Shared-Memory Programming

Finding Maximum Cliques with Distributed Ants

Finding Maximum Cliques with Distributed Ants Thang N. Bui, Mike Martys, ThanhVu H. Nguyen and Joseph R. Rizzo, Jr. Department of Computer Science The Pennsylvania State University at Harrisburg Middletown,