Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes

Size: px

Start display at page:

Download "Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes"

Loraine McCarthy
6 years ago
Views:

1 Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes Agnieszka Debudaj-Grabysz 1 and Rolf Rabenseifner 2 1 Silesian University of Technology, Gliwice, Poland; 2 High-Performance Computing Center (HLRS), Stuttgart, Germany EuroPar, Dresden, August 30,

2 OUTLINE 1. The algorithm of simulated annealing 2. Hybrid communication method (HC) nesting OpenMP in MPI The reference method The method with a single data exchange 4. Outer level load balancing 5. Inner level load balancing 6. Vehicle routing problem with time windows (VRPTW) an example of a bi-criterion optimization problem 7. Experimental results EuroPar, Dresden, August 30, /18

3 THE GOAL AND ALGORITHM OF SIMULATED ANNEALING Finding the state of minimal (maximal) value of the cost function Begin generate initial solution End Starting at Eagle accepted not accepted e.g., to go into the deepest crater on the moon No End conditions met? Yes Trials Really descending Select a new solution from neighborhood Next position on a suitable footpath Is the new solution better? Accept current solution EuroPar, Dresden, August 30, /18 Yes No deeper one independent dependent Take a random number sometimes climbing back on the crater edge to get a chance to find a The sequence of trials forms a chain Yes No Trial Ceiling moves down in line with decreasing parameter called temperature Is the number < ceiling? The factor sometimes is reduced over the execution time

4 THE HYBRID COMMUNICATION METHOD Nesting OpenMP in MPI Hybrid character of the hardware Parallelization uses two levels SMP nodes CPUs shared memory Node interconnect INNER PARALLELIZATION Only up to a few threads are used Communication overhead is minimal relative to that between nodes Uses OpenMP OUTER PARALLELIZATION Several SMP nodes Communication overhead is not negligible Uses MPI EuroPar, Dresden, August 30, /18

5 THE REFERENCE HYBRID COMMUNICATION METHOD Outer parallelization for communication between nodes Each Each chain chain is is divided divided into into subchainchains by by the the factor factor of of the the sub- number number of of nodes nodes The The process process of of generating generating every every consecutive consecutive chain chain is is performed performed without without communication communication between between nodes nodes At At the the end, end, the the best best solution solution is is picked picked up up as as the the final final one one chain 1 chain 2 chain 3 Node interconnect The The maximal maximal number number of of chain n engaged engaged nodes nodes is is limited limited by by reasonable reasonable shortening shortening of of the the chain A final solution chain length length EuroPar, Dresden, August 30, /18

6 Thread 1 Thread 2 Thread 3 Thread 4 Agnieszka Debudaj-Grabysz, Rolf Rabenseifner Load Balanced Parallel Simulated Annealing... THE REFERENCE HYBRID COMMUNICATION METHOD Inner parallelization for communication within nodes SMP-node Performing CPU-intensive trials in a parallel region Fast comparison of the costfunction in a serial OpenMP master section Select one solution common for all threads from all accepted ones for(l=0;l<nooftempred;l++) EuroPar, Dresden, August 30, /18 { #pragma omp parallel private(ii) { for(ii=0;ii<epochlengthpernode; ii=ii+setoftrialsize) { #pragma omp for schedule(static) for(i=0;i<setoftrialsize;i++) perform_trial(); #pragma omp barrier #pragma omp master {select one solution common for all threads... } #pragma omp barrier } /*end for (ii=0;...) */ } /*end of parallel region*/ {reduce the temperature... } } /*end for(l=0;...)*/

7 THE METHOD WITH A SINGLE DATA EXCHANGE The idea Incorporating one data exchange after elapsing a percent of the specified time limit (e.g. 50%, 70%) During the exchange of the data the best solution is selected and mandated for all processes The idea gives the possibility of: Heavy exploration of the search space during the first phase, i.e., a few (but only a few) paths can reach the area of the global minimum. Improvement of the best path during the second phase by all working processes Many small (instead groups of of only astronauts again a few) are the looking moon independently for the deepest crater and hopefully, at least one group (=SMP node) is finding it Improvement of the best path during the second phase by all working processes (instead of only a few) George A short W. time Bush before (Jan 14, returning 2004): Using to earth, the Crew all groups Exploration are concentrated Vehicle, we will undertake extended to human the deepest missions crater to the found moon up as to early now. as 2015,... The last minutes, they all try --> to Jan. find 2004 the --> deepest Jan. 14 location in that crater! EuroPar, Dresden, August 30, /18

8 time [s] Agnieszka Debudaj-Grabysz, Rolf Rabenseifner Load Balanced Parallel Simulated Annealing... OUTER LEVEL LOAD BALANCING The times for generating 8 sub-chains based on an example run idle times stopping and communicating after a fixed number of temperature levels number of temperature levels time [s] Optimization: stopping and communicating after a fixed amount of time number of temperature levels EuroPar, Dresden, August 30, /18

9 INNER LEVEL LOAD BALANCING The First Optimization Step Each single trial needs extremely different compute time Therefore with always 5 trials per thread: Better load balancing EuroPar, Dresden, August 30, /18

10 time [ms ] Agnieszka Debudaj-Grabysz, Rolf Rabenseifner Load Balanced Parallel Simulated Annealing... INNER LEVEL LOAD BALANCING The First Optimization Step the average fastest trial After After the the optimization: optimization: 5 trials trials per per a thread thread the average value 112% of the average s e parate trials time [ms ] ms *) Before Before the the optimization: optimization: 1 trial trial per per a thread thread 51% of the average the average value the average slowest trial clus te re d trials 624 ms *) *) Execution time for 4x5 trials on 4 threads EuroPar, Dresden, August 30, /18

11 INNER LEVEL LOAD BALANCING The Second Optimization Step Redefinition of a trial: Finding a new valid solution S in the neighborhood of S the most time consuming function Allowing a trial to abort this loop without result causes better load balancing Average trial time more dominated by minimal trial size Absolute maximal trial time significantly reduced EuroPar, Dresden, August 30, /18

12 INNER LEVEL LOAD BALANCING The Second Optimization Step time [ms ] Clustered Clustered trials trials the average value after the redefinition be fore the re de finition 112% of the average afte r the re de finition 1.0 time [ms ] ms *) 273 ms *) 51% of the average be fore the re de finition Separate Separate trials trials 66% of the average the average value after the redefinition 624 ms *) afte r the re de finition 212 ms *) 29% of the average EuroPar, Dresden, August 30, 2006 *) Execution time for 12/18 4x5 trials on 4 threads

13 INNER LEVEL LOAD BALANCING The Third Optimization Step The The easiest easiest solution: solution: for(l=0;l<nooftempred;l++) { for(ii=0;ii<epochlengthpernode; ii=ii+setoftrialsize) { #pragma omp parallel for schedule(static) for(i=0;i<setoftrialsize;i++) perform_trial(); #pragma omp barrier /*end of parallel region*/ {select one solution common for all threads... } } /*end for (ii=0;...) */ {reduce the temperature... } } /*end for(l=0;...)*/ 30-40ms * 0.16ms * ~ for(l=0;l<nooftempred;l++) EuroPar, Dresden, August 30, /18 ~20 Third Third optimization: Reducing the the fork-join overhead The The optimized optimized solution: solution: { #pragma omp parallel private(ii) { for(ii=0;ii<epochlengthpernode; ii=ii+setoftrialsize) { #pragma omp for schedule(static) for(i=0;i<setoftrialsize;i++) perform_trial(); #pragma omp barrier #pragma omp master {select one solution common for all threads... } #pragma omp barrier } /*end for (ii=0;...) */ } /*end of parallel region*/ {reduce the temperature... } } /*end for(l=0;...)*/ *System: NEC TX7 16x1.5Ghz Intel ItaniumII CPUs, 6MB L3 Cache

14 VEHICLE ROUTING PROBLEM WITH TIME WINDOWS Goal Goal Minimizing Minimizing number number of of route route legs legs Minimizing Minimizing of of the the total total travel travel distance distance Constraints Constraints Customer Customer and and warehouse warehouse time time windows windows Vehicle Vehicle capacity capacity Duration Duration of of servicing servicing a single single customer customer Customer s Customer s demand demand Customer Warehouse EuroPar, Dresden, August 30, /18

15 EXPERIMENTAL RESULTS* THE FIRST OPTIMIZATION GOAL** The number of final solutions with the minimal number of route legs (i.e., good solutions), generated within 100 runs number of "good" solutions per 100 runs SEQ HC4, i.e., OMP=4*** HC2, i.e., OMP= number of processors Presented values are the averages over the values obtained for 4 data files from Solomon s benchmark set: R108, R111, RC105, RC108. Constraints: Execution time x number of CPUs = constant * Experiments carried out on NEC Xeon EM64T Cluster ** The previous work *** Emulated usage of 4 OMP threads, based on results of tests of OMP parallelization carried out on NEC TX-7 system EuroPar, Dresden, August 30, /18

16 EXPERIMENTAL RESULTS THE SECOND OPTIMIZATION GOAL The relative distance from the value obtained by a sequential algorithm the distance from a sequential result better results than for the SEQUENTIAL method the sequential result number of processors HC4 HC4-0.9 HC4-0.7 HC4-0.5 Presented values are the averages over the values obtained for 4 data files from Solomon s benchmark set: R108, R111, RC105, RC108. Constraints: Execution time x number of CPUs = constant better results than for the reference method Outer level communication after 100%, 90%, 70% and 50% of the total execution time EuroPar, Dresden, August 30, /18

17 EXPERIMENTAL RESULTS THE SECOND OPTIMIZATION GOAL R108 R111 RC105 RC108 total distance total distance total distance total distance seq HC4 HC4-0.9 HC4-0.7 HC number of processors number of processors number of processors EuroPar, Dresden, August 30, /18 number of processors

18 Acknowledgements This project was supported by HPC-Europa. The research stay at HLRS Jan. Feb was granted by HPC-Europa. HPC-Europa Pan-European Research Infrastructure on High Performance Computing at HLRS, Stuttgart SARA, Amsterdam idris, Paris epcc, Edinburgh CEPBA, Barcelona CINECA, Bologna Research Grants for Computational Science Access to HPC Scientific Collaboration Technical Support All travel and living expense fully reimbursed! 3 13 weeks Poster outside Applications are welcome at any level, from postgraduate researchers to senior profs. Currently, high acceptance rate EuroPar, Dresden, August 30, /18

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center