Distributed Computing with Hierarchical Master-worker Paradigm for Parallel Branch and Bound Algorithm

Size: px

Start display at page:

Download "Distributed Computing with Hierarchical Master-worker Paradigm for Parallel Branch and Bound Algorithm"

Elwin Johnston
5 years ago
Views:

Distributed Computing with Hierarchical Master-worker Paradigm for Parallel Branch and Bound Algorithm Kento Aida Tokyo Institute of Technology / PRESTO, JST aida@dis.titech.ac.

1 Distributed Computing with Hierarchical Master-worker Paradigm for Parallel Branch and Bound Algorithm Kento Aida Tokyo Institute of Technology / PRESTO, JST aida@dis.titech.ac.jp Yoshiaki Futakata nihou@alab.dis.titech.ac.jp Wataru Natsume Tokyo Institute of Technology natsume@alab.dis.titech.ac.jp Abstract This paper discusses the impact of the hierarchical master-worker paradigm on performance of an application program, which solves an optimization problem by a parallel branch and bound algorithm on a distributed computing system. The application program, which this paper addresses, solves the BMI Eigenvalue Problem, which is an optimization problem to minimize the greatest eigenvalue of a bilinear matrix function. This paper proposes a parallel branch and bound algorithm to solve the BMI Eigenvalue Problem with the hierarchical master-worker paradigm. The experimental results showed that the conventional algorithm with the master-worker paradigm significantly degraded performance on a Grid test bed, where computing resources were distributed on WAN via a firewall; however, the hierarchical master-worker paradigm sustained good performance. 1 Introduction Progress of Grid computing technology significantly reduces the cost for high-performance computing, and has possibility to extend a user community of high performance computing. Grid computing enables many user communities, who have large scale problems but do not have highend supercomputers, to perform computation with huge computational power. The research community related to an optimization problem is one of these communities, that is, while they have a lot of NP-hard problems, which require huge computational power to even obtain semi-optimal solutions, they need to scale down the size of problems to solve because they do not have enough computational power. Thus, Grid computing has possibility to scale up the size of a solvable optimization problem in their community. Parallel applications to solve optimization problems on a PC cluster or on a Grid have been studied[1, 6, 10, 13]. These applications use the master-worker paradigm, where a single master process dispatches a subset of computation, or a task, to multiple worker processes and gathers computed results from the worker processes. The masterworker paradigm is successfully used in many parallel applications on PC clusters and on a Grid [1, 2, 6, 7, 10, 12] as a common framework to implement parallel applications. However, the performance of an application with the master-worker paradigm is affected by many factors. For instance, communication overhead between a master process and worker processes could degrade performance. Particularly, degradation of performance could be significant on a Grid, because communication overhead among computing resources connected by WAN is high. Also, the performance of a master process could be a bottleneck of application performance if a master process controls too many worker processes, because a master process frequently communicate with all of worker processes. The hierarchical master-worker paradigm is one of solutions to avoid performance degradation in the masterworker paradigm. In the hierarchical master-worker paradigm, a single supervisor process controls multiple process sets, each of which is composed of a single master process and multiple worker processes. Distribution of tasks is performed in two phases: distribution from a supervisor process to master processes and that from a master process to worker processes. Collection of computed results is performed in the reverse way. The hierarchical masterworker paradigm has advantages compared with the conventional master-worker paradigm. The first advantage is to reduce communication overhead by putting a set of a master process and worker processes, which frequently communicate with each other, on tightly coupled computing resources. The second advantage is to avoid that a single master process becomes a performance bottleneck by distributing work among multiple master processes.

2 This paper discusses the impact of the hierarchical master-worker paradigm on the performance of an application program, which solves an optimization problem by a parallel branch and bound algorithm on a distributed computing system. The application program, which this paper addresses, solves the BMI Eigenvalue Problem, which is an optimization problem to minimize the greatest eigenvalue of a bilinear matrix function. This paper proposes a parallel branch and bound algorithm to solve the BMI Eigenvalue Problem with the hierarchical master-worker paradigm. The experimental results showed that the conventional algorithm with the master-worker paradigm significantly degraded performance on a Grid test bed, where computing resources were distributed on WAN via a firewall; however, the hierarchical master-worker paradigm sustained good performance. The rest of the paper is organized as follows: Section 2 describes the BMI Eigenvalue Problem, and a parallel branch and bound algorithm to solve the problem with the conventional master-worker paradigm. Section 3 describes the proposed parallel branch and bound algorithm with the hierarchical master-worker paradigm, and Section 4 presents its experimental results on a Grid test bed. Section 5 describes related works, and finally, Section 6 presents conclusions and future work. 2 BMI Eigenvalue Problem This section describes an overview of the BMI Eigenvalue Problem and a parallel branch and bound algorithm to solve this problem with the conventional master-worker paradigm. 2.1 BMI Eigen value Problem The BMI Eigenvalue Problem is to find the solution, x and y, which minimize the greatest eigenvalue of F (x; y) presented as follows: Let F : R nx R ny!r m m be a biaffine function derived by (1), where symmetric matrices, F ij R m m (i =0; ;n x ;j =0; ;n y ), are given, and x := (x1; ;x nx ) T, and y := (y1; ;y ny ) T. Xn x Xnx n X y Xny F (x; y) := F00 + x i F i0 + i=1 j=1 + = Fij T 2 y j F0j x i y j F ij (1) i=1 j=1 The BMI Eigenvalue Problem is recognized as a general framework for analysis and synthesis of control systems in variety of industrial applications, such as position control of a helicopter and control of robot arms. Thus, speedup of the computation is expected in the control theory community in order to enable analysis and synthesis of large scale control systems. Also, in the operations research community, it is an academic grand challenge to solve the large scale problem that has never been solved. 2.2 Parallel Branch and Bound Algorithm The BMI Eigenvalue Problem is NP-hard problem; thus, practical branch and bound algorithms to solve the ffloptimal solution, where an error of the optimal value is less than ffl, have been proposed[4, 5]. However, these algorithms still require huge computation time to solve a large scale problem such as a control problem for a real industrial application, and it restricts the size of a solvable problem for the control system to small. A parallel branch and bound algorithm to solve the ffloptimal solution of the BMI Eigenvalue Problem has been proposed[1]. The proposed algorithm performs a parallel branch and bound method with the master-worker paradigm on a PC cluster or on a Grid. In the proposed algorithm, a single master process maintains a search tree. It dispatches subproblems, which correspond to leaf nodes on the search tree, to multiple worker processes and receives computed results from the worker processes. Here, the computed results contain the best upper bound of the objective function, the best solutions of the objective function, and subproblems that have generated by branching and have not been pruned on a worker process. A worker process that received a subproblem from a master process performs branching, that is, it decomposes a subproblem into multiple subproblems and generates a subset of the search tree. Next, it computes the lower/upper bound for each subproblem on the tree, and performs bounding, that is, it prunes an unnecessary subproblem, in which its lower bound exceeds the current best upper bound. Finally, a worker process returns the computed results to a master process. A master and worker processes repeat these procedures until the gap between the lowest lower bound and the best upper bound converges within ffl. 3 Hierarchical Master-Worker Paradigm This section describes the proposed parallel branch and bound algorithm to solve the BMI Eigenvalue Problem with the hierarchical master-worker paradigm. 3.1 Dra wbacks in the Master-worker Paradigm While the master-worker paradigm is successfully used in many parallel applications as a common framework to 2

3 implement parallel applications, it has drawbacks when it is used on a Grid, where a large number of computing resources are connected via WAN. : subproblem master worker Communication Overhead Communication overhead between a master process and worker processes significantly affects the performance of an application. Communication occurs when a master process dispatches a task to a worker process and a worker process returns computed results to a master process. Performance degradation occurs when communication overhead is relatively large compared with execution time of a single task. For the application addressed in this paper, granularity of a single task tends to be small, e.g. execution time of a single task is a few seconds or less in a real problem. Thus, the impact of communication overhead on performance could be significant. Furthermore, the impact of communication overhead could be more significant on a Grid, where computing resources are connected via WAN with high latency and low throughput. For instance, let us suppose that a master process running on a local computer dispatches tasks to worker processes running on a remote PC cluster. If the local computer and the remote PC cluster are connected via WAN with high latency and low throughput, high communication overhead degrades the performance of an application significantly. Furthermore, in many cases, a remote PC cluster is installed in a private network beyond a firewall. Thus, a local user needs to run an application with a setting to establish communication through a firewall, e.g. ssh tunneling; this setting further increases communication overhead Bottleneck on a Single Master Process The performance of a master process could be a bottleneck of application performance if the master process controls too many worker processes. A master process continuously communicates with all worker processes to find an idle worker process, to dispatch new tasks and to receive computed results. A master process needs to perform these procedures in very frequent manner, because task granularity in the application addressed in this paper tends to be small. Thus, if a master process controls too many worker processes, procedures for computation and I/O on a master process degrades performance. 3.2 Proposed P arallel B&B Algorithm The proposed algorithm performs a parallel branch and bound method to solve the BMI Eigenvalue Problem with the hierarchical master-worker paradigm, where a supervisor process controls multiple process sets, each of which is composed of a master process and worker processes. A supervisor, Z M1, Z, Z M2, Z, Z W1, Z M1 W 1 M 1, Z W2, Z M1 W 2 M 2 W 1 W 2 Figure 1. The hierarchical master-worker paradigm set of a master and worker processes performs a parallel branch and bound method for a subset of a search tree, that is, a master process dispatches subproblems to multiple worker processes and receives computed results from the worker processes. A supervisor process performs load balancing among master processes by migrating subproblems among master processes. Also, a supervisor process and master processes gather the best upper bound of the objective function, which is computed on each worker process, and updates the current best upper bound on all worker processes hierarchically. The updating of the current best upper bound is crucial to improve the performance of the application, because it accelerates bounding[1]. Figure 1 shows an overview of the proposed algorithm. On the figure, ZWi, ZMj and Z denote the current best upper bound of the objective function stored on a worker process Wi, a master process Mj and a supervisor process, respectively. Here, i = 1... the number of worker processes in a set of a master and worker processes and j = 1... the number of master processes. The rest of this section describes detailed procedures of a worker, a master and a supervisor processes in the proposed algorithm Worker Process In a set of a master and worker processes, Si, a worker process, Wj, performs the following steps whenever it is dispatched a subproblem from a master process, Mi. (1) Wj receives a subproblem and the current best upper bound of the objective function stored on Mi,or Z Mi, from Mi. (2) Wj branches the subproblem and generates a tree of subproblems. 3

4 (3) Wj computes the lower/upper bound of the objective function for subproblems on the tree. For each subproblem on the tree, if the computed upper bound is less than the current best upper bound stored on Wj, or Z Wj, Wj updates Z Wj to the lower value. (4) Wj prunes unnecessary subproblems, in which their lower bounds exceed Z Wj. (5) Wj returns the computed results, or subproblems that have not been pruned, Z Wj and the solution of the objective function, to Mi Master Process A master process has two roles: the role to perform a parallel branch and bound method with worker processes and that to achieve load balancing in cooperation with a supervisor process. A master process, Mi, repeats the following procedure until it receives a request to terminate the computation from a supervisor process. (1) Mi examines a request from a supervisor process. If Mireceived the request, it performs one of the following actions: ffl If the request is to query about the current computed results stored on Mi, Mi sends the results to a supervisor process. The results contain the number of subproblems assigned to Mi, Z Mi, the current lowest lower bound of the objective function, and the solution of the objective function. ffl If the request is to steal subproblems on Mi, Mi sends subproblems to a supervisor process. ffl If the request is to assign new subproblems to Mi, Mireceives subproblems. ffl If the request is to update the current best upper bound on Mi, Mi compares the best upper bound stored on the supervisor process, or Z, with the current Z Mi.IfZis less than Z Mi, Mi updates Z Mi to the value of Z. (2) Miprobes worker processes to find an idle worker process. If Mi finds an idle worker process, Wj, it performs the following steps. (a) Mi receives computed results, which contains subproblems generated on Wj, Z Wj and the solution of the objective function, from Wj. In the initial phase of the execution, a worker process has never been dispatched a subproblem; in this case, this step is skipped. (b) Mi prunes unnecessary subproblems, in which their lower bounds exceed Z Mi. (c) Mi compares the current Z Mi and Z Wj.IfZ Wj is less than Z Mi, Miupdates Z Mi to the value of Z Wj. (d) Midispatches a new subproblem and sends Z Mi to Wj Supervisor Process A supervisor process performs the following steps to achieve load balancing and to share the best upper bound of the objective function among all processes. (1) A supervisor process queries a master process, Mi, about the computed results and receives the results. (2) A supervisor process computes Nmg i derived by the following formula: mx Nmg i = N i 1 N k (2) m k=1 Here, N i denotes the number of subproblems assigned to Mi 1, and m denotes the total number of master processes. Nmg i represents the number of subproblems that will migrate from/to Mito achieve load balancing. A supervisor process performs migration of subproblems among master processes as follows: ffl If Nmg i > 0, a supervisor process requests Mi to send back Nmg i of subproblems and receives the subproblems. ffl If Nmg i < 0, a supervisor process requests Mi to receive jnmg i j of subproblems and assigns the subproblems. (3) A supervisor process compares Z Mi with the current best upper bound stored on itself, or Z. IfZ Mi is less than Z, a supervisor process updates Z to the value of Z Mi, and requests all master processes to update the current best upper bound on the master processes with the updated Z. (4) A supervisor process computes the lowest lower bound of the objective function, L, by comparing the lowest lower bounds computed on master processes, and examines if the condition, (Z L)=jLj < ffl, is satisfied. If it is satisfied, a supervisor process requests all master processes to terminate computation. The load balancing policy presented in this section is to distribute the equal number of subproblems to master processes. A supervisor process could have more aggressive 1 Ni includes the number of subproblems computed on worker processes in S i. 4

5 Table 1. Computing resources on the test bed name processor, memory, NIC OS location PC1 Pentium II 400MHz, 256MB, 100Base-T Linux UCSD PC2 Pentium III 700MHz, 256MB, 100Base-T Linux TITECH PC3 Pentium 4 1.9GHz, 512MB, 100Base-T Linux TITECH PCC1 dual Pentium III 1.4GHz, 512MB x18nodes, 100Base-T Linux TITECH PCC2 dual Pentium III 1.4GHz, 512MB x12nodes, 100Base-T Linux TITECH strategies, e.g. it estimates a cost to execute a task and assigns tasks to master processes in proportion to their computational power. However, the proposed algorithm in this paper applies the conservative strategy. The reason is that the characteristic of the target application makes it difficult to estimate a cost of a task, because subproblems computed on a worker process are generated/pruned by branching/bounding dynamically during the execution. We performed a preliminary experiment for the load balancing algorithm presented in this section. We solved benchmark problems by the proposed algorithm on a test bed, where five master processes and 64 worker processes ran on PC clusters connected to LAN. In the result, no idle time was observed on master processes, that is, the load balancing was achieved well. However, discussion about the load balancing policy is one of research issues for performance improvement. The further discussion about this issue is our future work. 3.3 Implemen tation on a Grid test bed We implemented the proposed algorithm on a Grid test bed using the Grid RPC middleware, Ninf[9, 11]. Ninf provides remote procedure call facilities which are designed to provide a programming interface similar to conventional function calls and enable users to build Grid-enabled applications. A client computer in Ninf, Ninf client, is allowed to request a remote computer, Ninf server, to execute computing routines, Ninf library, installed on the Ninf server through the Ninf client API, Ninf call(). In the implementation of the hierarchical master-worker paradigm, a worker process is implemented as a Ninf library and a master process invokes the Ninf library to dispatch a subproblem through Ninf call(). A master process is implemented as a set of multiple programs: a Ninf client program that invokes worker processes to perform a parallel branch and bound method, and Ninf libraries that perform load balancing in cooperation with a supervisor process. A supervisor process is implemented as a Ninf client program that invokes Ninf libraries in a master process to perform load balancing through Ninf call(). 4 Experimental Results This section presents experimental results of the proposed parallel branch and bound algorithm to solve the BMI Eigenvalue Problem with the hierarchical master-worker paradigm. The experiment was performed on a Grid test bed constructed by computing resources installed in the University of California, San Diego (UCSD) and the Tokyo Institute of Technology (TITECH). Table 1 shows computing resources used on the test bed. The measured ping latency between a computer in UCSD and that in TITECH is 152.7[ms]. Benchmark problems for the experiment contain the helicopter control problem[8] and the synthetic problem[1]. The helicopter control problem is a real application problem, which finds optimal parameters for a controller to control a position of a helicopter. Its problem size is: n x =10, n y =2, and m =8. Here, m denotes matrix size of F ij. The synthetic problem is created by generating elements of matrices, F ij, randomly. Its problem size is larger than that in the helicopter control problem: n x = 6, n y = 6, and m = 24. Sequential execution time for the benchmark problems on PC3 is 1087 [sec] and [sec] for the helicopter control problem and the synthetic problem, respectively. 4.1 Comparison of Master-worker Paradigms First, this paper shows performance comparison between the master-worker paradigm and the hierarchical masterworker paradigm on a Grid test bed. Figure 2 and 3 present execution time to solve the benchmark problems by the master-worker paradigm (mw) and by the hierarchical master-worker paradigm (hmw). On the figures, LAN indicates execution time where a supervisor process, a master process and worker processes run on computers connected to LAN in TITECH. Here, a master process runs on PC2 and 16 worker processes run on eight nodes (16 CPUs) in PCC1 for mw; a supervisor process runs on PC2, a master process runs on a single node in PCC1, and 16 worker processes run on eight nodes in PCC1 for hmw. Next, LAN+ssh 5

6 WAN+ssh LAN+ssh LAN hmw mw Table 2. overhead of Ninf call helicopter [sec] synthetic [sec] LAN LAN+ssh WAN+ssh execution time [sec] Figure 2. Execution time of the helicopter control problem on a Grid test bed WAN+ssh LAN+ssh LAN execution time [sec] hmw mw Figure 3. Execution time of the synthetic problem on a Grid test bed means execution time where all processes run on computers in TITECH but a master/supervisor process needs to communicate with a worker/master process via the firewall by ssh tunneling. This represents the situation that users use a PC cluster in their department but the PC cluster has its own firewall. Here, the allocation of processes is the same as the LAN except PC2 needs to communicate with other computers via a firewall by ssh tunneling. Finally, WAN+ssh indicates execution time where a master/supervisor process runs on a computer in UCSD and others run on computers in TITECH. All computers in TITECH are installed within a firewall. The last case shows the situation that a user uses computing resources on a remote site. Here, a master/supervisor process runs on PC1 and the other processes are allocated as the same way as LAN, and PC1 communicates with computers in TITECH by ssh tunneling. The results on Figure 2 and 3 show that the masterworker paradigm degrades performance in LAN+ssh and WAN+ssh while the hierarchical master-worker paradigm sustains almost the same performance in all the cases. Particularly, performance degradation of mw in WAN+ssh is significant. In this case, execution of the helicopter control problem and the synthetic problem did not finish within 10 minutes and within one hour, respectively. The reason for the performance degradation is high communication overhead, which is caused by high communication latency on WAN and by overhead of ssh tunneling. Table 2 presents overhead for invoking a single Ninf call() from a master process to a worker process. The result on the table shows high communication overhead in LAN+ssh and WAN+ssh. As described in Section 3.1.1, execution time of a single task in this application is small. In the experiment, the measured mean execution time for a single task is 0.03[sec] and 0.52[sec] for the helicopter control problem and the synthetic problem, respectively. Thus, the communication overhead is relatively large compared to the task execution time. Particularly, the communication overhead is significantly higher than the task execution time in WAN+ssh. For a breakdown of the communication overhead, high overhead caused by ssh tunneling is observed by the gap between LAN and LAN+ssh on Table 2. Also, the amount of data transferred between a master process and a worker process for execution of a single task is small in this application. The size of data transferred from a master process to a worker process is 3829[Bytes] and [Bytes] for the helicopter control problem and the synthetic problem, respectively. Those from a worker process to a master process are 720[Bytes] for both. 2 It suggests that ssh tunneling yields significant communication overhead even for an application with small transferred data. 4.2 Evaluation for Scalability Next, this section presents the results for performance scalability of the hierarchical master-worker paradigm. Figure 4 and 5 present performance scalability of the master-worker paradigm and the hierarchical master-worker paradigm. On the Figures, mw indicates execution time of the benchmark problems where a master process runs on PC3 and worker processes run on nodes in PCC1 and 2 The amount of data transferred from a worker process to a master process depends on how many subproblems are pruned on the worker processes. The result on Table 2 shows overhead in the worst case, that is, overhead when no subproblems are pruned. 6

7 execution time [sec] mw hmw(m=1) hmw(m=2) hmw(m=3) the number of workers Figure 4. Performance scalability for the helicopter control problem execution time [sec] mw hmw(m=1) hmw(m=2) hmw(m=3) the number of workers Figure 5. Performance scalability for the synthetic problem PCC2. Also, hmw means execution time where a supervisor process runs on PC3 and master/worker processes run on nodes in PCC1 and PCC2. The value of m in the parenthesis indicates the number of master processes in the hierarchical master-worker paradigm, where worker processes are equally divided among master processes. All computing resources are installed in TITECH, and processes are able to communicate with others without ssh tunneling. The results on Figure 4 and 5 show that the masterworker paradigm degrades performance on the large number of worker processes while the hierarchical masterworker paradigm improves performance. The performance gap between mw and hmw(m=1) for the helicopter control problem is caused by communication overhead. A master process and worker processes running on nodes in PCC1 communicate through a single network switch in hmw(m=1), while a master process and worker processes communicate via multiple network switches in mw. The measured ping latency between computing nodes in the former case is 0.1[msec], while the latency for the latter case is 0.2[msec]. The performance of the master-worker paradigm for the helicopter control problem is significantly affected even by the small communication overhead, because task granularity is small. Performance for the synthetic problem is not affected by the small communication overhead, because the task granularity is sufficiently large. For the number of master processes in the hierarchical master-worker paradigm, adding a master process improves performance in several cases, e.g. hmw(m=2), hmw(m=3) for the helicopter control problem and hmw(m=2) for the synthetic problem on 48 worker processes. This result means that adding master processes is effective to eliminate performance bottleneck on a master process. However, adding a master process does not improve performance at the execution on 32 worker processes. This result means that adding a master process does not improve performance where a master process still has enough power to control worker processes. Also, for the synthetic problem on 48 worker processes, hmw(m=3) shows worse performance than hmw(m=2). The reason is that two master processes are enough to control 48 worker processes for the synthetic problem, because the task granularity is larger than that in the helicopter control problem. 5 Related Work The master-worker paradigm on a Grid has been discussed in many literatures. The AppLes project discusses a problem how to determine placement of a master process and worker processes on computing resources[12]. The work presented in [7] discusses a problem to define the number of worker processes to be allocated to a masterworker application and proposes a strategy to adjust the number of worker processes allocated to an application adaptively during its execution. Parallel branch and bound algorithms with the masterworker paradigm are addressed in [6, 10, 13]. The MW[6] and Javelin 3[10] provide software frameworks to implement applications with the master-worker paradigm and parallel branch and bound applications are implemented on these frameworks. The work presented in [13] presents experimental results of a parallel branch and bound algorithm to solve the knapsack problem using the Grid RPC middleware, Ninf. The hierarchical master-worker paradigm is supported in ATLAS[3], Satin[14] and AMWAT[2]. The work presented in [14] shows comparison of load balancing algorithms on 7

8 a hierarchical master-worker setting. The AMWAT provides a software template to implement a parallel application with the (hierarchical) master-worker paradigm. However, the detailed discussion about the impact of the hierarchical master-worker paradigm on performance, which this paper presented, has not been reported in these literatures. 6 Conclusions This paper proposed a parallel branch and bound algorithm to solve an optimization problem, namely the BMI Eigenvalue Problem, with the hierarchical master-worker paradigm on a distributed computing system, and compared its performance with the conventional master-worker paradigm on a Grid test bed. The results show that computation with the conventional master-worker paradigm is not suitable to efficiently solve the optimization problem with fine grain tasks on the WAN setting, because communication overhead is too high compared with the cost of tasks. Also, a performance bottleneck on a single master process degrades performance in the master-worker paradigm even on the LAN setting. The hierarchical master-worker paradigm avoids performance degradation caused by high communication overhead by putting frequent communication between a master process and worker processes in tightly coupled computing resources. It also eliminates a performance bottleneck on a master process and improves performance scalability by distributing work among multiple master processes. The hierarchical master-worker paradigm is necessary to achieve satisfactory performance for an application to solve optimization problem with fine grain tasks, such as the BMI Eigenvalue Problem, on the WAN setting, where multiple PC clusters are connected via WAN through firewalls. Even on the LAN setting, the hierarchical masterworker paradigm improves performance scalability, while we need to define appropriate parameters, e.g. the number of master/worker processes, to achieve the best performance. The application evaluated in this paper performs a parallel branch and bound algorithm for a problem with fine grain tasks. Thus, we believe that the results in this paper are applicable to other parallel branch and bound applications with fine grain tasks. For performance improvement, we need more sophisticated algorithms to define some parameters in the proposed algorithm, e.g. an algorithm to define the number of master/worker processes, a load balancing algorithm among masters. Development of these algorithms to achieve the best performance is our future work. acknowledgments We would like to sincerely thank Dr. Henri Casanova and the Global Scientific Information and Computing Center at the Tokyo Institute of Technology for allowing us to use their computing resource for our experiments. We also thank Prof. Shinji Hara and the staffs of the Ninf project for their insightful comments. References [1] K. Aida, Y. Futakata, and S. Hara. High-performance parallel and distributed computing for the bmi eigenvalue problem. In Proc. The 16th IEEE International Parallel and Distributed Processing Symposium, [2] AppLeS Master Worker Application Template (AMWAT). [3] J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer. AT- LAS: An Infrastructure for Global Computing. In Proc. of the 1996 SIGOPS European Workshop, [4] H. Fujioka and K. Hoshijima. Bounds for the bmi eigenvalue problem. Trans. of the Society of Instrument and Control Engineers, 33(7): , [5] M. Fukuda and M. Kojima. Branch-and-cut algorithms for the bilinear matrix inequality eigenvalue problem. Computational Optimization and Applications, 19(1):79 105, [6] J. Goux, S. Kulkarni, J. Linderoth, and M. Yoder. An enabling framework for master-worker applications on the computational grid. In Proc. the 9th IEEE Symposium on High Performance Distributed Computing (HPDC9). [7] E. Heymann, M. A. Senar, E. Luque, and M. Livny. Adaptive scheduling for master-worker applications on the computational grid. In Proc. of the 1st IEEE/ACM International Workshop on Grid Computing (Grid2000), [8] L. H. Keel, S. P. Bhattachayya, and J. W. Howze. Robust contorl with structured perturbations. IEEE Trans. on Auto. Contr., 33(1):68 78, [9] S. Matsuoka, H. Nakada, M. Sato, and S. Sekiguchi. Design issues of Network Enabled Server Systems for the Grid. In Grid Computing GRID 2000, Lecture Notes in Computer Science 1971, pages pp Springer-Verlag, [10] M. O. Neary and P. Cappello. Advanced Eager Scheduling for Java-Based Adaptively Parallel Computing. In Proc. of the 2002 joint ACM-ISCOPE conference on Java Grande, [11] Ninf: A Global Computing Infrastructure. [12] G. Shao, F. Berman, and R. Wolski. Master/slave computing on the grid. In Proc. of Heterogeneous Computing Workshop, [13] Y. Tanaka, M. Sato, M. Hirano, H. Nakada, and S. Sekiguchi. Performance evaluation of a firewallcompliant globus-based wide-area cluster system. In Proc. of 9th IEEE Symposium on High-Performance Distributed Computing, [14] R. V. van Nieuwpoort, T. Kelmann, and H. E. Bal. Efficient Load Balancing for Wide-Area Divide-and-Conquer Applications. In Proc. of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming, pages pp.34 43,

A Case Study in Running a Parallel Branch and Bound Application on the Grid

A Case Study in Running a Parallel Branch and Bound Application on the Grid Kento Aida Tokyo Institute of Technology/PRESTO, JST aida@alab.ip.titech.ac.jp Tomotaka Osumi Tokyo Institute of Technology osumi@alab.ip.titech.ac.jp