Distributed Computing with Hierarchical Master-worker Paradigm for Parallel Branch and Bound Algorithm

Size: px
Start display at page:

Download "Distributed Computing with Hierarchical Master-worker Paradigm for Parallel Branch and Bound Algorithm"

Transcription

1 Distributed Computing with Hierarchical Master-worker Paradigm for Parallel Branch and Bound Algorithm Kento Aida Tokyo Institute of Technology / PRESTO, JST aida@dis.titech.ac.jp Yoshiaki Futakata nihou@alab.dis.titech.ac.jp Wataru Natsume Tokyo Institute of Technology natsume@alab.dis.titech.ac.jp Abstract This paper discusses the impact of the hierarchical master-worker paradigm on performance of an application program, which solves an optimization problem by a parallel branch and bound algorithm on a distributed computing system. The application program, which this paper addresses, solves the BMI Eigenvalue Problem, which is an optimization problem to minimize the greatest eigenvalue of a bilinear matrix function. This paper proposes a parallel branch and bound algorithm to solve the BMI Eigenvalue Problem with the hierarchical master-worker paradigm. The experimental results showed that the conventional algorithm with the master-worker paradigm significantly degraded performance on a Grid test bed, where computing resources were distributed on WAN via a firewall; however, the hierarchical master-worker paradigm sustained good performance. 1 Introduction Progress of Grid computing technology significantly reduces the cost for high-performance computing, and has possibility to extend a user community of high performance computing. Grid computing enables many user communities, who have large scale problems but do not have highend supercomputers, to perform computation with huge computational power. The research community related to an optimization problem is one of these communities, that is, while they have a lot of NP-hard problems, which require huge computational power to even obtain semi-optimal solutions, they need to scale down the size of problems to solve because they do not have enough computational power. Thus, Grid computing has possibility to scale up the size of a solvable optimization problem in their community. Parallel applications to solve optimization problems on a PC cluster or on a Grid have been studied[1, 6, 10, 13]. These applications use the master-worker paradigm, where a single master process dispatches a subset of computation, or a task, to multiple worker processes and gathers computed results from the worker processes. The masterworker paradigm is successfully used in many parallel applications on PC clusters and on a Grid [1, 2, 6, 7, 10, 12] as a common framework to implement parallel applications. However, the performance of an application with the master-worker paradigm is affected by many factors. For instance, communication overhead between a master process and worker processes could degrade performance. Particularly, degradation of performance could be significant on a Grid, because communication overhead among computing resources connected by WAN is high. Also, the performance of a master process could be a bottleneck of application performance if a master process controls too many worker processes, because a master process frequently communicate with all of worker processes. The hierarchical master-worker paradigm is one of solutions to avoid performance degradation in the masterworker paradigm. In the hierarchical master-worker paradigm, a single supervisor process controls multiple process sets, each of which is composed of a single master process and multiple worker processes. Distribution of tasks is performed in two phases: distribution from a supervisor process to master processes and that from a master process to worker processes. Collection of computed results is performed in the reverse way. The hierarchical masterworker paradigm has advantages compared with the conventional master-worker paradigm. The first advantage is to reduce communication overhead by putting a set of a master process and worker processes, which frequently communicate with each other, on tightly coupled computing resources. The second advantage is to avoid that a single master process becomes a performance bottleneck by distributing work among multiple master processes.

2 This paper discusses the impact of the hierarchical master-worker paradigm on the performance of an application program, which solves an optimization problem by a parallel branch and bound algorithm on a distributed computing system. The application program, which this paper addresses, solves the BMI Eigenvalue Problem, which is an optimization problem to minimize the greatest eigenvalue of a bilinear matrix function. This paper proposes a parallel branch and bound algorithm to solve the BMI Eigenvalue Problem with the hierarchical master-worker paradigm. The experimental results showed that the conventional algorithm with the master-worker paradigm significantly degraded performance on a Grid test bed, where computing resources were distributed on WAN via a firewall; however, the hierarchical master-worker paradigm sustained good performance. The rest of the paper is organized as follows: Section 2 describes the BMI Eigenvalue Problem, and a parallel branch and bound algorithm to solve the problem with the conventional master-worker paradigm. Section 3 describes the proposed parallel branch and bound algorithm with the hierarchical master-worker paradigm, and Section 4 presents its experimental results on a Grid test bed. Section 5 describes related works, and finally, Section 6 presents conclusions and future work. 2 BMI Eigenvalue Problem This section describes an overview of the BMI Eigenvalue Problem and a parallel branch and bound algorithm to solve this problem with the conventional master-worker paradigm. 2.1 BMI Eigen value Problem The BMI Eigenvalue Problem is to find the solution, x and y, which minimize the greatest eigenvalue of F (x; y) presented as follows: Let F : R nx R ny!r m m be a biaffine function derived by (1), where symmetric matrices, F ij R m m (i =0; ;n x ;j =0; ;n y ), are given, and x := (x1; ;x nx ) T, and y := (y1; ;y ny ) T. Xn x Xnx n X y Xny F (x; y) := F00 + x i F i0 + i=1 j=1 + = Fij T 2 y j F0j x i y j F ij (1) i=1 j=1 The BMI Eigenvalue Problem is recognized as a general framework for analysis and synthesis of control systems in variety of industrial applications, such as position control of a helicopter and control of robot arms. Thus, speedup of the computation is expected in the control theory community in order to enable analysis and synthesis of large scale control systems. Also, in the operations research community, it is an academic grand challenge to solve the large scale problem that has never been solved. 2.2 Parallel Branch and Bound Algorithm The BMI Eigenvalue Problem is NP-hard problem; thus, practical branch and bound algorithms to solve the ffloptimal solution, where an error of the optimal value is less than ffl, have been proposed[4, 5]. However, these algorithms still require huge computation time to solve a large scale problem such as a control problem for a real industrial application, and it restricts the size of a solvable problem for the control system to small. A parallel branch and bound algorithm to solve the ffloptimal solution of the BMI Eigenvalue Problem has been proposed[1]. The proposed algorithm performs a parallel branch and bound method with the master-worker paradigm on a PC cluster or on a Grid. In the proposed algorithm, a single master process maintains a search tree. It dispatches subproblems, which correspond to leaf nodes on the search tree, to multiple worker processes and receives computed results from the worker processes. Here, the computed results contain the best upper bound of the objective function, the best solutions of the objective function, and subproblems that have generated by branching and have not been pruned on a worker process. A worker process that received a subproblem from a master process performs branching, that is, it decomposes a subproblem into multiple subproblems and generates a subset of the search tree. Next, it computes the lower/upper bound for each subproblem on the tree, and performs bounding, that is, it prunes an unnecessary subproblem, in which its lower bound exceeds the current best upper bound. Finally, a worker process returns the computed results to a master process. A master and worker processes repeat these procedures until the gap between the lowest lower bound and the best upper bound converges within ffl. 3 Hierarchical Master-Worker Paradigm This section describes the proposed parallel branch and bound algorithm to solve the BMI Eigenvalue Problem with the hierarchical master-worker paradigm. 3.1 Dra wbacks in the Master-worker Paradigm While the master-worker paradigm is successfully used in many parallel applications as a common framework to 2

3 implement parallel applications, it has drawbacks when it is used on a Grid, where a large number of computing resources are connected via WAN. : subproblem master worker Communication Overhead Communication overhead between a master process and worker processes significantly affects the performance of an application. Communication occurs when a master process dispatches a task to a worker process and a worker process returns computed results to a master process. Performance degradation occurs when communication overhead is relatively large compared with execution time of a single task. For the application addressed in this paper, granularity of a single task tends to be small, e.g. execution time of a single task is a few seconds or less in a real problem. Thus, the impact of communication overhead on performance could be significant. Furthermore, the impact of communication overhead could be more significant on a Grid, where computing resources are connected via WAN with high latency and low throughput. For instance, let us suppose that a master process running on a local computer dispatches tasks to worker processes running on a remote PC cluster. If the local computer and the remote PC cluster are connected via WAN with high latency and low throughput, high communication overhead degrades the performance of an application significantly. Furthermore, in many cases, a remote PC cluster is installed in a private network beyond a firewall. Thus, a local user needs to run an application with a setting to establish communication through a firewall, e.g. ssh tunneling; this setting further increases communication overhead Bottleneck on a Single Master Process The performance of a master process could be a bottleneck of application performance if the master process controls too many worker processes. A master process continuously communicates with all worker processes to find an idle worker process, to dispatch new tasks and to receive computed results. A master process needs to perform these procedures in very frequent manner, because task granularity in the application addressed in this paper tends to be small. Thus, if a master process controls too many worker processes, procedures for computation and I/O on a master process degrades performance. 3.2 Proposed P arallel B&B Algorithm The proposed algorithm performs a parallel branch and bound method to solve the BMI Eigenvalue Problem with the hierarchical master-worker paradigm, where a supervisor process controls multiple process sets, each of which is composed of a master process and worker processes. A supervisor, Z M1, Z, Z M2, Z, Z W1, Z M1 W 1 M 1, Z W2, Z M1 W 2 M 2 W 1 W 2 Figure 1. The hierarchical master-worker paradigm set of a master and worker processes performs a parallel branch and bound method for a subset of a search tree, that is, a master process dispatches subproblems to multiple worker processes and receives computed results from the worker processes. A supervisor process performs load balancing among master processes by migrating subproblems among master processes. Also, a supervisor process and master processes gather the best upper bound of the objective function, which is computed on each worker process, and updates the current best upper bound on all worker processes hierarchically. The updating of the current best upper bound is crucial to improve the performance of the application, because it accelerates bounding[1]. Figure 1 shows an overview of the proposed algorithm. On the figure, ZWi, ZMj and Z denote the current best upper bound of the objective function stored on a worker process Wi, a master process Mj and a supervisor process, respectively. Here, i = 1... the number of worker processes in a set of a master and worker processes and j = 1... the number of master processes. The rest of this section describes detailed procedures of a worker, a master and a supervisor processes in the proposed algorithm Worker Process In a set of a master and worker processes, Si, a worker process, Wj, performs the following steps whenever it is dispatched a subproblem from a master process, Mi. (1) Wj receives a subproblem and the current best upper bound of the objective function stored on Mi,or Z Mi, from Mi. (2) Wj branches the subproblem and generates a tree of subproblems. 3

4 (3) Wj computes the lower/upper bound of the objective function for subproblems on the tree. For each subproblem on the tree, if the computed upper bound is less than the current best upper bound stored on Wj, or Z Wj, Wj updates Z Wj to the lower value. (4) Wj prunes unnecessary subproblems, in which their lower bounds exceed Z Wj. (5) Wj returns the computed results, or subproblems that have not been pruned, Z Wj and the solution of the objective function, to Mi Master Process A master process has two roles: the role to perform a parallel branch and bound method with worker processes and that to achieve load balancing in cooperation with a supervisor process. A master process, Mi, repeats the following procedure until it receives a request to terminate the computation from a supervisor process. (1) Mi examines a request from a supervisor process. If Mireceived the request, it performs one of the following actions: ffl If the request is to query about the current computed results stored on Mi, Mi sends the results to a supervisor process. The results contain the number of subproblems assigned to Mi, Z Mi, the current lowest lower bound of the objective function, and the solution of the objective function. ffl If the request is to steal subproblems on Mi, Mi sends subproblems to a supervisor process. ffl If the request is to assign new subproblems to Mi, Mireceives subproblems. ffl If the request is to update the current best upper bound on Mi, Mi compares the best upper bound stored on the supervisor process, or Z, with the current Z Mi.IfZis less than Z Mi, Mi updates Z Mi to the value of Z. (2) Miprobes worker processes to find an idle worker process. If Mi finds an idle worker process, Wj, it performs the following steps. (a) Mi receives computed results, which contains subproblems generated on Wj, Z Wj and the solution of the objective function, from Wj. In the initial phase of the execution, a worker process has never been dispatched a subproblem; in this case, this step is skipped. (b) Mi prunes unnecessary subproblems, in which their lower bounds exceed Z Mi. (c) Mi compares the current Z Mi and Z Wj.IfZ Wj is less than Z Mi, Miupdates Z Mi to the value of Z Wj. (d) Midispatches a new subproblem and sends Z Mi to Wj Supervisor Process A supervisor process performs the following steps to achieve load balancing and to share the best upper bound of the objective function among all processes. (1) A supervisor process queries a master process, Mi, about the computed results and receives the results. (2) A supervisor process computes Nmg i derived by the following formula: mx Nmg i = N i 1 N k (2) m k=1 Here, N i denotes the number of subproblems assigned to Mi 1, and m denotes the total number of master processes. Nmg i represents the number of subproblems that will migrate from/to Mito achieve load balancing. A supervisor process performs migration of subproblems among master processes as follows: ffl If Nmg i > 0, a supervisor process requests Mi to send back Nmg i of subproblems and receives the subproblems. ffl If Nmg i < 0, a supervisor process requests Mi to receive jnmg i j of subproblems and assigns the subproblems. (3) A supervisor process compares Z Mi with the current best upper bound stored on itself, or Z. IfZ Mi is less than Z, a supervisor process updates Z to the value of Z Mi, and requests all master processes to update the current best upper bound on the master processes with the updated Z. (4) A supervisor process computes the lowest lower bound of the objective function, L, by comparing the lowest lower bounds computed on master processes, and examines if the condition, (Z L)=jLj < ffl, is satisfied. If it is satisfied, a supervisor process requests all master processes to terminate computation. The load balancing policy presented in this section is to distribute the equal number of subproblems to master processes. A supervisor process could have more aggressive 1 Ni includes the number of subproblems computed on worker processes in S i. 4

5 Table 1. Computing resources on the test bed name processor, memory, NIC OS location PC1 Pentium II 400MHz, 256MB, 100Base-T Linux UCSD PC2 Pentium III 700MHz, 256MB, 100Base-T Linux TITECH PC3 Pentium 4 1.9GHz, 512MB, 100Base-T Linux TITECH PCC1 dual Pentium III 1.4GHz, 512MB x18nodes, 100Base-T Linux TITECH PCC2 dual Pentium III 1.4GHz, 512MB x12nodes, 100Base-T Linux TITECH strategies, e.g. it estimates a cost to execute a task and assigns tasks to master processes in proportion to their computational power. However, the proposed algorithm in this paper applies the conservative strategy. The reason is that the characteristic of the target application makes it difficult to estimate a cost of a task, because subproblems computed on a worker process are generated/pruned by branching/bounding dynamically during the execution. We performed a preliminary experiment for the load balancing algorithm presented in this section. We solved benchmark problems by the proposed algorithm on a test bed, where five master processes and 64 worker processes ran on PC clusters connected to LAN. In the result, no idle time was observed on master processes, that is, the load balancing was achieved well. However, discussion about the load balancing policy is one of research issues for performance improvement. The further discussion about this issue is our future work. 3.3 Implemen tation on a Grid test bed We implemented the proposed algorithm on a Grid test bed using the Grid RPC middleware, Ninf[9, 11]. Ninf provides remote procedure call facilities which are designed to provide a programming interface similar to conventional function calls and enable users to build Grid-enabled applications. A client computer in Ninf, Ninf client, is allowed to request a remote computer, Ninf server, to execute computing routines, Ninf library, installed on the Ninf server through the Ninf client API, Ninf call(). In the implementation of the hierarchical master-worker paradigm, a worker process is implemented as a Ninf library and a master process invokes the Ninf library to dispatch a subproblem through Ninf call(). A master process is implemented as a set of multiple programs: a Ninf client program that invokes worker processes to perform a parallel branch and bound method, and Ninf libraries that perform load balancing in cooperation with a supervisor process. A supervisor process is implemented as a Ninf client program that invokes Ninf libraries in a master process to perform load balancing through Ninf call(). 4 Experimental Results This section presents experimental results of the proposed parallel branch and bound algorithm to solve the BMI Eigenvalue Problem with the hierarchical master-worker paradigm. The experiment was performed on a Grid test bed constructed by computing resources installed in the University of California, San Diego (UCSD) and the Tokyo Institute of Technology (TITECH). Table 1 shows computing resources used on the test bed. The measured ping latency between a computer in UCSD and that in TITECH is 152.7[ms]. Benchmark problems for the experiment contain the helicopter control problem[8] and the synthetic problem[1]. The helicopter control problem is a real application problem, which finds optimal parameters for a controller to control a position of a helicopter. Its problem size is: n x =10, n y =2, and m =8. Here, m denotes matrix size of F ij. The synthetic problem is created by generating elements of matrices, F ij, randomly. Its problem size is larger than that in the helicopter control problem: n x = 6, n y = 6, and m = 24. Sequential execution time for the benchmark problems on PC3 is 1087 [sec] and [sec] for the helicopter control problem and the synthetic problem, respectively. 4.1 Comparison of Master-worker Paradigms First, this paper shows performance comparison between the master-worker paradigm and the hierarchical masterworker paradigm on a Grid test bed. Figure 2 and 3 present execution time to solve the benchmark problems by the master-worker paradigm (mw) and by the hierarchical master-worker paradigm (hmw). On the figures, LAN indicates execution time where a supervisor process, a master process and worker processes run on computers connected to LAN in TITECH. Here, a master process runs on PC2 and 16 worker processes run on eight nodes (16 CPUs) in PCC1 for mw; a supervisor process runs on PC2, a master process runs on a single node in PCC1, and 16 worker processes run on eight nodes in PCC1 for hmw. Next, LAN+ssh 5

6 WAN+ssh LAN+ssh LAN hmw mw Table 2. overhead of Ninf call helicopter [sec] synthetic [sec] LAN LAN+ssh WAN+ssh execution time [sec] Figure 2. Execution time of the helicopter control problem on a Grid test bed WAN+ssh LAN+ssh LAN execution time [sec] hmw mw Figure 3. Execution time of the synthetic problem on a Grid test bed means execution time where all processes run on computers in TITECH but a master/supervisor process needs to communicate with a worker/master process via the firewall by ssh tunneling. This represents the situation that users use a PC cluster in their department but the PC cluster has its own firewall. Here, the allocation of processes is the same as the LAN except PC2 needs to communicate with other computers via a firewall by ssh tunneling. Finally, WAN+ssh indicates execution time where a master/supervisor process runs on a computer in UCSD and others run on computers in TITECH. All computers in TITECH are installed within a firewall. The last case shows the situation that a user uses computing resources on a remote site. Here, a master/supervisor process runs on PC1 and the other processes are allocated as the same way as LAN, and PC1 communicates with computers in TITECH by ssh tunneling. The results on Figure 2 and 3 show that the masterworker paradigm degrades performance in LAN+ssh and WAN+ssh while the hierarchical master-worker paradigm sustains almost the same performance in all the cases. Particularly, performance degradation of mw in WAN+ssh is significant. In this case, execution of the helicopter control problem and the synthetic problem did not finish within 10 minutes and within one hour, respectively. The reason for the performance degradation is high communication overhead, which is caused by high communication latency on WAN and by overhead of ssh tunneling. Table 2 presents overhead for invoking a single Ninf call() from a master process to a worker process. The result on the table shows high communication overhead in LAN+ssh and WAN+ssh. As described in Section 3.1.1, execution time of a single task in this application is small. In the experiment, the measured mean execution time for a single task is 0.03[sec] and 0.52[sec] for the helicopter control problem and the synthetic problem, respectively. Thus, the communication overhead is relatively large compared to the task execution time. Particularly, the communication overhead is significantly higher than the task execution time in WAN+ssh. For a breakdown of the communication overhead, high overhead caused by ssh tunneling is observed by the gap between LAN and LAN+ssh on Table 2. Also, the amount of data transferred between a master process and a worker process for execution of a single task is small in this application. The size of data transferred from a master process to a worker process is 3829[Bytes] and [Bytes] for the helicopter control problem and the synthetic problem, respectively. Those from a worker process to a master process are 720[Bytes] for both. 2 It suggests that ssh tunneling yields significant communication overhead even for an application with small transferred data. 4.2 Evaluation for Scalability Next, this section presents the results for performance scalability of the hierarchical master-worker paradigm. Figure 4 and 5 present performance scalability of the master-worker paradigm and the hierarchical master-worker paradigm. On the Figures, mw indicates execution time of the benchmark problems where a master process runs on PC3 and worker processes run on nodes in PCC1 and 2 The amount of data transferred from a worker process to a master process depends on how many subproblems are pruned on the worker processes. The result on Table 2 shows overhead in the worst case, that is, overhead when no subproblems are pruned. 6

7 execution time [sec] mw hmw(m=1) hmw(m=2) hmw(m=3) the number of workers Figure 4. Performance scalability for the helicopter control problem execution time [sec] mw hmw(m=1) hmw(m=2) hmw(m=3) the number of workers Figure 5. Performance scalability for the synthetic problem PCC2. Also, hmw means execution time where a supervisor process runs on PC3 and master/worker processes run on nodes in PCC1 and PCC2. The value of m in the parenthesis indicates the number of master processes in the hierarchical master-worker paradigm, where worker processes are equally divided among master processes. All computing resources are installed in TITECH, and processes are able to communicate with others without ssh tunneling. The results on Figure 4 and 5 show that the masterworker paradigm degrades performance on the large number of worker processes while the hierarchical masterworker paradigm improves performance. The performance gap between mw and hmw(m=1) for the helicopter control problem is caused by communication overhead. A master process and worker processes running on nodes in PCC1 communicate through a single network switch in hmw(m=1), while a master process and worker processes communicate via multiple network switches in mw. The measured ping latency between computing nodes in the former case is 0.1[msec], while the latency for the latter case is 0.2[msec]. The performance of the master-worker paradigm for the helicopter control problem is significantly affected even by the small communication overhead, because task granularity is small. Performance for the synthetic problem is not affected by the small communication overhead, because the task granularity is sufficiently large. For the number of master processes in the hierarchical master-worker paradigm, adding a master process improves performance in several cases, e.g. hmw(m=2), hmw(m=3) for the helicopter control problem and hmw(m=2) for the synthetic problem on 48 worker processes. This result means that adding master processes is effective to eliminate performance bottleneck on a master process. However, adding a master process does not improve performance at the execution on 32 worker processes. This result means that adding a master process does not improve performance where a master process still has enough power to control worker processes. Also, for the synthetic problem on 48 worker processes, hmw(m=3) shows worse performance than hmw(m=2). The reason is that two master processes are enough to control 48 worker processes for the synthetic problem, because the task granularity is larger than that in the helicopter control problem. 5 Related Work The master-worker paradigm on a Grid has been discussed in many literatures. The AppLes project discusses a problem how to determine placement of a master process and worker processes on computing resources[12]. The work presented in [7] discusses a problem to define the number of worker processes to be allocated to a masterworker application and proposes a strategy to adjust the number of worker processes allocated to an application adaptively during its execution. Parallel branch and bound algorithms with the masterworker paradigm are addressed in [6, 10, 13]. The MW[6] and Javelin 3[10] provide software frameworks to implement applications with the master-worker paradigm and parallel branch and bound applications are implemented on these frameworks. The work presented in [13] presents experimental results of a parallel branch and bound algorithm to solve the knapsack problem using the Grid RPC middleware, Ninf. The hierarchical master-worker paradigm is supported in ATLAS[3], Satin[14] and AMWAT[2]. The work presented in [14] shows comparison of load balancing algorithms on 7

8 a hierarchical master-worker setting. The AMWAT provides a software template to implement a parallel application with the (hierarchical) master-worker paradigm. However, the detailed discussion about the impact of the hierarchical master-worker paradigm on performance, which this paper presented, has not been reported in these literatures. 6 Conclusions This paper proposed a parallel branch and bound algorithm to solve an optimization problem, namely the BMI Eigenvalue Problem, with the hierarchical master-worker paradigm on a distributed computing system, and compared its performance with the conventional master-worker paradigm on a Grid test bed. The results show that computation with the conventional master-worker paradigm is not suitable to efficiently solve the optimization problem with fine grain tasks on the WAN setting, because communication overhead is too high compared with the cost of tasks. Also, a performance bottleneck on a single master process degrades performance in the master-worker paradigm even on the LAN setting. The hierarchical master-worker paradigm avoids performance degradation caused by high communication overhead by putting frequent communication between a master process and worker processes in tightly coupled computing resources. It also eliminates a performance bottleneck on a master process and improves performance scalability by distributing work among multiple master processes. The hierarchical master-worker paradigm is necessary to achieve satisfactory performance for an application to solve optimization problem with fine grain tasks, such as the BMI Eigenvalue Problem, on the WAN setting, where multiple PC clusters are connected via WAN through firewalls. Even on the LAN setting, the hierarchical masterworker paradigm improves performance scalability, while we need to define appropriate parameters, e.g. the number of master/worker processes, to achieve the best performance. The application evaluated in this paper performs a parallel branch and bound algorithm for a problem with fine grain tasks. Thus, we believe that the results in this paper are applicable to other parallel branch and bound applications with fine grain tasks. For performance improvement, we need more sophisticated algorithms to define some parameters in the proposed algorithm, e.g. an algorithm to define the number of master/worker processes, a load balancing algorithm among masters. Development of these algorithms to achieve the best performance is our future work. acknowledgments We would like to sincerely thank Dr. Henri Casanova and the Global Scientific Information and Computing Center at the Tokyo Institute of Technology for allowing us to use their computing resource for our experiments. We also thank Prof. Shinji Hara and the staffs of the Ninf project for their insightful comments. References [1] K. Aida, Y. Futakata, and S. Hara. High-performance parallel and distributed computing for the bmi eigenvalue problem. In Proc. The 16th IEEE International Parallel and Distributed Processing Symposium, [2] AppLeS Master Worker Application Template (AMWAT). [3] J. E. Baldeschwieler, R. D. Blumofe, and E. A. Brewer. AT- LAS: An Infrastructure for Global Computing. In Proc. of the 1996 SIGOPS European Workshop, [4] H. Fujioka and K. Hoshijima. Bounds for the bmi eigenvalue problem. Trans. of the Society of Instrument and Control Engineers, 33(7): , [5] M. Fukuda and M. Kojima. Branch-and-cut algorithms for the bilinear matrix inequality eigenvalue problem. Computational Optimization and Applications, 19(1):79 105, [6] J. Goux, S. Kulkarni, J. Linderoth, and M. Yoder. An enabling framework for master-worker applications on the computational grid. In Proc. the 9th IEEE Symposium on High Performance Distributed Computing (HPDC9). [7] E. Heymann, M. A. Senar, E. Luque, and M. Livny. Adaptive scheduling for master-worker applications on the computational grid. In Proc. of the 1st IEEE/ACM International Workshop on Grid Computing (Grid2000), [8] L. H. Keel, S. P. Bhattachayya, and J. W. Howze. Robust contorl with structured perturbations. IEEE Trans. on Auto. Contr., 33(1):68 78, [9] S. Matsuoka, H. Nakada, M. Sato, and S. Sekiguchi. Design issues of Network Enabled Server Systems for the Grid. In Grid Computing GRID 2000, Lecture Notes in Computer Science 1971, pages pp Springer-Verlag, [10] M. O. Neary and P. Cappello. Advanced Eager Scheduling for Java-Based Adaptively Parallel Computing. In Proc. of the 2002 joint ACM-ISCOPE conference on Java Grande, [11] Ninf: A Global Computing Infrastructure. [12] G. Shao, F. Berman, and R. Wolski. Master/slave computing on the grid. In Proc. of Heterogeneous Computing Workshop, [13] Y. Tanaka, M. Sato, M. Hirano, H. Nakada, and S. Sekiguchi. Performance evaluation of a firewallcompliant globus-based wide-area cluster system. In Proc. of 9th IEEE Symposium on High-Performance Distributed Computing, [14] R. V. van Nieuwpoort, T. Kelmann, and H. E. Bal. Efficient Load Balancing for Wide-Area Divide-and-Conquer Applications. In Proc. of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming, pages pp.34 43,

A Case Study in Running a Parallel Branch and Bound Application on the Grid

A Case Study in Running a Parallel Branch and Bound Application on the Grid A Case Study in Running a Parallel Branch and Bound Application on the Grid Kento Aida Tokyo Institute of Technology/PRESTO, JST aida@alab.ip.titech.ac.jp Tomotaka Osumi Tokyo Institute of Technology osumi@alab.ip.titech.ac.jp

More information

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP (extended abstract) Mitsuhisa Sato 1, Motonari Hirano 2, Yoshio Tanaka 2 and Satoshi Sekiguchi 2 1 Real World Computing Partnership,

More information

Statistical Approach to Optimize Master-Worker Allocation in Grid Computing

Statistical Approach to Optimize Master-Worker Allocation in Grid Computing Statistical Approach to Optimize Master-Worker Allocation in Grid Computing S. R. Kodituwakku 1 and H. R. O. E. Dhayarathne 2 Department of Statistics and Computer Science Faculty of Science University

More information

Scalable Distributed Depth-First Search with Greedy Work Stealing

Scalable Distributed Depth-First Search with Greedy Work Stealing Scalable Distributed Depth-First Search with Greedy Work Stealing Joxan Jaffar Andrew E. Santosa Roland H.C. Yap Kenny Q. Zhu School of Computing National University of Singapore Republic of Singapore

More information

APAN Conference 2000, Beijing. Ninf Project

APAN Conference 2000, Beijing. Ninf Project APAN Conference 2000, Beijing Ninf Project Kento Aida(4), Atsuko Takefusa(4), Hirotaka Ogawa(4), Osamu Tatebe(1), Hidemoto Nakada(1), Hiromitsu Takagi(1), Yoshio Tanaka (1), Satoshi Matsuoka(4), Mitsuhisa

More information

Storage access optimization with virtual machine migration during execution of parallel data processing on a virtual machine PC cluster

Storage access optimization with virtual machine migration during execution of parallel data processing on a virtual machine PC cluster Storage access optimization with virtual machine migration during execution of parallel data processing on a virtual machine PC cluster Shiori Toyoshima Ochanomizu University 2 1 1, Otsuka, Bunkyo-ku Tokyo

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Implementing Scalable Parallel Search Algorithms for Data-Intensive Applications

Implementing Scalable Parallel Search Algorithms for Data-Intensive Applications Implementing Scalable Parallel Search Algorithms for Data-Intensive Applications Ted Ralphs Industrial and Systems Engineering Lehigh University http://www.lehigh.edu/~tkr2 Laszlo Ladanyi IBM T.J. Watson

More information

Using TOP-C and AMPIC to Port Large Parallel Applications to the Computational Grid

Using TOP-C and AMPIC to Port Large Parallel Applications to the Computational Grid Using TOP-C and AMPIC to Port Large Parallel Applications to the Computational Grid Gene Cooperman Henri Casanova Jim Hayes Thomas Witzel College of Computer Science, Northeastern University. {gene,twitzel}@ccs.neu.edu

More information

Solving Large Scale Optimization Problems via Grid and Cluster Computing

Solving Large Scale Optimization Problems via Grid and Cluster Computing Solving Large Scale Optimization Problems via Grid and Cluster Computing Katsuki Fujisawa, Masakazu Kojima Akiko Takeda and Makoto Yamashita Abstract. Solving large scale optimization problems requires

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

High Performance Grid and Cluster Computing for Some Optimization Problems

High Performance Grid and Cluster Computing for Some Optimization Problems Research Reports on Mathematical and Computing Sciences Series B : Operations Research Department of Mathematical and Computing Sciences Tokyo Institute of Technology 2-12-1 Oh-Okayama, Meguro-ku, Tokyo

More information

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

IX: A Protected Dataplane Operating System for High Throughput and Low Latency IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this

More information

Wide-area Cluster System

Wide-area Cluster System Performance Evaluation of a Firewall-compliant Globus-based Wide-area Cluster System Yoshio Tanaka 3, Mitsuhisa Sato Real World Computing Partnership Mitsui bldg. 14F, 1-6-1 Takezono Tsukuba Ibaraki 305-0032,

More information

Online Optimization of VM Deployment in IaaS Cloud

Online Optimization of VM Deployment in IaaS Cloud Online Optimization of VM Deployment in IaaS Cloud Pei Fan, Zhenbang Chen, Ji Wang School of Computer Science National University of Defense Technology Changsha, 4173, P.R.China {peifan,zbchen}@nudt.edu.cn,

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Distributed ASCI Supercomputer DAS-1 DAS-2 DAS-3 DAS-4 DAS-5

Distributed ASCI Supercomputer DAS-1 DAS-2 DAS-3 DAS-4 DAS-5 Distributed ASCI Supercomputer DAS-1 DAS-2 DAS-3 DAS-4 DAS-5 Paper IEEE Computer (May 2016) What is DAS? Distributed common infrastructure for Dutch Computer Science Distributed: multiple (4-6) clusters

More information

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA Motivation Scientific experiments are generating large amounts of data Education

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

High Throughput WAN Data Transfer with Hadoop-based Storage

High Throughput WAN Data Transfer with Hadoop-based Storage High Throughput WAN Data Transfer with Hadoop-based Storage A Amin 2, B Bockelman 4, J Letts 1, T Levshina 3, T Martin 1, H Pi 1, I Sfiligoi 1, M Thomas 2, F Wuerthwein 1 1 University of California, San

More information

Parallel Matrix Multiplication on Heterogeneous Networks of Workstations

Parallel Matrix Multiplication on Heterogeneous Networks of Workstations Parallel Matrix Multiplication on Heterogeneous Networks of Workstations Fernando Tinetti 1, Emilio Luque 2 1 Universidad Nacional de La Plata Facultad de Informática, 50 y 115 1900 La Plata, Argentina

More information

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Omni/SCASH 1 2 3 4 heterogeneity Omni/SCASH page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Yoshiaki Sakae, 1 Satoshi Matsuoka,

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Scalable Distributed Depth-First Search with Greedy Work Stealing

Scalable Distributed Depth-First Search with Greedy Work Stealing Scalable Distributed Depth-First Search with Greedy Work Stealing Joxan Jaffar, Andrew E. Santosa, Roland H.C. Yap, and Kenny Q. Zhu School of Computing National University of Singapore Republic of Singapore

More information

Parallel and Distributed Optimization with Gurobi Optimizer

Parallel and Distributed Optimization with Gurobi Optimizer Parallel and Distributed Optimization with Gurobi Optimizer Our Presenter Dr. Tobias Achterberg Developer, Gurobi Optimization 2 Parallel & Distributed Optimization 3 Terminology for this presentation

More information

Characterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date:

Characterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date: Characterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date: 8-17-5 Table of Contents Table of Contents...1 Table of Figures...1 1 Overview...4 2 Experiment Description...4

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

A Decoupled Scheduling Approach for the GrADS Program Development Environment. DCSL Ahmed Amin

A Decoupled Scheduling Approach for the GrADS Program Development Environment. DCSL Ahmed Amin A Decoupled Scheduling Approach for the GrADS Program Development Environment DCSL Ahmed Amin Outline Introduction Related Work Scheduling Architecture Scheduling Algorithm Testbench Results Conclusions

More information

QoS-constrained List Scheduling Heuristics for Parallel Applications on Grids

QoS-constrained List Scheduling Heuristics for Parallel Applications on Grids 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing QoS-constrained List Scheduling Heuristics for Parallel Applications on Grids Ranieri Baraglia, Renato Ferrini, Nicola Tonellotto

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver Author manuscript, published in "N/P" A NUMA Aware Scheduler for a Parallel Sparse Direct Solver Mathieu Faverge a, Pierre Ramet a a INRIA Bordeaux - Sud-Ouest & LaBRI, ScAlApplix project, Université Bordeaux

More information

Parallel Interval Analysis for Chemical Process Modeling

Parallel Interval Analysis for Chemical Process Modeling arallel Interval Analysis for Chemical rocess Modeling Chao-Yang Gau and Mark A. Stadtherr Λ Department of Chemical Engineering University of Notre Dame Notre Dame, IN 46556 USA SIAM CSE 2000 Washington,

More information

Accelerating Parameter Sweep Applications Using CUDA

Accelerating Parameter Sweep Applications Using CUDA 2 9th International Euromicro Conference on Parallel, Distributed and Network-Based Processing Accelerating Parameter Sweep Applications Using CUDA Masaya Motokubota, Fumihiko Ino and Kenichi Hagihara

More information

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication John Markus Bjørndalen, Otto J. Anshus, Brian Vinter, Tore Larsen Department of Computer Science University

More information

Scalable Performance Analysis of Parallel Systems: Concepts and Experiences

Scalable Performance Analysis of Parallel Systems: Concepts and Experiences 1 Scalable Performance Analysis of Parallel Systems: Concepts and Experiences Holger Brunst ab and Wolfgang E. Nagel a a Center for High Performance Computing, Dresden University of Technology, 01062 Dresden,

More information

Ch. 7: Benchmarks and Performance Tests

Ch. 7: Benchmarks and Performance Tests Ch. 7: Benchmarks and Performance Tests Kenneth Mitchell School of Computing & Engineering, University of Missouri-Kansas City, Kansas City, MO 64110 Kenneth Mitchell, CS & EE dept., SCE, UMKC p. 1/3 Introduction

More information

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints

Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Layer-Based Scheduling Algorithms for Multiprocessor-Tasks with Precedence Constraints Jörg Dümmler, Raphael Kunis, and Gudula Rünger Chemnitz University of Technology, Department of Computer Science,

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems

Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems Task Allocation for Minimizing Programs Completion Time in Multicomputer Systems Gamal Attiya and Yskandar Hamam Groupe ESIEE Paris, Lab. A 2 SI Cité Descartes, BP 99, 93162 Noisy-Le-Grand, FRANCE {attiyag,hamamy}@esiee.fr

More information

Engineering shortest-path algorithms for dynamic networks

Engineering shortest-path algorithms for dynamic networks Engineering shortest-path algorithms for dynamic networks Mattia D Emidio and Daniele Frigioni Department of Information Engineering, Computer Science and Mathematics, University of L Aquila, Via Gronchi

More information

IN5050: Programming heterogeneous multi-core processors Thinking Parallel

IN5050: Programming heterogeneous multi-core processors Thinking Parallel IN5050: Programming heterogeneous multi-core processors Thinking Parallel 28/8-2018 Designing and Building Parallel Programs Ian Foster s framework proposal develop intuition as to what constitutes a good

More information

Hierarchical Chubby: A Scalable, Distributed Locking Service

Hierarchical Chubby: A Scalable, Distributed Locking Service Hierarchical Chubby: A Scalable, Distributed Locking Service Zoë Bohn and Emma Dauterman Abstract We describe a scalable, hierarchical version of Google s locking service, Chubby, designed for use by systems

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Using implicit fitness functions for genetic algorithm-based agent scheduling

Using implicit fitness functions for genetic algorithm-based agent scheduling Using implicit fitness functions for genetic algorithm-based agent scheduling Sankaran Prashanth, Daniel Andresen Department of Computing and Information Sciences Kansas State University Manhattan, KS

More information

Anna Morajko.

Anna Morajko. Performance analysis and tuning of parallel/distributed applications Anna Morajko Anna.Morajko@uab.es 26 05 2008 Introduction Main research projects Develop techniques and tools for application performance

More information

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics

Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics Towards ParadisEO-MO-GPU: a Framework for GPU-based Local Search Metaheuristics N. Melab, T-V. Luong, K. Boufaras and E-G. Talbi Dolphin Project INRIA Lille Nord Europe - LIFL/CNRS UMR 8022 - Université

More information

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT

MONTE CARLO SIMULATION FOR RADIOTHERAPY IN A DISTRIBUTED COMPUTING ENVIRONMENT The Monte Carlo Method: Versatility Unbounded in a Dynamic Computing World Chattanooga, Tennessee, April 17-21, 2005, on CD-ROM, American Nuclear Society, LaGrange Park, IL (2005) MONTE CARLO SIMULATION

More information

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Shigeki Akiyama, Kenjiro Taura The University of Tokyo June 17, 2015 HPDC 15 Lightweight Threads Lightweight threads enable

More information

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle

Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle Plamenka Borovska Abstract: The paper investigates the efficiency of parallel branch-and-bound search on multicomputer cluster for the

More information

I/O in the Gardens Non-Dedicated Cluster Computing Environment

I/O in the Gardens Non-Dedicated Cluster Computing Environment I/O in the Gardens Non-Dedicated Cluster Computing Environment Paul Roe and Siu Yuen Chan School of Computing Science Queensland University of Technology Australia fp.roe, s.chang@qut.edu.au Abstract Gardens

More information

Balancing Fairness and Efficiency in Tiered Storage Systems with Bottleneck-Aware Allocation

Balancing Fairness and Efficiency in Tiered Storage Systems with Bottleneck-Aware Allocation Balancing Fairness and Efficiency in Tiered Storage Systems with Bottleneck-Aware Allocation Hui Wang, Peter Varman Rice University FAST 14, Feb 2014 Tiered Storage Tiered storage: HDs and SSDs q Advantages:

More information

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini Metaheuristic Development Methodology Fall 2009 Instructor: Dr. Masoud Yaghini Phases and Steps Phases and Steps Phase 1: Understanding Problem Step 1: State the Problem Step 2: Review of Existing Solution

More information

CS6401- Operating System QUESTION BANK UNIT-I

CS6401- Operating System QUESTION BANK UNIT-I Part-A 1. What is an Operating system? QUESTION BANK UNIT-I An operating system is a program that manages the computer hardware. It also provides a basis for application programs and act as an intermediary

More information

Position Paper: OpenMP scheduling on ARM big.little architecture

Position Paper: OpenMP scheduling on ARM big.little architecture Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM

More information

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University

More information

Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems

Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems Florin Balasa American University in Cairo Noha Abuaesh American University in Cairo Ilie I. Luican Microsoft Inc., USA Cristian

More information

Direct Execution of Linux Binary on Windows for Grid RPC Workers

Direct Execution of Linux Binary on Windows for Grid RPC Workers Direct Execution of Linux Binary on Windows for Grid RPC Workers Yoshifumi Uemura, Yoshihiro Nakajima and Mitsuhisa Sato Graduate School of Systems and Information Engineering University of Tsukuba {uemura,

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Using Multiple Machines to Solve Models Faster with Gurobi 6.0

Using Multiple Machines to Solve Models Faster with Gurobi 6.0 Using Multiple Machines to Solve Models Faster with Gurobi 6.0 Distributed Algorithms in Gurobi 6.0 Gurobi 6.0 includes 3 distributed algorithms Distributed concurrent LP (new in 6.0) MIP Distributed MIP

More information

A High Population, Fault Tolerant Parallel Raytracer

A High Population, Fault Tolerant Parallel Raytracer A High Population, Fault Tolerant Parallel Raytracer James Skorupski Ben Weber Mei-Ling L. Liu Computer Science Department Computer Science Department Computer Science Department Cal Poly State University

More information

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of California, Berkeley Operating Systems Principles

More information

Dynamic Balancing Complex Workload in Workstation Networks - Challenge, Concepts and Experience

Dynamic Balancing Complex Workload in Workstation Networks - Challenge, Concepts and Experience Dynamic Balancing Complex Workload in Workstation Networks - Challenge, Concepts and Experience Abstract Wolfgang Becker Institute of Parallel and Distributed High-Performance Systems (IPVR) University

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

GRB. Grid-JQA : Grid Java based Quality of service management by Active database. L. Mohammad Khanli M. Analoui. Abstract.

GRB. Grid-JQA : Grid Java based Quality of service management by Active database. L. Mohammad Khanli M. Analoui. Abstract. Grid-JQA : Grid Java based Quality of service management by Active database L. Mohammad Khanli M. Analoui Ph.D. student C.E. Dept. IUST Tehran, Iran Khanli@iust.ac.ir Assistant professor C.E. Dept. IUST

More information

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC Segregated storage and compute NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC Co-located storage and compute HDFS, GFS Data

More information

Operating System Performance and Large Servers 1

Operating System Performance and Large Servers 1 Operating System Performance and Large Servers 1 Hyuck Yoo and Keng-Tai Ko Sun Microsystems, Inc. Mountain View, CA 94043 Abstract Servers are an essential part of today's computing environments. High

More information

Parallel Computing in Combinatorial Optimization

Parallel Computing in Combinatorial Optimization Parallel Computing in Combinatorial Optimization Bernard Gendron Université de Montréal gendron@iro.umontreal.ca Course Outline Objective: provide an overview of the current research on the design of parallel

More information

Energy Conservation In Computational Grids

Energy Conservation In Computational Grids Energy Conservation In Computational Grids Monika Yadav 1 and Sudheer Katta 2 and M. R. Bhujade 3 1 Department of Computer Science and Engineering, IIT Bombay monika@cse.iitb.ac.in 2 Department of Electrical

More information

An Efficient Load-Sharing and Fault-Tolerance Algorithm in Internet-Based Clustering Systems

An Efficient Load-Sharing and Fault-Tolerance Algorithm in Internet-Based Clustering Systems An Efficient Load-Sharing and Fault-Tolerance Algorithm in Internet-Based Clustering Systems In-Bok Choi and Jae-Dong Lee Division of Information and Computer Science, Dankook University, San #8, Hannam-dong,

More information

Technical Brief: Specifying a PC for Mascot

Technical Brief: Specifying a PC for Mascot Technical Brief: Specifying a PC for Mascot Matrix Science 8 Wyndham Place London W1H 1PP United Kingdom Tel: +44 (0)20 7723 2142 Fax: +44 (0)20 7725 9360 info@matrixscience.com http://www.matrixscience.com

More information

Parallelization of Graph Isomorphism using OpenMP

Parallelization of Graph Isomorphism using OpenMP Parallelization of Graph Isomorphism using OpenMP Vijaya Balpande Research Scholar GHRCE, Nagpur Priyadarshini J L College of Engineering, Nagpur ABSTRACT Advancement in computer architecture leads to

More information

PARLGRAN: Parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures *

PARLGRAN: Parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures * PARLGRAN: Parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures * Sudarshan Banerjee, Elaheh Bozorgzadeh, Nikil Dutt Center for Embedded Computer Systems

More information

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme

A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme A Resource Discovery Algorithm in Mobile Grid Computing Based on IP-Paging Scheme Yue Zhang 1 and Yunxia Pei 2 1 Department of Math and Computer Science Center of Network, Henan Police College, Zhengzhou,

More information

On Cluster Resource Allocation for Multiple Parallel Task Graphs

On Cluster Resource Allocation for Multiple Parallel Task Graphs On Cluster Resource Allocation for Multiple Parallel Task Graphs Henri Casanova Frédéric Desprez Frédéric Suter University of Hawai i at Manoa INRIA - LIP - ENS Lyon IN2P3 Computing Center, CNRS / IN2P3

More information

A paralleled algorithm based on multimedia retrieval

A paralleled algorithm based on multimedia retrieval A paralleled algorithm based on multimedia retrieval Changhong Guo Teaching and Researching Department of Basic Course, Jilin Institute of Physical Education, Changchun 130022, Jilin, China Abstract With

More information

QoS-aware resource allocation and load-balancing in enterprise Grids using online simulation

QoS-aware resource allocation and load-balancing in enterprise Grids using online simulation QoS-aware resource allocation and load-balancing in enterprise Grids using online simulation * Universität Karlsruhe (TH) Technical University of Catalonia (UPC) Barcelona Supercomputing Center (BSC) Samuel

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Predicting the response time of a new task on a Beowulf cluster

Predicting the response time of a new task on a Beowulf cluster Predicting the response time of a new task on a Beowulf cluster Marta Beltrán and Jose L. Bosque ESCET, Universidad Rey Juan Carlos, 28933 Madrid, Spain, mbeltran@escet.urjc.es,jbosque@escet.urjc.es Abstract.

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Applying the Component Paradigm to AUTOSAR Basic Software

Applying the Component Paradigm to AUTOSAR Basic Software Applying the Component Paradigm to AUTOSAR Basic Software Dietmar Schreiner Vienna University of Technology Institute of Computer Languages, Compilers and Languages Group Argentinierstrasse 8/185-1, A-1040

More information

Efficiency Evaluation of the Input/Output System on Computer Clusters

Efficiency Evaluation of the Input/Output System on Computer Clusters Efficiency Evaluation of the Input/Output System on Computer Clusters Sandra Méndez, Dolores Rexachs and Emilio Luque Computer Architecture and Operating System Department (CAOS) Universitat Autònoma de

More information

The Google File System

The Google File System October 13, 2010 Based on: S. Ghemawat, H. Gobioff, and S.-T. Leung: The Google file system, in Proceedings ACM SOSP 2003, Lake George, NY, USA, October 2003. 1 Assumptions Interface Architecture Single

More information

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme

A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme A Resource Discovery Algorithm in Mobile Grid Computing based on IP-paging Scheme Yue Zhang, Yunxia Pei To cite this version: Yue Zhang, Yunxia Pei. A Resource Discovery Algorithm in Mobile Grid Computing

More information

Network Design Considerations for Grid Computing

Network Design Considerations for Grid Computing Network Design Considerations for Grid Computing Engineering Systems How Bandwidth, Latency, and Packet Size Impact Grid Job Performance by Erik Burrows, Engineering Systems Analyst, Principal, Broadcom

More information

Chapter 2 System Models

Chapter 2 System Models CSF661 Distributed Systems 分散式系統 Chapter 2 System Models 吳俊興國立高雄大學資訊工程學系 Chapter 2 System Models 2.1 Introduction 2.2 Physical models 2.3 Architectural models 2.4 Fundamental models 2.5 Summary 2 A physical

More information

Homogenization: A Mechanism for Distributed Processing across a Local Area Network

Homogenization: A Mechanism for Distributed Processing across a Local Area Network Homogenization: A Mechanism for Distributed Processing across a Local Area Network Mahmud Shahriar Hossain Department of Computer Science and Engineering, Shahjalal University of Science and Technology,

More information

2. Modeling AEA 2018/2019. Based on Algorithm Engineering: Bridging the Gap Between Algorithm Theory and Practice - ch. 2

2. Modeling AEA 2018/2019. Based on Algorithm Engineering: Bridging the Gap Between Algorithm Theory and Practice - ch. 2 2. Modeling AEA 2018/2019 Based on Algorithm Engineering: Bridging the Gap Between Algorithm Theory and Practice - ch. 2 Content Introduction Modeling phases Modeling Frameworks Graph Based Models Mixed

More information

Computer-System Organization (cont.)

Computer-System Organization (cont.) Computer-System Organization (cont.) Interrupt time line for a single process doing output. Interrupts are an important part of a computer architecture. Each computer design has its own interrupt mechanism,

More information

Advanced School in High Performance and GRID Computing November Introduction to Grid computing.

Advanced School in High Performance and GRID Computing November Introduction to Grid computing. 1967-14 Advanced School in High Performance and GRID Computing 3-14 November 2008 Introduction to Grid computing. TAFFONI Giuliano Osservatorio Astronomico di Trieste/INAF Via G.B. Tiepolo 11 34131 Trieste

More information

A PARALLEL ALGORITHM FOR THE DEFORMATION AND INTERACTION OF STRUCTURES MODELED WITH LAGRANGE MESHES IN AUTODYN-3D

A PARALLEL ALGORITHM FOR THE DEFORMATION AND INTERACTION OF STRUCTURES MODELED WITH LAGRANGE MESHES IN AUTODYN-3D 3 rd International Symposium on Impact Engineering 98, 7-9 December 1998, Singapore A PARALLEL ALGORITHM FOR THE DEFORMATION AND INTERACTION OF STRUCTURES MODELED WITH LAGRANGE MESHES IN AUTODYN-3D M.

More information

Achieving Distributed Buffering in Multi-path Routing using Fair Allocation

Achieving Distributed Buffering in Multi-path Routing using Fair Allocation Achieving Distributed Buffering in Multi-path Routing using Fair Allocation Ali Al-Dhaher, Tricha Anjali Department of Electrical and Computer Engineering Illinois Institute of Technology Chicago, Illinois

More information

A Parallel Macro Partitioning Framework for Solving Mixed Integer Programs

A Parallel Macro Partitioning Framework for Solving Mixed Integer Programs This research is funded by NSF, CMMI and CIEG 0521953: Exploiting Cyberinfrastructure to Solve Real-time Integer Programs A Parallel Macro Partitioning Framework for Solving Mixed Integer Programs Mahdi

More information

Chapter 1: Introduction

Chapter 1: Introduction Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered System Real -Time Systems Handheld Systems Computing Environments

More information

LECTURE 3:CPU SCHEDULING

LECTURE 3:CPU SCHEDULING LECTURE 3:CPU SCHEDULING 1 Outline Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time CPU Scheduling Operating Systems Examples Algorithm Evaluation 2 Objectives

More information

A Framework for Opportunistic Cluster Computing using JavaSpaces 1

A Framework for Opportunistic Cluster Computing using JavaSpaces 1 A Framework for Opportunistic Cluster Computing using JavaSpaces 1 Jyoti Batheja and Manish Parashar Electrical and Computer Engineering, Rutgers University 94 Brett Road, Piscataway, NJ 08854 {jbatheja,

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338

More information