Hierarchical Scheduling in Parallel and Cluster Systems

Size: px

Start display at page:

Download "Hierarchical Scheduling in Parallel and Cluster Systems"

Allan Sullivan
6 years ago
Views:

1 Hierarchical Scheduling in Parallel and Cluster Systems

2 SERIES IN COMPUTER SCIENCE Series Editor: Rami G. Melhem University of Pittsburgh Pittsburgh, Pennsylvania ENGINEERING ELECTRONIC NEGOTIATIONS A Guide to Electronic Negotiation Technologies for the Design and Implementation of Next-Generation Electronic Markets-Future Silkroads of ecommerce Michael Strobel HIERARCHICAL SCHEDULING IN PARALLEL AND CLUSTER SYSTEMS Sivarama Dandamudi INTRODUCTION TO PARALLEL PROCESSING Algorithms and Architectures Behrooz Parhami OBJECT-ORIENTED DISCRETE-EVENT SIMULATION WITH JAVA A Practical Introduction jose M. Garrido A PARALLEL ALGORITHM SYNTHESIS PROCEDURE FOR HIGH PERFORMANCE COMPUTER ARCHITECTURES Ian N. Dunn and Gerard G. L. Meyer PERFORMANCE MODELING OF OPERATING SYSTEMS USING OBJECT-ORIENTED SIMULATION A Practical Introduction jose M. Garrido POWER AWARE COMPUTING Edited by Robert Graybill and Rami Melhem THE STRUCTURAL THEORY OF PROBABILITY New Ideas from Computer Science on the Ancient Problem of Probability Interpretation Paolo Rocchi

3 Hierarchical Scheduling in Parallel and Cluster Systems Sivarama Dandamudi Carleton University Ottawa, Ontario, Canada Springer Science+Business Media, LLC

4 Library of Congress Cataloging-in-Publication Data Dandamudi, Sivarama P., Hierarchical scheduling in parallel and cluster systems/sivarama Dandamudi. p. cm. - (Series in computer science) Includes bibliographical references and index. ISBN ISBN (ebook) DOl / Parallel processing (Electronic computers) 2. Computer architecture. 3. Electronic data processing-distributed processing. I. Title. 11. Series in computer science (Springer-Science+Business Media, LLC) QA76.58.D '.35-dc ISBN Springer Science + Business Media New York Originally published by Kluwer Academic / Plenum Publishers in 2003 Softcover reprint of the hardcover 1 st edition A c.i.p_ record for this book is available from the library of Congress All rights reserved No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording, or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Permissions for books published in Europe: permissions@wkap.nl Permissions for books published in the United States of America: permissions@wkap.com

5 To my parents, Subba Rao and Prameela Rani, my wife, Sobba, and my daughter, Veda

6 Preface Multiple processor systems are an important class of parallel systems. Over the years, several architectures have been proposed to build such systems to satisfy the requirements of high performance computing. These architectures span a wide variety of system types. At the low end of the spectrum, we can build a small, shared-memory parallel system with tens of processors. These systems typically use a bus to interconnect the processors and memory. Such systems, for example, are becoming commonplace in high-performance graphics workstations. These systems are called uniform memory access (UMA) multiprocessors because they provide uniform access of memory to all processors. These systems provide a single address space, which is preferred by programmers. This architecture, however, cannot be extended even to medium systems with hundreds of processors due to bus bandwidth limitations. To scale systems to medium range i.e., to hundreds of processors, non-bus interconnection networks have been proposed. These systems, for example, use a multistage dynamic interconnection network. Such systems also provide global, shared memory like the UMA systems. However, they introduce local and remote memories, which lead to non-uniform memory access (NUMA) architecture. Distributed-memory architecture is used for systems with thousands of processors. These systems differ from the shared-memory architectures in that there is no globally accessible shared memory. Instead, they use message passing to facilitate communication among the processors. As a result, they do not provide single address space. Architecture of a distributed-memory system is remarkably close to that of a network of workstations or a workstation cluster. There are some significant differences between the two systems in the kind of hardware used. Forexample, distributed-memory systems such as Cray T3E use high-bandwidth, low-latency interconnect. However, cluster systems offer significant cost advantage. As a result, these systems are increasingly becoming popular for high performance vii

7 viii HIERARCHICAL SCHEDULING computing. In this book, we are interested in parallel systems as well as cluster systems. From the hardware point of view, it is relatively straightforward to build large parallel systems with thousands of processors. Such systems are becoming economically viable as well. However, managing system resources in such large systems is very complex. In this book, we look at job scheduling problem in parallel and cluster systems. Parallel job scheduling has been extensively studied over the last two decades. Initial studies have focused on small UMA architectures. More recent interest is in the cluster systems. A job scheduling policy that works effectively for small UMA systems might not work for large distributed-memory systems with thousands of processors. Thus, scalability is an important characteristic of a scheduling policy if we want to use it in large distributed-memory systems. In this book we present a hierarchical scheduling policy that scales well with system size. This policy is based on the hierarchical task queue organization we introduced to organize the system run queue. The book is divided into four parts. Part I consists of the first three chapters. This part gives introduction to parallel and cluster systems. Furthermore, it surveys the parallel job scheduling policies proposed in the literature. Part II, comprising of Chapters 4 to 6, gives details about our hierarchical task queue organization and its performance. We demonstrate that this organization scales well, which makes it suitable for systems with hundreds to thousands of processors. In Part III we use this task queue organization as the basis to devise hierarchical scheduling policies for parallel and cluster systems. Chapter 7 gives details on the hierarchical policy for shared-memory systems. The next two chapters describe how the hierarchical policy can be adopted to the distributed-memory systems and cluster systems. These three chapters show that the hierarchical policy provides substantial performance advantages over other policies proposed in the literature. Finally, Part IV concludes the book with a brief summary and concluding remarks. Acknowledgments First and foremost, I would like to thank my wife Sobha and my daughter Veda for enduring my preoccupation with this project during the evenings and weekends. This book draws upon the research we did as part of our parallel scheduling project. Over the past eight years several students have worked on this project for their theses. I would like to thank the following students for their contribution to some of the results presented in this book: Jemal Abawajy, Terrence Au, Samir Ayachi, Philip Cheng, Thyagaraj Thanalapati, Hai Yu, and Zhengao Zhou.

8 PREFACE IX Thanks are also due to Prof. Rami Melhem of the University of Pittsburgh for inviting me to write this monograph. I also thank Ana Bozicevic, Editor, Kluwer Academic Publishers for following up the proposal with enthusiasm. My sincere appreciation goes to the School of Computer Science and Carleton University for supporting our parallel scheduling project. I gratefully acknowledge the financial support received by the project from the Natural Sciences and Engineering Research Council of Canada. SIVARAMA DANDAMUDI

9 Contents List of Figures List of Tables xvii xxv PART I: Background 1 1. INTRODUCTION Why Parallel Processing? Parallel Architectures SIMD Systems MIMD Systems Job Scheduling Software Architectures Overview of the Monograph PARALLEL AND CLUSTER SYSTEMS Introduction l3 2.2 Parallel Architectures UMA Systems NUMA Systems Distributed-Memory Systems Distributed Shared Memory Example Parallel Systems IBM SP2 System Stanford DASH System ASCI Systems Interconnection Networks Dynamic Interconnection Networks Static Interconnection Networks 29 Xl

10 xii HIERARCHICAL SCHEDULING 2.5 Interprocess Communication PVM MPI TreadMarks Cluster Systems Beowulf Summary PARALLEL JOB SCHEDULING Introduction Parallel Program Structures Fork-and-Join Programs Divide-and-Conquer Programs Matrix Factorization Programs Task Queue Organizations Basic Task Queue Organizations Improving Centralized Organization Improving Distributed Organization Scheduling Policies Space-Sharing Policies Static Policies Dynamic Policies An Example Space-Sharing Policy Adaptive Space-Sharing Policy A Modification An Improvement Performance Comparison Performance Comparison Handling Heterogeneity Time-Sharing Policies Hybrid Policies Example Policies IBMSP ASCI Blue-Pacific Portable Batch System Summary 84 PART II: Hierarchical Task Queue Organization 85

11 Contents xiii 4. HIERARCHICAL TASK QUEUE ORGANIZATION Motivation Hierarchical Organization Workload and System Models Performance Analysis Queue Access Overhead Utilization Analysis Centralized Organization Distributed Organization Hierarchical Organization Contention Analysis Centralized Organization Distributed Organization Hierarchical Organization Performance Comparison Impact of Access Contention Effect of Number of Tasks Sensitivity to Service Time Variance Impact of System Size Influence of Branching and Transfer Factors III 4.6 Performance of Dynamic Task Removal Policies Summary PERFORMANCE OF SCHEDULING POLICIES Introduction Performance of Job Scheduling Policies Policies Results Performance Sensitivity to System Load Sensitivity to Task Service Time Variance Sensitivity to Variance in Task Distribution Performance of Task Scheduling Policies Task Scheduling Policies Results and Discussion Principal Comparison Impact of Variance in Task Service Time Impact of Variance in Task Distribution Effect of Window Size 135

12 xiv HIERARCHICAL SCHEDULING Sensitivity to Other Parameters Conclusions PERFORMANCE WITH SYNCHRONIZATION WORKLOADS Introduction Related Work System and Workload Models Spinning and Blocking Policies Spinning Policy Blocking Policies Lock Accessing Workload Results Workload Model Simulation Results Principal Comparison Sensitivity to Service Time Variance Impact of Granularity Impact of Queue Access Time Barrier Synchronization Workload Results Workload Model Simulation Results Impact of System Load Sensitivity to Service Time Variance Impact of Granularity Impact of Queue Access Time Cache Effects Summary 163 PART III: Hierarchical Scheduling Policies SCHEDULING IN SHARED-MEMORY MULTIPROCESSORS Introduction Space-Sharing and Time-Sharing Policies Equipartitioning Modified RRJob Hierarchical Scheduling Policy Performance Evaluation System and Workload Models 174

13 Contents xv System Model Workload Model Performance Analysis Effect of Scheduling Overhead Impact of Variance in Service Demand Effect of Task Granularity Effect of the ERF Factor Effect of Quantum Size Sensitivity to Other Parameters Performance with Lock Accessing Workload Lock Accessing Workload Results Conclusions SCHEDULING IN DISTRIBUTED-MEMORY MULTICOMPUTERS Introduction Hierarchical Scheduling Policy Scheduling Policies for Performance Comparison Space Partitioning Time-Sharing Policy Workload Model Performance Comparison Performance with Ideal Workload Performance with Non-Uniform Workload Performance with distribution Sensitivity to variance in job service demand Performance under distribution Performance under distribution Discussion Conclusions SCHEDULING IN CLUSTER SYSTEMS Introduction Hierarchical Scheduling Policy Job Placement Policy Dynamic Load Balancing Algorithm Space-Sharing and Time-Sharing Policies Space-Sharing Policy 221

14 xvi HIERARCHICAL SCHEDULING Time-Sharing Policy 9.4 Performance Comparison Workload Model Ideal Workload Results Non-Uniform Workload Results 9.5 Summary PART IV: Epilog 10. CONCLUSIONS 10.1 Summary 10.2 Concluding Remarks REFERENCES INDEX

15 List of Figures 1.1 A SIMD system with N processing elements A shared-memory multiprocessor system with N processors and k memory modules A distributed-memory multicomputer system with N processors and N memory modules UMA shared-memory system architecture NUMA shared-memory system architecture Architecture of a distributed-memory system Distributed shared-memory system The SP2 switch board uses 4 x 4 crossbar switching elements The DASH system organization A high level view of the ASCI Blue-Pacific system Crossbar network (the small squares represents switches) Four possible settings of a 2 x 2 switching box The perfect shuffle for N = A multistage shuffle-exchange network A multistage shuffle-exchange network A ring network A chordal ring network A complete connection network A binary tree network X-tree and hypertree networks Two-dimensional mesh and torus networks. 34 xvii

16 xviii HIERARCHICAL SCHEDULING 2.19 Hypercube networks: (a) I-dimensional hypercube, (b) 2-dimensional hypercube, (c) 3-dimensional hypercube A two-level hierarchical network with four different types of networks The fork-and-join job structure The divide-and-conquer job structure The matrix factorization job structure Two basic task queue organizations (a) Centralized organization (b) Distributed organization Performance of the centralized organization as a function of system utilization Performance of the distributed organization as a function of system utilization Performance sensitivity of the distributed organization to variance in task service times Performance of the four placement strategies as a function of system utilization Impact of service time variance on the performance of the four placement strategies (utilization = 80%) Performance sensitivity of the shortest queue and SRT queue policies to the number of probes (utilization = 70%) The effect of task size estimation error on the performance of the SRT policy (utilization = 80%). The ESRT queue represents performance of the SRT policy when the task size estimation error is ±30%. For comparison, performance of the shortest and SRT policies is included Relative performance of the AP and MAP policies as a function of system utilization and job structure Performance comparison of the AP and MAP policies as a function of variance in interarrival times for the GE job structure Performance comparison of the AP and MAP policies as a function of variance in service times for the GE job structure Performance sensitivity of the MAP policy to parameter f 74

17 List of Figures 3.16 Impact of Eager Release policy on the performance of the MAP policy. The y-axis gives the response time improvement over the MAP policy. Eager Release policy does not have any significant impact on the FJ applica- tion. 74 xix 3.17 Performance sensitivity of the MAP and HAP policies to interarrival time variance Performance sensitivity of the MAP and HAP policies to service time variance Organization of the GangLL scheduler Hierarchical task queue organization for N = 8 processors with a branching factor B = Task transfer process in the hierarchical organization for N = 64 processors with a branching factor B = 4 and transfer factor Tr = Task transfer process in the hierarchical organization for N = 64 processors with a branching factor B = 4 and transfer factor Tr = 2. Compare this figure with Figure 4.2 to see the impact of increasing the transfer factor from 1 to Performance of the three task queue organizations as a function of utilization (a) Centralized organization (b) Distributed and hierarchical organizations Performance of the three task queue organizations as a function of average number of tasks per job for the fixed task size workload (a) Centralized organization (b) Distributed and hierarchical organizations (j = 3%) Performance of the distributed and hierarchical task queue organizations as a function of average number of tasks per job for the fixed job size workload Performance sensitivity to the task service time variance (N = 64, T = 64, J-l = 1, B = 4, Tr = 1, A = 0.75 and f = 0%). Note that the lines for the centralized and hierarchical organizations are very close together. Performance sensitivity of the distributed and hierarchical organizations to the task service time variance (N = 64, T = 64, J-l = 1, B = 4, Tr = 1 and f = 4%)

18 xx HIERARCHICAL SCHEDULING 4.9 Performance sensitivity to the system size when the number of tasks per job is doubled (B = 4, Tr = 1, I = 4%, T = N, J-L = 1) Performance sensitivity to the system size when the task service time is doubled (B = 4, Tr = 1, I = 4%, T = 64, J-L = 64/N) Impact of branching factor on the performance of the hierarchical organization (N = 64, T = 64, J-L = 1, Tr = 1) Impact of transfer factor on the performance of the hierarchical organization (N = 64, T = 64, J-L = 1, B = 4) Task transfer behavior of Policy Task transfer behavior of Policy Performance of the two dynamic task transfer policies in the hierarchical organization (N = 64, T = 64, J-L = 1, B = 4, I = 2%) Performance of the three job scheduling policies as a function of system load Performance sensitivity to service time variance at system utilization of 85% Performance sensitivity to task distribution variance at system utilization of 85% Behavior of the RRI policy (N = 64, B = 4, Tr = 1, W = 2) Behavior of the RR2 policy (N = 64, B = 4, Tr = 1, W = 2) Behavior of the RR3 policy (N = 64, B = 4, Tr = 1) Performance of task scheduling policies as a function of system load (task service time CV = 1) Performance of task scheduling policies as a function of system load (task service time CV = 7) Sensitivity of task scheduling policies to the service time variance (system utilization = 85%) Performance sensitivity to task distribution variance Performance sensitivity of the round robin policies to the window size Performance sensitivity of the round robin policies to the quantum size. 137

19 List of Figures xxi 5.13 Performance sensitivity of the round robin policies to the context switch overhead Generic lock access workload task structure for task 'Ii Generic barrier syncronization workload task structure for task Ti Performance of the spinning and blocking policies as a function of useful utilization Impact of the lock holding ratio (useful utilization = 70% and Be + BI = 0.25) Performance impact of service time variability in the lock accessing model Performance as a function of the number of iterations MaXi in the lock accessing model (useful utilization = 70%) Performance sensitivity to queue access time f in the lock accessing model (useful utilization = 70%) Performance of the spinning and blocking policies as a function of useful utilization under the barrier synchronization workload Performance impact of service time variability (useful utilization = 50%) Performance sensitivity to the maximum number of iterations MaXi (useful utilization = 50%) Performance sensitivity to queue access time f Hierarchical task queue organization for N = 8 processors with a branching factor B = Example curves for ERF 1'(Avg) = ( 1 ; t! a ~ ; g Response time versus utilization for low overhead Response time versus utilization for high overhead Response time versus utilization with service demand CV CVd = Response time versus utilization with service demand CV CVd = Response time versus service demand CV CVd at 72% utilization Performance sensitivity to average parallelism (Avg) at 50% utilization Performance sensitivity to average parallelism (Avg) at 75% utilization. 184

20 xxii HIERARCHICAL SCHEDULING 7.10 Sensitivity to the ERF factor at 50% utilization Sensitivity to the ERF factor at 75% utilization Sensitivity of hierarchical and RRJob policies to quantum size at 50% utilization Sensitivity of hierarchical and RRJob policies to quantum size at 75% utilization Response time versus utilization for low overhead Response time versus utilization for high overhead Response time versus utilization for service demand CV = Response time versus C Vd at 72% utilization Job and task transfer modes in the hierarchical policy (number of processors N = 64 and the branching factor B = 4) Job and task transfer modes in the hierarchical policy (number of processors N = 32 and the branching factor B = 2) Algorithm used by the space-sharing policy Performance of the three policies under the ideal workload Performance of the three policies under distribution (service CV = 10) Performance of the three policies under distribution (service CV = 1) Performance of the three policies under distribution (service CV = 15) Performance of the three policies under distribution (service CV = 10) Performance of the three policies under distribution (service CV = 10) A cluster tree example (SS: Sysytem scheduler, CS: Cluster scheduler, LS: Local scheduler, Wi: Workstation i) An overview of the job placement policy. 217' 9.3 An overview of the dynamic load balancing algorithm An illustration of the load balancing activity in the hierarchical policy Performance of the three scheduling policies for the ideal workload (Dedicated-heterogeneous system). 225

21 List of Figures xxiii Performance of the three scheduling policies for the ideal workload (Shared-homogeneous system). Performance of the three scheduling policies for the non-uniform workload (Dedicated-heterogeneous configuration). Performance of the three scheduling policies for the non-uniform workload (Shared-homogeneous configuration)

22 List of Tables 4.1 Average number of queue accesses required to schedule a task in the hierarchical organization (from Eq. 4.1) Default parameter values used in the lock accessing workload experiments Default parameter values used in the barrier synchronization workload experiments Default parameter values used in the simulation experiments Additional parameters for the lock accessing workload A summary of work distribution in the four workloads Node types used in the simulation and their ratings Default parameter values used in the experiments 223 xxv

Guide to RISC Processors

Guide to RISC Processors Sivarama P. Dandamudi Guide to RISC Processors for Programmers and Engineers Sivarama P. Dandamudi School of Computer Science Carleton University Ottawa, ON K1S 5B6 Canada sivarama@scs.carleton.ca