{jfzhan, snh} gncic.ac.cn

Size: px

Start display at page:

Download "{jfzhan, snh} gncic.ac.cn"

Kerrie Maxwell
5 years ago
Views:

1 Fire Phoenix Cluster Operating System Kernel and its Evaluation Jianfeng Zhan, Ninghui Sun Institute of Computing Technology, Chinese Academy of Sciences, Beijing , China {jfzhan, snh} gncic.ac.cn Abstract Fire Phoenix cluster operating system kernel (Phoenix kernel) is a minimum set of cluster core functions with scalability andfault-tolerance support. In this paper, we define components of cluster operating system kernel, and introduce its internal mechanism for scalability and fault-tolerance support. Based on Phoenix kernel, user environments can be easily constructed according to users' needs. In addition, we evaluate Phoenix kernel from four different perspectives, such as fault-tolerance, scalability, performance impact on scientific computing, and easiness of constructing user environment. Our design has been proved in the practices of Dawning 4000A super server, which is the biggest cluster system for scientific computing in China. 1: Introduction Though cluster systems have been widely used as platform for scientific and business computing, the challenge still lies in developing cluster system software. Firstly, application range of cluster is expanding and user needs are always varying, so cluster system software should provide a flexible components framework to adapt to this situation; Secondly, since more and more cluster systems are adopted as business computing platforms, such as Web hosting environment, digital library, cluster system software should provide high availability support for business computing which promises delivering 7x24 service [1][2]; Lastly, cluster system software should have a highly scalable architecture that easily extends to increasing system scale. In order to deal with those difficulties and challenges, we need take a global view and figure out a reasonable architecture which could hide system complexity and reduce risks of developing cluster system software. Layered architecture style [4] is the best practice suitable for this need, which is proved to be correct in UNIX operating system. In UNIX operating system, the kernel shields bottom layer hardware, and utilities can communicate with the operating system kernel through a set of documented interfaces. Applications can be constructed upon lower level utilities. In our opinion, these efforts give us a direction in research and development of cluster system software. We can define and develop cluster operating system kernel, which provides stable minimum set of core functions with scalability and fault-tolerant support. On the basis of cluster operating system kernel, we can construct user environments, which are easily adapted and extended according to users' needs. In July 2002, Institute of Computing Technology, Chinese Academy of Sciences started developing a cluster operating system named Fire Phoenix ( In the remainder, Phoenix is short for Fire Phoenix). It provides facilities such as system monitoring, system administration, job management for the Dawning 4000A super server. The Dawning 4000A [6] super server, composed of 640 nodes, ranked No. 10 in a list of the top 500 supercomputers on July 22, This paper is structured as follows. Section 2 outlines related work; section 3 gives an overview of Phoenix cluster operating system; section 4 defines the architecture of Phoenix cluster operating system kernel; section 5 evaluates Phoenix operating system kernel from four perspectives, including fault-tolerance, performance impact on scientific computing, scalability

2 and easiness of constructing user environment; finally, in section 6, we give a conclusion. 2: Related Work When dealing with the difficulties and challenges mentioned above in section one, the research and development of cluster system software follows different paths and lacks of unified effort: Beowulf system software [7] takes fully advantage of open source cluster software developed without unified efforts, thus packaging of different software is main work, unable to achieve the target of efficiency, common use, easiness of use and interoperability. To improve this situation, many research institutes and open source group begin the work of packaging and integrating of cluster system software, such as SCE [8] developed by Thailand Kasetsart University, Score [9] researched by Japan Real world computing project, and OSCAR project [10]. Among them, the typical work is OSCAR project, which focuses on "best cluster practices", taking the best of what is currently available and integrating it into one package. Now several projects begin fully integrated design of cluster system software. SSS project [11] aims at addressing the lack of software for the effective management and utilization of terascale computational resources and developing an integrated suite of machine independent, scalable systems software components, which focuses on scientific computing support. The rational behind DOE CCA project [12] is using component frameworks to deal with the complexity of developing interdisciplinary HPC application through introducing higher level abstractions and allowing code reusability. Galaxy cluster management framework [2] focuses on servicing large-scale enterprise clusters through the use of novel, highly scalable communication and management technique, and this system is developed for Microsoft's Windows 2000 operating system, integrating tightly with the naming, directory services offered by host operating system. Oceano project [13] is a prototype of a scalable, management infrastructure for a large server farm, and it is motivated by large scale web hosting environment, which increasingly require support for peak loads that are orders of magnitude larger than the normal steady state. Ciba City project [14] studies a medium-sized testbed cluster dedicated to computer science research. In these projects, people develop different cluster system software from scratch according to different user's requirement. For example, SSS [11] project aims at supporting scientific computing, while Galaxy [2] and Oceano [13] develop cluster system software for business computing. Since the application range of cluster is always expanding, if we don't change this situation, the code base of cluster software will be always increasing. In several projects, researchers have resorted to component technology to solve this problem [11] [12]. Though component technology supports code reuse and component substitution, how to adapt to varying user needs and reduce risk of developing cluster system software is still a problem. When dealing with this challenge, Phoenix takes a different solution. Firstly, we develop Phoenix kernel, which maintains a stable minimum set of core functions with scalability and fault-tolerance support. Then, on the basis of Phoenix kernel, we develop different user environments according to user needs. 3: Overview of Phoenix Cluster OS Figure 1 describes the layered architecture of Phoenix OS. The lowest layer is heterogeneous resource, and it shields heterogeneous hardware architectures, host operating systems and communication protocols with heterogeneous middleware. The second layer is Phoenix cluster operating system kernel, which defines the minimum set of core functions with scalability and fault-tolerance support. The first layer is user environment, through which users utilize cluster resources to fulfill their targets. In Phoenix system, we define four user roles including system constructor, system administrator, scientific computing users, and business computing user. For these users, Phoenix provides difference user environments:

3 * System constructor configures, deploys and boots cluster system with system construction tool, and system construction tool behaves like the BIOS and kernel booting module of a host operating system. * System management and monitoring tools assist system administrators to perform daily system management, real-time system monitoring, performance analysis and fault analysis. * Job management system is a user environment for science computing users, which manages cluster resources, through which users submit their jobs and complete their computing task. * Business application runtime environment is the core of the business application hosting environment. It manages multi-tier business applications and guarantees their high-availability and load-balancing. (3). Phoenix kernel provides scalable and fault-tolerant support, thus the difficulty of developing user environment is decreased. 4.2: Components Framework of Phoenix Kernel Phoenix kernel is the minimum set of core functions for Phoenix cluster OS, which is also the basic building block for user environments. Phoenix kernel provides documented interfaces and parallel command calls for user environments in different forms with uniformed semantics (Such as Socket, RPC and ORB etc.). The layered architecture of Phoenix kernel is shown in Figure 2. Parallel command and Application programming interface Event Service Check Point Service Data Bulletin Group Service Detector Services Physical Node Network Application Resource State State State Parallel Process Management Figure 2 Phoenix Kernel Stack Figurel Architecture of Phoenix 4: Phoenix Cluster Operating System Kernel 4.1: Rational behind Phoenix Kernel When developing cluster operating system kernel, we choose three principles to follow: (1).User interacts with user environment, and Phoenix kernel is invisible to users, which decreases cognition overhead and management cost for daily users. (2).Maintaining a stable minimum set of core functions on cluster operating system kernel level, we can easily construct, adapt and extend user environments on the basis of Phoenix kernel according to user's needs, which controls the complexity and reduces the risk of developing cluster operating system. * Configuration service It provides cluster-wide configuration information, including information of physical resources, Phoenix kernel and user environments. Configuration service has a self-introspection mechanism to automatically find and diagnose cluster resources, and provides documented interface for dynamic reconfiguration. * Security service It provides authorization, authentication and encryption functions for users. * Parallel processes management service Parallel process management service performs efficient remote jobs loading, deleting, and resource cleaning up, which is a basic module of Phoenix kernel. * Detector services Detector services include physical resource detector, application state detector node state detector and, network state detector; physical resource detector

monitors usage of physical resources, such as CPU, memory, swap, disk I/0 and network I/0 of each nodes, which are fundamental for job management's schedulers; application state detector monitors

4 monitors usage of physical resources, such as CPU, memory, swap, disk I/0 and network I/0 of each nodes, which are fundamental for job management's schedulers; application state detector monitors application status such as physical resources consumed by a specific application, the application's living status, as well as the application information related to system level agreement, which are fundamental for business application runtime environment; node state detector and network state detector monitor status of node and network connectivity. * Group service Group service is the kernel one to solve scalability and high availability at the same time. The key functions of group service are guaranteeing the high availability of its meta-group; providing interfaces for upper-layer service group's creating, joining and leaving; and guaranteeing upper-layer service group's high availability. * Checkpoint service Based on group service, it provides interfaces for upper-layer services to save system data, which means that upper-layer services themselves are responsible for saving and deleting system state by calling interface of checkpoint service. * Event service Based on group service, event service play the role of communication channel of Phoenix kernel, and provides the following interfaces: the registration of the event supplier and event types it produces, the registration of the event consumer and event types it feel interested in; plus these interfaces, event service also provides functions like events filtering and real-time notification. * Data bulletin service Based on group service, data bulletin service is an in-memory database which stores the state of cluster-wide physical resource and application state; it provides interfaces for non-persistent data storage and data query. 4.3: Management Framework of Phoenix Kernel The scalability and fault-tolerance issues come to merge with large scale cluster system used as productivity platform. Designers used several different structures to deal with high availability issues, such as master-slave structure [1] [15], group structure [2] [13]; group structure includes two kinds of form: peer-to-peer structure and leader/member structure. Master-slave structure is only suitable for small cluster systems with its scalability limit. As for group structure, group members need monitor each other's state, and maintain consistent state information among each ones. Several projects propose group membership protocol in cluster management framework, for example Galaxy [2] and GulfStream [13] project. But when the scale of cluster system reaches thousand nodes, it is unacceptable for all nodes joining a group managed by group membership protocol, thus we improve the group structure. In Phoenix system, the whole cluster system is divided into several cluster partitions, each of which is composed of one server node, at least one server backup node, and other computing nodes. In order to achieve the scalability, each partition chooses one server node as representative to form a group, as shown in Figure 3. Among different partitions, the principal part of group management is GSD (Group Service Daemon). A GSD takes charge of a partition. Several group service daemons form a meta-group which managed by membership protocol. The GSD meta-group takes a ring structure. In case of of Leader, other members of meta-group select Princess to take over it. If Princess fails, the next member to Princess will take over it. If one of the members fails, the member next to it will take over it. Within a partition, the daemons responsible for sending heartbeat are watch daemons (WD) which reside on every node. WD sends heartbeat to GSD periodically through all network interfaces of the node. Through receiving and analyzing heartbeat from WD, GSD can monitor status of nodes and networks in a partition. Acting as an event supplier, GSD calls the interface of event service to push event, recovery event of node or network to those event consumers.

4.4: Service Federation Figure 3 Meta-group Structures with Five Members Checkpoint service, data bulletin service and event service call the interface of group service to create service group and

5 4.4: Service Federation Figure 3 Meta-group Structures with Five Members Checkpoint service, data bulletin service and event service call the interface of group service to create service group and register policies of how to deal with s. Taking event service as an example, group services take charge of monitoring event service group, as shown in Figure 4. If one member of event service group fails, GSD on the same host will notify all members of GSD group and then restart the failed service. Recovered event service daemon will retrieve its state data from the checkpoint service. If the node on which event service daemon running fails, GSD member next to it in the ring structure will select a new node for migrating GSD and then recovering event service, so the restarted event service daemon will also retrieve its state data from the checkpoint service. -es) Service federation is a group of service entity, providing single service access point through internally mutual connection and coordination. In Phoenix kernel, data bulletin, event and checkpoint service group respectively form its own federations. Figure 5 shows the structure of data bulletin federation in the form of a completed graph. There lies one instance of data bulletin (DB) service in each partition, and detector services on each nodes export the physical resource state and application state to data bulletin service which belongs to same partition. The user can query any data bulletin service to obtain cluster-wide information, so there is only one access point for data bulletin federation from user point of view. For each kernel service group, there is only one access point from user point of view, which simplifies the development of user environment. The system reliability is improved with service federation. If one data bulletin service fails, only the state of one partition can't be obtained. With the support of GSD, the failed data bulletin service will be restarted and come to work in a short period of time. User Figure 5 Data Bulletin Service Federation Figure 4 Event Service Group based on GSD In the whole system, there are one instance of configuration service and one instance of security service, while there are several different kernel services on the single server node in each partition, one instance for each service respectively; and meanwhile there are only detector service and parallel process management service running on each computing node. 5: Evaluation of Phoenix Kernel 5.1: Evaluation of Fault-Tolerance The main performance criterions for evaluating fault-tolerant system software are detection overhead and recovery overhead. The testbed is as follows: 136 nodes in Dawning 4000A with 16 computing nodes and 1 server node per partition,

6 so it is divided into 8 partitions. The interval for sending heartbeat can be configured as a system parameter, and 30 seconds is set for testing. As shown in Figure 3, group service daemon (GSD) detects and recovery of nodes and networks through receiving and analyzing of heartbeat sent by watch daemons (WD) within same partition. Group service daemons monitor each other to detect and recovery of their meta-group, and event service (ES) acts as a communication channel for sending and receiving and recovery events. Since WD, GSD and ES are most important components for fault-tolerance; we measure the fault detecting time, fault diagnosing time and recovery time for WD, GSD and ES in three unhealthy situations. E.g. for WD, these three unhealthy situations include of WD process, of node on which WD running and of one network interface. By the means of fault injection, we get the information in Table 1-3. From data in Table 1-3, we can conclude that the sum of detecting time, diagnosing time and recovery time is almost equal to the interval of sending heartbeat, while the interval for sending heartbeat can be configured as system parameter. It proves that Phoenix kernel has a good performance in supporting fault-tolerance. From Table 1 to Table 3, the recovery time of network is 0, because each node has three networks, only of one network isn't fatal. For WD, in case of node, the recovery time is 0, because each WD is the representative of hosting node for sending heartbeat, and migrating WD means nothing, while for GSD or ES, they can be migrated to another node in case of node. Fault Detecting Fault Recovery Sum of reason time diagnosing time time time Process 30s 0.29s OUs 30.39s Node 30s 2s Os 32s Fault Detecting Fault Recovery Sum of reason time diagnosing time time time Process 3Os 0.29s 2.03 s 32.32s Node 30s 0.3s 2.95s 33.25s network 30s 348us Os 30s Table 2 Three Unhealthy Situations for GSD Fault Detecting Fault Recovery Sum of reason time diagnosing time time time Process 30s 12us 0.12s 30.12s Node 30s 0.3s 2.95s 33.25s network 3Os 12us Os 3Os Table 3 Three Unhealthy Situations for ES 5.2: Performance Impact of Phoenix Kernel on Scientific Computing As for fault tolerant software system, fault tolerance means loss of performance. We measure the performance impact of Phoenix kernel on scientific computing under 4, 16, 64 and 128 CPU conditions respectively using the Linpack benchmark program. The test data is shown in table-4, and it is worth pointing out that the data without Phoenix running is obtained with system optimization by High performance computing Lab in Institute of Computing Technology, Chinese Academy of Sciences, while the data with Phoenix running is obtained by our test without optimization work From the Table-4, it can be concluded that Phoenix kernel has little impact on scientific computing. Network 3Os 348us Os 3Os Table 1 Three Unhealthy Situations for WD

7 CPU Without Phoenix With Phoenix running running % % % % % % % % Table 4 Phoenix's Impact on Linpack Benchmark Performance 1 0* r 50% 1- O % 1 a06 50% % _ 50% anu CPU Usage SWAP Usag 0 Flushing interval 5.3: Evaluating Scalability 384 4t Nd u Taking monitor system of Dawning 4000A [6] as an example, this section proves the high scalability of Phoenix kernel. The monitor system of Dawning 4000A involves five Phoenix kernel services, including configuration service, detector service, group service, data bulletin, event service, and GridView module [16] in charge of displaying graphic and analyzing data. GridView interacts with Phoenix kernel only through the interfaces of data bulletin service and event service and configuration service. GridView registers its interested event types to event service, including node and network etc., and GridView can get real-time notifications of these events. GridView collects cluster-wide performance data by calling single interface of data bulletin service federation, and visually displays cluster-wide resources usage with a specific refreshing rate. Figure 6 is a snapshot of Dawning 4000A's monitoring system under common load with percent average memory usage, percent average CPU usage and 0.72 percent average swap usage. As shown in this SystEOOverloa Syse Sttu Figure 6 System Monitoring based on Phoenix Kernel 5.4: Constructing User Environment on Phoenix Kernel According to Figure 1, we build different user environments based on Phoenix kernel. In this section, we will discuss on how to construct Phoenix-PWS job management system user environments (Partitioned Workload Solution, PWS for short). PWS is a job management system based on Phoenix kernel, improved on the basis of PBS (Portable Batch System) [17]. PWS supports multi-pools with customized scheduling policies for different pools and dynamic leasing among different pools. As shown in Figure 7, main modules of PBS include user interface, scheduling, resource monitoring, configuration, parallel process management. Figure 8 shows the PWS based on Phoenix kernel. figure, this system includes 640 nodes, and it proves the high scalability of Phoenix kernel, since the GridView system is constructed on it. Figure 7 Main Components of PBS

8 XhedulinR; Configuratio Event tservic DtBullet. I 0 Phoen'ix kernle CGopsrice _ ector Par e roess Management Figure 8 Main Components of PWS Based on Phoenix kernel Comparing with the PBS, PWS has several desirable properties as follows: (1) Phoenix kernel provides most of functions of PBS, and the development of new PWS system focuses only on the user interface and scheduling modules. (2) The scalability of PWS system is improved on the basis of group management and service federation. Physical resources detectors export physical resource information to data bulletin federation, from which the PWS system obtains cluster-wide resources information directly, thus the system workload is reduced; By registering events of node, network and application to event service, PWS can get real-time notification of those events, while PBS needs polling continually and consumes network bandwidth. (3) The fault-tolerance of PWS system is guaranteed on the basis of group management and service federation. If one data bulletin fails, only the state of one partition can't be obtained. With the support of GSD, the failed data bulletin will be restarted and come to work in a short period of time. The scheduling service group for different pools is created on the basis of group service with high availability guaranteed, while PBS doesn't guarantee it. (4) PWS supports multi-pools and dynamic leasing among different pools. Figure 9 shows a snapshot of integrated Web GUI for PWS job management system. Our practice proves the easiness of constructing user environment on the base of Phoenix kernel. C J l with ~ gbestartls ;citlde ;: Figure 9 Integrated Web GUI for Phoenix-PWS: Start/Shutdown Nodes 6. Conclusion d~ ~ ~~~~~~~~~S e r c plnoef o el Co aringe c SX i ii Maintaining a stable minimum set of core functions on cluster operating systemkeafel level make Phoenix different from other systems. Based on Phoenix kernel, we have constructed more complete user environments with ordinary developing efforts than other projects, including system construction tool, system management and system monitoring tools, job management system, and business application runtime environment. Comparing with Phoenix, SSS projects [11] only focus on HPC system software support, Ganglia [18] builds cluster monitoring system, and Rock [19] provides cluster building tool, and Galaxy [2] and Oce'ano project [13] take into account the requirement of business application. In addition, the scalable and fault tolerant support is embedded Phoenix kesel, thus the diffdculty of developing user environment is decreased. Though many projects propose high-availability solution [1] [2] [13] [15], what makes Phoenix different is that it provides a unified scalability and fault-tolerance supporting framework composed of three mechanisms: improved service management, service federation and single service access point. The developments of GridView [ 16] and Phoenix-PWS prove the correctness of this design decision. ACKNOWLEDGMENTS

9 I wish to thank all the members of Phoenix team at the Institute of Computing Technology System, The Chinese Academy of Science. REFERENCES [1] Richard Rabbat, Tom McNeal, Tim Burke, A High-Availability Clustering Architecture with Data Integrity Guarantees, Proceedings of the 2001 IEEE International Conference on Cluster Computing, Newport Beach, CA [2] Werner Vogels Dan Dumitriu, An Overview of the Galaxy Management Framework for Scalable Enterprise Cluster Computing, Proceedings of the 2000 IEEE International Conference on Cluster Computing. [3] Monika Henzinger, Indexing the Web - A Challenge for Supercomputers, International supercomputer conference, June 19, 2002 in Heidelberg. [4] Mary SHAW, DAVID GARLAN, software architecture: perspective on an emerging discipline, prentice hall [5] Maurice J. Bach: The Design of the UNIX Operating System. Prentice-Hall, 1986 [6] Dawning 4000A, sublist/ System.php? id= 7036, 2004 [7] T. Sterling, D. J. Becker, D. S. abd John Dorband, U. A. Ranawake, and C. E. Packer. Beowulf: A parallel workstation for scientific computation. In Proceedings of The International Conference on Parallel Processing 95. IEEE and ACM, [8] P. Uthayopas, S. Phatanapherom, T. Angskun, S. Sriprayoonsakul, "SCE: A Fully Integrated Software Tool for Beowulf Cluster System," in Proceedings oflinux Clusters: the HPC Revolution, National Center for Supercomputing Applications (NCSA), University of Illinois, Urbana, Illinois,June 25-27, [9] Atsushi, SCore: An Integrated Cluster System Software Package for High Performance Cluster Computing, Proceedings of the 2000 IEEE International Conference on cluster computing. [10] Stephen L. Scott, OSCAR and the Beowulf Arms Race for the "Cluster Standard", Proceedings of the 2001 IEEE International Conference on Cluster Computing. [11] Ralph Butler, Narayan Desai Andrew Lusk Ewing Lusk, The Process Management Component of a Scalable Systems Software Environment, In Proceedings of the IEEE Cluster 2003 Conference, Hong Kong. [12] Rob Armstrong, Dennis Gannon, Al Geist, Katarzyna Keahey, Scott Kohn, Lois McInnes, Steve Parker, and Brent Smolinski, Toward a Common Component Architecture for High- Performance Scientific Computing, http: H www -unix.mcs. anl.gov/ %7Ecurfman/ web/ cca_paper.html [13] Sameh A. Fakhouri, Germ an Goldszmidt, Michael Kalantar, John A. Pershing, GulfStream - a System for Dynamic Topology Management in Multi-domain Server Farms, Proceedings of the 2001 IEEE International Conference on Cluster Computing,Newport Beach, CA [14] Remy Evard, Narayan Desai, John-Paul Navarro, and Daniel Nurmi, Clusters as Large-Scale Development Facilities, Proceedings of the 2002 IEEE International Conference on Cluster Computing. [15] Chokchai Leangsuksun and Ibrahim Haddad, Building Highly Available HPC Clusters with HA-OSCAR, Proceedings of the 2004 IEEE International Conference on Cluster Computing. [16]Ni Guangbao, Ma Jie, Li Bo, GridView: ADynamic and Visual Grid Monitoring System, In Proc. of the 7th International Conference on High Performance Computing and Grid in Asia Pacific Region, Omiya Sonic City, Tokyo Area, Japan, July 20-22, 2004, pp [17] [18] Federico D. Sacerdoti, Mason J. Katz, Matthew L. Massie, David E. Culler, Wide Area Cluster Monitoring with Ganglia, In Proceedings of the IEEE Cluster 2003 Conference, Hong Kong. [19] P. Papadopoulos, M. Katz, and G. Bruno, "NPACI Rocks: Tools and Techniques for Easily Deploying Manageable Linux Clusters," Proceedings of the 2001 IEEE International Conference on Cluster Computing, Newport Beach, CA

Easy and Reliable Cluster Management: The Self-management Experience of Fire Phoenix *

Easy and Reliable Cluster Management: The Self-management Experience of Fire Phoenix * Zhang Zhi-Hong, Meng Dan, Zhan Jian-Feng, Wang Lei, Wu Lin-ping and Huang Wei Institute of Computing Technology Chinese