A GPFS Primer October PDF Free Download

A Primer October 2005

Overview This paper describes (General Parallel File System) Version 2, Release 3 for AIX 5L and Linux. It provides an overview of key concepts which should be understood by those considering the product. It is assumed that the reader has a basic knowledge of clustering and storage networks. The product documentation provides more detail on all aspects of the licensed product, as well as information on prior releases [2,]. is a cluster file system providing normal application interfaces, and has been available on AIX operating system-based clusters since 1998 and Linux operating system-based clusters since 2001. distinguishes itself from other cluster file systems by providing concurrent, high-speed file access to applications executing on multiple nodes in an AIX 5L cluster, a Linux cluster or a heterogeneous cluster of AIX 5L and Linux machines. The processors supporting this cluster may be a mixture of IBM System p5, ~ p5 and pseries machines, IBM ~ BladeCenter or IBM ~ xseries machines based on Intel or AMD processors. supports the current releases of AIX 5L and selected releases of Red Hat and SUSE LINUX Enterprise Server distributions. See the FAQ [1] for a current list of tested machines and also tested Linux distribution levels. It is possible to run on compatible machines from other hardware vendors, but you should contact your IBM sales representative for details. IP Network Compute Nodes FC driver FC driver FC driver FC Switch (optional)... Disk Collections Figure 1: in a Direct-attached SAN Environment psssgpfsprimerwp100405.doc Page 2

for AIX 5L and for Linux are derived from the same programming source and differ principally in adapting to the different hardware and operating system environments. The functionality of the two products is identical. V2.3 allows AIX 5L and Linux nodes, including Linux nodes on different machine architectures, to exist in the same cluster with shared access to the same file system. A cluster is a managed collection of computers which are connected via a network and share access to storage. Storage may be shared directly using storage networking capabilities provided by a storage vendor or by using IBM supplied capabilities which simulate a storage area network (SAN) over an IP network. V2.3 is enhanced over previous releases of by introducing the capability to share data between clusters. This means that a cluster with proper authority can mount and directly access data owned by another cluster. It is possible to create clusters which own no data and are created for the sole purpose of accessing data owned by other clusters. The data transport uses either SAN simulation capabilities over a general network or SAN extension hardware. V2.3 also adds new facilities in support of disaster recovery, recoverability and scaling. See the product publications for details [2,]. Introduction is IBM s best performing cluster file system. The file system is built from a collection of disks which contain the file system data, a collection of computers which own and manage the file system and a set of networking connections which bind together the computers and the storage. In its simplest environment, the storage is connected to all machines using a SAN as shown in Figure 1. The illustration shows a Fibre Channel SAN which is the most common form of SAN connection. The computers are connected to the storage via the SAN and to each other using a LAN. Data used by applications flows over the SAN and control information flows among the instances on the cluster via the LAN. Other configurations are also possible. In some environments, where every compute node in the cluster cannot have SAN access to disks, makes use of an IBM provided network block device capability. On AIX 5L only configurations which use the IBM pseries High Performance Switch, this is the IBM Virtual Shared Disk facility which is a component of AIX 5L. On Linux, AIX 5L using other interconnects or on mixed clusters of AIX 5L and Linux nodes, uses the Network Shared Disk () capability which is a component of. Both virtual shared disk and provide software simulation of a SAN across IP or IBM proprietary networks. uses and virtual shared disk to provide high-speed access to data for applications running on compute nodes which do not have a SAN attachment. Data is served to those compute nodes from a virtual shared disk I/O server or an I/O server. Multiple I/O servers for each disk are possible and recommended to avoid single points of failure for your data. A Linux example of such a model is shown in Figure 2. In this example, both data and control flow across an unswitched LAN. This model is only appropriate for small clusters, but is used here as an illustration. Switched LANs and/or bonded LANs should be used for clusters requiring significant data transfer. Higher performance networks such as IBM High Performance Switch or InfiniBand will provide even higher performance. psssgpfsprimerwp100405.doc Page 3

Compute Nodes Ethernet... Disk Device Interface Disk Device Interface IO Nodes Disk Collections Figure 2: A Simple Configuration In the SAN simulation model, a subset of the total node population is defined as I/O server nodes. The disk drives are attached only to the I/O servers. The or IBM Virtual Shared Disk subsystem is responsible for the abstraction of disk data blocks across an IP-based network. The fact that I/O is remote is transparent to the application issuing the file system I/O calls. Figure 2 shows an example configuration where a set of compute nodes are connected to a set of I/O servers via a high-speed interconnect or an IP based network such as Ethernet. Each I/O server is attached to a portion of the disk collection. The disks should be multi-tailed to I/O servers for failover capability in the event of an I/O server failure. The choice of how many nodes should be configured as I/O servers is based on individual performance requirements and the capabilities of the storage subsystem. Note that this model of storage attachment does not require a complex SAN. Each storage box is only attached to a small number of I/O servers, often only two servers, eliminating the need for SAN switches. The choice between a SAN attachment and a SAN simulation is a performance and economic one. In general, SANs will provide the highest performance; but the cost and management complexity of SANs for large clusters is often prohibitive. The SAN simulation capabilities of provide an answer for that problem. provides file data access from all nodes in the cluster by providing a global name space for files. s can efficiently access files using standard UNIX file system interfaces and supplies the data to any location in the cluster using the supplied path to the storage. allows all compute nodes with the file system mounted to have coherent and concurrent access to all storage including write sharing with full Xopen semantics. beyond the cluster psssgpfsprimerwp100405.doc Page 4

The description above describes a simple cluster where all machines are co-located and share a common network. This is the basic configuration supported by all releases. V2.3 introduces a configuration called multi-clustering which allows a cluster to permit access to specific file systems from another cluster. This level sharing is intended to allow clusters to share data at higher performance levels than distributed file systems. It is not intended to replace distributed file systems which are tuned for desktop access or for access across unreliable network links. data sharing requires a trusted kernel at both the owning cluster and cluster remotely accessing the file system. This capability is useful for sharing data across multiple clusters within a location or across locations which have adequate network links among them. Cluster A DD Disk Device Interface Disk Device Interface DD Figure 3: Multiclustering Cluster B Figure 3 shows a multi-clustering configuration. The cluster on the lower left side of the figure is the cluster from figure 2. We have added two clusters, Cluster A and Cluster B which need to access the data from the original cluster. The original cluster owns the storage and manages the file system. It may grant access to file systems which it manages to remote clusters such as Cluster A and Cluster B. Cluster A and Cluster B do not have any storage in this example, but that is not always true. They could own file systems which may or may not be accessible outside their cluster. In the case where they do not own storage, these nodes are grouped into clusters for ease of management. It is also possible to have a remote cluster consisting of a single node. When the remote clusters need access to the data, they mount the file system by contacting the owning cluster and passing required security checks. Cluster A accesses the data through an extension of psssgpfsprimerwp100405.doc Page 5

the network. Cluster B accesses data through a physical extension of the SAN. In both cases, control traffic is sent across an IP network. Scaling Scaling considerations for include: supports clusters of up to 512 nodes mounting a file system as a general availability statement; however there are a number of larger configurations supported through special arrangements with IBM. These nodes may be any mixture of supported processor types and operating system levels. As described earlier, all nodes must either have direct connectivity or SAN-attachment to the storage. supports file systems of 200TB as a general availability statement, however there are configurations in excess of 1PB supported by special arrangement with IBM. has demonstrated 10 s of GB/sec aggregate throughput in configurations with sufficient hardware to drive that load. The key and unique feature of the architecture and design which enables this is the use of distributed metadata servers. handles metadata in a distributed fashion on all nodes of the cluster. This distinguishes from other cluster file systems which typically have a centralized metadata server handling fixed regions of the file namespace. The centralized metadata server becomes a performance bottleneck for metadata intensive operations and also can be a single point of failure. solves this problem by handling metadata at the node which is using the file or in the case of parallel access to the file, at a dynamically selected node which is using the file. The token manager is a central facility which controls which nodes manage the metadata for each active object. Note that the resources associated with the token manager are significantly lower than those requires for a metadata server. See the : Concepts, Planning, and Installation Guide for further information [2,]. and other Data Facilities provides the standard interfaces which are exploited by other data facilities such as backup products and network file systems. Several IBM supplied data facilities explicitly support. These include the Tivoli Storage Manager (TSM) and the Network File System (NFS) facilities on AIX 5L. Specifically, TSM provides both backup capabilities and space management capabilities where data can flow out of to tape and be recalled on demand. The combination of and TSM in this way provides a form of Information Lifecycle Management. There are other programs providing capabilities of this type which can be used with. Contact your data facility vendor for any support statements. In a similar way, can be used in conjunction with database products which require a cluster file system. Contact your database vendor to determine whether would be a supported environment for your clustered database. file data may be exported to clients outside the cluster via NFS or other distributed file system programs including the capability of exporting the same data from multiple nodes. This allows your cluster to provide aggregate NFS service of the same data in excess of the capability of one node. It also allows service from the cluster when one or more of the exporting nodes are inoperable. psssgpfsprimerwp100405.doc Page 6

Additionally, provides support for a Data Management API (DMAPI) interface which is IBM s implementation of the X/Open data storage management API. This DMAPI interface allows vendors of storage management applications such as TSM to provide HSM (Hierarchical Storage Management) support for. A single-node cluster can be created to allow standalone servers to take advantage of HSM support with the large block file performance offered by. Support and File System Access is designed for application sets which have needs for data access rates and compute capability beyond that which can be reliably satisfied by a single computer with directly attached storage or by a cluster serviced by a distributed file system. There are numerous examples of application sets which meet this characterization; but we will discuss two for the purposes of illustration. The High Performance Computing environment serves many technical and complex analytic business applications. These applications are characterized by the fact that they require the computing power of multiple computers to reasonably solve their problems. Parallel programming systems require that the data associated with the application be delivered to each instance at high performance and with full consistency semantics. provides outstanding performance to parallel applications including some extended programming interfaces intended specifically for this class of application. Parallel applications which require concurrent shared access to the same data from many nodes in the cluster (including concurrent updates to the same file) can accomplish this easily with. maintains the coherency and consistency of the file system via sophisticated locking and byte level token management. Increases in computing power and networking capability have made possible a whole class of applications which depend on the high speed reliable capture and processing of large volumes of data. An example of this class of application is those that process and collect unstructured data such as video images. These applications can require high data rates as well as the ability to add and delete capacity upon demand. provides this capability to supply high data rates and to add members of the cluster on-line without disruption to on-going operations. also provides fault tolerance capabilities which allow operations to continue in the event of many types of failures. There are other applications which can exploit the fault tolerance or high bandwidth capabilities of including those which use clustered databases for fault tolerance and scalability. Will your application work with installs into your operating system as a physical file system and interfaces with the operating system to handle data requests for data within a file system. Most applications execute with no change. The product publications describe the extended application interfaces for parallel programs should you choose to use them. Those interfaces are unique to. recognizes typical access patterns like sequential, reverse sequential and random and optimizes its pre-fetching mechanism for these patterns. The same application will normally be capable of accessing both resident data and data residing in other file systems. Data Availability psssgpfsprimerwp100405.doc Page 7

is a fault tolerant file system and can be configured for continued access to data even in the presence of possible failures of compute nodes, I/O server nodes or their disk attachments. The metadata is organized by in a fashion that lends itself to efficient parallel access and maintenance. Metadata can be configured with multiple copies to allow for continuous operation even if the paths to a disk or the disk itself is broken. can be used with RAID or other hardware redundancy capabilities to survive media failures. The disks can be multi-tailed to attach to multiple I/O servers so that the loss of an I/O server does not prevent access from the disks attached to a failed I/O server. The loss of connectivity to disks from one node does not affect the other nodes in the direct attached SAN model. continuously monitors the health of the various file system components. When failures are detected appropriate recovery action is taken automatically if alternate resources are available. also provides extensive logging and recovery capabilities which maintain metadata consistency across the failure of application nodes holding locks or performing services for other nodes. Performance provides unparalleled performance especially for larger data objects. It provides excellent performance for large aggregates of smaller objects. achieves high performance I/O by: Striping data across multiple disks attached to multiple nodes. Efficient client side caching including read-ahead and write-behind when application access patterns make this the right choice. Using a block size which is configurable by the administrator. This is especially important with some new disk technologies where very large block sizes are critical to storage performance. Built in logic for read-ahead and write-behind file functions. Using block-level locking based on a very sophisticated token management system to provide data consistency while allowing multiple application nodes to have concurrent access to the files. Administration provides a very simple administration model that is consistent with standard AIX 5L and Linux file system administration while providing extensions for the clustering aspects of. provides functions that simplify multi-node administration. A single multi-node command can perform a file system function across the entire cluster. The command can be issued from any node in the cluster. These commands are typically extensions to the usual AIX 5L and Linux file system commands. also has other standard file system administration functions such as quotas, snapshots, and extended access control lists Connectivity Choice requires connectivity to storage from all nodes and control connectivity among the nodes sharing the data. As illustrated in Figure 2, these may be the same network. Planning the connectivity for requires the allocation of sufficient bandwidth for both tasks. For maximum performance, connection via IBM s pseries High Performance Switch or a universally connected SAN should be considered. Connections using LANs and Infiniband are also possible. The FAQ [1] lists the currently qualified networks. psssgpfsprimerwp100405.doc Page 8

does not require a dedicated network; but we do not recommend running over a network which is shared for many purposes. There must be sufficient bandwidth and low-level latency available for usage in order to meet your performance expectations. Supported Disk Storage Devices: Effective operation depends on sufficient disk bandwidth. IBM qualifies some disk types for use with and these are documented in the FAQ [1]. This list is updated as new disks are tested. Other disks may be used if they provide the required functionality for multiple node access. Glossary of Terms Node Cluster IBM Virtual Shared Disk FC Metadata A single AIX 5L or Linux operating system image A collection of nodes which are managed as a single entity A kernel subsystem in AIX 5L providing a shared disk architecture via software and associated recovery capabilities. Fiber Channel Network shared disk subsystem. A component of on AIX 5L and Linux providing a shared disk architecture via software Data describing user data psssgpfsprimerwp100405.doc Page 9

References 1. FAQ: http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.cluster.gpfs.doc/ gpfs_faqs/gpfs_faqs.html 2. documentation: http://publib.boulder.ibm.com/infocenter/clresctr/index.jsp?topic=/com.ibm.cluster.gpfs.d oc/gpfsbooks.html IBM occasionally publishes Redbooks on various topics including. See: http:// www.redbooks.ibm.com/ The following conference paper on may also be of interest: http://www.almaden.ibm.com/storagesystems/file_systems// psssgpfsprimerwp100405.doc Page 10

IBM Corporation 2005 IBM Corporation Marketing Communications Systems Group Route 100 Somers, New York 10589 Produced in the United States of America October 2005 All Rights Reserved This document was developed for products and/or services offered in the United States. IBM may not offer the products, features, or services discussed in this document in other countries. The information may be subject to change without notice. Consult your local IBM business contact for information on the products, features and services available in your area. All statements regarding IBM s future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. IBM, the IBM logo, the e-business logo, ~, AIX, AIX 5L, BladeCenter, pseries, System p5, Tivoli, xseries are trademarks or registered trademarks of International Business Machines Corporation in the United States or other countries or both. A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. UNIX is a registered trademark of The Open Group in the United States, other countries or both. Intel is a trademark of Intel Corporation in the United States, other countries, or both. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Red Hat, the Red Hat "Shadow Man" logo, and all Red Hat-based trademarks and logos are trademarks or registered trademarks of Red Hat, Inc., in the United States and other countries. Other company, product, and service names may be trademarks or service marks of others. IBM hardware products are manufactured from new parts, or new and used parts. Regardless, our warranty terms apply. Information concerning non-ibm products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of the non-ibm products should be addressed with the suppliers. The IBM home page on the Internet can be found at: http://www.ibm.com. The System p5 and ~ p5 home page on the Internet can be found at: http://www.ibm.com/servers/eserver/pseries. psssgpfsprimerwp100405.doc Page 11

A GPFS Primer October 2005