Dell s High Availability Cluster Product Strategy

Cluster Development Group December 2003 Dell s High Availability Cluster Product Strategy This article outlines generic High Availability (HA) Cluster requirements and supported configurations. The paper also explains the logic behind a number of the rules and requirements for Dell supported HA solutions. Overview Ensuring access to information requires that applications and data meet stringent uptime requirements. As users demand that more services be available to them, the need to maximize application uptime has become common place. Developing High Availability (HA) solutions that support high levels of uptime while maintaining the simplicity of the Dell business model is challenging. Dell s primary focus has been on composing solutions with Windows-based systems, but over the past year the need for high availability cluster solutions has emerged in the Linux market. Likewise, application clustering now enables a distributed approach to computing where multiple servers can logically be grouped into a cluster and viewed as a single entity like Oracle Real Application Clusters (RAC). To facilitate the adoption of credible solutions for Linux, Dell is implementing a set of cluster rules that will enable a common set of configurations that will be supported and sold around Microsoft Cluster Server (MSCS), Linux High Availability (Linux HA) and Oracle RAC cluster solutions. Dell s High Availability Cluster Configuration rules are designed to ensure no single point of failure (SPOF) in the end-to-end cluster solution. This includes the standalone server, storage system, fabrics, paths, and applications. The Dell HA Cluster solutions are tested end-to-end to ensure maximum availability and reliability are available. The matrixes below present a generic outline of the various components that are supported in the Small Computer System Interface (SCSI) and Fibre solutions for Windows, Linux HA and Oracle RAC based HA cluster configurations. As our HA programs mature, each new solution should use similar approaches and components. The various permutations and configurations go through stringent testing, which includes fault interjection to ensure that the entire solution is extensively developed in a highly stressful environment. The entire scope of the testing is conducted by the Dell HA cluster development groups. When issues are found, the teams work with the appropriate engineering teams and/or vendors to determine the root cause and develop a solution. Certifications If required, then they are completed. However, not all solutions require a certification by the ISV. No Heterogenous Storage Components As HA clustering is very dependent on the I/O subsystem, intermixing I/O components adds unacceptable risk to the configuration. Data integrity in cluster configurations must never be jeopardized. Current and previous (N & N 1) Server configurations Customer investment protection and migration paths for the latest OS and I/O subsystems. High Availability SCSI Cluster Solution SCSI-based HA cluster solutions are based on a cluster (server) failover configuration versus a path and cluster failover as supported in the Fibre configurations. There is also a private (heartbeat) network that is a dedicated connection for communicating the cluster status between the cluster nodes. In Dell s SCSI-based HA solution there is at least a single RAID controller in each cluster node (server). When a cluster node fails, the AS OF 12/18/2003

node will failover to another cluster node. The matrix shown later in the article outlines the standard components of each Dell supported cluster. For example, under servers, N is equivalent to the currently shipping server such as the PE1750, N-1 would represent the PE1650. Under OSs, N is equivalent to Windows Server 2003 Enterprise Edition and N-1 represents Windows 2000 Advanced Server. Servers External SCSI Storage Diagram 1 High Availability Fibre Cluster Solution Fibre -based HA cluster solutions are based on path failover and cluster failover. By requiring redundant HBAs or paths to the storage, this provides for a higher level of availability than a SCSI-based cluster solution. Redundant Host Bus Adapters (HBAs) are required in each cluster node (server). Redundant HBAs coupled with redundant switches provide the ability to support redundant paths and fabrics connected to the external storage array. When a path fails, there will be a failover within the same cluster node. If both paths fail, then the cluster can fail over to another node in the cluster. The HBAs in a cluster must be identical. Mixed versions of HBAs are not supported in a single cluster configuration, regardless if you are implementing 2 up to 8 nodes in Windows 2003, Enterprise Edition. Servers (nodes) Fibre Switches TBU External Fibre Storage Diagram 2 2

Oracle Real Application Clusters (RAC) Application clustering enables additional functionalities for specific purposes. While database virtualization technologies such as Real Application Clusters are not yet as widespread as generic HA clustering technologies, they can provide a unique value proposition for a given application or deployment scenario. Oracle RAC is Oracle s database clustering technology, whereby multiple servers can be grouped in an activeactive cluster with shared data. As of today, RAC is the only technology that allows databases to scale out in a shared data model. Based on the RAC technology, any front-end application (such as OLTP applications, Oracle E-Business Suite, SAP, etc) can connect to the database cluster. RAC is therefore a platform for Oracle clustering at the database level. Diagram 3 3

Matrix of Dell Supported Cluster Components Windows HA Oracle RAC Linux HA Win NT EE W2K AS W2K3, EE Product/Feature Configuration Rules Certifications X X X RH 2.1 AS RHEL 3 RHEL 3 26x0 4600 64x0 66x0 8450 1750 26x0 4600 64x0 66x0 N and/or N -1 Servers X X X Multiple Clusters on a SAN X X X X X Multiple Clusters Direct Attached CX600 CX600 Mixed Storage (on a SAN) X X X Mixed Storage (on a cluster) Single Path Configs X X Dual Path Configs X X X X X Homogeneous HBA I/O (no mixing of HBA cards) X X X X X Controllers Emulex Single LP9002L X X X LP982 X X QLogic Single QLA2200 X X X QLA2340 X X X X Emulex Dual LP9802 QLogic Dual QLA2342 X X RAID Controllers PERC 3/DC X X X X X PERC 4/DC X X X Driver Changes Requalification at a minimum. Certification done at next major release 4

Win NT EE Windows HA Oracle RAC Linux HA W2K3 RHEL W2K AS EE 2.1 RHEL 3 RHEL 3 Product/Feature Configuration Rules External Storage Power Vault TM PV22xS X X Array Manager X X X SATA SCSI PV650 X X PV660 X X X FC4500 X X X FC4700-2 X X X X X CX Series CX600 X X X X CX400 X X X X CX200 X X X X CX200LC SATA FC Switches Brocade 8 Port X X X X X 16 Port X X X X X 32 Port McData 8 Port 16 Port 32 Port Flex Switch X X X Platforms Blades SC 1P Tower 2P Tower X X X X X 4P Tower X X X X X 1P Rack 2P Rack X X X X X 4P Rack X X X X X 64 Bit 2P Rack Interconnect On-board LOM X X X X X All add-in Ethernet NICs supported by platform X X X X X Heterogeneous Interconnect X X Homogeneous Interconnect X X X X X NIC Teaming Public Only Public Only Public Only X X 5

Win NT EE - Windows NT, Enterprise Edition W2K AS - Windows 2000, Advanced Server W2K3 EE - Windows Server 2003, Enterprise Edition RHEL 2.1 - Red Hat Enterprise License 2.1 RHEL 3 AS - Red Hat Enterprise License 3 Advanced Server Application Availability Because the application is critical, Dell focuses on understanding and proposing applications that are cluster aware, such as Microsoft Exchange and Microsoft SQL Server to name a few. By leveraging cluster aware applications, the clustering software can perform an operation to see if an application is responding. When it is not, the cluster software assumes the application is hung and the application attempts to restart on the same system. This is referred to as a local recovery. Local recoveries are quicker to perform than a failover to a backup server. Thus users are usually up and running quicker. Having failed all of these steps, the clustering software will fail resources over to another cluster node; this includes any applications. Node failover takes longer for the application to come up, but service will be restarted once the backup node and application are up and running. Planned downtime can be managed in a more effective manner. Maintenance from a hardware as well as a software perspective can be performed on one of the servers while the other servers continue to provide the needed functionality for the users. No longer does this important task have to impact users or be performed at non-working timeframes. As Dell s high availability portfolio continues to expand, application monitoring and fault prevention are areas that continue to be a primary focus for improving application availability. = High Availability = Disaster Recovery Conclusion Dell is continuing to drive simplicity and standardization within the HA cluster market. Previously, High Availability clustering was considered difficult to plan, productize, implement, test, and sell. Throughout the past several years, Dell has standardized the Windows HA Clustering market, and is now planning to do so for the Linux HA clustering market and the Oracle RAC solutions. 6