VERITAS Volume Manager for Windows 2000 VERITAS Cluster Server for Windows 2000

Similar documents
Simplified Storage Migration for Microsoft Cluster Server

Veritas Storage Foundation and High Availability Solutions Microsoft Clustering Solutions Guide for Microsoft SQL 2008

VERITAS Volume Manager for Windows 2000

VERITAS Storage Foundation 4.0 for Windows

Veritas Storage Foundation for Oracle RAC from Symantec

Data Sheet: Storage Management Veritas Storage Foundation for Oracle RAC from Symantec Manageability and availability for Oracle RAC databases

Veritas Storage Foundation and High Availability Solutions Microsoft Clustering Solutions Guide for Microsoft Exchange 2007

Veritas Storage Foundation for Windows by Symantec

VERITAS Storage Foundation HA 4.3 for Windows

Veritas InfoScale Enterprise for Oracle Real Application Clusters (RAC)

Veritas Storage Foundation for Windows by Symantec

VERITAS Volume Replicator. Successful Replication and Disaster Recovery

Oracle E-Business Availability Options. Solution Series for Oracle: 2 of 5

Microsoft Office SharePoint Server 2007

Virtualization And High Availability. Howard Chow Microsoft MVP

Synology High Availability (SHA)

Veritas Storage Foundation for Windows by Symantec

Deep Dive: Cluster File System 6.0 new Features & Capabilities

Business Continuity and Disaster Recovery. Ed Crowley Ch 12

Protecting Mission-Critical Workloads with VMware Fault Tolerance W H I T E P A P E R

CLUSTERING. What is Clustering?

Comparison: Microsoft Logical Disk Manager (LDM) and VERITAS Volume Manager

How everrun Works. An overview of the everrun Architecture

WHITE PAPER: ENTERPRISE SOLUTIONS

Technical White Paper

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

Maximum Availability Architecture: Overview. An Oracle White Paper July 2002

Veritas Storage Foundation and High Availability Solutions Quick Recovery and Microsoft Clustering Solutions Guide for Microsoft Exchange

Symantec Storage Foundation for Oracle Real Application Clusters (RAC)

iscsi Technology Brief Storage Area Network using Gbit Ethernet The iscsi Standard

VERITAS Storage Foundation 4.0 TM for Databases

Microsoft E xchange 2010 on VMware

The Right Choice for DR: Data Guard, Stretch Clusters, or Remote Mirroring. Ashish Ray Group Product Manager Oracle Corporation

Veritas Storage Foundation from Symantec

Improving Blade Economics with Virtualization

Survey Of Volume Managers

VERITAS Global Cluster Manager

VERITAS Storage Foundation 4.0 for Oracle RAC. Oz Melamed E&M Computing

ST0-12W Veritas Cluster Server 5 for Windows (STS)

Intelligent Rebuilds in vsan 6.6 January 08, 2018

Datacenter replication solution with quasardb

Installing the IBM ServeRAID Cluster Solution

Data Sheet: Storage Management Veritas Storage Foundation by Symantec Heterogeneous online storage management

VERITAS Foundation Suite for HP-UX

WHITE PAPER. Recovery of a Single Microsoft Exchange 2000 Database USING VERITAS EDITION FOR MICROSOFT EXCHANGE 2000

EMC Celerra CNS with CLARiiON Storage

Step-by-Step Guide to Installing Cluster Service

New England Data Camp v2.0 It is all about the data! Caregroup Healthcare System. Ayad Shammout Lead Technical DBA

VERITAS Volume Manager 3.1 for Windows 2000 Quick Start Guide

Hitachi Adaptable Modular Storage and Hitachi Workgroup Modular Storage

Focus On: Oracle Database 11g Release 2

VERITAS FlashSnap. Guidelines for Using VERITAS FlashSnap

Veritas Storage Foundation and High Availability Solutions HA and Disaster Recovery Solutions Guide for Microsoft SQL 2008

Using VERITAS Volume Replicator for Disaster Recovery of a SQL Server Application Note

VERITAS Volume Replicator Successful Replication and Disaster Recovery

Veritas Cluster Server from Symantec

Offloaded Data Transfers (ODX) Virtual Fibre Channel for Hyper-V. Application storage support through SMB 3.0. Storage Spaces

VERITAS Storage Foundation for Windows FlashSnap Option

COMPLETE ONLINE MICROSOFT SQL SERVER DATA PROTECTION

3.1. Storage. Direct Attached Storage (DAS)

VMware HA: Overview & Technical Best Practices

Storage Foundation and High Availability Solutions HA and Disaster Recovery Solutions Guide for Enterprise Vault

High Availability and Disaster Recovery Solutions for Perforce

Protecting remote site data SvSAN clustering - failure scenarios

VERITAS SANPoint Foundation and Suite SANPoint Foundation

PS Series Best Practices Deploying Microsoft Windows Clustering with an iscsi SAN

Hitachi Adaptable Modular Storage and Workgroup Modular Storage

QuickStart Guide vcenter Server Heartbeat 5.5 Update 1 EN

Storage Area Networks SAN. Shane Healy

Data Sheet: High Availability Veritas Cluster Server from Symantec Reduce Application Downtime

Addressing Data Management and IT Infrastructure Challenges in a SharePoint Environment. By Michael Noel

Equitrac Office and Express DCE High Availability White Paper

Always On White Paper for Fujitsu ETERNUS Storage System

The VERITAS VERTEX Initiative. The Future of Data Protection

Three Steps Toward Zero Downtime. Guide. Solution Guide Server.

HIGH AVAILABILITY AND DISASTER RECOVERY FOR IMDG VLADIMIR KOMAROV, MIKHAIL GORELOV SBERBANK OF RUSSIA

CSCI 4717 Computer Architecture

DELL EMC UNITY: HIGH AVAILABILITY

Windows Server 2012 Hands- On Camp. Learn What s Hot and New in Windows Server 2012!

EMC VPLEX with Quantum Stornext

VERITAS Database Edition for Sybase. Technical White Paper

Presented By Chad Dimatulac Principal Database Architect United Airlines October 24, 2011

HP StoreVirtual Storage Multi-Site Configuration Guide

Parallels Virtuozzo Containers 4.6 for Windows

Solution Brief. IBM eserver BladeCenter & VERITAS Solutions for Microsoft Exchange

The advantages of architecting an open iscsi SAN

Installation and User's Guide

WHITE PAPER Using Marathon everrun MX 6.1 with XenDesktop 5 Service Pack 1

HP StoreVirtual Storage Multi-Site Configuration Guide

SvSAN Data Sheet - StorMagic

Assessing performance in HP LeftHand SANs

Parallels Containers for Windows 6.0

What's in this guide... 4 Documents related to NetBackup in highly available environments... 5

Synology High Availability (SHA)

Copyright 2003 VERITAS Software Corporation. All rights reserved. VERITAS, the VERITAS Logo and all other VERITAS product names and slogans are

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disaster Recovery-to-the- Cloud Best Practices

Disaster Recovery Solutions With Virtual Infrastructure: Implementation andbest Practices

Cluster Volume Manager

VERITAS Cluster Server. QuickStart. Product Overview

Transcription:

WHITE PAPER VERITAS Volume Manager for Windows 2000 VERITAS Cluster Server for Windows 2000 VERITAS CAMPUS CLUSTER SOLUTION FOR WINDOWS 2000 WHITEPAPER 1

TABLE OF CONTENTS TABLE OF CONTENTS...2 Overview...3 Dynamic Volumes Concepts...4 Dynamic Volume Overview...4 Dynamic Volumes Virtualize Storage...4 Dynamic Volumes in Microsoft Windows 2000...5 Dynamic Volumes in VERITAS Volume Manager for Windows...5 Dynamic Disk Groups...6 VERITAS Cluster Server (VCS)...6 VCS Overview...6 Failover Groups...8 Parallel Groups...8 Putting the pieces together...9 Cluster Communications (Heartbeat)...10 Group Membership Services/Atomic Broadcast (GAB)...10 Cluster membership...10 Cluster State...11 Low Latency Transport (LLT)...11 Low Priority Link...11 Using Volume Manager in a Veritas Cluster Environment...12 Advantages of using Volume Manager with VERITAS Cluster Server...12 Dynamic Volume Support...12 Protection from Common Storage Management Errors...13 Optional Automatic Failover...14 Campus Clusters...14 How does a Campus Cluster respond to a failure?...15 Failure Scenarios...16 Reinstating Faulted Hardware...16 Summary...18 2

OVERVIEW The Microsoft Windows 2000 operating system offers significant advances in performance, scalability, and manageability. One of the key features of this new operating system is the Logical Disk Manager (LDM) that provides logical volume management and online disk administration capabilities. VERITAS Volume Manager for Windows 2000 (VM) extends these in-the-box basic capabilities to create a highly scalable, manageable platform for the most data-intensive or critical application environments. VERITAS Cluster Server for Windows 2000 (VCS) provides automated application recovery for business environments which demand uninterrupted application service and availability. Many integration points have been developed between Volume Manager and Cluster Server that build on the existing capabilities of each product to provide cost-effective and powerful solutions for high availability. By combining VERITAS Volume Manager and Cluster Server, system administrators can create flexible storage configurations that allow clusters to be created using local storage devices on each system rather than a single shared storage array. This integrated solution provides customers with both highly available application failover support and highly configurable and manageable storage capabilities. This paper provides a brief overview of the various components involved in this solution and then discusses specifically how to create application-specific storage migration for VCS using VERITAS Volume Manager. This paper also discusses a number of advantages gained by using VERITAS Volume Manager in a VCS environment, including the ability to use dynamic volumes in a cluster, optional automatic failover in site-failure scenarios, and the integrated safety measures only found in VCS/VM configurations that prevent common storage management mishaps. These VERITAS solution advantages reduce overall planned and unplanned downtime in a clustering environment. More information on VERITAS solutions for Windows 2000 can be found on the VERITAS Web site at http://www.veritas.com. 3

DYNAMIC VOLUMES CONCEPTS Dynamic Volume Overview VERITAS worked with Microsoft to develop the logical volume management in the Windows 2000 software. Logical volume management through the use of dynamic volumes removes physical limitations of storage, enabling administrators to build higher-performance, more available storage configurations from existing disk devices. This simplifies disk administration tasks for reduced cost of ownership. Windows 2000 introduces a new Logical Disk Manager (LDM) facility that supports both basic disks and dynamic disks. Basic disks use standard disk partition tables to support basic volumes and have been supported on previous versions of Windows. Dynamic disks that contain dynamic volumes store disk and volume information on the disk itself. A dynamic volume is an abstract online storage management unit instantiated by a system software component called a volume manager. To file systems, database management systems, and applications that do raw I/O, a dynamic volume appears to be located on a single disk, in the sense that: It has a fixed amount of non-volatile storage Its storage capacity is organized as consecutively numbered 512-byte blocks Sequences of consecutively numbered blocks can be read or written with a single request Reading and writing can start at any block The smallest unit of data that can be read or written is one 512-byte block Dynamic Volumes Virtualize Storage Unlike basic volumes, a dynamic volume can aggregate the capacity of several disks into a single storage unit so that there are fewer storage units to manage, or to accommodate files larger than the largest available disk. A dynamic volume can aggregate I/O performance of several disks. This allows large files to be transferred faster than would be possible with the fastest available disk. In some circumstances, it also enables more I/O transactions per second to be executed than would be possible with the fastest available disk (i.e., by issuing concurrent I/O s). A dynamic volume can improve data availability through mirroring or RAID techniques that tolerate disk failures. Failure-tolerant volumes can remain fully functional when one or more disks that comprise them fail. A dynamic volume created with the VERITAS Volume Manager can be grown dynamically, even while applications hosted on the volume are still running. More complex volumes can be created to provide a combination of these benefits. 4

Dynamic Volumes in Microsoft Windows 2000 Dynamic volumes in Windows 2000 can host software-managed RAID volumes. Because the disk and volume information is on the disk itself instead of in system tables, moving or reallocating dynamic disk storage between systems is easier. Another major benefit is that administrators can perform disk and volume management tasks without restarting the system. Dynamic volumes in Windows 2000 may be simple, spanned, striped (RAID-0), mirrored (RAID-1), or RAID-5 (striping with distributed parity). The Windows 2000 Logical Disk Manager provides online management and configuration of local and remote disk storage and a domainwide view of storage resources. Together, these features support highly configurable and manageable storage solutions. Dynamic Volumes in VERITAS Volume Manager for Windows VERITAS Volume Manager extends the capabilities of Windows 2000 dynamic volumes. For example, Volume Manager dynamic volumes have all the capabilities of the native Windows 2000 dynamic volumes, plus: Striped and RAID-5 volumes using more than 32 physical disks (columns) Mirrored stripe volumes for a high-performance, highly available storage solution Ability to grow software RAID volumes dynamically without taking users or applications off-line (no rebooting) N-way mirroring administrators can create and detach third mirrors to mirrored volumes Preferred plex designating a local mirror as the preferred read device for data with heavy request loads. Hot spares, Hot Relocation and Unrelocation RAID 5 and Dirty Region Logging to speed recovery after RAID 5 or mirror volume failure. Volume Manager also provides advanced online management capabilities. For example, administrators can expand mirrored, striped, and RAID-5 volumes while the data is online and available. Administrators can use the graphical interface to identify storage bottlenecks and move data to correct or prevent performance problems. Finally, VERITAS Volume Manager supports shared and partitioned shared storage configurations using the concept of multiple disk groups. This makes it easier for multiple Windows servers to share a disk farm or Storage Area Network by segmenting the storage available, with each server owning specific storage segments. The administrator can easily reconfigure or change the segmentation. This last feature is relevant for supporting storage migration with VCS. 5

DYNAMIC DISK GROUPS The VERITAS Volume Manager supports a concept called Dynamic Disk Groups. A dynamic disk group is a collection of disks from one or more storage arrays combined in a layout as defined by the user. The Windows 2000 Logical Disk Manager does not support dynamic disk groups. VERITAS Volume Manager s support for multiple dynamic disk groups is a key feature when used in a VCS environment. A dynamic disk group is a collection of disks that is imported or deported as a single unit. When a disk group is imported all the volumes contained in the disk group are brought online and made available by the volume manager. When a dynamic disk group is deported all the volumes contained within the group are taken offline and made unavailable. There are three types of VERITAS Volume Manager dynamic disk groups: 1. Primary disk group - Contains the boot/system disk and optionally may contain additional disks with an arbitrary volume layout. 2. Secondary disk group - Contains one or more disks with arbitrary volume layout. 3. Cluster disk group - Contains one or more disks with arbitrary volume layout. A cluster disk group has a few additional properties: Cluster disk groups are intended to be used by clustering applications such as VERITAS Cluster Server (VCS) and Microsoft Cluster Server (MSCS). A cluster disk group is NOT automatically imported at boot time. If the disk group is not managed by a cluster, the user must perform a manual import through the GUI, command line, or Volume Manager API. A cluster disk group uses hardware locking mechanisms (e.g., SCSI-2 reserve/release) to guarantee that the disks within a cluster disk group are imported by only one node at a time. VCS will import and deport cluster disk groups through VMDG resource online and offline operations. VERITAS CLUSTER SERVER (VCS) VCS Overview VERITAS Cluster Server provides value to businesses that require applications or services to be available constantly, with little or no downtime per year. These applications, services, and their supporting infrastructure are monitored for failure, with responsive actions to be taken in the event a failure occurs, such as moving the application and its dependant resources to a healthy server. Many advanced cluster capabilities that are not available in traditional, 2 node high availability solutions are supported with VCS, such as role-based security, intelligent workload management, web-based administration and N-to-1 clusters. More detail can be found in the product documentation and whitepapers for VERITAS Cluster Server. 6

A cluster is a group of independent computers working together as a single system to ensure that mission-critical applications and resources are as highly available as possible. The group is managed as a single system, shares a common namespace, and is specifically designed to tolerate component failures, and to support the addition or removal of components in a way that's transparent to users. VERITAS Cluster Server employs a shared disk architecture, and supports both shared SCSI implementations as well as Storage Area Network (SAN) configurations. VCS supports up to thirty-two nodes in a single cluster, using any combination of Windows 2000 Server, Advanced Server, or Datacenter operating systems. Shared disk refers to the fact that storage resources are physically connected to all nodes in the cluster via SCSI or Fibre Channel bus. By using SCSI-2 reserve and release commands, VCS ensures that no two servers in a cluster can access the same disk at the same time. VERITAS considers a shared nothing architecture to be a configuration where the nodes in a cluster do not share a storage bus and the data on each host remains static. VCS does not use a quorum disk architecture. Cluster configuration information is replicated over the heartbeat infrastructure and changes are made on all systems at the same time, so every system always has the latest configuration. Because heartbeat communication within the cluster is critical to its operation, VCS requires a minimum of two heartbeat links per node for redundancy. The primary operating attributes of VCS are as follows: Each server participating in the cluster is referred to as a node. Anything managed by VCS is a considered a resource. Resources may include storage devices, file shares, TCP/IP addresses, applications, and databases. Controlling a resource means bringing it online (starting), taking it offline (stopping) as well as monitoring the health or status of the resource. A Service Group is a set of resources working together to provide application services to clients. For example, a web application Service Group might consist of: Disk groups on which the web pages to be served are stored, A volume built using the NTFS file system in the disk group, A network interface card (NIC), One or more IP addresses associated with the network card(s), and, The database application itself 7

SERVICE GROUPS VCS performs administrative operations on resources, including starting, stopping, restarting, and monitoring at the Service Group level. Service Group operations initiate administrative operations for all resources within the group. For example, when a service group is brought online, all the resources within that group are brought online. When a failover occurs in VCS, resources never failover individually the entire service group that the resource is a member of is failed over as a unit. If there is more than one group defined on a server, one group may failover without affecting the other group(s) on the server. From a cluster standpoint, there are two significant aspects to this view of an application Service Group as a collection of resources: If a Service Group is to run on a particular server, all of the resources it requires must be available to the server. The resources comprising a Service Group have interdependencies; that is, some resources (e.g., NIC) must be operational before other resources (e.g., IP address) can be made operational. One of the most important parts of a service group definition is the concept of resource dependencies. As mentioned above, resource dependencies determine the order specific resources within a Service Group are brought online or offline when the Service Group is brought offline or online. For example, a NIC resource must be online before the IP address can be brought online, and the IP must be online before the network name can be brought online. In the same manner, databases must be stopped before volumes are stopped and volumes stopped before disk groups deported. VCS service groups fall in two categories, depending on whether they can be run on multiple servers simultaneously. Failover Groups A failover group runs on one system in the cluster at a time. Failover groups are used for most application services, such as most databases, messaging servers and any other application not designed to maintain data consistency when multiple copies are started. Parallel Groups A parallel group can run concurrently on more than one system in the cluster at a time. A parallel service group is more complex than a failover group. It requires an application that can safely be started on more than one system at a time with no threat of data corruption, or that the data being accessed is local to each server. 8

Putting the pieces together How do all these pieces tie together to form a cluster? Understanding how the pieces fit makes the rest of VCS fairly simple. Let s take a very common example, a two-node cluster serving a single SQL database to clients. The cluster itself consists of two nodes connected to a shared disk array which allows both servers to access the data needed for the database. In this example, we are going to configure a single Service Group called SQL2000 that will be failed over between ServerA and ServerB as necessary. The service group, configured as a Failover Group, consists of resources, each one with a different resource type. The resources must be started in a specific order for everything to work. This is described with resource dependencies. The VCS engine, HAD, will determine the order to bring up the resources based on resource dependency statements in the configuration. When it is time to online the service group, VCS will issue online commands to the proper resources in the proper order. The drawing to the right is a representation of a VCS service group, with the appropriate resources and dependencies for the SQL2000 Group. The method used to display the resource dependencies is identical to the VCS GUI. The SQL2000 Group can be configured to start automatically on either node in the example. It can then move or failover to the second node based on operator command, or automatically if the first node fails. VCS will offline the resources starting at the top of the graph and start them on the second node starting at the bottom of the graph. This cluster design brings both availability and manageability benefits. VCS tracks the state of the nodes in the cluster, and in the event of an application or server failure, it either restarts the application or performs a failover to an available node. When the problem on the failed node is addressed, VCS can optionally switch the application back to the primary node. Failover is the process by which an application, including its services, network address, data volumes, and hostname moves from a failed node to a healthy node in the cluster. Most stateless applications switch to the failover node transparently. Some applications that track the state of the nodes need to re-establish a connection to the cluster. Otherwise, the 9

failover is transparent. System administrators can also manually move resources from one node to another to perform load balancing and system maintenance tasks without resorting to downtime on production applications. Split-Brain refers to a state where all communication paths between nodes in a cluster have been lost. When this happens, the cluster nodes cannot determine if a network failure occurred or if the other nodes in the cluster have failed. If they assume node failure, there is a risk that applications will be brought online on more than one node in the cluster and will attempt to access the same shared disk, which could lead to potential data corruption. To prevent this from happening, VERITAS Cluster Server uses SCSI-2 disk reservations. When a disk or collection of disks is brought online using VCS, they are reserved by the system which brought them online and cannot be accessed by other servers in the cluster. CLUSTER COMMUNICATIONS (HEARTBEAT) VCS uses private network communications between cluster nodes for cluster maintenance. This communication takes the form of nodes informing other nodes they are alive, known as heartbeat, and nodes informing all other nodes of actions taking place and the status of all resources on a particular node, known as cluster status. This cluster communication takes place over a private, dedicated network between cluster nodes. VERITAS requires two completely independent, private networks between all cluster nodes to provide necessary communication path redundancy and allow VCS to discriminate between a network failure and a system failure. VCS uses a purpose built communication package, comprised of the Low Latency Transport (LLT) and Group Membership/Atomic Broadcast (GAB). These packages function together as a replacement for the IP stack and provide a robust, high-speed communication link between systems without the latency induced by the normal network stack. Group Membership Services/Atomic Broadcast (GAB) The Group Membership Services /Atomic Broadcast protocol, abbreviated GAB is responsible for Cluster Membership and Cluster State communications described below. Cluster membership In order to maintain a complete picture of the exact status of all resources and groups on all nodes, VCS must be constantly aware of which nodes are currently participating in the cluster. While this may sound like an over-simplification, realize that at any time nodes can be rebooted, powered off, fault, added to the cluster, etc. VCS uses its cluster membership capability to dynamically track the overall cluster topology. 10

Cluster membership is maintained via the use of heartbeats. Heartbeats are signals that are sent over a network link periodically from one system to another via the LLT protocol to verify that the systems are active. When systems no longer receive heartbeat messages from a peer for an interval set by Heartbeat Timeout (see below), it is marked DOWN and excluded from the cluster. Its applications are then migrated to the other systems. Cluster State Cluster State refers to tracking the status of all resources and groups in the cluster. This is the function of the Atomic Broadcast capability of GAB. Atomic Broadcast ensures all systems within the cluster are immediately notified of changes in resource status, cluster membership, and configuration. Atomic means all systems receive updates, or all are rolled back to the previous state, much like a database atomic commit. If a failure occurs while transmitting status changes, GAB s atomicity ensures that, upon recovery, all systems will have the same information regarding the status of any monitored resource in the cluster. The broadcast messaging service employs a two phase commit protocol to deliver messages atomically to all surviving members of a group in the presence of node failures. VCS does not use a quorum disk architecture. Instead of storing cluster configuration information on a shared volume, VCS uses redundant network interconnects for heartbeats and cluster status. This provides a much more scalable and reliable cluster infrastructure. Low Latency Transport (LLT) LLT provides fast, kernel-to-kernel communications, and monitors network connections. LLT functions as a replacement for the IP stack on systems. The use of LLT rather than IP removes latency and overhead associated with the IP stack and ensures that events such as state changes are reflected more quickly. Low Priority Link LLT can be configured to use a low priority network link as a backup to normal heartbeat channels. Low priority links are typically configured on the customer s public network or administrative network. The low priority link is not used for cluster membership traffic until it is the only remaining link. In normal operation, the low priority link carries only heartbeat traffic for cluster membership and link state maintenance. The frequency of heartbeats is reduced to 50% of normal to reduce network overhead. When the low priority link is the only remaining network link, LLT will switch all cluster status traffic over as well. Upon repair of any configured private link, LLT switches cluster status traffic back to the high priority link. It is important to note that LLT is a non-routable protocol, which means all nodes participating in a cluster are restricted to the same subnet for heartbeat communication. In a campus environment, it may be necessary to use VLAN technology to span a single subnet over a distance. VCS requires a minimum data latency of 500ms to sustain proper cluster 11

communication over the heartbeat interconnects. Be sure to validate your environment meets these requirements prior to testing application failover. USING VOLUME MANAGER IN A VERITAS CLUSTER ENVIRONMENT Advantages of using Volume Manager with VERITAS Cluster Server VERITAS Volume Manager for Windows has a number of key advantages when used with VERITAS Cluster Server (VCS): Dynamic volume support within a cluster Protection from common storage management errors in a clustered environment Optional automatic failover in site-failure scenarios Dynamic Volume Support Dynamic volumes are essential in a cluster environment to ensure application data is always available even in the event of a disk failure or online storage maintenance. The key advantages of dynamic volume support in a cluster are described below. Advantage #1 Data volumes can be extended while online Clustering provides higher availability than non-clustered systems. Yet, if a server s data grows and storage space must be added onto existing volumes, there is no way to avoid downtime with native VCS. By using Volume Manager in conjunction with VCS, dynamic disks can be utilized, which do allow you to grow your volumes without interrupting data availability. Advantage #2 Support for Fault Tolerant data volumes Another consideration in building a high availability solution is to protect against possible hardware failure. Because VCS does not natively allow the use of dynamic disks, data in a cluster cannot be made fault tolerant without hardware-proprietary solutions. If a disk that holds your data in a cluster fails, you must take the cluster off line, replace the faulty hardware, then restore the data from a backup. Volume Manager s dynamic disks support RAID-1 (Mirrored), RAID-5 (Striped with Parity), and RAID-1+0 (Mirrored Stripe) volumes to keep your data online through hardware failures, avoiding the time consuming backup and restore operations normally required. 12

Protection from Common Storage Management Errors VERITAS Cluster Server and VERITAS Volume Manager work closely together to provide a tightly integrated solution for data and application availability. From an administration perspective, it should be understood that the underlying Volume Management layer is being monitored by the Cluster software running on the server. Any changes to the data volumes should be done with the cluster configuration in mind. In an environment where the storage administrator may not be the cluster administrator, it is important that the software is intelligent enough to allow flexibility of control while preventing actions which would lead to failure. Listed below are some examples of how common administrative mistakes are prevented in basic environments. Scenario #1 Cluster Manager must be used to deport online cluster disk groups under VCS control Diskgroup A is being monitored by VCS and is online in the cluster. The administrator opens the VM console and attempts to deport Diskgroup A for testing. If successful, VCS would see Diskgroup A go offline (Fault) and would attempt to online Diskgroup A on another node in the cluster. Instead, the admin is prompted that they should use the VCS console to offline Diskgroup A prior to deporting the disk group. Scenario #2 FlashSnap operations only allowed if VCS configuration is not affected A system administrator is using the Volume Manager FlashSnap option for Offhost Backup purposes. Diskgroup A contains two disks Disk 1 and Disk 2. Volume 1 is created on Disk 1. A mirror is created of Volume 1 using Disk 2. FlashSnap is then used to break off the mirror and mount it on another host. During this operation, the administrator is given the option to change the name of the existing volume. Since the existing volume is being monitored by VCS, that operation is not allowed. Scenario #3 Volume snapshots monitored by VCS are prevented from snapback If a system administrator has mounted a snapshot and is monitoring that mirror within the VCS configuration, it cannot be used to snapback to the original volume until it is removed from the VCS configuration. 13

Scenario #4 Volumes and disk groups monitored by VCS cannot be deleted from the VM console If a system administrator attempts to delete a volume or disk group which is being monitored by VCS, the operation will fail until the volume or disk group is removed from the VCS configuration. Optional Automatic Failover VERITAS Cluster Server provides an option for customers who prefer to have automatic recovery in the event of an entire site (building) failure. This is made possible through the ForceImport attribute of the MountV resource a feature which does not exist in Microsoft Cluster Server. It is important to note that using this feature could lead to data loss. It is recommended this option is set to False (a value of zero) until the failure scenarios detailed in the sections below are clearly understood. In a Split Brain scenario, a cluster node cannot distinguish between the loss of a site and the loss of communications to that site. In theory, the communication paths between the sites could have been lost and the original site could still be online. Because of this, the default behavior is to not online the application at the remaining site unless the majority of disks in the diskgroup are available. If the majority of disks are accessible, Volume Manager can ensure the other site is no longer accessing the storage. In this scenario, the administrator is notified of a site failure and manual intervention would be required to initiate application failover to the remote site. After a site failure, only 50% of the disks in a mirrored volume remain, and the disk group must be forcefully imported. This can be automated by setting the ForceImport attribute in the VMDG resource to 1, although it opens the possibility for each site to import their 50% of the volume in a Split Brain scenario. If that were to occur, the administrator would have to make a choice between which half of the mirror to keep. Once a decision is made, one site is designated as the primary, resynchronization occurs, and the data from the secondary site is lost. Some possible failure scenarios are depicted below, with a comparison between a configuration with the ForceImport option set to True and a configuration where it is set to False. CAMPUS CLUSTERS It is becoming commonplace for customers to protect their clusters by utilizing Campus Clusters to protect from natural disasters and site failure. This practice is also becoming more 14

common as power blackouts become issues that customers must plan into their system recovery process. A Campus Cluster is a configuration where the nodes of a single cluster are located in separate physical buildings, each with a local storage array and SAN interconnects between buildings. The image below represents a standard configuration. Implementing a Campus Cluster will provide a lightweight form of disaster recovery for environments where a traditional wide area disaster recovery solution using replication is not suitable. This solution eliminates both the hardware array and the physical building as a single point of failure in a cluster, and effectively provides application and data fault tolerance in the event of nearly all failure scenarios with the exception of campus-wide disasters. While this configuration is encouraged, the recommended VERITAS disaster recovery implementation includes VERITAS Cluster Server and VERITAS Volume Manager for automated local system and application availability, VERITAS Volume Replicator for data replication to a remote cluster in a geographically distant location, and VERITAS Global Cluster Manager for automation of site-to-site recovery and cross-cluster management. A Campus Cluster configuration can easily be scaled to the full disaster recovery solution as business requirements evolve. How does a Campus Cluster respond to a failure? 15

Automated recovery of an application is handled differently in a Campus Cluster configuration. This table outlines the most common failure scenarios and describes the behavior of the cluster after each failure. It also compares the behavior of a configuration where the ForceImport option is enabled vs. disabled. An administrative alert is automatically triggered on all failure scenarios. Failure Scenarios 1. An application fault occurs. This could mean the service(s) stopped for an application, the NIC faulted, or a database table went offline. 2. A server failure occurs. The power cord was unplugged, a system hang occurred, or another failure caused the system to stop responding. 3. A disk or array failure occurs. If a disk failure occurs, a mirrored volume would prevent any interruption. This failure scenario indicates a complete array or all disk failure. 4. A site (building) failure occurs. All accessibility to servers and storage is lost, and anything running at that site has failed. 5. Both heartbeat links are lost (Split Brain). This could be caused by street workers digging up wires, a power failure in the switches, lightening damage, etc. All links must be lost simultaneously. If one link is disconnected, then another is disconnected, it does not constitute a Split Brain scenario. If the public network link is used as a low-priority heartbeat, it is assumed that link is also lost. 6. Storage interconnect is lost. Similar reasons for #5, the fibre interconnect is severed. 7. Both the heartbeats and storage interconnect are lost (Split Brain). If using a single pipe between buildings for ethernet and SAN, this may occur. ForceImport option set to 0 ForceImport option set to 1 1. Application automatically moves to another node. 1. Same behavior. 2. Application automatically moves to another node. 2. Same behavior. (100% of the disks are still available.) 3. No interruption of service. (Remaining disks in mirror 3. Same behavior. are still accessible from other site.) 4. Manual intervention required to move application. 4. Application automatically moves to the other (Can t import with only 50% of disks available.) site 5. No interruption of service. (Can t import disks 5. Same behavior. because original site still has reservation of disks.) 6. No interruption of service. (Remaining disks in mirror 6. Same behavior. are still accessible from other site.) 7. No interruption of service. (Can t import with only 7. Automatically imports disks on secondary site, 50% of disks available.) now disks are online in both locations data can be kept from only one site. Reinstating Faulted Hardware Once a failure occurs and an application is migrated to another node or site, it is important to know what will happen when the original hardware is reinstated. More importantly, a temporary 16

failure may occur through something as simple as unplugging a device and plugging it back in which could result in an undesired failure scenario. For failure scenarios 3 thru 7 above, the table below will list the behavior when each area (array, site, heartbeats, storage interconnect, etc) are reinstated after failure. Scenarios 1 and 2 have no affect when reinstated. Keep in mind the cluster has already responded to the initial failure as indicated in the table above. ForceImport option set to 0 ForceImport option set to 1 3a. No interruption of service. Resync the mirror from the 3a. Same behavior. remote site. 4a. Cluster nodes join and see that the application is 4a. Same behavior. online at the remote site. Resync the mirror from the remote site. 5a. No interruption of service. 5a. Same behavior. 6a. No interruption of service. Resync the mirror from the 6a. Same behavior. original site. 7a. No interruption of service. Resync the mirror from the 7a. VCS alerts administrator that volumes are original site. online at both sites. Resync the mirror from the copy with the latest data. While both configurations are similar, the ForceImport option provides automatic failover in the event of site failure. This advantage comes at the cost of potential data loss in the event all storage and network communication paths between the sites are severed. Choose an option that is suitable given your cluster infrastructure, uptime requirements, and administrative capabilities. Volume Manager 3.1 for Windows 2000 and Cluster Server 2.0 for Windows 2000 Service Pack 1 are requirements for this Campus Cluster implementation. Documentation can be found on the VERITAS technical support web site or in the VERITAS Cluster Server 2.0 Service Pack 1 Release Notes. 17

SUMMARY It is a challenge to ensure the high availability of applications and data in today s rapidly growing Windows data center environments. Many factors can cause downtime planned downtime to perform system maintenance and necessary upgrades, as well as unexpected faults with software and hardware. VERITAS delivers a solution stack of fully tested, integrated products which provide every level of availability your business may require, from local tape backup to wide area disaster recovery. Volume Manager for Windows 2000 builds on the strong foundation of logical volume management and dynamic disks in Windows 2000. It provides advanced storage-management capabilities for applications with critical performance or availability requirements and offers the highest level of online disk- and volume-management capabilities available. VERITAS Cluster Server, market leader in clustering and Network Magazine s Product of the Year 2001, provides enterprise level high availability in a scalable, easy-to-manage and integrated solution for medium to global-sized organizations. Volume Manager for Windows 2000 combined with VERITAS Cluster Server for Windows 2000 creates a flexible clustering solution using commodity hardware that can scale from a 2- to 32-node campus cluster and extend to a wide area disaster recovery plan. This solution creates a foundation for future growth which can be scaled to meet even the most demanding business requirements. To learn more about VERITAS Products for Windows, visit http://www.veritas.com. 18