VMware HA: Overview & Technical Best Practices

Similar documents
vsphere Availability Guide

Virtualization with VMware ESX and VirtualCenter SMB to Enterprise

Virtualization with VMware ESX and VirtualCenter SMB to Enterprise

vsphere Availability Update 1 ESXi 5.0 vcenter Server 5.0 EN

Availability & Resource

VMware vsphere with ESX 4.1 and vcenter 4.1

Securing the Data Center against

VMware Overview VMware Infrastructure 3: Install and Configure Rev C Copyright 2007 VMware, Inc. All rights reserved.

New Features in VMware vsphere (ESX 4)

VMware vsphere Administration Training. Course Content

vsphere Availability 17 APR 2018 VMware vsphere 6.7 VMware ESXi 6.7 vcenter Server 6.7

"Charting the Course... VMware vsphere 6.7 Boot Camp. Course Summary

vsphere Availability Update 1 VMware vsphere 6.5 VMware ESXi 6.5 vcenter Server 6.5 EN

Setting Up Cisco Prime LMS for High Availability, Live Migration, and Storage VMotion Using VMware

VMware vsphere with ESX 4 and vcenter

VMware Join the Virtual Revolution! Brian McNeil VMware National Partner Business Manager

Microsoft E xchange 2010 on VMware

Business Continuity and Disaster Recovery Disaster-Proof Your Business

Potpuna virtualizacija od servera do desktopa. Saša Hederić Senior Systems Engineer VMware Inc.

Improving Blade Economics with Virtualization

By the end of the class, attendees will have learned the skills, and best practices of virtualization. Attendees

VMware vsphere with ESX 6 and vcenter 6

Exam : VMWare VCP-310

Network Design Considerations for VMware Deployments. Koo Juan Huat

Automated Disaster Recovery. Simon Caruso Senior Systems Engineer, VMware

VMware vsphere 6.5 Boot Camp

Directions in Data Centre Virtualization and Management

Exploring Options for Virtualized Disaster Recovery

Exam4Tests. Latest exam questions & answers help you to pass IT exam test easily

VMware vsphere 5.0 Migration vcenter Server 5.0 Pre- Upgrade Checklist (version 1)

VMware vsphere 5.5 Advanced Administration

Technical Field Enablement. Symantec Messaging Gateway 10.0 HIGH AVAILABILITY WHITEPAPER. George Maculley. Date published: 5 May 2013

VMware vsphere 4.0 The best platform for building cloud infrastructures

Better Security with Virtual Machines

VMware vsphere. Using vsphere VMware Inc. All rights reserved

Virtualization And High Availability. Howard Chow Microsoft MVP

Protecting Mission-Critical Workloads with VMware Fault Tolerance W H I T E P A P E R

predefined elements (CI)

The Realities of Virtualization

Exploring Options for Virtualized Disaster Recovery. Ranganath GK Solution Architect 6 th Nov 2008

VMware ESX Server 3i. December 2007

Virtualizing Microsoft Exchange Server 2010 with NetApp and VMware

VMware vcenter Site Recovery Manager 4.1 Evaluator s Guide EVALUATOR'S GUIDE

Logical Operations Certified Virtualization Professional (CVP) VMware vsphere 6.0 Level 1 Exam CVP1-110

Setup for Microsoft Cluster Service Update 1 Release for ESX Server 3.5, ESX Server 3i version 3.5, VirtualCenter 2.5

EXAM - VCP5-DCV. VMware Certified Professional 5 Data Center Virtualization (VCP5-DCV) Exam. Buy Full Product.

Virtualizing Business- Critical Applications with Confidence TECHNICAL WHITE PAPER

NICE Perform Virtualization Solution Overview

SRM Evaluation Guide First Published On: Last Updated On:

Technical White Paper

Vmware VCP550PSE. VMware Certified Professional on vsphere 5.

VMware vcenter Server Performance and Best Practices

VMware Exam VCP-511 VMware Certified Professional on vsphere 5 Version: 11.3 [ Total Questions: 288 ]

Getting Started with ESX Server 3i Installable Update 2 and later for ESX Server 3i version 3.5 Installable and VirtualCenter 2.5

Virtualization Overview

Getting Started with ESX Server 3i Embedded ESX Server 3i version 3.5 Embedded and VirtualCenter 2.5

Foundation for Cloud Computing with VMware vsphere 4

1V0-642.exam.30q.

Virtual Security Gateway Overview

Virtual Appliance User s Guide

E V A L U A T O R ' S G U I D E. VMware vsphere 4 Evaluator s Guide

Virtual Recovery for Real Disasters: Virtualization s Impact on DR Planning. Caddy Tan Regional Manager, Asia Pacific Operations Double-Take Software

Virtual Datacenter Automation

Oracle Real Application Clusters One Node

vsphere Networking Update 1 ESXi 5.1 vcenter Server 5.1 vsphere 5.1 EN

Evaluator Guide. Site Recovery Manager 1.0

Managing VMware vcenter Site Recovery Manager

Using EonStor DS Series iscsi-host storage systems with VMware vsphere 5.x

AB Drives. T4 - Process Control: Virtualization for Manufacturing. Insert Photo Here Anthony Baker. PlantPAx Characterization & Lab Manager

vsphere 5/6: Install, Configure, Manage Review Questions

Administering VMware vsphere and vcenter 5

VMware vsphere Metro Storage Cluster Recommended Practices November 08, 2017

Overview. Prerequisites. VMware vsphere 6.5 Optimize, Upgrade, Troubleshoot

The Future of Virtualization Desktop to the Datacentre. Raghu Raghuram Vice President Product and Solutions VMware

PAC485 Managing Datacenter Resources Using the VirtualCenter Distributed Resource Scheduler

ESX Server 3 Configuration Guide ESX Server 3.5 and VirtualCenter 2.5

vsphere 4 The Best Platform for Business-Critical Applications Gaetan Castelein Sr Product Marketing Manager VMware, Inc.

Migration. 22 AUG 2017 VMware Validated Design 4.1 VMware Validated Design for Software-Defined Data Center 4.1

Table of Contents VSSI VMware vcenter Infrastructure...1

Resource Management Guide ESX Server and VirtualCenter 2.0.1

Architecting DR Solutions with VMware Site Recovery Manager

VMware - VMware vsphere: Install, Configure, Manage [V6.7]

Sophos for Virtual Environments. startup guide -- Sophos Central edition

1V0-621.testking. 1V VMware Certified Associate 6 - Data Center Virtualization Fundamentals Exam

Setup for Failover Clustering and Microsoft Cluster Service

vcenter Operations Management Pack for NSX-vSphere

Disaster Recovery-to-the- Cloud Best Practices

The Future of Virtualization. Jeff Jennings Global Vice President Products & Solutions VMware

BC/DR Strategy with VMware

VMware Infrastructure 3

EMC CLARiiON CX3-40. Reference Architecture. Enterprise Solutions for Microsoft Exchange Enabled by MirrorView/S

Logical Operations Certified Virtualization Professional (CVP) VMware vsphere 6.0 Level 2 Exam CVP2-110

Table of Contents HOL SLN

Configuration Cheat Sheet for the New vsphere Web Client

Question No: 2 What three shares are available when configuring a Resource Pool? (Choose three.)

Resource Management Guide. ESX Server 3 and VirtualCenter 2

1 Quantum Corporation 1

Professional on vsphere

"Charting the Course... VMware vsphere 6.5 Optimize, Upgrade, Troubleshoot. Course Summary

Veritas Storage Foundation for Windows by Symantec

Transcription:

VMware HA: Overview & Technical Best Practices Updated 8/10/2007 What is Business Continuity? Business Continuity = Always-on uninterrupted availability of business systems and applications Business Continuity is about eliminating downtime (MTBF) and reducing time to recovery (MTTR) Availability = MTBF / (MTBF + MTTR) Type of Downtime Unplanned downtime Planned downtime Disasters Business Continuity Component High availability Disaster recovery 1

Business Continuity in a Virtualized World Virtualization can improve business continuity by: Lowering the cost and complexity of achieving business continuity Protecting every workload virtualized in an OS & application independent manner Business continuity related features & benefits of VMware Infrastructure: Encapsulation & HW independence: Simple disaster recovery ESX NIC teaming & HBA Multipathing: Shared hw redundancy VMotion: Eliminates & automates planned downtime VMware HA: Detects & responds to unplanned downtime DRS: Balances workloads & optimizes resource utilization Consolidated Backup: Protects data & integrates with other tools Operating System based Clustering vs. VMware HA In practice, VMware HA and the other features of VI3 allow many customers to achieve comparable or higher levels of availability vs. legacy solutions Application protection APPLICATION FAILURES OS FAILURES HARDWARE FAILURES UNPROTECTED Less than 10% of x86 workloads are clustered today due to costs & complexities VMware HA can provide hardware failure protection for 90%+ of remaining workloads with VMware HA 0% 10% 100% Application Coverage 2

High Availability with VMware HA What is VMware HA? Automatic restart of virtual machines in case of physical server failures X Resource Pool Customer Benefits: High availability with minimal capital & operational costs Reduced need for passive stand-by hardware & dedicated administrators Broadly applicable All x86 workloads virtualized are automatically protected without agents or scripting Manageability & Flexibility Very simple to deploy and maintain Application manageability unchanged N-to-N failovers, hosts and VMs easily moved in and out of clusters Detecting Individual Virtual Machine Failures VMware HA does not detect individual VM failures Alternatives can be used for this purpose: VMware Tools heartbeats & VirtualCenter alarms 3 rd party monitoring agents running inside guests 3

VMware HA Cluster Configuration: Step 1 VMware HA cluster configuration is composed of two steps: 1) Cluster-wide policies 2) Individual VM customizations Spare capacity can be reserved to tolerate multiple host failures Option #1: Allow HA to calculate & enforce spare capacity requirements (*see best practices for details) See appendix for details regarding advanced settings Option #2: Disable heuristics used by HA for capacity planning Host Failure Capacity Planning VMware HA enforces failover capacity by using memory & CPU reservations to estimate the number of VMs able to recover from multiple host failures Examples (assuming a 3 node HA cluster): Failover capacity of 1 limits number of running VMs to fit onto 2 hosts Failover capacity of 2 limits number of running VMs to fit onto 1 host 4

VMware HA Cluster Configuration: Step 2 Isolation Response Initiated when a host experiences network isolation from the rest of the cluster Power off is the default Assumes proper network redundancy makes isolation events very rare Restart Priority Use Low/Medium/High restart priorities to customize failover ordering Default = Medium, High priority VMs are restarted first Non-essential VMs should be set to Disabled (automated restart will skip them) Leave power on is intended for cases where: Lack of redundancy & environmental factors make outages likely VM networks are separate from service console (and more reliable) VMware HA Under the Covers HA agents are installed & configured on ESX Hosts via VirtualCenter Post configuration, HA agents maintain a heartbeat network Heartbeat NW VirtualCenter Ability to perform failovers is independent from VirtualCenter availability VMFS & shared storage allows hosts to access & power-on virtual machines Distributed locking prevents simultaneous access to protect data integrity During a failover, quick restart is the primary goal DRS algorithms balance workloads after HA has recovered virtual machines 5

Cluster Configuration & Communication Details Cluster nodes are designated as Primary or Secondary Primary nodes maintain a synchronized view of entire cluster Up to 5 primary nodes per cluster Secondary nodes are managed by the primary nodes Service console network(s) are used for heartbeats and state synchronization Minimal network activity in steady state (5 second heartbeat intervals) Additional light traffic during node configuration & VM power operations Incoming ports used: TCP/UDP 8042-8045 Outgoing ports used: TCP/UDP 2050-2250 Failure Detection & Failover Details ESX host failures are detected via missing heartbeats Failure detection is initiated after 15 seconds Follow-up network ping determines node failure vs. HA agent failure, and a failover is performed if no response is received Failover operations are coordinated as follows: 1. One of the primary hosts is selected to coordinate the failover, and one of the remaining hosts with spare capacity becomes failover target 2. Virtual machines affected are sorted by priority, and powered on until failover target runs out of spare capacity 3. If the host selected as coordinator fails, another primary continues the effort 4. If one of the hosts that fails was a primary node, one of the remaining secondary nodes is promoted to being a primary 6

Isolation Response Details Network failures can cause split-brain conditions In such cases, hosts are unable to determine if the rest of cluster has failed or has become unreachable Isolation response is used to prevent splitbrain conditions and is started when: A host has stopped receiving heartbeats from other cluster nodes AND the isolation address cannot be pinged The default isolation address is the ESX service console gateway, and the default isolation response time is 15 seconds (*can be changed, see appendix) Virtual Machine? Virtual Machine Virtual Machine Virtual Machine Virtual Machine Powering virtual machines off releases VMFS locks, enables other hosts to recover When Leave power on option is set, virtual machines may require manual power-off / migration in case of an actual network isolation VMware HA Best Practices Setup & Networking 1. Proper DNS & Network settings are needed for initial configuration After configuration DNS resolutions are cached to /etc/ft_hosts (minimizing the dependency on DNS server availability during an actual failover) DNS on each host is preferred (manual editing of /etc/hosts is error prone) 2. Redundancy to ESX Service Console is essential (several options) Choose the option that minimizes single points of failure Gateways should be IP addresses that will respond via ICMP (ping) Enable PortFast (or equivalent) on network switches to avoid spanning tree isolations 3. Network maintenance activities should take into account dependencies on the ESX Service Console network(s) VMware HA can be temporarily disabled through the Cluster->Edit Settings dialog 4. Valid VM network label names required for proper failover Virtual machines use them to re-establish network connectivity upon restart 7

Network Configuration Network redundancy between the ESX service consoles is essential for reliable detection of host failures & isolation conditions ESX Host A single service console network with underlying redundancy is usually sufficient: Use a team of 2 NICs connected to different physical switches to avoid a single point of failure Configure vnics in vswitch for Active/Standby configuration (rolling failover = yes, default load balancing = route based on originating port ID) X Consider extending timeout values & adding multiple isolation addresses (*see appendix) Timeouts of 30-60 seconds will slightly extend recovery times, but will also allow for intermittent network outages Network Configuration (continued) Beyond NIC teaming, a secondary service console network can be configured to provide redundant heartbeating & isolation detection HA will detect and use a secondary service console network Adding a secondary service console portgroup to an existing VMotion vswitch avoids having to dedicate an additional subnet & NIC for this purpose Also need to specify an additional isolation address for the cluster to account for the added redundancy (*see appendix) Continue using the primary service console network & IP address for management purposes Be careful with network maintenance that affects the primary service console network and the secondary / VMotion network 8

VMware HA Best Practices Resource Management 1. Larger groups of homogenous servers will allow higher levels of utilization across an HA/DRS enabled cluster (on average) More nodes per cluster (current maximum is 16) can tolerate multiple host failures while still guaranteeing failover capacities Admission control heuristics are conservatively weighted; must account for extra spare capacity so that large servers with many VMs can failover to small servers 2. To define the sizing estimates used for admission control, set reasonable reservations as the minimum resources needed Admission control will allow unlimited VMs to power on without any reservations set. When reservations are set, HA will use largest reservation specified as the slot size. At a minimum, set reservations for a few virtual machines considered average 3. Pay attention to cluster warnings and messages Clusters marked with a YELLOW icon indicate capacity is overcommitted according to DRS; while clusters marked with RED indicate HA may be unable to restart all virtual machines subject to the availability constraints specified Troubleshooting VMware HA 1. Verify the following: IP connectivity, DNS resolution, shared storage and networks are visible throughout the cluster, service consoles have valid & reachable gateways 2. Re-initialize HA cluster configuration Per host: Select ESX Host->Summary Tab->Reconfigure for HA Per cluster: Select Cluster->Edit Settings->Uncheck HA enabled, wait for reconfiguration task to complete, and then check to re-enable 3. Inspect ESX Server logs and HA agent configuration files: Check for ESX Service Console networking errors first, HA agents next HA agent logs: /opt/lgtoaam512/log/* & /opt/lgtoaam512/vmsupport/* /opt/lgtoaam512/config/vmware-sites contains a list of cluster nodes 9

Appendix: HA Customizations Setting preferred failover hosts for individual virtual machines Select Virtual Machine->Edit Settings->Advanced->Configuration Parameters Add the das.defaultfailoverhost = <hostname> option/value pair to the virtual machine s parameters where <hostname> represents the short name of the preferred host The host specified will be used as a first choice, but HA will automatically select an alternate in case of insufficient capacity Adjusting the default timeout used for failure & isolation detection New in VC 2.0.2: the response time can be configured to be different than 15 seconds (15000 ms). 60 seconds (60000 ms) is an alternative commonly used. Select a cluster->vmware HA->Advanced Options Add the das.failuredetectiontime = <value> option/value pair to the cluster s settings where <value> represents the desired timeout value in milliseconds VMware HA will not declare a host failure nor initiate an isolation detection response until the timeout value specified has been exceeded without heartbeats received Appendix: HA Customizations Changing the default isolation response address By default, HA uses the gateway specified in each hosts service console network configuration as the isolation address to query as the last step in case of isolation Select a cluster->vmware HA->Advanced Options, and add the das.isolationaddress = <value> option/value pair to the cluster s settings where <value> represents the IP address to use as an override to the gateway Use a reliable IP addresses reachable with fewest failure paths possible Setting more than one isolation response address New in VC 2.0.2: more than one isolation response address can be specified, and each service console network should have a different isolation response address defined (by default, only the gateway of the first console network is used) Select a cluster->vmware HA->Advanced Options, and add the das.isolationaddress2 = <value2> option/value pair to the cluster s settings where <value2> represents the secondary IP addresses to use The default timeout value should also be increased to 20 seconds (20000 ms) or greater when a secondary isolation address has been specified 10

Appendix: Applying HA Customizations To apply HA customizations: 1. Create an HA/DRS cluster in the VirtualCenter inventory 2. Set option/value pairs 3. Disable VMware HA, wait for reconfiguration task to complete, and then re-enable HA to have settings take effect Scenario #1: Single service console network with teamed NICs Some risk assumed by configuring hosts in cluster with only 1 service console network (subnet 10.20.XX.XX; 2 teamed NICs used to protect from NIC failure Default timeout increased to 60 seconds (das.failuredetectiontime = 60000) Scenario #2: Redundant service console networks Each host in cluster configured with 2 service console networks by leveraging an existing VMotion network (subnets 10.20.YY.YY and 192.168.ZZ.ZZ) Default gateway used for the first network, das.isolationaddress2 = 192.168.1.103 specified as additional isolation address for the second network Default timeout increased to 20 seconds (das.failuredetectiontime = 20000) VMware Product & Solution Areas Infrastructure Optimization Business Continuity Software Lifecycle Automation Virtual Clients and Desktops Deploy virtual machines that run safely and move transparently across shared hardware. > Consolidate servers > Reduce data center operating costs: real estate, power, cooling > Includes the industry s most widely deployed Virtualization Infrastructure suite Keep systems up and running through simple, reliable data protection and pervasive failover protection. >Reduce planned and unplanned downtime >Reduce cost and complexity of high availability >Simplify disaster recovery Leverage assets and speed software development by automating the setup, sharing and storage of multi-machine configurations. >Rapidly provision machines >Improve software quality Secure unmanaged PCs while retaining end-user autonomy by layering a security policy in software around desktop virtual machines. >For enterprises and end users >Improve security and mobility 11