VMware vsphere 6.0 on NetApp MetroCluster September 2016 SL10214 Version 1.0.1
TABLE OF CONTENTS 1 Introduction... 3 2 Lab Environment... 6 3 Lab Activities... 8 3.1 Tour Environment... 8 3.2 Unplanned Switchover: High Availability for Business Continuity... 32 3.3 MetroCluster Healing and Switchback for Disaster Recovery... 45 3.4 Planned Switchover: High Availability for Non-Disruptive Operations... 60 4 Lab Limitations... 74 5 References...75 6 Version History... 76 2
1 Introduction The combination of VMware infrastructure and NetApp MetroCluster resolves a number of customer challenages from both the server and the storage perspectives. VMware infrastructure provides a robust server consolidation solution with high application availablity. MetroCluster provides continuous access to your data in a data center, across a campus, or in a metro area. 1 Lab Objectives This lab demonstrates various failure scenarios (host, network, and storage), and identifies NetApp best practices for vsphere on MetroCluster. 3
Figure 1-1: VMware vsphere Stretched Cluster on replicating NetApp MetroCluster 4
Figure 1-2: VMware vsphere Stretched Cluster on switched over NetApp MetroCluster 1 Prerequisites This lab assumes minimal prior experience with NetApp and VMware products. 5
2 Lab Environment All of the servers and storage controllers presented in this lab are virtual devices, and the networks that interconnect them are exclusive to your lab session. The virtual storage controllers (vsims) offer nearly all the same functionality as do physical storage controllers, but at a reduced performance profile. Currently, the main exception between virtual and physical storage controllers is that vsims do not offer HA support. Figure 2-1: 2 Table of Systems 6 Host Name Operating System Role/Function IP Address cluster1 clustered Data ONTAP 8.3.1 cluster 1 - Site A 192.168.0.101 cluster1-01 clustered Data ONTAP 8.3.1 cluster 1 node 1 192.168.0.111 cluster1-02 clustered Data ONTAP 8.3.1 cluster 1 node 2 192.168.0.112 cluster2 clustered Data ONTAP 8.3.1 cluster 2 - Site B 192.168.0.102 cluster2-01 clustered Data ONTAP 8.3.1 cluster 2 node 1 192.168.0.121
Host Name Operating System Role/Function IP Address cluster2-02 clustered Data ONTAP 8.3.1 cluster 2 node 2 192.168.0.122 jumphost Windows Server 2012 R2 primary desktop entry point for lab 192.168.0.5 dc1 Windows Server 2008 R2 Active Directory / DNS 192.168.0.253 esx1 VMware vsphere 6.0 ESX Site A host 1 host 192.168.0.51 esx2 VMware vsphere 6.0 ESX Site A host 2 host 192.168.0.52 esx3 VMware vsphere 6.0 ESX Site B host 1 host 192.168.0.53 esx4 VMware vsphere 6.0 ESX Site B host 2 host 192.168.0.54 vc1 Windows Server 2012 R2 VMware vsphere 6.0 vcenter 192.168.0.31 vasa Linux Appliance NetApp VASA Provider 192.168.0.34 ocum Linux Appliance NetApp OnCommand Unified Manager 192.168.0.71 tiebreak Red Hat Ent. Linux 6.5 linux client running NetApp 192.168.0.66 TieBreaker sitea_vm1 Debian Linux nested VM to demonstrate 192.168.0.41 survivability 2 User IDs and Passwords 7 Host Name User ID Password Comments jumphost DEMO\Administrator Netapp1! Domain Administrator cluster1 admin Netapp1! Same for individual cluster nodes svm... vsadmin Netapp1! svm administrator vasa vpserver Netapp1! Administrator vc1 Administrator@vsphere.localNetapp1! Local Admin for vcenter ocum Administrator or admin Netapp1! OCUM Administrator tiebreak root Netapp1! Linux Administrator sitea_vm1 root Netapp1! Linux Administrator
3 Lab Activities Key NetApp capabilities will be highlighted: Tour Environment on page 8 Unplanned Switchover: High Availability for Business Continuity on page 32 MetroCluster Healing and Switchback for Disaster Recovery on page 45 Planned Switchover: High Availability for Non-Disruptive Operations on page 60 3.1 Tour Environment This lab activity demonstrates a NetApp MetroCluster setup simulating two sites. This activity familiarizes you with the various management interfaces available to users of clustered Data ONTAP with MetroCluster enabled. 1. Immediately following the first connection to the lab, there is a powershell script that runs to finish preparing the lab for initial use. Allow this window to finish running the script before you continue. 1 Figure 3-1: 2. When prepatory script completes, the browser will open. Press Advanced to bypass the invalid certificate warning. 8
2 Figure 3-2: 3. Click Proceed to vc1.demo.netapp.com. 9
3 Figure 3-3: 4. Enter credentials: DEMO\Administrator with password: Netapp1! 5. Click Login. 4 5 Figure 3-4: 6. Click on Hosts and Clusters. 10
6 Figure 3-5: 7. Right click on the VM sitea_vm1, navigate to Power > Power On. 7 Figure 3-6: 8. Select the Recommendation to start the VM on esx1.demo.netapp.com 9. Click OK. 11
Note: If you are presented with a warning, press Answer Question, and select I Moved it as an answer. 8 9 Figure 3-7: 10. If you see a stale warning in the Alarms panel, right click and pick Reset to green to clear it. 10 Figure 3-8: While the VM starts, note that there are 4 ESX hosts for this lab demonstration: ESX1 and ESX2 providing the compute layer at Site A, while ESX3 and ESX 4 are at Site B. However, all four hosts belong to the same VMware High Availablity (HA) Cluster. This is defined as a stretched vmware cluster. 11. Right-click on the MCC HA Cluster > Settings. 12
11 Figure 3-9: 12. Select VM/Host Groups under Configuration. 13. Note that there are three groups defined, Host Group A, Host Group B, and a VM Group for the sitea_vms. 13
13 12 Figure 3-10: 14. Next click on VM/Host Rules. 15. Select sitea_affinity, and click Edit. 14
15 14 Figure 3-11: 16. Examine the rule, then click Cancel. 15
16 Figure 3-12: 17. Minimize the browser with vcenter open, then double-click the link for OnCommand System Manager (NetApp OCSM Cluster1). Figure 3-13: 18. Log in with: admin and password: Netapp1!. 16
18 Figure 3-14: 19. Navigate to Storage Virtual Machines, click on cluster1. 20. Examine the State of each SVM: svm1 is running with configuration state unlocked, while svm2-mc is stopped and locked. 19 20 Figure 3-15: 21. Minimize the browser, then double-click the link for OCSM Cluster2. Figure 3-16: 22. Log in with: admin and password: Netapp1! 17
22 Figure 3-17: 23. Navigate to Storage Virtual Machines, click on cluster2. 24. Examine the State of each SVM: svm2 is running with configuration state Unlocked, while svm1-mc is stopped and Locked (the opposite of cluster1). 23 24 Figure 3-18: 25. Minimize the browser, then double-click the link Log into OnCommand Unified Manager. Figure 3-19: 26. Click Advanced to bypass the invalid certificate warning. 18
26 Figure 3-20: 27. Click Proceed to ocum.demo.netapp.com. 19
27 Figure 3-21: 28. Sign in with User Name Administrator(or Admin), and Password Netapp1!. 28 Figure 3-22: 20
29. In the Availablity pane in the Quick Takes area, click on the 2 under Clusters. 29 Figure 3-23: 30. Click on cluster1. 21
30 Figure 3-24: 31. Click on the Configuration tab, and examine the MetroCluster details. 22
31 Figure 3-25: 32. Click on the tab for MetroCluster Connectivity. 33. Examine the map. Note: This lab does not include bridges or the physical switches, however the switching layer would display here if it were available. 23
32 Figure 3-26: 34. Next click on the MetroCluster Replication tab, examine the map that identifies the Replication. 35. Note that there is a error/warning on an aggregate displayed in orange. 24
35 Figure 3-27: 36. Click on Events, or hover over the issue and click on View Details for more details. 37. Select the event then click Acknowledge. 25
36 37 Figure 3-28: 38. Click on Mark As Resolved. 26
38 Figure 3-29: 39. All events are cleared, minimize the browser. 27
39 Figure 3-30: 40. Double-click on PuTTY on the desktop. Figure 3-31: 41. Scroll down to the saved session for tiebreak. 42. Double-click Load and Open. 28
41 42 Figure 3-32: 43. Login as root, with password Netapp1!. 44. Issue the netapp-metrocluster-tiebreaker-software-cli command: (hint: type netapp then hit tab). Note: If after entering the command your prompt does not change, type reboot -h now and wait a few minutes. Then log back into the tiebreaker server, and try again. netapp-metrocluster-tiebreaker-software-cli 29
Figure 3-33: 45. Issue the monitor show -status command. monitor show -status 46. Examine that all clusters and nodes are set to true for Reachable, and State is set to normal. 30
45 46 Figure 3-34: 47. For additional details issue the monitor show -stats command. monitor show -stats 31
47 Figure 3-35: 48. Leave PuTTY open to prepare for the next part of the lab procedures. 3.2 Unplanned Switchover: High Availability for Business Continuity In this activity you create a site failure on SiteA. This will simulate an Unplanned MetroCluster Swithover event. In this activity, you perform the steps to manually Switchover, however in a production environment this functionality could be triggered through scripting when specific events occur, and conditions are met. 1. To create the site failure, you need to temporarily move away from inside the lab, and go to the tab that includes the lab guide and the Lab Control Panel. 2. Click the Blue FAIL SITE A button. 32
1 2 Figure 3-36: 3. A dialog box appears confirming that ESX1, ESX2, Cluster1-01, and Cluster1-02 nodes will be powered off to simulate a site failure. Click OK. 33
3 Figure 3-37: 4. Return to the Remote Desktop tab in the lab, and double-click on PuTTY. 5. Double-click on cluster2 in Saved Sessions, or highlight cluster2, press Load and Open. 5 Figure 3-38: 6. Login as: User Name admin and password: Netapp1!. 7. Issue the metrocluster show command (Hint: type me, and tab for autocompletion). metrocluster show 8. Note that from cluster2, cluster1 is now reporting as not-reachable. 34
7 8 Figure 3-39: 9. Keep the PuTTY session to cluster2 open, and return to the PuTTY session to TieBreaker. 10. Issue the monitor show -status command. monitor show -status 11. Examine cluster1, and notice that the nodes in cluster1 are now listed as false for reachable. 35
10 11 Figure 3-40: 12. Return to VMware vsphere Web Client, and note that ESX1 and ESX2 are not responding, and sitea_vm1 is disconnected. 36
12 Figure 3-41: 13. Return to the PuTTY session for cluster2. 14. Issue the set -privilege advanced command. set -privilege advanced 15. Press y to continue. 16. Issue the metrocluster switchover -simulate command. metrocluster switchover -simulate 37
14 16 Figure 3-42: 17. Issue the cluster peer show command. cluster peer show 17 Figure 3-43: 18. Now perform the switchover by issuing the metrocluster switchover -forced-on-disaster command. metrocluster switchover -forced-on-disaster 19. Press y to continue. 18 Figure 3-44: 38
20. Issue the metrocluster show command. metrocluster show 20 Figure 3-45: 21. Return to PuTTY with TieBreaker. 22. Issue the monitor show -status command. monitor show -status 23. Examine the state is listed as switchover completed. Note: Click up and Enter to refresh the Monitor State. 39
22 23 Figure 3-46: 24. Maximize the browser and return to the tab for cluster2, examine that both SVMs are now being served on cluster2. 24 Figure 3-47: 40
25. Return to VMware vsphere Web Client, and note that the VM is now running from esx3. If it powered off, do not attempt to power it on. Once the surviving hypervisors see the storage online they will complete the host affinity failure action. 26. Click the refresh button, and wait for the VM Failure Response to return the VM to a powered on state. 27. Note that the VM is running from Compute at SiteB, and being served from a replicated version of the SVM from SiteA, but running on SiteB. 26 Figure 3-48: 28. Return to the OCUM tab, click on Dashboard. 41
28 Figure 3-49: 29. OCUM has a fast notification feature that pulls immediate critical events associated to the MetroCluster. However, other events are subject to the standard 15 minute polling interval. To address this in the demonstration, go to the Actions dropdown menu, and select rediscover from the Storage > Cluster inventory list. 30. Navigate to cluster2 s MetroCluster Connectivity tab. 42
30 Figure 3-50: 31. Navigate to cluster2 s MetroCluster Replication tab. 43
31 Figure 3-51: 32. Go to Events, and examine the events that are being reported. Select all and click Acknowledge. 44
32 Figure 3-52: 3.3 MetroCluster Healing and Switchback for Disaster Recovery Following an unplanned or planned switchover event, you can perform certain steps to restore the MetroCluster to full health. You must perform healing on data aggregates, and then on the root aggregates. This process resynchronizes the data aggregat and prepares the disaster site for normal operation. In this activity you perform the healing and switchback tasks. 1. Open a PuTTY session to cluster2 2. Issue the metrocluster show command. metrocluster show 3. Examine that the Mode for cluster2 is switchover. 2 3 Figure 3-53: 4. If you have an asterisk in the prompt, skip this step, otherwise issue the set -priv advanced command. 5. Issue the metrocluster heal -phase aggregates command. 6. After the data aggregates have been healed, you must heal the root aggregates in preparation for the switchback operation. Issue the metrocluster heal -phase root-aggregates command. metrocluster heal -phase aggregates metrocluster heal -phase root-aggregates 45
5 6 Figure 3-54: 7. Now that healing is completed, go back to the Lab Control Panel and restart the nodes. 8. Press the blue button labeled RESTART SITE A. 46
7 8 Figure 3-55: 9. Click OK to acknowledge that the nodes are restarting. 47
9 Figure 3-56: 10. Return to the lab and open the OCUM tab, select all the events, and click Mark As Resolved. 10 Figure 3-57: 11. Launch a PowerShell window. 11 Figure 3-58: 12. Issue ping -t cluster1, and let it run. Continue to the next step. ping -t cluster1 48
12 Figure 3-59: 13. Go to the VMware vsphere Web Client tab, click refresh a few times over about 1-2 minutes, and ESX1 and ESX2 will come online. Note that the storage is still online since the VMware datastores are being served from the replicated copy of the SVM on SiteB. 14. Right click on sitea_vm1, click Migrate. 49
13 14 Figure 3-60: 15. Choose Change compute resources only, click Next. 50
15 Figure 3-61: 16. Select esx1, and click Next. 16 Figure 3-62: 17. Click Next. 51
Figure 3-63: 18. Click Next. Figure 3-64: 19. Click Finish. 52
Figure 3-65: 20. The VM now relocates back to a compute host at SiteA, while the SVM is still serving the datastore from SiteB. 20 Figure 3-66: 21. By this point the PowerShell s continuous ping may demonstrate that the management LIF for cluster1 is available. If not, wait for it to become reachable, then close the PowerShell window. 53
Figure 3-67: 22. Return to the TieBreaker PuTTY session, and issue the monitor show -status command. monitor show -status 23. Examine the cluster1 state. 54
22 Figure 3-68: 24. Return to the PuTTY session to cluster2 and issue the cluster peer show command. cluster peer show 24 Figure 3-69: 25. Issue the metrocluster show command. metrocluster show 55
25 Figure 3-70: 26. If you have an asterisk in the prompt, skip this step, otherwise issue the set 27. Issue the metrocluster switchback -simulate command. -priv advanced command. set -priv adv metrocluster switchback -simulate 27 Figure 3-71: 28. After confirming that the operation was successful, issue the metrocluster switchback command. metrocluster switchback 29. Click y. 28 Figure 3-72: 30. Issue the metrocluster show command. metrocluster show 56
30 Figure 3-73: 31. Return to OCSM for cluster2, refresh the SVM state, and examine that svm1-mc is now stopped. 31 Figure 3-74: 32. Log back into OCSM for cluster1, refresh the SVM state, and examine that svm1 is running. 32 Figure 3-75: 57
33. Return to TieBreaker, and issue the monitor show -status command. monitor show -status 34. Note the state is now normal and true for reachable. 33 34 Figure 3-76: 35. Return to OCUM, and examine any events listed. 36. Select Events, and click Acknowledge. 58
36 Figure 3-77: 37. Keep events selected, and click Mark As Resolved. 37 Figure 3-78: 59
3.4 Planned Switchover: High Availability for Non-Disruptive Operations A planned or negotiated switchover event can be used in Non-Disruptive Opeartions (NDO). A negotiated switchover cleanly shuts down processes on the partner site and then switches over operations from the partner site. You can use negotiated switchover to perform maintenance on a MetroCluster site or test the switchover functionality. If you want to test the MetroCluster functionality or to perform planned maintenance, you can perform a negotiated switchover in which one cluster is cleanly switched over to the partner cluster. You can then heal and switch back the configuration. Rather than a forced switchover when one of the clusters is not reachable, this event involves performing the negotiated switchover prior to beginning the controlled power off event. In this activity you migrate the VM to compute at SiteB, and have the NetApp MetroCluster serve storage from SiteB in a controlled manner. 1. From VMware vsphere Web Client, right-click on sitea_vm1, and select Migrate. 1 Figure 3-79: 2. Click the Change compute resource only button, and click Next. 60
2 Figure 3-80: 3. Under the Hosts > Filter tab, select esx3. 3 Figure 3-81: 4. Click Next. 61
Figure 3-82: 5. Click Next. Figure 3-83: 6. Click Finish. 62
Figure 3-84: 7. After the VM task finishes migrating, right click on ESX1, go to Power > Shut Down. 7 Figure 3-85: 8. Enter MCC Test for a reason, and click OK. 63
8 Figure 3-86: 9. Right click on ESX2, go to Power > Shut Down. 64
9 Figure 3-87: 10. Enter MCC Test for a reason, and click OK. 65
10 Figure 3-88: 11. Return to PuTTY on cluster2. 12. If you have an asterisk in the prompt, skip this step, otherwise issue the set -priv advanced command. 13. You can use the -simulate option to preview the results of a switchover. A verification check gives you a way to ensure that most of the preconditions for a successful run are met before you start the operation. Issue the metrocluster switchover -simulate command. set -priv adv metrocluster switchover -simulate 13 Figure 3-89: 14. Issue the metrocluster switchover command. metrocluster switchover 15. Enter y. 66
14 Figure 3-90: 16. Return to the PuTTY session for TieBreaker, and issue the monitor show -status command. 17. Examine the monitor state. It may state that it is unable to reach cluster cluster1. If so, issue the monitor show -status command again. monitor show -status 16 Figure 3-91: 18. After running the monitor show MCC in switched over state. 67 -status command again, examine that the monitor state changes to
18 Figure 3-92: 19. Return to the tab for VMware vsphere Web Client, and observe that the VM is running. 19 Figure 3-93: 20. Return to the tab for OCSM for cluster 2, click Refresh to see that svm1-mc is running. 68
20 Figure 3-94: 21. Return to the OCUM tab to examine reported events. 22. Note that there are fewer events being reported, because this was a negotiated switchover. 22 Figure 3-95: 23. Click on the cluster2 tab for MetroCluster Connectivity to see that Connectivity is impacted. 69
23 Figure 3-96: 24. Check the Last Refreshed Time for cluster1. Optionally, you can wait to get to the 15 Mins Ago to see what changes occurred. 70
24 Figure 3-97: 25. Navigate to the cluster2 MetroCluster Connectivity tab. 71
25 Figure 3-98: 26. Navigate to the cluster2 MetroCluster Replication tab. 72
26 Figure 3-99: 27. Following the switchover event the nodes for cluster1 have powered themselves down. You can confirm this by attempting to PuTTY to cluster1, ping cluster1, or try to connect to OCSM for cluster1. 28. Go to OCUM Events, select all and click Acknowledge. 28 Figure 3-100: 73
4 Lab Limitations This lab has the following limitations: 74 All of the servers and storage controllers presented in this lab are virtual devices. Consequently, any operations involving movement of large quantities of data will not exhibit performance representative of real systems.
5 References The following references were used to create this lab guide. 75 http://mysupport.netapp.com/documentation/docweb/index.html?productid=62093&language=en-us http://www.netapp.com/us/media/tr-4128.pdf https://library.netapp.com/ecm/ecm_download_file/ecmp12454947 https://library.netapp.com/ecm/ecm_download_file/ecmp12458277
6 Version History 76 Version Date Document Version History 1.0.0 Sep 2015 Initial Release 1.0.1 Oct 2015 Added more context 1.0.2 Dec 2015 Added note about nested VM and rebooting tiebreaker
Refer to the Interoperability Matrix Tool (IMT) on the NetApp Support site to validate that the exact product and feature versions described in this document are supported for your specific environment. The NetApp IMT defines the product components and versions that can be used to construct configurations that are supported by NetApp. Specific results depend on each customer's installation in accordance with published specifications. NetApp provides no representations or warranties regarding the accuracy, reliability, or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS, and the use of this information or the implementation of any recommendations or techniques herein is a customer s responsibility and depends on the customer s ability to evaluate and integrate them into the customer s operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document. Go further, faster 2016NetApp, Inc. All rights reserved. No portions of this document may be reproduced without prior written consent of NetApp, Inc. Specifications are subject to change without notice. NetApp, the NetApp logo, Data ONTAP, ONTAP, OnCommand, SANtricity, FlexPod, SnapCenter, and SolidFire are trademarks or registered trademarks of NetApp, Inc. in the United States and/or other countries. All other brands or products are trademarks or registered trademarks of their respective holders and should be treated as such.