StorageGRID Webscale 10.2

Size: px

Start display at page:

Download "StorageGRID Webscale 10.2"

Martin Williamson
5 years ago
Views:

StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments NetApp, Inc. 495 East Java Drive Sunnyvale, CA 94089 U.S. Telephone: +1 (408) 822-6000 Fax: +1 (408) 822-4501 Support telephone: +1 (888) 463-8277 Web: www.

1 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments NetApp, Inc. 495 East Java Drive Sunnyvale, CA U.S. Telephone: +1 (408) Fax: +1 (408) Support telephone: +1 (888) Web: Feedback: Part number: A0 December 2015

3 Table of Contents 3 Contents Maintain your StorageGRID Webscale system... 5 Checking the StorageGRID Webscale version... 5 Downloading the StorageGRID Webscale installation files... 5 Required materials for grid node recovery... 6 Recovery procedures... 8 Recovering a Storage Node... 8 Gathering information about failed grid nodes Removing failed grid nodes Deploying the StorageGRID Webscale Installer in OpenStack Dashboard Generating the grid node recovery deployment template Deploying the recovery grid nodes OpenStack Dashboard Associating the correct floating IP address Initializing storage volumes and setting the installation state Rebuilding the Cassandra database Determine if hotfixes or maintenance releases must be applied Restoring object data to a storage volume where the system drive also failed Reverting to the default boot mode Finishing the StorageGRID Webscale deployment Checking the storage state Recovering a StorageGRID Webscale appliance Storage Node Preparing the StorageGRID Webscale appliance Storage Node Deploying the StorageGRID Webscale Installer in OpenStack Dashboard Preparing and monitoring the deployment in the StorageGRID Webscale Installer Connecting to the appliance configuration web page Configuring the data network connections Setting the StorageGRID Webscale Installer IP address Installing StorageGRID Webscale software on the appliance Initializing storage volumes and setting the installation state Rebuilding the Cassandra database Restoring object data to a storage volume where the system drive also failed Reverting to the default boot mode Finishing the StorageGRID Webscale deployment Recovering from Admin Node failures Recovering from primary Admin Node failures Recovering from nonprimary Admin Node failures Recovering from API Gateway Node failures... 62

4 4 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments Gathering information about failed grid nodes Removing failed grid nodes Deploying the StorageGRID Webscale Installer in OpenStack Dashboard Generating the grid node recovery deployment template Deploying the recovery grid nodes in OpenStack Dashboard Associating the correct floating IP address Finishing the StorageGRID Webscale deployment Determine if hotfixes or maintenance releases must be applied Recovering from Archive Node failures Gathering information about failed grid nodes Removing failed grid nodes Deploying the StorageGRID Webscale Installer in OpenStack Dashboard Generating the grid node recovery deployment template Deploying the recovery grid nodes in OpenStack Dashboard Associating the correct floating IP address Finishing the StorageGRID Webscale deployment Determine if hotfixes or maintenance releases must be applied Resetting Archive Node connection to the cloud Decommission process for grid nodes When to decommission a grid node About Storage Node decommissioning Storage Node consolidation System expansion and decommissioning nodes Decommission multiple Storage Nodes ILM policy and storage configuration review Impact of decommissioning Impact of decommissioning on data security Impact of decommissioning on ILM policy Impact of decommissioning on other grid tasks Impact of decommissioning on system operations Prerequisites and preparations for decommissioning Backing up the Grid Provisioning archive Decommissioning grid nodes Completing the decommission process for grid nodes Troubleshooting the decommissioning grid task Glossary Copyright information Trademark information How to send comments about documentation and receive update notifications Index... 97

5 5 Maintain your StorageGRID Webscale system This guide contains procedures for recovering the various types of grid nodes that make up a StorageGRID Webscale system (Admin Nodes, Storage Nodes, API Gateway Nodes, and Archive Nodes). This guide also includes procedures that explain how to decommission Storage Nodes and API Gateway Nodes. All maintenance activities require understanding of the StorageGRID Webscale system as a whole. You should review system specific documentation and the grid configuration web pages generated during provisioning (available in the /Doc directory of the SAID package) to ensure that you understand the StorageGRID Webscale system s topology. For more information about the workings of the StorageGRID Webscale system, see the Grid Primer and the Administrator Guide. Ensure that you follow the instructions and warnings in this guide exactly. Maintenance procedures not detailed in this guide are not supported or require a services engagement to implement. This guide includes procedures to recover a StorageGRID Webscale appliance Storage Node. For information about recovering a StorageGRID Webscale appliance, see the StorageGRID Webscale Appliance Installation and Maintenance Guide. Related information StorageGRID Webscale 10.2 Appliance Installation and Maintenance Guide StorageGRID Webscale 10.2 Grid Primer StorageGRID Webscale 10.2 Administrator Guide Checking the StorageGRID Webscale version All services of the same type must be running the same version of the StorageGRID Webscale software. This includes applied hotfixes and maintenance releases. Before you begin Before performing any maintenance procedure, you must know which version of the StorageGRID Webscale software is running on the grid node. You use this number as a reference to determine if hotfixes or maintenance releases, or both, must be applied when performing maintenance procedures. If you cannot get this information from the failed grid node, use another in the system of the same type, if available. Step 1. In the NMS management interface (MI), go to Grid Topology > grid node > SSM > Services. Under Packages, the version listed for the storage grid release indicates the installed version. Downloading the StorageGRID Webscale installation files Before you can use StorageGRID Webscale features, you must download the software from the NetApp Support Site. 1. Access the NetApp Support Site at mysupport.netapp.com.

6 6 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments 2. Download the StorageGRID Webscale Installer (SGI) archive file (SGWS-<version>- <SHA>.tgz). 3. Extract the files from the archive file and use the information in the README file to select the correct SGI deployment template file (.yaml) for your environment. Required materials for grid node recovery Before performing maintenance procedures, you must ensure you have the necessary materials to recover a failed grid node. Item StorageGRID Webscale OpenStack Heat Template Notes Select the appropriate OpenStack Heat template for your OpenStack deployment: SGI_Public_Routing_Template.yaml StorageGRID Installer Heat template for deployments to externally routable tenant networks that do not require the use of floating IP addresses. SGI_Private_Routing_Template.yaml StorageGRID Installer Heat template for deployments to private tenant networks. This template creates floating IP addresses and associations. These files must be downloaded from the NetApp Support Site at mysupport.netapp.com as part of the StorageGRID Webscale Installer archive file (.tgz). StorageGRID Webscale Installer Virtual Machine Disk file Hotfix and Maintenance Release files Provisioning data Provisioning Backup.zip file Provisioning passphrase Passwords.txt file The Virtual Machine Disk (.vmdk) file used to deploy the StorageGRID Webscale Installer. This file must be downloaded from the NetApp Support Site at mysupport.netapp.com as part of the StorageGRID Webscale Installer archive file (.tgz). Determine whether or not a hotfix or a maintenance release, or both, has been applied to the grid node. The recovered grid node must be updated to the same build as all other grid nodes of the same type. See the storage grid release number listed on the Grid Topology > grid node > SSM > Services > Main page. To acquire hotfixes and maintenance releases, contact technical support. Obtain a copy of the most recent provisioning data.zip file. This provisioning data is updated each time the system is modified. The provisioning.zip file includes the SAID package. Use the latest revision of the SAID package. Note: The provisioning folder must contain only one grid specification file at the root level or provisioning will fail. Ensure that it is the latest version, which includes all maintenance updates. Included in the SAID package. The SAID package included in the Grid Provisioning Backup, which you can download from the NMS MI (Grid Maintenance > Grid Provisioning Backup).

7 Maintain your StorageGRID Webscale system 7 Item OpenStack software and documentation Service laptop Notes For the current supported versions of OpenStack software, see the Interoperability Matrix. NetApp Interoperability Matrix Tool. The service laptop must have the following: Microsoft Windows operating system Network port Supported browser The following browsers have been tested with StorageGRID Webscale to verify compatibility: Google Chrome 43 Microsoft Internet Explorer 11.0 Mozilla Firefox Telnet and SSH client (for example, PuTTY) SCP tool (for example, WinSCP) to transfer files to and from the primary Admin Node.

8 8 Recovery procedures When you recover a failed grid node, you must also replace the failed server hardware with new hardware, reinstall the software, and ensure all recoverable data is intact. The grid node recovery procedures in this section describe how to recover a grid node of any type: Admin Nodes API Gateway Nodes Archive Nodes Storage Nodes, including those that are installed on StorageGRID Webscale appliances You must always recover a failed grid node as soon as possible. A failed grid node may reduce the redundancy of data in the StorageGRID Webscale system, leaving you vulnerable to the risk of permanent data loss in the event of another failure. Operating with failed grid nodes can have an impact on the efficiency of day to day operations, can increase recovery time (when queues develop that need to be cleared before recovery is complete), and can reduce your ability to monitor system operations. All of the following conditions are assumed when recovering grid nodes: The hardware has failed and the drives from the old server cannot be used in a new server to recover the grid node. The failed hardware has been replaced and configured. The server to be recovered is configured to match the firmware version, BIOS settings, and storage configurations of the original server. If you are recovering a grid node other than the primary Admin Node, there is connectivity between the grid node being recovered and the primary Admin Node. Recovering a Storage Node You must always recover a failed Storage Node as soon as possible because objects are at increased risk of loss if another failure occurs. Before you begin If two or more Storage Nodes have failed at the same time, do not attempt this recovery procedure before contacting technical support. Because of the complexities involved when restoring objects to a failed Storage Node, if the data center site includes three or more Storage Nodes that have been offline for more than 15 days, you must contact technical support before attempting to recover failed Storage Nodes. Failure to do so may result in the unrecoverable loss of objects. If the Storage Node is in read-only maintenance mode to allow for the retrieval of objects for another Storage Node with failed storage volumes, refer to the instructions for operating with failed storage volumes before performing the recovery procedure for the Storage Node. About this task If your StorageGRID Webscale system is configured to use an information lifecycle management (ILM) rule with only one content placement instruction, only a single copy of object data is made. If the single copy is created on a Storage Node that fails, the result is unrecoverable loss of data. ILM rules with only one content placement instruction are not recommended for this reason.

9 Recovery procedures 9 1. Gathering information about failed grid nodes on page 10 You must gather the information required to clean up each failed grid node in OpenStack Dashboard. 2. Removing failed grid nodes on page 11 You must clean up the deployment in Openstack Dashboard for the failed grid node before you can recover the grid node. 3. Deploying the StorageGRID Webscale Installer in OpenStack Dashboard on page 11 You must deploy the StorageGRID Webscale Installer (SGI) using the OpenStack Dashboard. The SGI is then accessed through a web browser and used to deploy grid nodes. 4. Generating the grid node recovery deployment template on page 13 To recover grid nodes you need to create a Heat template that includes the deployment information for the grid nodes you are recovering, and then use it to deploy the stack in OpenStack Dashboard. 5. Deploying the recovery grid nodes OpenStack Dashboard on page 14 You can add an individual grid node to the Openstack deployment by launching a new stack that contains the grid node. 6. Associating the correct floating IP address on page 15 If your grid is using floating IP addresses for public access, you may need to reassign the IP address previously associated with the failed grid node to the recovered grid node. 7. Initializing storage volumes and setting the installation state on page 15 You need to initialize the storage volumes (rangedbs) on each Storage Node to make the storage available to the StorageGRID Webscale system. 8. Rebuilding the Cassandra database on page 16 You need to run the check-cassandra-rebuild script to determine if you need to rebuild the Cassandra database, and then rebuild it if required. 9. Determine if hotfixes or maintenance releases must be applied on page 17 After you start the StorageGRID Webscale software, you need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. If versions do not match, you must apply the required hotfix or maintenance release. 10. Restoring object data to a storage volume where the system drive also failed on page 18 After recovering a storage volume on a Storage Node where the system drive also failed and was recovered, and after rebuilding the Cassandra database, you can restore object data to the recovered storage volumes from other Storage Nodes and Archive Nodes. 11. Reverting to the default boot mode on page 20 After you have completed the procedure to recover a grid node in maintenance mode, you need to reset the grid node to boot the operating system normally. 12. Finishing the StorageGRID Webscale deployment on page 21 After completing the deplpu,emt of the StorageGRID Webscale grid nodes, you must return to the StorageGRID Webscale Installer and finish the deployment. 13. Checking the storage state on page 21 You need to verify that the desired state of the Storage Node is set to online and ensure that the state will be online by default whenever the Storage Node server is restarted.

10 10 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments Gathering information about failed grid nodes You must gather the information required to clean up each failed grid node in OpenStack Dashboard. 1. Log in to the Openstack Dashboard. 2. Select Project > Compute > Instances. 3. In the Instance Name column, click the link for the instance where the failed grid node is located. 4. Record the IP addresses listed in the IP Addresses section, and the volume names listed in the Volumes Attached section.

11 Recovery procedures 11 Removing failed grid nodes You must clean up the deployment in Openstack Dashboard for the failed grid node before you can recover the grid node. 1. Log in to the Openstack Dashboard. 2. Remove the failed grid node instance: a. Select Project > Compute > Instances. b. Select the checkbox to the left of the failed grid node instance to terminate. c. In the Actions column, select Terminate Instance from the drop-down list. d. Click Terminate Instance in the confirmation dialog box. 3. Remove the failed grid node volumes: a. Select Project > Compute > Volumes. b. Select the checkboxes to the left of each of the volumes associated with the failed grid node. c. Click Delete Volumes above the volumes table. d. Click Delete Volumes in the confirmation dialog box. 4. Remove the port associated with the failed grid node: a. Select Admin > System > Networks. b. In the Network Name column, click the link for the network where the failed grid node is located. c. Select the checkbox to the left of the port associated with the failed grid node. d. In the Actions column, select Delete Port from the drop-down list. e. Click Delete Port in the confirmation dialog box. Deploying the StorageGRID Webscale Installer in OpenStack Dashboard You must deploy the StorageGRID Webscale Installer (SGI) using the OpenStack Dashboard. The SGI is then accessed through a web browser and used to deploy grid nodes. Before you begin OpenStack software must be installed and correctly configured. You must have the SGI virtual machine disk file (.vmdk) and the correct Heat template file (.yaml) for your deployment. These files must be extracted from the StorageGRID Webscale installaton file (.tgz) downloaded from the NetApp Support Site at mysupport.netapp.com. You must have network configuration information for the SGI (IP address, network mask, default gateway). About this task You must deploy the SGI on the same network as, or a network that is accessible to, the grid nodes being deployed for the StorageGRID Webscale system. An additional, unique IP address is required

12 12 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments for the SGI, one that is separate from the IP addresses assigned to grid nodes in the grid specification file. The best practice is to remove the SGI virtual machine after you have deployed all grid nodes and verified that they have started successfully and joined the grid. This helps to ensure that future changes to the grid topology are not competed with the incorrect version of the SGI. Each time you make a change to the grid topology, you must ensure that the SGI version matches the version of your StorageGRID Webscale system. 1. Log in to the OpenStack Dashboard. 2. Select Project > Orchestration > Stacks. 3. Click Launch Stack. 4. In the Select Template dialog box enter the configuration information for the StorageGRID Webscale Installer stack: Template Source: Select URL from the drop-down list. Template URL: Enter, or copy and paste, the location of the SGI heat template, for example: Leave Environment Source and Environment File at their default values. 5. Click Next. 6. In the Launch Stack dialog box enter the SGI deployment information: Stack Name: Enter a meaningful name for the stack, for example, NetApp-SGI. Creation Timeout (minutes): The time to allow for the stack creation before a timeout in minutes. Rollback On Failure: Select this option to enable rollback to the starting state upon failure. This option should be selected with caution, because it will prevent error messages from being displayed in OpenStack Dashboard if the deployment fails. Password for user "username": Enter the password for the specified OpenStack project user account. StorageGRID Network: Select the network the grid nodes are being deployed in from the drop-down list. StorageGRID Network Netmask: Enter the network mask for the SGI on the StorageGRID Network. StorageGRID Network Gateway: Enter the default gateway for the SGI on the StorageGRID Network. StorageGRID Installer IP Address: Enter the IP address for the SGI on the StorageGRID Network. Public Network: Select the public network the SGI will use from the drop-down list. This is the network that Floating IP addresses are allocated from. This field is only applicable, and is only displayed, if you are using the SGI template for private tenant networks (SGI_Private_Routing_Template.yaml). StorageGRID Installer Image URL: Enter, or copy and paste, the URL for the SGI virtual machine disk (.vmdk) file, for example: SGI a323.vmdk

13 Recovery procedures Click Launch. You must wait for the deployment to complete, and then you can access the SGI in your web browser at the IP address you specified in the StorageGRID Installer IP Address text box. Generating the grid node recovery deployment template To recover grid nodes you need to create a Heat template that includes the deployment information for the grid nodes you are recovering, and then use it to deploy the stack in OpenStack Dashboard. Before you begin You must have the latest version of the SAID package. You can download the latest version from the NMS MI (Grid Management > Grid Maintenance > Provisioning Backup), or acquire the latest backup version you have stored in a secure location. 1. In a supported web browser, navigate to the StorageGRID Webscale Installer using the IP address configured when deploying the StorageGRID Webscale Installer. 2. On the Welcome page, select Modify an existing StorageGRID Webscale System. 3. In the Upload the SAID package page, click Upload, locate and select the latest SAID package file (.zip) for your StorageGRID Webscale system, and click Open. The SAID file is named using the following format: GIDgridIDNumber_REVrevisionNumber_SAID.zip Ensure that you select the.zip file with the highest revision number. 4. Click Next. 5. In the Grid Configuration page, verify the values specified for virtual machine based Storage Nodes: Number of RangeDBs: The number of storage volumes (RangeDBs) to attach to each Storage Node. Size of RangeDBs (GBs): The size of each individual storage volume (RangeDB) in gigabytes. You must enter a value between 50GB and 20000GB (20 terabytes). The minimum value for production systems is 4000 GB (4 terabytes). These values are defined in the grid specification file, and should not be modified unless you verify that the specified values are incorrect. If you are only deploying StorageGRID Webscale appliance Storage Nodes, these values are not used because the full storage capacity on the appliance is always used. 6. Click Save. 7. Click Next. 8. In the Deploy Grid Nodes page, select the recovery grid nodes to deploy: a. Select the grid nodes you need to recover, and deselect all other grid nodes. b. Click Generate Configuration. c. Copy the generated URL from the Grid Nodes URL text box. You use the Grid Nodes URL value to launch the recovery stack in OpenStack Dashboard.

14 14 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments Deploying the recovery grid nodes OpenStack Dashboard You can add an individual grid node to the Openstack deployment by launching a new stack that contains the grid node. 1. Log in to the Openstack Dashboard. 2. Click Project > Orchestration > Stacks. 3. Click Launch Stack. 4. In the Select Template dialog box enter the grid deployment file information: Template Source: Select URL from the drop-down list. Template URL: Paste in, or enter, the Grid Nodes URL value from the Deploy the Grid page in the SGI. Leave Environment Source and Environment File at their default values. 5. Click Next. 6. In the Launch Stack dialog box enter the grid deployment information: Stack Name: Enter a meaningful name for the stack, for example, DC1-S4-Recovery. Creation Timeout (minutes): The time to allow for the stack creation before a timeout in minutes. Rollback On Failure: Select this option to enable rollback to the starting state upon failure. This option should be selected with caution, because it will prevent error messages from being displayed in OpenStack Dashboard if the deployment fails. Password for user "username": Enter the password for the specified OpenStack project user account. StorageGRID node security group: The Neutron security group to use for the grid node. The default value, StorageGRID Node Firewall should be used in most cases. You can ensure that the correct value is listed by verifying that the release number is listed followed by the deployment date and a unique ID, for example, unique_id. Maintenance Mode: Select True from the drop-down list. StorageGRID node server flavor: The type of server to use for the grid node. You need to select the node server flavour that corresponds to the stack for your StorageGRID Webscale grid deployment. For example, if you named the StorageGRID Webscale stack NetApp_SGW, look for an entry named NetApp_SGW_node_flavor-uniqueID StorageGRID node root image: Select "StorageGRID root image" as the root disk for the grid node. You can ensure that you are selecting the correct entry by verifying that the release number is listed followed by the deployment date and a unique ID, for example, unique_id. 7. Click Launch. 8. Return to the SGI and monitor the progress of the grid node installation. Wait until the status bar for each recovered grid node is yellow and the status is Stopped in Maintenance Mode before continuing with the recovery procedure.

15 Recovery procedures 15 Associating the correct floating IP address If your grid is using floating IP addresses for public access, you may need to reassign the IP address previously associated with the failed grid node to the recovered grid node. 1. Log in to the Openstack Dashboard. 2. Select Project > Compute > Instances. 3. In the IP Address column, verify that the Floating IPs value for the recovered node matches the value used by the failed node. If the value is incorrect, you need to remove the incorrect value and assign the correct floating IP address: a. Select Disassociate Floating IP from the drop-down list in the Actions column for the recovered node instance. b. Click Disassociate Floating IP in the confirmation dialog box. c. Select Associate Floating IP from the drop-down list in the Actions column for the recovered node instance. d. Select the correct floating IP address to use from the IP Address drop-down list and click Associate. Do not change the value of the Port to be associated drop-down list. Initializing storage volumes and setting the installation state You need to initialize the storage volumes (rangedbs) on each Storage Node to make the storage available to the StorageGRID Webscale system. Before you begin You must have the access to the Passwords.txt file. 1. From the service laptop, log in to the recovered Storage Node as root using the password listed in the Passwords.txt file. 2. Initialize the storage volume (rangedb) hard disks: ruby /tmp/ldrinit.rb a. When unallocated drives are detected, you are asked to use the volume as an LDR rangedb and Accept proposal [y/n], enter: y. b. For each rangedb drive on the Storage Node, when you are asked to Reformat the rangedb drive <name>? [y/n]?, enter y. In the following example, the rangedbs associated with the DC1-S4 Storage Node are initialized: DC1-S4:~ $ ruby /tmp/ldrinit.rb Unallocated drives detected. Is the following proposed action correct? Use /dev/sdb (300.0GiB) as an LDR rangedb Use /dev/sdc (300.0GiB) as an LDR rangedb Use /dev/sdd (300.0GiB) as an LDR rangedb

16 16 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments Accept proposal [Y/n]? y Determining drive partitioning WARNING: drive /dev/sdb exists Reformat the rangedb drive /dev/sdb? [Y/n]? y WARNING: drive /dev/sdc exists Reformat the rangedb drive /dev/sdc? [Y/n]? y WARNING: drive /dev/sdd exists Reformat the rangedb drive /dev/sdd? [Y/n]? y Formatting the following: /dev/sdb /dev/sdc /dev/sdd Formatting Storage Drives Finalizing device tuning Disabling rpcbind service Creating Object Stores for LDR Generating Grid Interface Configuration file LDR initialization complete DC1-S4:~ $ 3. Set the installation state: echo ACTIVATE > /var/local/run/install.state Rebuilding the Cassandra database You need to run the check-cassandra-rebuild script to determine if you need to rebuild the Cassandra database, and then rebuild it if required. Before you begin The system drives on the server must already have been restored. The cause of the storage volume failure has been identified and the defective storage hardware has been replaced. All the replaced storage drives have been formatted as rangedbs. The total size of the replacement storage must be the same as the original. You must have access to the Passwords.txt file. 1. From the service laptop, log in to the Storage Node as root using the password listed in the Passwords.txt file. 2. Determine if the Cassandra database must be rebuilt, and then rebuild it if the answer is yes: a. Check the database state: check-cassandra-rebuild b. If asked to Stop storage services [y/n]?, enter y. c. If you are prompted to rebuild the Cassandra database, enter y. Attention: You should not enter n unless directed by technical support. Rebuilding the Cassandra database means that the database is deleted from the Storage Node and rebuilt from other available Storage Nodes. This procedure should never be performed on multiple Storage Nodes concurrently as it may result in data loss. In the following example, the Cassandra database has been down for more than 15 days and must be rebuilt:

17 Recovery procedures 17 Cassandra was down for more than 15 days. Running: /usr/local/sbin/rebuild-cassandra-data Enter 'y' to rebuild the Cassandra database for this Storage Node. Rebuilding the Cassandra database may take as long or longer than 12 hours. Once started, do not stop or pause this rebuild operation. If the rebuild process is stopped or paused, you must rerun the operation. [y/n]? y Removing Cassandra commit logs Removing Cassandra SSTables Updating timestamps of the Cassandra data directories. starting service cassandra Running nodetool rebuild. Done. Cassandra database successfully rebuilt. If you are not prompted to rebuild the Cassandra database, continue to the next recovery task. Determine if hotfixes or maintenance releases must be applied After you start the StorageGRID Webscale software, you need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. If versions do not match, you must apply the required hotfix or maintenance release. Before you begin StorageGRID Webscale software must be started on the recovered grid node. 1. Sign in to the NMS MI. 2. Determine the current version of the StorageGRID Webscale software: a. Select Grid Topology > grid node of same type > SSM > Services > Main. b. Under Packages, note the storage-grid-release number. 3. Determine the version of the StorageGRID Webscale software of the recovered grid node: a. Select Grid Topology > recovered grid node > SSM > Services > Main. b. Under Packages, note the storage-grid-release number. 4. Compare the two versions, and if they differ install the required hotfixes or maintenance releases to update the recovered grid node to the correct software version. For more information about available hotfixes and maintenance releases, contact technical support. Related tasks Checking the StorageGRID Webscale version on page 5

18 18 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments Restoring object data to a storage volume where the system drive also failed After recovering a storage volume on a Storage Node where the system drive also failed and was recovered, and after rebuilding the Cassandra database, you can restore object data to the recovered storage volumes from other Storage Nodes and Archive Nodes. Before you begin You must have acquired the Node ID of the Storage Node where restored storage volumes reside. In the NMS MI, go to Grid Topology > Storage Node > LDR > Overview > Main. You must have confirmed the condition of the Storage Node. The Storage Node must be displayed in the Grid Topology tree with a color of green and all services must have a state of Online. If you want to recover erasure coded object data, the storage pool to which the recovered Storage Node is a member must include enough "green" and online Storage Nodes to support the Erasure Coding scheme used to create the erasure-coded object data being recovered. For example, if you are recovering erasure coded object data that was created using a scheme of 6+3, at least six Storage Nodes that are members of the Erasure Coding profile's storage pool must be green and online. About this task The procedure to restore object data to storage volumes notifies the StorageGRID Webscale system that object data stored on the lost storage volumes is no longer available, which prompts an ILM reevaluation to determine the correct placement of restored object data. If the only remaining copy of object data is located on an Archive Node, object data is retrieved from the Archive Node. Due to the latency associated with retrievals from external archival storage systems, restoring object data to a Storage Node from an Archive Node takes longer than restoring copies from other Storage Nodes. If the StorageGRID Webscale system s ILM policy is configured to use an ILM rule with only one active content placement instruction, copies of an object are not made; only a single instance of a replicated object is stored at any one time. If there is a failure, all such objects are lost and cannot be recovered; however, you must still perform the following procedure to purge lost object information from the database. For more information about ILM rules, see the Administrator Guide. 1. From the service laptop, log in to the failed Storage Node as root using the password listed in the Passwords.txt file. Attention: You should use the ADE console with caution. If the console is used improperly, it is possible to interrupt system operations and corrupt data. Enter commands carefully, and only use the commands documented in this procedure. 2. Access the ADE console of the LDR service: telnet localhost Access the CMSI module: cd /proc/cmsi 4. Begin restoring object data: Volume_Lost vol_id vol_id

19 Recovery procedures 19 vol_id is the volume ID of the reformatted volume, or a range of volumes in hex representation. There can be up to 16 volumes, numbered from 0000 to 000F, such as Volume_Lost F Note: The second vol_id is optional. For StorageGRID Webscale appliance Storage Node, you must reformat all storage volumes (0000 to 000F). As object data is restored, if the StorageGRID Webscale system cannot locate replicated object data, the LOST (Lost Objects) alarm is triggered. Alarms may be triggered on Storage Nodes throughout the system. Action should be taken to determine the cause of the loss and if recovery is possible. For more information, see the Troubleshooting Guide. 5. To determine the current status of the Volume_Lost recovery operation, do one or more of the following: To determine status... Of objects queued for retrieval from an Archive Node. Of the ILM Evaluation (Volume Lost) grid task triggered by the Volume_Lost command. Do the following... In the NMS MI, go to the Archive Node > ARC > Retrieve > Overview > Main page, and view the Objects Queued attribute. In the NMS MI, go to primary Admin Node > CMN > Grid Tasks > Overview > Main and view the percentage complete under Active. Wait for the grid task to move into the Historical table with a Status of Successful, which indicates a successful Storage Node recovery. Unavailable Storage Nodes may affect the progress of ILM Evaluation (Volume Lost) grid tasks depending on where the Storage Node is located. 6. When the Volume_Lost recovery procedure finishes, including the completion of the ILM Evaluation (Volume Lost) grid task, exit the ADE console of the LDR service: exit 7. Access the ADE console of the DDS service: telnet localhost Access the ECGQ module: cd /proc/ecgq 9. Complete the restoration of object data: node_repair node_id node_id is the node ID for the recovered Storage Node's LDR service. The StorageGRID Webscale system completes the processes of recovering object data, ensuring that ILM rules are met. A unique repair ID number is returned to identify this node_repair operation. This repair ID number can be used to track the progress and results of the repair. No other feedback is returned. Note: You cannot run multiple node_repair operations at the same time. 10. Determine the current status or result of the node_repair recovery operation repair_status repair_id repair_id is the identifier returned when the node_repair command is run. Determine the repair_id of a repair, you can list all previously and currently running repairs:

20 20 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments repair_status all In the following example, all object data is successfully recovered: ade : /proc/ecgq > node_repair ade : /proc/ecgq > Repair of node started. Repair ID ade : /proc/ecgq > repair_status Repair ID : Type : Storage Node Repair Node ID : Start time : T23:28: End time : T23:28: State : Success Estimated bytes affected : Bytes repaired : Retry repair : No If Retry repair is Yes, check the condition of the StorageGRID Webscale system, and confirm that all grid nodes are "green" in the NMS MI's Grid Topology tree with a state of Online. For erasure-coded object data, confirm that there are the minimum number of green and online Storage Nodes in the storage pool of which the recovered Storage Node is a member, so that the erasure coding scheme in use is supported. Resolve any issues with the system, including connectivity, and retry the repair by entering the following command: repair_retry <repair_id> repair_id is the identifier returned when the node_repair command is run. Unrecoverable erasure-coded object data triggers the LOST (Lost Objects) and ECOR (Copies Lost) alarms. If State is Failure and Retry repair is No, erasure coded object data is permanently lost. 11. Exit the ADE console: exit Related information StorageGRID Webscale 10.2 Administrator Guide StorageGRID Webscale 10.2 Troubleshooting Guide Reverting to the default boot mode After you have completed the procedure to recover a grid node in maintenance mode, you need to reset the grid node to boot the operating system normally. 1. From the service laptop, log in to the recovered grid node as root using the password listed in the Passwords.txt file. 2. Run the script to override the maintenance mode boot option: override_maintenance_mode.sh 3. Log out of the command shell: exit

21 Recovery procedures 21 Finishing the StorageGRID Webscale deployment After completing the deplpu,emt of the StorageGRID Webscale grid nodes, you must return to the StorageGRID Webscale Installer and finish the deployment. 1. When the installation of all of the selected grid nodes is complete, and you have verified that they have successfully joined the grid in the NMS MI, return to the StorageGRID Webscale Installer. 2. Click Cancel. 3. When asked to confirm the cancellation, click Yes. You are returned to the Welcome page. 4. Close the browser hosting the StorageGRID Webscale Installer. Checking the storage state You need to verify that the desired state of the Storage Node is set to online and ensure that the state will be online by default whenever the Storage Node server is restarted. Before you begin Storage Node has been recovered, and data recovery is complete. 1. In the NMS MI, check the value of Grid Topology > Recovered Storage Node > LDR > Storage > Storage State Desired and Storage State Current. The value of both attributes should be Online. 2. If the Storage State Desired is set to Read-only, complete the following steps: a. Click the Configuration tab. b. From the Storage State Desired drop-down list, select Online. c. Click Apply Changes. d. Click the Overview tab and confirm that the values of Storage State Desired and Storage State Current are updated to Online. Recovering a StorageGRID Webscale appliance Storage Node Recovering a StorageGRID Webscale appliance Storage Node involves deploying the appliance Storage Nodes, rebuilding the Cassandra database, enabling services with the Grid Deployment Utility, and restoring object data to the Storage Node. About this task Attention: You must not attempt this recovery procedure if two or more Storage Nodes have failed at the same time. Contact technical support. Each StorageGRID Webscale appliance is represented as one Storage Node in the Network Management System (NMS) Management Interface (MI).

22 22 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments 1. Preparing the StorageGRID Webscale appliance Storage Node on page 23 When recovering a StorageGRID Webscale appliance Storage Node, you must first prepare the grid node before reinstalling StorageGRID Webscale software. 2. Deploying the StorageGRID Webscale Installer in OpenStack Dashboard on page 23 You must deploy the StorageGRID Webscale Installer (SGI) using the OpenStack Dashboard. The SGI is then accessed through a web browser and used to deploy grid nodes. 3. Preparing and monitoring the deployment in the StorageGRID Webscale Installer on page 25 To recover StorageGRID Webscale Storage Nodes, you need to upload the SAID package containing deployment information to the StorageGRID Webscale Installer (SGI). You can monitor the progress of the deployment in the SGI. 4. Connecting to the appliance configuration web page on page 25 To start the appliance software installation, you connect to the appliance configuration web page. Using this page enables you to configure the management network, configure the data network connection, enter the StorageGRID Webscale Installer IP address, and monitor the installation progress. 5. Configuring the data network connections on page 26 Using the StorageGRID Webscale Appliance Installer web page, you enter the IP address of the data network. Additionally, you enter the subnet mask for the network and at least a default gateway. Entering the IP address, subnet mask, and gateway enables you to configure the data network connection. 6. Setting the StorageGRID Webscale Installer IP address on page 27 You can use the StorageGRID Webscale Appliance Installer web page to set the IP address of the StorageGRID Webscale software installer. Setting this enables installer connectivity. 7. Installing StorageGRID Webscale software on the appliance on page 27 You install the StorageGRID Webscale software and the operating system by using the appliance installation web page. You can also use the web page to monitor the installation. Installing this software enables you to monitor the appliance information in the StorageGRID Webscale system. 8. Initializing storage volumes and setting the installation state on page 29 You need to initialize the storage volumes (rangedbs) on each Storage Node to make the storage available to the StorageGRID Webscale system. 9. Rebuilding the Cassandra database on page 30 You need to run the check-cassandra-rebuild script to determine if you need to rebuild the Cassandra database, and then rebuild it if required. 10. Restoring object data to a storage volume where the system drive also failed on page 31 After recovering a storage volume on a Storage Node where the system drive also failed and was recovered, and after rebuilding the Cassandra database, you can restore object data to the recovered storage volumes from other Storage Nodes and Archive Nodes. 11. Reverting to the default boot mode on page 34 After you have completed the procedure to recover a grid node in maintenance mode, you need to reset the grid node to boot the operating system normally. 12. Finishing the StorageGRID Webscale deployment on page 34 After completing the deplpu,emt of the StorageGRID Webscale grid nodes, you must return to the StorageGRID Webscale Installer and finish the deployment.

23 Recovery procedures 23 Preparing the StorageGRID Webscale appliance Storage Node When recovering a StorageGRID Webscale appliance Storage Node, you must first prepare the grid node before reinstalling StorageGRID Webscale software. 1. From the service laptop, log in to the failed Storage Node as root using the password listed in the Passwords.txt file. 2. Prepare the StorageGRID Webscale appliance Storage Node for the installation of StorageGRID Webscale software. sgareinstall 3. When asked to continue [y/n]?, enter: y The appliance reboots and your SSH session ends. It takes approximately 20 minutes for the StorageGRID Webscale appliance page to become available. The StorageGRID Webscale appliance Storage Node is reset and data on the Storage Node is no longer accessible. IP addresses configured during the original installation process should remain intact; however, it is recommended that you confirm this when the procedure completes. Deploying the StorageGRID Webscale Installer in OpenStack Dashboard You must deploy the StorageGRID Webscale Installer (SGI) using the OpenStack Dashboard. The SGI is then accessed through a web browser and used to deploy grid nodes. Before you begin OpenStack software must be installed and correctly configured. You must have the SGI virtual machine disk file (.vmdk) and the correct Heat template file (.yaml) for your deployment. These files must be extracted from the StorageGRID Webscale installaton file (.tgz) downloaded from the NetApp Support Site at mysupport.netapp.com. You must have network configuration information for the SGI (IP address, network mask, default gateway). About this task You must deploy the SGI on the same network as, or a network that is accessible to, the grid nodes being deployed for the StorageGRID Webscale system. An additional, unique IP address is required for the SGI, one that is separate from the IP addresses assigned to grid nodes in the grid specification file. The best practice is to remove the SGI virtual machine after you have deployed all grid nodes and verified that they have started successfully and joined the grid. This helps to ensure that future changes to the grid topology are not competed with the incorrect version of the SGI. Each time you make a change to the grid topology, you must ensure that the SGI version matches the version of your StorageGRID Webscale system. 1. Log in to the OpenStack Dashboard. 2. Select Project > Orchestration > Stacks. 3. Click Launch Stack.

24 24 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments 4. In the Select Template dialog box enter the configuration information for the StorageGRID Webscale Installer stack: Template Source: Select URL from the drop-down list. Template URL: Enter, or copy and paste, the location of the SGI heat template, for example: Leave Environment Source and Environment File at their default values. 5. Click Next. 6. In the Launch Stack dialog box enter the SGI deployment information: Stack Name: Enter a meaningful name for the stack, for example, NetApp-SGI. Creation Timeout (minutes): The time to allow for the stack creation before a timeout in minutes. Rollback On Failure: Select this option to enable rollback to the starting state upon failure. This option should be selected with caution, because it will prevent error messages from being displayed in OpenStack Dashboard if the deployment fails. Password for user "username": Enter the password for the specified OpenStack project user account. StorageGRID Network: Select the network the grid nodes are being deployed in from the drop-down list. StorageGRID Network Netmask: Enter the network mask for the SGI on the StorageGRID Network. StorageGRID Network Gateway: Enter the default gateway for the SGI on the StorageGRID Network. StorageGRID Installer IP Address: Enter the IP address for the SGI on the StorageGRID Network. Public Network: Select the public network the SGI will use from the drop-down list. This is the network that Floating IP addresses are allocated from. This field is only applicable, and is only displayed, if you are using the SGI template for private tenant networks (SGI_Private_Routing_Template.yaml). StorageGRID Installer Image URL: Enter, or copy and paste, the URL for the SGI virtual machine disk (.vmdk) file, for example: SGI a323.vmdk 7. Click Launch. You must wait for the deployment to complete, and then you can access the SGI in your web browser at the IP address you specified in the StorageGRID Installer IP Address text box.

25 Recovery procedures 25 Preparing and monitoring the deployment in the StorageGRID Webscale Installer To recover StorageGRID Webscale Storage Nodes, you need to upload the SAID package containing deployment information to the StorageGRID Webscale Installer (SGI). You can monitor the progress of the deployment in the SGI. 1. In a supported web browser, navigate to the StorageGRID Webscale Installer using the IP address configured when deploying the StorageGRID Webscale Installer. 2. On the Welcome page, select Modify an existing StorageGRID Webscale System. 3. In the Upload the SAID package page, click Upload, locate and select the latest SAID package file (.zip) for your StorageGRID Webscale system, and click Open. The SAID file is named using the following format: GIDgridIDNumber_REVrevisionNumber_SAID.zip Ensure that you select the.zip file with the highest revision number. 4. Click Next. 5. In the Grid Configuration page, click Save to accept the de values specified for virtual machine based Storage Nodes. If you are only deploying StorageGRID Webscale appliance Storage Nodes, these values are not used because the full storage capacity on the appliance is always used. 6. Click Next. 7. In the Deploy Grid Nodes page, deselect the entries for the StorageGRID Webscale appliance Storage Nodes you are recovering. 8. Switch to thestoragegrid WebscaleAppliance Installer to complete the deployment of the Storage Node. Result You can return to the SGI at any point in the appliance installation to monitor the progress of the installation. Connecting to the appliance configuration web page To start the appliance software installation, you connect to the appliance configuration web page. Using this page enables you to configure the management network, configure the data network connection, enter the StorageGRID Webscale Installer IP address, and monitor the installation progress. Step 1. Open a browser and enter the E5600SG controller Management Port 1 IP address that was provisioned during the physical installation: The StorageGRID Webscale Appliance Installer web page appears:

26 26 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments When you are first installing the appliance, the status for each of the procedures on the web page indicates that the procedure is not complete. Configuring the data network connections Using the StorageGRID Webscale Appliance Installer web page, you enter the IP address of the data network. Additionally, you enter the subnet mask for the network and at least a default gateway. Entering the IP address, subnet mask, and gateway enables you to configure the data network connection. Before you begin You must already have the IP address of the data network. 1. On the StorageGRID Webscale Appliance Installer web page, click Configure StorageGRID data network connection: 2. To edit the data network IP address, in the StorageGRID data network connection section, click Update IP/netmask.

27 Recovery procedures 27 The button name changes to Save Changes and a pop-up appears. 3. Enter the IP address of the data network and click Save Changes. Route information based on the specified IP displays. 4. In the pop-up, click OK. 5. If needed, edit the route and click Save route. Setting the StorageGRID Webscale Installer IP address You can use the StorageGRID Webscale Appliance Installer web page to set the IP address of the StorageGRID Webscale software installer. Setting this enables installer connectivity. Before you begin You must know the IP address of the StorageGRID Webscale Installer. 1. Click Home to navigate to the main StorageGRID Webscale Appliance Installer web page. 2. In the Set StorageGRID Webscale Installer IP text box, enter the IP address of the StorageGRID Webscale Installer and click Update. 3. Click OK in the confirmation dialog box. Installing StorageGRID Webscale software on the appliance You install the StorageGRID Webscale software and the operating system by using the appliance installation web page. You can also use the web page to monitor the installation. Installing this software enables you to monitor the appliance information in the StorageGRID Webscale system. Before you begin You must have already configured the management and data networks and entered the StorageGRID Webscale Installer IP address. You must have access to the Passwords.txt file. You must have an SSH client, such as PuTTY, to use to connect the StorageGRID Webscale appliance. About this task Attention: You must monitor the progress of the installation and put the grid node in maintenance mode at the appropriate time. After the operating system is installed, you are prompted to run a script to put the grid node in maintenance mode. You must run this script within 5 minutes.

28 28 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments When you install the software, the web interface initiates the following operations: Establishes a connection to the storage array. Checks the operational status of all drives. Creates a primary disk pool. Calculates volume sizes. Creates volumes. Creates the LUN mappings. Renames the array. Creates a configuration file. Rescans SCSI ports and reloads devices. 1. From the StorageGRID Webscale Appliance Installer web page, next to the Set StorageGRID Webscale Installer IP option, click Begin StorageGRID node_name node install : A list of install operations appears. You can review the installation progress as each operation status changes from Not started to Completed. The status refreshes every five seconds. 2. To monitor progress, return to the StorageGRID Webscale installation web page. The Deploy Grid Nodes section shows the installation progress for the appliance Storage Node.

Recovery procedures 29 3. Review the appliance web page.

29 Recovery procedures Review the appliance web page. The following occurs: When the operating system installation is in progress, a thumbnail image of the installation appears next to the list of operations. The web page displays the last 10 lines of the installation log, which updates every five seconds. During the installation of StorageGRID Webscale software, you are prompted to start the grid node in maintenance mode. Since this is a maintenance procedure, follow the instructions on screen to put the grid node into maintenance mode. The StorageGRID Webscale Installer Deploy Grid Nodes web page status bar changes to blue, indicating a job in progress, and then to yellow, indicating that the grid node is in maintenance mode. 4. Open the StorageGRID Webscale Installer in your web browser and monitor the progress of the grid node installation. Wait until the status bar for each expansion grid node you are adding is yellow and the status is "Stopped in Maintenance Mode" before continuing with the expansion procedure. If you are installing additional StorageGRID Webscale appliance Storage Nodes, complete the software installation of the Storage Nodes before running the "Grid Expansion: Initial" grid task. Otherwise, run the "Grid Expansion: Initial" grid task next. Initializing storage volumes and setting the installation state You need to initialize the storage volumes (rangedbs) on each Storage Node to make the storage available to the StorageGRID Webscale system. Before you begin You must have the access to the Passwords.txt file. 1. From the service laptop, log in to the recovered Storage Node as root using the password listed in the Passwords.txt file. 2. Initialize the storage volume (rangedb) hard disks: ruby /tmp/ldrinit.rb a. When unallocated drives are detected, you are asked to use the volume as an LDR rangedb and Accept proposal [y/n], enter: y.

30 30 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments b. For each rangedb drive on the Storage Node, when you are asked to Reformat the rangedb drive <name>? [y/n]?, enter y. In the following example, the rangedbs associated with the DC1-S4 Storage Node are initialized: DC1-S4:~ $ ruby /tmp/ldrinit.rb Unallocated drives detected. Is the following proposed action correct? Use /dev/sdb (300.0GiB) as an LDR rangedb Use /dev/sdc (300.0GiB) as an LDR rangedb Use /dev/sdd (300.0GiB) as an LDR rangedb Accept proposal [Y/n]? y Determining drive partitioning WARNING: drive /dev/sdb exists Reformat the rangedb drive /dev/sdb? [Y/n]? y WARNING: drive /dev/sdc exists Reformat the rangedb drive /dev/sdc? [Y/n]? y WARNING: drive /dev/sdd exists Reformat the rangedb drive /dev/sdd? [Y/n]? y Formatting the following: /dev/sdb /dev/sdc /dev/sdd Formatting Storage Drives Finalizing device tuning Disabling rpcbind service Creating Object Stores for LDR Generating Grid Interface Configuration file LDR initialization complete DC1-S4:~ $ 3. Set the installation state: echo ACTIVATE > /var/local/run/install.state Rebuilding the Cassandra database You need to run the check-cassandra-rebuild script to determine if you need to rebuild the Cassandra database, and then rebuild it if required. Before you begin The system drives on the server must already have been restored. The cause of the storage volume failure has been identified and the defective storage hardware has been replaced. All the replaced storage drives have been formatted as rangedbs. The total size of the replacement storage must be the same as the original. You must have access to the Passwords.txt file. 1. From the service laptop, log in to the Storage Node as root using the password listed in the Passwords.txt file. 2. Create a backup of the log directory: mv /var/local/log /var/local/log.keep mkdir /var/local/log 3. Determine if the Cassandra database must be rebuilt, and then rebuild it if the answer is yes: a. Check the database state:

31 Recovery procedures 31 check-cassandra-rebuild b. If asked to Stop storage services [y/n]?, enter y. c. If you are prompted to rebuild the Cassandra database, enter y. Attention: You should not enter n unless directed by technical support. Rebuilding the Cassandra database means that the database is deleted from the Storage Node and rebuilt from other available Storage Nodes. This procedure should never be performed on multiple Storage Nodes concurrently as it may result in data loss. In the following example, the Cassandra database has been down for more than 15 days and must be rebuilt: Cassandra was down for more than 15 days. Running: /usr/local/sbin/rebuild-cassandra-data Enter 'y' to rebuild the Cassandra database for this Storage Node. Rebuilding the Cassandra database may take as long or longer than 12 hours. Once started, do not stop or pause this rebuild operation. If the rebuild process is stopped or paused, you must rerun the operation. [y/n]? y Removing Cassandra commit logs Removing Cassandra SSTables Updating timestamps of the Cassandra data directories. starting service cassandra Running nodetool rebuild. Done. Cassandra database successfully rebuilt. If you are not prompted to rebuild the Cassandra database, continue to the next recovery task. Restoring object data to a storage volume where the system drive also failed After recovering a storage volume on a Storage Node where the system drive also failed and was recovered, and after rebuilding the Cassandra database, you can restore object data to the recovered storage volumes from other Storage Nodes and Archive Nodes. Before you begin You must have acquired the Node ID of the Storage Node where restored storage volumes reside. In the NMS MI, go to Grid Topology > Storage Node > LDR > Overview > Main. You must have confirmed the condition of the Storage Node. The Storage Node must be displayed in the Grid Topology tree with a color of green and all services must have a state of Online. If you want to recover erasure coded object data, the storage pool to which the recovered Storage Node is a member must include enough "green" and online Storage Nodes to support the Erasure Coding scheme used to create the erasure-coded object data being recovered. For example, if you are recovering erasure coded object data that was created using a scheme of 6+3, at least six Storage Nodes that are members of the Erasure Coding profile's storage pool must be green and online. About this task The procedure to restore object data to storage volumes notifies the StorageGRID Webscale system that object data stored on the lost storage volumes is no longer available, which prompts an ILM reevaluation to determine the correct placement of restored object data. If the only remaining copy of object data is located on an Archive Node, object data is retrieved from the Archive Node. Due to the latency associated with retrievals from external archival storage

32 32 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments systems, restoring object data to a Storage Node from an Archive Node takes longer than restoring copies from other Storage Nodes. If the StorageGRID Webscale system s ILM policy is configured to use an ILM rule with only one active content placement instruction, copies of an object are not made; only a single instance of a replicated object is stored at any one time. If there is a failure, all such objects are lost and cannot be recovered; however, you must still perform the following procedure to purge lost object information from the database. For more information about ILM rules, see the Administrator Guide. 1. From the service laptop, log in to the failed Storage Node as root using the password listed in the Passwords.txt file. Attention: You should use the ADE console with caution. If the console is used improperly, it is possible to interrupt system operations and corrupt data. Enter commands carefully, and only use the commands documented in this procedure. 2. Access the ADE console of the LDR service: telnet localhost Access the CMSI module: cd /proc/cmsi 4. Begin restoring object data: Volume_Lost vol_id vol_id vol_id is the volume ID of the reformatted volume, or a range of volumes in hex representation. There can be up to 16 volumes, numbered from 0000 to 000F, such as Volume_Lost F Note: The second vol_id is optional. For StorageGRID Webscale appliance Storage Node, you must reformat all storage volumes (0000 to 000F). As object data is restored, if the StorageGRID Webscale system cannot locate replicated object data, the LOST (Lost Objects) alarm is triggered. Alarms may be triggered on Storage Nodes throughout the system. Action should be taken to determine the cause of the loss and if recovery is possible. For more information, see the Troubleshooting Guide. 5. To determine the current status of the Volume_Lost recovery operation, do one or more of the following: To determine status... Of objects queued for retrieval from an Archive Node. Of the ILM Evaluation (Volume Lost) grid task triggered by the Volume_Lost command. Do the following... In the NMS MI, go to the Archive Node > ARC > Retrieve > Overview > Main page, and view the Objects Queued attribute. In the NMS MI, go to primary Admin Node > CMN > Grid Tasks > Overview > Main and view the percentage complete under Active. Wait for the grid task to move into the Historical table with a Status of Successful, which indicates a successful Storage Node recovery. Unavailable Storage Nodes may affect the progress of ILM Evaluation (Volume Lost) grid tasks depending on where the Storage Node is located. 6. When the Volume_Lost recovery procedure finishes, including the completion of the ILM Evaluation (Volume Lost) grid task, exit the ADE console of the LDR service:

33 Recovery procedures 33 exit 7. Access the ADE console of the DDS service: telnet localhost Access the ECGQ module: cd /proc/ecgq 9. Complete the restoration of object data: node_repair node_id node_id is the node ID for the recovered Storage Node's LDR service. The StorageGRID Webscale system completes the processes of recovering object data, ensuring that ILM rules are met. A unique repair ID number is returned to identify this node_repair operation. This repair ID number can be used to track the progress and results of the repair. No other feedback is returned. Note: You cannot run multiple node_repair operations at the same time. 10. Determine the current status or result of the node_repair recovery operation repair_status repair_id repair_id is the identifier returned when the node_repair command is run. Determine the repair_id of a repair, you can list all previously and currently running repairs: repair_status all In the following example, all object data is successfully recovered: ade : /proc/ecgq > node_repair ade : /proc/ecgq > Repair of node started. Repair ID ade : /proc/ecgq > repair_status Repair ID : Type : Storage Node Repair Node ID : Start time : T23:28: End time : T23:28: State : Success Estimated bytes affected : Bytes repaired : Retry repair : No If Retry repair is Yes, check the condition of the StorageGRID Webscale system, and confirm that all grid nodes are "green" in the NMS MI's Grid Topology tree with a state of Online. For erasure-coded object data, confirm that there are the minimum number of green and online Storage Nodes in the storage pool of which the recovered Storage Node is a member, so that the erasure coding scheme in use is supported. Resolve any issues with the system, including connectivity, and retry the repair by entering the following command: repair_retry <repair_id> repair_id is the identifier returned when the node_repair command is run. Unrecoverable erasure-coded object data triggers the LOST (Lost Objects) and ECOR (Copies Lost) alarms.

34 34 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments If State is Failure and Retry repair is No, erasure coded object data is permanently lost. 11. Exit the ADE console: exit Related information StorageGRID Webscale 10.2 Administrator Guide StorageGRID Webscale 10.2 Troubleshooting Guide Reverting to the default boot mode After you have completed the procedure to recover a grid node in maintenance mode, you need to reset the grid node to boot the operating system normally. 1. From the service laptop, log in to the recovered grid node as root using the password listed in the Passwords.txt file. 2. Restore the backed up log directory: mv /var/local/log /var/local/log.rebuild mv /var/local/log.keep /var/local/log 3. Run the script to override the maintenance mode boot option: override_maintenance_mode.sh The StorageGRID Webscale appliance will reboot soon after the script runs to complete the installation and start running in bare metal mode. Your SSH session will terminate when the appliance reboots. 4. Log out of the command shell: exit Finishing the StorageGRID Webscale deployment After completing the deplpu,emt of the StorageGRID Webscale grid nodes, you must return to the StorageGRID Webscale Installer and finish the deployment. 1. When the installation of all of the selected grid nodes is complete, and you have verified that they have successfully joined the grid in the NMS MI, return to the StorageGRID Webscale Installer. 2. Click Cancel. 3. When asked to confirm the cancellation, click Yes. You are returned to the Welcome page. 4. Close the browser hosting the StorageGRID Webscale Installer. Recovering from Admin Node failures The recovery process for an Admin Node depends on whether it is the primary Admin Node or a nonprimary Admin Node at a secondary data center site.

35 Recovery procedures 35 Choices Recovering from primary Admin Node failures on page 35 You need to complete a series of tasks in specific order to recover from a primary Admin Node failure. Recovering from nonprimary Admin Node failures on page 50 You need to complete a specific set of tasks to recover from an Admin Node failure at any data center site other than the site where the primary Admin Node is located. Recovering from primary Admin Node failures You need to complete a series of tasks in specific order to recover from a primary Admin Node failure. About this task You must repair or replace a failed primary Admin Node promptly to avoid affecting the StorageGRID Webscale system's ability to ingest objects. The primary Admin Node hosts the Configuration Management Node (CMN) service, which is responsible for issuing blocks of object identifiers to CMS services. As each object is ingested, a unique identifier from this block is assigned to the object. Each CMS service has a minimum of 16 million object identifiers available for use if the CMN service becomes unavailable and stops issuing object identifiers to CMS services. When all CMS services exhaust their supply of object identifiers, objects can no longer be ingested. 1. Copying audit logs from the failed Admin Node on page 36 Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the recovered Admin Node. You want to preserve the audit log files if they are recoverable. 2. Gathering information about failed grid nodes on page 37 You must gather the information required to clean up each failed grid node in OpenStack Dashboard. 3. Removing failed grid nodes on page 38 You must clean up the deployment in Openstack Dashboard for the failed grid node before you can recover the grid node. 4. Deploying the StorageGRID Webscale Installer in OpenStack Dashboard on page 39 You must deploy the StorageGRID Webscale Installer (SGI) using the OpenStack Dashboard. The SGI is then accessed through a web browser and used to deploy grid nodes. 5. Generating the grid node recovery deployment template on page 40 To recover grid nodes you need to create a Heat template that includes the deployment information for the grid nodes you are recovering, and then use it to deploy the stack in OpenStack Dashboard. 6. Deploying the recovery grid nodes OpenStack Dashboard on page 41 You can add an individual grid node to the Openstack deployment by launching a new stack that contains the grid node. 7. Associating the correct floating IP address on page 42 If your grid is using floating IP addresses for public access, you may need to reassign the IP address previously associated with the failed grid node to the recovered grid node. 8. Restoring the GPT Repository and activating services on page 43 You need to restore the Grid Provisioning Tool (GPT) repository from the most recent backup so that you can recreate the provisioning media and SAID package required to recover the primary Admin Node, and then you need to start the primary Admin Node services.

36 36 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments 9. Starting StorageGRID Webscale software on page 43 You need to start the StorageGRID Webscale software on grid nodes by enabling services. 10. Disabling the Load Configuration option on page 44 When you are recovering the primary Admin Node, you should disable the Load Configuration option to prevent it from being selected later in error. 11. Starting the GDU server on page 44 You need to start the Grid Deployment Utility server on the primary Admin Node. 12. Determine if hotfixes or maintenance releases must be applied on page 45 After you start the StorageGRID Webscale software, you need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. If versions do not match, you must apply the required hotfix or maintenance release. 13. Reverting to the default boot mode on page 45 After you have completed the procedure to recover a grid node in maintenance mode, you need to reset the grid node to boot the operating system normally. 14. Finishing the StorageGRID Webscale deployment on page 46 After completing the deplpu,emt of the StorageGRID Webscale grid nodes, you must return to the StorageGRID Webscale Installer and finish the deployment. 15. Importing the updated grid specification file on page 46 You must import the bundle that includes the updated grid specification file into the StorageGRID Webscale system after you have started the recovered primary Admin Node. 16. Restoring the audit log on page 47 If you were able to preserve the audit log from the failed Admin Node, so that historical audit log information is retained, you can copy it to the Admin Node you are recovering. 17. Resetting the preferred sender on page 48 If the Admin Node you are recovering is currently set as the preferred sender of notifications and AutoSupport messages, you must reconfigure this setting in the NMS MI. 18. Restoring the NMS database on page 48 If you want to retain the historical information about attribute values and alarms on a failed Admin Node, you need to restore the Network Management System (NMS) database. This database can only be restored if your StorageGRID Webscale system includes more than one Admin Node. Copying audit logs from the failed Admin Node Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the recovered Admin Node. You want to preserve the audit log files if they are recoverable. About this task This procedure copies the audit log files from the failed Admin Node to a temporary location. These preserved audit logs can then be copied to the replacement Admin Node. Audit logs are not automatically copied to the new Admin Node. Depending on the failure of the Admin Node, it might not be possible to copy audit logs from the failed Admin Node. In a deployment with more than one Admin Node, audit logs are replicated to all Admin Nodes. Thus, in a multi Admin Node deployment, if audit logs cannot be copied off of the failed Admin Node, audit logs can be recovered from another Admin Node in the system. In a deployment with only one Admin Node, where the audit log cannot be copied off of the failed Admin Node, the recovered Admin Node starts recording events to the audit log in a new empty file and previously recorded data is lost.

37 Recovery procedures From the service laptop, log in to an Admin Node as root using the password listed in the Passwords.txt file. If possible, log in to the failed Admin Node. If the failure is such that you cannot log in to the failed Admin Node, log in to the primary Admin Node or another Admin Node, if available. 2. If it is running, stop the AMS service to prevent it from creating a new log file: /etc/init.d/ams stop 3. Rename the audit.log file so that it does not overwrite the file on the recovered Admin Node when you copy it to the recovered Admin Node. Rename audit.log to a unique numbered file name such as YYYY-MM-DD.txt.1. For example, you can rename the audit.log file to txt.1 cd /var/local/audit/export ls -l mv audit.log txt.1 4. Restart the AMS service: /etc/init.d/ams start 5. Copy all audit log files to a temporary location on any other server: scp -p * IP_address:/var/local/tmp When prompted, enter the password of the remote server listed in the Passwords.txt file. 6. Log out from the Admin Node you are copying the audit log file from: exit Gathering information about failed grid nodes You must gather the information required to clean up each failed grid node in OpenStack Dashboard. 1. Log in to the Openstack Dashboard. 2. Select Project > Compute > Instances. 3. In the Instance Name column, click the link for the instance where the failed grid node is located. 4. Record the IP addresses listed in the IP Addresses section, and the volume names listed in the Volumes Attached section.

38 38 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments Removing failed grid nodes You must clean up the deployment in Openstack Dashboard for the failed grid node before you can recover the grid node. 1. Log in to the Openstack Dashboard. 2. Remove the failed grid node instance: a. Select Project > Compute > Instances. b. Select the checkbox to the left of the failed grid node instance to terminate. c. In the Actions column, select Terminate Instance from the drop-down list. d. Click Terminate Instance in the confirmation dialog box. 3. Remove the failed grid node volumes: a. Select Project > Compute > Volumes.

39 Recovery procedures 39 b. Select the checkboxes to the left of each of the volumes associated with the failed grid node. c. Click Delete Volumes above the volumes table. d. Click Delete Volumes in the confirmation dialog box. 4. Remove the port associated with the failed grid node: a. Select Admin > System > Networks. b. In the Network Name column, click the link for the network where the failed grid node is located. c. Select the checkbox to the left of the port associated with the failed grid node. d. In the Actions column, select Delete Port from the drop-down list. e. Click Delete Port in the confirmation dialog box. Deploying the StorageGRID Webscale Installer in OpenStack Dashboard You must deploy the StorageGRID Webscale Installer (SGI) using the OpenStack Dashboard. The SGI is then accessed through a web browser and used to deploy grid nodes. Before you begin OpenStack software must be installed and correctly configured. You must have the SGI virtual machine disk file (.vmdk) and the correct Heat template file (.yaml) for your deployment. These files must be extracted from the StorageGRID Webscale installaton file (.tgz) downloaded from the NetApp Support Site at mysupport.netapp.com. You must have network configuration information for the SGI (IP address, network mask, default gateway). About this task You must deploy the SGI on the same network as, or a network that is accessible to, the grid nodes being deployed for the StorageGRID Webscale system. An additional, unique IP address is required for the SGI, one that is separate from the IP addresses assigned to grid nodes in the grid specification file. The best practice is to remove the SGI virtual machine after you have deployed all grid nodes and verified that they have started successfully and joined the grid. This helps to ensure that future changes to the grid topology are not competed with the incorrect version of the SGI. Each time you make a change to the grid topology, you must ensure that the SGI version matches the version of your StorageGRID Webscale system. 1. Log in to the OpenStack Dashboard. 2. Select Project > Orchestration > Stacks. 3. Click Launch Stack. 4. In the Select Template dialog box enter the configuration information for the StorageGRID Webscale Installer stack: Template Source: Select URL from the drop-down list. Template URL: Enter, or copy and paste, the location of the SGI heat template, for example:

40 40 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments Leave Environment Source and Environment File at their default values. 5. Click Next. 6. In the Launch Stack dialog box enter the SGI deployment information: Stack Name: Enter a meaningful name for the stack, for example, NetApp-SGI. Creation Timeout (minutes): The time to allow for the stack creation before a timeout in minutes. Rollback On Failure: Select this option to enable rollback to the starting state upon failure. This option should be selected with caution, because it will prevent error messages from being displayed in OpenStack Dashboard if the deployment fails. Password for user "username": Enter the password for the specified OpenStack project user account. StorageGRID Network: Select the network the grid nodes are being deployed in from the drop-down list. StorageGRID Network Netmask: Enter the network mask for the SGI on the StorageGRID Network. StorageGRID Network Gateway: Enter the default gateway for the SGI on the StorageGRID Network. StorageGRID Installer IP Address: Enter the IP address for the SGI on the StorageGRID Network. Public Network: Select the public network the SGI will use from the drop-down list. This is the network that Floating IP addresses are allocated from. This field is only applicable, and is only displayed, if you are using the SGI template for private tenant networks (SGI_Private_Routing_Template.yaml). StorageGRID Installer Image URL: Enter, or copy and paste, the URL for the SGI virtual machine disk (.vmdk) file, for example: SGI a323.vmdk 7. Click Launch. You must wait for the deployment to complete, and then you can access the SGI in your web browser at the IP address you specified in the StorageGRID Installer IP Address text box. Generating the grid node recovery deployment template To recover grid nodes you need to create a Heat template that includes the deployment information for the grid nodes you are recovering, and then use it to deploy the stack in OpenStack Dashboard. Before you begin You must have the latest version of the SAID package. You can download the latest version from the NMS MI (Grid Management > Grid Maintenance > Provisioning Backup), or acquire the latest backup version you have stored in a secure location. 1. In a supported web browser, navigate to the StorageGRID Webscale Installer using the IP address configured when deploying the StorageGRID Webscale Installer. 2. On the Welcome page, select Modify an existing StorageGRID Webscale System.

41 Recovery procedures In the Upload the SAID package page, click Upload, locate and select the latest SAID package file (.zip) for your StorageGRID Webscale system, and click Open. The SAID file is named using the following format: GIDgridIDNumber_REVrevisionNumber_SAID.zip Ensure that you select the.zip file with the highest revision number. 4. Click Next. 5. In the Grid Configuration page, verify the values specified for virtual machine based Storage Nodes: Number of RangeDBs: The number of storage volumes (RangeDBs) to attach to each Storage Node. Size of RangeDBs (GBs): The size of each individual storage volume (RangeDB) in gigabytes. You must enter a value between 50GB and 20000GB (20 terabytes). The minimum value for production systems is 4000 GB (4 terabytes). These values are defined in the grid specification file, and should not be modified unless you verify that the specified values are incorrect. If you are only deploying StorageGRID Webscale appliance Storage Nodes, these values are not used because the full storage capacity on the appliance is always used. 6. Click Save. 7. Click Next. 8. In the Deploy Grid Nodes page, select the recovery grid nodes to deploy: a. Select the grid nodes you need to recover, and deselect all other grid nodes. b. Click Generate Configuration. c. Copy the generated URL from the Grid Nodes URL text box. You use the Grid Nodes URL value to launch the recovery stack in OpenStack Dashboard. Deploying the recovery grid nodes OpenStack Dashboard You can add an individual grid node to the Openstack deployment by launching a new stack that contains the grid node. 1. Log in to the Openstack Dashboard. 2. Click Project > Orchestration > Stacks. 3. Click Launch Stack. 4. In the Select Template dialog box enter the grid deployment file information: Template Source: Select URL from the drop-down list. Template URL: Paste in, or enter, the Grid Nodes URL value from the Deploy the Grid page in the SGI. Leave Environment Source and Environment File at their default values. 5. Click Next.

42 42 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments 6. In the Launch Stack dialog box enter the grid deployment information: Stack Name: Enter a meaningful name for the stack, for example, DC1-S4-Recovery. Creation Timeout (minutes): The time to allow for the stack creation before a timeout in minutes. Rollback On Failure: Select this option to enable rollback to the starting state upon failure. This option should be selected with caution, because it will prevent error messages from being displayed in OpenStack Dashboard if the deployment fails. Password for user "username": Enter the password for the specified OpenStack project user account. StorageGRID node security group: The Neutron security group to use for the grid node. The default value, StorageGRID Node Firewall should be used in most cases. You can ensure that the correct value is listed by verifying that the release number is listed followed by the deployment date and a unique ID, for example, unique_id. Maintenance Mode: Select True from the drop-down list. StorageGRID node server flavor: The type of server to use for the grid node. You need to select the node server flavour that corresponds to the stack for your StorageGRID Webscale grid deployment. For example, if you named the StorageGRID Webscale stack NetApp_SGW, look for an entry named NetApp_SGW_node_flavor-uniqueID StorageGRID node root image: Select "StorageGRID root image" as the root disk for the grid node. You can ensure that you are selecting the correct entry by verifying that the release number is listed followed by the deployment date and a unique ID, for example, unique_id. 7. Click Launch. 8. Return to the SGI and monitor the progress of the grid node installation. Wait until the status bar for each recovered grid node is yellow and the status is Stopped in Maintenance Mode before continuing with the recovery procedure. Associating the correct floating IP address If your grid is using floating IP addresses for public access, you may need to reassign the IP address previously associated with the failed grid node to the recovered grid node. 1. Log in to the Openstack Dashboard. 2. Select Project > Compute > Instances. 3. In the IP Address column, verify that the Floating IPs value for the recovered node matches the value used by the failed node. If the value is incorrect, you need to remove the incorrect value and assign the correct floating IP address: a. Select Disassociate Floating IP from the drop-down list in the Actions column for the recovered node instance. b. Click Disassociate Floating IP in the confirmation dialog box. c. Select Associate Floating IP from the drop-down list in the Actions column for the recovered node instance. d. Select the correct floating IP address to use from the IP Address drop-down list and click Associate.

43 Recovery procedures 43 Do not change the value of the Port to be associated drop-down list. Restoring the GPT Repository and activating services You need to restore the Grid Provisioning Tool (GPT) repository from the most recent backup so that you can recreate the provisioning media and SAID package required to recover the primary Admin Node, and then you need to start the primary Admin Node services. Before you begin You are recovering the primary Admin Node You have the most recent version of the provisioning backup archive file (.zip) About this task The GPT Repository is a tool used to create provisioning media and SAID packages. The primary Admin Node typically hosts the GPT Repository. To recover the primary Admin Node, you must restore the repository from the backup. Restore the GPT repository on the Admin Node using the provisioning media created the last time the StorageGRID Webscale system was provisioned. 1. Log in to the primary Admin Node as root using the password listed in the passwords.txt file. 2. Use an SCP application such as WinSCP to copy the provisioning backup archive file (gptrepository.zip) from your service laptop to the primary Admin Node. 3. Unzip the provisioning backup archive file in the /var/local/tmp folder on the primary Admin Node. 4. Restore the grid provisioning tool repository. restore-repository /var/local/tmp 5. You are prompted for the repository passphrase. Enter the passphrase. 6. Log out from the primary Admin Node server. 7. Log back in to the primary Admin Node as root, using the password listed in the Passwords.txt file. 8. Activate the primary Admin Node services: activate-admin-services.sh Starting StorageGRID Webscale software You need to start the StorageGRID Webscale software on grid nodes by enabling services. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Add your private key identity to the SSH authentication agent: ssh-add 3. Enter the SSH Access Password for the primary Admin Node listed in the Passwords.txt file. 4. Start the GDU:

44 44 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments gdu-console 5. Type the provisioning passphrase, press the Tab key to select OK, and press Enter. 6. Start the StorageGRID Webscale software on the server: a. Ensure that the grid node you are configuring is selected and that the current state is Available. b. Press the Tab key to move through the options to the Tasks list. c. Use the down-arrow key to highlight Enable Services, and press the Spacebar to select it. d. Press the Tab key to move through the options and select the Start Task action. e. Press Enter to run the task. Wait for the message Finished Postinstall start task to appear in the Log Messages panel. Note: If you are completing this procedure on a primary Admin Node, do not select the Load Configuration option. f. When the task completes, press the right-arrow key to move to the Quit action and press Enter. 7. End the SSH session: ssh-add -D Disabling the Load Configuration option When you are recovering the primary Admin Node, you should disable the Load Configuration option to prevent it from being selected later in error. About this task The system automatically restores the latest version of the configuration bundles to the new Admin Node after it is started. If you mistakenly select the Load Configuration option, the system overwrites all configuration changes made through the NMS MI since the StorageGRID Webscale system was first installed. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Disable the Load Configuration option for GDU: echo LOAD > /var/local/run/install.state The Load Configuration option will no longer appear for the primary Admin Node in GDU. Starting the GDU server You need to start the Grid Deployment Utility server on the primary Admin Node. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Password.txt file. 2. Start the GDU server service: service gdu-server start

45 Recovery procedures 45 Determine if hotfixes or maintenance releases must be applied After you start the StorageGRID Webscale software, you need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. If versions do not match, you must apply the required hotfix or maintenance release. Before you begin StorageGRID Webscale software must be started on the recovered grid node. 1. Sign in to the NMS MI. 2. Determine the current version of the StorageGRID Webscale software: a. Select Grid Topology > grid node of same type > SSM > Services > Main. b. Under Packages, note the storage-grid-release number. 3. Determine the version of the StorageGRID Webscale software of the recovered grid node: a. Select Grid Topology > recovered grid node > SSM > Services > Main. b. Under Packages, note the storage-grid-release number. 4. Compare the two versions, and if they differ install the required hotfixes or maintenance releases to update the recovered grid node to the correct software version. For more information about available hotfixes and maintenance releases, contact technical support. Related tasks Checking the StorageGRID Webscale version on page 5 Reverting to the default boot mode After you have completed the procedure to recover a grid node in maintenance mode, you need to reset the grid node to boot the operating system normally. 1. From the service laptop, log in to the recovered grid node as root using the password listed in the Passwords.txt file. 2. Run the script to override the maintenance mode boot option: override_maintenance_mode.sh 3. Log out of the command shell: exit

46 46 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments Finishing the StorageGRID Webscale deployment After completing the deplpu,emt of the StorageGRID Webscale grid nodes, you must return to the StorageGRID Webscale Installer and finish the deployment. 1. When the installation of all of the selected grid nodes is complete, and you have verified that they have successfully joined the grid in the NMS MI, return to the StorageGRID Webscale Installer. 2. Click Cancel. 3. When asked to confirm the cancellation, click Yes. You are returned to the Welcome page. 4. Close the browser hosting the StorageGRID Webscale Installer. Importing the updated grid specification file You must import the bundle that includes the updated grid specification file into the StorageGRID Webscale system after you have started the recovered primary Admin Node. About this task Normally the system is provisioned when the primary Admin Node is running, and the updated specification file is imported automatically. In this case, because you are recovering the primary Admin Node, it was not installed and operational when you provisioned the StorageGRID Webscale system during the recovery procedure. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Display the status of the StorageGRID Webscale services, and ensure that the status of the CMN service is listed as Running : storagegrid-status 3. Enter Ctrl+C to return to the command prompt. 4. Import the grid provisioning tool bundle, including the updated grid specification file: import-gptb-bundle 5. When prompted, enter the repository passphrase. The required passphrase is the one you recorded as part of the original installation of the StorageGRID Webscale system. The updated grid specification file is imported into the StorageGRID Webscale system. You can view the grid specification file through the NMS MI at Grid Management > Grid Configuration > Configuration. This version of the grid specification file is only a copy and not the actual grid specification file. This copied version is only used for troubleshooting purposes. It cannot be used to provision the StorageGRID Webscale system.

47 Recovery procedures 47 Restoring the audit log If you were able to preserve the audit log from the failed Admin Node, so that historical audit log information is retained, you can copy it to the Admin Node you are recovering. Before you begin Admin Node must be installed and running You must have copied the audit logs to another location after the original Admin Node failed. About this task If an Admin Node fails, audit logs saved to that Admin Node are potentially lost. It might be possible to preserve data from loss by copying audit logs from the failed Admin Node and then restoring these audit logs to the recovered Admin Node. Depending on the failure of the Admin Node, it might not be possible to copy audit logs from the failed Admin Node. In a deployment with more than one Admin Node, audit logs are replicated to all Admin Nodes. Thus, in a multi Admin Node deployment, if audit logs cannot be copied from the failed Admin Node, audit logs can be recovered from another Admin Node in the system. In a deployment with only one Admin Node, where the audit log cannot be copied from the failed Admin Node, the recovered Admin Node starts recording events to the audit log as if the installation is new. You must recover an Admin Node as soon as possible to restore logging functionality. 1. From the service laptop, log in to the grid node where you made a temporary copy of the audit log files as root using the password listed in the Passwords.txt file. 2. Check which audit files have been preserved: cd /var/local/tmp ls -l The following audit log files might be present in the temporary directory: The renamed current log file (audit.log) from the failed grid node: YYYY-MM-DD.txt.1 Rotated audit log file from the day before the failure: YYYY-MM-DD.txt (or YYYY-MM- DD.txt.n if more than one is created in a day). Older compressed and rotated audit log files: YYYY-MM-DD.txt.gz, preserving the original archive date in their name. 3. Copy the preserved audit log files to the recovered Admin Node: scp -p YYYY* recovered_admin_ip:/var/local/audit/export Where recovered_admin_ip is the IP address of the recovered Admin Node. 4. When prompted, enter the password of the recovered grid node, as listed in the Passwords.txt file. 5. Remove the preserved audit log files from their temporary location: rm YYYY* 6. When prompted, confirm that you want to remove the file: y 7. Log out of the server that contained the temporary location of the audit logs: exit

48 48 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments 8. Log in to the recovered grid node as root. 9. Update the user and group settings of the preserved audit log files: cd /var/local/audit/export chown ams-user:bycast * 10. Log out from the recovered grid node: exit After you finish You will also need to restore any preexisting client access to the audit share. For more information, see the Administrator Guide. Related tasks Copying audit logs from the failed Admin Node on page 36 Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the recovered Admin Node. You want to preserve the audit log files if they are recoverable. Resetting the preferred sender If the Admin Node you are recovering is currently set as the preferred sender of notifications and AutoSupport messages, you must reconfigure this setting in the NMS MI. Before you begin The Admin Node must be installed and running. 1. Sign in to the NMS MI of the recovered Admin Node using the Vendor account. 2. Go to Grid Management > NMS Management > General > Main. 3. Select the recovered Admin Node from the Preferred Sender drop-down list. 4. Click Apply Changes. Restoring the NMS database If you want to retain the historical information about attribute values and alarms on a failed Admin Node, you need to restore the Network Management System (NMS) database. This database can only be restored if your StorageGRID Webscale system includes more than one Admin Node. Before you begin The Admin Node must be installed and running.

49 Recovery procedures 49 The StorageGRID Webscale system must include one or more data center sites with an Admin Node About this task If an Admin Node fails, the historical information about attribute values and alarms that are stored in the NMS database for that Admin Node are lost. In a StorageGRID Webscale system with more than one Admin Node, the NMS database is recovered by copying the NMS database from another Admin Node. In a system with only one Admin Node, the NMS database cannot be restored. When you are recovering an Admin Node, the software installation process creates a new database for the NMS service on the recovered Admin Node. After it is started, the recovered Admin Node records attribute and audit information for all services as if your system were a performing new installation of the StorageGRID Webscale system. In a StorageGRID Webscale system with more than one Admin Node, you can copy the NMS database from another Admin Node to the recovered Admin Node and restore historical information. Note: To copy the NMS database, the StorageGRID Webscale system must be configured with multiple Admin Nodes. 1. Stop the MI service on both Admin Nodes: a. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. b. Stop the MI service: /etc/init.d/mi stop c. Repeat the same steps on the second Admin Node. 2. Complete the following steps on the recovered Admin Node: a. Copy the database: /usr/local/mi/bin/mi-clone-db.sh Source_Admin_Node_IP Source_Admin_Node_IP is the IP address of the source from which the Admin Node copies the NMS database. b. When prompted, enter the password for the Admin Node found in the Passwords.txt file. c. When prompted, confirm that you want to overwrite the MI database on the recovered Admin Node. d. When prompted, enter the password for the Admin Node found in the Passwords.txt file. The NMS database and its historical data is copied to the recovered Admin Node. When it is done, the script starts the recovered Admin Node. Note: Copying the NMS database may take several hours. 3. Restart the MI service on the source Admin Node. /etc/init.d/mi start

50 50 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments Recovering from nonprimary Admin Node failures You need to complete a specific set of tasks to recover from an Admin Node failure at any data center site other than the site where the primary Admin Node is located. 1. Copying audit logs from the failed Admin Node on page 51 Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the recovered Admin Node. You want to preserve the audit log files if they are recoverable. 2. Gathering information about failed grid nodes on page 52 You must gather the information required to clean up each failed grid node in OpenStack Dashboard. 3. Removing failed grid nodes on page 53 You must clean up the deployment in Openstack Dashboard for the failed grid node before you can recover the grid node. 4. Deploying the StorageGRID Webscale Installer in OpenStack Dashboard on page 53 You must deploy the StorageGRID Webscale Installer (SGI) using the OpenStack Dashboard. The SGI is then accessed through a web browser and used to deploy grid nodes. 5. Generating the grid node recovery deployment template on page 55 To recover grid nodes you need to create a Heat template that includes the deployment information for the grid nodes you are recovering, and then use it to deploy the stack in OpenStack Dashboard. 6. Deploying the recovery grid nodes OpenStack Dashboard on page 56 You can add an individual grid node to the Openstack deployment by launching a new stack that contains the grid node. 7. Associating the correct floating IP address on page 57 If your grid is using floating IP addresses for public access, you may need to reassign the IP address previously associated with the failed grid node to the recovered grid node. 8. Activating the nonprimary Admin Node services on page 57 You need to activate the services on the nonprimary Admin Node you are recovering. 9. Starting StorageGRID Webscale software on page 57 You need to start the StorageGRID Webscale software on grid nodes by enabling services. 10. Determine if hotfixes or maintenance releases must be applied on page 58 After you start the StorageGRID Webscale software, you need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. If versions do not match, you must apply the required hotfix or maintenance release. 11. Reverting to the default boot mode on page 59 After you have completed the procedure to recover a grid node in maintenance mode, you need to reset the grid node to boot the operating system normally. 12. Finishing the StorageGRID Webscale deployment on page 59 After completing the deplpu,emt of the StorageGRID Webscale grid nodes, you must return to the StorageGRID Webscale Installer and finish the deployment. 13. Restoring the audit log on page 59 If you were able to preserve the audit log from the failed Admin Node, so that historical audit log information is retained, you can copy it to the Admin Node you are recovering. 14. Resetting the preferred sender on page 61 If the Admin Node you are recovering is currently set as the preferred sender of notifications and AutoSupport messages, you must reconfigure this setting in the NMS MI.

51 Recovery procedures Restoring the NMS database on page 61 If you want to retain the historical information about attribute values and alarms on a failed Admin Node, you need to restore the Network Management System (NMS) database. This database can only be restored if your StorageGRID Webscale system includes more than one Admin Node. Copying audit logs from the failed Admin Node Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the recovered Admin Node. You want to preserve the audit log files if they are recoverable. About this task This procedure copies the audit log files from the failed Admin Node to a temporary location. These preserved audit logs can then be copied to the replacement Admin Node. Audit logs are not automatically copied to the new Admin Node. Depending on the failure of the Admin Node, it might not be possible to copy audit logs from the failed Admin Node. In a deployment with more than one Admin Node, audit logs are replicated to all Admin Nodes. Thus, in a multi Admin Node deployment, if audit logs cannot be copied off of the failed Admin Node, audit logs can be recovered from another Admin Node in the system. In a deployment with only one Admin Node, where the audit log cannot be copied off of the failed Admin Node, the recovered Admin Node starts recording events to the audit log in a new empty file and previously recorded data is lost. 1. From the service laptop, log in to an Admin Node as root using the password listed in the Passwords.txt file. If possible, log in to the failed Admin Node. If the failure is such that you cannot log in to the failed Admin Node, log in to the primary Admin Node or another Admin Node, if available. 2. If it is running, stop the AMS service to prevent it from creating a new log file: /etc/init.d/ams stop 3. Rename the audit.log file so that it does not overwrite the file on the recovered Admin Node when you copy it to the recovered Admin Node. Rename audit.log to a unique numbered file name such as YYYY-MM-DD.txt.1. For example, you can rename the audit.log file to txt.1 cd /var/local/audit/export ls -l mv audit.log txt.1 4. Restart the AMS service: /etc/init.d/ams start 5. Copy all audit log files to a temporary location on any other server: scp -p * IP_address:/var/local/tmp When prompted, enter the password of the remote server listed in the Passwords.txt file. 6. Log out from the Admin Node you are copying the audit log file from: exit

52 52 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments Gathering information about failed grid nodes You must gather the information required to clean up each failed grid node in OpenStack Dashboard. 1. Log in to the Openstack Dashboard. 2. Select Project > Compute > Instances. 3. In the Instance Name column, click the link for the instance where the failed grid node is located. 4. Record the IP addresses listed in the IP Addresses section, and the volume names listed in the Volumes Attached section.

53 Recovery procedures 53 Removing failed grid nodes You must clean up the deployment in Openstack Dashboard for the failed grid node before you can recover the grid node. 1. Log in to the Openstack Dashboard. 2. Remove the failed grid node instance: a. Select Project > Compute > Instances. b. Select the checkbox to the left of the failed grid node instance to terminate. c. In the Actions column, select Terminate Instance from the drop-down list. d. Click Terminate Instance in the confirmation dialog box. 3. Remove the failed grid node volumes: a. Select Project > Compute > Volumes. b. Select the checkboxes to the left of each of the volumes associated with the failed grid node. c. Click Delete Volumes above the volumes table. d. Click Delete Volumes in the confirmation dialog box. 4. Remove the port associated with the failed grid node: a. Select Admin > System > Networks. b. In the Network Name column, click the link for the network where the failed grid node is located. c. Select the checkbox to the left of the port associated with the failed grid node. d. In the Actions column, select Delete Port from the drop-down list. e. Click Delete Port in the confirmation dialog box. Deploying the StorageGRID Webscale Installer in OpenStack Dashboard You must deploy the StorageGRID Webscale Installer (SGI) using the OpenStack Dashboard. The SGI is then accessed through a web browser and used to deploy grid nodes. Before you begin OpenStack software must be installed and correctly configured. You must have the SGI virtual machine disk file (.vmdk) and the correct Heat template file (.yaml) for your deployment. These files must be extracted from the StorageGRID Webscale installaton file (.tgz) downloaded from the NetApp Support Site at mysupport.netapp.com. You must have network configuration information for the SGI (IP address, network mask, default gateway). About this task You must deploy the SGI on the same network as, or a network that is accessible to, the grid nodes being deployed for the StorageGRID Webscale system. An additional, unique IP address is required

54 54 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments for the SGI, one that is separate from the IP addresses assigned to grid nodes in the grid specification file. The best practice is to remove the SGI virtual machine after you have deployed all grid nodes and verified that they have started successfully and joined the grid. This helps to ensure that future changes to the grid topology are not competed with the incorrect version of the SGI. Each time you make a change to the grid topology, you must ensure that the SGI version matches the version of your StorageGRID Webscale system. 1. Log in to the OpenStack Dashboard. 2. Select Project > Orchestration > Stacks. 3. Click Launch Stack. 4. In the Select Template dialog box enter the configuration information for the StorageGRID Webscale Installer stack: Template Source: Select URL from the drop-down list. Template URL: Enter, or copy and paste, the location of the SGI heat template, for example: Leave Environment Source and Environment File at their default values. 5. Click Next. 6. In the Launch Stack dialog box enter the SGI deployment information: Stack Name: Enter a meaningful name for the stack, for example, NetApp-SGI. Creation Timeout (minutes): The time to allow for the stack creation before a timeout in minutes. Rollback On Failure: Select this option to enable rollback to the starting state upon failure. This option should be selected with caution, because it will prevent error messages from being displayed in OpenStack Dashboard if the deployment fails. Password for user "username": Enter the password for the specified OpenStack project user account. StorageGRID Network: Select the network the grid nodes are being deployed in from the drop-down list. StorageGRID Network Netmask: Enter the network mask for the SGI on the StorageGRID Network. StorageGRID Network Gateway: Enter the default gateway for the SGI on the StorageGRID Network. StorageGRID Installer IP Address: Enter the IP address for the SGI on the StorageGRID Network. Public Network: Select the public network the SGI will use from the drop-down list. This is the network that Floating IP addresses are allocated from. This field is only applicable, and is only displayed, if you are using the SGI template for private tenant networks (SGI_Private_Routing_Template.yaml). StorageGRID Installer Image URL: Enter, or copy and paste, the URL for the SGI virtual machine disk (.vmdk) file, for example: SGI a323.vmdk

55 Recovery procedures Click Launch. You must wait for the deployment to complete, and then you can access the SGI in your web browser at the IP address you specified in the StorageGRID Installer IP Address text box. Generating the grid node recovery deployment template To recover grid nodes you need to create a Heat template that includes the deployment information for the grid nodes you are recovering, and then use it to deploy the stack in OpenStack Dashboard. Before you begin You must have the latest version of the SAID package. You can download the latest version from the NMS MI (Grid Management > Grid Maintenance > Provisioning Backup), or acquire the latest backup version you have stored in a secure location. 1. In a supported web browser, navigate to the StorageGRID Webscale Installer using the IP address configured when deploying the StorageGRID Webscale Installer. 2. On the Welcome page, select Modify an existing StorageGRID Webscale System. 3. In the Upload the SAID package page, click Upload, locate and select the latest SAID package file (.zip) for your StorageGRID Webscale system, and click Open. The SAID file is named using the following format: GIDgridIDNumber_REVrevisionNumber_SAID.zip Ensure that you select the.zip file with the highest revision number. 4. Click Next. 5. In the Grid Configuration page, verify the values specified for virtual machine based Storage Nodes: Number of RangeDBs: The number of storage volumes (RangeDBs) to attach to each Storage Node. Size of RangeDBs (GBs): The size of each individual storage volume (RangeDB) in gigabytes. You must enter a value between 50GB and 20000GB (20 terabytes). The minimum value for production systems is 4000 GB (4 terabytes). These values are defined in the grid specification file, and should not be modified unless you verify that the specified values are incorrect. If you are only deploying StorageGRID Webscale appliance Storage Nodes, these values are not used because the full storage capacity on the appliance is always used. 6. Click Save. 7. Click Next. 8. In the Deploy Grid Nodes page, select the recovery grid nodes to deploy: a. Select the grid nodes you need to recover, and deselect all other grid nodes. b. Click Generate Configuration. c. Copy the generated URL from the Grid Nodes URL text box. You use the Grid Nodes URL value to launch the recovery stack in OpenStack Dashboard.

56 56 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments Deploying the recovery grid nodes OpenStack Dashboard You can add an individual grid node to the Openstack deployment by launching a new stack that contains the grid node. 1. Log in to the Openstack Dashboard. 2. Click Project > Orchestration > Stacks. 3. Click Launch Stack. 4. In the Select Template dialog box enter the grid deployment file information: Template Source: Select URL from the drop-down list. Template URL: Paste in, or enter, the Grid Nodes URL value from the Deploy the Grid page in the SGI. Leave Environment Source and Environment File at their default values. 5. Click Next. 6. In the Launch Stack dialog box enter the grid deployment information: Stack Name: Enter a meaningful name for the stack, for example, DC1-S4-Recovery. Creation Timeout (minutes): The time to allow for the stack creation before a timeout in minutes. Rollback On Failure: Select this option to enable rollback to the starting state upon failure. This option should be selected with caution, because it will prevent error messages from being displayed in OpenStack Dashboard if the deployment fails. Password for user "username": Enter the password for the specified OpenStack project user account. StorageGRID node security group: The Neutron security group to use for the grid node. The default value, StorageGRID Node Firewall should be used in most cases. You can ensure that the correct value is listed by verifying that the release number is listed followed by the deployment date and a unique ID, for example, unique_id. Maintenance Mode: Select True from the drop-down list. StorageGRID node server flavor: The type of server to use for the grid node. You need to select the node server flavour that corresponds to the stack for your StorageGRID Webscale grid deployment. For example, if you named the StorageGRID Webscale stack NetApp_SGW, look for an entry named NetApp_SGW_node_flavor-uniqueID StorageGRID node root image: Select "StorageGRID root image" as the root disk for the grid node. You can ensure that you are selecting the correct entry by verifying that the release number is listed followed by the deployment date and a unique ID, for example, unique_id. 7. Click Launch. 8. Return to the SGI and monitor the progress of the grid node installation. Wait until the status bar for each recovered grid node is yellow and the status is Stopped in Maintenance Mode before continuing with the recovery procedure.

57 Recovery procedures 57 Associating the correct floating IP address If your grid is using floating IP addresses for public access, you may need to reassign the IP address previously associated with the failed grid node to the recovered grid node. 1. Log in to the Openstack Dashboard. 2. Select Project > Compute > Instances. 3. In the IP Address column, verify that the Floating IPs value for the recovered node matches the value used by the failed node. If the value is incorrect, you need to remove the incorrect value and assign the correct floating IP address: a. Select Disassociate Floating IP from the drop-down list in the Actions column for the recovered node instance. b. Click Disassociate Floating IP in the confirmation dialog box. c. Select Associate Floating IP from the drop-down list in the Actions column for the recovered node instance. d. Select the correct floating IP address to use from the IP Address drop-down list and click Associate. Do not change the value of the Port to be associated drop-down list. Activating the nonprimary Admin Node services You need to activate the services on the nonprimary Admin Node you are recovering. 1. Log in to the primary Admin Node as root using the password listed in the passwords.txt file. 2. Activate the primary Admin Node services: activate-admin-services.sh Starting StorageGRID Webscale software You need to start the StorageGRID Webscale software on grid nodes by enabling services. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Add your private key identity to the SSH authentication agent: ssh-add 3. Enter the SSH Access Password for the primary Admin Node listed in the Passwords.txt file. 4. Start the GDU: gdu-console 5. Type the provisioning passphrase, press the Tab key to select OK, and press Enter. 6. Start the StorageGRID Webscale software on the server: a. Ensure that the grid node you are configuring is selected and that the current state is Available.

58 58 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments b. Press the Tab key to move through the options to the Tasks list. c. Use the down-arrow key to highlight Enable Services, and press the Spacebar to select it. d. Press the Tab key to move through the options and select the Start Task action. e. Press Enter to run the task. Wait for the message Finished Postinstall start task to appear in the Log Messages panel. Note: If you are completing this procedure on a primary Admin Node, do not select the Load Configuration option. f. When the task completes, press the right-arrow key to move to the Quit action and press Enter. 7. End the SSH session: ssh-add -D Determine if hotfixes or maintenance releases must be applied After you start the StorageGRID Webscale software, you need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. If versions do not match, you must apply the required hotfix or maintenance release. Before you begin StorageGRID Webscale software must be started on the recovered grid node. 1. Sign in to the NMS MI. 2. Determine the current version of the StorageGRID Webscale software: a. Select Grid Topology > grid node of same type > SSM > Services > Main. b. Under Packages, note the storage-grid-release number. 3. Determine the version of the StorageGRID Webscale software of the recovered grid node: a. Select Grid Topology > recovered grid node > SSM > Services > Main. b. Under Packages, note the storage-grid-release number. 4. Compare the two versions, and if they differ install the required hotfixes or maintenance releases to update the recovered grid node to the correct software version. For more information about available hotfixes and maintenance releases, contact technical support. Related tasks Checking the StorageGRID Webscale version on page 5

59 Recovery procedures 59 Reverting to the default boot mode After you have completed the procedure to recover a grid node in maintenance mode, you need to reset the grid node to boot the operating system normally. 1. From the service laptop, log in to the recovered grid node as root using the password listed in the Passwords.txt file. 2. Run the script to override the maintenance mode boot option: override_maintenance_mode.sh 3. Log out of the command shell: exit Finishing the StorageGRID Webscale deployment After completing the deplpu,emt of the StorageGRID Webscale grid nodes, you must return to the StorageGRID Webscale Installer and finish the deployment. 1. When the installation of all of the selected grid nodes is complete, and you have verified that they have successfully joined the grid in the NMS MI, return to the StorageGRID Webscale Installer. 2. Click Cancel. 3. When asked to confirm the cancellation, click Yes. You are returned to the Welcome page. 4. Close the browser hosting the StorageGRID Webscale Installer. Restoring the audit log If you were able to preserve the audit log from the failed Admin Node, so that historical audit log information is retained, you can copy it to the Admin Node you are recovering. Before you begin Admin Node must be installed and running You must have copied the audit logs to another location after the original Admin Node failed. About this task If an Admin Node fails, audit logs saved to that Admin Node are potentially lost. It might be possible to preserve data from loss by copying audit logs from the failed Admin Node and then restoring these audit logs to the recovered Admin Node. Depending on the failure of the Admin Node, it might not be possible to copy audit logs from the failed Admin Node. In a deployment with more than one Admin Node, audit logs are replicated to all Admin Nodes. Thus, in a multi Admin Node deployment, if audit logs cannot be copied from the failed Admin Node, audit logs can be recovered from another Admin Node in the system. In a deployment with only one Admin Node, where the audit log cannot be copied from the failed Admin Node, the recovered Admin Node starts recording events to the audit log as if the installation is new. You must recover an Admin Node as soon as possible to restore logging functionality.

60 60 StorageGRID Webscale 10.2 Maintenance Guide for OpenStack Deployments 1. From the service laptop, log in to the grid node where you made a temporary copy of the audit log files as root using the password listed in the Passwords.txt file. 2. Check which audit files have been preserved: cd /var/local/tmp ls -l The following audit log files might be present in the temporary directory: The renamed current log file (audit.log) from the failed grid node: YYYY-MM-DD.txt.1 Rotated audit log file from the day before the failure: YYYY-MM-DD.txt (or YYYY-MM- DD.txt.n if more than one is created in a day). Older compressed and rotated audit log files: YYYY-MM-DD.txt.gz, preserving the original archive date in their name. 3. Copy the preserved audit log files to the recovered Admin Node: scp -p YYYY* recovered_admin_ip:/var/local/audit/export Where recovered_admin_ip is the IP address of the recovered Admin Node. 4. When prompted, enter the password of the recovered grid node, as listed in the Passwords.txt file. 5. Remove the preserved audit log files from their temporary location: rm YYYY* 6. When prompted, confirm that you want to remove the file: y 7. Log out of the server that contained the temporary location of the audit logs: exit 8. Log in to the recovered grid node as root. 9. Update the user and group settings of the preserved audit log files: cd /var/local/audit/export chown ams-user:bycast * 10. Log out from the recovered grid node: exit After you finish You will also need to restore any preexisting client access to the audit share. For more information, see the Administrator Guide. Related tasks Copying audit logs from the failed Admin Node on page 36 Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the recovered Admin Node. You want to preserve the audit log files if they are recoverable.

Recovery procedures 61 Resetting the preferred sender If the Admin Node you are recovering is currently set as the preferred sender of notifications and AutoSupport messages, you must reconfigure

61 Recovery procedures 61 Resetting the preferred sender If the Admin Node you are recovering is currently set as the preferred sender of notifications and AutoSupport messages, you must reconfigure this setting in the NMS MI. Before you begin The Admin Node must be installed and running. 1. Sign in to the NMS MI of the recovered Admin Node using the Vendor account. 2. Go to Grid Management > NMS Management > General > Main. 3. Select the recovered Admin Node from the Preferred Sender drop-down list. 4. Click Apply Changes. Restoring the NMS database If you want to retain the historical information about attribute values and alarms on a failed Admin Node, you need to restore the Network Management System (NMS) database. This database can only be restored if your StorageGRID Webscale system includes more than one Admin Node. Before you begin The Admin Node must be installed and running. The StorageGRID Webscale system must include one or more data center sites with an Admin Node About this task If an Admin Node fails, the historical information about attribute values and alarms that are stored in the NMS database for that Admin Node are lost. In a StorageGRID Webscale system with more than one Admin Node, the NMS database is recovered by copying the NMS database from another Admin Node. In a system with only one Admin Node, the NMS database cannot be restored. When you are recovering an Admin Node, the software installation process creates a new database for the NMS service on the recovered Admin Node. After it is started, the recovered Admin Node records attribute and audit information for all services as if your system were a performing new installation of the StorageGRID Webscale system. In a StorageGRID Webscale system with more than one Admin Node, you can copy the NMS database from another Admin Node to the recovered Admin Node and restore historical information. Note: To copy the NMS database, the StorageGRID Webscale system must be configured with multiple Admin Nodes.

StorageGRID Webscale 10.3 Maintenance Guide for OpenStack Deployments

StorageGRID Webscale 10.3 Maintenance Guide for OpenStack Deployments September 2016 215-10820-A0 doccomments@netapp.com Table of Contents 3 Contents Maintain your StorageGRID Webscale system... 6 Checking