StorageGRID Webscale 10.1

Size: px

Start display at page:

Download "StorageGRID Webscale 10.1"

Julian Stone
5 years ago
Views:

StorageGRID Webscale 10.1 Maintenance Guide NetApp, Inc. 495 East Java Drive Sunnyvale, CA 94089 U.S. Telephone: +1 (408) 822-6000 Fax: +1 (408) 822-4501 Support telephone: +1 (888) 463-8277 Web: www.

1 StorageGRID Webscale 10.1 Maintenance Guide NetApp, Inc. 495 East Java Drive Sunnyvale, CA U.S. Telephone: +1 (408) Fax: +1 (408) Support telephone: +1 (888) Web: Feedback: Part number: _B0 April 2015

3 Table of Contents 3 Contents Maintain your StorageGRID Webscale system... 5 Checking the StorageGRID Webscale version... 5 Required materials for server recovery... 6 Recovery procedures... 8 Recovering from Storage Node failures... 8 Recovering from loss of storage volumes where the system drive is intact... 9 Recovering from loss of the system drive and possible loss of storage volumes Recovering a StorageGRID Webscale appliance Storage Node Recovering from Admin Node failures Recovering from primary Admin Node failures Recovering from nonprimary Admin Node failures Recovering from API Gateway Node failures Generating server activation media Installing SLES on virtual machines Installing VMware Tools Installing StorageGRID Webscale software Starting StorageGRID Webscale software Applying hotfixes and maintenance releases Recovering from Archive Node failures Creating and uploading the Tivoli Storage Manager ISO image file Generating server activation media Installing SLES on virtual machines Installing VMware Tools Installing the StorageGRID Webscale software Starting StorageGRID Webscale software Applying hotfixes and maintenance releases Setting the S3 middleware state to online Decommission process for grid nodes When to decommission a grid node About Storage Node decommissioning Storage Node consolidation System expansion and decommissioning nodes Decommission multiple Storage Nodes ILM policy and storage configuration review Impact of decommissioning Impact of decommissioning on data security Impact of decommissioning on ILM policy Impact of decommissioning on other grid tasks Impact of decommissioning on system operations... 85

4 4 StorageGRID Webscale 10.1 Maintenance Guide Prerequisites and preparations for decommissioning Decommissioning grid nodes Copying the provisioned grid specification file Updating the grid specification file Provisioning the StorageGRID Webscale system Preserving copies of the Grid Provisioning Tool repository Running the decommissioning grid task Troubleshooting the decommissioning grid task Maintaining Archive Nodes that use a TSM server Fault with archival storage devices Taking the Archive Node middleware offline Tivoli Storage Manager administrative tools Object permanently unavailable Determining if objects are permanently unavailable Troubleshooting Corrupt ISO message when using load_cds.py script Checking for a wrong USB device name Virtual machine is not configured for automatic restart Resetting the virtual machine Server with a failed volume fails to reboot Glossary Copyright information Trademark information How to send comments about documentation and receive update notifications Index

5 5 Maintain your StorageGRID Webscale system This guide contains procedures for recovering the various types of grid nodes that make up a StorageGRID Webscale system (Admin Nodes, Storage Nodes, API Gateway Nodes, and Archive Nodes). This guide also includes procedures that explain how to decommission Storage Nodes and API Gateway Nodes. All maintenance activities require understanding of the StorageGRID Webscale system as a whole. You should review system specific documentation and the grid configuration web pages generated during provisioning (available in the /Doc directory of the SAID package) to ensure that you understand the StorageGRID Webscale system s topology. For more information about the workings of the StorageGRID Webscale system, see the Grid Primer and the Administrator Guide. Ensure that you follow the instructions and warnings in this guide exactly. Maintenance procedures not detailed in this guide are not supported or require a services engagement to implement. This guide includes procedures to recover a StorageGRID Webscale Webscale appliance Storage Node. For information about recovering a StorageGRID Webscale appliance, see the StorageGRID Webscale Appliance Maintenance Guide. Related information StorageGRID Webscale 10.1 Appliance Maintenance Guide StorageGRID Webscale 10.1 Grid Primer StorageGRID Webscale 10.1 Administrator Guide Checking the StorageGRID Webscale version Various maintenance procedures require the reinstallation of software. All services of the same type must be running the same version of the StorageGRID Webscale software. This includes applied hotfixes and maintenance releases. Before you begin Before performing any maintenance procedure, you must know which version of the StorageGRID Webscale software is running on the grid node. You use this number as a reference to determine if hotfixes or maintenance releases, or both, must be applied when performing maintenance procedures. If you cannot get this information from the failed grid note, use another in the system of the same type. Step 1. In the NMS management interface (MI), go to Grid Topology > <grid node> > SSM > Services. Under Packages, the version listed for the storage grid release indicates the installed version.

6 6 StorageGRID Webscale 10.1 Maintenance Guide Required materials for server recovery Before performing maintenance procedures, you must ensure you have the necessary materials for server recovery. Item StorageGRID Webscale Software ISO image SUSE Linux Enterprise Server (SLES) ISO image Notes Required if recovering the primary Admin Node. Used to customize the Linux operating system, and install StorageGRID Webscale software for all grid nodes types. See the NetApp Support Site to acquire the latest version of StorageGRID Webscale software. NetApp Support Use only the supported version of SLES for the StorageGRID Webscale system. For supported versions, see the Interoperability Matrix. NetApp Interoperability Matrix Tool. Note: Use of any version of SLES other than the correct version will result in an installation failure. Tivoli Storage Manager (TSM) Client packages Hotfix and Maintenance Release ISO images Provisioning Media Provisioning Backup Media Provisioning passphrase Required if the deployment includes an Archive Node that uses Tivoli Storage Manager (TSM) middleware to write to archival media. Required if recovering the Admin Node. For supported versions, see the Interoperability Matrix. NetApp Interoperability Matrix Tool. Determine whether or not a hotfix or a maintenance release, or both, has been applied to the grid node. The recovered grid node must be updated to the same build as all other grid nodes of the same type. See the storage grid release number listed on the Grid Topology > grid node > SSM > Services > Main page. To acquire hotfixes and maintenance releases, contact technical support. Obtain a copy of the most recent provisioning media. This provisioning media is updated each time the system is modified. The provisioning media includes the SAID package. There may be more than one SAID package stored on the provisioning media. Use the latest revision of the SAID package. Note: Provisioning media must contain only one grid specification file at the root level or provisioning will fail. Ensure that it is the latest version, which includes all maintenance updates. Passwords.txt file Included in the SAID package.

7 Maintain your StorageGRID Webscale system 7 Item The Server Activation Media containing the grid node s activation file VMware software and documentation Service laptop Notes You can create new Server Activation Media using the latest version of the SAID package. The Server Activation media includes the virtual machine template (.ovf) and boot image file (.flp) for each grid node. If you are recovering the primary Admin Node, ensure that the Server Activation Media contains the provisioning-autoinst-dn.xml file for the primary Admin Node (where dn is the device name of the serverʹs system drive). You can find a copy in the /provisioning directory of the StorageGRID Webscale software ISO image. For the current supported versions of VMware software, see the Interoperability Matrix. NetApp Interoperability Matrix Tool. The service laptop must have the following: Microsoft Windows operating system Network port Supported browser The following browsers have been tested with StorageGRID Webscale to verify compatibility: Google Chrome 40 Microsoft Internet Explorer 11.0 Mozilla Firefox Telnet and SSH client (for example, PuTTY) SCP tool (for example, WinSCP) to transfer files to and from the primary Admin Node.

8 8 Recovery procedures When you recover a failed grid node, you must also replace the failed server hardware with new hardware, reinstall the software, and ensure all recoverable data is intact. The grid node recovery procedures in this section describe how to recover a grid node of any type: Admin Nodes API Gateway Nodes Archive Nodes Storage Nodes, including one installed as a StorageGRID Webscale appliance You must always recover a failed grid node as soon as possible. A failed grid node may reduce the redundancy of data in the StorageGRID Webscale system, leaving you vulnerable to the risk of permanent data loss in the event of another failure. Operating with failed grid nodes can have an impact on the efficiency of day to day operations, can increase recovery time (when queues develop that need to be cleared before recovery is complete), and can reduce your ability to monitor system operations. All of the following conditions are assumed when recovering grid nodes: The hardware has failed and the drives from the old server cannot be used in a new server to recover the grid node. The failed hardware has been replaced and configured. The server to be recovered is configured to match the firmware version, BIOS settings, and storage configurations of the original server. If you are recovering a grid node other than the primary Admin Node, there is connectivity between the grid node being recovered and the primary Admin Node. Recovering from Storage Node failures You must always recover a failed Storage Node as soon as possible because objects are at increased risk of loss if another failure occurs. Before you begin If two or more Storage Nodes have failed at the same time, do not attempt this recovery procedure before contacting Support. Because of the complexities involved when restoring objects to a failed Storage Node, if the data center site includes three or more Storage Nodes that have been offline for more than 15 days, you must contact technical support before attempting to recover failed Storage Nodes. Failure to do so may result in the unrecoverable loss of objects. If the Storage Node is in read-only maintenance mode to allow for the retrieval of objects for another Storage Node with failed storage volumes, refer to the instructions for operating with failed storage volumes before performing the recovery procedure for the Storage Node. About this task If your StorageGRID Webscale system is configured to use an information lifecycle management (ILM) rule with only one content placement instruction, only a single copy of object data is made. If the single copy is created on a Storage Node that fails, the result is unrecoverable loss of data. ILM rules with only one content placement instruction are not recommended for this reason.

9 Recovery procedures 9 If you encounter a situation where the SUSE Linux Enterprise Server (SLES) operating system detects a missing disk during startup and fails to reboot, you must manually remove the entry for the missing disk from the /etc/fstab file and restart the server. Choices Recovering from loss of storage volumes where the system drive is intact on page 9 You need to complete a series of tasks in specific order to recover a Storage Node where one or more storage volumes on the Storage Node are lost, but the system drive is intact. If only storage volumes have been lost, the Storage Node is still available in the NMS MI. Recovering from loss of the system drive and possible loss of storage volumes on page 17 You need to complete a series of tasks in specific order to recover from the loss of the system drive on a Storage Node. Once the system drive is recovered you need to determine the status of the storage volumes and complete the recovery of the Storage Node. If the system drive is lost, the Storage Node is not available in the NMS MI. Recovering a StorageGRID Webscale appliance Storage Node on page 29 Recovering a StorageGRID Webscale appliance Storage Node involves deploying the appliance Storage Nodes, rebuilding the Cassandra database, enabling services with the Grid Deployment Utility, and restoring object data the Storage Node. Related tasks Server with a failed volume fails to reboot on page 97 Recovering from loss of storage volumes where the system drive is intact You need to complete a series of tasks in specific order to recover a Storage Node where one or more storage volumes on the Storage Node are lost, but the system drive is intact. If only storage volumes have been lost, the Storage Node is still available in the NMS MI. Before you begin You should not attempt this recovery procedure if two or more Storage Nodes have failed at the same time. If this occurs, contact technical support. 1. Identifying and unmounting failed storage volumes on page 10 When recovering a Storage Node with failed storage volumes, you need to identify and unmount failed storage volumes to prepare the Storage Node for recovery. 2. Recovering failed storage volumes on page 11 You need to reformat and remount storage on failed storage volumes. 3. Restoring object data to a storage volume where the system drive is intact on page 14 After recovering a storage volume on a Storage Node where the system drive is intact, you can restore object data to the recovered storage volume from other grid nodes (Storage Nodes and Archive Nodes) in the system. 4. Checking the storage state on page 17 You need to verify that the desired state of the Storage Node is set to online and ensure that the state will be online by default whenever the Storage Node server is restarted.

the Storage Node for recovery. About this task You should recover the failed storage volumes as soon as possible.

correspond to the failed storage volumes. Note: Identify failed storage volumes carefully. You will use this information to verify which volumes must be reformatted.

10 10 StorageGRID Webscale 10.1 Maintenance Guide Identifying and unmounting failed storage volumes When recovering a Storage Node with failed storage volumes, you need to identify and unmount failed storage volumes to prepare the Storage Node for recovery. About this task You should recover the failed storage volumes as soon as possible. You must identify failed storage volumes on the Storage Node to verify that only the failed storage volumes are reformatted as part of the recovery procedure, and identify the device names that correspond to the failed storage volumes. Note: Identify failed storage volumes carefully. You will use this information to verify which volumes must be reformatted. Once a volume has been reformatted, data on the volume cannot be recovered. At installation, each storage device is assigned a file system universal unique identifier (UUID) and is mounted to a rangedb directory on the Storage Node using that assigned file system UUID. The file system UUID and the rangedb directory are listed in the /etc/fstab file. The device name, rangedb directory, and the size of the mounted volume are displayed in the NMS MI at failed Storage Node > SSM > Resources > Overview > Main. In the following example, device /dev/sdb has a volume size of 830 GB, is mounted to /var/ local/rangedb/0, using the device name /dev/disk/by-uuid/822b0547-3b2b-472e-ad5ee1cf1809faba in the /etc/fstab file: 1. Sign in to the NMS MI and complete the following steps to record the failed storage volumes and their device names: a. In the NMS MI, go to Grid Topology > <failed Storage Node> > LDR > Storage > Overview > Main and look for object stores with alarms. b. Go to Grid Topology > <failed Storage Node> > SSM > Resources > Overview > Main, look for alarms, and determine the mount point and volume size of each failed storage volume identified in substep a.

Recovery procedures 11 Each item in the Object Stores table found on the LDR > Storage > Overview > Main page corresponds to a mount point listed in the Volumes table on the SSM > Resources page.

11 Recovery procedures 11 Each item in the Object Stores table found on the LDR > Storage > Overview > Main page corresponds to a mount point listed in the Volumes table on the SSM > Resources page. Object stores and mount points are numbered in hex notation, from 0000 to 000F. For example, the object store with an ID of 0000 corresponds to /var/local/rangedb/0 with device name sdb and a size of 830 GB. Note: For the tenth object store and beyond, the ID changes to letters. For example, the tenth object store would have an ID of 000A and would correspond to /var/local/ rangedb/a. If you cannot determine the volume number and device name of failed storage volumes from the NMS MI, log in to an equivalent Storage Node and determine the mapping of volumes to device names on that server. Storage Nodes are usually added in pairs, with identical hardware and storage configurations. Examine the /etc/fstab file on the equivalent Storage Node to identify the device names that correspond to each storage volume. Identify and record the device name for each failed storage volume. 2. Unmount the failed storage volumes by completing the following steps on each volume: a. From the service laptop, log in to the failed Storage Node as root using the password in the Passwords.txt file. b. Ensure that the failed storage volume is unmounted: umount /var/local/rangedb/object_store_id The object_store_id is the ID of the failed storage volume. For example, enter 0 for the object store with ID If you are recovering the first storage volume (disk 2), and you encounter a device is busy error, enter the following commands to stop the Cassandra database before unmounting the storage volume: service cassandra stop umount /var/local/rangedb/object_store_id Recovering failed storage volumes You need to reformat and remount storage on failed storage volumes. Before you begin The system drives on the server must be intact or already restored. The cause of the failure must has been identified and, if necessary, the defective storage hardware has been replaced. The total size of the replacement storage must be the same as the original. You must have the access to the Passwords.txt file. Note: If the server is rebooted with a failed volume, it might fail to reboot.

12 12 StorageGRID Webscale 10.1 Maintenance Guide Server with a failed volume fails to reboot on page 97. About this task If reformatting is required, you must indicate the rangedb directories that must be reformatted and indicate whether the proposed mounting points for the storage volumes to be reformatted are correct. 1. From the service laptop, log in to the failed Storage Node as root using the password in the Passwords.txt file. 2. Use a text editor (vi or vim) to delete failed volumes from the /etc/fstab file and then save the file. Note: Commenting out a failed volume in the /etc/fstab file is insufficient. The volume must be deleted from fstab as the recovery process verifies that all lines in the fstab file match the mounted file systems. 3. Complete the following steps to reformat any failed storage volumes: a. Enter the following command: reformat_storage_block_devices.rb b. When you are asked to Stop storage services [y/n]?, enter: y c. When unallocated drives are detected, you are asked to use the volume as an LDR rangedb and Accept proposal [y/n], enter: y. d. For each rangedb drive on the Storage Node, when you are asked to Reformat the rangedb drive <name>? [y/n]?, enter one of the following responses: y to reformat a drive with errors. This reformats the storage volume and adds the reformatted storage volume to the /etc/fstab file. n if the drive contains no errors. e. Examine the mapping (mount point) between each device and the /rangedb directory. When asked Is the assignment of devices to rangedbs correct? [y/n]?, enter one of the following responses: y to confirm the mount point assignment n to change the mount point assignment In the following example, the drive /dev/sdb must be reformatted: servername:~ # reformat_storage_block_devices.rb Storage services must be stopped before running this script. Stop storage services [y/n]? y Shutting down storage services. Storage services stopped. Unmounted /dev/sdb Unmounted /dev/sdc Unmounted /dev/sdd Number of rangedb disks being replaced is 1 new rangedb disk=/dev/sdb Unallocated drives detected. Is the following proposed action correct? Use /dev/sdb (60.0GiB) as an LDR rangedb

13 Recovery procedures 13 Accept proposal [y/n]? y Restoring any volume groups... Determining drive partitioning... WARNING: drive /dev/sdc exists Reformat the rangedb drive /dev/sdb? [y/n]? y Reformat the rangedb drive /dev/sdc? [y/n]? n Reformat the rangedb drive /dev/sdd? [y/n]? n done Device: /dev/sdb Size: 64.4G Used: New/Existing: New UUID: rangedb: /var/local/rangedb/0 Reformat?: Yes Device: /dev/sdc Size: 64.4G Used: 1% New/Existing: Existing UUID: 1aa52292-d a8dc-fef934d3caeb rangedb: /var/local/rangedb/1 Reformat?: No Device: /dev/sdd Size: 64.4G Used: 1% New/Existing: Existing UUID: 1aa52295-d a8dc-fee934d3ceeb rangedb: /var/local/rangedb/2 Reformat?: No Is the assignment of devices to rangedbs correct? [y/n]? y Formatting the following: /dev/sdb Mounted /dev/sdbsdbrangedb disks in /etc/fstab have been sorted out Running: /usr/local/ldr/setup_rangedb.sh Starting storage services. Storage services started. Reformatting done. Now do manual steps to restore copies of data. f. Enter y if you are prompted to rebuild the Cassandra database. Attention: Do not enter n unless directed by technical support. Rebuilding the Cassandra database means that the database is deleted from the Storage Node and rebuilt from other available Storage Nodes. This procedure should never be performed on multiple Storage Nodes concurrently as it may result in data loss. In the following example, the Cassandra database has been down for more than 15 days and must be rebuilt: Cassandra was down for more than 15 days. Running: /usr/local/sbin/rebuild-cassandra-data Enter 'y' to rebuild the Cassandra database for this Storage Node. Rebuilding the Cassandra database may take as long or longer than 12 hours. Once started, do not stop or pause this rebuild operation. If the rebuild process is stopped or paused, you must rerun the operation. [y/n]? y Removing Cassandra commit logs Removing Cassandra SSTables

14 14 StorageGRID Webscale 10.1 Maintenance Guide Updating timestamps of the Cassandra data directories. starting service cassandra Running nodetool rebuild. Done. Cassandra database successfully rebuilt. Restoring object data to a storage volume where the system drive is intact After recovering a storage volume on a Storage Node where the system drive is intact, you can restore object data to the recovered storage volume from other grid nodes (Storage Nodes and Archive Nodes) in the system. Before you begin You must have acquired the Node ID of the Storage Node where restored storage volumes reside. In the NMS MI, go to Storage Node > LDR > Overview > Main. You must have acquired the Volume ID of each restored storage volume. In the NMS MI, go to Storage Node > LDR > Storage > Main. You must have confirmed the condition of the Storage Node. The Storage Node must be displayed in the Grid Topology tree with a color of green and all services must have a state of Online. To recover erasure coded object data, the storage pool to which the recovered Storage Node is a member must include enough green and online Storage Nodes to support the Erasure Coding scheme used to create the erasure-coded object data being recovered. For example, to recover erasure-coded object data that was created using a scheme of 6+3, at least six Storage Nodes that are members of the Erasure Coding profile's storage pool must be green and online. About this task The procedure to restore object data to a storage volume notifies the StorageGRID Webscale system that object data stored on the lost storage volume is no longer available, which prompts an ILM reevaluation to determine the correct placement of restored object data. If the only remaining copy of replicated object data is located on an Archive Node, object data is retrieved from the Archive Node. Due to the latency associated with archival media such as tape, or the cloud-tiering service, restoring replicated object data to a Storage Node from an Archive Node takes longer than restoring copies from other Storage Nodes. Note that only replicated object data is archived to Archive Nodes; erasure coded object data is not archived. If the StorageGRID Webscale system s ILM policy is configured to use an ILM rule with only one active content placement instruction for replicated object data, copies are not made. That is, only a single instance of replicated object data is stored at any one time. If there is a failure to a storage volume that stores that one copy, the object data is lost and cannot be recovered; however, you must still perform the following procedure to purge lost object information from the database. For more information about ILM rules, see the Administrator Guide. 1. From the service laptop, log in to the failed Storage Node as root using the password in the Passwords.txt file. Attention: You should use the ADE Console with caution; if the ADE Console is used improperly, it is possible to interrupt system operations and corrupt data. Enter commands carefully, and only use commands documented in this procedure. 2. Access the ADE console of the LDR service: telnet localhost Access the CMSI module:

15 Recovery procedures 15 cd /proc/cmsi 4. Begin restoring object data: Volume_Lost <vol_id> <vol_id> In the previous example, vol_id is the volume ID of the reformatted volume, or a range of volumes in hex representation. There can be up to 16 volumes, numbered from 0000 to 0000F; for example: Volume_Lost F. Note: The second <vol_id> is optional. Unrecoverable replicated object data triggers the LOST (Lost Objects) alarm. 5. Determine the current status of the Volume_Lost recovery operation by doing one or more of the following: To determine status... Do the following... Of objects queued for retrieval from an Archive Node. Of the ILM Evaluation (Volume Lost) grid task triggered by the Volume_Lost command. In the NMS MI, go to the Archive Node > ARC > Retrieve > Overview > Main page, and view the Objects Queued attribute. In the NMS MI, go to primary Admin Node > CMN > Grid Tasks > Overview > Main and view the percentage complete under Active. Wait for the grid task to move into the Historical table with a Status of Successful, which indicates a successful Storage Node recovery. Unavailable Storage Nodes may affect the progress of ILM Evaluation (Volume Lost) grid tasks depending on where the Storage Node is located. 6. When the Volume_Lost recovery operation completes, including the completion of the ILM Evaluation (Volume Lost) grid task, exit the ADE console of the LDR service: exit 7. Access the ADE console of the DDS service: telnet localhost Access the ECGM module: cd /proc/ecgm 9. Complete the restoration of object data: volume_repair <node_id> <vol_id_lower> <vol_id_higher> [--inplace] where: node_id is the node ID for the Storage Node's LDR service where the storage volumes are located. vol_id_lower is the volume ID for the recovered storage volume that is lowest in the range of recovered storage volumes. vol_id_higher is the volume ID for the recovered storage volume that is highest in the range of recovered storage volumes. --inplace is an optional parameter to place recovered erasure coded copies back on the same recovered storage volume. If this parameter is not used, recovered erasure coded copies are placed on other equivalent Storage Nodes that are members of the same storage pool. A unique repair ID number is returned to identify this volume_repair operation. This repair ID number can be used to track the progress and results of the repair. No other feedback is returned.

16 16 StorageGRID Webscale 10.1 Maintenance Guide If you are recovering a single storage volume, use its volume ID for both vol_id_lower and vol_id_higher. Volume IDs must be contiguous. If you are recovering object data on non-contiguous volumes (for example, volumes 3, 8, an 15), you must run volume_repair for each recovered volume. The StorageGRID Webscale system completes the processes of recovering object data, ensuring that ILM rules are met. Note: You cannot run multiple volume_repair operations at the same time, neither can you cancel volume_repair once started. Wait for the volume_repair operation to complete before continuing. 10. To determine the current status or result of the volume_repair recovery operation, enter: repair_status <repair_id> where repair_id is the identifier returned when the volume_repair command is run. To determine the repair_id of a volume repair, you can list all previously and currently running repairs. Enter: repair_status all In the following example, all object data is successfully recovered: ade : /proc/ecgm > volume_repair E ade : /proc/ecgm > Repair of Volume(s) 0 to E in node started. Repair ID: ade : /proc/ecgm > repair_status Repair ID : Type : Storage Volume Repair Node ID : Lower volume ID : 0 Upper volume ID : E Start time : T23:19: End time : T23:19: State : Success Estimated bytes affected : Bytes repaired : Retry repair : No If Retry repair is Yes, check the condition of the StorageGRID Webscale system, and confirm that all grid nodes are green in the NMS MI's Grid Topology tree with a state of Online. For erasure-coded object data, confirm that there are the minimum number of green and online Storage Nodes in the storage pool of which the recovered Storage Node is a member so that the erasure coding scheme in use is supported. Resolve any issues with the system, including connectivity, and retry the repair: repair_retry <repair_id> In the previous example, repair_id is the identifier returned when the volume_repair command is run. Unrecoverable erasure-coded object data triggers the LOST (Lost Objects) and ECOR (Copies Lost) alarms. If State is Failure and Retry repair is No, erasure-coded object data is permanently lost. 11. Exit the ADE Console: exit 12. Log out of the command shell:

17 Recovery procedures 17 exit Checking the storage state You need to verify that the desired state of the Storage Node is set to online and ensure that the state will be online by default whenever the Storage Node server is restarted. Before you begin Storage Node has been recovered, and data recovery is complete. 1. In the NMS MI, check the value of Grid Topology > recovered Storage Node > LDR > Storage > Storage State Desired and Storage State Current. The value of both attributes should be Online. 2. If the Storage State Desired is set to Read-only, complete the following steps: a. Click the Configuration tab. b. From the Storage State Desired drop-down list, select Online. c. Click Apply Changes. d. Click the Overview tab and confirm that the values of Storage State Desired and Storage State Current are updated to Online. Recovering from loss of the system drive and possible loss of storage volumes You need to complete a series of tasks in specific order to recover from the loss of the system drive on a Storage Node. Once the system drive is recovered you need to determine the status of the storage volumes and complete the recovery of the Storage Node. If the system drive is lost, the Storage Node is not available in the NMS MI. Before you begin You should not attempt this recovery procedure if two or more Storage Nodes have failed at the same time. If this occurs, contact technical support. 1. Generating server activation media on page 18 You need to generate the server activation media to create the necessary files to reinstall software on each failed virtual machine in the VMware vsphere Client. 2. Installing SLES on the failed virtual machine that will serve as host on page 19 You need to install the SUSE Linux Enterprise Server (SLES) operating system on the failed virtual machine that will host a StorageGRID Webscale grid node. 3. Installing VMware Tools on page 20 You must install VMware Tools on each virtual machine that will host a grid node for enhanced performance and improved management of the virtual machine. 4. Remounting and identifying failed storage volumes on page 21 You need to use the Grid Deployment Utility (GDU) to check the attached storage. The GDU looks for storage volumes attached to the server, attempts to mount them, and then checks to see if they are structured like StorageGRID Webscale object stores. Any storage volume that cannot be mounted or does not pass this check is assumed to be failed. 5. Installing StorageGRID Webscale software and recovering failed storage volumes on page 23

18 18 StorageGRID Webscale 10.1 Maintenance Guide You need to use the Grid Deployment Utility (GDU) to install StorageGRID Webscale software on each grid node. 6. Rebuilding the Cassandra database on page 24 You need to run the check-cassandra-rebuild script to determine if you need to rebuild the Cassandra database, and then rebuild it if required. 7. Applying hotfixes and maintenance releases on page 25 You need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. 8. Restoring object data to a storage volume where the system drive also failed on page 26 After recovering a storage volume on a Storage Node where the system drive also failed and was recovered, and after rebuilding the Cassandra database, you can restore object data to the recovered storage volumes from other grid nodes (Storage Nodes and Archive Nodes) in the system. 9. Checking the storage state on page 28 You need to verify that the desired state of the Storage Node is set to online and ensure that the state will be online by default whenever the Storage Node server is restarted. Generating server activation media You need to generate the server activation media to create the necessary files to reinstall software on each failed virtual machine in the VMware vsphere Client. Before you begin A service laptop must be available An SCP tool, such as WinSCP, must be installed on the service laptop, which you can use to transfer files to and from the primary Admin Node. About this task You need to regenerate the server activation media to create the necessary files, including the virtual machine template (.ovf) and boot image file (.flp), for each grid node. To restore a grid node, you only need to use the boot image file (.flp) to reinstall SUSE Linux Enterprise Server on the grid node virtual machine. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Create a directory and output the generated files to that directory: mkdir /var/local/deploy-files generate-grid-deployment-files.rb -o /var/local/deploy-files 3. Enter the provisioning passphrase at the prompt. 4. Transfer the contents of the /var/local/deploy-files directory to the service laptop using WinSCP, or a similar tool. The generated files include a virtual machine template file (.ovf) and a boot image file (.flp) for each grid node listed your grid specification file. Only the.flp file is required to recover and configure failed grid nodes. For example, if you are recovering a storage node named DC1 S1, the generated file you will need to use to install the operating system and recover the grid node is named DC1-S1.flp.

19 Recovery procedures 19 Installing SLES on the failed virtual machine that will serve as host You need to install the SUSE Linux Enterprise Server (SLES) operating system on the failed virtual machine that will host a StorageGRID Webscale grid node. Before you begin The cause of the system drives failure must have been identified. The hard disk for the failed system drives must already been replaced/repaired. The size of the replaced hard drive is the same as the original. you must have verified that the virtual machine for the failed grid node is powered on. The SLES ISO image for the StorageGRID Webscale release is in an accessible VMware datastore. For supported versions, see the Interoperability Matrix Tool. NetApp Interoperability Matrix Tool Locate the boot image file (.flp) for each grid node you need to complete the installation for. The boot image files (.flp) for the additional grid nodes are stored in the /deploy-files subdirectory where you transferred the provisioning data to the service laptop using an SCP tool such as WinSCP. About this task The installation process completely erases the server drives and installs the SLES operating system, applications, and support files customized for StorageGRID Webscale. 1. Open VMware vsphere Client and log in. 2. In the navigation tree, select the virtual machine. 3. Connect the SLES ISO image from the VMware datastore to the virtual machine, click the Connect/Disconnect the CD/DVD devices of the virtual machine CD/DVD Drive n > Connect to ISO image on a datastore. icon, and then select 4. In the Browse Datastores dialog box, locate and select the SLES ISO image, and click OK. 5. Click the Connect/Disconnect the floppy devices of the virtual machine icon, and then select Floppy Drive 1 > Connect to Floppy Image on local disk. 6. In the Open dialog box, locate and select the boot image file (.flp) that contains the activation file for the server you are recovering and click Open. 7. Click the Console tab. 8. Click anywhere inside the Console pane. Your cursor disappears and you are locked into the Console pane. 9. Press Ctrl-Alt-Insert to restart the virtual machine. The server performs the following steps while rebooting: The BIOS runs a hardware verification. By default, the system boots from the ISO image connected to the CD/DVD drive, and loads the SLES boot screen in the VMware vsphere Client Console pane. If the system does not

20 20 StorageGRID Webscale 10.1 Maintenance Guide boot from the CD/DVD drive by default, press F4 and change the boot order and reboot the virtual machine. 10. When the SLES Boot Screen is displayed, press the down arrow key to select the Installation option (do not press Enter). Note: You must move the cursor to the Installation option within eight seconds. If you do not, SLES will automatically attempt to install from the hard drive and the installation process will fail. If this happens, you must restart the virtual machine and select the Installation option within the required time. 11. Press Tab, and at the Boot Options prompt, enter the following command: autoyast=device://fd0/autoinst.xml Note: You must always specify the autoyast information at the Boot Options prompt. If you do not enter this information, AutoYaST does not complete the required custom installation for the server. If you enter an incorrect value, and are prompted to reenter the device name and path, verify the floppy device name. The device name is fd zero (fd0). When the SLES installation is complete, the server completes its configuration and starts the operating system. Installation is complete when the login prompt appears. 12. Disconnect the SLES ISO image by clicking the Connect/Disconnect the CD/DVD devices of the virtual machine image. Installing VMware Tools icon, and selecting CD/DVD Drive n > Disconnect from datastore You must install VMware Tools on each virtual machine that will host a grid node for enhanced performance and improved management of the virtual machine. About this task The required version is made available to virtual machines through the VMware vsphere Client. 1. Open VMware vsphere Client and log in. 2. In the navigation tree, select the virtual machine on which you want to complete the installation, and then power on the virtual machine if it is not started. 3. Click the Connect/Disconnect the CD/DVD devices of the virtual machine icon, and select CD/DVD Drive n > Connect to ISO image on a datastore. 4. In the Browse Datastores dialog box, navigate to the /vmimages/tools-isoimages subdirectory, select linux.iso and click Open. 5. Click the Console tab. 6. Click anywhere in the Console pane and log in as root using the password listed in the Passwords.txt file. 7. Mount the VMware Tools installer: mount /cdrom 8. Copy the gzip installation package to a temporary directory on the virtual machine and unpack it:

21 Recovery procedures 21 mkdir /tmp/vmtools cd /tmp/vmtools tar -zxvf /cdrom/vmwaretools-*.tar.gz 9. Install VMware Tools with the default installation options: cd /tmp/vmtools/vmware-tools-distrib/./vmware-install.pl --default 10. When the installation is complete, verify that VMware Tools is running: /etc/init.d/vmware-tools status If the installation was successful you will see the following status message: vmtoolsd is running. 11. Remove the installation files from the virtual machine: cd /tmp rm -rf vmtools 12. Reboot the system to ensure that the changes take effect: reboot Remounting and identifying failed storage volumes You need to use the Grid Deployment Utility (GDU) to check the attached storage. The GDU looks for storage volumes attached to the server, attempts to mount them, and then checks to see if they are structured like StorageGRID Webscale object stores. Any storage volume that cannot be mounted or does not pass this check is assumed to be failed. Before you begin Hardware replacement steps have been completed for identified failed storage volumes. There may be additional failed storage volumes that cannot be identified at this stage. The StorageGRID Webscale software has not yet been installed. About this task Note: Any storage volume that cannot be mounted or does not pass this check is assumed to be corrupt, and is reformatted when StorageGRID Webscale software is reinstalled. All data on these storage volumes is lost during software installation. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Add your private key identity to the SSH authentication agent: ssh-add 3. Enter the SSH Access Password listed in the Passwords.txt file. 4. Start the GDU. gdu-console 5. Type the provisioning passphrase, press Tab to select OK and press Enter.

22 22 StorageGRID Webscale 10.1 Maintenance Guide 6. In the Servers panel, select the Storage Node you are recovering and confirm that its status is Available. 7. In the Tasks panel, select Remount Storage, and then in the Actions panel, select Start Task and press Enter. The GDU checks for preserved storage volumes and remounts them. Log Messages Starting server task-handler Starting next task "Server info" on Starting "Server info" task Refreshing server state, please wait Finished "Server info" task Finished task "Server info" on Starting next task "Remount storage volumes" on Starting "Remount storage volumes" task Attempting to remount /dev/sdd Found node id in volid file Device /dev/ssd remounted successfully Finished "Remount storage volumes" task Finished task "Remount storage volumes" on Review the list of devices in the Log Messages panel that have been remounted successfully. Use the information you gathered when you identified the failed storage volumes to identify the device names that correspond to each storage volume. You can also monitor the contents of the GDU log file /var/local/log/gdu-console.log using the tail -f command. Devices that are found and mounted by the GDU are preserved when StorageGRID Webscale software is installed. 9. If you determine that all of the storage volumes have failed, you must complete the following steps to delete any existing data from the drives: a. In the Servers panel, select the Storage Node you are recovering and confirm that its status is Available. b. In the Tasks panel, select Install Software. c. In the Actions panel, select Start Task and press Enter. 10. Review the list of devices that the GDU could not mount or were not formatted like rangedbs (object stores). Note: This is the list of devices that you need to repair or replace before you continue with the recovery. These devices will be reformatted by the GDU when you install StorageGRID Webscale software, and all data on these devices will be lost. Record the volume ID (var/ local/rangedb/number) and device name of each failed storage volume that you have identified. If volumes that you believe to be good are not mounted by the GDU, you need to quit the GDU and investigate. After you correct the issue that prevented the GDU from mounting the devices, restart the GDU and select Remount Storage again. In the following example, /dev/sdc was not mounted successfully by the GDU. Log Messages Starting server task-handler Starting next task "Server info" on DC2-S1 Starting "Server info" task Refreshing DC2-S1 server state, delay 0 seconds, please wait Finished "Server info" task Finished task "Server info" on DC2-S1

23 Recovery procedures 23 Starting next task "Remount storage volumes" on DC2-S1 Starting "Remount storage volumes" task Attempting to remount /dev/sdb Found node id in volid file Device /dev/ssv remounted successfully Failed to mount device /dev/sdc Attempting to remount /dev/sdd Found node id in volid file Device /dev/sdd remounted successfuly Finished "Remount storage volumes" task Finished task "Remount storage volumes" on DC2-S1 11. Confirm that the mounted devices (LUNs) should be associated with this Storage Node. a. Find the node ID of the LDR service either through the NMS MI on the LDR > Overview page, or the index.html file found in the /Doc directory of the SAID package. b. A message Found node id node_id in volid file is displayed for each mounted device. Check that the node_id for each device is the same as the node ID of the LDR service that you are restoring. 12. If the node ID for any of the volumes is different than the node ID of the LDR service, call technical support. 13. If the GDU finds and mounts devices that you believe to be bad, proceed with the installation of StorageGRID Webscale software. After software installation is complete, you must manually reformat and recover these bad volumes. Installing StorageGRID Webscale software and recovering failed storage volumes You need to use the Grid Deployment Utility (GDU) to install StorageGRID Webscale software on each grid node. About this task The GDU is a console application used to install software, enable services, and execute other tasks on individual grid nodes. The GDU is accessed through Putty, or an equivalent Telnet application. Note: Do not start StorageGRID Webscale software by selecting Enable Services in the GDU after installing software. Starting StorageGRID Webscale software before all required recovery steps are completed can lead to unrecoverable data loss and undefined system behavior. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Add your private key identity to the SSH authentication agent: ssh-add 3. Enter the SSH Access Password listed in the Passwords.txt file. 4. Start the GDU: gdu-console 5. Type the provisioning passphrase, press the Tab key to select OK, and then press Enter. 6. Select the grid node you are installing software on and ensure that its status is Available.

24 24 StorageGRID Webscale 10.1 Maintenance Guide 7. Press the Tab key, and then the down arrow key to highlight Install Software, and press the Spacebar to select it. 8. Press the Tab key to move through the options and select Start Task. 9. Press Enter to run the task. The status of the selected grid node changes to Busy. 10. When the task completes, press the Tab key to move through the options to the Tasks list. The status of the selected grid node returns to Available. Software installation is completed when the message StorageGRID activation completed appears in the Log Messages panel. Installation times vary depending on the grid node. For Storage Nodes, if the provisioning hardware profile has not specified object store names, installation detects any unallocated drives and formats the disks. Note: Do not start StorageGRID Webscale software by selecting Enable Services. Starting system software immediately after installation can lead to unrecoverable data loss in some situations. Rebuilding the Cassandra database You need to run the check-cassandra-rebuild script to determine if you need to rebuild the Cassandra database, and then rebuild it if required. Before you begin The system drives on the server must already have been restored. The cause of the storage volume failure has been identified and the defective storage hardware has been replaced. All the replaced storage drives have been formatted as rangedbs by the GDU. For more information, see Installing StorageGRID Webscale software and recovering failed storage volumes on page 23. The total size of the replacement storage must be the same as the original. Passwords.txt file. 1. From the service laptop, log in to the Storage Node as root using the password listed in the Passwords.txt file. 2. Determine if the Cassandra database must be rebuilt, and then rebuild it if the answer is yes: a. Check the database state: check-cassandra-rebuild b. If asked to Stop storage services [y/n]?, entery. c. If you are prompted to rebuild the Cassandra database, enter: y Attention: You should not enter n unless directed by technical support. Rebuilding the Cassandra database means that the database is deleted from the Storage Node and rebuilt from other available Storage Nodes. This procedure should never be performed on multiple Storage Nodes concurrently as it may result in data loss.

25 Recovery procedures 25 In the following example, the Cassandra database has been down for more than 15 days and must be rebuilt: Cassandra was down for more than 15 days. Running: /usr/local/sbin/rebuild-cassandra-data Enter 'y' to rebuild the Cassandra database for this Storage Node. Rebuilding the Cassandra database may take as long or longer than 12 hours. Once started, do not stop or pause this rebuild operation. If the rebuild process is stopped or paused, you must rerun the operation. [y/n]? y Removing Cassandra commit logs Removing Cassandra SSTables Updating timestamps of the Cassandra data directories. starting service cassandra Running nodetool rebuild. Done. Cassandra database successfully rebuilt. If you are not prompted to rebuild the Cassandra database, continue to the next recovery task. Applying hotfixes and maintenance releases You need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. Before you begin StorageGRID Webscale software must be started on the recovered grid node. About this task After StorageGRID Webscale software is started, confirm that the recovered grid node is running the same version of the StorageGRID Webscale software as other grid nodes of the same type. If it is not, apply any necessary hotfixes or maintenance releases needed to update the recovered grid node to the same version as the rest of the grid nodes of the same type. 1. Sign in to the NMS MI. 2. Determine the current version of the StorageGRID Webscale software: a. Go to grid node of same type > SSM > Services > Main. b. Under Packages, note the storage-grid-release number, or refer to the documented storage-grid-release number for the grid node. 3. Determine the version of the StorageGRID Webscale software of the recovered grid node: a. Go to recovered grid node > SSM > Services > Main. b. Under Packages, note the storage-grid-release number. 4. Compare the two versions, and if they differ apply hotfixes or maintenance releases as necessary to update the recovered grid node to the correct software version. For more information about available hotfixes and maintenance releases, contact technical support. Related tasks Checking the StorageGRID Webscale version on page 5

26 26 StorageGRID Webscale 10.1 Maintenance Guide Restoring object data to a storage volume where the system drive also failed After recovering a storage volume on a Storage Node where the system drive also failed and was recovered, and after rebuilding the Cassandra database, you can restore object data to the recovered storage volumes from other grid nodes (Storage Nodes and Archive Nodes) in the system. Before you begin You must have already acquired the Node ID of the Storage Node where restored storage volumes reside. In the NMS MI, go to Grid Topology > Storage Node > LDR > Overview > Main. You must have confirmed the condition of the Storage Node. The Storage Node must be displayed in the Grid Topology tree with a color of green and all services must have a state of Online. If you want to recover erasure coded object data, the storage pool to which the recovered Storage Node is a member must include enough "green" and online Storage Nodes to support the Erasure Coding scheme used to create the erasure-coded object data being recovered. For example, if you are recovering erasure coded object data that was created using a scheme of 6+3, at least six Storage Nodes that are members of the Erasure Coding profile's storage pool must be green and online. About this task The procedure to restore object data to storage volumes notifies the StorageGRID Webscale system that object data stored on the lost storage volumes is no longer available, which prompts an ILM reevaluation to determine the correct placement of restored object data. If the only remaining copy of replicated object data is located on an Archive Node, object data is retrieved from the Archive Node. Due to the latency associated with archival media such as tape, or the cloud-tiering service, restoring replicated object data to a Storage Node from an Archive Node takes longer than restoring copies from other Storage Nodes. Note that only replicated object data is archived to Archive Nodes; erasure coded object data is not archived. If the StorageGRID Webscale system s ILM policy is configured to use an ILM rule with only one active content placement instruction, copies of an object are not made; only a single instance of a replicated object is stored at any one time. If there is a failure, all such objects are lost and cannot be recovered; however, you must still perform the following procedure to purge lost object information from the database. For more information about ILM rules, see the Administrator Guide. 1. From the service laptop, log in to the failed Storage Node as root using the password listed in the Passwords.txt file. Attention: You should use the ADE Console with caution. If the console is used improperly, it is possible to interrupt system operations and corrupt data. Enter commands carefully, and only use the commands documented in this procedure. 2. Access the ADE console of the LDR service: telnet localhost Access the CMSI module: cd /proc/cmsi 4. Begin restoring object data: Volume_Lost <vol_id> <vol_id>

27 Recovery procedures 27 vol_id is the volume ID of the reformatted volume, or a range of volumes in hex representation. There can be up to 16 volumes, numbered from 0000 to 0000F, such as Volume_Lost F Note: The second <vol_id> is optional. For StorageGRID Webscale appliance Storage Node, you must reformat all storage volumes (0000 to 0000F). As object data is restored, if the StorageGRID Webscale system cannot locate replicated object data, the LOST (Lost Objects) alarm is triggered. Alarms may be triggered on Storage Nodes throughout the system. Action should be taken to determine the cause of the loss and if recovery is possible. For more information, see the Troubleshooting Guide. 5. To determine the current status of the Volume_Lost recovery operation, do one or more of the following: To determine status... Of objects queued for retrieval from an Archive Node. Of the ILM Evaluation (Volume Lost) grid task triggered by the Volume_Lost command. Do the following... In the NMS MI, go to the Archive Node > ARC > Retrieve > Overview > Main page, and view the Objects Queued attribute. In the NMS MI, go to primary Admin Node > CMN > Grid Tasks > Overview > Main and view the percentage complete under Active. Wait for the grid task to move into the Historical table with a Status of Successful, which indicates a successful Storage Node recovery. Unavailable Storage Nodes may affect the progress of ILM Evaluation (Volume Lost) grid tasks depending on where the Storage Node is located. 6. When the Volume_Lost recovery procedure finishes, including the completion of the ILM Evaluation (Volume Lost) grid task, exit the ADE console of the LDR service: exit 7. Access the ADE console of the DDS service: telnet localhost Access the ECGM module: cd /proc/ecgm 9. Complete the restoration of object data: node_repair <node_id> [--inplace] node_id is the node ID for the recovered Storage Node's LDR service. --inplace is an optional parameter to place recovered erasure coded copies back on the same recovered storage volume. If this parameter is not used, recovered erasure coded copies are placed on other equivalent Storage Nodes that are members of the same storage pool. The StorageGRID Webscale system completes the processes of recovering object data, ensuring that ILM rules are met. A unique <repair ID> number is returned to identify this node_repair operation. This <repair ID> number can be used to track the progress and results of the repair. No other feedback is returned. Note: You cannot run multiple node_repair operations at the same time. 10. Determine the current status or result of the node_repair recovery operation

28 28 StorageGRID Webscale 10.1 Maintenance Guide repair_status <repair_id> where repair_id is the identifier returned when the node_repair command is run. Determine the repair_id of a repair, you can list all previously and currently running repairs: repair_status all In the following example, all object data is successfully recovered: ade : /proc/ecgm > node_repair ade : /proc/ecgm > Repair of node started. Repair ID ade : /proc/ecgm > repair_status Repair ID : Type : Storage Node Repair Node ID : Start time : T23:28: End time : T23:28: State : Success Estimated bytes affected : Bytes repaired : Retry repair : No If Retry repair is Yes, check the condition of the StorageGRID Webscale system, and confirm that all grid nodes are "green" in the NMS MI's Grid Topology tree with a state of Online. For erasure-coded object data, confirm that there are the minimum number of "green" and online Storage Nodes in the storage pool of which the recovered Storage Node is a member, so that the erasure coding scheme in use is supported. Resolve any issues with the system, including connectivity, and retry the repair by entering the following command: repair_retry <repair_id> repair_id is the identifier returned when the node_repair command is run. Unrecoverable erasure-coded object data triggers the LOST (Lost Objects) and ECOR (Copies Lost) alarms. If State is Failure and Retry repair is No, erasure coded object data is permanently lost. 11. Exit the ADE Console: exit 12. Log out of the command shell: exit Related information StorageGRID Webscale 10.1 Administrator Guide StorageGRID Webscale 10.1 Troubleshooting Guide Checking the storage state You need to verify that the desired state of the Storage Node is set to online and ensure that the state will be online by default whenever the Storage Node server is restarted. Before you begin Storage Node has been recovered, and data recovery is complete.

29 Recovery procedures In the NMS MI, check the value of Grid Topology > recovered Storage Node > LDR > Storage > Storage State Desired and Storage State Current. The value of both attributes should be Online. 2. If the Storage State Desired is set to Read-only, complete the following steps: a. Click the Configuration tab. b. From the Storage State Desired drop-down list, select Online. c. Click Apply Changes. d. Click the Overview tab and confirm that the values of Storage State Desired and Storage State Current are updated to Online. Recovering a StorageGRID Webscale appliance Storage Node Recovering a StorageGRID Webscale appliance Storage Node involves deploying the appliance Storage Nodes, rebuilding the Cassandra database, enabling services with the Grid Deployment Utility, and restoring object data the Storage Node. About this task Attention: You must not attempt this recovery procedure if two or more Storage Nodes have failed at the same time. Contact technical support. Each StorageGRID Webscale appliance is represented as one Storage Node in the Network Management System (NMS) Management Interface (MI). 1. Preparing the StorageGRID Webscale appliance Storage Node on page 30 When recovering a StorageGRID Webscale appliance Storage Node, you must first prepare the grid node before reinstalling StorageGRID Webscale software. 2. Deploying the StorageGRID Webscale Installer on page 30 You must deploy the StorageGRID Webscale Installer on a virtual machine in VMware vsphere. 3. Finishing the initial setup on page 32 When recovering a StorageGRID Webscale appliance Storage Node, to prepare for the reinstallation of software, you must start the StorageGRID Webscale Installer and then upload software. 4. Connecting to the appliance configuration web page on page 32 To start the appliance software installation, you connect to the appliance configuration web page. Using this page enables you to configure the management network, configure the data network connection, enter the StorageGRID Webscale installer IP address, and monitor the installation progress. 5. Configuring the data network connections on page 33 Using the appliance installer, you enter the IP address of the data network. Additionally, you enter the subnet mask for the network and at least a default gateway. Entering the IP address, subnet mask, and gateway enables you to configure the data network connection. 6. Setting the StorageGRID Webscale Installer IP address on page 34 You can use the appliance installer web page to set the IP address of the StorageGRID Webscale software installer. Setting this enables installer connectivity. 7. Installing StorageGRID Webscale software on the appliance on page 35

30 30 StorageGRID Webscale 10.1 Maintenance Guide You install the StorageGRID Webscale software and SLES by using the appliance installation web page. You can also use the web page to monitor the installation. Installing this software enables you to monitor the appliance information in the StorageGRID Webscale system. 8. Finishing the StorageGRID Webscale appliance deployment on page 38 After competing the installation of the StorageGRID Webscale appliance Storage Node, you must return to the StorageGRID Webscale Installer and finish the deployment. 9. Rebuilding the Cassandra database on page 38 You need to run the check-cassandra-rebuild script to determine if you need to rebuild the Cassandra database, and then rebuild it if required. 10. Restoring object data to a storage volume where the system drive also failed on page 39 After recovering a storage volume on a Storage Node where the system drive also failed and was recovered, and after rebuilding the Cassandra database, you can restore object data to the recovered storage volumes from other grid nodes (Storage Nodes and Archive Nodes) in the system. Preparing the StorageGRID Webscale appliance Storage Node When recovering a StorageGRID Webscale appliance Storage Node, you must first prepare the grid node before reinstalling StorageGRID Webscale software. 1. From the service laptop, log in to the failed Storage Node as root using the password listed in the Passwords.txt file. 2. Prepare the StorageGRID Webscale appliance Storage Node for the installation of StorageGRID Webscale software. sgareinstall 3. When asked to continue [y/n]?, enter: y The StorageGRID Webscale appliance Storage Node is reset and data on the Storage Node is no longer accessible. IP addresses configured during the original installation process should remain intact; however, it is recommended that you confirm this when the procedure completes. Deploying the StorageGRID Webscale Installer You must deploy the StorageGRID Webscale Installer on a virtual machine in VMware vsphere. Before you begin VMware software must be installed and correctly configured. You must have the StorageGRID Webscale Installer OVA file. You must have network configuration information for the StorageGRID Webscale Installer (IP address, network mask, and default gateway). The StorageGRID Webscale Installer must be on the same network as the StorageGRID Webscale system s grid nodes as defined in the grid specification file created in Grid Designer. About this task You must deploy the StorageGRID Webscale Installer on the same network as, or a network that is accessible to, the grid nodes being deployed for the StorageGRID Webscale system. An additional, unique IP address is required for the installer, one that is separate from the IP addresses assigned to grid nodes in the grid specification file.

Recovery procedures 31 After it is deployed on a virtual machine, the StorageGRID Webscale Installer is accessed through a web browser and is used to deploy all the grid nodes for the StorageGRID

31 Recovery procedures 31 After it is deployed on a virtual machine, the StorageGRID Webscale Installer is accessed through a web browser and is used to deploy all the grid nodes for the StorageGRID Webscale system. The IP address is configured during this deployment procedure. Before starting this procedure, confirm that you have the network configuration information (IP address, network mask, and default gateway) for the StorageGRID Webscale Installer. 1. Open the VMware vsphere Client and log in. 2. Select File > Deploy OVF Template. 3. In the Deploy OVF Template wizard, follow the prompts on screen and update the default settings, as required: a. On the Source page, click Browse and select the StorageGRID Webscale Installer OVA file. b. On the Network Mapping page, click the drop-down list in the Destination Networks column and select the network configured for StorageGRID Webscale. c. On the Properties page, enter network configuration information to use for the StorageGRID Webscale Installer. You must specify the IP address, network mask, and default gateway. These settings are used to enable access to the installer user interface through a web browser. d. Click Finish. 4. If the virtual machine to which the StorageGRID Webscale Installer is deployed did not power on automatically, right-click it and select Power > Power On. The StorageGRID Webscale Installer must be powered on after it is successfully deployed.

32 32 StorageGRID Webscale 10.1 Maintenance Guide The StorageGRID Webscale Installer takes a minute or two to start up, and then it can be accessed through your web browser using the IP address you specified. Finishing the initial setup When recovering a StorageGRID Webscale appliance Storage Node, to prepare for the reinstallation of software, you must start the StorageGRID Webscale Installer and then upload software. 1. In a supported web browser, navigate to the StorageGRID Webscale Installer using the IP address configured when deploying the StorageGRID Webscale Installer. 2. Click Modify an existing StorageGRID Webscale System. 3. Click Upload for NetApp StorageGRID Webscale Software, locate and select the.iso image file and click Open. 4. Click Upload for the Grid Specification file, locate and select the grid specification file for your installation, and click Open. 5. Switch to VMware vsphere Client and select the StorageGRID Webscale Installer virtual machine. The default name is NetApp SGI. 6. Connect the VMware Tools ISO image from the VMware datastore to the virtual machine by clicking the Connect/Disconnect the CD/DVD devices of the virtual machine selecting CD/DVD Drive n > Connect to ISO image on a datastore. 7. In the Browse Datastores dialog box, navigate to the /vmimages/tools-isoimages subdirectory, select linux.iso and click OK. icon and 8. Connect the SUSE Linux Enterprise Server ISO image from the VMware datastore to the virtual machine by clicking the Connect/Disconnect the CD/DVD devices of the virtual machine icon and selecting CD/DVD Drive n > Connect to ISO image on a datastore. 9. In the Browse Datastores dialog box, locate and select the SUSE Linux Enterprise Server.iso image file, and click OK. 10. Return to the StorageGRID Webscale Installer and click Verify Availability for Novell SuSE Linux Enterprise Server and then for VMware Tools to verify that these resources are available. It may take several minutes to validate the SUSE Linux Enterprise Server installation ISO image. 11. When all files are successfully loaded and the availability of resources confirmed, click Next. The Deploy Grid Nodes page appears. Do not click Next again. 12. Leave the Deploy Grid Nodes page and continue the recovery process by connecting to the appliance configuration page. After you finish After finishing the installation of StorageGRID Webscale software, return to the StorageGRID Webscale Installer, click Cancel and close the browser window. Finishing the StorageGRID Webscale appliance deployment on page 38. Connecting to the appliance configuration web page To start the appliance software installation, you connect to the appliance configuration web page. Using this page enables you to configure the management network, configure the data network

33 Recovery procedures 33 connection, enter the StorageGRID Webscale installer IP address, and monitor the installation progress. Step 1. Open a browser and enter the E5600SG controller Management Port 1 IP address that was provisioned during the physical installation: The StorageGRID Webscale Appliance Installer web page appears: When you are first installing the appliance, the status for each of the procedures on the web page indicates that the procedure is not complete. Configuring the data network connections Using the appliance installer, you enter the IP address of the data network. Additionally, you enter the subnet mask for the network and at least a default gateway. Entering the IP address, subnet mask, and gateway enables you to configure the data network connection. Before you begin You must already have the IP address of the data network. 1. From the StorageGRID Webscale Appliance Installer web page, click Configure StorageGRID data network connection:

34 34 StorageGRID Webscale 10.1 Maintenance Guide 2. To edit the data network IP address, in the StorageGRID data network connection section, click Update IP/netmask. The button name changes to Save Changes and a pop-up appears. 3. Enter the IP address of the data network and click Save Changes. Route information based on the specified IP displays. 4. In the pop-up, click OK. 5. If needed, edit the route and click Save route. 6. Click Home to return to the main Appliance Installer web page. Setting the StorageGRID Webscale Installer IP address You can use the appliance installer web page to set the IP address of the StorageGRID Webscale software installer. Setting this enables installer connectivity. Before you begin You must know the IP address of the StorageGRID Webscale installer. 1. From the StorageGRID Webscale Appliance Installer web page, next to the Set StorageGRID Webscale Installer IP option, click Update:

35 Recovery procedures In the next page, enter the IP address of the StorageGRID Webscale Installer. 3. Click Save changes. The Appliance Installer Home page indicates that the installer IP connection is set. Installing StorageGRID Webscale software on the appliance You install the StorageGRID Webscale software and SLES by using the appliance installation web page. You can also use the web page to monitor the installation. Installing this software enables you to monitor the appliance information in the StorageGRID Webscale system. Before you begin You must have already configured the management and data networks and entered the StorageGRID Webscale Installer IP address. About this task When you install the software, the web interface initiates the following operations: Establishes a connection to the storage array. Checks the operational status of all drives. Creates a primary disk pool. Calculates volume sizes. Creates volumes. Creates the LUN mappings. Renames the array. Creates a configuration file. Rescans SCSI ports and reloads devices. 1. From the StorageGRID Webscale Appliance Installer web page, next to the Set StorageGRID Webscale Installer IP option, click Begin StorageGRID node install <node_name>:

36 StorageGRID Webscale 10.1 Maintenance Guide A list of install operations appears. You can review the installation progress as each operation status changes from "Not started" to "Completed.

36 36 StorageGRID Webscale 10.1 Maintenance Guide A list of install operations appears. You can review the installation progress as each operation status changes from "Not started" to "Completed." The status refreshes every five seconds. 2. To monitor progress, return to the StorageGRID Webscale installation web page. The Deploy Grid Nodes section shows the installation progress for the appliance Storage Node. 3. Review the appliance web page. The following occurs: When the appliance list of operations shows "Install SLES" as in progress, a thumbnail image of the SLES installation appears next to the list of operations:

The web page displays the last 10 lines of the installation log, which updates every five seconds.

37 Recovery procedures 37 The StorageGRID Webscale Installer Deploy Grid Nodes web page status bar changes to blue, indicating a job in progress, and then to green, indicating successful completion. The web page displays the last 10 lines of the installation log, which updates every five seconds. After the last operation, the last line, "Boot into StorageGRID," indicates a status of initiating the reboot. The appliance installation has completed:

38 38 StorageGRID Webscale 10.1 Maintenance Guide If you encounter any installation issues, see information about troubleshooting the installation. 4. If you are installing the appliance for the first time, repeat this process for any additional appliances. If you are performing maintenance procedures, skip this step. After you finish Continue by completing the grid node deployment. Finishing the StorageGRID Webscale appliance deployment After competing the installation of the StorageGRID Webscale appliance Storage Node, you must return to the StorageGRID Webscale Installer and finish the deployment. 1. When the installation of the StorageGRID Webscale appliance Storage Node is finished, return to the StorageGRID Webscale Installer. 2. Click Cancel. 3. When asked to confirm the cancellation, click Yes. You are returned to the Welcome page. 4. Close the browser hosting the StorageGRID Webscale Installer. Rebuilding the Cassandra database You need to run the check-cassandra-rebuild script to determine if you need to rebuild the Cassandra database, and then rebuild it if required. Before you begin The system drives on the server must already have been restored. The cause of the storage volume failure has been identified and the defective storage hardware has been replaced. All the replaced storage drives have been formatted as rangedbs by the GDU. For more information, see Installing StorageGRID Webscale software and recovering failed storage volumes on page 23. The total size of the replacement storage must be the same as the original. Passwords.txt file. 1. From the service laptop, log in to the Storage Node as root using the password listed in the Passwords.txt file. 2. Determine if the Cassandra database must be rebuilt, and then rebuild it if the answer is yes: a. Check the database state: check-cassandra-rebuild b. If asked to Stop storage services [y/n]?, entery. c. If you are prompted to rebuild the Cassandra database, enter: y Attention: You should not enter n unless directed by technical support.

39 Recovery procedures 39 Rebuilding the Cassandra database means that the database is deleted from the Storage Node and rebuilt from other available Storage Nodes. This procedure should never be performed on multiple Storage Nodes concurrently as it may result in data loss. In the following example, the Cassandra database has been down for more than 15 days and must be rebuilt: Cassandra was down for more than 15 days. Running: /usr/local/sbin/rebuild-cassandra-data Enter 'y' to rebuild the Cassandra database for this Storage Node. Rebuilding the Cassandra database may take as long or longer than 12 hours. Once started, do not stop or pause this rebuild operation. If the rebuild process is stopped or paused, you must rerun the operation. [y/n]? y Removing Cassandra commit logs Removing Cassandra SSTables Updating timestamps of the Cassandra data directories. starting service cassandra Running nodetool rebuild. Done. Cassandra database successfully rebuilt. If you are not prompted to rebuild the Cassandra database, continue to the next recovery task. Restoring object data to a storage volume where the system drive also failed After recovering a storage volume on a Storage Node where the system drive also failed and was recovered, and after rebuilding the Cassandra database, you can restore object data to the recovered storage volumes from other grid nodes (Storage Nodes and Archive Nodes) in the system. Before you begin You must have already acquired the Node ID of the Storage Node where restored storage volumes reside. In the NMS MI, go to Grid Topology > Storage Node > LDR > Overview > Main. You must have confirmed the condition of the Storage Node. The Storage Node must be displayed in the Grid Topology tree with a color of green and all services must have a state of Online. If you want to recover erasure coded object data, the storage pool to which the recovered Storage Node is a member must include enough "green" and online Storage Nodes to support the Erasure Coding scheme used to create the erasure-coded object data being recovered. For example, if you are recovering erasure coded object data that was created using a scheme of 6+3, at least six Storage Nodes that are members of the Erasure Coding profile's storage pool must be green and online. About this task The procedure to restore object data to storage volumes notifies the StorageGRID Webscale system that object data stored on the lost storage volumes is no longer available, which prompts an ILM reevaluation to determine the correct placement of restored object data. If the only remaining copy of replicated object data is located on an Archive Node, object data is retrieved from the Archive Node. Due to the latency associated with archival media such as tape, or the cloud-tiering service, restoring replicated object data to a Storage Node from an Archive Node takes longer than restoring copies from other Storage Nodes. Note that only replicated object data is archived to Archive Nodes; erasure coded object data is not archived. If the StorageGRID Webscale system s ILM policy is configured to use an ILM rule with only one active content placement instruction, copies of an object are not made; only a single instance of a replicated object is stored at any one time. If there is a failure, all such objects are lost and cannot be

40 40 StorageGRID Webscale 10.1 Maintenance Guide recovered; however, you must still perform the following procedure to purge lost object information from the database. For more information about ILM rules, see the Administrator Guide. 1. From the service laptop, log in to the failed Storage Node as root using the password listed in the Passwords.txt file. Attention: You should use the ADE Console with caution. If the console is used improperly, it is possible to interrupt system operations and corrupt data. Enter commands carefully, and only use the commands documented in this procedure. 2. Access the ADE console of the LDR service: telnet localhost Access the CMSI module: cd /proc/cmsi 4. Begin restoring object data: Volume_Lost <vol_id> <vol_id> vol_id is the volume ID of the reformatted volume, or a range of volumes in hex representation. There can be up to 16 volumes, numbered from 0000 to 0000F, such as Volume_Lost F Note: The second <vol_id> is optional. For StorageGRID Webscale appliance Storage Node, you must reformat all storage volumes (0000 to 0000F). As object data is restored, if the StorageGRID Webscale system cannot locate replicated object data, the LOST (Lost Objects) alarm is triggered. Alarms may be triggered on Storage Nodes throughout the system. Action should be taken to determine the cause of the loss and if recovery is possible. For more information, see the Troubleshooting Guide. 5. To determine the current status of the Volume_Lost recovery operation, do one or more of the following: To determine status... Of objects queued for retrieval from an Archive Node. Of the ILM Evaluation (Volume Lost) grid task triggered by the Volume_Lost command. Do the following... In the NMS MI, go to the Archive Node > ARC > Retrieve > Overview > Main page, and view the Objects Queued attribute. In the NMS MI, go to primary Admin Node > CMN > Grid Tasks > Overview > Main and view the percentage complete under Active. Wait for the grid task to move into the Historical table with a Status of Successful, which indicates a successful Storage Node recovery. Unavailable Storage Nodes may affect the progress of ILM Evaluation (Volume Lost) grid tasks depending on where the Storage Node is located. 6. When the Volume_Lost recovery procedure finishes, including the completion of the ILM Evaluation (Volume Lost) grid task, exit the ADE console of the LDR service: exit 7. Access the ADE console of the DDS service: telnet localhost 1411

41 Recovery procedures Access the ECGM module: cd /proc/ecgm 9. Complete the restoration of object data: node_repair <node_id> [--inplace] node_id is the node ID for the recovered Storage Node's LDR service. --inplace is an optional parameter to place recovered erasure coded copies back on the same recovered storage volume. If this parameter is not used, recovered erasure coded copies are placed on other equivalent Storage Nodes that are members of the same storage pool. The StorageGRID Webscale system completes the processes of recovering object data, ensuring that ILM rules are met. A unique <repair ID> number is returned to identify this node_repair operation. This <repair ID> number can be used to track the progress and results of the repair. No other feedback is returned. Note: You cannot run multiple node_repair operations at the same time. 10. Determine the current status or result of the node_repair recovery operation repair_status <repair_id> where repair_id is the identifier returned when the node_repair command is run. Determine the repair_id of a repair, you can list all previously and currently running repairs: repair_status all In the following example, all object data is successfully recovered: ade : /proc/ecgm > node_repair ade : /proc/ecgm > Repair of node started. Repair ID ade : /proc/ecgm > repair_status Repair ID : Type : Storage Node Repair Node ID : Start time : T23:28: End time : T23:28: State : Success Estimated bytes affected : Bytes repaired : Retry repair : No If Retry repair is Yes, check the condition of the StorageGRID Webscale system, and confirm that all grid nodes are "green" in the NMS MI's Grid Topology tree with a state of Online. For erasure-coded object data, confirm that there are the minimum number of "green" and online Storage Nodes in the storage pool of which the recovered Storage Node is a member, so that the erasure coding scheme in use is supported. Resolve any issues with the system, including connectivity, and retry the repair by entering the following command: repair_retry <repair_id> repair_id is the identifier returned when the node_repair command is run. Unrecoverable erasure-coded object data triggers the LOST (Lost Objects) and ECOR (Copies Lost) alarms.

42 42 StorageGRID Webscale 10.1 Maintenance Guide If State is Failure and Retry repair is No, erasure coded object data is permanently lost. 11. Exit the ADE Console: exit 12. Log out of the command shell: exit Related information StorageGRID Webscale 10.1 Administrator Guide StorageGRID Webscale 10.1 Troubleshooting Guide Recovering from Admin Node failures The recovery process for an Admin Node depends on whether it is the primary Admin Node or a non-primary Admin Node at a secondary data center site. Choices Recovering from primary Admin Node failures on page 42 You need to compete a series of tasks in specific order to recover from a primary Admin Node failure. Recovering from nonprimary Admin Node failures on page 58 You need to compete a specific set of tasks to recover from an Admin Node failure at any data center site other than the site where the primary Admin Node is located. Recovering from primary Admin Node failures You need to compete a series of tasks in specific order to recover from a primary Admin Node failure. About this task You must repair or replace a failed primary Admin Node promptly to avoid affecting the StorageGRID Webscale system's ability to ingest objects. The primary Admin Node hosts the Configuration Management Node (CMN) service, which is responsible for issuing blocks of object identifiers to CMS services. As each object is ingested, a unique identifier from this block is assigned to the object. Each CMS service has a minimum of 16 million object identifiers available for use if the CMN service becomes unavailable and stops issuing object identifiers to CMS services. When all CMS services exhaust their supply of object identifiers, objects can no longer be ingested. 1. Copying audit logs from the failed Admin Node on page 43 Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the replacement Admin Node. You want to preserve the audit log files if they are recoverable. 2. Deploying the primary Admin Node virtual machine on page 44 If your primary Admin Node fails, you must redeploy the primary Admin Node virtual machine template in VMware vsphere. 3. Installing SLES on the primary Admin Node on page 45 You need to install SUSE Linux Enterprise Server (SLES) on the virtual machine that will host the primary Admin Node.

43 Recovery procedures Installing VMware Tools on page 46 You need to install VMware Tools on each virtual machine that will host a grid node for enhanced performance and improved management of the virtual machine. 5. Loading the software distribution ISO images on page 47 You need to load the StorageGRID Webscale software distribution onto the primary Admin Node. 6. Installing provisioning software on page 49 You must install the provisioning software that is used to generate the collection of files for installing and configuring the customized StorageGRID Webscale system you defined in your grid specification file. 7. Restoring the GPT Repository on page 50 You need to restore the Grid Provisioning Tool (GPT) repository from the most recent backup so that you can recreate the provisioning media and SAID package required to recover the primary Admin Node. 8. Updating the primary Admin Node configuration files on page 51 You must run a script that copies the required configuration files to the correct locations after you reprovision your StorageGRID Webscale system. 9. Installing StorageGRID Webscale software on page 52 You must install the StorageGRID Webscale software on each virtual machine to create the type of grid node you are recovering. 10. Starting StorageGRID Webscale software on page 52 You need to start the StorageGRID Webscale software on grid nodes by enabling services. 11. Disabling the Load Configuration option on page 53 When you are recovering the primary Admin Node, you should disable the Load Configuration option to prevent it from being selected later in error. 12. Applying hotfixes and maintenance releases on page 54 You need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. 13. Importing the updated grid specification file on page 54 You must import the bundle that includes the updated grid specification file into the StorageGRID Webscale system after you have started the recovered primary Admin Node. 14. Restoring the audit log on page 55 If you were able to preserve the audit log from the failed Admin Node, you can copy it to the Admin Node you are recovering. 15. Resetting the preferred sender on page 56 If the Admin Node you are recovering is currently set as the preferred sender of notifications and AutoSupport messages, you must reconfigure this setting in the NMS MI. 16. Restoring the NMS database on page 57 If you want to retain the historical information about attribute values and alarms on a failed Admin Node, you need to restore the Network Management System (NMS) database. This database can only be restored if your StorageGRID Webscale system includes more than one Admin Node. Copying audit logs from the failed Admin Node Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the replacement Admin Node. You want to preserve the audit log files if they are recoverable. About this task This procedure copies the audit log files from the failed Admin Node to a temporary location. These preserved audit logs can then be copied to the replacement Admin Node. Audit logs are not automatically copied to the new Admin Node.

44 44 StorageGRID Webscale 10.1 Maintenance Guide Depending on the failure of the Admin Node, it might not be possible to copy audit logs from the failed Admin Node. In a deployment with more than one Admin Node, audit logs are replicated to all Admin Nodes. Thus, in a multi Admin Node deployment, if audit logs cannot be copied off of the failed Admin Node, audit logs can be recovered from another Admin Node in the system. In a deployment with only one Admin Node, where the audit log cannot be copied off of the failed Admin Node, the recovered Admin Node starts recording events to the audit log as if the installation is new. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. If it is running, stop the AMS service from creating a new log file: /etc/init.d/ams stop 3. Rename the audit.log file so that it does not overwrite the file on the replacement Admin Node when you copy it to the replacement Admin Node. Rename audit.log to a unique numbered file name such as YYYY-MM-DD.txt.1. For example, you can rename the audit.log file to txt.1 cd /var/local/audit/export ls -l mv audit.log YYYY-MM-DD.txt.1 4. Copy all audit log files to a temporary location on any other server: scp -p * IP_address:/var/local/tmp When prompted, enter the password of the remote server listed in the Passwords.txt file. 5. Log out of the command shell: exit Deploying the primary Admin Node virtual machine If your primary Admin Node fails, you must redeploy the primary Admin Node virtual machine template in VMware vsphere. Before you begin The service laptop must be configured with VMware vsphere Client. The PRIMARY-ADMIN-TEMPLATE.ovf file must be available on the service laptop. 1. Open VMware vsphere Client on the service laptop and log in. 2. Select File > Deploy OVF Template. 3. On the Source page, click Browse. 4. In the Open dialog box, locate and select the PRIMARY ADMIN TEMPLATE.ovf file, and click Open. 5. Click Next. 6. On the OVF Template Details page, click Next. 7. On the End User License Agreement page, read the StorageGRID Webscale License Agreement, click Accept, and then click Next.

45 Recovery procedures On the Name and Location page, enter the name for the primary Admin Node virtual machine, select the Inventory Location, and click Next. 9. On the Host/Cluster page, select the host or cluster on which you want to run the deployed template, and click Next. 10. On the Resource Pool page, select the resource pool to deploy the template into, and click Next. 11. On the Storage page, select the storage hardware where the virtual machine files will be stored, and click Next. 12. On the Disk Format page, click Next. 13. On the Network Mapping page, verify the network chosen for the virtual machine and then click Next when the correct network is selected. You can select a different network by clicking the entry in the DestinationNetworks column, and selecting the correct network from the drop down list. 14. On the Ready to Complete page, review the deployment settings, select the Power on after deployment check box (if available), and click Finish. When the virtual machine is created, a Deployment Completed Successfully dialog box is displayed. 15. Click Close. Installing SLES on the primary Admin Node You need to install SUSE Linux Enterprise Server (SLES) on the virtual machine that will host the primary Admin Node. Before you begin The virtual machine for the primary Admin Node must be powered on. The boot image file for the primary Admin Node (sg_boot.flp) must be available on the service laptop. This file must be copied from the /provisioning subdirectory on the StorageGRID Webscale ISO image. The SLES installation ISO image must be available in a VMware datastore that can be accessed while installing your StorageGRID Webscale system. About this task The installation process completely erases the virtual machine drives and installs the SLES operating system (OS), applications, and support files customized for StorageGRID Webscale. 1. Open VMware vsphere Client and log in. 2. In the navigation tree, select the virtual machine and power it on if it is not started. 3. Connect the SLES ISO image from the VMware datastore to the virtual machine by clicking the Connect/Disconnect the CD/DVD devices of the virtual machine icon, and then selecting CD/DVD Drive n > Connect to ISO image on a datastore. 4. In the Browse Datastores dialog box, locate and select the SLES.iso file, and click OK.

46 46 StorageGRID Webscale 10.1 Maintenance Guide 5. Click the Connect/Disconnect the floppy devices of the virtual machine icon and then select Floppy Drive 1 > Connect to Floppy Image on local disk. 6. In the Open dialog box, locate and select the boot image file (.flp) that contains the activation file for this server and click Open. 7. Click the Console tab. 8. Click anywhere inside the Console pane. Your cursor disappears and you are locked into the Console pane. 9. Press Ctrl-Alt-Insert to reboot the virtual machine. The server performs the following steps while rebooting: The BIOS runs a hardware verification. By default, the system boots from the ISO image connected to the CD/DVD drive, and loads the SLES boot screen in the VMware vsphere Client Console pane. If the system does not boot from the CD/DVD drive by default, press F4 and change the boot order and reboot the virtual machine. 10. When the SLES Boot Screen is displayed, press the down arrow key to select the Installation option (do not press Enter). Note: You must move the cursor to the Installation option within eight seconds. If you do not, SLES will automatically attempt to install from the hard drive and the installation process will fail. If this happens, you must restart the virtual machine and select the Installation option within the required time. 11. Press Tab to move to the Boot Options text box and enter the following: autoyast=device://fd0/autoinst.xml Note: You must specify the AutoYaST information correctly in the Boot Options text box. If you do not enter this information, AutoYaST does not complete the required custom installation for the server. If you enter an incorrect value and are prompted to reenter the device name and path, verify the floppy device name. The device name is fd zero (fd0). 12. Press Enter. After the SLES installation is complete, the server completes its configuration and starts the operating system. Installation is complete when the login prompt appears. You can log in to the server using the root username and pressing Enter at the password prompt to specify a blank password. Installing VMware Tools You need to install VMware Tools on each virtual machine that will host a grid node for enhanced performance and improved management of the virtual machine. About this task The required version must be made available to virtual machines through the VMware vsphere Client. 1. Open and log in to the VMware vsphere Client.

47 Recovery procedures In the navigation tree, select the virtual machine on which you want to complete the installation, and then power on the virtual machine if it is not started. 3. Click the Connect/Disconnect the CD/DVD devices of the virtual machine icon, and select CD/DVD Drive n > Connect to ISO image on a datastore. 4. In the Browse Datastores dialog box, navigate to the /vmimages/tools-isoimages subdirectory, select linux.iso, and click Open. 5. Click the Console tab. 6. Click anywhere in the Console pane and log in as the root user. 7. When prompted for the password, press Enter to specify a blank password. 8. Mount the VMware Tools installer: mount /dev/sr0 /cdrom 9. Copy the gzip installation package to a temporary directory on the virtual machine and unpack it: mkdir /tmp/vmtools cd /tmp/vmtools tar -zxvf /cdrom/vmwaretools-*.tar.gz 10. Install VMware Tools with the default installation options: cd /tmp/vmtools/vmware-tools-distrib/./vmware-install.pl --default 11. When the installation is complete, verify that VMware Tools is running: /etc/init.d/vmware-tools status If the installation is successful, you will see the following status message: vmtoolsd is running. 12. Remove the installation files from the virtual machine: cd /tmp rm -rf vmtools 13. Reboot the system to ensure that the changes take effect: reboot Loading the software distribution ISO images You need to load the StorageGRID Webscale software distribution onto the primary Admin Node. Before you begin The primary Admin Node must have must have SUSE Linux Enterprise Server and VMware Tools installed. The service laptop must be configured with VMware vsphere Client. The StorageGRID Webscale ISO image is available on the service laptop. If your grid specification includes an Archive Node, you must have downloaded the Tivoli Storage Manager Backup-Archive Client.

48 48 StorageGRID Webscale 10.1 Maintenance Guide You can use the downloaded.tar file to create an ISO image, and copy the ISO to the service laptop. For supported versions, see the NetApp Interoperability Matrix Tool. About this task If an Archive Node is included in your grid specification, you also need to load the Tivoli Storage Manager Backup Archive Client software distribution onto the primary Admin Node. 1. Open and log in to VMware vsphere Client. 2. Select the Admin Node virtual machine in the navigation tree and power on the virtual machine if it is not currently running. 3. Verify that no ISO images are connected to the CD/DVD drives on the virtual machine, and disconnect any that you find, by completing the following steps: a. Click the Connect/Disconnect the CD/DVD devices of the virtual machine icon, and select CD/DVD Drive 1. If a Disconnect from ISO image path/filename menu item is displayed, you must select it to disconnect the ISO image. b. Click the Connect/Disconnect the CD/DVD devices of the virtual machine icon, and select CD/DVD Drive 2. If a Disconnect from ISO image path/filename menu item is displayed, you must select it to disconnect the ISO image. 4. Click the Connect/Disconnect the CD/DVD devices of the virtual machine icon, and select CD/DVD Drive 1 > Connect to ISO image on local disk. 5. In the Open dialog box, locate and select the StorageGRID Webscale ISO image and click Open. 6. Click the Console tab. 7. Click anywhere in the Console pane. 8. Log in as the root user and press Enter at the password prompt to specify a blank password. 9. Mount the ISO image and install the load_cds.py script: mount /dev/sr0 /cdrom /cdrom/install-load-cds 10. Copy the StorageGRID Webscale ISO onto the Admin Node virtual machine: load_cds.py Wait while the ISO image is being written to the correct directory. 11. When the ISO image has been copied, the following prompt is displayed: Would you like to read another CD? [y/n]. If you are... Not configuring an Archvie Node Do the following... Enter n.

49 Recovery procedures 49 If you are... Configuring an Archive Node Do the following... a. Press Ctrl+Alt to exit the Console pane. b. Click the Connect/Disconnect the CD/DVD devices of the virtual machine icon, select CD/DVD Drive 1 > Disconnect from ISO_path_and_name. Click Yes to disconnect the device, and then click OK in the Confirmation dialog box. c. Select CD/DVD Drive 1 > Connect to ISO image on local disk. d. In the Open dialog box, locate and select the TSM Backup Archive Client ISO image and click Open. e. Click anywhere in the Console pane. f. Enter y. Wait while the ISO image is written to the correct directory. 12. To exit, enter n when prompted. 13. Press Ctrl+Alt to exit the Console pane. Installing provisioning software You must install the provisioning software that is used to generate the collection of files for installing and configuring the customized StorageGRID Webscale system you defined in your grid specification file. Before you begin The StorageGRID Webscale software distribution has been loaded onto the primary Admin Node. 1. From the service laptop, log in to the primary Admin Node as root. When prompted for a password press Enter to specify a blank password. 2. Mount the StorageGRID Webscale Software ISO image. mount -o loop,ro /var/local/install/<storagegrid_webscale_iso> /cdrom <StorageGRID_Webscale_iso> is the name of the StorageGRID Webscale Software.iso file. 3. Load the provisioning software. /cdrom/load-provisioning-software 4. When prompted, read and accept the StorageGRID Webscale Licensing Agreement. I agree 5. When prompted, enter y to confirm that the current time is within 10 minutes of the time displayed. If the time is not accurate, enter n and when prompted enter the correct system date and time in UTC time using the format YYYY-MM-DD-hh-mm. Note: Ensure that you verify the time displayed. Provisioning might fail if the displayed time is not within 10 minutes of the current time. To determine the difference between local time and UTC time, you can use UTC Time conversion tools available online, such as The Time Now ( The official current UTC time is available from the International Bureau of Weights and Measures (BIPM) at

50 50 StorageGRID Webscale 10.1 Maintenance Guide Restoring the GPT Repository You need to restore the Grid Provisioning Tool (GPT) repository from the most recent backup so that you can recreate the provisioning media and SAID package required to recover the primary Admin Node. Before you begin You are recovering the primary Admin Node You have loaded provisioning software onto the server. Provisioning media, copied to a USB flash drive About this task The GPT Repository is a tool used to create provisioning media and SAID packages. The primary Admin Node typically hosts the GPT Repository. To recover the primary Admin Node, you must restore the repository from the backup. Restore the GPT repository on the Admin Node using the provisioning media created the last time the StorageGRID Webscale system was provisioned. Note: at this point in the re installation of the Admin Node, the virtual machine does not have networking configured, so it is not possible to use WinSCP to copy the repository to the Admin Node. You must use a copy of the repository on a USB flash drive. 1. Complete the following steps to copy the contents of the USB flash drive to the /var/ local/usb directory. Ensure that you copy the gpt-backup directory and all of its contents. a. Insert the USB flash drive containing the provisioning media for your grid into the service laptop USB port. b. Open VMware vsphere Client and log in. c. Select the Admin Node virtual machine in the navigation tree and power on the virtual machine if it is not currently running. d. Click the Connect/Disconnect the USB devices to the virtual machine icon, and select Connect to USB device1 > <name_of_usb_drive>. e. Click OK in the Connect USB device dialog box. The device driver for the USB flash drive is automatically installed and configured on the virtual machine. f. Click the Console tab. g. Click anywhere in the Console panel. Your mouse pointer disappears. Tip: Press Ctrl+Alt to release the mouse pointer from the console. h. Log in as the root user. Press Enter at the Password prompt to specify a blank password. i. Mount the USB flash drive. mount_usb_flash_drive.py

51 Recovery procedures 51 j. Copy the gpt/backup directory and all of its contents from the USB flash drive to /var/ local/usb directory on the primary Admin Node. mkdir /var/local/usb cp -r /mnt/usb/<path_to_gpt_backup> /var/local/usb <path_to_gpt_backup> is the location of the gpt_backup directory. 2. Restore the grid provisioning tool repository. restore-repository /var/local/usb 3. You are prompted for the repository passphrase. Enter the passphrase. 4. Log out from the primary Admin Node server. 5. Log back in to the primary Admin Node as root, using the password from the Passwords.txt file. 6. Unmount the USB flash drive. umount /mnt/usb Related tasks Installing provisioning software on page 49 You must install the provisioning software that is used to generate the collection of files for installing and configuring the customized StorageGRID Webscale system you defined in your grid specification file. Updating the primary Admin Node configuration files You must run a script that copies the required configuration files to the correct locations after you reprovision your StorageGRID Webscale system. Before you begin You must have restored the GPT repository to the Admin Node server. You must have reprovisioned the StorageGRID Webscale system with the updated grid specification file. About this task This task assumes you are recovering a primary Admin Node, and the hardware is different than the original server. When you restore the GPT repository, it restores various configuration files that are specific to the original hardware configuration of the server. These files are recreated for the updated hardware when you reprovision the StorageGRID Webscale system, but must be copied to the correct places on the replacement server. The configuration files govern such items as the networking configuration, disk configuration, and security keys. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Change directories: cd /home/provision

52 52 StorageGRID Webscale 10.1 Maintenance Guide 3. Run the script that copies the updated configuration files: autoyast-emulator.rb admin-autoinst.xml.processed Installing StorageGRID Webscale software You must install the StorageGRID Webscale software on each virtual machine to create the type of grid node you are recovering. Before you begin A Telnet application such as PuTTY must be available on the service laptop. About this task The Grid Deployment Utility (GDU) is a console application used to install StorageGRID Webscale software, enable services, and execute other tasks on individual grid nodes. The GDU is accessed through a Telnet application such as PuTTY. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Add your private key identity to the SSH authentication agent: ssh-add 3. Enter the SSH Access Password for the primary Admin Node listed in Passwords.txt. 4. Start the Grid Deployment Utility (GDU): gdu-console 5. Type the provisioning passphrase, press Tab to select OK, and press Enter. 6. Complete the following steps for each additional grid node you are configuring: a. Ensure that the grid node you are recovering is selected and that the current state is Available. b. Press Tab to move through the options to the Tasks list. c. Press the down arrow to highlight Install Software, and press Spacebar to select it. d. Press Tab to move through the options and select the Start Task action. e. Press Enter to run the task. The current state of the grid node returns to Available when the task completes. 7. When the task completes for the last grid node you are recovering StorageGRID Webscale software for, press the right arrow to move to the Quit action and press Enter. Starting StorageGRID Webscale software You need to start the StorageGRID Webscale software on grid nodes by enabling services. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Add your private key identity to the SSH authentication agent: ssh-add

53 Recovery procedures Enter the SSH Access Password for the primary Admin Node listed in the Passwords.txt file. 4. Start the GDU: gdu-console 5. Type the provisioning passphrase, press the Tab key to select OK, and press Enter. 6. Start the StorageGRID Webscale software on the server: a. Ensure that the grid node you are configuring is selected and that the current state is Available. b. Press the Tab key to move through the options to the Tasks list. c. Use the down-arrow key to highlight Enable Services, and press the Spacebar to select it. d. Press the Tab key to move through the options and select the Start Task action. e. Press Enter to run the task. Wait for the message Finished Postinstall start task to appear in the Log Messages panel. Note: If you are completing this procedure on a primary Admin Node, do not select the Load Configuration option. f. When the task completes, press the right-arrow key to move to the Quit action and press Enter. 7. End the SSH session: ssh-add -D 8. Verify that the selected grid node is configured correctly: a. Open VMware vsphere Client and log in. b. Select the virtual machine in the navigation tree and power on the virtual machine if it is not currently running. c. Click the Console tab. d. Verify that the StorageGRID Webscale Server Console is displayed and that the status of all components is Verified, and the status of all services is Running. Disabling the Load Configuration option When you are recovering the primary Admin Node, you should disable the Load Configuration option to prevent it from being selected later in error. About this task The system automatically restores the latest version of the configuration bundles to the new Admin Node after it is started. If you mistakenly select the Load Configuration option, the system overwrites all configuration changes made through the NMS MI since the StorageGRID Webscale system was first installed. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Disable the Load Configuration option for GDU: echo LOAD > /var/local/run/install.state

54 54 StorageGRID Webscale 10.1 Maintenance Guide 3. Log out from the primary Admin Node: exit The Load Configuration option will no longer appear for the primary Admin Node in GDU. Applying hotfixes and maintenance releases You need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. Before you begin StorageGRID Webscale software must be started on the recovered grid node. About this task After StorageGRID Webscale software is started, confirm that the recovered grid node is running the same version of the StorageGRID Webscale software as other grid nodes of the same type. If it is not, apply any necessary hotfixes or maintenance releases needed to update the recovered grid node to the same version as the rest of the grid nodes of the same type. 1. Sign in to the NMS MI. 2. Determine the current version of the StorageGRID Webscale software: a. Go to grid node of same type > SSM > Services > Main. b. Under Packages, note the storage-grid-release number, or refer to the documented storage-grid-release number for the grid node. 3. Determine the version of the StorageGRID Webscale software of the recovered grid node: a. Go to recovered grid node > SSM > Services > Main. b. Under Packages, note the storage-grid-release number. 4. Compare the two versions, and if they differ apply hotfixes or maintenance releases as necessary to update the recovered grid node to the correct software version. For more information about available hotfixes and maintenance releases, contact technical support. Related tasks Checking the StorageGRID Webscale version on page 5 Importing the updated grid specification file You must import the bundle that includes the updated grid specification file into the StorageGRID Webscale system after you have started the recovered primary Admin Node. About this task Normally the system is provisioned when the primary Admin Node is running, and the updated specification file is imported automatically. In this case, because you are recovering the primary Admin Node, it was not installed and operational when you provisioned the StorageGRID Webscale system during the recovery procedure.

55 Recovery procedures From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Ensure that the CMN service has started on the reinstalled primary Admin Node. a. Press Alt-F7 to access the Server Manager interface on the Admin Node, and ensure that the CMN service has a status of Running. b. For more information about Server Manager, see the Administrator Guide. c. Return to the command line. 3. At the command line of the Admin Node, enter the following command: import-gptb-bundle 4. When prompted, enter the repository passphrase. The required passphrase is the one you recorded as part of the original installation of the StorageGRID Webscale system. The updated grid specification file is imported into the StorageGRID Webscale system. You can view the grid specification file through the NMS MI at Grid Management > Grid Configuration > Configuration. This version of the grid specification file is only a copy and not the actual grid specification file. This copied version is only used for troubleshooting purposes. It cannot be used to provision the StorageGRID Webscale system. 5. Log out of the command shell: exit Restoring the audit log If you were able to preserve the audit log from the failed Admin Node, you can copy it to the Admin Node you are recovering. Before you begin Admin Node must be installed and running You must have copied the audit logs to another location after the original Admin Node failed. About this task If an Admin Node fails, audit logs saved to that Admin Node are potentially lost. It might be possible to preserve data from loss by copying audit logs from the failed Admin Node and then restoring these audit logs to the recovered Admin Node. Depending on the failure of the Admin Node, it might not be possible to copy audit logs from the failed Admin Node. In a deployment with more than one Admin Node, audit logs are replicated to all Admin Nodes. Thus, in a multi Admin Node deployment, if audit logs cannot be copied from the failed Admin Node, audit logs can be recovered from another Admin Node in the system. In a deployment with only one Admin Node, where the audit log cannot be copied from the failed Admin Node, the recovered Admin Node starts recording events to the audit log as if the installation is new. You must recover an Admin Node as soon as possible to restore logging functionality. 1. From the service laptop, log in to the grid node where you made a temporary copy of the audit log files as root using the password listed in the Passwords.txt file.

56 56 StorageGRID Webscale 10.1 Maintenance Guide 2. Check which audit files have been preserved: cd /var/local/tmp ls -l The following audit log files might be present in the temporary directory: The renamed current log file (audit.log) from the failed grid node: YYYY-MM-DD.txt.1 Rotated audit log file from the day before the failure: YYYY-MM-DD.txt (or YYYY-MM- DD.txt.n if more than one is created in a day). Older compressed and rotated audit log files: YYYY-MM-DD.txt.gz, preserving the original archive date in their name. 3. Copy the preserved audit log files to the recovered Admin Node: scp -p YYYY* recovered_admin_ip:/var/local/audit/export Where recovered_admin_ip is the IP address of the recovered Admin Node. 4. When prompted, enter the password of the recovered grid node, as listed in the Passwords.txt file. 5. Remove the preserved audit log files from their temporary location: rm YYYY* 6. Log out of the server that contained the temporary location of the audit logs: exit 7. Log in to the recovered grid node as root. 8. Update the user and group settings of the preserved audit log files: cd /var/local/audit/export chown ams-user:bycast * After you finish You will also need to restore any preexisting client access to the audit share. For more information, see "Chapter 11: Configuring audit client access" in the Administrator Guide. Related tasks Copying audit logs from the failed Admin Node on page 43 Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the replacement Admin Node. You want to preserve the audit log files if they are recoverable. Resetting the preferred sender If the Admin Node you are recovering is currently set as the preferred sender of notifications and AutoSupport messages, you must reconfigure this setting in the NMS MI. Before you begin The Admin Node must be installed and running. 1. Sign in to the NMS MI of the recovered Admin Node using the Vendor account.

57 Recovery procedures Go to Grid Management > NMS Management > General > Main. 3. Select Preferred Sender > recovered Admin Node. 4. Click Apply Changes. Restoring the NMS database If you want to retain the historical information about attribute values and alarms on a failed Admin Node, you need to restore the Network Management System (NMS) database. This database can only be restored if your StorageGRID Webscale system includes more than one Admin Node. Before you begin The Admin Node must be installed and running. The StorageGRID Webscale system must include one or more data center sites with an Admin Node About this task If an Admin Node fails, the historical information about attribute values and alarms that are stored in the NMS database for that Admin Node are lost. In a StorageGRID Webscale system with more than one Admin Node, the NMS database is recovered by copying the NMS database from another Admin Node. In a system with only one Admin Node, the NMS database cannot be restored. When you are recovering an Admin Node, the software installation process creates a new database for the NMS service on the recovered Admin Node. After it is started, the recovered Admin Node records attribute and audit information for all services as if your system were a performing new installation of the StorageGRID Webscale system. In a StorageGRID Webscale system with more than one Admin Node, you can copy the NMS database from another Admin Node to the recovered Admin Node and restore historical information. Note: To copy the NMS database, the StorageGRID Webscale system must be configured with multiple Admin Nodes. 1. Stop the MI service on both Admin Nodes: a. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. b. Stop the MI service: /etc/init.d/mi stop c. Repeat the same steps on the second Admin Node.

58 58 StorageGRID Webscale 10.1 Maintenance Guide 2. Complete the following steps on the recovered Admin Node: a. Copy the database: /usr/local/mi/bin/mi-clone-db.sh Source_Admin_Node_IP Source_Admin_Node_IP is the IP address of the source from which the Admin Node copies the NMS database. b. When prompted, enter the password for the Admin Node found in the Passwords.txt file. c. When prompted, confirm that you want to overwrite the MI database on the recovered Admin Node. d. When prompted, enter the password for the Admin Node found in the Passwords.txt file. The NMS database and its historical data is copied to the recovered Admin Node. When it is done, the script starts the recovered Admin Node. Note: Copying the NMS database may take several hours. 3. Restart the MI service on the source Admin Node. /etc/init.d/mi start Recovering from nonprimary Admin Node failures You need to compete a specific set of tasks to recover from an Admin Node failure at any data center site other than the site where the primary Admin Node is located. 1. Generating server activation media on page 59 You need to generate the server activation media to create the necessary files to reinstall software on each failed virtual machine in the VMware vsphere Client. 2. Copying audit logs from the failed Admin Node on page 59 Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the replacement Admin Node. You want to preserve the audit log files if they are recoverable. 3. Installing SLES on virtual machines on page 60 You need to install the SUSE Linux Enterprise Server (SLES) operating system on each virtual machine that will host a StorageGRID Webscale grid node. 4. Installing VMware Tools on page 62 You must install VMware Tools on each virtual machine that will host a grid node for enhanced performance and improved management of the virtual machine. 5. Installing StorageGRID Webscale software on page 63 You must install the StorageGRID Webscale software on each virtual machine to create the type of grid node you are recovering. 6. Starting StorageGRID Webscale software on page 63 You need to start the StorageGRID Webscale software on grid nodes by enabling services. 7. Applying hotfixes and maintenance releases on page 64 You need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. 8. Restoring the audit log on page 65 If you were able to preserve the audit log from the failed Admin Node, you can copy it to the Admin Node you are recovering. 9. Resetting the preferred sender on page 66

59 Recovery procedures 59 If the Admin Node you are recovering is currently set as the preferred sender of notifications and AutoSupport messages, you must reconfigure this setting in the NMS MI. 10. Restoring the NMS database on page 67 If you want to retain the historical information about attribute values and alarms on a failed Admin Node, you need to restore the Network Management System (NMS) database. This database can only be restored if your StorageGRID Webscale system includes more than one Admin Node. Generating server activation media You need to generate the server activation media to create the necessary files to reinstall software on each failed virtual machine in the VMware vsphere Client. Before you begin A service laptop must be available An SCP tool, such as WinSCP, must be installed on the service laptop, which you can use to transfer files to and from the primary Admin Node. About this task You need to regenerate the server activation media to create the necessary files, including the virtual machine template (.ovf) and boot image file (.flp), for each grid node. To restore a grid node, you only need to use the boot image file (.flp) to reinstall SUSE Linux Enterprise Server on the grid node virtual machine. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Create a directory and output the generated files to that directory: mkdir /var/local/deploy-files generate-grid-deployment-files.rb -o /var/local/deploy-files 3. Enter the provisioning passphrase at the prompt. 4. Transfer the contents of the /var/local/deploy-files directory to the service laptop using WinSCP, or a similar tool. The generated files include a virtual machine template file (.ovf) and a boot image file (.flp) for each grid node listed your grid specification file. Only the.flp file is required to recover and configure failed grid nodes. For example, if you are recovering a storage node named DC1 S1, the generated file you will need to use to install the operating system and recover the grid node is named DC1-S1.flp. Copying audit logs from the failed Admin Node Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the replacement Admin Node. You want to preserve the audit log files if they are recoverable. About this task This procedure copies the audit log files from the failed Admin Node to a temporary location. These preserved audit logs can then be copied to the replacement Admin Node. Audit logs are not automatically copied to the new Admin Node.

60 60 StorageGRID Webscale 10.1 Maintenance Guide Depending on the failure of the Admin Node, it might not be possible to copy audit logs from the failed Admin Node. In a deployment with more than one Admin Node, audit logs are replicated to all Admin Nodes. Thus, in a multi Admin Node deployment, if audit logs cannot be copied off of the failed Admin Node, audit logs can be recovered from another Admin Node in the system. In a deployment with only one Admin Node, where the audit log cannot be copied off of the failed Admin Node, the recovered Admin Node starts recording events to the audit log as if the installation is new. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. If it is running, stop the AMS service from creating a new log file: /etc/init.d/ams stop 3. Rename the audit.log file so that it does not overwrite the file on the replacement Admin Node when you copy it to the replacement Admin Node. Rename audit.log to a unique numbered file name such as YYYY-MM-DD.txt.1. For example, you can rename the audit.log file to txt.1 cd /var/local/audit/export ls -l mv audit.log YYYY-MM-DD.txt.1 4. Copy all audit log files to a temporary location on any other server: scp -p * IP_address:/var/local/tmp When prompted, enter the password of the remote server listed in the Passwords.txt file. 5. Log out of the command shell: exit Installing SLES on virtual machines You need to install the SUSE Linux Enterprise Server (SLES) operating system on each virtual machine that will host a StorageGRID Webscale grid node. Before you begin You must have verified that the virtual machine for the grid node is powered on. The SLES ISO image for the StorageGRID Webscale release must be in an accessible VMware datastore. See the Interoperability Matrix for supported versions. NetApp Interoperability Matrix Tool. You must have located the boot image file (.flp) for each grid node you need to complete the installation for. The boot image files for the additional grid nodes are stored in the deploy-files subdirectory where you transferred the provisioning data to the service laptop using an SCP tool, such as WinSCP. About this task The installation process completely erases the server drives and installs the SLES operating system, applications, and support files customized for StorageGRID Webscale.

61 Recovery procedures Open VMware vsphere Client and log in. 2. In the navigation tree, select the virtual machine. 3. Connect the SLES ISO image from the VMware datastore to the virtual machine. Click the Connect/Disconnect the CD/DVD devices of the virtual machine CD/DVD Drive n > Connect to ISO image on a datastore. icon, and select 4. In the Browse Datastores dialog box, locate and select the SLES.iso file, and click OK. 5. Click the Connect/Disconnect the floppy devices of the virtual machine icon and then select Floppy Drive 1 > Connect to Floppy Image on local disk. 6. In the Open dialog box, locate and select the boot image file (.flp) that contains the activation file for this server and click Open. 7. Click the Console tab. 8. Click anywhere inside the Console pane. Your cursor disappears and you are locked into the Console pane. 9. Press Ctrl-Alt-Insert to restart the virtual machine. The server performs the following steps while rebooting: The BIOS runs a hardware verification. By default, the system boots from the ISO image connected on the CD/DVD drive, and loads the SLES boot screen in the VMware vsphere Client Console pane. 10. When the SLES Boot Screen is displayed, press the down arrow key to select the Installation option (do not press Enter). Note: You must move the cursor to the Installation option within eight seconds. If you do not, SLES will automatically attempt to install from the hard drive and the installation process will fail. If this happens, you must restart the virtual machine and select the Installation option within the required time. 11. Press Tab, and at the Boot Options prompt, enter the following command: autoyast=device://fd0/autoinst.xml Note: You must always specify the autoyast information at the Boot Options prompt. If you do not enter this information, AutoYaST does not complete the required custom installation for the server. If you enter an incorrect value, and are prompted to reenter the device name and path, verify the floppy device name. The device name is fd zero (fd0). 12. Press Enter. When the SLES installation is complete, the server completes its configuration and starts the operating system. Installation is complete when the login prompt appears. 13. Disconnect the SLES ISO image by clicking the Connect/Disconnect the CD/DVD devices of the virtual machine datastore image. icon, and then selecting CD/DVD Drive n > Disconnect from

62 62 StorageGRID Webscale 10.1 Maintenance Guide Installing VMware Tools You must install VMware Tools on each virtual machine that will host a grid node for enhanced performance and improved management of the virtual machine. About this task The required version is made available to virtual machines through the VMware vsphere Client. 1. Open VMware vsphere Client and log in. 2. In the navigation tree, select the virtual machine on which you want to complete the installation, and then power on the virtual machine if it is not started. 3. Click the Connect/Disconnect the CD/DVD devices of the virtual machine icon, and select CD/DVD Drive n > Connect to ISO image on a datastore. 4. In the Browse Datastores dialog box, navigate to the /vmimages/tools-isoimages subdirectory, select linux.iso and click Open. 5. Click the Console tab. 6. Click anywhere in the Console pane and log in as root using the password listed in the Passwords.txt file. 7. Mount the VMware Tools installer: mount /cdrom 8. Copy the gzip installation package to a temporary directory on the virtual machine and unpack it: mkdir /tmp/vmtools cd /tmp/vmtools tar -zxvf /cdrom/vmwaretools-*.tar.gz 9. Install VMware Tools with the default installation options: cd /tmp/vmtools/vmware-tools-distrib/./vmware-install.pl --default 10. When the installation is complete, verify that VMware Tools is running: /etc/init.d/vmware-tools status If the installation was successful you will see the following status message: vmtoolsd is running. 11. Remove the installation files from the virtual machine: cd /tmp rm -rf vmtools 12. Reboot the system to ensure that the changes take effect: reboot

63 Recovery procedures 63 Installing StorageGRID Webscale software You must install the StorageGRID Webscale software on each virtual machine to create the type of grid node you are recovering. Before you begin A Telnet application such as PuTTY must be available on the service laptop. About this task The Grid Deployment Utility (GDU) is a console application used to install StorageGRID Webscale software, enable services, and execute other tasks on individual grid nodes. The GDU is accessed through a Telnet application such as PuTTY. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Add your private key identity to the SSH authentication agent: ssh-add 3. Enter the SSH Access Password for the primary Admin Node listed in Passwords.txt. 4. Start the Grid Deployment Utility (GDU): gdu-console 5. Type the provisioning passphrase, press Tab to select OK, and press Enter. 6. Complete the following steps for each additional grid node you are configuring: a. Ensure that the grid node you are recovering is selected and that the current state is Available. b. Press Tab to move through the options to the Tasks list. c. Press the down arrow to highlight Install Software, and press Spacebar to select it. d. Press Tab to move through the options and select the Start Task action. e. Press Enter to run the task. The current state of the grid node returns to Available when the task completes. 7. When the task completes for the last grid node you are recovering StorageGRID Webscale software for, press the right arrow to move to the Quit action and press Enter. Starting StorageGRID Webscale software You need to start the StorageGRID Webscale software on grid nodes by enabling services. 1. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. 2. Add your private key identity to the SSH authentication agent: ssh-add 3. Enter the SSH Access Password for the primary Admin Node listed in the Passwords.txt file.

64 64 StorageGRID Webscale 10.1 Maintenance Guide 4. Start the GDU: gdu-console 5. Type the provisioning passphrase, press the Tab key to select OK, and press Enter. 6. Start the StorageGRID Webscale software on the server: a. Ensure that the grid node you are configuring is selected and that the current state is Available. b. Press the Tab key to move through the options to the Tasks list. c. Use the down-arrow key to highlight Enable Services, and press the Spacebar to select it. d. Press the Tab key to move through the options and select the Start Task action. e. Press Enter to run the task. Wait for the message Finished Postinstall start task to appear in the Log Messages panel. Note: If you are completing this procedure on a primary Admin Node, do not select the Load Configuration option. f. When the task completes, press the right-arrow key to move to the Quit action and press Enter. 7. End the SSH session: ssh-add -D 8. Verify that the selected grid node is configured correctly: a. Open VMware vsphere Client and log in. b. Select the virtual machine in the navigation tree and power on the virtual machine if it is not currently running. c. Click the Console tab. d. Verify that the StorageGRID Webscale Server Console is displayed and that the status of all components is Verified, and the status of all services is Running. Applying hotfixes and maintenance releases You need to verify the version of the recovered grid node and ensure that it matches the version of all of the other grid nodes in your StorageGRID Webscale system. Before you begin StorageGRID Webscale software must be started on the recovered grid node. About this task After StorageGRID Webscale software is started, confirm that the recovered grid node is running the same version of the StorageGRID Webscale software as other grid nodes of the same type. If it is not, apply any necessary hotfixes or maintenance releases needed to update the recovered grid node to the same version as the rest of the grid nodes of the same type. 1. Sign in to the NMS MI. 2. Determine the current version of the StorageGRID Webscale software:

65 Recovery procedures 65 a. Go to grid node of same type > SSM > Services > Main. b. Under Packages, note the storage-grid-release number, or refer to the documented storage-grid-release number for the grid node. 3. Determine the version of the StorageGRID Webscale software of the recovered grid node: a. Go to recovered grid node > SSM > Services > Main. b. Under Packages, note the storage-grid-release number. 4. Compare the two versions, and if they differ apply hotfixes or maintenance releases as necessary to update the recovered grid node to the correct software version. For more information about available hotfixes and maintenance releases, contact technical support. Related tasks Checking the StorageGRID Webscale version on page 5 Restoring the audit log If you were able to preserve the audit log from the failed Admin Node, you can copy it to the Admin Node you are recovering. Before you begin Admin Node must be installed and running You must have copied the audit logs to another location after the original Admin Node failed. About this task If an Admin Node fails, audit logs saved to that Admin Node are potentially lost. It might be possible to preserve data from loss by copying audit logs from the failed Admin Node and then restoring these audit logs to the recovered Admin Node. Depending on the failure of the Admin Node, it might not be possible to copy audit logs from the failed Admin Node. In a deployment with more than one Admin Node, audit logs are replicated to all Admin Nodes. Thus, in a multi Admin Node deployment, if audit logs cannot be copied from the failed Admin Node, audit logs can be recovered from another Admin Node in the system. In a deployment with only one Admin Node, where the audit log cannot be copied from the failed Admin Node, the recovered Admin Node starts recording events to the audit log as if the installation is new. You must recover an Admin Node as soon as possible to restore logging functionality. 1. From the service laptop, log in to the grid node where you made a temporary copy of the audit log files as root using the password listed in the Passwords.txt file. 2. Check which audit files have been preserved: cd /var/local/tmp ls -l The following audit log files might be present in the temporary directory: The renamed current log file (audit.log) from the failed grid node: YYYY-MM-DD.txt.1 Rotated audit log file from the day before the failure: YYYY-MM-DD.txt (or YYYY-MM- DD.txt.n if more than one is created in a day).

66 66 StorageGRID Webscale 10.1 Maintenance Guide Older compressed and rotated audit log files: YYYY-MM-DD.txt.gz, preserving the original archive date in their name. 3. Copy the preserved audit log files to the recovered Admin Node: scp -p YYYY* recovered_admin_ip:/var/local/audit/export Where recovered_admin_ip is the IP address of the recovered Admin Node. 4. When prompted, enter the password of the recovered grid node, as listed in the Passwords.txt file. 5. Remove the preserved audit log files from their temporary location: rm YYYY* 6. Log out of the server that contained the temporary location of the audit logs: exit 7. Log in to the recovered grid node as root. 8. Update the user and group settings of the preserved audit log files: cd /var/local/audit/export chown ams-user:bycast * After you finish You will also need to restore any preexisting client access to the audit share. For more information, see "Chapter 11: Configuring audit client access" in the Administrator Guide. Related tasks Copying audit logs from the failed Admin Node on page 43 Depending on the type of Admin Node failure, you might be able to recover the audit logs from the failed server and later restore them on the replacement Admin Node. You want to preserve the audit log files if they are recoverable. Resetting the preferred sender If the Admin Node you are recovering is currently set as the preferred sender of notifications and AutoSupport messages, you must reconfigure this setting in the NMS MI. Before you begin The Admin Node must be installed and running. 1. Sign in to the NMS MI of the recovered Admin Node using the Vendor account. 2. Go to Grid Management > NMS Management > General > Main.

Recovery procedures 67 3. Select Preferred Sender > recovered Admin Node. 4. Click Apply Changes.

67 Recovery procedures Select Preferred Sender > recovered Admin Node. 4. Click Apply Changes. Restoring the NMS database If you want to retain the historical information about attribute values and alarms on a failed Admin Node, you need to restore the Network Management System (NMS) database. This database can only be restored if your StorageGRID Webscale system includes more than one Admin Node. Before you begin The Admin Node must be installed and running. The StorageGRID Webscale system must include one or more data center sites with an Admin Node About this task If an Admin Node fails, the historical information about attribute values and alarms that are stored in the NMS database for that Admin Node are lost. In a StorageGRID Webscale system with more than one Admin Node, the NMS database is recovered by copying the NMS database from another Admin Node. In a system with only one Admin Node, the NMS database cannot be restored. When you are recovering an Admin Node, the software installation process creates a new database for the NMS service on the recovered Admin Node. After it is started, the recovered Admin Node records attribute and audit information for all services as if your system were a performing new installation of the StorageGRID Webscale system. In a StorageGRID Webscale system with more than one Admin Node, you can copy the NMS database from another Admin Node to the recovered Admin Node and restore historical information. Note: To copy the NMS database, the StorageGRID Webscale system must be configured with multiple Admin Nodes. 1. Stop the MI service on both Admin Nodes: a. From the service laptop, log in to the primary Admin Node as root using the password listed in the Passwords.txt file. b. Stop the MI service: /etc/init.d/mi stop c. Repeat the same steps on the second Admin Node. 2. Complete the following steps on the recovered Admin Node: a. Copy the database:

StorageGRID Webscale 10.3 Maintenance Guide for OpenStack Deployments

StorageGRID Webscale 10.3 Maintenance Guide for OpenStack Deployments September 2016 215-10820-A0 doccomments@netapp.com Table of Contents 3 Contents Maintain your StorageGRID Webscale system... 6 Checking