ProphetStor DiskProphet Ensures SLA for VMware vsan 2017 ProphetStor Data Services, Inc. 0
ProphetStor DiskProphet Ensures SLA for VMware vsan Table of Contents Summary... 2 The Challenges... 2 SLA Compliance... 3 Redundancy & Performance Degradation... 3 The Solution... 4 Performance Prediction... 6 Impact Analysis... 7 Replacement Time (Future)... 7 Proactive Alerting (Future)... 7 Key Benefits... 7 2018 ProphetStor Data Services, Inc. All Rights Reserved 1
Summary VMware vsan is enterprise-class storage for hyper-converged infrastructure (HCI) with native integration to VMware vsphere. It utilizes commodity x86 server components to deliver enterpriseclass storage services to many use cases: business-critical applications, disaster recovery sites, remote office and branch office (ROBO) implementations, and virtual desktop infrastructure (VDI), etc. Global enterprises expect the vsphere with vsan infrastructure to provide reliable data services with performance in according with SLA. ProphetStor DiskProphet, together with VMware vsan, deliver a higher level of services and lowers operating expenses and risk with accurate predictions. The Challenges While most HCI design includes solution redundancy, hardware failures can pose a threat to the performance and stability of the environment, without regard to the scale of implementation. Increasing the number of failures to tolerate can certainly address the risk. However, this option might not be an economical choice for many companies attempting to reduce operating expenses. In addition, having more hardware does not necessarily guarantee performance SLA during hardware failure. Figure 1: Host Requirements Based on Failure Tolerance Method 2018 ProphetStor Data Services, Inc. All Rights Reserved 2
For a VMware vsphere solution with vsan, SLA on performance and reliability is critical to the design of the solution. The main challenges include: Balancing cost, redundancy, and performance Maximizing system reliability Reducing operating expense Consistent performance throughout the life cycles of hardware and software In VMware vsan, random disk failures on host boot disks, cache, or capacity disks can trigger data evacuation at any time, risking the SLA. SLA Compliance Redundancy & Performance Degradation The following factors will affect SLA during a failure event: Wait time for hard drive arrival and replacement Local SSD for cache or swap files Impact during data evacuation with/without storage vmotion Read Operation drops due to cache re-warm up after a cache disk fails Typical reactions in the environment due to a disk group failure are shown below. Figure 2: vsan Performance of One Disk Group Failure 2018 ProphetStor Data Services, Inc. All Rights Reserved 3
An error during data evacuation is a true disaster: The disk might not have the capacity to hold the data needed during evacuation. The disk might not be able to read from the source disk if that is the only copy. A second hard drive failure may occur during data evacuation. The Solution Powered by the ProphetStor AI platform, ProphetStor DiskProphet monitors and analyzes hardware activity in the environment and provides actionable insights, enabling IT administrators to manage hardware component errors and optimize resource utilization. Right out of the box, DiskProphet machine learning models for each data type are ready to collect metrics data from VMware vcenter, ESXi hosts, and hardware components. DiskProphet can then provide predictive analytics and visual charts of trends in the environment. Figure 3: DiskProphet for vsan Solution Architecture 2018 ProphetStor Data Services, Inc. All Rights Reserved 4
Disk Failure Prediction VMware vsan implements a logic called Degraded (or Dying) Disk Handling (DDH) to take proactive remediation on disks expected to fail. In additional to using SMART data to determine the status of drives, DDH also monitors the latency of drives against an imposed threshold. If latency is exceeded, the drive will be marked Degraded and data evacuation starts even if the disk dies. When the drive is marked Degraded, the performance of the disk group can be degraded for a period of time as shown in Figure 1. The status of data evacuation can be: Disk dying, Evacuation complete o This is the final status when the disk has been evacuated. At this point the drive can be decommissioned by any mode, and then further diagnostics or replacement can proceed. Disk dying, preventative evacuation incomplete due to lack of resources Disk dying, preventative evacuation incomplete due to inaccessible objects DiskProphet applies machine learning and patented prediction technologies on disk SMART data and disk metrics, together with host and VM metrics, and predicts the remaining lifetime of disks (HD, SSDs and NVMEs) as early as 45 days in advance with higher than 95% accuracy. This early diagnosis gives operators enough leeway to ensure replacement disk readiness (in order to prevent incomplete data evacuation) and make plans for maintenance during off peak periods (to minimize performance impact). Figure 4: Comparison of vsan DDH with DiskProphet disk failure predictions 2018 ProphetStor Data Services, Inc. All Rights Reserved 5
Performance Prediction Beyond historical performance statistics, predictions on CPU, memory, and IOs on different levels of stacks offer guidance on future resources and operations planning. Figure 5: VMware resources optimization 2018 ProphetStor Data Services, Inc. All Rights Reserved 6
Impact Analysis The impact of a potential risk is can be visualized by correlating the affected components, as shown in the DiskProphet diagram below. Replacement Time (Future) Figure 6: VMware resource map with impact analysis Aggregating the predictions on multiple factors can help develop a replacement plan. Proactive Alerting (Future) Notifications are delivered to administrators in SMS or emails when a hardware error or resource conflicts are predicted to occur. Key Benefits The key benefits provided by DiskProphet can be summarized as follows: Protects SLA performance against hardware failures: With hardware error predictions, IT operators can ensure data availability and plan for maintenance during off-peak hours to minimize production impact. Optimizes VMware vsphere Utilization: with both historical resource analytics and future predictions, IT operators can develop precise plans to allocate shares of the infrastructure with confidence. Improves Employee Productivity: With proactive indications on disk failures, DiskProphet turns individual random disk failure incidents into planned execution of multiple tasks. 2018 ProphetStor Data Services, Inc. All Rights Reserved 7