SER1815BU DRS Advancements: What's New and What Is Being Cooked Up in Resource Management Land VMworld 2017 Thomas Bryant, VMware, Inc - @kix1979 Maarten Wiggers, VMware, Inc Content: Not for publication #VMworld #SER1815BU
Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined. 2
Agenda What is DRS? Q & A How DRS works Proven Best Practices New 6.5 Features VMware Labs Industry Trends CONFIDENTIAL 3
What is DRS?
Initial placement & ongoing balancing Minimize risk of contention and satisfy business policy Maintenance mode No downtime for infrastructure updates Power Management & Consolidation Efficient use of infrastructure or distribution CONFIDENTIAL 5
What is in the DRS family Distributed Resource Scheduler (DRS) Resource Pools Leverages shares NIOC Storage DRS (sdrs) SIOC Distributed Power Management (DPM) 6
What s new in vsphere 6.5 Proactive HA Predictive Workload Balancing (pdrs) Policy-based SIOC configuration Network Aware DRS 7
DRS by the numbers
VMworld 2017 Content: Not for 81% Fully Automated 15% Partially Automated publication 4% Manual
VMworld 2017 89% Affinity/ Anti-affinity rules 100% Maintenance Mode 48% Resource Pool Content: Not for publication
How DRS works
Distributed Resource Scheduler (DRS) Performance DRS keeps VM s happy Resource Pools Operational DRS affinity rules: Control the placement of VMs on hosts within a cluster. Maintenance Mode Works in conjunction with HA Proactive HA Fault Tolerance vsphere Upgrade Manager (VUM) Auto-Deploy & others
Metrics used for Initial Placement and Load balancing Innumerable host-level and VM-level stats and metrics are considered during IP and LB Few important VM metrics CPU active, run and peak Memory overhead, growth-rate Active, Consumed and Idle memory Network saturation
Constraints are essential HA admission control policies (slot-based, reserved % for failover etc..) Affinity and anti-affinity rules # of concurrent vmotions Datastore connectivity vcpu to pcpu ratio Reservation, limit and share settings Special VMs (eg: SMP-FT, Latency sensitive VM, etc.) Placement on hosts that have all required physical devices CONFIDENTIAL 14
De-Mystifying Resource Pool Resource Pool: Powerful abstraction to segregate resources in a cluster Set business requirements based on workload importance and characteristics Provides isolation between resource pools It is the fundamental building block for vcan partners (Cloud Service Providers) Resource controls: 1. Reservation (MHz or MB) Minimum MHz or MB guaranteed By default, R = 0 <means, no dedicated resource> 2. Limit (MHz or MB) Maximum MHz or MB allowed By default, L = 0 <means unlimited> 3. Shares (No unit) Relative priority between siblings How to proportionally divvy resources when there is contention CONFIDENTIAL 15
Resource Pool Example Total Shares = 400+100 = 500 Contention for = 100 80 = 20GHz RP1 quota = 400 x 20 = 16GHz 500 RP2 quota = 100 x 20 = 4GHz 500 VMworld 2017 RP1 (Production) Total Cluster Capacity = 100 GHz Root RP R=80, S=400 R=0, S=100 RP2 (Analytics) Content: Not for publication VM-P1... VM-P10 VM-A1... VM-A20
Cost Benefit and mingoodness Cost-Benefit Analysis: Cost of VM migration is evaluated against the potential benefits wrt VM demands and host load Cost considerations: Per vmotion a Reservation of 30% of a CPU core for 1GbE and 100% of a CPU core per 10GbE Memory consumption of Shadow VM at the destination host Negative performance impact to VMs at the destination host Potential memory reclamation implication at the destination host Benefit considerations: Positive performance benefits to VMs at the source host Positive performance gains for the migrated VM at the destination host VMs on source host and moved VM have more headroom for utilization spike mingoodness: vmotions need to improve cluster balance beyond this threshold (configured through DRS migration threshold) CONFIDENTIAL 17
Cost Benefit and mingoodness VM happiness is the most important metric!! If VM s demand and entitlement for resources are always met, then VM is happy! During Initial placement, DRS ensures minimum performance impact on already running VMs During Load balance, DRS ensures VMs are happy with a minimum number of vmotions CONFIDENTIAL 18
Memory Metrics in ESXi Active Memory Idle Memory Consumed Memory Consumed: All touched memory pages minus page sharing Active: Estimated based on recently-touched memory pages VMworld 2017 Shared Memory Configured VM Size Content: Not for publication
Memory Metrics and DRS What DRS uses by default to balance memory Active Memory 25% Idle Memory What is displayed in the cluster summary screen VMworld 2017 Consumed Memory Sum of Consumed memory of all VMs on host Configured VM Size Content: Not for publication
DRS Settings Automation Level Manual vcenter Server will suggest migration recommendations for VMs Partially Automated Automatic placement, migration recommendations Full Automated (recommended) Automatic Placement and migration recommendations 21
Migration Threshold Priority 1 Only mandatory moves (maintenance mode or affinity/anti-affinity rules) Priority 2 Very conservative. Only recommends moves where a severe imbalance is detected. Priority 3 Conservative yet balanced approach. (Default) Priority 4 Semi-aggressive. (Recommended if balanced clusters is desired) Priority 5 Very aggressive. Will balance even if very little performance benefit results. VMworld 2017 Priority 2 Priority 4 Priority 5 Content: Not for publication Priority 3 Hosts in DRS Cluster 22
How SDRS works
SIOC - IO control w/single datastore VMworld 2017 Storage IO Control ESX IO Scheduler Control: IO Reservations Storage IO Control Capabilities Control: IO Reservations Content: Not for publication
Storage IO Control Control Congestion in shared datastore Detect Congestion SIOC monitors average IO latency for a datastore Latency above a threshold indicates congestion SIOC throttles IOs once congestion is detected Control IOs issued per host Based on VMs shares, reservations, and limits on each host Configurable via Storage Policies (SPBM) Throttling adjusted dynamically based on workload Idleness Bursty behavior
SDRS IO control w/multiple datastores Storage DRS Storage IO Control
Storage DRS Ease of Storage Management Initial Placement Out of Space Avoidance IO Load Balancing Virtual Disk Affinity (Anti-Affinity) Datastore Maintenance Mode Add Datastore Datastore Cluster VMworld 2017 Content: Not for Storage vmotion publication
Key Takeaways Initial placement and Load balancing is greatly influenced by: Real time stats from ESX host and VMs (ex: CPU Demand, Memory Active, Memory Consumed etc ) Constraints (ex: HA policies, affinity rules, etc..) Cost Benefit Analysis VM Happiness is the #1 influencer for both initial placement and load balance decisions A small imbalance in the DRS/SDRS cluster should not be a concern. SDRS/SIOC helps to solve IO contention Start at default Priority 3. Adjust up if you require a more aggressive balance profile This can cause additional vmotions 28
Proven Best Practices
Best Practices - Tip #1 Use Latency Sensitivity flag For latency sensitive VMs set latency sensitivity flag ESX CPU scheduler gives prioritized scheduling for this VM DRS ensures this VM is *not* disturbed during periodic load balancing CONFIDENTIAL 30
Best Practices Tip #2 CPU Ready time? Check BIOS power management is set to OS control mode Ensure the ESX power management Active Policy is set to Performance CONFIDENTIAL 31
Best Practices Tip #3 Full Storage Connectivity All the hosts have access to all the data stores Results in an efficient initial placement, load balancing and workload consolidation VM availability is improved significantly CONFIDENTIAL 32
New 6.5 Features
Key Themes for 6.5 enhancements Enable higher churn environments like containers & devops Improved algorithm Scalability enhancements Business critical pdrs Proactive HA Network Aware DRS Advanced Options UI enhancements 34
DRS Algo Enhancements Improved initial placement algorithm Even VM distribution Saves on vmotions on subsequent load balancing! More aggressive Detects and corrects outlier situations Recommends/balances until no two hosts differ by a defined value maximum and minimum host entitlement And more! CONFIDENTIAL 35
Resource Utilization Optimization (of vcenter) Throughput > 2.5x increase 70% resource reduction at scale VM Power-on Latency > 3x improvement DRS Cluster Compatibility check > 21x Improvement Less than 2% CPU utilization > 850 MB Reduction http://www.vmware.com/techpapers/2017/drs-cluster-mgmt-perf.html CONFIDENTIAL 36
Predictive DRS Tight integration with vrealize Operations Manager (vrops) Resource utilization trends are observed Predicted demand of workloads is incorporated in initial placement and load balancing Current VM demands are honored before future demands are satisfied VMworld 2017 vsphere DRS vrealize Operations Ingests forecasted metrics Balances cluster based on forecasted utilization Content: Not for publication Computes and forecasts utilization based on metric history. CPU Memory Dynamic Thresholds created and data passed to DRS CONFIDENTIAL 37
Predictive DRS Some workloads have predictable resource utilization trends Having a high level of confidence allows DRS to pro-actively prepare for increased demand before demand occurs VMworld 2017 Potentially faster balancing and better performance from VMs resource demand Content: Not for publication Observed Observed spike: react! Remediation complete time Predicted Proactive remediation complete Predicted spike: prepare CONFIDENTIAL 38
Proactive High Availability Proactive evacuation of VMs from degraded hosts based on hardware health metrics Increase the availability of VMs even more than current technology provides Tight integration, qualification and certification with hardware vendors CONFIDENTIAL 39
What would this look like? 1. Servers running in Datacenter 2. Hardware is monitored via OEM software or vsphere DRS distribution 3. Health alerts/updates pushed to vcenter 4. DRS and health state are invoked. Workloads are moved according to severity 40
Customized Proactive HA automation settings CONFIDENTIAL 41
Degradation events generated in vcenter Provider Health Host Failure Condition Remediation CONFIDENTIAL 42
Network-Aware DRS Network utilization has not been a first-class citizen with CPU and Memory Network-Aware DRS is based on host pnic saturation Advanced option for Network Utilization % NetworkAwareDrsSaturationThresholdPercent Default is 80% UTILIZATION CONFIDENTIAL 43
Advanced Options in the UI Do not need to know the property name Easier to consume Commonly used options CONFIDENTIAL 44
Advanced Options in the UI Even distribution of virtual machines TryBalanceVmsPerHost Best effort attempt for purposes of availability Each host given a maxvms limit (avg VMs per host) Only applied to the Load Balancing Algorithm (Initial Placement can violate this) Will try to balance VMs (count) but if there is an imbalance of resources, DRS will violate the VM balance Attempts to move small VMs to correct the maxvms limit violations May introduce more vmotions CONFIDENTIAL 45
Advanced Options in the UI CPU Over-commitment Used heavily by VDI Applies for certain application requirements (exchange and others may require specific ratio) MaxVcpusPerCore Set max CPU Overcomittment per host for cluster MaxVcpusPerClusterPct Set max CPU Overcommit for the cluster CONFIDENTIAL 46
Advanced Options in the UI Consumed Memory vs Active Memory PercentIdleMBInMemDemand Allow DRS to balance on Consumed Memory Specifically for environments are under-committed in memory CONFIDENTIAL 47
DRS Labs
DRS Lens CONFIDENTIAL 49
DRS Lens 50
DRS Dump Insight 51
Industry Trends
What s Next Industry trends we are considering Application Requirements are changing Customers are moving beyond Traditional apps -> Containers/devOps/Business Critical apps Application Requirements span all aspects of infrastructure More integrated management (eg. HCI) Touches Compute, Storage & Network IT is increasingly more important in the business Increasing visibility for compliance, auditing, & legal reasons CONFIDENTIAL 53
Q&A