SER1534BU vsphere Performance Troubleshooting and Root Cause Analysis Brett Guarino, VMware, Inc Steve Baca, VMware, Inc VMworld 2017 Content: Not for publication #VMworld #SER1534BU
Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitment from VMware to deliver these features in any generally available product. Features are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind. Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features discussed or presented have not been determined. VMworld 2017 Content: Not for publication #SER1534BU CONFIDENTIAL 2
Agenda 1 ESXTOP Overview 2 CPU Key Performance Indicators 3 Memory Key Performance Indicators 4 Network Key Performance Indicators 5 Storage Key Performance Indicators VMworld 2017 Content: Not for publication CONFIDENTIAL 3
ESXTOP Overview VMworld 2017 Content: Not for publication
ESXTOP Esxtop is the primary real-time performance monitoring tool for vsphere It can be run from an ESXi host local command line as esxtop It can be run remotely from vcli as resxtop Designed to work like the top performance utility in Linux The key performance indicators are viewed on individual resource screens by entering the appropriate keys. Commands are case sensitive. c m d u CPU screen (default) Memory screen Disk (adapter) screen Disk (device) screen v Virtual disk view (lowercase v) n f/f V h Network screen Add or remove statistic columns VMworld 2017 Content: Not for publication Virtual machine view (uppercase V) Help q Quit #SER1534BU CONFIDENTIAL 5
CPU Key Performance Indicators VMworld 2017 Content: Not for publication
Host CPU Co-Scheduler World VMworld 2017 Content: Not for publication #SER1534BU CONFIDENTIAL 7
CPU Key Performance Indicators CPU Key Performance Indicators for ESXi Hosts Ready Time Utilization Load Average VMworld 2017 Content: Not for CPU Key Performance Indicators for Virtual Machines publication Ready Time (%RDY) Co-Stop (%CSTP) Swap Wait (%SWPWT) Memory Limited (%MLMTD) #SER1534BU CONFIDENTIAL 8
Example: Identifying CPU Constraint The %USED and %RDY columns of the resxtop command output indicate CPU overcommitment. VMworld 2017 Content: Not for publication #SER1534BU CONFIDENTIAL 9
When to Right Size Virtual Machine vcpus What is Co-Stop? When a VM with multiple vcpus must stop processing on one or more vcpus. Why does Co-Stop Occur? The fastest sibling vcpu stops itself when it s slowest sibling vcpu on the VM violates a threshold. This is due to skew between sibling vcpus. The vcpus co-start when the slowest sibling begins to make progress. It progresses because scheduling opportunities are available once the fastest vcpus are co-stopped. How do I resolve CPU Co- Stop issues? Right size your VM s vcpus. When in doubt, mimic the physical host CPU topology to take advantage of physical/virtual NUMA. Consider Wide and Flat vcpu allocations. VMworld 2017 Content: Not for publication For configurations for VMs with greater than 8 vcpus, allocate X number of virtual sockets and a single virtual core. #SER1534BU CONFIDENTIAL 10
Memory Key Performance Indicators VMworld 2017 Content: Not for publication
Using esxtop to Monitor Memory Usage esxtop offers several options for monitoring memory usage Enter m to display the memory screen PMEM: installed MB VMKMEM: managed MB MINFREE: calculated MB MB of free RAM VMworld 2017 Content: Not for publication or distribution Possible States: High, Clear, Soft, Hard, Low #SER1534BU CONFIDENTIAL 12
ESXTOP Memory State Thresholds What is MinFree? Memory State Threshold Actions Performed High 300% of minfree Break Large Pages (wait for next TPS run) Clear 100% of MinFree Soft 64% of minfree TPS + Balloon Break Large Pages and active call TPS to collapse pages Hard 32% of minfree TPS + Compress + Swap VMworld 2017 Content: Not for publication Low 16% of minfree Compress + Swap + Block #SER1534BU CONFIDENTIAL 13
Host Memory Shortage KPI: Ballooning Activity in esxtop In the esxtop memory screen add the J field for virtual machine ballooning activity. Memory Balloon Statistics for the Host VMworld 2017 Content: Not for Configured Memory per VM Active Memory per VM Balloon Driver Installed? Physical Memory Held for Other VMs Max Balloon per VM publication Physical Memory Target to Reclaim #SER1534BU CONFIDENTIAL 14
Host Memory Shortage KPI: Compression Activity in esxtop In the esxtop memory screen add the Q field for virtual machine compression activity. Memory Compression Statistics for the Host n o i t ibu r t s i d or Calculated compression cache size n io t a c bli t o N nt: 7 1 0 2 rld Actively compressing memory per VM u p r fo te n o C o w M V Memory compressed in MB per VM #SER1534BU CONFIDENTIAL Accessing compressed memory per VM 15
Host Memory Shortage KPI: Swapping Activity in esxtop: Memory screen In the esxtop memory screen add the K field for virtual machine swapping activity. Total Memory Swapped for All Virtual Machines on Host Total Memory Swap Rate for All Virtual Machines on Host VMworld 2017 Content: Not for Swap Reads per Second Swap Writes per Second publication Swap Space Currently Used Swap Space Target #SER1534BU CONFIDENTIAL 16
Host Memory Shortage KPI: Swapping Activity in esxtop: CPU Screen In esxtop, the CPU screen can indicate memory swapping is occurring VMworld 2017 Content: Not for Percentage of Time Virtual Machine Has Waited for Swap Activity publication #SER1534BU CONFIDENTIAL 17
Networking Key Performance Indicators VMworld 2017 Content: Not for publication
Key Points to Monitor for Performance in the Network I/O Stack Measure which uplinks each Virtual NIC is using Measure network bandwidth per Virtual NIC Measure packet count and average packet size per Virtual NIC Measure dropped packets per Virtual NIC Virtual NICs Physical NICs Virtual Switch VMworld 2017 Content: Not for publication Measure network bandwidth per physical NIC Measure packet count and average packet size per physical NIC Measure dropped packets per physical NIC #SER1534BU CONFIDENTIAL 19
Network Usage Using CPU View in esxtop: Linux01 is Client and Linux02(expanded) is Server on Same ESXi Tx Thread Usage Higher since VMs on same ESXi %SYS vmx is processing interrupts and other system activities to receive packets by Netpoll threads VMworld 2017 Content: Not for publication or distribution CPU Usage Enter c to display the CPU screen #SER1534BU CONFIDENTIAL 20
Key Performance Indicators in esxtop VMworld 2017 Packets Transmitted per Virtual NIC Network Bandwidth Transmitted per Virtual NIC Average Packet Size Transmitted per Virtual NIC Content: Not for publication Enter n to display the Network screen #SER1534BU CONFIDENTIAL 21
Key Performance Indicators in esxtop VMworld 2017 Packets Received per Virtual NIC Network Bandwidth Received per Virtual NIC Average Packet Size Received per Virtual NIC Dropped Packets per Virtual NIC Content: Not for publication Enter n to display the Network screen #SER1534BU CONFIDENTIAL 22
Key Performance Indicators in esxtop VMworld 2017 Packets Transmitted per physical NIC Network Bandwidth Transmitted per physical NIC Average Packet Size Transmitted per physical NIC Content: Not for publication Enter n to display the Network screen #SER1534BU CONFIDENTIAL 23
Key Performance Indicators in esxtop VMworld 2017 Packets Received per Physical NIC Network Bandwidth Received per Physical NIC Average Packet Size Received per Physical NIC Dropped Packets per Physical NIC Content: Not for publication Enter n to display the Network screen #SER1534BU CONFIDENTIAL 24
Dropped Network Packets 1 2 3 4 Packet data might have to be buffered before being passed to the next step in the delivery process. Network packets are buffered in queues in the following cases: The destination is not ready to receive the packets. The network is too busy to send the packets. The queues are finite in size: Virtual NIC devices buffer packets when they cannot be handled immediately. If the queue in the virtual NIC fills, packets are buffered by the virtual switch port. When these queues fill up, no more packets can be received, causing additional arriving packets to be dropped. VMworld 2017 Content: Not for publication #SER1534BU CONFIDENTIAL 25
Storage Key Performance Indicators VMworld 2017 Content: Not for publication
Where Can Storage Problems Exist? A disk network is only as fast as the slowest device in the network. Problems may arise at: Virtual Machine VMDK ESXi Host HBA/NIC Storage Network Switches SAN Processors SAN Network Interfaces VMworld 2017 Given the areas of interest, how do I quickly identify my point(s) of constraint? With ESXTOP and understanding storage KPIs (key performance indicators) Content: Not for publication #SER1534BU CONFIDENTIAL 27
Common Storage KPIs (Key Performance Indicators) What are IOPs? IOPs are disk Input/Outputs or Reads/Writes per second What is a SCSI command? Any disk command; includes reads and writes, but also additional disk needs like SCSI reservations What is a SCSI reservation? Temporarily locks a LUN for metadata protection. Mitigated by VAAI hardware acceleration VMworld 2017 Content: Not for publication What is latency? Time a SCSI command spends in transit from source to destination and back. Measured in milliseconds What is throughput? The total sum of data transferred, measured in MBps #SER1534BU CONFIDENTIAL 28
Identifying Storage Options from the Default CPU Screen Lowercase v Per virtual machine per vmdk disk view Lowercase u Per LUN disk view VMworld 2017 Content: Not for publication Lowercase d Per HBA/RAID card disk view #SER1534BU CONFIDENTIAL 29
Storage KPI Monitoring Thresholds DAVG - 15-20ms* KAVG - 2-3ms* IOPS We want ALL THE IOPs! SCSI CMD ABORTS 0 at all times VMworld 2017 Content: Not for publication *Varies with application tolerance for latency #SER1534BU CONFIDENTIAL 30
ESXTOP Storage Views Adapter view: Enter d. Device view: Enter u. Virtual machine view: Enter v. VMworld 2017 Content: Not for publication #SER1534BU CONFIDENTIAL 31
ESXTOP KPI and Storage Construct Mapping KPIs: IOPs/Latency/ Throughput KPIs: IOPs/Latency/ Throughput KPIs: IOPs/Latency/ Throughput Lowercase v Is VMs View Lowercase d Is Device View VMworld 2017 Content: Not for publication Lowercase u Is LUN View #SER1534BU CONFIDENTIAL 32
ESXTOP Device (HBA) View Screen cap or live demo of HBA view Show vcenter server GUI information to identify storage constructs Physical Diagram showing HBA device and a software initiated iscsi device Press h to see help for the Device View screen and reveal Sort options: VMworld 2017 Content: Not for publication GUI based storage adapters view: #SER1534BU CONFIDENTIAL 33
ESXTOP LUN (Datastore Backing) View Show luns and relative metrics and ESXTOP LUN View Show conceptual diagram with LUN mapped as DATASTORE. Press h to see help for the Device View screen and reveal Sort options: VMworld 2017 Content: Not for publication GUI based LUN view: #SER1534BU CONFIDENTIAL 34
ESXTOP VM View Press h to see help for the Device View screen and reveal Sort options: Press e to see individual vmdk metrics on a per VMDK basis per VM: VMworld 2017 Content: Not for publication #SER1534BU CONFIDENTIAL 35
Latency Explored VMworld 2017 Content: Not for Sum of all latency; Guest OS latency experience Time spent in vmkernel publication Time spent in queue Time for SCSI cmd to exit vmkernel, hit physical storage device and return. #SER1534BU CONFIDENTIAL 36
Correlation Between the ESXTOP Monitored Devices What does it mean when metrics on device/lun/vm are low/low, high/high, low/high and high/low?? High d/avg and low k/avg overworked array High k/avg and low d/avg overworked host Both high problem could be both or the SAN queueing so hard that the host backs up as well Both low ideal situation if I/O demands are met. VMworld 2017 Content: Not for publication #SER1534BU CONFIDENTIAL 37
Additional Education Resources At VMworld 2017 Europe Education & Certification Lounge: VM Village Certification Exam Center: Jasmine EFG, Level 3 Online vsphere Training: www.vmware.com/go/vsphere65training VMware Training: www.vmware.com/education VMware Certification: www.vmware.com/certification Global Support Services Learn more about how VMware is radically transforming Customer Support through VMware Skyline technology. VMworld 2017 Content: Not for publication or Visit the demo booth within the Solutions Exchange Sign up for a Meet the Experts session in the Content Catalogue Visit www.vmware.com/support/service/skyline Save 50% distribution off VCP & VCAP exams at VMworld 2017 #SER1534BU CONFIDENTIAL 38
VMworld 2017 Content: Not for publication #SER1534BU CONFIDENTIAL 39
VMworld 2017 Content: Not for publication #SER1534BU CONFIDENTIAL 40