John Paul Managed Services, R&D. Copyright 2012 Siemens Medical Solutions USA, Inc. All rights reserved.

CMG: VMWare vsphere Performance Boot Camp John Paul Managed Services, R&D February 29, 2012

Acknowledgments and Presentation Goal The material in this presentation was pulled from a variety of sources, some of which was graciously provided by VMware and Intel staff members. I acknowledge and thank the VMware and Intel staff for their permission to use their material in this presentation. This presentation is intended to review the basics for performance analysis for the virtual infrastructure with detailed information on the tools and counters used. It presents a series of examples of how the performance counters report different types of resource consumption with a focus on key counters to observe. The performance counters do change and we are not going to go over all of the counters. ESXTOP and RESXTOP will be used interchangeably since both tools effectively provide the same counters. The screen shots in this presentation have the colors inverted for readability purposes. The presentation shows screen shots from vsphere 4 and 5 since both are actively in use. Page 2

Introductory Comments Page 3

Trends to Consider - Hardware Introductory Comments Intel Strategies Intel necessarily had to move to a horizontal, multi- core strategy due to the physical restrictions of what could be done on the current technology base. This resulted in: Increasing number of cores per socket Stabilization of processor speed (i.e., processor speeds no longer are increasing according to Moore s Law and in fact are slower for newer models) Focus on new architectures that allow for more efficient movement of data between memory and the processors, external sources and the processors, and larger and faster caches associated with different components OEM Strategies As Intel moved down this path system OEMs have assembled the Intel (and AMD) components in different ways (such as multi-socket) to differentiate their offerings. It is important to understand the timing of the Intel architectural releases, the OEM implementation of those releases, and the operating system vendors use of those features in their code. They aren t always aligned. Page 4

Introductory Comments Trends to Consider Hardware Assisted Virtualization Intel VT-x and AMD-V - This provides two forms of CPU operation (root and non-root), allowing the virtualization hypervisor to be less intrusive during workloads. Hardware virtualization with a Virtual Machine Monitor (VMM) versus binary translation resulted in a substantial performance improvement. Intel EPT and AMD RVI - Memory management virtualization (MMU) supports extended page tables (EPT) which eliminated the need for ESX to maintain shadow page tables. Vt-d and AMD-Vi I/O virtualization assist allows the virtual machines to have direct access to hardware I/O devices, such as network cards and storage controllers (HBAs). Page 5

Introductory Comments Trends to Consider Software VMWare/Hypervisor Strategies Horizontal scalability at the hardware layer requires comparable scalability for the hypervisor NUMA support required hypervisor scheduler changes (Wide NUMA) Larger CPU and RAM virtual machines Efficiency while running larger, most complex workloads Abstraction model versus consolidation model for some workloads Performance guarantees for the Core Four Federation of management tools and resource pools Page 6

Introductory Comments Scheduler vcpu The vcpu is an aggregation of the time it allocates to the workload. It time slices each core based upon the type of configuration. It constantly is changing which core the vm is on, unless affinity is used. SMP Lazy Scheduling The scheduler continues to evolve, using lazy scheduling to launch individual vcpus and then having others catch up if CPU skewing occurs. Note that this has improved across releases. SMP Note that SMP effectiveness is NOT linear, depending upon workloads. It is very important to load test your workloads on your hardware to measure the efficiency of SMP. We have found that the higher the number of SMP vcpus, the lower the efficiency. Page 7

Introductory Comments Introductory Comments Resource Pools These are really a way to group the amount of resources allocated to a specific workload, or grouping of workloads across VMs and hosts. Single Unit of Work It is easy to miss the fact that resource sharing really does not affect the single unit of work. Dynamic Resource Sharing (DRS) moves VMs across ESX hosts where there may be more resources. Perfmon Counters The inclusion of ESX counters (via VMTools) into the Windows perfmon counters is helpful for overall analysis, and higher level performance analysis. Many counters are not exposed yet in Perfmon. Page 8

Introductory Comments Introductory Comments Hyper-threading The pre-nehalem Intel architecture had some problems with the hyper-threading efficiencies, causing many people to turn off hyper-threading. The Nehalem architecture seems to have corrected those problems, and vsphere is now hyper-threading aware. You need to understand how hyper-threading works so you know how to interpret the VMware tools (such as ESXTOP, ESXPlot). The different cores will be shown as equals while they don t have equal capacity. Microsoft Hyper-V While we are not going to be diving into Hyper-V (or Zen VM) the basic principles of the hypervisors are the same, though the implementation is quite different. There are good reasons why VMware leads the market in enterprise virtualization implementations. Page 9

Introductory Comments Introductory Comments Decision Points WHAT should we change? VMware continues to expose more performance changing settings. One of the key questions that needs to be answered is whether you should take the default settings for operational simplicity or fine tune for the best possible performance. NUMA Awareness and Control Should the NUMA control be turned over to the guest operating system? ESXTOP versus RESXTOP Though both work the use of ESXTOP on the actual host requires SSH to be enabled, which may violate security guidelines. Page 10

vsphere Architecture VMware ESX Architecture Guest Monitor VMkernel Guest Scheduler Memory Allocator TCP/IP Monitor (BT, HW, PV) Virtual NIC Virtual Switch NIC Drivers File System Virtual SCSI File System I/O Drivers CPU is controlled by scheduler and virtualized by monitor Monitor supports: BT (Binary Translation) HW (Hardware assist) PV (Paravirtualization) Memory is allocated by the VMkernel and virtualized by the monitor Physical Hardware Network and I/O devices are emulated and proxied though native device drivers Page 11

Performance Analysis Basics Key Reference Documents vsphere Resource Management (EN-000591-01) Performance Best Practices for VMWare vsphere 5.0 (EN-000005-04) 04) Page 12

Performance Analysis Basics Types of Resources The Core Four (Plus One) Though the Core Four resources exist at both the ESX host and virtual machine levels, l they are not the same in how they are instantiated ti t and reported against. CPU processor cycles (vertical), multi-processing (horizontal) Memory allocation and sharing Disk (a.k.a. storage) throughput, size, latencies, queuing Network - throughput, latencies, queuing Though all resources are limited, ESX handles the resources differently. CPU is more strictly scheduled, memory is adjusted and reclaimed (more fluid) if based on shares, disk and network are fixed bandwidth (except for queue depths) resources. The Fifth Core Four resource is virtualization overhead! Page 13

Performance Analysis Basics vsphere Components in a Context You Are Used To World The smallest schedulable component for vsphere Similar to a process in Windows or thread in other operating systems Groups A collection of ESX worlds, often associated with a virtual server or common set of functions, such as Idle System Helper Drivers Vmotion Console Page 14

Performance Analysis Basics The Five Contexts of Virtualization i and Which h Tools to Use for Each Physical Machine Virtual Machine ESX Host Machine ESX Host Farm/Cluster ESX Host Complex Application Application Application Operating System Operating System Operating System VCPU VMemory VCPU VNIC VMemory VDisk VNICVCPU VMemory VDisk VNIC VDisk Intel Hardware Intel Hardware Operating System Application Intel Hardware Application Application Operating System Operating System VCPU VMemory VNIC VDisk VCPU VMemory VNIC VDisk Intel Hardware Intel Hardware PerfMon ESXTOP PerfMon ESXTOP PCPU PMemory PNIC PDisk PCPU PMemory PNIC PDisk PCPU PMemory PNIC Intel Hardware Intel Hardware Intel Hardware PDisk Virtual Center PCPU PMemory PNIC PDisk Intel Hardware Virtual Center Application Application Application Application Operating System Operating System Operating System Operating System VCPU VMemory VNIC VDisk VCPU VMemory VNIC VDisk VCPU VMemory VNIC VDisk VCPU VMemory VNIC VDisk Intel Hardware Intel Hardware Intel Hardware Intel Hardware PerfMon ESXTOP PerfMon ESXTOP PCPU PMemory PNIC PDisk PCPU PMemory PNIC PDisk Intel Hardware Intel Hardware Remember the virtual context Page 15

Performance Analysis Basics Types of Performance Counters (v4) Static Counters that don t change during runtime, for example MEMSZ (memsize), Adapter queue depth, VM Name. The static counters are informational and may not be essential during performance problem analysis. Dynamic Counters that are computed dynamically, for example CPU load average, memory over-commitment load average. Calculated l - Some are calculated l from the delta between two successive snapshots. Refresh interval (-d) determines the time between successive snapshots. For example %CPU used = ( CPU used time at snapshot 2 - CPU used time at snapshot 1 ) / time elapsed between snapshots Page 16

Performance Analysis Basics A Review of the Basic Performance Analysis Approach Identify the virtual context of the reported performance problem Where is the problem being seen? ( When I do this here, I get that ) How is the problem being quantified? ( My function is 25% slower) Apply a reasonability check ( Has something changed from the status quo? ) Monitor the performance from within that virtual context View the performance counters in the same context as the problem Look at the ESX cluster level performance counters Look for atypical behavior ( Is the amount of resources consumed characteristic of this particular application or task for the server processing tier? ) Look for repeat offenders! This happens often. Expand the performance monitoring to each virtual context as needed Are other workloads influencing the virtual context of this particular application and causing a shortage of a particular resource? Consider how a shortage is instantiated ti t for each of the Core Four resources Page 17

Performance Analysis Basics Resource Control Revisited CPU Example Reservation (Guarantees) Minimum service level guarantee (in MHz) When system is overcommitted it is still the target Needs to pass admission control for start-up Shares (Share the Resources) CPU entitlement is directly proportional to VM's shares and depends on the total number of shares issued Abstract t number, only ratio matters Limit Absolute upper bound on CPU entitlement (in MHz) Even when system is not overcommitted Total MHZ Limit Reservation 0 MHZ Shares Page 18

Tools Key Reference Documents vsphere Monitoring and Performance (EN-000620-01) Chapter 7 Performance Monitoring Utilities: resxtop and esxtop Esxtop for Advanced Users (VSP1999 VMWorld 2011) Page 19

Tools Load Generators, Data Gatherers, Data Analyzers Load Generators IOMeter www.iometer.org org Consume windows SDK SQLIOSIM - http://support.microsoft.com/?id=231619 Data Gatherers ESXTOP Virtual Center Vscsistats Data Analyzers ESXTOP (interactive or batch mode) Windows Perfmon/Systems Monitor ESXPLOT Page 20

Tools A Comparison of ESXTOP and the vsphere Client vc gives a graphical view of both real-time and trend consumption vc combines real-time reporting with short term (1 hour) trending vc can report on the virtual machine, ESX host, or ESX cluster vc has performance overview charts in vsphere 4 and 5 vc is limited to 2 unit types at a time for certain views ESXTOP allows more concurrent performance counters to be shown ESXTOP has a higher h system overhead to run ESXTOP can sample down to a 2 second sampling period ESXTOP gives a detailed view of each of the Core Four Recommendation Use vc to get a general view of the system performance but use ESXTOP for detailed problem analysis. Page 21

Tools An Introduction to ESXTOP/RESXTOP Launched through vsphere Management Assistant (VMA) or CLI or via SSH session (ESXTOP) with ESX host Screens (version 5) c: cpu (default) d: disk adapter h: help i: interrupts m: memory n: network p: power management u: disk device v: disk VM Can be piped pp to a file and then imported in System Monitor/ESXPLOT Horizontal and vertical screen resolution limits the number of fields and entities that could be viewed so chose your fields wisely Some of the rollups and counters may be confusing to the casual user Page 22

Tools ESXTOP: New Counters in vsphere 5.0 World, VM Count, vcpu count (CPU screen) %VMWait (%Wait - %Idle, CPU screen) CPU Clock Frequency in different P-states (Power Management Screen) Failed Disk IOs (Disk adapter screen) FCMDs/s failed commands per second FReads/s failed reads per second FMBRD/s failed megabyte reads per second FMBWR/s failed megabyte writes per second FRESV/s failed reservations per second VAAI: Block Deletion Operations (Disk adapter screen) Same counters as Failed Disk IOs above Low-Latency Swap (Host Cache Disk Screen) LLSWR/s Swap in rate from host cache LLSWW/s Swap out rate to host cache Page 23

Tools ESXTOP: Help Screen (v5) Page 24

Tools ESXTOP: CPU screen (v5) Time Uptime New Counter Page 25 Worlds = Worlds, VMs, vcpu Totals ID = ID GID = world group identifier NWLD = number of worlds fields hidden from the view

Tools ESXTOP: CPU screen (v4) expanding groups press e key In rolled up view some stats are cumulative of all the worlds in the group Expanded view gives breakdown per world VM group consists of mks (mouse, keyboard, screen), vcpu, vmx worlds. SMP VMs have additional vcpu and vmm worlds vmm0, vmm1 = Virtual machine monitors for vcpu0 and vcpu1 respectively Page 26

Tools ESXTOP CPU Screen (v5): Many New Worlds New Processes Using Little/No CPU resource Page 27

Tools ESXTOP CPU Screen (v5): Virtual Machines Only (using V command) Value >= 1 means overload Page 28

Tools ESXTOP CPU Screen (v5): Virtual Machines Only, Expanded d Page 29

Tools ESXTOP: CPU screen (v4) PCPU = Physical CPU/core CCPU = Console CPU (CPU 0) Press f key to choose fields Page 30

Tools ESXTOP: CPU screen (v5) Core Usage Now Shown PCPU = Physical CPU CORE = Core CPU Changed Field New Field Page 31

Tools Idle State on Test Bed (CPU View v4) ESXTOP Virtual Machine View Page 32

Tools Idle State on Test Bed GID 32 Expanded (v4) Wait includes idle Expanded GID Rolled Up GID Cumulative Five Worlds Total Idle % Wait % Page 33

Tools ESXTOP memory screen (v4) Possible states: High, Soft, hard and low PCI Hole COS VMKMEM Physical Memory (PMEM) VMKMEM - Memory managed by VMKernel COSMEM - Memory used by Service Console Page 34

ESXTOP: memory screen (v5) Tools NUMA Stats Changed Field New Fields Page 35

Tools ESXTOP: memory screen (4.0) Swapping activity in Service Console VMKernel Swapping activity it SZTGT : determined by reservation, limit and memory shares SWCUR = 0 : no swapping in the past SWTGT = 0 : no swapping pressure SWR/S, SWR/W = 0 : No swapping activity currently SZTGT = Size target SWTGT = Swap target SWCUR = Currently swapped MEMCTL = Balloon driver SWR/S = Swap read /sec SWW/S = Swap write /sec Page 36

Tools ESXTOP: disk adapter screen (v4) Host bus adapters (HBAs) - includes SCSI, iscsi,raid, and FC-HBA adapters Latency stats from the Device, Kernel and the Guest DAVG/cmd - Average latency (ms) from the Device (LUN) KAVG/cmd - Average latency (ms) in the VMKernel GAVG/cmd - Average latency (ms) in the Guest Page 37

Tools ESXTOP: disk device screen (v4) LUNs in C:T:L format (Controller: Target: LUN) Page 38

ESXTOP disk VM screen (v v4) Tools running VMs Page 39

ESXTOP: network screen (v4) Tools Service console NIC Virtual NICs Physical NIC PKTTX/s - Packets transmitted /sec PKTRX/s -Packets received ed /sec MbTx/s - Transmit Throughput in Mbits/sec MbRx/s - Receive throughput in Mbits/sec Port ID: every entity is attached to a port on the virtual switch DNAME - switch where the port belongs to Page 40

Tools A Brief Introduction to the vsphere Client Screens CPU, Disk, Management Agent, Memory, Network, System vcenter collects performance metrics from the hosts that it manages and aggregates the data using a consolidation algorithm. The algorithm is optimized to keep the database size constant over time. vcenter does not display many counters for trend/history screens ESXTOP defaults to a 5 second sampling rate while vcenter defaults to a 20 second rate. Default statistics collection periods, samples, and how long they are stored Interval Interval Period Number of Samples Interval Length Per Hour (real-time) 20 seconds 180 Per day 5 minutes 288 1 day Per week 30 minutes 336 1 week Per month 2 hours 360 1 month Per year 1 day 365 1 year Page 41

Tools vsphere Client CPU Screen (v4) To Change Settings To Change Screens Page 42

Tools vsphere Client Disk Screen (v4) Page 43

Tools vsphere Client - Performance Overview Chart (v4) Performance overview charts help to quickly identify bottlenecks and isolate root causes of issues. Page 44

Tools Analyzing Performance from Inside a VM VM Performance Counters Integration into Perfmon Access key host statistics ti ti from inside the guest OS View accurate CPU utilization along side observed CPU utilization Third-parties can instrument their agents to access these counters using WMI Integrated with VMware Tools Page 45

Tools Summarized Performance Charts (v4) Quickly identify bottlenecks and isolate root causes Side-by-side performance charts in a single view Correlation and drill-down capabilities Richer set of performance metrics Key Metrics Displayed Aggregated Usage Page 46

A Brief Introduction to ESXPlot Tools Launched on a Windows workstation Imports data from a.csv file Allows an in-depth analysis of an ESXTOP batch file session Capture data using ESXTOP batch from root using SSH utility ESXTOP a b >exampleout.csv (for verbose capture) Transfer file to Windows workstation using WinSCP Page 47

ESXPlot Tools Page 48

Tools ESXPlot Field Expansion: CPU Page 49

Tools ESXPlot Field Expansion: Physical Disk Page 50

Tools Top Performance Counters to Use for Initial Problem Determination ti Physical/Virtual Machine ESX Host CPU (queuing) Average physical CPU utilization Peak physical CPU utilization CPU Time Processor Queue Length Memory (swapping) Average Memory Usage Peak Memory Usage Page Faults Page Fault Delta* Disk (latency) Split IO/Sec Disk Read Queue Length Disk Write Queue Length Average Disk Sector Transfer Time Network (queuing/errors) Total Packets/second Bytes Received/second Bytes Sent/Second Output queue length CPU (queuing) PCPU% %SYS %RDY Average physical CPU utilization Peak physical CPU utilization Physical CPU load average Memory (swapping) State (memory state) SWTGT (swap target) SWCUR (swap current) SWR/s (swap read/sec) SWW/s (swap write/sec) Consumed Active (working set) Swapused (instantaneous swap) Swapin (cumulative swap in) Swapout (cumulative swap out) VMmemctl (balloon memory) Disk (latency, queuing) DiskReadLatency DiskWriteLatency CMDS/s (commands/sec) Bytes transferred/received/sec Disk bus resets ABRTS/s (aborts/sec) SPLTCMD/s (I/O split cmds/sec) Network (queuing/errors) %DRPTX (packets dropped - TX) %DRPRX (packets dropped RX) MbTX/s (mb transferred/sec TX) MbRX/s (mb transferred/sec RX) Page 51

CPU Page 52

Performance Counters in Action CPU Understanding PCPU versus VCPU It is important to separate the physical CPU (PCPU) resources of the ESX host from the virtual CPU (VCPU) resources that are presented by ESX to the virtual machine. PCPU The ESX host s processor resources are exposed only to ESX. The virtual it machines are not aware and cannot report on those physical resources. VCPU ESX effectively assembles a virtual CPU(s) for each virtual machine from the physical machine s processors/cores, based upon the type of resource allocation (ex. shares, guarantees, minimums). Scheduling - The virtual machine is scheduled to run inside the VCPU(s), with the virtual machine s reporting mechanism (such as W2K s System Monitor) reporting on the virtual machine s allocated VCPU(s) and remaining Core Four resources. Page 53

Performance Counters in Action CPU Key Question and Considerations Is there a lack of CPU resources for the VCPU(s) of the virtual machine or for the PCPU(s) of the ESX host? Allocation The CPU allocation for a specific workload can be constrained due to the resource settings or number of CPUs, amount of shares, or limits. The key field at the virtual machine level is CPU queuing and at the ESX level it is Ready to Run (%RDY in ESXTOP). Capacity - The virtual machine s CPU can be constrained due to a lack of sufficient capacity at the ESX host level as evidenced by the PCPU/LCPU utilization. Contention The specific workload may be constrained by the consumption of workloads operating outside of their typical patterns SMP CPU Skewing The movement towards lazy scheduling of SMP CPUs can cause delays if one CPU gets too far ahead of the other. Look for higher %CSTP (co-schedule pending) Page 54

Performance Counters in Action CPU State Times and Accounting Accounting: USED = RUN + SYS - OVRLP Page 55

Performance Counters in Action High CPU within one virtual machine caused by affinity (ESXTOP v4) Physical CPU Fully Used One Virtual CPU is Fully Used Page 56

Performance Counters in Action High CPU within one virtual machine (affinity) (vcenter v4) View of the ESX Host View of the VM Page 57

Performance Counters in Action SMP Implementation WITHOUT CPU Constraints ESXTOP V4 4 Physical CPUs Fully Used One - 2 CPU SMP VCPU 4 Virtual CPUs Fully Used Ready to Run Acceptable Page 58

Performance Counters in Action SMP Implementation WITHOUT CPU Constraints vc V4 Page 59

Performance Counters in Action SMP Implementation with Mild CPU Constraints V4 4 Physical CPUs Fully Used One - 2 CPU SMP VCPUs (7 NWLD) 4 Virtual CPUs Heavily Used Ready to Run Indicates Problems Page 60

Performance Counters in Action SMP Implementation with Severe CPU Constraints V4 4 Physical CPUs Fully Used Two - 2 CPU SMP VCPUs (7 NWLD) 4 Virtual CPUs Fully Used Ready to Run Indicates Severe Problems Page 61

Performance Counters in Action SMP Implementation with Severe CPU Constraints V4 Page 62

Performance Counters in Action CPU Usage Without Core Sharing ESX scheduler tries to avoid sharing the same core Page 63

Performance Counters in Action CPU Usage With Core Sharing Page 64

Introduction to the Intel QuickPath Interconnect Intel QuickPath interconnect is: Cache-coherent, high-speed packet-based, point-to-point interconnect used in Intel s next generation microprocessors (starting in 2H 08) Narrow physical link contains 20 lanes Two uni-directional links complete QuickPath interconnect port Four-Socket Platform Provides high bandwidth, low latency connections between processors and between processors and chipset Maximum data rate of 6.4GT/s 2 bytes/t, 2 directions, yields 25.6GB/s per port Interconnect performance for Intel s next generation microarchitectures Page 65

Intel Topologies Lynnfield CPU Intel Core i7 Processor CPU Nehalem-EP CPU Nehalem-EX CPU Intel Itanium Processor (Tukwila) CPU No links 1 Full Width Link 2 Full Width Links 4 Full Width Links 4 Full Width Links 2 Half Width Links Nehalem-EP Example 2S CPU CPU Nehalem-EX Example 4S CPU IOH CPU IOH CPU CPU IOH Page 66 Different Number of Links for Different Platforms

Intel: QPI Performance Considerations Not always a direct correlation between processor performance and interconnect latency / bandwidth What is important t is that t the interconnect t should perform sufficiently to not limit processor performance Max theoretical bandwidth Max of 16 bits (2 bytes) of real data sent across full width link during one clock edge Double pumped bus with max initial frequency of 3.2 GHz 2 bytes/transfer * 2 transfers/cycle * 3.2 GHz = 12.8 GB/s With Intel QuickPath Interconnect at 6.4 GT/s translates to 25.6 GB/s across two simultaneous unidirectional links Max bandwidth with packet overhead Typical data transaction is a 64 byte cache line Typical packet has header Flit which requires 4 Phits to transmit across link Data payload takes 32 Phits to transfer (64 bytes at 2 bytes/phit) With CRC is sent inline with data, data packet requires 4 Phits for header + 32 Phits of payload With Intel QuickPath Interconnect at 6.4 GT/s, 64B cache line transfers in 5.6 ns translates to 22.8 GB/s across two simultaneous unidirectional links Page 67

Memory Page 68

Memory Memory Separating the machine and guest memory It is important to note that some statistics refer to guest physical memory while others refer to machine memory. Guest physical memory" is the virtual- hardware physical memory presented to the VM. " Machine memory" is actual physical RAM in the ESX host. In the figure below, two VMs are running on an ESX host, where each block represents 4 KB of memory and each color represents a different set of data on a block. Inside each VM, the guest OS maps the virtual memory to its physical memory. ESX Kernel maps the guest physical memory to machine memory. Due to ESX Page Sharing technology, guest physical pages with the same content can be mapped to the same machine page. Page 69

Memory A Brief Look at Ballooning The W2K balloon driver is located in VMtools ESX sets a balloon target for each workload at start-up and as workloads are introduced/removed The balloon driver expands memory consumption, requiring the Virtual Machine operating system to reclaim memory based on its algorithms Ballooning routinely takes 10-20 minutes to reach the target The returned memory is now available for ESX to use Key ballooning fields: SZTGT: determined by reservation, limit and memory shares SWCUR = 0 : no swapping in the past SWTGT = 0 : no swapping pressure SWR/S, SWR/W = 0 : No swapping activity currently Page 70

Memory Interleaving Basics What is it? It is a process where memory is Ch 0 Ch 1 Ch 2 Ch 0 Ch 1 Ch 2 stored in a non contiguous form to optimize access performance and efficiency Interleaving usually done in cache line granularity Why do it? CSI Increase bandwidth by allowing System Memory Map multiple memory accesses at once Reduce hot spots since memory is spread out over a wider location Tylersburg To T support NUMA (Non Uniform Memory Access) based -DP IOH OS/applications Memory organization where there is different access times for different sections of memory, due to memory located in different locations Concentrate the data for the application on the memory of the same socket Page 71

Non-NUMA NUMA (UMA) Uniform Memory Access (UMA) Addresses interleaved across memory nodes by cache line. Accesses may or may not have to cross QPI link Socket 0 Memory Socket 1 Memory DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 Tylersburg-DP System Memory Map Uniform Memory Access lacks tuning for optimal performance Page 72

NUMA Non-Uniform Memory Access (NUMA) Addresses not interleaved across memory nodes by cache line. Each CPU has direct access to contiguous block of memory. Socket 0 Memory Socket 1 Memory DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 Tylersburg-EP System Memory Map Thread affinity benefits from memory attached locally Page 73

Memory ESX Memory Sharing - The Water Bed Effect ESX handles memory shares on an ESX host and across an ESX cluster with a result similar to a single water bed, or room full of water beds, depending upon the action and the memory allocation type: Initial ESX boot (i.e., lying down on the water bed ) ESX sets a target working size for each virtual machine, based upon the memory allocations or shares, and uses ballooning to pare back the initial allocations until those targets are reached (if possible). Steady State (i.e., minor position changes ) - The host gets into a steady state with small adjustments made to memory allocation targets. Memory ripples occur during steady state, with the amplitude dependent upon the workload characteristics and consumption by the virtual machines. New Event (i.e., second person on the bed ) The host receives additional workload via a newly started virtual machine or VMotion moves a virtual machine to the host through a manual step, maintenance mode, or DRS. ESX pares back the target working size of that virtual machine while the other virtual machines lose CPU cycles that are directed to the new workload. Large Event (i.e., jumping across water beds ) The cluster has a major event that causes a substantial movement of workloads to or between multiple hosts. Each of the hosts has to reach a steady state, or to have DRS determine that the workload is not a current candidate for the existing host, moving to another host that has reached a steady state with available capacity. Maintenance mode is another major event. Page 74

Memory Memory Key Question and Considerations Is the memory allocation for each workload optimum to prevent swapping at the Virtual Machine level, yet low enough not to constrain other workloads or the ESX host? HA/DRS/Maintenance Mode Regularity How often do the workloads in the cluster get moved between hosts? Each movement causes an impact on the receiving (negative) and sending (positive) hosts with maintenance mode causing a rolling wave of impact across the cluster, depending upon the timing. Allocation Type Each of the allocation types have their drawbacks so tread carefully when choosing the allocation type. One size seldom is right for all needs. Capacity/Swapping - The virtual machine s CPU can be constrained due to a lack of sufficient capacity at the ESX host level. Look for regular swapping at the ESX host level as an indicator of a memory capacity issue but be sure to notice memory leaks that artificially force a memory shortage situation. Page 75

Memory Idle State on Test Bed Memory View V4 Page 76

Memory Memory View at Steady State of 3 Virtual Machines Memory Shares V4 Most memory is not reserved Virtual Machine Just Powered On These VMs are at memory steady state No VM Swapping or Targets Page 77

Memory Ballooning and Swapping in Progress Memory View V4 Possible states: High, Soft, hard and low Different Size Targets Due to Different Amount of Up Time Ballooning In Effect Mild swapping Page 78

Memory Memory Reservations Effect on New Loads V4 Three VMs, each with 2GB reserved memory What Size Virtual Machine with Reserved Memory Can Be Started? 6GB of free physical memory due to memory sharing over 20 minutes 666MB of unreserved memory Can t start fourth virtual machine of >512MB of reserved memory Fourth virtual machine of 512MB of reserved memory started Page 79

Memory Memory Shares Effect on New Loads V4 Three VMs with 2GB allocation Fourth virtual machine of 2GB of memory allocation started successfully 5.9 GB of free physical memory 6GB of unreserved memory Page 80

Memory Virtual Machine with Memory Greater Then On A Single NUMA Node V5 Remote NUMA Access Local NUMA Access % Local NUMA Access Page 81

Memory Wide-NUMA Support in V5 1 vcpu 1 NUMA Node Page 82

Memory Wide-NUMA Support in V5 8 vcpu 2 NUMA Nodes Page 83

Power Management Page 84

Power Management Power Management Screen V5 Page 85

Power Management Impact of Power States on CPU Page 86

Power Management Power Management Impact on CPU V5 Page 87

Storage Page 88

Storage Considerations Storage Key Question and Considerations Is the bandwidth and configuration of the storage subsystem sufficient to meet the desired latency (a.k.a. a response time) for the target workloads? If the latency target is not being met then further analysis may be very time consuming. Storage Frames specifications refer to the aggregate bandwidth of the frame or components, not the single path capacity of those components. Queuing - Queuing can happen at any point along the storage path, but is not necessarily a bad thing if the latency meets requirements. Storage Path Configuration and Capacity It is critical to know the configuration of the storage path and the capacity of each component along that path. The number of active vmkernel commands must be less then or equal to the queue depth max of any of the storage path components while processing the target storage workload. Page 89

Storage Considerations Storage Aggregate versus Single Paths Storage Frames specifications refer to the aggregate bandwidth of the frame or components, not the single path capacity of those components* DMX Message Bandwidth: 4-6.4 GB/s DMX Data Bandwidth: 32-128 GB/s Global l Memory: 32-512 GB Concurrent Memory Transfers: 16-32 (4 per Global Memory Director) Performance Measurement for storage is all about individual paths and the performance of the components contained in that path (* Source EMC Symmetrix DMX-4 Specification Sheet c1166-dmx4-ss.pdf) Page 90

Storage More Questions Storage Considerations Virtual Machines per LUN - The number of outstanding active vmkernel commands per virtual machine times the number of virtual machines on a specific LUN must be less then the queue depth of that adapter How fast can the individual disk drive process a request? Based upon the block-size and type of I/O (sequential read, sequential write, random read, random write) what type of configuration (RAID, number of physical spindles, cache) is required to match the I/O characteristics and workload demands for average and peak throughput? Does the network storage (SAN frame) handle the I/O rate down each path and aggregated across the internal bus, frame adaptors, and front end processors? In order to answer these questions we need to better understand the underlying design, considerations, and basics of the storage subsystem Page 91

Storage Considerations Back-end Storage Design Considerations Capacity - What is the storage capacity needed for this workload/cluster? Disk drive size (ex., 144GB, 300GB) Number of disk drives needed within a single logical unit (ex., LUN) IOPS Rate How many I/Os per second are required with the needed latency? Number of physical spindles per LUN Impact of sharing of physical disk drives between LUNs Configuration (ex., cache) and speed of the disk drive Availability How many disk drives, storage components can fail at one time? Type of RAID chosen, number of parity drives per grouping g Amount of redundancy built into the storage solution Cost Delivered cost per byte at the required speed and availability Many options are available for each design consideration Final decisions i on the choice for each component The cumulative amount of capacity, IOPS rate, and availability often dictate the overall solution Page 92

Storage Considerations Storage from the Ground Up Basic Definitions: iti Mechanical Drives Disk Latency The average time it takes for the requested sector to rotate under the read/write head after a completed seek 5400 (5.5ms), 7200 (4.2ms), 10,000 (3ms), 15,000 (2ms) RPM Ave. disk latency = 1/2 * rotation Throughput (MB/sec) = (Outstanding IOs/ latency (msec)) * Block size (KB) Seek Time The time it takes for the read/write head to find the physical location of the requested data on the disk Average Seek time: 8-10 ms Access Time The total time it takes to locate the data on the drive(s). This includes seek time, latency, settle time, and command processing overhead time. Host Transfer Rate The speed at which the host can transfer the data across the disk interface. Page 93

Storage Considerations Network Storage Components That Can Affect Performance/Availability Size and use of cache (i.e., % dedicated to reads versus writes) Number of independent internal data paths and buses Number of front-end interfaces and processors Types of interfaces supported (ex. Fiber channel and iscsi) Number and type of physical disk drives available MetaLUN Expansion MetaLUNs allow for the aggregation of LUNs System typically re-stripes data when MetaLUN is changed Some performance degradation during re-striping Storage Virtualization Aggregation of storage arrays behind a presented mount point/lun Movements between disk drives and tiers control by storage management Change of physical drives and configuration may be transient and severe Page 94

Case Studies - Storage Test Bed Idle State Device Adapter View V4 Average Device Latency, Per Command Storage Adapter Maximum Queue Length World Maximum Queue Length LUN Maximum Queue Length Page 95

Case Studies - Storage Moderate load on two virtual machines V4 Commands are queued BUT. Acceptable latency from the disk subsystem Page 96

Case Studies - Storage Heavier load on two virtual machines Commands are queued and are exceeding maximum queue lengths BUT. Virtual machine latency is consistently above 20ms/second, performance could start to be an issue Page 97

Case Studies - Storage Heavy load on four virtual machines Commands are queued and are exceeding maximum queue lengths AND. Virtual machine latency is consistently above 60 ms/second for some VMs, performance will be an issue Page 98

Case Studies - Storage Atifi Artificial i lconstraints t on Storage Problem with the disk subsystem Good throughput Low device Latency Bad throughput Device Latency is high - cache disabled Page 99

Understanding the Disk Counters and Latencies Page 100

Understanding Disk I/O Queuing Page 101

Storage Considerations SAN Storage Infrastructure t Areas to Watch/Consider HBA Speed Fiber Bandwidth HBA FA CPU Speed Disk Response RAID Configuration Block Size Number of Spindles in LUN/array HBA ISL FC switch Director San.jpg Disk Speeds HBA World Queue Length LUN Queue Length Storage Adapter Queue Length Cache Size/Type Page 102

Storage Considerations Storage Queuing The Key Throttle Points ESX Host VM 1 World Queue Length (WQLEN) VM 2 World Queue Length (WQLEN) VM 3 World Queue Length (WQLEN) VM 4 World Queue Length (WQLEN) L UN Queue Length (L QLEN) Executio on Throttle Execution Thr rottle HBA HBA Storage Area Network Page 103

Storage Considerations Storage I/O The Key Throttle Point Definitions Storage Adapter Queue Length (AQLEN) The number of outstanding vmkernel active commands that the adapter is configured to support. This is not settable. It is a parameter passed from the adapter to the kernel. LUN Queue Length (LQLEN) The maximum number of permitted outstanding vmkernel active commands to a LUN. (This would be the HBA queue depth setting for an HBA.) This is set in the storage adapter configuration via the command line. World Queue Length (WQLEN) VMware Recommends Not to Change This!!!! The maximum number of permitted outstanding vmkernel active requests to a LUN from any singular virtual machine (min:1, max:256: default: 32) Configuration->Advanced Settings->Disk-> Disk.SchedNumReqOutstanding Execution Throttle (this is not a displayed counter) The maximum number of permitted outstanding vmkernel active commands that can be executed on any one HBA port (min:1, max:256: default: ~16, depending on vendor) This is set in the HBA driver configuration. Page 104

Storage Considerations Queue Length Rules of Thumb For a lightly-loaded system, average queue length should be less than 1 per spindle with occasional spikes up to 10. If the workload is writeheavy, the average queue length above a mirrored controller should be less than 0.6 per spindle and less than 0.3 per spindle above a RAID-5 controller. For a heavily-loaded system that isn t saturated, average queue length should be less than 2.5 per spindle with infrequent spikes up to 20. If the workload is write-heavy, the average queue length above a mirrored controller should be less than 1.5 per spindle and less than 1 above a RAID-5 controller. Page 105

Closing Thoughts Know the key counters to look at for each type of resource Be careful on what type of resource allocation technique you use for CPU and RAM. One size may NOT fit all. Consider the impact of events such as maintenance on the performance of a cluster Set up a simple test bed where you can create simple loads to become familiar with the various performance counters and tools Compare your test bed analysis and performance counters with the development and production clusters Know your storage subsystem components and configuration due to the large impact this can have on overall performance Take the time to learn how the various components of the virtual infrastructure work together Page 106

John Paul johnathan.paul@siemens.com Page 107