Lies, Damn Lies and Performance Metrics PRESENTATION TITLE GOES HERE Barry Cooks Virtual Instruments
Goal for This Talk Take away a sense of how to make the move from: Improving your mean time to innocence to Improving your infrastructure performance 2
What We ll Cover A case of performance metrics gone bad Some history What performance monitoring needs The lies The damn lies The performance metrics How can you use them 3
Application is down again. 4
Data Center Management - Actual You see this??? 5
Array tools say it s okay 6
Data Center Management - Actual How can I help? 7
Have you tried updating your drivers and firmware? Data Center Management - Actual Meanwhile, at the storage vendor 8
Can you clear the counters and run another log collection? And the switch vendor 9
Some history 10
IBM A Point of Reference Mainframes collected and correlated lots of data about the workload and infrastructure 11
Closed vs. Open Systems The move to open systems was introduced Numerous competing vendors Interconnected specialized devices Inconsistency in monitoring methods and metrics Correlating data from multiple vendors is a serious challenge Vendors focus has been on core innovation Monitoring became a secondary priority 12
What does performance monitoring need? 13
What s Required for Success Understanding what data is relevant A method to gather that data, ideally, without impacting systems under monitoring End-to-end view of data Historical data retention Comparable data across vendor ecosystem Actionable insights from that data 14
The lies 15
Performance Monitoring Today Performance metrics are often Not really performance metrics Utilization Error counters Samples taken on a polling interval Every minute, hour, 6 hours? Rollup averages over a window of time At 16G a single 2KB frame takes 1.25μs to transmit. That s 48 million 2K reads per minute. Fifteen- minute average? That s the population of Europe. 16
The Outlier $1,000,000 Traditional Performance Management $700,000 $350,000 $295K average $67K average 0 17
The Hidden Issue 5,000ms Response Time 1ms 0 20sec 25sec 60sec 10,000 I/Os 10,000 @ 1ms x 20s 32 @ 5,000ms 10,000 @ 1ms x 35 Total Commands: 550,032 Total I/O Time: 1ms * 10,000 I/Os * 55s + 32 I/Os *5s = 710,000 Average Response Time = 1.29ms I/Os per second 0 20sec 25sec 60sec 18
A Question of Balance Is the traffic between these ports on the same server balanced? Port A mean traffic: 4.41Mb/s Port B mean traffic: 4.40Mb/s 19
Workload Profiling 20
Vendor Response Time Metrics Utilization = 100% * busy time in period / (idle + busy) time in period Throughput = total number of visitors in periods / period in length in seconds Average Busy Queue Length = sum of queue upon arrival of visitor x / total number of visitors Queue length = ABQL * utilization/100% Response time = queue length / throughput (Little s Law) Response Time = (Sum of Queue Upon Arrival of Visitor / Total Number of Visitors) * (100% * Busy Time in Period / (Idle + Busy) time in period) / 100%) / (# of Visitors in Period / Length of Period) 21
Vendor Response Time Metrics The Fine Text (Necessary Caveats): For low LUN throughput (<32 IOPS), response time might be inaccurate. Lazy writes skew the LUN busy counter. Dual SP ownership of a disk can also impact response time. Each SP only knows about its own ABQL, throughput and utilization for the disk. At poll time, they exchange views. The utilization is max(spa,spb). ABQL is computed from the sum of the sum. And SP throughput is the sum of SPA and SPB throughput. Be wary of confusing SP response time in Analyzer with the average response time of all LUNs on that SP. A LUN is busy (not resting) as long as something is queued to it. An SP is busy (not resting) as long as it is not in the OS idle loop. While a disk is busy getting a LUN request, the LUN is still busy. While a disk is busy getting a LUN request, the SP might be idle. The SP response time is generally smaller than the average response time of all the LUNs on that SP. Host response time is approximated by LUN response time. 22
Data Time Skew R 2 at one minute delay is 0.91, while at zero delay it is 0.41 23
Gathering the Data A challenge for external software-based monitoring perturbing the system under investigation Adding load Changing behavior 24
Data Collection 25
Data Collection AIX VMware Cisco Cisco EMC HPux HDS Solaris Brocade Brocade IBM HyperV 26
Data Collection AIX VMware Cisco Cisco EMC HPux HDS Solaris Brocade Brocade IBM HyperV 27
The damn lies 28
Decisions Based on Thresholds Go buy a lottery ticket, immediately. Yes. Refer to Documentation All clear? Not yet I guess so. INPUT A VALUE I guess so. Just the right number of alarms? ASK SOMEBODY Not yet All clear? On the first try? Yeah, right. Pick a lower threshold No. Done! Finally. Create an email filter Yeah. Have something better to do? Uhhh. Done, yet? 29
Where should alarm thresholds be placed? 30
Data Granularity Challenge Traditional Performance Management Threshold One-minute 31
Data Granularity Challenge Traditional Performance Management Threshold One-second 32
Data Granularity Challenge Traditional Performance Management Threshold One-millisecond 33
Performance metrics 34
The Outlier - Revisited $1,000,000 Traditional Performance Management $700,000 $350,000 $295K average $67K average 0 35
What Does Average Response Time Mean? Q: When you hear your average response time is 20 ms, what is the first thing that pops into your mind? A. My response distribution must look like this: B. My response distribution must look like this: C. My response distribution must look like this: D. My response distribution must look like this: E. I don t know what my response distribution looks like because taking an average of all the response times is not a helpful thing to do. F. When s lunch? 36
What Are Histograms? A histogram is a graphical representation of the distribution of data. Scalar quantization, typically denoted as y=q(x), is the process of using a quantization function Q() to map a scalar (one-dimensional) input value x to a scalar output value y. 37
Histogram Bins Timing Bins: Reads Writes > 0 <= 0.05ms > 0 <= 0.05ms > 0.05 <= 0.2ms > 0.05 <= 0.1ms > 0.2 <= 0.5ms > 0.1 <= 0.2ms > 0.5 <= 1ms > 0.2 <= 0.3ms > 1 <= 2ms > 0.3 <= 0.5ms > 2 <= 4ms > 0.5 <= 0.7ms > 4 <= 6ms > 0.7 <= 1ms > 6 <= 8ms > 1 <= 1.5ms > 8 <= 10ms > 1.5 <= 2ms > 10 <= 15ms > 2 <= 3ms > 15 <= 20ms > 3 <= 4ms > 20 <= 30ms > 4 <= 6ms > 30 <= 50ms > 6 <= 10ms > 50 <= 75ms > 10 <= 20ms > 75 <= 100ms > 20 <= 30ms > 100 <= 150ms > 30 <= 50ms > 150 <= 250ms > 50 <= 75ms > 250 <= 500ms > 75 <= 100ms > 500 <= 1000ms > 100 <= 150ms > 1000 <= 4500ms > 150 <= 250ms > 4500ms > 250 <= 1000ms Size Bins: Read & Write > 0 <= 0.5 KiB > 0.5 <= 1 KiB > 1 <= 2 KiB > 2 <= 3 KiB > 3 <= 4 KiB > 4 <= 8 KiB > 8 <= 12 KiB > 12 <= 16 KiB > 16 <= 24 KiB > 24 <= 32 KiB > 32 <= 48 KiB > 48 <= 60 KiB > 60 <= 64 KiB > 64 <= 96 KiB > 96 <= 128 KiB > 128 <= 192 KiB > 192 <= 256 KiB > 256 <= 512 KiB > 512 <= 1024 KiB > 1024 KiB > 1000 <= 4500ms 2015 Data Storage Innovation Conference. > 4500ms Virtual Instruments. All Rights Reserved. The bins were selected on three criteria: 1. Sampling from live datacenter systems 2. Common SLA Language a. Common service level agreement language is for 10, 15, 20, 30, 50ms boundaries 3. Expected disk seek/access latencies a. Cache hit range 0 0.5ms b. EFD / SSD range 0.5 2ms c. 15k FC/SAS range 2 6ms d. 10k FC/SAS range 6 10ms e. SATA/NL-SAS range 10 15ms 38
Write Cache Misses Cache Hits Cache Misses 39
Impacts of Auto-Tiering Cache Hits SSD FC SATA Auto-tiering left unattended 40
IO Size Skew Average I/O size = 80KiB Does not do a very good job of describing the distribution. 41
Histogram Capabilities 42
Answers, not data 43
How to Analyze HBA Queue Depth High Quality Raw Data Approach #1 Threshold Trigger $ if (queue_size > 128) throw_red_flag Approach #2 Average Metric Average Queue Depth = 15 44
Response Time (ms) Response Time (ms) How to Analyze HBA Queue Depth Approach #3 Combining Multiple Metrics With Machine Learning Analytics Execution throttle set properly. Execution throttle set too high! 95%th 75%th 50%th Queue Size Queue Size Both these scenarios would trigger red flags in Approach #2 45
Repositioning VMs in a Cluster High Quality Raw Data VM#1 CPU Usage VM#1 MEM Usage VM#1 Disk Usage VM#1 NET Usage Approach #1 Average Metrics Approach #2 Threshold Trigger $ if (vm_cpu_usage > 85%) move_vm_process 46
Server CPU Utilization % Server CPU Utilization % Repositioning VMs in a Cluster Approach #3 Predict Future Usage and Reorganize to Fix Bottlenecks BEFORE they Happen Reorganize VMs such that the busy times of one VM correspond with the free times of the rest of the server One Server Bottleneck Today predicted future steady usage VM#46 VM#35 VM#12 Time (include both Dynamic CPU & Memory Utilization) VM#25 VM#17 VM#16 47
Where We Landed Using high-quality, low-impact data, we can drive better decision-making across the infrastructure Analytics will enable a change in the way answers are derived from the data 48
PRESENTATION TITLE GOES HERE