Lies, Damn Lies and Performance Metrics. PRESENTATION TITLE GOES HERE Barry Cooks Virtual Instruments

Similar documents
Port Tapping Session 3 How to Survive the SAN Infrastructure Storm

Managing Performance Variance of Applications Using Storage I/O Control

Key metrics for effective storage performance and capacity reporting

PRESENTATION TITLE GOES HERE

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

La rivoluzione di NetApp

Upgrade to Microsoft SQL Server 2016 with Dell EMC Infrastructure

The Oracle Database Appliance I/O and Performance Architecture

Cisco SAN Analytics and SAN Telemetry Streaming

The Impact of SSD Selection on SQL Server Performance. Solution Brief. Understanding the differences in NVMe and SATA SSD throughput

NAS for Server Virtualization Dennis Chapman Senior Technical Director NetApp

Enterprise2014. GPFS with Flash840 on PureFlex and Power8 (AIX & Linux)

VMWARE VREALIZE OPERATIONS MANAGEMENT PACK FOR. NetApp Storage. User Guide

CS 326: Operating Systems. CPU Scheduling. Lecture 6

Emulex LPe16000B 16Gb Fibre Channel HBA Evaluation

IBM Emulex 16Gb Fibre Channel HBA Evaluation

Removing the I/O Bottleneck in Enterprise Storage

Practical Strategies For High Performance SQL Server High Availability

VDBENCH Overview. Steven Johnson. SNIA Emerald TM Training. SNIA Emerald Power Efficiency Measurement Specification, for use in EPA ENERGY STAR

Real-time Monitoring, Inventory and Change Tracking for. Track. Report. RESOLVE!

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

Performance Sentry VM Provider Objects April 11, 2012

EMC Unity Family. Monitoring System Performance. Version 4.2 H14978 REV 03

Efficient QoS for Multi-Tiered Storage Systems

Creating the Fastest Possible Backups Using VMware Consolidated Backup. A Design Blueprint

Assessing performance in HP LeftHand SANs

Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency

Port Tapping Session 2 Race tune your infrastructure

Public Cloud Leverage For IT/Business Alignment Business Goals Agility to speed time to market, adapt to market demands Elasticity to meet demand whil

Veritas Dynamic Multi-Pathing for VMware 6.0 Chad Bersche, Principal Technical Product Manager Storage and Availability Management Group

Best Practices for SSD Performance Measurement

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

BlackBerry AtHoc Networked Crisis Communication Capacity Planning Guidelines. AtHoc SMS Codes

EMC Disk Tiering Technology Review

1Z0-433

Demartek September Intel 10GbE Adapter Performance Evaluation for FCoE and iscsi. Introduction. Evaluation Environment. Evaluation Summary

Ref: Chap 12. Secondary Storage and I/O Systems. Applied Operating System Concepts 12.1

Virtual SQL Servers. Actual Performance. 2016

Avoiding Storage Service Disruptions with Availability Intelligence

Roadmap for Enterprise System SSD Adoption

Cloud Monitoring as a Service. Built On Machine Learning

BlackBerry Enterprise Server for Microsoft Exchange Version: 5.0. Performance Benchmarking Guide

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Management Abstraction With Hitachi Storage Advisor

Maintaining End-to-End Service Levels for VMware Virtual Machines Using VMware DRS and EMC Navisphere QoS

Historical Collection Best Practices. Version 2.0

Accelerating Microsoft SQL Server 2016 Performance With Dell EMC PowerEdge R740

Select Use Cases for VirtualWisdom Applied Analytics

VM Migration Acceleration over 40GigE Meet SLA & Maximize ROI

STORAGE CONFIGURATION GUIDE: CHOOSING THE RIGHT ARCHITECTURE FOR THE APPLICATION AND ENVIRONMENT

Vendor: EMC. Exam Code: E Exam Name: Cloud Infrastructure and Services Exam. Version: Demo

SAS Technical Update Connectivity Roadmap and MultiLink SAS Initiative Jay Neer Molex Corporation Marty Czekalski Seagate Technology LLC

Maintaining End-to-End Service Levels for VMware Virtual Machines Using VMware DRS and EMC Navisphere QoS

Introducing Tegile. Company Overview. Product Overview. Solutions & Use Cases. Partnering with Tegile

Increasing Performance of Existing Oracle RAC up to 10X

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

PRESERVE DATABASE PERFORMANCE WHEN RUNNING MIXED WORKLOADS

Performance Benefits of NVMe over Fibre Channel A New, Parallel, Efficient Protocol

SC Series: Affordable Data Mobility & Business Continuity With Multi-Array Federation

Interface Trends for the Enterprise I/O Highway

Evaluation Report: Supporting Multiple Workloads with the Lenovo S3200 Storage Array

How Flash-Based Storage Performs on Real Applications Session 102-C

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

Your Speakers. Iwan e1 Rahabok Linkedin.com/in/e1ang

CS533 Modeling and Performance Evaluation of Network and Computer Systems

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

PARDA: Proportional Allocation of Resources for Distributed Storage Access

CS533 Modeling and Performance Evaluation of Network and Computer Systems

ECE Enterprise Storage Architecture. Fall ~* CLOUD *~. Tyler Bletsch Duke University

What s New in VMware vsphere 4.1 Performance. VMware vsphere 4.1

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

VirtualWisdom â ProbeNAS Brief

VirtualWisdom Performance Probes are VirtualWisdom Fibre Channel SAN Performance Probe

Preface. Fig. 1 Solid-State-Drive block diagram

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

vrealize Operations Management Pack for Storage Devices Guide

VMAX: PERFORMANCE MADE SIMPLE

Exadata X3 in action: Measuring Smart Scan efficiency with AWR. Franck Pachot Senior Consultant

Next-Generation Cloud Platform

ProphetStor DiskProphet Ensures SLA for VMware vsan

W H I T E P A P E R. Comparison of Storage Protocol Performance in VMware vsphere 4

DB2 is a complex system, with a major impact upon your processing environment. There are substantial performance and instrumentation changes in

VMWARE EBOOK. Easily Deployed Software-Defined Storage: A Customer Love Story

Dell EMC Unity: Performance Analysis Deep Dive. Keith Snell Performance Engineering Midrange & Entry Solutions Group

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

89 Fifth Avenue, 7th Floor. New York, NY White Paper. HP 3PAR Priority Optimization: A Competitive Comparison

Level 3 SM Enhanced Management - FAQs. Frequently Asked Questions for Level 3 Enhanced Management

Determining the Number of CPUs for Query Processing

Solid State Storage is Everywhere Where Does it Work Best?

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

Sheldon D Paiva, Nimble Storage Nick Furnell, Transform Medical

Vblock Architecture. Andrew Smallridge DC Technology Solutions Architect

Performance Implications of Storage I/O Control Enabled NFS Datastores in VMware vsphere 5.0

How To Manage Disk Effectively with MPG's Performance Navigator

Capacity Estimation for Linux Workloads. David Boyes Sine Nomine Associates

Auto Management for Apache Kafka and Distributed Stateful System in General

Demystifying Storage Area Networks. Michael Wells Microsoft Application Solutions Specialist EMC Corporation

Measuring VMware Environments

Recovering Disk Storage Metrics from low level Trace events

STORAGE SYSTEMS. Operating Systems 2015 Spring by Euiseong Seo

Transcription:

Lies, Damn Lies and Performance Metrics PRESENTATION TITLE GOES HERE Barry Cooks Virtual Instruments

Goal for This Talk Take away a sense of how to make the move from: Improving your mean time to innocence to Improving your infrastructure performance 2

What We ll Cover A case of performance metrics gone bad Some history What performance monitoring needs The lies The damn lies The performance metrics How can you use them 3

Application is down again. 4

Data Center Management - Actual You see this??? 5

Array tools say it s okay 6

Data Center Management - Actual How can I help? 7

Have you tried updating your drivers and firmware? Data Center Management - Actual Meanwhile, at the storage vendor 8

Can you clear the counters and run another log collection? And the switch vendor 9

Some history 10

IBM A Point of Reference Mainframes collected and correlated lots of data about the workload and infrastructure 11

Closed vs. Open Systems The move to open systems was introduced Numerous competing vendors Interconnected specialized devices Inconsistency in monitoring methods and metrics Correlating data from multiple vendors is a serious challenge Vendors focus has been on core innovation Monitoring became a secondary priority 12

What does performance monitoring need? 13

What s Required for Success Understanding what data is relevant A method to gather that data, ideally, without impacting systems under monitoring End-to-end view of data Historical data retention Comparable data across vendor ecosystem Actionable insights from that data 14

The lies 15

Performance Monitoring Today Performance metrics are often Not really performance metrics Utilization Error counters Samples taken on a polling interval Every minute, hour, 6 hours? Rollup averages over a window of time At 16G a single 2KB frame takes 1.25μs to transmit. That s 48 million 2K reads per minute. Fifteen- minute average? That s the population of Europe. 16

The Outlier $1,000,000 Traditional Performance Management $700,000 $350,000 $295K average $67K average 0 17

The Hidden Issue 5,000ms Response Time 1ms 0 20sec 25sec 60sec 10,000 I/Os 10,000 @ 1ms x 20s 32 @ 5,000ms 10,000 @ 1ms x 35 Total Commands: 550,032 Total I/O Time: 1ms * 10,000 I/Os * 55s + 32 I/Os *5s = 710,000 Average Response Time = 1.29ms I/Os per second 0 20sec 25sec 60sec 18

A Question of Balance Is the traffic between these ports on the same server balanced? Port A mean traffic: 4.41Mb/s Port B mean traffic: 4.40Mb/s 19

Workload Profiling 20

Vendor Response Time Metrics Utilization = 100% * busy time in period / (idle + busy) time in period Throughput = total number of visitors in periods / period in length in seconds Average Busy Queue Length = sum of queue upon arrival of visitor x / total number of visitors Queue length = ABQL * utilization/100% Response time = queue length / throughput (Little s Law) Response Time = (Sum of Queue Upon Arrival of Visitor / Total Number of Visitors) * (100% * Busy Time in Period / (Idle + Busy) time in period) / 100%) / (# of Visitors in Period / Length of Period) 21

Vendor Response Time Metrics The Fine Text (Necessary Caveats): For low LUN throughput (<32 IOPS), response time might be inaccurate. Lazy writes skew the LUN busy counter. Dual SP ownership of a disk can also impact response time. Each SP only knows about its own ABQL, throughput and utilization for the disk. At poll time, they exchange views. The utilization is max(spa,spb). ABQL is computed from the sum of the sum. And SP throughput is the sum of SPA and SPB throughput. Be wary of confusing SP response time in Analyzer with the average response time of all LUNs on that SP. A LUN is busy (not resting) as long as something is queued to it. An SP is busy (not resting) as long as it is not in the OS idle loop. While a disk is busy getting a LUN request, the LUN is still busy. While a disk is busy getting a LUN request, the SP might be idle. The SP response time is generally smaller than the average response time of all the LUNs on that SP. Host response time is approximated by LUN response time. 22

Data Time Skew R 2 at one minute delay is 0.91, while at zero delay it is 0.41 23

Gathering the Data A challenge for external software-based monitoring perturbing the system under investigation Adding load Changing behavior 24

Data Collection 25

Data Collection AIX VMware Cisco Cisco EMC HPux HDS Solaris Brocade Brocade IBM HyperV 26

Data Collection AIX VMware Cisco Cisco EMC HPux HDS Solaris Brocade Brocade IBM HyperV 27

The damn lies 28

Decisions Based on Thresholds Go buy a lottery ticket, immediately. Yes. Refer to Documentation All clear? Not yet I guess so. INPUT A VALUE I guess so. Just the right number of alarms? ASK SOMEBODY Not yet All clear? On the first try? Yeah, right. Pick a lower threshold No. Done! Finally. Create an email filter Yeah. Have something better to do? Uhhh. Done, yet? 29

Where should alarm thresholds be placed? 30

Data Granularity Challenge Traditional Performance Management Threshold One-minute 31

Data Granularity Challenge Traditional Performance Management Threshold One-second 32

Data Granularity Challenge Traditional Performance Management Threshold One-millisecond 33

Performance metrics 34

The Outlier - Revisited $1,000,000 Traditional Performance Management $700,000 $350,000 $295K average $67K average 0 35

What Does Average Response Time Mean? Q: When you hear your average response time is 20 ms, what is the first thing that pops into your mind? A. My response distribution must look like this: B. My response distribution must look like this: C. My response distribution must look like this: D. My response distribution must look like this: E. I don t know what my response distribution looks like because taking an average of all the response times is not a helpful thing to do. F. When s lunch? 36

What Are Histograms? A histogram is a graphical representation of the distribution of data. Scalar quantization, typically denoted as y=q(x), is the process of using a quantization function Q() to map a scalar (one-dimensional) input value x to a scalar output value y. 37

Histogram Bins Timing Bins: Reads Writes > 0 <= 0.05ms > 0 <= 0.05ms > 0.05 <= 0.2ms > 0.05 <= 0.1ms > 0.2 <= 0.5ms > 0.1 <= 0.2ms > 0.5 <= 1ms > 0.2 <= 0.3ms > 1 <= 2ms > 0.3 <= 0.5ms > 2 <= 4ms > 0.5 <= 0.7ms > 4 <= 6ms > 0.7 <= 1ms > 6 <= 8ms > 1 <= 1.5ms > 8 <= 10ms > 1.5 <= 2ms > 10 <= 15ms > 2 <= 3ms > 15 <= 20ms > 3 <= 4ms > 20 <= 30ms > 4 <= 6ms > 30 <= 50ms > 6 <= 10ms > 50 <= 75ms > 10 <= 20ms > 75 <= 100ms > 20 <= 30ms > 100 <= 150ms > 30 <= 50ms > 150 <= 250ms > 50 <= 75ms > 250 <= 500ms > 75 <= 100ms > 500 <= 1000ms > 100 <= 150ms > 1000 <= 4500ms > 150 <= 250ms > 4500ms > 250 <= 1000ms Size Bins: Read & Write > 0 <= 0.5 KiB > 0.5 <= 1 KiB > 1 <= 2 KiB > 2 <= 3 KiB > 3 <= 4 KiB > 4 <= 8 KiB > 8 <= 12 KiB > 12 <= 16 KiB > 16 <= 24 KiB > 24 <= 32 KiB > 32 <= 48 KiB > 48 <= 60 KiB > 60 <= 64 KiB > 64 <= 96 KiB > 96 <= 128 KiB > 128 <= 192 KiB > 192 <= 256 KiB > 256 <= 512 KiB > 512 <= 1024 KiB > 1024 KiB > 1000 <= 4500ms 2015 Data Storage Innovation Conference. > 4500ms Virtual Instruments. All Rights Reserved. The bins were selected on three criteria: 1. Sampling from live datacenter systems 2. Common SLA Language a. Common service level agreement language is for 10, 15, 20, 30, 50ms boundaries 3. Expected disk seek/access latencies a. Cache hit range 0 0.5ms b. EFD / SSD range 0.5 2ms c. 15k FC/SAS range 2 6ms d. 10k FC/SAS range 6 10ms e. SATA/NL-SAS range 10 15ms 38

Write Cache Misses Cache Hits Cache Misses 39

Impacts of Auto-Tiering Cache Hits SSD FC SATA Auto-tiering left unattended 40

IO Size Skew Average I/O size = 80KiB Does not do a very good job of describing the distribution. 41

Histogram Capabilities 42

Answers, not data 43

How to Analyze HBA Queue Depth High Quality Raw Data Approach #1 Threshold Trigger $ if (queue_size > 128) throw_red_flag Approach #2 Average Metric Average Queue Depth = 15 44

Response Time (ms) Response Time (ms) How to Analyze HBA Queue Depth Approach #3 Combining Multiple Metrics With Machine Learning Analytics Execution throttle set properly. Execution throttle set too high! 95%th 75%th 50%th Queue Size Queue Size Both these scenarios would trigger red flags in Approach #2 45

Repositioning VMs in a Cluster High Quality Raw Data VM#1 CPU Usage VM#1 MEM Usage VM#1 Disk Usage VM#1 NET Usage Approach #1 Average Metrics Approach #2 Threshold Trigger $ if (vm_cpu_usage > 85%) move_vm_process 46

Server CPU Utilization % Server CPU Utilization % Repositioning VMs in a Cluster Approach #3 Predict Future Usage and Reorganize to Fix Bottlenecks BEFORE they Happen Reorganize VMs such that the busy times of one VM correspond with the free times of the rest of the server One Server Bottleneck Today predicted future steady usage VM#46 VM#35 VM#12 Time (include both Dynamic CPU & Memory Utilization) VM#25 VM#17 VM#16 47

Where We Landed Using high-quality, low-impact data, we can drive better decision-making across the infrastructure Analytics will enable a change in the way answers are derived from the data 48

PRESENTATION TITLE GOES HERE