Port Tapping Session 3 How to Survive the SAN Infrastructure Storm
Tap Module Same profile as Pretium EDGE Module Red adapter indicates TAP port 2
Corning Fibre Channel and Ethernet Tap s 72 Ports per 1U 288 Ports per 4U 3
4
Virtual Instruments Overview March 2013 Alex D Anna Director of Solutions Consulting, EMEA
Virtual Instruments Proven Leader in Infrastructure Performance Management (IPM) Headquarters - USA San Jose, California Revenue growth 100% yr. EMC Select Partnership launched November 2012 VIP Partner program on fire! Key eco-system relationships Virtual Instruments Confidential 6
Virtual Instruments Confidential 7 The Customers Finance & Insurance Healthcare/Pharma Services Retail & E-commerce Government Manufacturing
Data Centre Solution Portfolio APP APP OS OS APP APP OS OS APP APP OS OS APP APP OS OS Virtual Machines APP APP OS OS Guests SAN Availability Probe SAN Performance Probe 8
Tapping is a SAN Best Practice Traffic Access Points (TAPs): Have been widely deployed in IP networks (LANs, WANs) for 20+ years Provide direct access to all levels of fibre traffic to derive data on SAN/storage performance, utilisation, and transmission errors Are used by all system and storage vendors to diagnose device-specific problems Enable IT personnel to: Ensure high application availability Maximise application performance Proactively find problems before users Systematically align CAPEX with business requirements Unlock the unrivaled power of VirtualWisdom through integrated SAN Performance Probes and Protocol Analysers
SAN Availability Probe SAN Availability Probe Software to Optimise Utilisation and Availability Reports on link failure, link errors, dropped frames Immediate root cause detection Speed problem resolution Prevent outages and performance problems Historical trending analysis Find multi-path failures Identifies over-provisioned links
Virtual Server Probe Virtual Server Probe Software to Optimise VMware Performance and Consolidation Ratios Increase use of Virtual Servers into tier 1 applications APP APP APP OS OS OS APP APP APP OS OS OS vcenter Collects over 100 vcenter metrics Reduces the risk of implementing mission-critical applications Increases consolidation ratios Offers VM to LUN correlation
SAN Performance Probe SAN Availability Probe software, TAPs, SAN Performance Probe hardware to optimise performance and availability Latency and response time reports Reduce number of array ports Improve performance Queue depth performance impact Identify degraded device metrics Optimise storage tiering Quickly identify root cause Reduce power, space requirements What if modeling for consolidation, reconfigurations
The Software Recording & Playback Change interval in dashboards for easy drill-down and Faster Troubleshooting Ships with suggested tabs and dashboards, but infinite Customization possible Correlation across all installed probe types, from VM to HBA, through switch to LUN Real-Time monitoring of Latency metrics like Exchange Completion Time Alerts & Alarms based on user-definable thresholds, e.g. critical alert when ECT > 40ms
Drill down to Isolate physical layer problems Series of CRC errors in the Storage Physical Events widget Drilling down, you can find the affected servers and array ports 14
SAN Performance Probe Performance Trend ProbeFC8 collects data from the fibre channel links Exchange Completion Time widget HBA Queue Depth widget
Correcting oversubscription ratios VW provides insight into Blade & ASIC utilisation on each SAN directors. It can be seen that there is significant throughput imbalance By allocating new servers based on ASIC utilisation and balancing the existing throughput, the customer can potentially increase the amount of servers without introducing risk or performance issues.
17 Inventory- March 15 th
18 Inventory- April 7 th, 2010
Multipath Verification MP after removing nicknames including the word TAPE. The single HBAs should be investigated. 19
Events at Glance The environment is experiencing significant events which most likely are affecting performance and potentially building up to an outage. 20
21 Events at Glance
Events at Glance Supporting Widget Based Dashboard view. 22
Reducing Trouble Tickets SAN Performance Probe SAN monitoring solution with flexible thresholds and alerts Gathers switched fabric performance statistics Vendor-agnostic view with no impact on switch performance Customer averaged Trouble Ticket Incident Volume 900+ trouble tickets 1200 per month Number of tickets dropped by more than 65% within 3 months of VirtualWisdom installation 1000 800 600 400 200 VirtualWisdom installed Urgent+High Medium Low Total 0
Proving Tier II Storage is Not to Blame - High Read and Write Times (500 ms) with avg Command to First Data indicate problems caused by the Target Full on the SVC Nodes - There is a potential for Tier II SATA drive read and write times to be significantly improved once these issues are resolved. - This will enable the user to migrate many of their applications to these lower cost SATA drives while still meeting their SLAs
Identify Slow Draining Devices With granular historical reporting and time-based correlation, VirtualWisdom uncovers relationship of buffer to buffer credits and performance problems caused by link resets Buffer to buffer credits degrade causing Link Resets 25
Measure latency to find cause of slowdown Measuring performance by MB/sec provides no clue that the applications are slowing How many exchanges are left open at the end of every 1 second interval, showing a slowdown Measure latency in milliseconds to see useful data 26
Storage admin view
28 Prevent problems finding excessive SCSI Servers 6017 & 6018 show no Array 1 Port 1 LUNs also show no See end to end view and check for SCSI Reservation Conflicts. It reveals that both Servers 6017 & performance issues on the HBA level problems at the LUN level. 6018 have cancelled transactions to Array 1 Port 1 partly due to SCSI Reservation Conflicts.
Reduce risk of storage consolidation VirtualWisdom User Defined Contexts Model effect of consolidation using actual production metrics Actual production metrics from baseline What if modeling shows effect 29
Virtual Server Probe summary dashboard Overall status of the server infrastructure Peak in server disk latency
Virtual Server Probe Summary Which server is causing the problem? Check the Virtual Server Probe (ProbeVM) Disk Trend dashboard
Virtual Server Probe Disk Trend Storage from the VMware admin point of view Latency, MB/s, IOPS, physical layer issues Indentifies problem server More detail, actual figures
Virtual Server Probe Disk Trend Storage is slow -- Very high read latency. Server esx2 is having the problems
SAN Availability Probe Summary Metrics for the fibre channel fabric So, there appear to be no physical layer issues HBA metrics ISL metrics Storage port metrics
SAN Performance Trend ProbeFC8 collects data from the fibre channel links Exchange Completion Time widget HBA Queue Depth widget
SAN Performance Trend Server ESX2 Write ECT too high Queue Depth Setting for ESX2 - possible source of problems
Queue Depth tests ITL ECT=3-5ms (still ok) ITL ECT < 2ms Unstable ECT Not a good choice Transfer = 78M/s (low) Transfer = 100 M/s (great) Not much transfer increase 100M/s vs. 104M/s Queue Depth = 4 Queue Depth = 8 Queue Depth = 32
Virtual Server Probe Summary dashboard Confirmation that properly set HBA Queue Depth solves latency problem
Queue Depths Perf. - Summary VirtualWisdom was able to initially detect the problem and generate an alarm The server admin reviewed the ProbeVM dashboards and found a latency issue The storage admin verified that the fibre channel fabric was fine The storage admin correlated the latency problem with a high Queue Depth setting The storage admin then performed tests and determined that the optimum value was 8 for this link
Contact information Corning Anthony Robinson RCDD CDCDP Data Centre Marketing Manager, EMEA robinsonam@corning.com +44 7785 518263 Virtual Instruments Alex D Anna Director of Solutions Consulting, EMEA Alex.danna@virtualinstruments.com +44 7850 057756 41
Next webinar Thursday 18 th April 13:00 GMT/ 14:00 CET http://cablesystems.corning.com/3-porttap.html 42