SAN Audit Report For ABC Company

Similar documents
S SNIA Storage Networking Management & Administration

Port Tapping Session 3 How to Survive the SAN Infrastructure Storm

Congestion Avoidance. Finding Feature Information

Storage Protocol Analyzers: Not Just for R & D Anymore. Brandy Barton, Medusa Labs - Finisar

Exam : S Title : Snia Storage Network Management/Administration. Version : Demo

Configuring Interface Buffers

Cisco SAN Analytics and SAN Telemetry Streaming

ADT Frame Format Notes (Paul Suhler) ADI ADT Frame Format Proposal (Rod Wideman)

Managing Data Center Interconnect Performance for Disaster Recovery

Snia S Storage Networking Management/Administration.

VirtualWisdom SAN Performance Probe Family Models: ProbeFC8-HD, ProbeFC8-HD48, and ProbeFC16-24

To: T10 Membership T10/97-184R4 Subject: Use of Class 2 for Fibre Channel Tapes From: Date: 23-Oct-1997

VirtualWisdom Performance Probes are VirtualWisdom Fibre Channel SAN Performance Probe

Virtualizing SAN Connectivity with VMware Infrastructure 3 and Brocade Data Center Fabric Services

Optimizing SAN Performance & Availability: A Financial Services Customer Perspective

Configuring Fibre Channel Interfaces

10Gbps Ethernet Solutions

ExpressSAS Host Adapter 6Gb v2.30 Windows

Fibre Channel Performance: Congestion, Slow Drain, and Over Utilization, Oh My! Live Webcast February 6, :00 am PT

Class 3 Error Detection and Recovery for Sequential Access Devices Preliminary ANSI T10 Working Document R22

WHITE PAPER Application Performance Management. The Case for Adaptive Instrumentation in J2EE Environments

ExpressSAS Host Adapter 6Gb v2.10 Windows

Select Use Cases for VirtualWisdom Applied Analytics

VERITAS Dynamic MultiPathing (DMP) Increasing the Availability and Performance of the Data Path

WHITE PAPER: ENTERPRISE AVAILABILITY. Introduction to Adaptive Instrumentation with Symantec Indepth for J2EE Application Performance Management

Port Tapping Session 2 Race tune your infrastructure

SCSI is often the best choice of bus for high-specification systems. It has many advantages over IDE, these include:

Configuring SAN Port Channel

ExpressSAS Host Adapter 6Gb v2.05 Linux

FIBRE CHANNEL OVER ETHERNET

EMC Celerra Replicator V2 with Silver Peak WAN Optimization

VirtualWisdom â ProbeNAS Brief

Absolute Analysis Investigator Architecture Fibre Channel Solutions

Vendor: EMC. Exam Code: E Exam Name: Cloud Infrastructure and Services Exam. Version: Demo

The Transition to Networked Storage

Introduction to Real-Time Communications. Real-Time and Embedded Systems (M) Lecture 15

Emulex LPe16000B 16Gb Fibre Channel HBA Evaluation

HUAWEI OceanStor Enterprise Unified Storage System. HyperReplication Technical White Paper. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

The Total Network Volume chart shows the total traffic volume for the group of elements in the report.

iscsi Technology Brief Storage Area Network using Gbit Ethernet The iscsi Standard

CONFIGURING ftscalable STORAGE ARRAYS ON OpenVOS SYSTEMS

Performance Characterization of the Dell Flexible Computing On-Demand Desktop Streaming Solution

Emulex LPe16000B Gen 5 Fibre Channel HBA Feature Comparison

Introduction... 4 Starting BAM... 6 From the desktop icon... 6 From within STB The BAM User Interface... 7 The Main BAM Screen...

ARINC-818 TESTING FOR AVIONICS APPLICATIONS. Ken Bisson Troy Troshynski

Gen 6 Fibre Channel Evaluation of Products from Emulex and Brocade

Voice, Video and Data Convergence:

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

HA-AP Hardware Appliance

Barcelona: a Fibre Channel Switch SoC for Enterprise SANs Nital P. Patwa Hardware Engineering Manager/Technical Leader

Storage Area Network (SAN)

Configuring iscsi in a VMware ESX Server 3 Environment B E S T P R A C T I C E S

vrealize Operations Management Pack for Storage Devices Guide

Keywords: ASM, Fiber channel, Testing, Avionics.

Mitsubishi FX Net Driver PTC Inc. All Rights Reserved.

CS 162 Operating Systems and Systems Programming Professor: Anthony D. Joseph Spring Lecture 21: Network Protocols (and 2 Phase Commit)

Brocade Fabric Vision Technology

Step-by-Step Guide to Installing Cluster Service

22 March r1 FCP-4 QUERY TASK task management function

Network Design Considerations for Grid Computing

Chapter 10: Mass-Storage Systems

IM and Presence Service

Documentation Accessibility. Access to Oracle Support

SaaS Providers. ThousandEyes for. Summary

A large cluster architecture with Efficient Caching Coherency, Intelligent Management, and High Performance for a Low-Cost storage node

VERITAS Dynamic Multipathing. Increasing the Availability and Performance of the Data Path

Brocade Fabric Vision Technology

Voice Performance Statistics on Cisco Gateways

EqualLogic Storage and Non-Stacking Switches. Sizing and Configuration

Storage Access Network Design Using the Cisco MDS 9124 Multilayer Fabric Switch

Performance Implications of Storage I/O Control Enabled NFS Datastores in VMware vsphere 5.0

10 The next chapter of this Web Based Training module describe the two different Remote Equivalent Transfer Modes; synchronous and asynchronous.

CS3600 SYSTEMS AND NETWORKS

QoS-Aware IPTV Routing Algorithms

Fibre Channel Slow Drain Device Detection and Congestion Avoidance- An Overview

Matrox Imaging White Paper

Revision 6: Red text Incorporate comments from January 5, 2004 conference call. Minor wording changes.

03-186r3r3 SAS-1.1 Transport layer retries 25 October 2003

OnCommand Insight 7.1 Planning Guide

Data Migration from Dell PS Series or PowerVault MD3 to Dell EMC SC Series Storage using Thin Import

ThousandEyes for. Application Delivery White Paper

Creating the Fastest Possible Backups Using VMware Consolidated Backup. A Design Blueprint

Fabric Services. Tom Clark Director, Technical Marketing

COSC 6385 Computer Architecture. Storage Systems

Gamma Service Incident Report Final 18/9/14

Cloudian Sizing and Architecture Guidelines

VMware Mirage Getting Started Guide

Cisco I/O Accelerator Deployment Guide

Emulex -branded Fibre Channel HBA Product Line

Maximize the All-Flash Data Center with Brocade Gen 6 Fibre Channel

High Availability through Warm-Standby Support in Sybase Replication Server A Whitepaper from Sybase, Inc.

Enabling Performance & Stress Test throughout the Application Lifecycle

Fibre Channel Specialist Lab

MONITORING STORAGE PERFORMANCE OF IBM SVC SYSTEMS WITH SENTRY SOFTWARE

ForeScout CounterACT. Resiliency Solutions. CounterACT Version 8.0

OFF-SITE TAPE REPLICATION

USING ISCSI AND VERITAS BACKUP EXEC 9.0 FOR WINDOWS SERVERS BENEFITS AND TEST CONFIGURATION

HP StorageWorks Continuous Access EVA 2.1 release notes update

Root cause codes: Level One: See Chapter 6 for a discussion of using hierarchical cause codes.

IBM MQ Appliance HA and DR Performance Report Version July 2016

Transcription:

SAN Audit Report For ABC Company 2010 24/02/10 Page 1

Table of Contents SAN Audit 3 Introduction 3 Purpose and Objectives 3 SAN Capacity 4 Capacity Planning 4 Link Utilisation, Congestion and Bottlenecks 4 Capacity Metrics for this Audit 5 SAN Performance 6 Managing Latency 6 Latency Metrics for this Audit 7 Storage Responsiveness 9 Response Metrics for this Audit 9 Disk Performance 10 Disk Metrics for this Audit 10 SAN Configuration 11 Load Balancing 11 Managing Queue Depths 11 Pending Exchange Metrics for this Audit 12 Connectivity Issues and Problem Events 13 Physical Layer Errors 13 Failed Communication and Transactions 13 Issues Monitored during the Audit 14 Switch Conditions reported during the audit 20 Key Findings and Recommendations 23 Key Findings 23 Recommendations 24 Appendix 25 Condition Descriptions 25 SCSI Status Check Conditions 25 Code Violations 25 Loss of Sync 26 Loss of Signal 26 Link Credit Reset 26 Bad Status 27 Abort Sequence 27 CRC and Frame Errors 28 Class 3 Discards 29 24/02/10 Page 2

SAN Audit Report Introduction This document details the results of a GCH SAN Audit carried out at ABC Company in the UK. The instrumentation deployment and sampling period was for two weeks in 2010. The main focus was to monitor two storage systems at each site for one week at a time and to monitor the inter-site links. The first week was spent monitoring the two storage systems in the main data centre and the two inter-site links. The second week was used to monitor the two storage systems at the second data centre. The main purpose of the two weeks monitoring was to carry out a general health check as detailed in this report and to review any issues that were raised. This report contains several sections: This introduction, SAN Capacity, Disk Performance, SAN Configuration, Problem Conditions, Key Findings and Recommendations and an Appendix describing some of the terms used in the report. Most of the graphs used show the top ten to twenty events in each category and not all events as this is not necessary. In some cases more detailed reports have been used to zoom in on events. Purpose and Objectives of this Report The main purpose of this SAN Audit was to monitor multiple storage ports on the storage systems in order to verify the performance and stability of the SAN implementation and to highlight any areas of concern. The data collected provides a baseline of the operation for the server and storage infrastructure which may be used for capacity planning, performance adjustments, and the current state of possible error conditions. This report will additionally provide a starting point of analysis for troubleshooting of any discovered issues. Towards the end of the report is a summary of the key findings as well as recommendations for maintaining the ongoing operational health of the SAN. 24/02/10 Page 3

SAN Capacity Capacity Planning Proper network capacity planning can help maintain networks in optimal working order, reduce risk of outages due to resource limitations, and justify future networking needs. This information can be used to make both short term and long term decisions. It is important to look for patterns that occur at various times of day. There are often the equivalent of rush hour time periods where the traffic will be slowed due to periods of significantly increased demands. It is a good idea to examine these time periods for workloads that can be moved to other, less busy, times of day. Look for activities such as batch processing or backups that could be moved to other less busy time periods. It is also very common that certain regular activities have been scheduled that are not necessary, for example a report that is no longer used but is configured to run periodically. Simply turning off the report may yield performance improvements for all of the applications that share the same resources. There are also medium term capacity planning activities such as refining traffic routing to eliminate bottlenecks and take advantage of under-utilised links or storage ports. Longer term planning should verify which portions of the network designs and configurations have worked best for the specific environment and demands. For this activity it is important to compare the capacity, performance and configurations. Link Utilization, Congestion, and Bottlenecks Link utilisation can significantly impact the overall performance of applications dependant on the SAN. Links congested at as low a level as 70% sustained for a minute or less can cause transaction times that are 1000 times longer than average transaction times on less busy links. Though a spike may occur from time to time, prolonged periods of high bandwidth usage will adversely affect the performance of your applications. While congestion on server or storage ports usually only impacts a few applications, Inter Switch Links (ISLs) can become bottlenecks that affect the performance of multiple applications simply because one of the applications utilising the links is busy. When small transactions that typically complete in 4 ms take over 4 seconds to complete, applications can appear to temporarily hang and can cause end user frustration and dissatisfaction. In worst case scenarios, as the response times are allowed to build up too long, applications or network timeouts can occur causing application level errors. Often busy networks with congested links have some links that are underutilised. This discrepancy in demand on the links causes networks to perform slowly and over time can lead to application level performance dissatisfaction or in some cases even network outages. 24/02/10 Page 4

Capacity Metrics for this Audit The graph below shows the SAN usage over the monitored period. The conclusion from the graph is that the majority of the traffic was monitored on ISL1 followed by traffic between the two storage systems. It can be seen that there is very little traffic on ISL2 compared to ISL1. The graph below shows the SAN capacity in MB/sec over the same period. This shows that the high loads appear on ISL1 which was only monitored during the first week, but also on one of the storage systems at the main data centre. The levels of traffic are lower for the second week of monitoring. 24/02/10 Page 5

SAN Performance The performance of a SAN is often far more important than many SAN administrators realise. While reliability of the data and prevention of outages is rightfully a top concern, performance is often what impacts the application users the most and in worst case scenarios can cause applications to be unusable or even lead to network outages. Although network capacity issues such as congestion and bottlenecks often cause poor network performance, there are other less well known causes of poor network performance. These other performance issues often occur on networks that are underutilised so it is important to consider the performance as a key factor. Managing Latency Latency of the SAN is measured using Exchange Completion Times (ECT). ECT is the measurement from the time a command was sent to the time it was fulfilled. There are many factors that can affect the latency across the SAN: HBA s, servers, number of hops, switching, disk speeds, interfaces, configuration issues (both by design and error) and device incompatibly, as well as a host of other possible issues. Therefore, due to the multitude of factors that affect the performance, each SAN tends to have its own latency range. There is general consensus for optimally performing networks that maximum latencies should be less than 1000 ms (or one second) and a reasonable average would be less than 40ms. Knowing what your latency is can help with finding the problem areas on your SAN. If a link spikes in latency every once in a while that can be a normal occurrence but if you see prolonged periods of high latency then there is an issue that needs to be addressed, whether through configuration, re-routing or replacing equipment, whatever the case may be that is causing the issues. 24/02/10 Page 6

Latency Metrics for this Audit The average exchange completion times were below 4ms for the two weeks of monitoring which is very good, with the longer responses shown on ISL2. The average times give an overall indication of latency and response performance but any issues would be indicated by the maximum times. 24/02/10 Page 7

While the average ECT were within excepted range there were some exchange completion times that exceeded the recommended 1 second, as shown in the graphs below. The largest peaks for both Reads and Writes occurred around 12:00pm on the 8 th February although there are multiple peaks throughout the week for both Reads and Writes that exceed the recommended time. 24/02/10 Page 8

Storage Responsiveness Storage system responsiveness is measured by looking at Command to First Data metrics. The Command to First Data is the measurement from the time a request to read data was sent to the time the first data frame was received. This metric gives a clear indication of the responsiveness of the target end devices. Tracking this metric over time allows the performance to be compared against earlier data thus showing any performance degradation. Slow responsiveness can indicate failing devices, poor LUN configurations, maintenance activities, or heavy random access of data versus sequential data access. Monitoring responsiveness can help uncover these issues before they become serious. Response Metrics for this Audit Whereas the above graphs (Maximum Read/Write Exchange Completion Times) relate to the exchange completion times, the graph below details the response time of the various storage systems. The maximum command to first data graph below shows that the peaks tie in with the responses shown in the graphs of maximum exchange completion times, indicating that the delays were due to the responsiveness of the storage. 24/02/10 Page 9

Disk Performance Utilisation of storage volumes is another key factor in overall application performance. Over utilized storage ports have several affects on the applications that access them. One impact is the simple network congestion which can impact the performance just as it can on any other network link. Another impact is the introduction of non sequential reads and writes of data. This can greatly impact the performance as the enterprise class storage systems all have algorithms for caching and reading that are best utilized by sequentially accessing the data. Most of the storage systems can actually optimize simultaneous sequential reads from multiple servers though there is a limit to how many simultaneous sequential reads can be handled before considering the reads to be non-sequential. Finally, over utilized storage ports can suffer from issues related to command queuing or physical queue depth limits. This will be looked into further in an upcoming section. A general rule of thumb would be to evenly distribute the load across storage controller ports and avoid placing peak load storage access (such as backups) on the same storage controller port. Disk Metrics for this Audit The graph below shows the top ten storage LUN s accessed during the two week period. The highest values are either on ISL1 or LEVA01 to LEVA02 transactions indicating that the replication between the EVA s is the biggest load on the LUNs. 24/02/10 Page 10

SAN Configuration Two important areas of configuration that this report focuses on are Load Balancing and Queue depths. These are two of the most important configurations in any SAN because improper configuration can lead to data corruption or network outages that would normally be prevented by network redundancy. They are also important in order to ensure optimum network performance. Load Balancing Load balancing is a way of proportioning the load across redundant paths. For example, when one path is running at 80% capacity but another is running at 3%, you run the risk of congestion and performance slow-downs for your applications. More importantly in the event of a hardware failure, there is no redundant path for the communication and this can result in an outage. There are two main types of load balancing, multi-path HBA load balancing and storage link level load balancing. If all the servers are running a dual port HBA or two HBA s your goal would be to have both HBA s with I/O loads within a certain range of equality. For the storage ports it is good practise to have the servers sharing the ports rather than all the servers targeting the same storage port and leaving the others relatively under-utilised. Both of these would then allow for proper planning when it comes to adding more servers to the SAN. If any dual-pathed server, ISL, or storage is not properly configured then it will become a single point of failure, as any failure to the only active port would cause a complete outage to the applications supported by these devices. Managing Storage Queue Depth Queue Depth is the physical limit of exchanges that can be open on a storage port at any one time. The Queue Depth setting on an HBA specifies how many exchanges can be sent to a LUN at one time. In order to prevent the storage port from being over-run with data it is important to consider both the number of servers that are connecting to a storage port as well as the number of LUN s available on that port. By knowing the number of exchanges that are pending at any time it is possible to manage the storage Queue Depths. In order to properly manage the storage Queue Depths one must consider both the configuration settings at the host bus adapter (HBA) in a server and the physical limits on the storage arrays. It is important to determine what the Queue Depth limits are for each storage array. All of the HBA s that access a storage port must be configured with this limit in mind. Most HBA vendors default allocation number is set for 8. If you set the queue depths too low on the HBA it could significantly impair the HBA s performance and lead to under utilisation capacity on the storage port (.i.e. underutilising storage resources). This occurs both because the network will be underutilised and the storage system will not be able to take advantage of its caching and serialisation algorithms that greatly improve performance. At the opposite end of the spectrum, setting the queue depths too high can overrun the maximum queue depth capacity of the storage port, possibly introducing corruption and/or data loss. Queue Depth settings on HBA s can also be used to throttle servers so that the most critical servers are allowed greater access to the necessary storage. 24/02/10 Page 11

Pending Exchange Metrics for this Audit The graph below shows the pending exchanges for the period of the SAN audit. It shows that the pending exchanges peaked at about 190 for the whole period. Large numbers of pending exchanges can cause severe delays in response and exchange times especially when the load is high. It can be seen from the graph that the large numbers of pending exchanges occurred during the first week of monitoring on the EVA s at the main data centre. These values are very high and probably the cause of the slow responses detailed in the previous section of this report. 24/02/10 Page 12

Connectivity Issues and Problem Events Connectivity is the ability for devices to clearly communicate with other devices on a network. Most SAN s will have periods of connectivity events that occur because of occasional maintenance, when cables are pulled and equipment gets shut down or rebooted. These events are fine as long as they can be attributed to an action that had been taken. It is important to keep a detailed record of any changes made to the SAN so that any connectivity issues can be attributed to the changes. If they cannot be attributed to the changes then it is most likely that the issue will need further investigation. Important disruptions to connectivity can be identified by the following events: CRC Errors, Aborts, Fabric Logins and Logouts, Basic Link Services, SCSI Bad Status or Check Conditions. Physical Layer Errors Physical layer errors are errors that exist on the FC1 layer. For this section we are reporting on the following metrics, Code Violations (CV), Loss of Sync (LoS), Loss of Signal and Frame Errors. These are very basic primitive metrics that can let you know when something as simple as a cable was pulled or a transceiver has lost its light. In fact something as simple as a pulled FC cable will create millions to billions of CV errors. These events and errors will occur from time to time with the moving of equipment or configuration but there should always be a correlating change control entry otherwise there could be a real problem. Failed Communication and Transactions To detect errors at the Initiator, Switch or Target level it is usual to look for the following metrics, basic errors and alarms, Aborts, rejects, busy signals, SCSI Bad Status or Check Conditions. To get one of these errors on your SAN can be a sign of a bigger issue so they need to be investigated. For example you should not see Aborted frames (Aborts) or BAD SCSI Status on your fabric and will want to investigate the reason for these types of errors. SCSI bad status or check conditions are reported at the SCSI level within the device whereas aborts (ABTS) are reported at the fibre channel level. Some other types of error message are warnings that a condition has changed with a SCSI device. The quantity and the severity of the SCSI check condition message should give you an idea of where to go investigate the issues. SCSI Status Check conditions can be caused due to changes implemented within the SAN but if there is no record of changes that tie up with the Check Condition then there is probably an issue at the device level and this should be investigated. 24/02/10 Page 13

Issues Monitored by the FC probes during the Audit The following conditions were reported by the fibre channel probes over the monitored period. The events "Loss of Sync Events", "Loss of Signal Events", "Extended Link Services Frames", "Fibre Channel Service Frames", "Fabric (SOFf) Frames", "Basic Link Services Frames", "#Check Condition Status Frames", "#Other Bad Status Frames", "Task Management Frames / Sec", "Logins", "Logouts", "Abort Sequence Frames" and "Accepts" were detected on one or more of the following devices (Link Name: Probe Name): ISL1: ISL1, LEVA01_CN2_FD2: LEVA02_CN2_FD2, LEVA01_CN1_FD2: LEVA02_CN1_FD2, LEVA01_CN1_FD1: LEVA02_CN1_FD1, LEVA01_CN2_FD1: LEVA02_CN2_FD1 and ISL2: ISL2. Some of these conditions may be normal for this SAN environment and others may require further investigation. Each of the issues listed above is explained in more detail in the following graphs. The graph below shows Loss of Sync events detected by the probes for the two week monitoring period. As can be seen from the graph below, there was one event that happened on which was most likely caused due to the ISL switch port being reset. 24/02/10 Page 14

The graph below shows Loss of Signal events detected by the probes for the two week monitoring period. As can be seen from the graph below, there was one event which coincides with the Loss of Sync event shown in the graph above. The graph below shows Extended Link Service events for the two week monitoring period. As can be seen from the graph below, the level of these events is consistent due to the communication between the storage systems and is normal behavior for this type of storage system. 24/02/10 Page 15

The graph below shows Fibre Channel Service Frames for the week. As can be seen from the graph the level of FCS frames is consistent throughout the week for all the devices monitored and is expected as the storage systems communicate with each other on a regular basis using FCS frames. The graph below shows Start of Frame (Fabric) Frames for the week. As can be seen from the graph the level of these frames is consistent throughout the first week which ties in with the monitoring of the ISL s, with no events during the second week as expected. 24/02/10 Page 16

The graph below shows Status Check Condition events for the two week monitoring period. There were many of these events over the ISL s during the first week of monitoring with no events during the second week. The graph below shows other Bad Status events for the two week monitoring period. As can be seen from the graph below, the level of these events is consistent with the Status Check Conditions shown above for the ISL s. 24/02/10 Page 17

The graphs below shows Logins and Logouts for the two week monitoring period. Devices should not login or logout with each other in a stable SAN unless a problem has occurred or the device in question cannot communicate with all the devices it needs to in order to operate correctly. 24/02/10 Page 18

The graph below shows Abort Sequence events detected by the probes for the two week monitoring period. As can be seen from the graph below, these events occurred over both weeks of the monitoring period. Abort Sequences indicate that exchanges have not completed and have to be re-sent and should not occur in a stable SAN without bottlenecks. 24/02/10 Page 19

Issues detected by the FC Switches during the Audit Switch statistics are an indication of conditions or issues that occur on the SAN although they are not always 100% accurate. Whereas the probes were used to monitor the storage ports and ISL s, the switches were monitored to see if any issues or conditions were reported for each of the switch ports. Any conditions reported by the probes should tie in with conditions reported on the switch ports connected to the storage ports and the ISL s. The following conditions were reported by the fibre channel switches over the monitored period. The events "Loss of Sync Events", "Loss of Signal Events", "Link Resets", "Link Failures" and "Class 3 Discards" were detected on one or more of the following devices (PortNumber: PortModuleNumber: Link Name: Probe Name): 12: N/A: LFCSW7::12: LFCSW7, 18: N/A: LFCSW12::18: LFCSW12, 18: N/A: LFCSW11::18: LFCSW11, 2: N/A: LFCSW10::2: LFCSW10, 2: N/A: LFCSW9::2: LFCSW9, 11: N/A: LFCSW7::11: LFCSW7, 2: N/A: LFCSW12::2: LFCSW12, 8: N/A: LFCSW7::8: LFCSW7, 2: N/A: LFCSW11::2: LFCSW11 and 8: N/A: LFCSW8::8: LFCSW8. The graph below shows Loss of Sync events as detected by the switches for the two week monitoring period. As can be seen from the graph below, these events occurred over both weeks of the monitoring period on three different ports. 24/02/10 Page 20

The graph below shows Loss of Signal events as detected by the switches for the two week monitoring period. As can be seen from the graph below, these events occurred over both weeks of the monitoring period on three different ports and appear to tie in with the Loss of Sync events in the previous graph. The graph below shows Link Reset events as detected by the switches for the two week monitoring period. As can be seen from the graph below, these events occurred over both weeks of the monitoring period on different ports and some events appear to tie up with the LOS/LOSIG events. Link resets will also occur when the credit needs resetting between two ports. 24/02/10 Page 21

The graph below shows Link Failure events as detected by the switches for the two week monitoring period. As can be seen from the graph below there are two events on the same port at the same time. The graph below shows Discard events as detected by the switches for the two week monitoring period. As can be seen from the graph below there were two discard events reported on the same port at the same time. 24/02/10 Page 22

Key Findings Key Findings and Recommendations From the statistics collected by the monitoring probes over the period of the GCH SAN Audit it is clear that the SAN has some issues which need to be addressed as well as issues that may require further investigation. The first issue is the loading of the ISL s between the data centres. It is clear from the report that most of the data is transmitted over ISL1 and the load on this ISL was very high at various times during the audit. Another issue that should be investigated is the loading between the storage systems. The graph of SAN capacity shows that the largest load is between sites over the ISL1 followed by exchanges between storage port1 on site 1 and storage port 1 on site 2. Load balancing should be investigated on the storage systems or application data could be moved from one to another to balance out the load. In addition to the loading issues above, the queue depth settings should be reviewed at the main data centre as the pending exchanges peak around 190 and appear to average around 130. From the graph it appears that the minimum pending exchanges at any time during the first week of monitoring are around 40. The maximum exchange times may be caused by the large number of open exchanges on the first storage system at the main data centre causing some exchanges to timeout. The Abort Sequences and Status Check conditions may be caused due to these exchanges being unable to complete due to the high load on the first storage system. Until the loading issue is resolved it is difficult to determine what is causing the ABTS and Status Check conditions without connecting an analyser. There were loss of sync and loss of signal issues detected by the probe and reported by the switches. These should be checked to see if they were caused by human intervention or issues that require further investigation. The probes reported high Bad Status counts during the first week of monitoring as well as steady counts on one interface during the second week. By reviewing the SAN Summary Report generated by the probe it can be seen that the bad status counts were caused by thousands of reservation conflicts, which should be investigated further. 24/02/10 Page 23

Recommendations The first recommendation is to purchase four 4Gb single mode (long wave) SFP s for the inter-site ISL connections as the current switches will support the 4Gb interface. This will reduce the risk of running out of bandwidth between the sites and in addition the loading of the ISL s should be investigated to see if the load can be balanced between the two. The loading of the storage systems should be reviewed to see if they can automatically balance the load. An alternative would be to manually move the locations where the applications store their data and to move some of the applications in order to spread the load. Looking at the reports generated it is clear that there are exchanges that do not complete and therefore have to be re-sent (Status Check Conditions and Abort Sequences). These are most likely due to the number of open exchanges at any one time or queue depth. Some consideration should be given to limiting the queue depth in order to speed up the delivery of data. The indicated loss of sync, loss of signal and link resets should be examined to see if they coincide with manual changes to the SAN such as resetting switch ports or changing cables. If not then further investigation should take place to ascertain the cause of these events. The reason for the thousands of reservation conflicts needs to be investigated to see if these connections were blocked due to existing reservations or by other means. The only way to discover the reason for the reservation conflicts would be to connect an analyser to the links. It would be useful to carry out another SAN Audit once changes have been implemented to see how the performance and SAN metrics have changed. It would also be useful to connect an analyser to examine the cause of the bad status, abort sequences and status check conditions. 24/02/10 Page 24

Appendix Condition Descriptions The following is a brief list describing some of the events reported by the monitoring probes and Fibre Channel switches. SCSI Status Check Condition This set of errors/warnings occurs when the indicated Target/LUN has returned a SCSI Status Frame indicating a Check Condition (Status value = 0x02) to the Initiator. Check Conditions are used for a variety of purposes in SCSI, some responses of Check Condition are expected in response to certain exchanges (i.e. on the first command following a Bus Reset / PLOGI from a device). Other Check Condition Status frames are used to indicate problems in framing, signalling or ULP errors. In Fibre Channel, devices will add the SCSI Sense Data with the Check Condition status. The Sense Data gives useful information about why the Check Condition occurred. There are three very important pieces of information delivered in the Sense Data contained within the Status frame: the Sense Key (SK), Additional Sense Code (ASC), and Additional Sense Code Qualifier (ASCQ). The Sense Key value indicates the state of the exchange (Aborted, Recovered Error, Hardware Error, etc.) and the ASC/ASCQ values are coupled together to form a valid reason to support the Sense Key value (such as Parity Error, Power On Reset Detected, etc.). Some common problematic reasons Check Conditions occur in Fibre Channel are due to Parity Errors or Data Phase Errors. According to PLDA rules, FC-AL Targets are not allowed to transmit P_RJT frames in response to bad data from an Initiator. In these cases, the Target will usually wait for its next turn to send data (i.e. wait for TSI in the F_CTL field) and respond with a Check Condition with the Sense Key set to 0x0B (Aborted Command) and the ASC/ASCQ set to 0x47/00 (Parity Error). The Status - Check Condition with Bad Sense Key error occurs when an undefined or reserved Sense Key value is utilized in the SCSI Sense Data. Code Violations These errors occur when the monitoring or analysis device has detected and flagged a code violation on the physical transmission layer of the network. A Code Violation (CV) is generally measured in an 8b/10b system as words that have a valid KChar (i.e. K28.5), but end in a disparity error or the bytes following the KChar do not represent a valid ordered set or primitive sequence. Code Violations only occur (by design) on the local link segment, since ordered sets and primitives are not forwarded throughout the SAN. It is also important to note that many components can be involved when an error on the physical layer occurs. Generally between two devices connected together in a point-topoint fashion, there are 6 potential points at which errors can occur (10 if you add an analyzer in-line). These are: 24/02/10 Page 25

1. From the Fibre Channel ASIC to the SERDES on either device. 2. From the SERDES to the physical transmitter (generally a GBIC, SFP, XFP or fixed media transmitter) on either device. 3. On either transmit wire between the devices. Loss of Sync This error occurs when a loss of synchronization condition is detected on the physical transmission layer of the network. A Loss of Sync (LOS) is generally measured in an 8b/10b system as a run of 3 continuous words of bad KChars, incorrect disparity, code violation, or missing KChar values (unframed data). LOS events only occur (by design) on the local link segment, since ordered sets and primitives are not forwarded throughout the SAN. It is also important to note that many components can be involved when an error on the physical layer occurs. Generally between two devices connected together in a point-topoint fashion, there are 6 potential points at which errors can occur. These are: 1. From the Fibre Channel ASIC to the SERDES on either device. 2. From the SERDES to the physical transmitter (generally a GBIC, SFP, SFP+ XFP or fixed media transmitter) on either device (at one end or the other). 3. On either transmit wire between the devices. Loss of Signal Whereas Loss of Sync occurs when light (signal) is being received but the receiver is unable to synchronise with the received bit stream, Loss of Signal is when there is no light (signal) detected at all. Link Credit Reset This warning occurs when a stream of 3 or more continuous Link Reset (LR) primitives appear on the link. It is important to think of the LR as a Link Credit Reset, not just a Link Reset. The primary function of Link Credit Reset is to reset the outstanding credit balance between two Fabric Ports. If coupled with NOS or OLS, these are generally part of the link reset process. However, when a LR is utilized without NOS or OLS, this generally indicates an out of credit situation has occurred, followed by an RA_TOV timeout. When an N_Port or F_Port (device port or switch port) cannot transmit frames due to lack of credits received by the destination port, it can use LR to reset the credit balance. These errors also output the Credit Offset of the other channel (thus how many frames this channel had transmitted without receiving credits back from the other channel) at the time of the reset in the Value field. This can be very useful for debugging out of credit situations. If an LR is received with a Credit Offset value other than zero on the other channel, the reset is generally due to lost credits or a frame transmission timeout occurring. SCSI Bad Status 24/02/10 Page 26

This error occurs when the indicated Target/LUN has returned a SCSI Status to the Initiator that is not defined or invalid. Valid SCSI Status codes are: Code Value 00h Good 02h Check Condition 04h Condition Met 08h Busy 10h Intermediate 14h Intermediate-Condition Met 18h Reservation Conflict 22h Command Terminated (Obsolete) 28h Queue Full (a.k.a. Task Set Full) 30h ACA Active 40h Task Aborted All other codes are reserved. Abort Sequence(ABTS) for Pending Exchange This error occurs when any frame is seen with an RCTL value of 0x81 - ABTS for an exchange that is currently open (pending). In Fibre Channel, ABTS frames are most commonly used to abort exchanges that have either timed out or encountered some other error. Since the exchange encountered an error, ABTS is used to terminate the exchange and the Initiator should then take appropriate steps to either retry or terminate the exchange. In some cases, an Initiator will send ABTS for each outstanding exchange when only one exchange encounters a problem. In these cases, it is best to examine each of the exchanges that have been terminated with ABTS. In most cases, however, the Initiator (or Target) will send ABTS only for the errant exchange or exchanges that have timed out. ABTS frames usually occur when there are signalling and framing errors (i.e. CRC Errors) present in the fabric. Dependent upon the devices, recovery from an ABTS can take anywhere from milliseconds to a minute or more. ABTS frames can also occur where timeout conditions occur. If a device has an inactive open sequence for more than a Sequence Timeout Value (SEQ_TOV), it will generally transmit an ABTS to terminate the sequence. However, in many situations the timeout comes following an Upper Layer Timeout Value (ULP_TOV). This occurs when single frame sequences (like SCSI Command, Transfer Ready, or Status Frames) are lost or delivered with incorrect CRC values. This can also occur when the last frame in a sequence is lost or has a bad CRC. The only recourse is to wait for a ULP_TOV and then transmit ABTS. In any situation where an ABTS occurs, there is some sort of error that needs to be checked. Watch for CRC Errors, out of order delivery, missing frames, inability to 24/02/10 Page 27

transmit frames, etc. Use the S_ID, D_ID and OXID values to examine the exchange - frame by frame. CRC and Frame Errors These are framing errors that can occur on any link with media or transmission problems. The framing errors checked for include: bad or missing CRC, bad or missing SOF/EOF values, improperly truncated frames (i.e. jabber or runt frames), and EOFa, EOFni, and EOFdti frames. The Improperly Truncated/Bad frame indicates that the frame did not have enough bytes to fill the SOF (1 word) and Fibre Channel Header (6 words) so is considered invalid. The source and/or destination address in these frames may not be valid, depending upon the amount of data that is present in the frame. The Bad or Missing SOF/EOF delimiter errors indicate frames that have SOF or EOF delimiters that are not recognized as valid values. In the case of bad or missing SOF/EOF delimiters or Improperly Truncated/Bad frames, the error usually occurred on the local link. These frames can be common during Link Reset events, where frame transmission is interrupted by recovery. The EOFa, EOFdti, and EOFni frames are special case CRC Errors - bad frames that have passed through another device (or devices) before reaching the final destination. These frames have generally had the bad CRC fixed, but the new EOF indicates that the frame should be discarded by the final destination. In the case of an EOFa, the frame error occurred while the frame was in transmission. This means that one port received the frame fine and the next port down the line (still not the final destination) received the frame as a CRC error. This port then knows that there is a port between the source of the frame and itself, so the error must have occurred during transmission between the two ports. The second port then fixes the CRC and places an EOFa delimiter on the frame and transmits it towards its final destination. These are most common in Switched Fabric environments between two switch ports (whether they are internal or external). There are some devices that will transmit frames to themselves with an EOFa to remedy an internal error condition or to clear frames from their transmit buffers. In the case of an EOFni or EOFdti, the frame error occurred between the source port and the receiving port. Thus, the switch received the frame from the source port having a CRC error. The switch port then fixes the CRC and modifies the EOF to either an EOFni (Class 2 or 3) or EOFdti (Class 1) and transmits it towards its final destination. When seeing CRC errors of this nature, it is best to relocate the analyzer to the port that the source is connected to. It is important to note that many components can be involved when a CRC error or other bad frame transmission occurs. Generally between two devices connected together in a point-to-point fashion, there are 6 potential points at which errors can occur (10 if you add an analyzer in-line). These are: 24/02/10 Page 28

1. From the Fibre Channel ASIC to the SERDES on either device. 2. From the SERDES to the physical transmitter on either device. 3. On either transmit wire between the devices. If you add the analyzer in-line, you add many more degrees of complexity in debugging these issues. The additional components required to analyze in-line are: 1. Two GBICS/SFP s. 2. One more cable, in which either transmitting wire can fail. Keeping this in mind can be of great assistance when debugging physical wire issues. Class 3 Discards (Switch Dropping Frames) A discard occurs within a switch when a frame is received into a buffer internally within the switch but the switch is unable to route the frame out of the switch within a predetermined time. In this instance the switch will throw away (discard) the frame to avoid becoming completely congested and blocking data. If the switch discards one or more frames it is usual for one of the devices involved in the transfer of the frame (source or destination) to issue an Abort Sequence (ABTS) or Status Check Condition and the frame and associated data should then be re-sent. 24/02/10 Page 29