Dell EMC Unity: Performance Analysis Deep Dive Keith Snell Performance Engineering Midrange & Entry Solutions Group
Agenda Introduction Sample Period Unisphere Performance Dashboard Unisphere uemcli command line Performance Archives Summary 2
Introduction
Introduction Three uses for performance data: 1. Health Check Performance metrics provide the ability to determine operating efficiency of the system in servicing user requests Independent to block or file activity, the storage processors and disks are common contributors to performance that give us a first look at system health 2. Capacity Planning Checking current resource utilization Can we incrementally add workload to existing resources? Can we add hardware and workload to the system? 3. Troubleshooting Object specific performance metrics provide the capability to isolate and identify areas of concern 4
Performance Data Sample Period
Sample Period Performance data can be presented with different sample periods so what? The larger the sample period the more averaged the data is Reduces chance to view bursty activity Duration of bursts will dictate accuracy of displayed data Performance dashboard might look different depending on time period viewed Dashboard is minimum 60 second samples but can go up to 4 hours per sample Variation in performance will be averaged to the sample frequency being displayed For the most accurate and customisable performance analysis, post processing performance archives is recommended Custom options later in the presentation 6
Sample Period x y Here we have a timeline, and our sampling period is a Our characterization would denote that we have a peak of y, for the duration of the sample period z a a a a a a t If our sampling period was now lower, as shown here by b, our reported peak wouldn t be x, but x+y g t 6 * a Be aware of what data you are looking at 7
Performance Data where do I find it?
Performance Metrics And Where To Find Them Object Unisphere Dashboard/uemcli Archive Storage Processor Utilization (average) Utilization (average and per core) LUN Disk Response time, IOPS, MB/s, queue IOPS, MB/s, Service Time, queue Utilization, response time, IOPS, MB/s, queue Utilization, IOPS, MB/s, queue Ports IOPS, requests, MB/s IOPS, requests, MBPS File Systems IOPS, MB/s, IO size IOPS, MB/s, IO size FAST Cache Dirty ratio None (future) Utilization, response time and MB/s are key quality of service indicators Utilization at LUN and disk layer is available from archive data 9
Performance Data Performance Dashboard And Historical Database
Scenario-1 (Performance Dashboard) Dell EMC Unity 400 Hybrid Array Data Hybrid pool 18 * 800GB SAS FLASH 3 20 * 1.2TB SAS 8 LUNs set to highest available to pin to FLASH tier 8 LUNs set to lowest available to pin to SAS tier Metadata for all LUNs would be resident in the highest tier available Variable workload duration of 1 hour Read to write ratio mostly 80:20 I/O size mixture of 4KB, 8KB, 16KB, 32KB and 64KB Analysis Method Unisphere Performance Dashboard 11
Performance Dashboard 12
Unisphere Performance Dashboard The performance dashboard is primarily used for viewing performance data from the historical database, and can be used to determine the health of the system Time selection Time available Sample period available 60 seconds = up to 3 days of data 300 seconds = up to 14 days of data 3600 seconds = up to 28 days of data 14400 seconds = up to 90 days of data 13
Storage Processor Utilization Observations: 1) SP-A workload is saturating utilization and causing imbalance 2) Lower periods of activity, utilization is well within good range, and reasonably balanced Questions: 1) How is this saturation affecting workloads? 2) What activity is contributing to this saturation? 3) What are our options to reduce the effect of this workload? 14
System Level Statistics 15
Port IOPS And MB/s Observations: 1) I/O is distributed across available Fibre Channel ports 2) 16Gb FC ports capable of >40K IOPS and ~1500MB/s bandwidth 3) Additional protocol statistics are available: 16
FLASH LUN Statistics 17
SAS LUN Statistics 18
LUN I/O Size And MB/s 19
Scenario-1 Review Summary SP Utilization becomes imbalanced and highly utilized at different periods Response times are within acceptable limits when we consider the utilization of the system and the queue to the active LUNs The high utilization of the SP is likely going to lead to issues if load increases, or we decide to utilize options like snapshots, compression, replication Options Isolate workloads and consider migration to another system Utilize Host I/O Limits on the system to limit the performance capability of targeted objects (LUNs) 20
Host I/O Limits 21
SP Utilization [Before] Observations with limits active: 1) Utilization maintained within good range 2) Host I/O limits applied to targeted objects only 3) Host I/O Limits can be dynamically adjusted [After] 22
LUN IOPS And Response Time 23
Scenario-1 Summary Storage processor utilization peaks were identified and correlated using the performance dashboard to specific workloads Other metrics were checked to verify no other issues observed Host I/O limits deployed to limit targeted LUN activity to reduce impact and maintain required levels of utilization 24
Performance Data someone is reporting a problem
Scenario-2 (uemcli) Dell EMC Unity 400 Hybrid Array Hybrid pool 18 * 800GB SAS FLASH 3 20 * 1.2TB SAS 8 LUNs set to highest available to pin to FLASH tier 8 LUNs set to lowest available to pin to SAS tier Workload Metadata for all LUNs would be resident in the highest tier available Varied workload with some scaling Variable I/O sizes Analysis Method uemcli historical stats Focus on SP Utilization, SAS LUN and disk IOPS and SAS LUN Response Time 26
Uemcli Options For Historical Data Available metrics for historical viewing uemcli -d <IP> -u <user> -p <pwd> /metrics/metric -availability historical show Lists all available metrics, ~77 in total Sample period available 60 seconds = up to 3 days of data 300 seconds = up to 14 days of data 3600 seconds = up to 28 days of data 14400 seconds = up to 90 days of data uemcli -d <IP> -u <user> -p <pwd> /metrics/value/hist -path sp.*.storage.lun.*.totalcallsrate show -from "2017-05-10 14:25:00" -count 360 -interval 60 -output csv uemcli -d <IP> -u <user> -p <pwd> /metrics/value/hist -path sp.*.storage.lun.*.responsetime show -from "2017-05-10 14:25:00" -count 360 - interval 60 -output csv 27
Uemcli Options For Real Time Available metrics for real time viewing uemcli -d <IP> -u <user> -p <pwd> /metrics/metric -availability real-time show Lists all available metrics, ~580 in total Uemcli syntax for real time commands: /metrics/value/rt -path <value> show -interval <value> [ { -period <value> -to <value> -count <value> } [ -summary ] ] [ -flat ] [ -output { nvp csv table [ -wrap ] } ] [ { -brief -detail } ] uemcli -d <IP> -u <user> -p <pwd> /metrics/value/rt -path sp.*.storage.lun.*.readsrate,sp.*.storage.lun.*.writesrate show -interval 30 We pick a longer interval than the minimum 5 as it can be challenging to compute/display multiple LUNs data in real time 28
SAS LUN IOPS And Response Time [SAS LUN IOPS] Observations: 1) Scaling the workload hits a plateau [SAS LUN Response Time] 2) Response time appears to be impacted when we are doing an aggregate of around 4000 IOPS Consider disk IOPS and at 80:20 workload to RAID 5, that equates to: (4000 / 5) * 8 = 6400 spread across 20 SAS = 320 per disk 29
SAS Disk IOPS [SAS Disk IOPS] Observation: Scaling workload pushes IOPS above recommended levels when referencing the Best Practices Guide CPU Util Consider dynamic pool expansion to distribute the load 30
Pool Expansion 31
After Expansion: SAS LUN IOPS [SAS LUN IOPS - Before] Observations: 1) With 20 SAS in the pool, the workload appeared to hit a plateau [After] 2) With 40 SAS in the pool, we now achieve ~50% more IOPS 3) Lower contention, utilization and hopefully response time 32
After Expansion: SAS LUN Response Time [SAS LUN Response Time - Before] Observations: 1) With 20 SAS in the pool, response time soon exceeded 20ms rising to ~50ms [After] 2) With 40 SAS in the pool, response time is dramatically improved 3) Queue distribution results in lower contention, utilization and response time 33
After Expansion: SAS Disk IOPS [SAS Disk IOPS - Before] Observations: 1) With 20 SAS in the pool, disks were saturated [After] 2) With 40 SAS in the pool, I/O distribution is much better resulting in lower utilization of the disks, contributing to reduction in response time for host I/O 34
Scenario-2 Summary Uemcli statistics match the capability of the performance dashboard Allows collection and post-processing of performance data in a customized way Using this method we identified resource utilization issues, specifically 10K rpm SAS disks Pool expansion was utilized to resulting in optimized handling of the workload leading to lower disk utilization and an improvement in IOPS and response time What about performance archives? 35
Performance Archives What and How?
Performance Archives Archives contain 1 hour of data in a SQL database format Each archive is aligned to the top of the hour e.g. coverage of 3pm to 4pm, and 4pm to 5pm Filename is date and time referenced to the start time of the archive (UTC time) Partial archives are readable self contained sql database files Repository contains a minimum of 48 archives (covering 2 days of high definition performance data) As of Dell EMC Unity OE 4.2, archives can be retrieved in the UI Retrieving archives is currently possible via WinSCP You can look at the structure of the archive with DB Browser for SQL lite https://www.sqlite.org/download.html Export requires data manipulation to evaluate timestamp details from an offset of epoch time Also per second samples for metrics like I/O, MB, calls, etc Object names have to be mapped to user objects where possible with embedded tables 37
Dell EMC Unity Performance Archive Dump Options: 1 to multiple archives Output to csv format 2 variants of formatting Timestamps Equated per second metrics Ongoing development Early access availability via contact upad@dell.com Sample output: cpu_core 20170320_110000.csv fibrechannel_feport 20170320_110000.csv iscsi_feport 20170320_110000.csv net_device 20170320_110000.csv physical_disk 20170320_110000.csv storage_filesystem 20170320_110000.csv storage_flu 20170320_110000.csv storage_lun 20170320_110000.csv 38 Aligned with Unisphere archive retrieve capability storage_pool 20170320_110000.csv What do I do with dumped csv data?
Excel If your timestamp doesn t show seconds, you can select column A and change format Add :ss to show seconds for each sample After selecting the entire sheet by clicking in the top corner, select insert pivot chart, that will default to the whole table 39
Pivot Chart The Easy Guide To Charting Data In pivot fields, drag timestamp to Axis category, user_lun to Legend, and the metric required to plot into Values Ideally, to verify single object selection, check value count and it should be 1 Here we see something with a count of 4: these are 4 system related LUNs that have no user ID s, so click in the chart user_lun drop down and deselect the 1st entry Now we only see 1 entry per sample, we can change the value field to show the data As there is only 1 sample per user LUN at each time point, we can select min, max, sum, as either will show the 1 value present 40
Pivot Chart The chart type will default to bar, though most times it s better to change type to line Now you can easily filter using the drop downs for specific LUNs or time periods You can also change chart type to stack to show aggregate values when all LUNs selected (this is also very useful) 41
Pivot Disk IOPS Disk IOPS Stacked chart: Disk Total IOPS ~185K Using stacked charts, we can determine a disk summary here of ~185K IOPS Disk stats represent block LUN and file system activity, internal operations and snaps 42
Summary Multiple performance data options for viewing, collection and analysis Dell EMC Unity Best practices for performance referenced for health status Sample period considerations with different methods to look at data Issue isolation and possible solutions considered, engaging Host I/O Limits, and rebalancing of load using dynamic pool expansion 43