A Generic Methodology of Analyzing Performance Bottlenecks of HPC Storage Systems. Zhiqi Tao, Sr. System Engineer Lugano, March

A Generic Methodology of Analyzing Performance Bottlenecks of HPC Storage Systems Zhiqi Tao, Sr. System Engineer Lugano, March 15 2013 1

Outline Introduction o Anatomy of a storage system o Performance Methodology o The top-down approach or the bottom-up approach o Pipeline approach Case Study Build up benchmarking profiles Conclusion 2

Introduction - Anatomy of a storage system A storage system consists of a good mix of hardware and software. Many aspects to be taken into account when architecting an enterprise storage system capacity, performance, reliability, scalability, managementability and often the most important factor COST. 3

Introduction - Questions Two questions are often associated with high performance storage systems o How to improve the performance of my storage system? o What is the bottleneck of a storage system? I m here to share a generic methodology of analyzing performance bottlenecks and examining the efficiency. o By no means to be the best practice o Simply share my personal experience o Hope to be useful for someone who are interested in the same topic o Appreciate any comments and suggestions 4

Introduction - Performance Review these questions o How to improve the performance of my storage system? o What is the bottleneck of a storage system? I often turn the question around how efficiently your storage system is? o We need to set a realistic expectation. o Too good to be true often brings some consequences. o Did I mention a well designed Lustre storage system can achieve 90% of underlining hardware bandwidth. Catch up with me after the talk 5

Performance bottle neck Performance is the most notice-able measurement. o IO modules are commonly the slowest component of a computing system, comparing with CPU, RAM etc al. o Wasting CPU cycle when there are lots of iowaits o High performance is an important factor of being cost-effective. o Achieving high efficiency is even harder. That s what I m here for. Contact Intel High Performance Data Divison at hpdd-info@intel.com It generally requires years experience to architect a well balanced high performance system. o It is generally easier to use proven open technologies or something you ve experienced with. o There is a scientific methodology we can follow. 6

Technical White Paper Architecting a High Performance Storage System By Zhiqi Tao, Andreas Dilger, Eric Barton, Bryon Neitzel Designing a large scale, high performance storage system presents significant challenges. This paper describes a step-by-step approach to designing a storage system and presents a design methodology based on an iterative approach that applies at both the component level and the overall system level. The paper includes a detailed case study in which a Lustre storage system is designed using the approach and methodology presented. http://www.whamcloud.com/resources/architecting-a-high-performance-storage-system/ 7

Performance bottle neck It is not uncommon to see a seemingly well designed storage system that does not deliver the expect performance. o Might be some factors we did not consider in the design o Might be some hardware were not as good as they claim to be. o Might be a bad luck we happened to receive a faulty batch o Might be the limitation in the software, for example Metadata performance before Lustre 2.3. o Might be some tune-ables required Troubleshooting performance bottleneck is what I do. o Contact Intel High Performance Data Divison at hpdd-info@intel.com o My methodology follows 8

Methodology Top-Down vs. Bottom-Up Top - Down Bo8om - Up 9

Methodology - Top-Down Top - Down Trace down the bo.leneck Like peeling onions or finding a rabbit in the forest O9en requires special tools In- depth knowledge of the en>re stack Hard to generalize the results Time- consuming react to an issue occurred Finger poin>ng 10

Methodology Bottom-Up Bo8om - Up Equally difficult and requires special tools, knowledge and >me- consuming 11

Methodology My motivation Proactive instead reactive o Take steady steps instead of rushing to the end and debugging from there. o There are few things we could do after the system is built. Or at least it would just take more time. I like generalize-able results and encourage collaboration. o An easy-to-follow methodology o Use generally available tools o Narrow down the bottleneck to a small scope without requiring indepth knowledge of the whole stack o Then engage with subject-matter expert (SME) Firstly, let s look at the storage system from a different analogy. 12

Methodology PipeLine The components of a storage system are aligned like a pipeline. For each IO operation, data flows through the pipeline. Obviously, the faster and more reliable the flow, the better pipeline is. The narrowest point in the pipeline determines the throughput of the pipeline. Disks and Enclosures Storage Controller SAN HBA/ NIC NAS Servers NIC Cluster Networ k Clients clients clients clients Storage Layers in the Pipeline 13

Methodology Pipe Line It is important to understand the specification of each component It is important to understand the overhead each layer introduces Disks and Enclosures Storage Controller SAN HBA/ NIC NAS Servers NIC Cluster Networ k Clients clients clients clients Storage Layers in the Pipeline Let us try the methodology on something I m familiar with. 14

Case Study a Lustre Storage System 10GbE 15

Case Study Backend Storage Object Storage Controller 1 60 Disk Enclosure 60 Disk Enclosure Sub-Controller Module A Sub-Controller Module B Lustre OSS 1 Cache Mirroring Link Controller 2 60 Disk Enclosure 60 Disk Enclosure Sub-Controller Module A Sub-Controller Module B Lustre OSS 2 16

Case Study Backend Storage SAS Connec>on to the storage controller 60x 3TB 7200rpm Every 10 disks as one RAID6 group 17

Case Study Backend Storage Lustre can work with any block device but Lustre has no knowledge or control on the backend storage. Hidden caches must be protected Lustre has built-in mechanism to protect caches visible to Lustre. Cache Mirroring and Battery backed cache on storage controllers Turn off HBA cache Turning off cache (disable cache and cache mirroring) sometimes gives us better performance. 18

Case Study Backend Storage Tool sets to analyze backend storage Vdbench Swiss army knife http://sourceforge.net/projects/vdbench/ sgpdd_survey (shipped with lustre-iokit rpm) thrlo=1 thrhi= 256 crglo=1 crghi=256 size= twice of system memory Section 24.2 Lustre Operations Manual http://wiki.whamcloud.com/display/pub/documentation DD would not be a good choice. We want to study how the storage responds to multiple IO threads 19

Case Study Storage Servers Effectively two systems would give us same performance How many OSTs would be the best fit for an OSS server? 20

Case Study Storage Servers Tool sets o Obdfilter-survey simulate Lustre work loads o thrlo: low counts of threads o thrhi: high counts of threads o nobjlo: low counts of objects to read/write o nobjhi: high counts of objects to read/write o size: Total IO size in MB o targets: names of obdfilter instances o Section 24.3 Lustre Operations Manual http://wiki.whamcloud.com/display/pub/documentation 21

Case Study - Network Consideration An un-optimized network architecture can potentially be a limiting factor. 2:1 oversubscribed InfiniBand Fabric 36-port IB switch 24 ports 12 ports 36-port IB switch 24 ports Non-oversubscribed InfiniBand Fabric 36-port IB switch 18 ports 18 ports 36-port IB switch 18 ports 22

Case Study - Network Consideration Tools Sets o ib_write_bw, ib_read_bw shipped in the perftest rpm o LNET Selftest - measure network throughput and RPC operations in Lustre environment o o o o o o o 1:1, 1:Multiple, Multiple:Multiple IO sizes (size=) The number of requests that are active at one time (concurrency=) Bulk data transfer (brw read/write) Small request message (ping) With and without data checksum (check=) Chapter 23 Lustre Operations Manual http://wiki.whamcloud.com/display/pub/documentation 23

Case Study Client application 10GbE It is the performance measured on clients that ma.ers 24

Case Study Client application Tools Sets o IOZone o Read, write, re-read, re-write, read backwards, read strided, fread, fwrite, random read, pread,mmap, aio_read, aio_write o http://www.iozone.org/ o IOR o Support POSIX, MPI_IO, HDF5 or NCMPI api for IO. o http://sourceforge.net/projects/ior-sio/ o Client Applications 25

Conclusion Proactive - Take steady steps to build performance profiles before performance issues occurred. o Understand what each component is capable of and their overhead Use generally available tool sets Look out for System Utilizations and Saturations and possibly errors o iostat, top, mpstat, sar, etc al. o Intel Manager for Lustre o All-in-one dash board, CPU, RAM, File system IO (both MetaData and Read/Write workloads) etc al o Aggregated logs o Syntax highlighted alerts. 26

Thank You zhiqi.tao@intel.com 27