Automated Verifica/on of I/O Performance. F. Delalondre, M. Baerstchi. EPFL/Blue Brain Project - confiden6al

Automated Verifica/on of I/O Performance F. Delalondre, M. Baerstchi

Requirements Support Scien6sts Crea6vity Minimize Development 6me Maximize applica6on performance

Performance Analysis System Performance Applica6on Performance Applica6on on System Performance (Real 6me)

Performance Analysis System Performance Applica6on Performance Applica6on on System Performance (Real 6me) Goal: Regression tes6ng & fast input to Developer/System Engineer

Scien/fic Use Cases Interac6ve Supercompu6ng Tradi6onal Applica6on Use Case using GPFS File System

Interac/ve Supercompu/ng Machine u6liza6on does not maoer Time to scien6fic delivery maoers

Interac/ve Supercompu/ng Machine u6liza6on does not maoer Time to scien6fic delivery maoers Monitoring

Interac/ve Supercompu/ng Machine u6liza6on does not maoer Time to scien6fic delivery maoers Steering Monitoring

Interac/ve Supercompu/ng Data Path 4096 Blue Gene/Q Compute Nodes 40 Nodes IdataPlex 64x2x2 GB/s 256 GB/s 1 4 2 IB 64 Blue Gene/Q I/O Nodes 3 40x56 Gb/s 280 GB/s IB 64x40 Gb/s 320 GB/s Infiniband Switch

Regular Use Case Data Path 4096 Blue Gene/Q Compute Nodes 40 Nodes IdataPlex 64x2x2 GB/s 256 GB/s Infiniband Switch 8 40x56 Gb/s 280 GB/s 7 10x2x56Gb/s 135 GB/s 5 6 10 GSS Servers 10x12x6Gb/s 72 GB/s 9 GSS Disk Drives 177 SAS Disk/Server 50Mb/s per disk => 88 GB/s

Regression Tes/ng & Performance Benchmark System I/O Regression Tes/ng Regression of System a]er Maintenance? Is the system delivering maximum performance?

Regression Tes/ng & Performance Benchmark System I/O Regression Tes/ng Regression of System a]er Maintenance? Is the system delivering maximum performance? Input to Developers & System Engineers System performance (bandwidth, latency, ) Scaling numbers: I/O fabric satura6on point Best I/O configura6on (block size, )

Tes/ng Framework For each path, test performance & scaling & I/O parameters All tests must be fully scripted (no manual interven6on) Tests include IOR, NsdPerf, Qperf, gpfsperf, Ib_read_*, ib_write_* Tests executed using Jenkins Con6nuous Integra6on Framework

IOR to I/O Node Memory 4096 Blue Gene/Q Compute Nodes 40 Nodes IdataPlex 64x2x2 GB/s 256 GB/s 1 IB 64 Blue Gene/Q I/O Nodes 40x56 Gb/s 280 GB/s IB 64x40 Gb/s 320 GB/s Infiniband Switch

IOR to I/O Node Memory IBM Blue Gene/Q I/O scaling cnk - IO node memory Memory Bandwith (MB/s) 180000 160000 140000 120000 100000 80000 60000 40000 20000 Posix w Posix r MPI IO w MPI IO r 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Number of Nodes

IOR to I/O Node Memory IBM Blue Gene/Q I/O scaling cnk - IO node memory 180000 217 GB/s Memory Bandwith (MB/s) 160000 140000 120000 100000 80000 60000 40000 20000 Posix w Posix r MPI IO w MPI IO r 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Number of Nodes Write performance scaling loss >2 racks (~74% peak [1]), almost linear scaling every 5-10 runs (~94% peak) Read opera6on twice slower but linear scaling (~56% peak) To be tested at larger scale Why is it important? Interac6ve Supercompu6ng (ISC) [1] D. Chen, N.A. Eisley, P. Heidelberger, R.M. Senger, Y. Sugawara, S. Kumar, V. Salapura, D.L. SaOerfield, B. Steinmacher- Burow, J.J. Parker, The IBM Blue Gene/Q Interconnec6on Network and Message Unit, SC11 Proceedings, Networking, Storage and Analysis, 2011

IB Test - I/O Nodes to Viz Nodes 4096 Blue Gene/Q Compute Nodes 64x2x2 GB/s 256 GB/s 2 IB 40 Nodes IdataPlex 64 Blue Gene/Q I/O Nodes 3 40x56 Gb/s 280 GB/s IB 64x40 Gb/s 320 GB/s Infiniband Switch

IB Test - I/O Nodes to Viz Nodes Test Setup Pair (I/O nodes, Cluster Node) Increase Number of nodes up to 40 Observed Performance per Node & Outliers Detec6on of misconfigura6on/faulty card

IB Test - I/O Nodes to Viz Nodes 4096 Blue Gene/Q Compute Nodes 64x2x2 GB/s 256 GB/s 2 IB 40 Nodes IdataPlex 64 Blue Gene/Q I/O Nodes 3 40x56 Gb/s 280 GB/s IB 64x40 Gb/s 320 GB/s Infiniband Switch

IOR to Disk 64 Blue Gene/Q I/O Nodes 40 Nodes IdataPlex 64x40 Gb/s 320 GB/s 40x56 Gb/s 280 GB/s Infiniband Switch 10x2x56Gb/s 135 GB/s 5 6 10 GSS Servers 10x12x6Gb/s 72 GB/s GSS Disk Drives 177 SAS Disk/Server 50Mb/s per disk => 88 GB/s

IOR to Disk Test Setup Read/write, MPI/Posix Various transfer sizes & access paoerns Observed Satura6on at 41 GB/s in op6mal configura6on Crashing research system when performing IOPS (4k) & large GPFS block size

I/O Nodes to GSS Servers 64 Blue Gene/Q I/O Nodes 40 Nodes IdataPlex 64x40 Gb/s 320 GB/s Infiniband Switch 8 40x56 Gb/s 280 GB/s 7 10x2x56Gb/s 135 GB/s 10 GSS Servers 10x12x6Gb/s 72 GB/s 9 GSS Disk Drives 177 SAS Disk/Server 50Mb/s per disk => 88 GB/s

I/O Nodes to GSS Servers 64 Blue Gene/Q I/O Nodes 64x40 Gb/s 320 GB/s 8: NSDperf Qperf/ib 40 Nodes IdataPlex 40x56 Gb/s 280 GB/s 7: Nsdperf /qperf/ib Infiniband Switch 10x2x56Gb/s 135 GB/s 10 GSS Servers What service can we run/install On GSS servers? 10x12x6Gb/s 72 GB/s GSS Disk Drives 177 SAS Disk/Server 50Mb/s per disk => 88 GB/s 9: What is the best test?

Performance Analysis System Performance Applica/on Performance Applica/on on System Performance (Real /me)

Can we go one step further? Reduce HPC development cycle by fast trouble shoo6ng Monitoring HPC/Simula6on plarorm real 6me & provide input to BBP Portal

Building an HPC Development Tool Building/simula6on Hardware monitoring Console Ok/not ok HW So]ware/Hardware mapping So]ware monitoring Whole Infrastructure Ok/not ok SW BG/Q Cluster EPFL Cluster Lugano EPFL/Blue Brain Project - conﬁden6al BG/Q

Building an HPC Development Tool Building/simula6on Hardware monitoring Console Ok/not ok HW So]ware/Hardware mapping So]ware monitoring Against DB Ok/not ok SW Whole Infrastructure BG/Q Cluster EPFL Cluster Lugano EPFL/Blue Brain Project - conﬁden6al BG/Q

Building an HPC Development Tool Git Graphical Interface Responsible Patch set Console Performance DB & Graph So]ware monitoring Perf Numbers Ok/not ok SW EPFL/Blue Brain Project - conﬁden6al Graph

Building an HPC Development Tool Git Graphical Interface Responsible Patch set Console Proﬁling Performance DB & Graph So]ware monitoring Perf Numbers Ok/not ok SW EPFL/Blue Brain Project - conﬁden6al Graph

Building an HPC Development Tool Vtune intel cluster(x86) HPM - Extrae Scalasca (BG/Q) Proﬁling So]ware/Hardware mapping Recording BG/Q Whole Infrastructure So]ware monitoring Recording Ok/not ok SW BG/Q Cluster EPFL Cluster Lugano EPFL/Blue Brain Project - conﬁden6al

Building an HPC Development Tool Vtune intel cluster(x86) HPM - Extrae Scalasca (BG/Q) Debugging So]ware/Hardware mapping Recording BG/Q Whole Infrastructure So]ware monitoring Recording Ok/not ok SW BG/Q Cluster EPFL Cluster Lugano EPFL/Blue Brain Project - conﬁden6al

Building an HPC Development Tool Debugger Interface Responsible Patch set Console Debugging So]ware monitoring So]ware/Hardware mapping Whole Infrastructure BG/Q Ok/not ok SW BG/Q Cluster Lugano Cluster EPFL

Thank you