CSCS HPC storage. Hussein N. Harake

CSCS HPC storage Hussein N. Harake

Points to Cover - XE6 External Storage (DDN SFA10K, SRP, QDR) - PCI-E SSD Technology - RamSan 620 Technology

XE6 External Storage - Installed Q4 2010 - In Production Q1 2011-5 Enclosures - 300 X 2TB Hard Drives - Two UPS for each Singlet - Lustre 1.8.4 filesystem - IB QDR Network - 4 IO servers

Voltaire 4036 CRAY XE6 IO SERVER 1 IO Server 2 IO Server 3 IO Server 4 (5) Quad SAS Disk Controllers 4 X IO servers: 4X 4X 4X 4X 4X - 4 OSSs & 1 MDS Server - 2 TB SATA drives - SRP protocol - IB QDR Network - 4 LNET Routers - Heartbeat & Multipath implementation - Lustre 1.8.4-8.4 GB RAW using ddn-ost-survey CSCS, Hussein N. Harake 4

XE6 External Storage # of LUNS Block Size Stripe Size # of Clients W. Cache=ON R. Cache=ON 1M Write Read 1 7110 MB 6154 MB 4 7013 MB 6274 MB 8 28 7753 MB 6053 MB 14 7508 MB 5388 MB 28 6465 MB 5270 MB 28 4M 1 6472 MB 6444 MB 4 6968 MB 6722 MB 8 28 7597 MB 6020 MB 14 7533 MB 6080 MB 28 8162 MB 5969 MB IOR was used with MPIIO (HDF5 didn t show any improvement) Workshop 5

PCI-E SSD Technology PCI E Virident based on SLC SSDs 1 X N400 400 GB PCI E Fusion IO based on SLC SSDs 1 X Iodrive Duo 320GB PCI E WarpDrive based on SLC SSDs 1 X SLP-300 packs 300GB

PCI-E SSD Technology Two Benchmark tools were used: FIO and IOR MPIIO, HDF5 and POSIX interfaces XFS and GPFS filesystems Two Supermicro servers with two sockets each PCI-E Gen 2 16X AMD Opteron Magny-Cours 8 cores 16 GB DDR3 Memory Workshop 7

PCI-E SSD Technology FIO Parameters: iodepth=256 iodepth_batch_complete=8 iodepth_batch_submit=8 ioengine=libaio direct=1 rw=randwrite or rw=randread numjobs=4 - iodepth # of requests for async IO - iodepth_batch_complete and iodepth_batch_submit are iodepth batching control - Direct=1 avoid any memory allocation direct io - Libaio Linux native asynchronous io - Randwrite = Random Write - Randread = Random Write Workshop 8

PCI-E SSD Technology IOR Parameters: -a POSIX -B b 1G t 4K e -s 1 -i 1 -F C -a POSIX: IO engine -B: bypassing I/O buffers -b: file size to be written per task -t: transfer size -e: fsync -s: number of segments -i: number of repetitions of test -F: file-per-process -C: changes task ordering to n+1 9

Workshop

PCI-E SSD Technology Comparing File-Systems Using One Card: XFS performed ~100% more then GPFS on IOPs GPFS and XFS showed same results on throughput Using two Cards Same server: IOPs showed ~90% increase on XFS and 30% on GPFS Throughput showed ~90% increase on XFS and on GPFS Using two Cards on two different servers over IB: GPFS showed 40% less in performance then local server Although GPFS didn't sustain the HW performance however a single card is capable to deliver the same IOPs of a DS5300 on GPFS. Workshop 13

PCI-E SSD Technology Things we learned: Marketing numbers are based on raw benchmarks, 30 to 50% of performance penalty when using raid and filesystem Bypass any kind of IO buffers if you want the real performance capability of your HW Don t change any configuration parameters on the HW controller in case you have data on it Each card required min. one core or more Two to 4GB of memory is required for each card Check the kind of Raid supported by your HW Does the HW include any battery for flushing the Data in case of loosing power Tools and utilities are required to report errors, health check, logs etc. Replacing a defected SSD makes sense instead of replacing the entire card Workshop 14

Benchmarking RamSan-620 on GPFS Hussein N. Harake CSCS - TI

Infrastructure Seven Dual Sockets clients / servers Two X GPFS IO server Five X GPFS clients PCI-E 16x 32 GB of memory IB QDR network Suse SLES 10 / Redhat OS GPFS 3.4.x 4 X FC 4 Gb/s dual ports

RamSan 620 Fast Random Access file system TMS - RamSan CPU1 CPU2 Store 1 CPU1 CPU2 Store 2 FC P1 4G/Bits FC P2 4G/Bits I B P1 40GIG I B P1 40GIG FC P1 4G/Bits FC P2 4G/Bits I B P1 40GIG I B P1 40GIG IB Network FC Switch1 FC Switch2 GPFS Clients FC 8 Gb/s Ethernet 1 Gb/s IB QDR 40 Gb/s Store1: GPFS IO server Store2: GPFS IO server SLC SSDs CSCS, Hussein N. Harake

Benchmark FIO from the IO servers: - Fio from GPFS IO servers - IB network was not used - GPFS Clients were not involved - Randread IOPs 4K block size - Randwrite IOPs 4K block size - Throughput randwrite 1M bs - Throughput randread 1M bs - 8 devices each was exported through FC port - GPFS nsds were used to handle Data And Metadata.

Results from IO servers Random Read using 4K block size each interface is delivering ~95K Iops Total 380K IOPs

Results from IO servers Random Write using 4K block size each interface is delivering ~83K Iops Total of 332K IOPs

Benchmark FIO from GPFS clients : - IB network was used - Using RDAM on GPFS showed some improvements - No FIO was running on GPFS servers, only on Clients - Randread IOPs 4K bs - Randwrite IOPs 4K bs - Throughput randwrite 1M bs - Throughput randread 1M bs bs = Block Size

Results from GPFS clients Random read using 4K block size each interface is delivering ~43K IOPs Total 190K IOPs

Results from GPFS clients Random write using 4K block size each interface is delivering ~18K IOPs Total 72K IOPs

Results Throughput - 2.1 GB/s Write Throughput using 1MB block size - 1.5 GB/s Read Throughput using 1MB block size - Client Results could be improved if we add more clients - GPFS showed a scalable solution on the RamSan - Not All components are Hot swappable