Technology evaluation at CSCS including BeeGFS parallel filesystem. Hussein N. Harake CSCS-ETHZ

Technology evaluation at CSCS including BeeGFS parallel filesystem Hussein N. Harake CSCS-ETHZ

Agenda CSCS About the Systems Integration (SI) Unit Technology Overview DDN IME DDN WOS OpenStack BeeGFS Case Study What is BeeGFS? Test System Layout Tuning Monitoring Benchmark tools Results Next Steps Monitoring and Profiling Q&A CSCS 2017 2

CSCS (Swiss National Supercomputing Centre) Founded in 1991 Enables world-class research with a scientific user lab Available to domestic and international researchers through a transparent, peer-reviewed allocation process. Open to academia and are available as well to users from industry and the business sector. Operated by ETH Zurich and is located in Lugano. CSCS 2017 3

24 years of supercomputers at CSCS 1991 NEC SX3 5.5 GF Adula 1996 NEC SX4 10 GF Gottardo 1999 NEC SX5 64 GF Prometeo 2002 IBM SP4 1.3 TF Venus 2005 Cray XT3 5.8 TF Palu 2006 IBM P5 4.5 TF Blanc 2009-12 Cray XE6 402 TF Monte Rosa 2012-13 Cray XC30 7.7 PF Piz Daint 2014 XC30 1.25 PF Piz Daint extension 4

Data Centre - 2000 sq.m Machine Room - 20 MW of power and Cooling capacity - Lake Water cooling - 700 Liters/s CSCS 2017 5

Overview of Systems Integration (SI) Unit Unit missions: - Managing projects - Relations with Vendors - Evaluating Technologies - Software deployments

Greina Cluster 7

Technology Overview DDN IME Image courtesy of DDN CSCS 2017 8

Tchnology Overview DDN WOS (1) CSCS 2017 9 Image courtesy of DDN

Technology Overview DDN WOS (2) CSCS 2017 10

Technology Overview DDN WOS (3) CSCS 2017 11

Technology Overview - OpenStack Image source: https://www.openstack.org/software/ CSCS 2017 12

Eidos Layout 13

BeeGFS Case Study

What is BeeGFS? Parallel filesystem HPC oriented Used to be called FhGFS Alternative to Lustre and GPFS Developed by Fraunhofer Open-source Support delivered by ThinkParq Image courtesy of BeeGFS 15

Basic Features of BeeGFS Supports failover for data and Metadata using applications like Peacemaker, heartbeat Replication failover mechanism Supports Multiple data and metadata on both servers and targets Supports quota Uses Robin-hood to scan the entire filesystem Beegfs on demand filesystem (BeeOND) Easy to deploy and manage Support X86 and Open-power platform CSCS 2017 16

Easy to deploy 17

BeeOND - Create a filesystem on Demand - Uses the hard drive / SSDs on every compute node - Filesystem get created by submitting a job to the schedule We are working on confirming SLURM support - Memory could used instead of SSDs - We used 20 SSDs on 20 nodes for our tests CSCS 2017 18

Benefits of BeeOND Benefits from unused space No impact on the parallel filesystem Real utilization of the high speed network Filesystem scales with the compute nodes Open point: What is the overhead on the compute nodes? CSCS 2017 19

Test System Layout One couplet (two controllers) 4 * FDR Links Two X86 servers One enclosure 60 drives DDN 7700 6 SSDs one raid volume 6 * 9 Raid 5 volumes 2 * FDR Links Dual sockets SB 128GB memory Fabric 1 * FDR Links CSCS 2017 20

Tuning the servers echo 5 > /proc/sys/vm/dirty_background_ratio echo 20 > /proc/sys/vm/dirty_ratio echo 50 > /proc/sys/vm/vfs_cache_pressure echo 262144 > /proc/sys/vm/min_free_kbytes echo always > /sys/kernel/mm/transparent_hugepage/enabled echo always > /sys/kernel/mm/transparent_hugepage/defrag for dev in dm-0 dm-1 dm-2 dm-3 dm-4 dm-5 dm-6 do echo deadline > /sys/block/$dev/queue/scheduler echo 4096 > /sys/block/$dev/queue/nr_requests echo 32768 > /sys/block/$dev/queue/read_ahead_kb echo 32767 > /sys/block/$dev/queue/max_sectors_kb done echo performance tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo 1 > /proc/sys/vm/zone_reclaim_mode Documentation for the tuned parameters: https://www.kernel.org/doc/documentation/sysctl/vm.txt https://access.redhat.com/solutions/46111 http://www.slideshare.net/rampalliraj/linux-kernel-io-schedulers?from_action=save CSCS 2017 21

Monitoring clients activities (1) CSCS 2017 22

Monitoring servers activities (2) CSCS 2017 23

Benchmark tools Mdtest measuring metadata https://sourceforge.net/projects/mdtest/ IOzone throughput read and write http://www.iozone.org CSCS 2017 24

Iozone results on /beegfs Test running: Children see throughput for 64 initial writers = 5032700.90 kb/sec Min throughput per process = 63754.09 kb/sec Max throughput per process = 103798.58 kb/sec Avg throughput per process = 78635.95 kb/sec Min xfer = 12880896.00 kb Test running: Children see throughput for 64 rewriters = 4996297.63 kb/sec Min throughput per process = 68781.82 kb/sec Max throughput per process = 90666.23 kb/sec Avg throughput per process = 78067.15 kb/sec Min xfer = 16473088.00 kb Test running: Children see throughput for 64 readers = 4225632.91 kb/sec Min throughput per process = 40047.24 kb/sec Max throughput per process = 77678.61 kb/sec Avg throughput per process = 66025.51 kb/sec Min xfer = 10813440.00 kb Test running: Children see throughput for 64 re-readers = 4253662.00 kb/sec Min throughput per process = 56998.73 kb/sec Max throughput per process = 76042.87 kb/sec Avg throughput per process = 66463.47 kb/sec Min xfer = 15729664.00 kb CSCS 2017 25

Mdtest results on BeeOND Directory creation Directory Stat Directories per second 120000 100000 80000 60000 40000 20000 0 1 2 4 8 16 20 Numer of MDSs Directories per second 900000 800000 700000 600000 500000 400000 300000 200000 100000 0 1 2 4 8 16 20 Numer of MDSs Directories per second 160000 140000 120000 100000 80000 60000 40000 20000 0 Directory Removal Stat 1 2 4 8 16 20 Numer of MDSs CSCS 2017 26

Mdtest results on BeeOND File Creation File Stat 300000 900000 Files per second 250000 200000 150000 100000 50000 Files per second 800000 700000 600000 500000 400000 300000 200000 100000 0 1 2 4 8 16 20 Numer of MDSs 0 1 2 3 4 5 6 Numer of MDSs File removal 250000 Files per second 200000 150000 100000 50000 0 1 2 3 4 5 6 Numer of MDSs CSCS 2017 27

Next steps Scaling on bigger cluster Verifying the fail over procedures Verify the BeeOND overhead on compute nodes Using Nvme instead of SSDs Using tmpfs Create BeeOND through SLURM jobs Use Robinhood to scan millions of files CSCS 2017 28

Check-MK Monitoring and Profiling 29

CPU Utilization 30

Q&A hussein@cscs.ch 31