Automated Configuration and Administration of a Storage-class Memory System to Support Supercomputer-based Scientific Workflows

Size: px

Start display at page:

Download "Automated Configuration and Administration of a Storage-class Memory System to Support Supercomputer-based Scientific Workflows"

Ada Butler
5 years ago
Views:

1 Automated Configuration and Administration of a Storage-class Memory System to Support Supercomputer-based Scientific Workflows J. Bernard 1, P. Morjan 2, B. Hagley 3, F. Delalondre 1, F. Schürmann 1, B. Fitch 4, A. Curioni 5 1 Blue Brain Project (BBP), Geneva, Switzerland 2 IBM, Böblingen, Germany 3 Swiss National Computing Center (CSCS), Lugano, Switzerland 4 IBM, Yorktown Heights, NY, USA 5 IBM, Zurich, Switzerland

2 Outline Why do we need storage-class memory system? Blue Brain Project hardware system design Why do we need system management automation? First implementation supporting application user-defined system configuration

3 Example complex workflow Volume Renderer uses Field Voxelization Report Reader uses Visualization Cluster reads events from reads events from Key-Value Store GPFS Nodes writes writes HPC Simulation Compute BGAS Nodes

4 Why use storage-class memory? Multi-step, complex workflows requiring lots of effort from scientific application user/developer Building brain tissue model Simulating electrical evolution Analysis of simulation results Visualization Brain modeling requires large memory footprint Rat brain about 100 TB Estimate for human brain is 100 PB DRAM not enough cost effective Requires memory hierarchy

5 Outline Why do we need storage-class memory system? Blue Brain Project hardware system design Why do we need system management automation? First implementation supporting application user-defined system configuration

BBP resources at CSCS Blue Gene/Q Blue Gene/Q

4 racks of compute nodes: 8 midplanes & 4096

8 BGAS I/O drawers (64 nodes) GSS storage

cluster Distributed management CSCS storage

6 BBP resources at CSCS Blue Gene/Q Blue Gene/Q I/O nodes GSS storage cluster System Overview 4 racks of compute nodes: 8 midplanes & 4096 nodes 8 BG/Q Production I/O drawers (64 nodes) 8 BGAS I/O drawers (64 nodes) GSS storage cluster x86 compute cluster Viz x86 compute cluster Distributed management CSCS storage team CSCS BG team BBP HPC & infrastructure team BGAS I/O nodes

7 BGAS I/O nodes compared to standard IONs

8 BGAS I/O nodes: hardware PCIe 2.0 x8 Infiniband replaced by 10 GbE optical cables between drawers <2,2,2> torus extended to <4,4,4> potentially expandable to <8,8,8> 2 TiB SLC flash

9 BGAS I/O nodes: NVM user interfaces Direct storage access (DSA) OFED RDMA verbs provider TiB applications need to be modified Block devices based on DSA Ext4 block device TiB overhead, but POSIX GPFS TiB NSDs communicate over iwarp overhead, no data locality worries, POSIX

HS4 flash card partitioning 1.4 TiB usable capacity GPFS flash partition 0 block device EXT4 flash partition 1 DSA 2 TiB 0.6 TiB for wear-leveling 1.

10 HS4 flash card partitioning 1.4 TiB usable capacity GPFS flash partition 0 block device EXT4 flash partition 1 DSA 2 TiB 0.6 TiB for wear-leveling 1.4 TiB usable Native DSA interface Verbs block device (VBD) on top of DSA for block access GPFS, EXT4, DSA partitions can all be from 0 to 100% of the usable capacity

11 Outline Why do we need storage-class memory system? Blue Brain Project hardware system design Why do we need system management automation? First implementation supporting application user-defined system configuration

12 Why do we need automation? Highly configurable system supporting different memory interfaces (GPFS, SKV) Automated partitioning based on user requirements for fast application prototyping

13 Why do we need automation? Highly configurable system supporting different memory interfaces (GPFS, SKV) Automated partitioning based on user requirements for fast application prototyping circuit building simulation analysis, visualization

14 Why do we need automation? Highly configurable system supporting different memory interfaces (GPFS, SKV) Automated partitioning based on user requirements for fast application prototyping circuit building SLURM queue simulation SLURM queue analysis, visualization SLURM queue

15 Why do we need automation? Highly configurable system supporting different memory interfaces (GPFS, SKV) Automated partitioning based on user requirements for fast application prototyping circuit building SLURM queue SKV GPFS ext4 simulation SLURM queue analysis, visualization SLURM queue 15

16 What do we want to automate? System Software Management New major release Release update 2-Level Partitioning Cluster partitioning On node-flash memory partitioning Integration with rest of the eco-system (Blue Gene/Q, x86 cluster, GSS storage)

17 System maintenance & update BGAS sandbox creation workflow (major release) build sandbox create ramdisk add RPMs compile GPFS kernel module install Soft-iWARP Integrate with other services (cp config files to sandbox) ssh kerberos SLURM environment modules

18 Shell access to BGAS I/O nodes SSH with Kerberos from any other BBP user node add /etc/krb5.conf, /etc/krb5.keytab to sandbox DNS and /etc/hosts sshd must return FQDN consistent across all user-accessible nodes Viz cluster BGAS nodes BG/Q and BGAS front end nodes BBP desktops Limit access to users with running jobs when fully productionized

19 User-defined configuration parameters List of basic configuration parameters at partitioning time How many clusters: 1 to 8 BGAS clusters How many nodes per cluster: 8 to 64 nodes How much flash allocated for DSA: 0 to 100% How much flash allocated to local ext4: 0 to 100% How much flash allocated to GPFS: 0 to 100% Advanced configuration (Only for GPFS) From 1 to 8 GB GPFS page pool From 64KB to 4MB GPFS block size

20 Overview of partitioning workflow On service node Free block Boot block On all I/O nodes in the block Partition flash Partition & set up ext4 On first node of each block Set up GPFS Integration: Grant remote access to Viz cluster Integration: Set up remote mounts from GSS cluster

drawer/32 node: 0-3, 4-7 2 drawer/16 node:

21 Partitioning BGAS nodes into I/O blocks I4-32 I6-16 I0-64 I0-32 I0-48 etc. 15 possible I/O blocks connected via <4,4,4> 3D torus 8 drawer/64 node: drawer/32 node: 0-3, drawer/16 node: 0-1, 2-3, 4-5, drawer/08 node: 0, 1, 2, 3, 4, 5, 6, 7

22 Compute node partitions I6-16 I4-32 I0-64 I0-32 I0-48

23 BGAS GPFS cluster creation Clusters identified & authenticated using Cluster name Automatically generated cluster ID Automatically generated SSL certificate to authenticate the name & ID

24 BGAS GPFS remote cluster access Integrating with rest of the system Remote access requires Key generation Certificate exchange Mmauth on server cluster Mmremote* on client cluster But All of these require root mmauth update? Depends on cluster ID not changing

25 BGAS GPFS cluster creation, step 1 Integrating with rest of the system Set up 15 clusters One for each possible BGAS I/O block Extract and save cluster names, IDs, certificates Exchange certs with GSS and Viz cluster admins GSS cluster authorizes each BGAS cluster Each BGAS cluster authorizes mounts by Viz cluster Viz cluster adds each cluster & its file system Delete the clusters (FD: What did you want to say here?)

26 BGAS GPFS cluster creation, step 2 Integrating with rest of the system mmcrcluster mmauth genkey new <TOTALLY UNSUPPORTED> cp af $CERT_DIR/* /var/mmfs/ssl new_cluster_id=$(mmlsconfig clusterid ) sed s/old_cluster_id/new_cluster_id/ mmfs.cfg mmauth genkey propagate </TOTALLY UNSUPPORTED>

27 Sharing scripts and public certificates Integrating with rest of the system git repo hosted at EPFL commit access for BBP and CSCS automated checkout by puppet on Viz cluster checkout by non-root user with read-only access

28 Mounting BGAS GPFS on Viz cluster Integrating with rest of the system BGAS GFPS file systems come and go Viz nodes get rebooted when a BGAS file system is created touch a status file on a GSS file system every N minutes check the status file if mtime < N, or uptime < N mmlsfs && mmmount mmumount non-root admin user via sudo

29 Repartitioning performance boot time(sec) 1000 partition time(sec)

30 Outline Why do we need storage-class memory system? Blue Brain Project hardware system design Why do we need system management automation? First implementation supporting application userdefined system configuration

31 User experience expected workflow Configuring BGAS Configuration of BGAS according to multiple teams needs (Multi-tendency) Configuring a BGAS cluster according to one team s needs Using BGAS for fast scientific development From IBM Blue Gene/Q From Viz cluster From BGAS itself as regular cluster

32 Expected user development cycle Configuring BGAS Super-User (Manager, PI) Decides how cluster should be partitioned based on several teams needs (few weeks) Team developers decide how they want flash of their cluster partitioned at job submission time (few days)

33 Using BGAS from Blue Gene/Q: switching I/O links automatically BGAS queue seen as regular queue Jobs will run on cnk nodes I/O will be routed automatically to BGAS nodes instead of production I/O nodes $ sinfo PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE MIDPLANELIST debug* down :00:00 8K 512: idle bgq1011 test up :00:00 8K 512: allocated bgq1001 prod up 512-2K 7-00:00:00 8K 512: drained bgq1000 prod up 512-2K 7-00:00:00 8K 512: K allocated bgq[0000x0011,1001] prod-large up 1-4K 2-12:00:00 8K 512: drained bgq1000 prod-large up 1-4K 2-12:00:00 8K 512: K allocated bgq[0000x0011,1001] prod-large up 1-4K 2-12:00:00 8K 512: idle bgq[1010x1011] bgas up 1-4K 1-00:00:00 8K 512: drained bgq1000 bgas up 1-4K 1-00:00:00 8K 512: K allocated bgq[0000x0001,1001] bgas up 1-4K 1-00:00:00 8K 512:16 2K idle bgq[0010x1011]

34 Using BGAS from Blue Gene/Q: switching I/O links automatically Switching between I/O nodes Compute nodes cabled to both sets of IONs Only one link can be active Compute nodes need to be rebooted SLURM prolog Check partition for bgas Are requested BGAS IONs already linked? If not, deallocate compute nodes, switch links Restart job, boot compute nodes Leave links in place to minimize rebooting

35 Using BGAS as an independent cluster Log in to BGAS front end node Use the queue of your BGAS cluster All created queues visible, only created cluster queues up $ sinfo PARTITION AVAIL JOB_SIZE TIMELIMIT CPUS S:C:T NODES STATE NODELIST bgas down :00:00 1 1:12:1 8 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 8 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 8 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 8 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 8 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 8 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 8 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 8 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 16 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 16 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 16 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 16 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 32 idle bbpbgas[ ] bgas down :00:00 1 1:12:1 32 idle bbpbgas[ ] bgas up :00:00 1 1:12:1 64 idle bbpbgas[ ]

36 GPFS IOR performance, reads, MiB/node proc/node, read 2 proc/node, read 4 proc/node, read RDMA over roq (iwarp) interface, 1MB blocks

37 GPFS IOR performance, writes, MiB/node proc/node, write 2 proc/node, write 4 proc/node, write RDMA over roq (iwarp) interface, 1MB blocks

38 Next steps Further integration to increase automation Integration of BGAS cluster partitioning with SLURM Integration of flash partitioning with SLURM Complete integration of all services Getting user feedback & experience Enhancement & addition of new services Performance benchmarking of data store interfaces (SKV, ) Automated data transfer/copy between BGAS & GSS Multicluster allocation via co-scheduling to support complex workflow execution

Enabling web-based interactive notebooks on geographically distributed HPC resources. Alexandre Beche

Enabling web-based interactive notebooks on geographically distributed HPC resources Alexandre Beche Outlines 1. Context 2. Interactive notebook running on cluster(s) 3. Advanced