Mahidhar Tatineni, Director of User Services, SDSC. HPC Advisory Council China Conference October 18, 2017, Hefei, China

Size: px

Start display at page:

Download "Mahidhar Tatineni, Director of User Services, SDSC. HPC Advisory Council China Conference October 18, 2017, Hefei, China"

Shon Stokes
5 years ago
Views:

1 Experiences with HPC and Big Data Applications on the SDSC Comet Cluster: Using Virtualization, Singularity containers, and RDMA enabled Data Analytics tools Mahidhar Tatineni, Director of User Services, SDSC HPC Advisory Council China Conference October 18, 2017, Hefei, China Acknowledgements: Trevor Cooper, Dmitry Mishin, Christopher Irving, Gregor von Laszewski (IU) Fugang Wang (IU), Rick Wagner (Globus group, U. Chicago), Phil Papadopoulos

2 This work supported by the National Science Foundation, award ACI

3 Overview Comet Hardware Compute,GPU nodes, network, flesystems MPI implementations, including MVAPICH2-GDR results Data analytics frameworks and tools on Comet RDMA-Hadoop, RDMASpark, OSU-Caffe Virtual Cluster (VC) Design layout, software VC Benchmarks: MPI, Applications NSF Award# , Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science PI: Michael Norman Co-PIs: Shawn Strande, Philip Papadopoulos, Robert Sinkovits, Nancy Wilkins-Diehr SDSC Project in Collaboration with Indiana University (led by Geoffrey Fox)

4 Comet: System Characteristics Total peak fops ~2.1 PF Dell primary integrator Intel Haswell processors w/ AVX2 Mellanox FDR InfniBand 1,944 standard compute nodes (46,656 cores) Dual CPUs, each 12-core, 2.5 GHz 128 GB DDR MHz DRAM 2*160GB GB SSDs (local disk) 72 GPU nodes 36 nodes with two NVIDIA K80 cards, each with dual Kepler3 GPUs, same CPU as main partition. 36 nodes with 4 P100 GPUs each, 2 Intel Broadwell processors (14 cores each) 4 large-memory nodes 1.5 TB DDR MHz DRAM Four Haswell processors/node 64 cores/node Hybrid fat-tree topology FDR (56 Gbps) InfniBand Rack-level (72 nodes, 1,728 cores) full bisection bandwidth 4:1 oversubscription cross-rack Performance Storage (Aeon) 7.6 PB, 200 GB/s; Lustre Scratch & Persistent Storage segments Durable Storage (Aeon) 6 PB, 100 GB/s; Lustre Automatic backups of critical data Home directory storage Gateway hosting nodes Virtual image repository 100 Gbps external connectivity to Internet2 & ESNet

5 Comet Network Architecture

6 Comet Lustre Filesystems Comet features two Lustre filesystems - scratch and projects storage. Projects storage is mounted on multiple systems, including non IB connected clusters. Backend storage servers are connected via 40 Gbit ethernet fabric. Comet network design handles this aspect by using bridge switches (Mellanox). Design provides flexibility to mount filesystem on multiple machine and keeps the aggregate performance high. Each filesystem (scratch and projects) achieved 100 GB/s on and IOR bandwidth test at scale.

7 Comet: MPI options, RDMA enabled software MVAPICH2 (v2.1) is the default MPI on Comet. Intel MPI and OpenMPI also available. MVAPICH2-X v2.2a to provide unified high-performance runtime supporting both MPI and PGAS programming models. MVAPICH2-GDR (v2.2) on the GPU nodes featuring NVIDIA K80s and P100s. Benchmark and application performance results in this talk. RDMA-Hadoop (2x-1.1.0), RDMA-Spark (0.9.4) (from Dr. Panda s HiBD lab) also available.

8 Comet K80 node architecture 4 GPUs per node GPUs (0,1) and (2,3) can do P2P communication Mellanox InfniBand adapter associated with second socket (GPUs 2, 3)

9 OSU Latency (osu_latency) Benchmark Intra-node, K80 nodes Latency between GPU 2, GPU 3: 2.82 µs Latency between GPU 1, GPU 2: 3.18 µs

10 OSU Latency (osu_latency) Benchmark Inter-node, K80 nodes Latency between GPU 2, process bound to CPU 1 on both nodes: 2.27 µs Latency between GPU 2, process bound to CPU 0 on both nodes: 2.47 µs Latency between GPU 0, process bound to CPU 0 on both nodes: 2.43 µs

11 Comet P100 node architecture 4 GPUs per node GPUs (0,1) and (2,3) can do P2P communication Mellanox InfniBand adapter associated with frst socket (GPUs 0, 1)

12 OSU Latency (osu_latency) Benchmark Intra-node, P100 nodes Latency between GPU 0, GPU 1: 2.73 µs Latency between GPU 2, GPU 3: 2.95 µs Latency between GPU 1, GPU 2: 3.13 µs

13 OSU Latency (osu_latency) Benchmark Inter-node, P100 nodes Latency between GPU 0, process bound to CPU 0 on both nodes: 2.17 µs Latency between GPU 2, process bound to CPU 1 on both nodes: 2.35 µs

14 MVAPICH2-GDR Application Example:HOOMD-blue HOOMD-blue is a general-purpose particle simulation toolkit Results for the Hexagon benchmark are presented.. References: HOOMD-blue web page: HOOMD-blue Benchmarks page: glotzerlab.engin.umich.edu/hoomd-blue/benchmarks.html J. A. Anderson, C. D. Lorenz, and A. Travesset. General purpose molecular dynamics simulations fully implemented on graphics processing units Journal of Computational Physics 227(10): , May /j.jcp J. Glaser, T. D. Nguyen, J. A. Anderson, P. Liu, F. Spiga, J. A. Millan, D. C. Morse, S. C. Glotzer. Strong scaling of general-purpose molecular dynamics simulations on GPUs Computer Physics Communications 192: , July /j.cpc

15 HOOMD-Blue: Hexagon Benchmark N=1,048,576N=1,048,576 Hard particle Monte Carlo Vertices: [[0.5,0],[0.25, ],[0.25, ],[-0.5,0],[-0.25, ],[0.25, ]] d= a= nselect=4 Log fle period: time steps SDF analysis xmax==0.02 δx=10 4 period: 50 time steps navg=2000 DCD dump period:

16 HOOMD-Blue: Hexagon Benchmark Strong scaling on K80 nodes

17 RDMA-Hadoop and RDMA-Spark Network-Based Computing Lab, Ohio State University NSF funded project in collaboration with Dr. DK Panda* HDFS, MapReduce, and RPC over native InfiniBand and RDMA over Converged Ethernet (RoCE). Based on Apache distributions of Hadoop and Spark. Version RDMA-Apache-Hadoop-2.x (based on Apache Hadoop 2.6.0) available on Comet Version RDMA-Spark (based on Apache Spark 2.1.0) is available on Comet. More details on the RDMA-Hadoop and RDMA-Spark projects at: *NSF BIGDATA F: DKM: Collaborative Research: Scalable Middleware for Managing and Processing Big Data on Next Generation HPC Systems, Award #s (Ohio State), and (SDSC).

18 RDMA-Hadoop, Spark Exploit performance on modern clusters with RDMA-enabled interconnects for Big Data applications. Hybrid design with in-memory and heterogeneous storage (HDD, SSDs, Lustre). Keep compliance with standard distributions from Apache.

20 OSU-Caffe, CIFAR10 Quick on K80 nodes Results with K80 nodes. Current runs with data in Lustre filesystem (/oasis/scratch/comet) All Comet GPU nodes have 280GB of SSD based local scratch space. Future tests with larger test cases planned to evaluate performance advantages of using the SSDs.

21 OSU-Caffe, CIFAR10 Quick on K80 nodes

22 Virtualization on Comet Comet Virtual Clusters KVM based, SRIOV enabled full virtualization. Singularity based containerization user space only with namespaces and minimal SetUID.

23 Comet VC Use Cases Root access to nodes for custom OS and software stack. Example : CAIDA group used it for a workshop allowing attendees to modify network stack for research. Allows for isolation of tests. Simplified install for groups with existing management infrastructure. Example: Open Science Group (OSG) used their existing installation procedures to enable multiple research groups to run on Comet (including LIGO).

24 Singularity Use Cases Applications with newer library OS requirements than available on Comet e.g. Tensorflow, Torch, Caffe. Commercial application binaries with specific OS requirements. Importing docker images to enable use in a shared HPC environment.

25 Overview of Virtual Clusters on Comet Projects have persistent VM for cluster management Modest: single core, 1-2 GB of RAM Standard compute nodes will be scheduled as containers via batch system One virtual compute node per container Virtual disk images stored as ZFS datasets Migrated to and from containers at job start and end VM use allocated and tracked like regular computing

26 User Perspective Active virtual compute nodes Scheduling Storage management Coordinating network changes VM launch & shutdown Attached and synchronized Nucleus Disk images API Request nodes Console & power Persistent virtual front end Idle disk images

27 Enabling Technologies KVM Lets us run virtual machines (all processor features) SR-IOV Makes MPI go fast on VMs Rocks Systems management ZFS Disk image management VLANs Isolate virtual cluster management network pkeys Isolate virtual cluster IB network Nucleus Coordination engine (scheduling, provisioning, status, etc.) Client Cloudmesh

28 User-Customized HPC public network Virtual Frontend Hosting Frontend Virtual Frontend private physical virtual virtual Disk Image Vault Virtual Frontend private private Compute Compute Compute Compute Compute Virtual Compute Virtual Compute Virtual Compute Compute Compute Compute Compute Virtual Compute Virtual Compute Virtual Compute

29 High Performance Virtual Cluster Characteristics Comet: Providing Virtualized HPC for XSEDE Infniband Virtual Frontend private Ethernet Infniband Virtualization 8% latency overhead. Nominal bandwidth overhead Virtual Compute All nodes have Private Ethernet Infniband Local Disk Storage Virtual Compute Virtual Compute Nodes can Network boot (PXE) from its virtual frontend Virtual Compute All Disks retain state keep user confguration between boots

30 Data Storage/Filesystems Local SSD storage on each compute node Limited number of large SSD nodes (1.4TB) for large VM images Local (SDSC) network access same as compute nodes Modest (TB) storage available via NFS now Future: Secure access to Lustre?

31 Cloudmesh Developed by IU collaborators Cloudmesh client enables access to multiple cloud environments from a command shell and command line. We leverage this easy to use CLI for virtual cluster management, allowing the use of Comet as infrastructure for virtual clusters. Cloudmesh has more functionality with ability to access hybrid clouds OpenStack, (EC2, AWS, Azure); possible to extend to other systems like Jetstream, Bridges etc. Plans for customizable launchers available through command line or browser can target specifc application user communities. Reference:

32 Comet Cloudmesh Client (selected commands) cm comet cluster Attach an image cm comet boot ID vm-id-0 Power 3 nodes on for 6 hours cm comet image attach image.iso ID vm-id-0 Show the cluster details cm comet power on ID vm-id -[0-3] --walltime=6h ID Boot node 0 cm comet console vc4 Console

33 Comet Cloudmesh Usage Examples

34 Comet Cloudmesh Client : Console access (COMET)host:client$ cm comet console vc4 vm-vc4-0

35 MPI bandwidth slowdown from SR-IOV is at most 1.21 for mediumsized messages & negligible for small & large ones

36 MPI latency slowdown from SR-IOV is at most 1.32 for small messages & negligible for large ones

37 WRF Weather Modeling 96-core (4-node) calculation Nearest-neighbor communication Test Case: 3hr Forecast, 2.5km resolution of Continental US (CONUS). Scalable algorithms 2% slower w/ SR-IOV vs native IB.

38 PSDNS, 1024x1024x1024 : Strong scaling case 32 (2-node), 64 (4-node), and 128 (8node) core tests. Computational core based on FFTs. Communication intensive, mainly alltoallv bisection bandwidth limited. Cores (Nodes) Time/Step 32 (2) (4) (8) 33.99

39 Quantum ESPRESSO 48-core (3 node) calculation CG matrix inversion - irregular communication 3D FFT matrix transposes (allto-all communication) Test Case: DEISA AUSURF 112 benchmark. 8% slower w/ SR-IOV vs native IB.

40 RAxML: Code for Maximum Likelihood-based inference of large phylogenetic trees. Widely used, including by CIPRES gateway. 48-core (2 node) calculation Hybrid MPI/Pthreads Code. 12 MPI tasks, 4 threads per task. Compilers: gcc + mvapich2 v2.2, AVX options. Test Case: Comprehensive analysis, 218 taxa, 2,294 characters, 1,846 patterns, 100 bootstraps specifed. 19% slower w/ SR-IOV vs native IB.

MrBayes: Software for Bayesian inference of phylogeny. Widely used, including by CIPRES gateway. 32-core (2 node) calculation Hybrid MPI/OpenMP Code.

41 MrBayes: Software for Bayesian inference of phylogeny. Widely used, including by CIPRES gateway. 32-core (2 node) calculation Hybrid MPI/OpenMP Code. 8 MPI tasks, 4 OpenMP threads per task. Compilers: gcc + mvapich2 v2.2, AVX options. Test Case: 218 taxa, 10,000 generations. 3% slower with SR-IOV vs native IB.

42 Summary Comet uses the flexibility and performance of IB network + Bridging to provide multi-machine mounted parallel filesystems enhance GPU applications using GPU Direct RDMA enhance performance of data analytics tools with RDMA enabled frameworks enable virtualized HPC clusters at scale. OSU Benchmarks and HOOMD-Blue applications show good performance using MVAPICH2-GDR. Results for OSU-Caffe with CIFAR10 benchmark show good scaling. Future tests with larger test cases planned to evaluate performance advantages of using the SSDs. Application benchmarks show good performance on virtualized cluster. PSDNS example, which is communication intensive, shows good strong scaling for a test example.

Virtualization for High Performance Computing Applications at SDSC Mahidhar Tatineni, Director of User Services, SDSC HPC Advisory Council China

Virtualization for High Performance Computing Applications at SDSC Mahidhar Tatineni, Director of User Services, SDSC HPC Advisory Council China Conference October 26, 2016, Xi an, China Comet Virtualization