HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/ PDF Free Download

HPC Innovation Lab Update Dell EMC HPC Community Meeting 3/28/2017

Dell EMC HPC Innovation Lab charter Design, develop and integrate Heading HPC systems Lorem ipsum Flexible reference dolor sit amet, architectures Systems tuned consectetur for research computing, adipiscing manufacturing, elit. life sciences, oil and gas, etc Act as the focal point for joint R&D activities Technology collaboration with partners for joint innovation Research coordination with DSC, COEs and customers New Investment: more SMEs, huge innovation eco-system HPC Innovation Lab Technical briefings, tours, remote access Conduct application Heading performance Lorem studies ipsum and develop best dolor practices sit amet, White papers, consectetur blogs, adipiscing elit. presentations www.hpcatdell.com Prototype and evaluate advanced technologies HPC+Cloud, HPC+Big Data Processors, Accelerators, File systems, software, etc. 2

Focus areas HPC software stack Bright Cluster Manager, OpenHPC Integration of all software components Compute performance and tuning Application focus: BIOS, Memory, Interconnect Accelerators and co-processors Interconnect performance and tuning Storage solutions NSS, IEEL Vertical solutions Genomics research CFD/Manufacturing Proof of Concept studies OpenStack for HPC, Hadoop on Lustre, etc. Collateral at: https://esg.one.dell.com/sites/solutions/esc/hpc/whiteblogs/sitepages/home.aspx 3

World-class infrastructure 13K sqft facility with 1300+ servers and ~10PB storage dedicated to HPC research, development and innovation in collaboration with Dell HPC community Zenith Top500 system based on Intel Scalable Systems Framework (OPA, KNL, Xeon, OpenHPC) 384-nodes with dual E5-2697v4 processors, non-blocking OPA fabric and 451 TFlops sustained performance (#372 on Top 500) Will grow to 512-nodes Rattler Research/development system in collaboration with Mellanox, nvidia 84 nodes with IB EDR and 2697v4 processors 4

C6320P

PowerEdge C6320p Delivering balanced high performance computing Intel Xeon Phi processor Up to 72 outof-order cores, energy efficient Embedded Omnipath and InfiniBand fabric options Provides a choice of low-latency IO for applications with the most demanding IO requirements 6 DIMMs of memory (384GB max.) Local memory enables easy scaling across scale-out computing infrastructures 6 internal drives (12TB max.) Local storage capacity permits faster access to data for better performance, faster results 1 PCIe Gen3 x16 (low profile)/ 1 Mezz x4 permits flexible range of usage 6

WRF BDW, KNL and different modes Average Time Step (lower is better) 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 baseline baseline KNL different memory modes (60C2T) and Broadwell 53.7% 34.4% 1.00 0.99 0.98 0.96 1.00 0.99 0.98 0.97 OSB Flat Cache Flat Cache 2697v4, 2x18c Quad Quad All2All All2All KNL 54% better than 2S BDW for new dataset. 34% better than current conus 12km benchmark. BDW to KNL: 3.3x more cores in use. Without HT, 67% more cores in use. Quad mode with MCDRAM in flat mode is best performance. Quad, Cache is within 1%. Dataset fits in MCDRAM. 31 of 7 40 All2All within 4% of Quad. Intel conus12k Public conus12k Intel-relative-perf Public-relative-perf

Storage with IEEL

Dell Storage for HPC with Intel EE for Lustre Solution Turn-key solution designed for high speed fast scratch storage Solution benefits & Dell differentiation Parallel scalable file system based on Intel EE for Lustre software Single file system namespace scalable to high capacities and performance Best practices developed by Dell HPC Engineering provide optimal performance on Dell hardware Tests yield peaks of roughly 15GB/s write and 17GB/s read per building block Lustre Distributed Namespace allows distribution of Lustre sub-directories across multiple MDTs to increase metadata capacity capabilities and performance Solution design for Big Data workloads using Intel Hadoop Adapter for Lustre (HAL) Share data with other file systems utilizing optional NFS/CIFS gateway Dell Networking 10/40GbE, InfiniBand or Omni-Path Dell PowerVault MD3460 Dell PowerVault MD3420 Intel Manager for Lustre PowerEdge R630 MDS Pair Dell PowerEdge R730 Active/Passive OSS Pair Dell PowerEdge R730 Active/Active 12 Gbps SAS Failover Connections Dell PowerVault MD3420 (Optional for DNE) 12 Gbps SAS Failover Connections 9

IEEL3.0+OPA 10

ML/ DL

HPL Performance on P100-PCIe HPL Performance Scaling on P100-PCIe Performance (TFLOPS) 70 60 50 40 30 20 10 0 93 82 86 84 81 81 85 57.8 41.8 29.4 3.9 15.5 1.1 7.9 CPU(2x 2690 v4) 1 P100 2 P100 4 P100 8 P100 12 P100 16 P100 100 90 80 70 60 50 40 30 20 10 0 Efficiency(%) TFLOPS Efficiency 12 HPL is on Double precision 1 P100 node = 14.1 CPU nodes ( 2x E5-2690v4) Scales very well across nodes, 16x P100 across 4 nodes = 14.9x 1x P100 card

NV-Caffe Training on single P100-PCIe node Images/sec (higher the better) 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Training Speed of GoogleNet in NV-Caffe 89 1.0 476 1.9 905 3.7 1782 2 x 2690 v4 1 P100 2 P100 4 P100 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Speedups Dataset is ImageNet 2012 dataset (ILSVRC2012) 1.2M training images 50K validation images 1000 categories Tested with GoogleNet model 1 P100 node = 20 CPU nodes 4 P100 = 3.75x 1P100, Scales very well with multiple GPUs. CPU P100-PCIe speedup over 1P100 13

HPC System for Manufacturing

Dell EMC HPC System for Manufacturing ISV Applications Dell Pro Support, Pro Support Plus, Deployment Services Bright Cluster Manager Key takeaways Comprehensive offering that includes compute, storage, networking, unified management, monitoring and services Choice & flexibility at every level HPC storage offerings HPC networking offerings Explicit Solver Building Blocks Implicit Solver Building Blocks Remote Visualization Building Blocks Management Building Blocks 15

CD-adapco STAR-CCM+ Explicit BB Scaling (1/2) CD-adapco STAR-CCM+ Scaling Explicit BB CD-adapco STAR-CCM+ Scaling Explicit BB Performance Relative to 32 Cores (1 Node) 10 8 6 4 2 0 32 (1) 64 (2) 128 (4) 192 (6) 256 (8) Number of Cores (Number of Nodes) Civil_Trim_20M HlMach10Sou Performance Relative to 32 Cores (1 Node) 9 8 7 6 5 4 3 2 1 0 32 (1) 64 (2) 128 (4) 192 (6) 256 (8) Number of Cores (Number of Nodes) KcsWithPhysics LeMans_Poly_17M EglinStoreSeparation LeMans_100M Reactor_9M TurboCharger VtmUhoodFanHeatx68m Scaling for all data sets is as expected. Scaling for most data sets is very good, with linear scaling up to 8 nodes. 16

HPC System for Life Sciences

Turn-key solutions designed for genomic computing 18

Cryo-EM ROME SML - Xeon vs KNL over OPA Compute time (lower is better) DATA8 45000 40000 35000 30000 3.1 3.1 3.1 25000 3.0 3.0 2.9 2.9 20000 15000 10000 5000 0 1 2 4 8 10 12 16 Number of servers in test DATA8.OPA.BDW DATA8.OPA.KNL Perf over BDW Compute time (lower is better) DATA6 45000 40000 35000 30000 25000 2.8 2.9 2.8 2.8 2.9 2.9 2.6 20000 15000 10000 5000 0 1 2 4 8 10 12 16 Number of servers in test DATA6.OPA.BDW DATA6.OPA.KNL Perf over BDW Compute time (lower is better) 100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 RING11_ALL 3.4 3.4 3.4 3.4 3.4 3.4 3.3 1 2 4 8 10 12 16 Number of servers in test RING11_ALL.OPA.BDW RING11_ALL.OPA.KNL Perf over BDW Xeon 2697 v4, 18c CPU (36c per server) KNL 7230 is ~3x better than Xeon for all three datasets. Both architectures scale well, but KNL starts off and stays better. 19

Dell EMC Isilon Isilon X410 The results came from a 3-node configuration (3 144 nodes, up to 20.7 PB capacity) SmartConnect: maximizing performance by keeping client connections balanced across the entire storage cluster 20

Tying it together - Access to the lab - White papers and Blogs

How to Engage HPC Innovation Lab 22 1) Work with your Dell account team. 2) Submit a request using the tool below. Include as much detail as possible https://esg.one.dell.com/sites/solutions/esc/hpc/request/_layouts/15/start.aspx # Complete the Dell HPC Innovation Lab Evaluation Program Agreement Will the customer / SC complete the benchmarking remotely or is an HPC Engineering team member being requested to assist? 3) Expect a response within 2 days on availability, scheduling. 4) The HPC Innovation lab is located in the Dell Parmer Campus, Austin, Texas. Resources: https://esg.one.dell.com/sites/solutions/esc/hpc/whiteblogs/sitepages/home.aspx http://www.dell.com/hpc http://www.hpcatdell.com

Team publications Blogs www.hpcatdell.com White papers www.dell.com 23

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017