HPC Systems Overview. SDSC Summer Institute August 6-10, 2012 San Diego, CA. Shawn Strande Gordon Project Manager SAN DIEGO SUPERCOMPUTER CENTER

Size: px

Start display at page:

Download "HPC Systems Overview. SDSC Summer Institute August 6-10, 2012 San Diego, CA. Shawn Strande Gordon Project Manager SAN DIEGO SUPERCOMPUTER CENTER"

Abel Lane
6 years ago
Views:

1 HPC Systems Overview SDSC Summer Institute August 6-10, 2012 San Diego, CA Shawn Strande Gordon Project Manager

available as user scratch space QDR Fat Tree interconnect In production operation since December

2 Trestles High Productivity System Targeted at modest scale jobs and Science Gateways Appro integrated 324 node AMD Magny-Cours cluster 120 GB Intel flash in each node used for OS and available as user scratch space QDR Fat Tree interconnect In production operation since December 2011 Funded by the NSF and available through the NSF Extreme Science and Engineering Discovery Environment program (XSEDE)

3 Trestles Design Objectives Target modest-scale users Limit job size to 1024 cores (32 nodes) Serve a large number of users Cap allocation per project at 1.5M core-hours/year (~2.5% of annual total) Gateways are exempt from this cap because they represent a large # of users Maintain fast turnaround time Primary system metric is expansion factor rather than traditional system utilization Allocate ~70% of the theoretically available core-hours (revised as needed) Reserve nodes for interactive and short jobs Respond to user s requirements Robust software suite Unique capabilities like pre-emptive on-demand access and user-settable reservations Long-running job queues (48 hours standard, up to 2 weeks allowed) Bring in new users/communities Welcome small jobs/allocations, start-up requests up to 50,000 SUs, gateway-friendly

4 Trestles Architecture Data Movers (4x) XSEDE & R&E Networks SDSC Network QDR IB GbE management GbE public Round robin login Mirrored NFS Redundant front-end Login s (2x) 4x Arista (2x MLAG) GbE switch 12x IB/Ethernet Bridge switch QDR InfiniBand Switch NFS Servers (4x) Gordon & Trestles Shared Gordon Cluster Data Oasis Lustre PFS 4 PB 324 Mgmt. s (2x) Ethernet Management GbE 10GbE QDR 40 Gb/s

5 Trestles Magny-Cours 120 GB QDR InfiniBand

6 Trestles Speeds and Feeds System Component Configuration AMD MAGNY-COURS COMPUTE NODE Sockets 4 Cores 32 Clock Speed 2.4 GHz Flop Speed 307 Gflop/s Memory capacity 64 GB Memory bandwidth 171 GB/s STREAM Triad bandwidth 100 GB/s Flash memory () 120 GB FULL SYSTEM Total compute nodes 324 Total compute cores 10,368 Peak performance 100 Tflop/s Total memory 20.7 TB Total memory bandwidth 55.4 TB/s Total flash memory 39 TB QDR INFINIBAND INTERCONNECT Topology Fat tree Link bandwidth 8 GB/s (bidrectional) Peak bisection bandwidth 5.2 TB/s (bidirectional) MPI latency 1.3 us DISK I/O SUBSYSTEMS File systems NFS, Lustre Storage capacity (usable) 400 TB: Scratch 400 TB : Project I/O bandwidth 12 GB/s

7 Gordon An Innovative Data-Intensive Supercomputer Designed to accelerate access to massive amounts of data in areas of genomics, earth science, engineering, medicine, and others Emphasizes memory and IO over FLOPS. Appro integrated 1,024 node Sandy Bridge cluster 300 TB of high performance Intel flash Large memory supernodes via vsmp Foundation from ScaleMP 3D torus interconnect from Mellanox In production operation since February 2012 Funded by the NSF and available through the NSF Extreme Science and Engineering Discovery Environment program (XSEDE)

8 The Memory Hierarchy of a Typical Supercomputer Shared memory Programming (single node) Message passing programming Latency Gap BIG DATA Disk I/O

9 The Memory Hierarchy of Gordon Shared memory Programming (vsmp) BIG DATA Disk I/O

10 Gordon Network Architecture Data Movers (4x) XSEDE & R&E Networks SDSC Network Dual-rail IB Dual 10GbE storage GbE management GbE public Round robin login Mirrored NFS Redundant front-end 4x Mgmt. s (2x) Mgmt. Edge & Core Ethernet Public Edge & Core Ethernet Login s (4x) Data Oasis Lustre PFS 4 PB 128x Arista 10 GbE switch IO s IO s 1,024 NFS Server (4x) 64 3D torus: rail 1 3D torus: rail 2 GbE 2x10GbE 10GbE QDR 40 Gb/s

11 Gordon Intel Sandy Bridge 11

Gordon Design Highlights 1,024 2S Xeon E5 (Sandy Bridge) nodes 16 cores, 64 GB/node Intel Jefferson Pass mobo PCI Gen3 3D Torus Dual rail QDR 300 GB Intel 710 emlc s 300 TB aggregate 64,

12 Gordon Design Highlights 1,024 2S Xeon E5 (Sandy Bridge) nodes 16 cores, 64 GB/node Intel Jefferson Pass mobo PCI Gen3 3D Torus Dual rail QDR 300 GB Intel 710 emlc s 300 TB aggregate 64, 2S Westmere I/O nodes 12 core, 48 GB/node 4 LSI controllers 16 s Dual 10GbE SuperMicro mobo PCI Gen2 Large Memory vsmp Supernodes 2TB DRAM 10 TB Flash Data Oasis Lustre PFS 100 GB/sec, 4 PB

13 Gordon 32-way Supernode vsmp aggregation SW ION ION Dual WM IOP 4.8 TB flash Dual WM IOP 4.8 TB flash

14 Supernode Network Design Detail I/O nodes serve flash and are the gateway to Data Oasis (Lustre file system

Gordon 3D Torus Interconnect Fabric 4x4x4 3D Torus Topology 4X4X4 Mesh Ends are folded on all three Dimensions to form a 3DTorus Dual-Rail Network increased Bandwidth &

15 Gordon 3D Torus Interconnect Fabric 4x4x4 3D Torus Topology 4X4X4 Mesh Ends are folded on all three Dimensions to form a 3DTorus Dual-Rail Network increased Bandwidth & Redundancy Single Connection to each Network 16 s, 2 IO s 18 x 4X IB Network Connections 18 x 4X IB Network Connections 36 Port Fabric Switch IO IO 36 Port Fabric Switch

localized communication Linearly expandable Simple wiring pattern Short Cables- Fiber Optic

16 Why a 3D Torus interconnect? Lower Cost :40% as many switches, 25% to 50% fewer cables compared to a fat tree Works well for localized communication Linearly expandable Simple wiring pattern Short Cables- Fiber Optic cables generally not required Fault Tolerant within the mesh with 2QoS Alternate Routing Fault Tolerant with Dual-Rails for all routing algorithms Based on OFED IB stack

17 Gordon System Layout (same basic design for Trestles) 16 Racks (all racks 48U) 1 Service Rack Hot aisle containment 4 I/O Racks 500kW Earthquake isobases

18 Gordon Speeds and Feeds System Component INTEL SANDY BRIDGE COMPUTE NODE Configuration Sockets, Cores 2 / 16 Clock speed 2.6 DRAM capacity 64 GB 80 GB Peak performance 333 GF Memory bandwidth 85 GB/s capacity per drive/capacity per node/total Flash bandwidth per drive (read/write) IOPS INTEL FLASH I/O NODE 300 GB / 4.8 TB / 300 TB 270 MB/s / 210 MB/s 38,000 / 2,300 SMP SUPER-NODE nodes 16 or 32 I/O nodes 1 or 2 Addressable DRAM 1 or 2 TB Flash 4.8 or 9.6 TB FULL SYSTEM s, Cores 1,024 / 16,394 Peak performance 341 TF Aggregate memory 64 TB INFINIBAND INTERCONNECT Aggregate torus BW 9.2 TB/s Type Dual-Rail QDR InfiniBand Link Bandwidth 8 GB/s (bidirectional) Latency (min-max) 1.25 µs 2.5 µs Total storage I/O bandwidth DATA OASIS LUSTRE FILE SYSTEM 4.5 PB (raw) 100 GB/s Gordon System Specification

19 Software Stack Consistent Across Systems for Ease of Use and Support Cluster management Rocks Operating System InfiniBand OFED MPI CentOS5.6 modified for AVX Mellanox subnet manager MVAPICH2 (Native) MPICH2 (vsmp) Shared Memory (Gordon) vsmp Foundation v4 Flash (Gordon) iscsi over RDMA (iser) Target daemon (tgtd) XFS, OCFS, et al Parallel File System Lustre Job scheduling Torque (PBS), Catalina Local enhancements for topology aware scheduling (Gordon) User Environment Modules; Rocks Rolls

20 Gordon Measured Performance System LINPACK 286 Tflop/s (84% of peak) node LINPACK 290 Gflop/s (85% of peak) Memory bandwidth 67 GB/s (79% of peak) I/O node 1 sequential performance when mounted on a compute node using XFS 16 sequential remote (r/w) with iser and no file system (using 2 rails) 16 IOPS local (r/w) 16 IOPS remote (r/w) 270/210 MB/s r/w (100% of spec) 4.5/4.4 GB/s (exceeds spec) 600K/120K (exceed spec) 160K/32K Interconnect Link bandwidth Rail 0: 3.9 GB/s Rail 1: 3.4 GB/s Latency (for traversing the maximum number of switch hops) Rail 0: 1.44 µs Rail 1: 2.14 µs Data Oasis From a single LNET (I/O node) router 1.6 GB/s write 5 GB/s read From 32 LNET routers to 32 Lustre OSS s (half of total system) 50 GB/s

21 s are a good fit for dataintensive computing Flash Drive Typical HDD Good for Data Intensive Apps Latency <.1 ms 10 ms Bandwidth (r/w) 270 / 210 MB/s MB/s IOPS (r/w) 38,500 / Power consumption (when doing r/w) 2-5 W 6-10 W Price/GB $3/GB $.50/GB -. Endurance 2-10PB N/A Total Cost of Ownership Jury is still out.

22 Flash: Option 1 - One per Lustre One 300 GB flash drive exported to each compute node appears as a local file system Lustre parallel file system is mounted identically on all nodes. Data is purged at the end of the run Use cases: Applications that need local, temporary scratch Gaussian Abaqus Logical View File system appears as: /scratch/$user/$pbs_jobid

23 Flash: Option 2-16 s for 1 Lustre 4.8 TB 16 s in a RAID0 appear as a single 4.8 TB file system to the compute node. Flash I/O and Lustre traffic uses Rail 1 of the torus. Use cases: Database Data mining Gaussian Abaqus Logical View File system appears as: /scratch/$user/$pbs_jobid

24 Flash: Option 3-16 s within a vsmp Supernode Lustre Logical View 16 node Virtual Image (1 TB) 4.8 TB file system File system appears as: /scratch1/$user/$pbs_jobid (/scratch2 available if using a 32-node supernode) 4.8 TB flash as a single XFS file system Flash I/O uses both rail 0 and rail 1 Data purged at the end of a run Use cases: Serial and threaded applications that need large memory and local disk Abaqus Genomics (Velvet, Allpaths, etc)

25 Flash: Option 4 - Dedicated I/O node ***A new allocations model for data intensive computing *** Lustre 4.8 TB I/O node is allocated for up to one year Users run applications directly on the I/O node Users may request dedicated compute nodes or access them through the scheduler Data is persistent Logical View Use cases: Database Data mining and analytics

26 Flash Option 5: 16 s/ 16 compute node (Testing various options please talk with us) 4.8 TB Lustre Logical View 16 s in a RAID0 appear as a single 4.8 TB file system to all compute nodes Use cases: MPI applications

27 Which I/O system is right for my application? Performance Flash-based I/O nodes s support low latency I/O, high IOPS, and high bandwidth. One can deliver 37K IOPS. Flash resources are dedicated to the user and performance is largely independent of what other users are doing on the system. Lustre (Data Oasis) Lustre is ubiquitous in HPC. It does well for sequential I/O and files that support I/O to a few files from many cores simultaneously. Random I/O is a Lustre killer. Lustre is a shared resource and performance will vary depending on what other users are doing. Infrastructure Persistence s are deployed in I/O nodes using iser, an RDMA protocol that is accessed over the InfiniBand network. Data is generally removed at the end of a run so the resource can be made available to the next job. 64 OSS s; distinct file systems and metadata servers; accessed over a 10GbE network via the I/O nodes HDDs. Most is deployed as scratch and purgeable by policy (not necessarily at the end of the job. Some deployed as a persistent project storage resource. Capacity Use cases Up to 4.8 TB per user depending on configuration Local application scratch (Abaqus, Gaussian); as a data mining platform (e.g., Hadoop); graph problems. No specific limits or quotas imposed on scratch. File system is ~ 2 PB. Traditional HPC I/O associated with MPI applications. Pre-staging of data that will be pulled into flash.

28 System Gordon/Trestles Comparison nodes Trestles 324 node AMD Magny-Cours Appro 10,368 cores 10 TB DRAM 100 Tflop/s Quad socket 32 cores (8/proc) 64 GB/node (2GB/core) 2.4 GHz Gordon 1,024 node Intel Xeon E5 Appro 16,384 cores 64 TB DRAM 341 Tflop/s Dual socket 16 cores (8 cores/proc) 64 GB/node (4 GB/core) 2.6 GHz Flash memory ~ 60 GB/node MLC ~300GB/node emlc (4.8TB aggregated per I/O node) Interconnect Target user Fat Tree Single rail QDR Small & modest size jobs; gateways; rapid turnaround; allocations < 1.5MSUs 3D Torus Dual Rail QDR Memory and I/O intensive applications; modest sized jobs; gateways; no specific limit on allocations

29 Data Oasis High performance, high capacity Lustre-based parallel file system A centerpiece of SDSC s data intensive infrastructure 10GbE I/O backbone for all of SDSC s HPC systems, supporting multiple architectures Scalable, open platform design Integrated by Aeon Computing 4PB capacity and growing 100 GB/s demonstrated performance Configured as individual scratch and project (nonpurged) file systems for each system

Data Oasis Heterogeneous Architecture TRESTLES IB cluster GORDON IB cluster TRITON Myrinet cluster Mellanox

Architectures MDS: Trestles scratch MDS: Gordon scratch Arista 7508 10G Arista 7508 10G Redundant Switches

72TB OSS 72TB 64 OSS (Object Storage Servers) Provide 100GB/s Performance and >4PB Raw Capacity Metadata

30 Data Oasis Heterogeneous Architecture TRESTLES IB cluster GORDON IB cluster TRITON Myrinet cluster Mellanox 5020 Bridge 12 GB/s 64 Lustre LNET Routers 100 GB/s Myrinet 10G Switch 25 GB/s 3 Distinct Network Architectures MDS: Trestles scratch MDS: Gordon scratch Arista G Arista G Redundant Switches for Reliability and Performance MDS: Gordon & Trestles project MDS: Triton scratch OSS 72TB OSS 72TB OSS 72TB OSS 72TB 64 OSS (Object Storage Servers) Provide 100GB/s Performance and >4PB Raw Capacity Metadata Servers JBOD 90TB JBOD 90TB JBOD 90TB JBOD 90TB JBODs (Just a Bunch Of Disks) Provide Capacity Scale-out to an Additional 5.8PB

31 Data Oasis Performance Measured from Gordon

32 Applications

Breadth First Search Comparison using and HDD Graphs are mathematical and computational representations of relationships of objects in a network.

33 Breadth First Search Comparison using and HDD Graphs are mathematical and computational representations of relationships of objects in a network. Such networks occur in many natural and man-made scenarios, including communication, biological, and social contexts. Understanding the structure of these graphs is important for uncovering important relationships among the members. Implementation of Breadth-first search (BFS) graph algorithm developed by Munagala and Ranade 134 million nodes Flash drives reduced I/O time by factor of 6.5x Problem converted from I/O bound to compute bound V M F Source: Sandeep Gupta, San Diego Supercomputer Center. Used by permission C T L

Foxglove Calculation using Gaussian 09 with vsmp - MP2 Energy Gradient Calculation The Foxglove plant (Digitalis) is studied for its medicinal uses.

34 Foxglove Calculation using Gaussian 09 with vsmp - MP2 Energy Gradient Calculation The Foxglove plant (Digitalis) is studied for its medicinal uses. Digoxin, an extract of the Foxglove, is used to treat a variety of conditions including diseases of the heart. There is some recent research that suggests it may also be a beneficial cancer treatment. Processor footprint - 4 nodes 64 threads 1 node = (16 cores/node) 64 GB/node) Time to solution: 43,000s Memory footprint 10 nodes 700 GB V M F C T L Source: Jerry Greenberg, San Diego Supercomputer Center. January, 2012.

Daphnia Genome Assembly using Velvet and vsmp Daphnia (a.k.a. water flea), is a model species used for understanding mechanisms of inheritance and evolution, and as a surrogate species for studying human health in responses to environmental changes.

35 Daphnia Genome Assembly using Velvet and vsmp Daphnia (a.k.a. water flea), is a model species used for understanding mechanisms of inheritance and evolution, and as a surrogate species for studying human health in responses to environmental changes. Photo: Dr. Jan Michels, Christian-Albrechts-University, Kiel De novo assembly of short DNA reads using the de Bruijn graph algorithm. Code parallelized using OpenMP directives. Benchmark problem: Daphnia genome assembly from 44-bp and 75-bp reads using 35-mer Source: Wayne Pfeiffer, San Diego Supercomputer Center. Used by permission. V M F C T L

Axial compression of caudal rat vertebra using Abaqus and vsmp The goal of the simulations is to analyze how small variances in boundary conditions effect high strain regions

This has relevance for paleontologists to infer habitual locomotion of ancient people and animals, and in treatment strategies for populations with fragile bones such as the

36 Axial compression of caudal rat vertebra using Abaqus and vsmp The goal of the simulations is to analyze how small variances in boundary conditions effect high strain regions in the model. The research goal is to understand the response of trabecular bone to mechanical stimuli. This has relevance for paleontologists to infer habitual locomotion of ancient people and animals, and in treatment strategies for populations with fragile bones such as the elderly. 5 million quadratic, 8 noded elements Model created with custom Matlab application that converts 25 3 micro CT images into voxelbased finite element models Source: Matthew Goff, Chris Hernandez. Cornell University. Used by permission. 2012

MrBayes Running on Trestles and Gordon through the CIPRES Gateway MrBayes 3.1.

CIPRES has allowed over 4000 biologists world-wide to run parallel tree inference codes via a simple-to-use web interface.

37 MrBayes Running on Trestles and Gordon through the CIPRES Gateway MrBayes is used extensively via the CIPRES Science Gateway to infer phylogenetic trees. The hybrid parallel version running at SDSC uses both MPI and OpenMP. CIPRES has allowed over 4000 biologists world-wide to run parallel tree inference codes via a simple-to-use web interface. Applications can be targeted to appropriate architectures. Gordon provides a significant speedup for unpartitioned data sets over the SDSC Trestles system. A model for future data intensive projects Source: Wayne Pfeiffer, San Diego Supercomputer Center.

38 Conclusions Trestles meets the needs of a majority of users. Small and modest job sizes, rapid turnaround. Gordon is a data intensive system for users requiring large memory, or I/O characterized by many small transactions or random access patterns. Both Trestles and Gordon are gateway friendly. Data Oasis is a high performance, high capacity parallel file system that is available on both Trestles and Gordon. Project storage is available as an allocable resource. An expansion of and additional 2 PB is planned for early It is cross mounted on Trestles and Gordon. High speed networks and data movers to SDSC systems facilitate large file transfers to/from home institutions. A wide range of applications are running on both systems that make use of their innovative features.

39 Thank you

Gordon - Design and Performance of a 3D Torus Interconnect for Data Intensive Computing

Gordon - Design and Performance of a 3D Torus Interconnect for Data Intensive Computing HPC Advisory Council Held in Conjunction with ISC 12 June 17, 2012 Hamburg, Germany Shawn Strande Gordon Project