The NCAR Yellowstone Data Centric Computing Environment. Rory Kelly ScicomP Workshop May 2013

Size: px

Start display at page:

Download "The NCAR Yellowstone Data Centric Computing Environment. Rory Kelly ScicomP Workshop May 2013"

Katherine McKinney
5 years ago
Views:

1 The NCAR Yellowstone Data Centric Computing Environment Rory Kelly ScicomP Workshop May 2013

2 Computers to Data Center EVERYTHING IS NEW 2

3 NWSC Procurement New facility: the NWSC NCAR Wyoming Supercomputing Center New data center in Cheyenne Wyoming Complex procurement involving multiple resources Supercomputer Data analysis and visualization clusters GLADE shared file system

4 Computing Resources at the NCAR Wyoming Supercomputing Center High-Performance Computing Yellowstone: IBM idataplex Cluster 1.5 PFLOPs Data Analysis and Visualization Geyser: Large-memory and visualization nodes Caldera: GPU computation and data analysis Pronghorn: Intel Xeon Phi compute cluster (May 2013) Centralized Filesystems and Data Storage GLADE: GPFS File system 10.9 PB initially 16.4 PB in 1Q2014 HPSS Data Archive 2 StorageTek SL8500 tape libraries 20k cartridge slots >100 PB capacity with 5 TB cartridges (uncompressed)

Yellowstone NWSC High-Performance Computing Resource Batch Computation 4,518 IBM dx360 M4 nodes,16 cores, 32 GB memory per node Intel Sandy Bridge EP (2.6 GHz) 72,288 cores total, 1.

5 Yellowstone NWSC High-Performance Computing Resource Batch Computation 4,518 IBM dx360 M4 nodes,16 cores, 32 GB memory per node Intel Sandy Bridge EP (2.6 GHz) 72,288 cores total, 1.50 PFLOPs peak TB total DDR memory 30x increase vs. previous system High-Performance Interconnect Mellanox FDR InfiniBand fat-tree 13.6 GB/s bidirectional bw/node 2.5 µs latency (worst case) 31.7 TB/s bisection bandwidth Login/Interactive 6 IBM x3650 M4 nodes, Intel Sandy Bridge EP (2.6 GHz) 16 cores & 128 GB memory per node

6 NCAR HPC Profile 20x - 30x previous system performance

Mellanox FDR IB interconnect Caldera: GPU compute / analysis 16 IBM dx360 M4 nodes Intel

7 Geyser and Caldera NWSC Data Analysis & Visualization Resources Geyser: Large-memory system 16 IBM x3850 nodes Intel Westmere-EX processors 40 cores, 1 TB memory, 1 NVIDIA GPU per node Mellanox FDR IB interconnect Caldera: GPU compute / analysis 16 IBM dx360 M4 nodes Intel Sandy Bridge EP 16 cores, 64 GB memory per node 2 NVIDIA GPUs per node Mellanox FDR IB interconnect

6 GHz 64 GB DDR3 1600 Memory 2 Xeon Phi 5110P coprocessors 60

8 Pronghorn Intel Xeon Phi cluster 16 IBM dx360 M4 nodes 2 Sandy Bridge EP, 2.6 GHz 64 GB DDR Memory 2 Xeon Phi 5110P coprocessors 60 cores, GHz 8 GB GDDR5 Memory PCIe x16 connected Passively Cooled Mellanox FDR IB interconnect Expect installation in May 2013

9 GLADE PB usable capacity PB (2014) Estimated initial file system sizes collections 2 PB RDA, CMIP5 data scratch 5 PB shared, temporary space projects 3 PB long-term, allocated space users 1 PB medium-term work space Disk Storage Subsystem 76 IBM DCS3700 controllers & expansion drawers 90 2-TB NL-SAS drives/controller add 30 3-TB NL-SAS drives/controller (1Q2014) GPFS NSD Servers 91.8 GB/s aggregate I/O bandwidth; 19 IBM x3650 M4 nodes I/O Aggregator Servers (GPFS, HPSS connectivity) 10-GbE & FDR interfaces; 4 IBM x3650 M4 nodes High-performance I/O interconnect to HPC & DAV Mellanox FDR InfiniBand fat-tree 13.6 GB/s bidirectional bandwidth/node

10 Yellowstone Environment Geyser, Caldera and Pronghorn Yellowstone GLADE HPC resource, 1.5 PFLOPS peak Central disk resource 11 PB (2012), 16.4 PB (2014) DAV and Accelerated Computing High Bandwidth Low Latency HPC and I/O Networks FDR InfiniBand and 10Gb Ethernet NCAR HPSS Archive 100 PB capacity ~15 PB/yr growth 1Gb/10Gb Ethernet (40Gb+ future) Science Gateways Data Transfer Services RDA, ESG Remote Vis Partner Sites XSEDE Sites

Yellowstone Software System Software LSF-HPC Batch

(RHEL) Version 6 IBM General Parallel File system (GPFS)

administration toolkit Compilers, Libraries, Debugger &

including IBM HPC Toolkit Intel Cluster Studio (Fortran,

analyzer) 50 concurrent users Intel VTune Amplifier XE

C, C++, pgdbg debugger, pgprof) 50 conc.

debugger, pgprof) for DAV systems only, 2 concurrent users

11 Yellowstone Software System Software LSF-HPC Batch Subsystem / Resource Manager Red Hat Enterprise Linux (RHEL) Version 6 IBM General Parallel File system (GPFS) Mellanox Universal Fabric Manager IBM xcat cluster administration toolkit Compilers, Libraries, Debugger & Performance Tools IBM Parallel Environment (POE), including IBM HPC Toolkit Intel Cluster Studio (Fortran, C++, performance & MPI libraries, trace collector & analyzer) 50 concurrent users Intel VTune Amplifier XE performance optimizer 2 concurrent users PGI CDK (Fortran, C, C++, pgdbg debugger, pgprof) 50 conc. users PGI CDK GPU Version (Fortran, C, C++, pgdbg debugger, pgprof) for DAV systems only, 2 concurrent users PathScale EckoPath (Fortran C, C++, PathDB debugger) 20 concurrent users Rogue Wave TotalView debugger 8,192 floating tokens

12 Managing Complexity in the USER ENVIRONMENT 12

13 Comparison of Previous and Current Supercomputers Bluefire Yellowstone Cores Compute Architectures Compilers MPI Libraries Levels of Switch or 2 3 or 4 1 or 2 4 or or rd Party Software Combinations O(100) O(100) O(1000) 13

14 Integrated Compute Platforms Trying to make all the NWSC resources feel more like one machine instead of a collection of machines Common file systems, schedulers, user environment and software versions One set of commands that users must learn to compile, run, and navigate the systems 14

15 Data Centered Design GLADE Shared file system is designed to increase efficiency of typical workflow by reducing data movement for users and decreasing the number of spaces they have to manage 15

16 Data Centered Workflow One location for all data on accessible from all of our machines. Input Model Output Post Processed Storage spaces and quotas are uniform between machines Don t need to backup, archive, manage multiple spaces, or monitor multiple temporary spaces for scrubbing 16

17 LSF Scheduler From user POV there is one place to submit jobs, regardless of resource Different queues depending on job type (e.g. regular, bigmem, gpgpu) Allows multistage jobs to run on multiple resources Large model run on Yellowstone Dependent Data-Analysis Run on Geyser Sharing between projects managed transparently 17

18 LSF Scheduler User Jobs MBD HPSS Archive jobs SBD Slave LIM Big Memory, Data Analysis Jobs SBD Slave LIM SBD Visualization, GPGPU jobs Slave LIM SBD Master Lim SBD Slave LIM Xeon Phi Accelerated Compute Jobs Large Parallel Compute Jobs SBD I/O Aggr. Nodes Slave LIM Pronghorn Geyser Caldera Yellowstone

19 User Software Environment Complexity of user environment greatly increased with Yellowstone Multiple compilers and libraries, not always mutually compatible Architectural differences between machines Amount of available software Don t want to push this complexity on to our users 19

20 Modules on bluefire (partial listing) 20

21 User Software Environment Use Lmod, an environment modules implementation out of TACC Structured for dynamic updating of software dependencies to maintain consistent user environment Express dependencies between modules in structured module tree Automatically unloads / reloads modules when something they depend on changes in the environment 21

22 Modulefiles Tree modulefiles idep compilers cdep gnuplot intel intel lua lua netcdf lua 22

23 User Software Environment -bash-4.1$ module list! Currently Loaded Modules:! 1) ncarenv/1.0 2) ncarbinlibs/1.0 3) intel/ ) ncarcompilers/1.0! 5) netcdf/4.2! -bash-4.1$ module swap intel pgi! Due to MODULEPATH changes the following modules have been reloaded:! 1) netcdf 2) ncarcompilers! -bash-4.1$ module unload pgi! Inactive Modules:! 1) ncarcompilers 2) netcdf! -bash-4.1$ module list! Currently Loaded Modules:! 1) ncarenv/1.0 2) ncarbinlibs/1.0! Inactive Modules:! 1) ncarcompilers/1.0 2) netcdf/4.2! -bash-4.1$ module load pathscale! Activating Modules:! 1) ncarcompilers/1.0 2) netcdf/4.2! Due to MODULEPATH changes the following modules have been reloaded:! 1) netcdf 2) ncarcompilers! 23

24 User Software Environment Compiler modules export their version as environment variable, used behind the scenes when software must be rebuilt for versions of the same compiler Sets of modules can be saved as defaults, or named environment sets, and shared with other users Only software consistent with the current environment will be listed by the avail command 24

25 User Software Environment -bash-4.1$ module av /glade/apps/opt/modulefiles/compilers cuda/5.0 gnu/4.7.1 gnu/4.8.0 intel/ (default) pathscale/ pgi/12.5 (default) gnu/4.4.6 gnu/4.7.2 (default) intel/ intel/ pathscale/5.0.0 pgi/13.3 gnu/4.6.4 gnu/4.7.3 intel/ pathscale/ (default) pgi/ /glade/apps/opt/modulefiles/idep METv4.0/4.0 debug/0.0 jython/2.5.3 (default) ncl/6.1.0 python/2.7.3-deprecated Panoply/3.1.7 ferret/6.84 jython/2.7-beta1 ncl/6.1.2 (default) python/2.7.3-gs (default) R/ fftw/3.3.2 lapack/3.2.1 ncl/6.2.0 totalview/ R/ opt (default) ftools/1.0 mathematica/9.0 nco/4.2.0 vapor/2.2.0 (default) R/ rmpi gnuplot/4.6.1 matlab/r2012b nco/4.2.3 (default) vapor/2.2.0.rc1 antlr/2.7.7 grads/2.0.2 mpitrace/1.0 ncview/2.1.1 (default) visit/2.6.2 cdo/1.5.5 gsl/1.15 ncarbinlibs/0.0 paraview/ serial (default) cdo/ (default) hwloc/1.5 ncarbinlibs/1.0 (default) paraview/ serial workshop/1.0 cmake/ idl/8.2.1 ncarenv/1.0 pypy/1.9 ddd/ job_mem/1.0 ncl/6.0.0 pypy/2.0-beta1 (default) /glade/apps/opt/modulefiles/cdep/pathscale hdf5-mpi/1.8.9 hdf5/1.8.9 ncarcompilers/1.0 (default) netcdf-mpi/4.2 netcdf/4.2 (default) netcdf/4.3.0-rc4 pnetcdf/

26 User Software Environment Modules can be searched for by string Compiler wrapper scripts are used to make linking and loading of libraries easier by exporting correct link flags and bundling rpath info into executables Autolinking can be disabled, but correct link paths still available for inspection in environment: mpiifort show! -L/glade/apps/opt/netcdf/4.2/intel/default/lib -lnetcdf_c++4 lnetcdff! -lnetcdf -Wl,-rpath,/glade/apps/opt/netcdf/4.2/intel/default/lib! 26

27 Looking toward the next machine BEYOND YELLOWSTONE 27

28 Historical Power and Efficiency of NCAR Systems Name Model Peak GFLOPs Sus Sus Power MFLOP GFLOPs (kw) /Watt Est'd Power Cost/yr Chipeta CRI Cray J90se/ $5,625 Ute SGI Origin2000/ $38,325 Blackforest IBM SP/1308 (318) WH2/NH2 1, $105,000 Bluesky IBM p690/32 (50) Regatta-H/Colony 8, $311,250 Lightning IBM e325/2 (128) Opteron Linux 1, $36,000 Bluevista IBM p575/8 (78) POWER5/HPS 4, $157,950 Blueice IBM p575/16 (112) POWER5+/HPS 13,312 1, $244,050 Bluefire (2008) IBM Power 575 POWER6 DDR-IB 77,005 2, $403,654 Frost (2009) IBM BlueGene/L (4096/2) 22, $62,325 Lynx Cray XT5m (912/76) 8, $26,250 Yellowstone (2012) IBM idataplex/fdr-ib 1,503,590 80,950 1, $1,700,000 28

29 Life after Yellowstone Increasing performance within our power budget expected to require accelerators in next system (GPU, Xeon Phi, etc) We have a large existing codebase (Fortran, C, C++, MPI, OpenMP) We need a way to port models to accelerators with minimal rewriting, and the ability to maintain single version of the source code 29

30 Life after Yellowstone Bringing in test systems to make GPU and MIC computing resources available to modelers Working with flagship users codes to get them running on accelerators and to understand performance characteristics Some work with GPUs over last few years (WRF, CESM), and recent promising results on Xeon Phi Major push ahead of our next procurement 30

31 Questions? 31

NCAR s Data-Centric Supercomputing Environment Yellowstone. November 28, 2011 David L. Hart, CISL

NCAR s Data-Centric Supercomputing Environment Yellowstone November 28, 2011 David L. Hart, CISL dhart@ucar.edu Welcome to the Petascale Yellowstone hardware and software Deployment schedule Allocations