CISL Update Operations and Services CISL HPC Advisory Panel Meeting Anke Kamrath anke@ucar.edu Operations and Services Division Computational and Information Systems Laboratory 1 CHAP Meeting
A lot happening in the last 6 Updates: Staffing NWSC Archive RDA USS months New Supercomputer: Yellowstone 2 CHAP Meeting
OSD Staff Comings and Goings New Staff at ML Dan Nagle HPC/CSG Consultant Current chair of the US Fortran standards committee (PL22 and PL22.3) NCAR rejoined InterNational Committee for Information Technology Standards (INCITS) Wes Jeanette Student Assistant with the VAPOR Team Kevin Marnross EaSM DSS hire. From National Severe Storm Lab Matthew Pasiewiecz Web Software Engineer GIS Background Departures Rob Perry moved to GLOBE Alyson Ellerin retirement Rocky Madden Lewis Johnson Openings 1 Mechanical Technician (at NWSC) SSG Position (at NWSC) Temp Admin Support (at NWSC) 3 CHAP Meeting
NWSC Staff Relocations Rebecca Bouchard (CPG CASG) Raisa Leifer (CPG CASG) Jirina Kokes (CPG CASG) John Ellis (SSG) Jim VanDyke (NETS) New Hires Jenny Brennan NWSC Admin Executive assistant Toyota subsidiary Detroit Michigan Matthew Herring NWSC Electrician Encore Electric subcontractor for NWSC Lee Myers NWSC CASG Computer Science Instructor - Eastern Wyoming College Jonathan Roberts NWSC CASG IBM Boulder Jonathan Frazier NWSC CASG DOD subcontractor Served United States Air Force (Cheyenne) David Read NWSC CASG LOWES data center systems administration LOWES distribution center (Cheyenne) 4 CHAP Meeting
And here they are! 5 CHAP Meeting
NWSC Update So much has happened it is difficult to summarize Tape Archive Acceptance Network Complete to NWSC LAN, WAN and Wireless Transition of NOC to TOC NWSC Facilities Management First winter season Fitup and transition of building HPSS Equipment Install and Preparation for Production cut-over Yellowstone Planning Electrical & Mechanical Install for Yellowstone 6 CHAP Meeting
Tape Archive Install 7 CHAP Meeting
NWSC Oversight Committee becomes Transition Oversight Committee Transition Oversight Committee (TOC) Held December 15 th Smooth handoff from NOC (December 14 th ) Three New Members added Thomas Hauser (CU) Francesca Verdier (NERSC) Henning Weber (DWD) Next Meeting Summer of 2012 Exit Briefing (next slide) 8 CHAP Meeting
TOC Exit Briefing General Comments Nothing critical to say. Presentations were great. Impressed with technical depth of staff. Can t imagine how they thought of so much. General Recommendations Write whitepaper on NWSC design innovations to share with community (Aaron/Gary) Develop NWSC Space Strategy for potential partners Under development already (Anke, Aaron, Rich) Separate NWSC Fiber Hut agreement in development Consider having users involved during acceptance testing 21 day acceptance period. Bluefire/Yellowstone overlap short (~6 weeks) Next meeting (August) High-level update on all topics Overview of (increased) operational costs for NWSC over Mesa Lab EOT Update (visitor center) Grand Opening Planning Update 9 CHAP Meeting
Lessons Learned (the expected) Seasonal issues that affect the building Wind Weather Cold Air sampling sensor froze causing a refrigerant alarm Air dryers for the pneumatic valves not installed correctly. Condensation froze in the line Boilers went out on high winds 10 CHAP Meeting
Lessons Learned (the unexpected) A ground fault within the utility (CLF&P) caused a power bump at NWSC Pumps in the business park surged Over pressurized the fire line Flow sensor identified water was flowing to computer room Triggered the EPO Should do this to protect electronics if water is flowing Some additional fail safes required or pressure reduction Identified that not all new staff had all the right information Cell phone numbers Emergency procedures Where devices are located & how to respond 11 CHAP Meeting
Yellowstone Fitup - Electrical 12 CHAP Meeting
Yellowstone Fitup - Electrical 13 CHAP Meeting
Yellowstone Fitup - Mechanical 14 CHAP Meeting
Yellowstone Fitup - Mechanical 15 CHAP Meeting
Key Resource Transition Efforts LAN/WAN Network Build- out NEW install at NWSC KEEP current install at ML Infrastructure Services (e.g., DNS, Nagios, DHCP) NEW install at NWSC KEEP current install at ML MOVE Data CollecIon Servers Archival Services NEW install at NWSC MOVE primary HPSS site from ML to NWSC MOVE data from ML to NWSC KEEP HPSS Disaster Recovery capability at ML DECOMISSION Type B Tape Drives/Tapes at ML 16 CHAP Meeting
Resource Transition Efforts (cont) Central Data Services NEW GLADE install at NWSC MOVE (some) data (and possibly storage) from ML to NWSC DECOMMISSION GLADE at ML SupercompuIng Services NEW install (Yellowstone) at NWSC DECOMMISSION Bluefire at ML KEEP Lynx at ML 17 CHAP Meeting
General Preparation Schedule 2011 2012 S O N D J F M A M J J A S HPC Fit Up Design and Pricing Package at NSF Equipment Ordering Installation Complete Complete Complete Staffing Onsite Management Team Gary New, Ingmar T., Jim Van Dyke Additional Elec/Mech Complete Admin/2 Sys Admins Complete 2 Software Engineers 1 Existing, 1 New Additional System Admins Complete Networking LAN Wireless LAN WAN N. Activiation WAN S. Activation Data Center Switch Install Enterprise Services Infrastructure Installation Nagios Development Accounting System Development and Deployment Complete Complete Complete Complete Complete Complete Training of SAI now monitoring Boulder from Cheyenne Mirroring Bluefire Accounting Now Archival Services Tape Library Delivery/Install Complete Tape Archive Acceptance Test Completed Nov 30 HPSS Servers/DataMovers/Cache Complete HPSS Primary at NWSC 1 day outage for cutover May 23 Data Migration Complete in 2014 18 CHAP Meeting
Procurement & Installation Schedule 2011 2012 S O N D J F M A M J J A S NWSC-1 Procurement Selection Package at NSF Signed Agreement Delivery Schedule Finalized Complete Complete Complete Complete Central Filesystem Services Storage Delivery/Install Test Equipment Delivery/Install Data Mover Delivery/Install Integration and Checkout Acceptance Testing Migrate data collections to NWSC Production Supercomputing Services Test Equipment Delivery/Install HPC Delivery Integration and Checkout Acceptance Testing Production 19 CHAP Meeting
User Preparations & Allocations Schedule 2011 2012 S O N D J F M A M J J A S Allocations ASD Submission, Review, User Prep Due Mar 5, 2012 CSL Submission, Review, User Prep CSLAP Meet Mar 21-23 University Submission, Review CHAP Meet May 3 Wyoming-NCAR Submission, Review WRAP Meet May 3-4 NCAR Submission, Review, User Prep User Preparation Presentations (NCAR, SC, AGU, AMS, Wyoming) User survey, environment, documentation Training ASD and Early Use Production Use Weeklong Training Workshops Until mid-2016 20 CHAP Meeting
HPSS and AMSTAR Update AMSTAR extended 2 years through December 2014. 5 TB per cartridge technology 30 PB capacity increase over 2 years New tape libraries, drives, and media installed at NWSC in November 2011 for primary and second copies New tape drives and media installed at ML in November 2011 for Disaster Recovery copies of selected data Will release an archive RFP in January 2013 after we have experience with NWSC-1 data storage needs. Additional HPSS server capacity delivered to NWSC and is being configured for May deployment. Production data and metadata will migrate to NWSC after CASG 24/7 support is available at NWSC in May. 21 CHAP Meeting
RDA Update EaSM data support portal is now operational 6-hrly 20 th Century Run & Four Future Projections RDA transition to NWSC plans No users access downtime, increase amount of GLADE online data Use NWSC DAV to offer more data subsetting services (current status on next slide) 22 CHAP Meeting
Monthly Impact of RDA Data Subsetting Request Input Volume (TB) 500 400 300 200 100 0 Subsetting Using DAV Reduces Data Transfer Volumes by 95% Input Output 100 80 60 40 20 0 Request Output Volume (TB) Month 23 CHAP Meeting
USS Update New Daily Bulletin rolled out last October (using Drupal). Winter HPC workshop in February to 13 onsite and 13 online attendees CSG staff organized the first UCAR SEA (Software Engineering Assembly) Conference in February Globus Online service deployed in production, now the recommended off-site data transfer method CSG collaborated with NCO and netcdf developers to resolve severe performance problem Numerous documentation improvements 24 CHAP Meeting
NCAR s Data-Centric Supercomputing Environment: Yellowstone 25 CHAP Meeting
Yellowstone Environment 26 CHAP Meeting
GLADE: GLobally Accessible Data Environment NWSC Centralized Filesystems & Data Storage Resource GPFS NSD Servers 20 IBM x3650 M4 nodes; Intel Xeon E5-2670 processors w/avx 16 cores, 64 GB memory per node; 2.6 GHz clock 91.8 GB/sec aggregate I/O bandwidth (4.8+ GB/s/server) I/O Aggregator Servers (export GPFS, CFDS- HPSS connectivity) 4 IBM x3650 M4 nodes; Intel Xeon E5-2670 processors w/avx 16 cores, 64 GB memory per node; 2.6 GHz clock 10 Gigabit Ethernet & FDR fabric interfaces High-Performance I/O interconnect to HPC & DAV Resources Mellanox FDR InfiniBand full fat-tree 13.6 GB/sec bidirectional bandwidth/node Disk Storage Subsystem 76 IBM DCS3700 controllers & expansion drawers; 90 2 TB NL-SAS drives/controller initially; add 30 3 TB NL-SAS drives/controller 1Q2014 10.94 PB usable capacity (initially) 16.42 PB usable capacity (1Q2014) Codenamed Sandy Bridge EP 27 CHAP Meeting
Yellowstone NWSC High Performance Computing Resource Batch Computation Nodes 4,518 IBM dx360 M4 nodes; Intel Xeon E5-2670 processors with AVX 16 cores, 32 GB memory, 2.6 GHz clock, 333 GFLOPs per node 4,518 nodes, 72,288 cores total 1.504 PFLOPs peak 144.6 TB total memory 28.9 bluefire equivalents High-Performance Interconnect Mellanox FDR InfiniBand full fat-tree 13.6 GB/sec bidirectional bw/node <2.5 usec latency (worst case) 31.7 TB/sec bisection bandwidth Login/Interactive Nodes 6 IBM x3650 M4 Nodes; Intel Xeon E5-2670 processors with AVX 16 cores & 128 GB memory per node Service Nodes (LSF, license servers) 6 IBM x3650 M4 Nodes; Intel Xeon E5-2670 processors with AVX 16 cores & 32 GB memory per node Codenamed Sandy Bridge EP 28 CHAP Meeting
Geyser and Caldera NWSC Data Analysis & Visualization Resource Geyser: Large Memory System 16 IBM x3850 X5 nodes; Intel Westmere-EX processors 40 2.4 GHz cores, 1 TB memory per node 1 nvidia Quadro 6000 GPU per node Mellanox FDR full fat-tree interconnect Caldera: GPU-Computation/Visualization System 16 IBM dx360 M4 nodes; Intel Xeon E5-2670 processors with AVX 16 2.6 GHz cores, 64 GB memory per node 2 nvidia Tesla M2070Q GPUs per node Mellanox FDR full fat-tree interconnect Knights Corner System (November 2012 delivery) 16 IBM dx360 M4 Nodes; Intel Xeon E5-2670 processors with AVX 16 Sandy Bridge cores, 64 GB memory, 2 Knights Corner adapters per node Each Knights Corner contains 62 x86 processors Mellanox FDR full fat-tree interconnect Codenamed Sandy Bridge EP The Quadro 6000 and M2070Q are identical ASICs 29 CHAP Meeting
Yellowstone Software Compilers, Libraries, Debugger & Performance Tools Intel Cluster Studio (Fortran, C++, performance & MPI libraries, trace collector & analyzer) 50 concurrent users Intel VTune Amplifier XE performance optimizer 2 concurrent users PGI CDK (Fortran, C, C++, pgdbg debugger, pgprof) 50 concurrent users PGI CDK GPU Version (Fortran, C, C++, pgdbg debugger, pgprof) for DAV systems only, 2 concurrent users PathScale EckoPath (Fortran C, C++, PathDB debugger) 20 concurrent users Rogue Wave TotalView debugger 8192 floating tokens Rogue Wave ThreadSpotter 10 seats IBM Parallel Environment (POE), including IBM HPC Toolkit System Software LSF-HPC Batch Subsystem / Resource Manager IBM has acquired Platform Computing, Inc. (developers of LSF-HPC) Red Hat Enterprise Linux (RHEL) Version 6 IBM General Parallel Filesystem (GPFS) Mellanox Universal Fabric Manager IBM xcat cluster administration toolkit 30 CHAP Meeting
Erebus Antarctic Mesoscale Prediction System (AMPS) IBM idataplex Compute Cluster 84 IBM dx360 M4 Nodes; Intel Xeon E5-2670 processors 16 cores, 32 GB memory per node; 2.6 GHz clock 1,344 cores total - 28 TFLOPs peak Mellanox FDR-10 InfiniBand full fat-tree interconnect 0.54 bluefire equivalents Login Nodes 2 IBM x3650 M4 Nodes; Intel Xeon E5-2670 processors 16 cores & 128 GB memory per node Dedicated GPFS filesystem 2 IBM x3650 M4 GPFS NSD servers 57.6 TB usable disk storage 9.6 GB/sec aggregate I/O bandwidth Erebus, on Ross Island, is Antarctica s most famous volcanic peak and is one of the largest volcanoes in the world within the top 20 in total size and reaching a height of 12,450 feet. Small eruptions have been occurring 6-10 times daily since the mid-1980s. 0 90 W 90 E 180 Codenamed Sandy Bridge EP 31 CHAP Meeting
Yellowstone Environment Test Systems Yellowstone test system: Jellystone (installed @ NWSC) 32 IBM dx360 M4 Nodes; Intel Xeon E5-2670 processors 1 login node, 2 management/service nodes 2 36-port Mellanox FDR IB switches full fat-tree 1 partially-filled idataplex rack & part of DAV 19 rack GLADE test system: Picanicbasket (installed @ Mesa Lab, move to NWSC during 2H2012 & integrate) 2 IBM x3650 M4 Nodes; Intel Xeon E5-2670 processors 1 DCS 3700 dual-controller & expansion drawer, ~150 TB storage 1 IBM e3650 management/service node 1 36-port Mellanox FDR IB switches full fat-tree 2 partially-filled 19 racks Geyser and Caldera test system: Yogi & Booboo (installed @ NWSC) 1 IBM e3850 large-memory node, 2 IBM dx360 M4 nodes GPU-computation nodes 1 IBM e3650 management/service node 36-port Mellanox FDR IB switch full fat-tree 1 partially-filled 19 rack 32 CHAP Meeting
Yellowstone Physical Infrastructure Resource # Racks HPC CFDS DAV AMPS 63 - idataplex Racks (72 nodes per rack) 10-19 Racks (9 Mellanox FDR core switches, 1 Ethernet switch) 1-19 Rack (login, service, management nodes) 19 - NSD Server, Controller and Storage Racks 1-19 Rack (I/O aggregator nodes, management, IB & Ethernet switches) 1 - idataplex Rack (GPU-Comp & Knights Corner) 2-19 Racks (Large Memory, management, IB switch) 1 - idataplex Rack 1-19 Rack (login, IB, NSD, disk & management nodes) Total Power Required HPC CFDS DAV AMPS ~2.13 MW ~1.9 MW 0.134 MW 0.056 MW 0.087 MW 33 CHAP Meeting
NWSC Floorplan CFDS Storage N Erebus Geyser Test Systems DAV IB, mgmt Viewing Area Yellowstone Caldera & Knights Corner 34 CHAP Meeting
Yellowstone Systems (north half) CFDS Storage N AMPS/Erebus Test Systems Geyser DAV IB, mgmt 35 CHAP Meeting
Yellowstone Systems (south half) InfiniBand Switches N 26 HPC Cabinets 64 Caldera & Knights Corner 36 CHAP Meeting
Yellowstone Schedule (100k ) Current Schedule Production Science ASD & early users Acceptance Testing Infrastructure Preparation Integration & Checkout Production Systems Delivery & Installation Test Systems Delivery & Installation Storage & InfiniBand Delivery & Installation 31 Oct 14 Nov 28 Nov 12 Dec 26 Dec 9 Jan 23 Jan 6 Feb 20 Feb 5 Mar 19 Mar 2 Apr 16 Apr 30 Apr 14 May 28 May 11 Jun 25 Jun 9 Jul 23 Jul 6 Aug 20 Aug 3 Sep 17 Sep 1 Oct 15 Oct 37 CHAP Meeting
Equipment Delivery Schedule Tentative schedule provided May 2: IBM still refining Date Equipment delivered to NWSC (unless otherwise indicated) All dates are tentative - to be confirmed ~May 9 Tue 15 May 2012 Mellanox InfiniBand racks Thu 24 May 2012 CFDS test rack (deliver to Mesa Lab) Thu 24 May 2012 HPC & DAV test racks Tue 29 May 2012 AMPS System Mon 11 Jun 2012 CFDS storage racks Tue 19 Jun 2012 HPC management rack (H00) and 10 RDHX's Production HPC group 1 Production HPC group 2 Production HPC group 3 Fri 29 Jun 2012 All delivered over ~10 day period, finishing on this date Mon 09 Jul 2012 Remainder of RDHX's UCAR Confidential 38 CHAP Meeting
NWSC-1 Resource Allocation by Cost HPC CFDS DAV Maintenance 4% 8% 21% 67% 39 CHAP Meeting
Questions? 40 CHAP Meeting