Fast computers, big/fast storage, fast networks Marla Meehl

Similar documents
2012 AIA Wyoming Winter Conference

13 th ECMWF Workshop on the Use of HPC in Meteorology

Cheyenne NCAR s Next-Generation Data-Centric Supercomputing Environment

NCAR s Data-Centric Supercomputing Environment Yellowstone. November 28, 2011 David L. Hart, CISL

CISL Update. 29 April Operations and Services Division

NCAR s Data-Centric Supercomputing Environment Yellowstone. November 29, 2011 David L. Hart, CISL

CISL Update Operations and Yellowstone. CISL HPC Advisory Panel Meeting 18 October 2012

A visual interface for the analysis of turbulent structure statistics

Marla Meehl. UCAR/FRGP/UPoP/BiSON Manager UCAR/NCAR

Transitioning NCAR MSS to HPSS

The Blue Water s File/Archive System. Data Management Challenges Michelle Butler

The NCAR Yellowstone Data Centric Computing Environment. Rory Kelly ScicomP Workshop May 2013

Storage Technology Requirements of the NCAR Mass Storage System

Networking updates: Westnet

Information Technology Services. Informational Report for the Board of Trustees October 11, 2017 Prepared effective August 31, 2017

The Architecture and the Application Performance of the Earth Simulator

Overview of HPC at LONI

NCEP HPC Transition. 15 th ECMWF Workshop on the Use of HPC in Meteorology. Allan Darling. Deputy Director, NCEP Central Operations

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Pushing the Limits. ADSM Symposium Sheelagh Treweek September 1999 Oxford University Computing Services 1

Shared Services Canada Environment and Climate Change Canada HPC Renewal Project

Integrating Fibre Channel Storage Devices into the NCAR MSS

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services

400G: Deployment at a National Lab

Storage Supporting DOE Science

Network updates. CHECO CIO Meeting. December 2010 Marla Meehl Jeff Custard

NCAR Workload Analysis on Yellowstone. March 2015 V5.0

A More Realistic Way of Stressing the End-to-end I/O System

ALICE Grid Activities in US

Data Management Components for a Research Data Archive

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Big Data 2015: Sponsor and Participants Research Event ""

N-Wave Overview. Westnet Boulder, CO 5 June 2012

Outline. March 5, 2012 CIRMMT - McGill University 2

Early Operational Experience with the Cray X1 at the Oak Ridge National Laboratory Center for Computational Sciences

Networking updates: CHECO CIO meeting

First Experience with LCG. Board of Sponsors 3 rd April 2009

NCAR Computation and Information Systems Laboratory (CISL) Facilities and Support Overview

2014 Bond Technology Update Progress of the Technology Network Infrastructure Upgrades Long Range Planning Committee March 4, 2015

SME License Order Working Group Update - Webinar #3 Call in number:

Part 2: Computing and Networking Capacity (for research and instructional activities)

NOAA N-Wave Update N-WAVE V. NOAA Research Network. Joint Techs Columbus 13 July 2010

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

NCAR Globally Accessible Data Environment (GLADE) Updated: 15 Feb 2017

The Red Storm System: Architecture, System Update and Performance Analysis

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

Council, 26 March Information Technology Report. Executive summary and recommendations. Introduction

COURSE LISTING. Courses Listed. with SAP HANA. 15 February 2018 (05:18 GMT) HA100 - SAP HANA. HA250 - Migration to SAP HANA using DMO

Data Serving Climate Simulation Science at the NASA Center for Climate Simulation (NCCS)

High Performance Computing at Mississippi State University

Networking updates: CHECO CIO meet ing

Green Supercomputing

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

High Performance Computing Data Management. Philippe Trautmann BDM High Performance Computing Global Research

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Update on Cray Activities in the Earth Sciences

Council, 8 February 2017 Information Technology Report Executive summary and recommendations

The Center for Computational Research

SPARC 2 Consultations January-February 2016

Office 365. It s Past, Present and Future at UW

University at Buffalo Center for Computational Research

Freedom of Information Act 2000 reference number RFI

RESEARCH DATA DEPOT AT PURDUE UNIVERSITY

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Innovation and the Internet

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. 20 November 2017 (12:10 GMT) Beginner.

Cluster Network Products

CISL Update Operations and Services

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Using Quality of Service for Scheduling on Cray XT Systems

Building a Global Data Federation for Climate Change Science The Earth System Grid (ESG) and International Partners

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

High Performance Computing

Networking updates: Westnet

ADDENDUM #2 TO REQUEST FOR PROPOSAL (RFP) #2252 Institutional Network Service, Data Center Co-location Service, and High-Speed Internet Service

SLIDE 1 - COPYRIGHT 2015 ELEPHANT FLOWS IN THE ROOM: SCIENCEDMZ NATIONALLY DISTRIBUTED

Jülich Supercomputing Centre

The LHC Computing Grid. Slides mostly by: Dr Ian Bird LCG Project Leader 18 March 2008

The Center for High Performance Computing. Dell Breakfast Events 20 th June 2016 Happy Sithole

Engagement With Scientific Facilities

Scaling a Global File System to the Greatest Possible Extent, Performance, Capacity, and Number of Users

Organizational Update: December 2015

Oak Ridge National Laboratory Computing and Computational Sciences

Building Self-Healing Mass Storage Arrays. for Large Cluster Systems

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

USGv6: US Government. IPv6 Transition Activities 11/04/2010 DISCOVER THE TRUE VALUE OF TECHNOLOGY

An Overview of CSNY, the Cyberinstitute of the State of New York at buffalo

AMI Applications at SDG&E Metering America March 24, 2009 Ted M. Reguly Smart Meter Program Director SDG&E

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Networking updates: Westnet

Data storage services at KEK/CRC -- status and plan

Energy Engineering: Tools and Trends

Blue Waters Super System

2014 Bond Technology Infrastructure Status Report. March 10, 2016

W&M General Government. Phase II Budget Presentation. 25 April 2017

Transcription:

Fast computers, big/fast storage, fast networks Marla Meehl Manager, Network Engineering and Telecommunications, NCAR/UCAR, Manager of the Front Range GigaPoP Computational & Information Systems Laboratory National Center for Atmospheric Research

History of Supercomputing at NCAR Production Systems new system @ NWSC IBM p6 p575/4096 (bluefire) Experimental/Test Systems that became Production Systems IBM p6 p575/192 (firefly) IBM p5+ p575/1744 (blueice) Experimental/Test Systems IBM p5 p575/624 (bluevista) Aspen Nocona-IB/40 (coral) IBM BlueGene-L/8192 (frost) Red text indicates those systems that are currently in operation within the IBM BlueGene-L/2048 (frost) NCAR Computational and Information Systems Laboratory's computing facility. IBM e1350/140 (pegasus) IBM e1350/264 (lightning) IBM p4 p690-f/64 (thunder) IBM p4 p690-c/1600 (bluesky) IBM p4 p690-c/1216 (bluesky) SGI Origin 3800/128 (tempest) Compaq ES40/36 (prospect) IBM p3 WH2/1308 (blackforest) IBM p3 WH2/604 (blackforest) IBM p3 WH1/296 (blackforest) Linux Networx Pentium-II/16 (tevye) SGI Origin2000/128 (ute) Cray J90se/24 (chipeta) HP SPP-2000/64 (sioux) Marine St. Mesa Lab I Mesa Lab II NWSC CDC 6600 CDC 3600 CDC 7600 Cray 1-A S/N 3 (C1) Cray C90/16 (antero) Cray J90se/24 (ouray) Cray J90/20 (aztec) Cray J90/16 (paiute) Cray T3D/128 (T3) Cray T3D/64 (T3) Cray Y-MP/8I (antero) CCC Cray 3/4 (graywolf) IBM SP1/8 (eaglesnest) TMC CM5/32 (littlebear) IBM RS/6000 Cluster (CL) Cray Y-MP/2 (castle) Cray Y-MP/8 (shavano) TMC CM2/8192 (capitol) Cray X-MP/4 (CX) Cray 1-A S/N 14 (CA) 35 systems over 35 years 27 Oct '09 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015

Current HPC Systems at NCAR System Name Vendor System # Frames # Proces sors Processor Type GHz Peak TFLOPs Production Systems bluefire (IB) IBM p6 Power 575 11 4096 POWER6 4.7 77.00 Research Systems, Divisional Systems & Test Systems Firefly (IB) IBM p6 Power 575 1 192 POWER6 4.7 3.610 frost (BG/L) IBM BlueGene/L 4 8192 PowerPC-440 0.7 22.935

High Utilization, Low Queue Wait Times 100.0% Bluefire System Utilization (daily average) Bluefire system utilization is routinely >90% Average queue-wait time < 1hr 90.0% 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% Spring ML Powerdown May 3 Firmware & Software Upgrade 20.0% 14000 10.0% # of Jobs 12000 10000 8000 Premium Regular Economy Standby 0.0% Nov-08 Dec-08 Jan-09 Feb-09 Mar-09 Apr-09 May-09 Jun-09 Jul-09 Aug-09 Sep-09 Oct-09 Nov-09 6000 4000 2000 Average Queue Wait Times for User Jobs at the NCAR/CISL Computing Facility 0 <2m <5m <10m <20m <40m <1h Queue-Wait Time <2h <4h <8h <1d <2d >2d Queue 4 bluefire lightning blueice bluevista since Jun'08 since Dec'04 Jan'07-Jun'08 Jan'06-Sep'08 Peak TFLOPs > 77.0 TFLOPs 1.1 TFLOPs 12.2 TFLOPs 4.4 TFLOPs all jobs all jobs all jobs all jobs Premium 00:05 00:11 00:11 00:12 Regular 00:36 00:16 00:37 01:36 Economy 01:52 01:10 03:37 03:51 Stand-by 01:18 00:55 02:41 02:14

IP Bluefire Usage Climate 49.3% Accelerated Scientific Discovery 18.0% Weather 10 Upper Atmosphere 0.3% Miscellaneous 0.3% Cloud Physics 1.3% Basic Fluid Dynamics 2.6% Oceanograph 4.0% Atmospheric Chemistry 3.5% Astrophysics 4.4%

CURRENT NCAR COMPUTING >100TFLOPS Peak TFLOPs at NCAR (All Systems) 100 ICESS Phase 2 IBM POWER6/Power575/IB (bluefire) IBM POWER6/Power575/IB (firefly) IBM POWER5+/p575/HPS (blueice) 80 IBM POWER5/p575/HPS (bluevista) IBM BlueGene/L (frost) 60 IBM Opteron/Linux (pegasus) IBM Opteron/Linux (lightning) ICESS Phase 1 IBM POWER4/Federation (thunder) 40 ARCS Phase 3 ARCS Phase 4 bluefire IBM POWER4/Colony (bluesky) IBM POWER4 (bluedawn) 20 ARCS Phase 2 Linux blueice SGI Origin3800/128 ARCS Phase 1 lightning/pegasus frost bluevista firefly bluesky blackforest 0 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 IBM POWER3 (blackforest) IBM POWER3 (babyblue)

CISL Computing Resource Users 1347 Users in FY09 Other, 13, 1% Univ Research Associates, 280, 21% Univ Faculty, 191, 14% Government, 108, 8% NCAR Assoc Sci & Support Staff, 252, 19% NCAR Scientists & Proj Scientists, 160, 12% Graduate and Undergrad Students, 343, 25%

120 Universities with largest number of users in FY09 100 Number of Users 80 60 40 20 Users are from 114 U.S. Universiti 0

Disk Storage Systems Current Disk Storage for HPC and Data Collections: 215 TB for Bluefire (GPFS) 335 TB for Data Analysis and Vis (DAV) (GPFS) 110 TB for Frost (Local) 152 TB for Data Collections (RDA, ESG) (SAN) Undergoing extensive redesign and expansion. Add ~2PB to (existing GPFS system) to create large central (xmounted) filesystem

Data Services Redesign Near-term Goals: Creation of unified and consistent data environment for NCAR HPC High-performance availability of central filesystem from many projects/systems (RDA, ESG, CAVS, Bluefire, TeraGrid) Longer-term Goals: Filesystem cross-mounting Global WAN filesystems Schedule: Phase 1: Add 300TB (now) Phase 2: Add 1 PB (March 2010 ) Phase 3: Add.75 PB (TBD)

Mass Storage Systems NCAR MSS (Mass Storage System) In Production since 70s Transitioning from the NCAR Mass Storage System to HPSS Prior to NCAR-Wyoming Supercomputer Center (NWSC) commissioning in 2012 Cutover Jan 2011 Transition without data migration Silos/Tape Drives Tape Drives/Silos Sun/StorageTek T10000B (1 TB media) with SL8500 libraries Phasing out tape silos Q2 2010 Oozing 4 PBs (unique), 6 PBs with duplicates Online process Optimized data transfer Average 20 TB/day with 10 streams

Estimated Growth 35.00 Total PBs Stored (w/ DSS Stewardship & IPCC AR5) 30.00 25.00 20.00 15.00 10.00 5.00 0.00 Oct'07 Oct'08 Oct'09 Oct'10 Oct'11 Oct'12

Facility Challenge Beyond 2011 NCAR Data Center NCAR data center limits of power/cooling/space have been reached Any future computing augmentation will exceed existing center capacity Planning in progress Partnership Focus on Sustainability and Efficiency Project Timeline

NCAR-Wyoming Supercomputer Center (NWSC) Partners NCAR & UCAR University of Wyoming State of Wyoming Cheyenne LEADS Wyoming Business Council Cheyenne Light, Fuel & Power Company National Science Foundation (NSF) University of Wyoming Cheyenne http://cisl.ucar.edu/nwsc/ NCAR Mesa Lab

Focus on Sustainability Maximum energy efficiency LEED certification Achievement of the smallest possible carbon footprint Adaptable to the ever-changing landscape of High- Performance Computing Modular and expandable space that can be adapted as program needs or technology demands dictate

Focus On Efficiency Don t fight mother nature The cleanest electron is an electron never used The computer systems drive all of the overhead loads NCAR will evaluate in procurement Sustained performance / kw Focus on the biggest losses Compressor based cooling UPS losses Transformer losses 65.9% Typical Modern Data Center 0.1% 15.0% 6.0% 13.0% NWSC Design 91.9% Lights Cooling Fans Electrical Losses IT Load Lights Cooling Fans Electrical Losses IT Load

Design Progress Will be one of the most efficient data centers Cheyenne Climate Eliminate refrigerant based cooling for ~98% of the year Efficient water use Minimal transformations steps electrically Guaranteed 10% Renewable Option for 100% (Wind Power) Power Usage Effectiveness (PUE) PUE = Total Load / Computer Load = 1.08 ~ 1.10 for WY Good data centers 1.5 Typical Commercial 2.0

Expandable and Modular Long term 24 acre site Dual substation electrical feeds Utility commitment up to 36MW Substantial capability fiber optics Medium term Phase I / Phase II Can be doubled Enough structure for three or four generations Future container solutions or other advances Short term Module A Module B Each Module 12,000 sq.ft. Install 4.5MW initially can be doubled in Module A Upgrade Cooling and Electrical systems in 4.5MW increments

Project Timeline 65% Design Review Complete (October 2009) 95% Design Complete 100% Design Complete (Feb 2010) Anticipated Groundbreaking (June 2010) Certification of Occupancy (October 2011) Petascale System in Production (January 2012)

NCAR-Wyoming Supercomputer Center

HPC services at NWSC PetaFlop Computing 1 PFLOP peak minimum 100 s PetaByte Archive (+20 PB yearly growth) HPSS Tape archive hardware acquisition TBD PetaByte Shared File System(s) (5-15PB) GPFS/HPSS HSM integration (or Lustre) Data Analysis and Vis Cross Mounting to NCAR Research Labs, others Research Data Archive data servers and Data portals High-Speed Networking and some enterprise services

NWSC HPC Projection Peak PFLOPs at NCAR 2.0 1.8 NWSC HPC (uncertainty) NWSC HPC 1.6 1.4 IBM POWER6/Power5 (bluefire) 1.2 IBM POWER5+/p575/H (blueice) 1.0 IBM POWER5/p57 (bluevista) 0.8 IBM BlueGene/L 0.6 IBM Opteron/Linu (pegasus) 0.4 ICESS Phase 2 0.2 ARCS Phase 4 ICESS Phase 1 IBM Opteron/Linu (lightning) 0.0 bluesky Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 23 bluefire frost IBM POWER4/Co (bluesky)

NCAR Data Archive Projection Total Data in the NCAR Archive (Actual and Projected) 120 Total Unique 100 80 60 40 20 0 Jan-99 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Petabytes 24

Tentative Procurement Timeline NWSC Production Installation & ATP Award NSF Approval Negotiations Proposal Evaluation RFP Release Benchmark Refinement RFP Refinement RFP Draft RFI Evaluation RFI NDA's Jul-09 Jan-10 Jul-10 Dec-10 Jul-11 Jan-12 39993 40177 40361 40545 40729 40913

Front Range GigaPoP (FRGP) 10 years of operation by UCAR 15 members NLR, I2, Esnet peering @ 10Gbps Commodity Qwest and Level3 CPS and TransitRail Peering Intra-FRGP peering Akamai 10Gbps ESnet peering www.frgp.net

Fiber, fiber, fiber NCAR/UCAR intra-campus fiber 10Gbps backbone Boulder Research and Administration Network (BRAN) Bi-State Optical Network (BiSON) upgrading to 10/40/100Gbps capable Enables 10Gbps NCAR/UCAR Teragrid connection Enable 10Gbps NOAA/DC link via NLR lifetime lambda DREAM 10Gbps SCONE 1Gbps

Western Regional Network (WRN) A multi-state partnership to ensure robust, advanced, high-speed networking available for research, education, and related uses Increased aggregate bandwidth Decreased costs

Questions