The Architecture and the Application Performance of the Earth Simulator

Similar documents
The Earth Simulator Current Status

Brand-New Vector Supercomputer

Fujitsu s Approach to Application Centric Petascale Computing

The Earth Simulator System

Introduction to the K computer

HPCC Results. Nathan Wichmann Benchmark Engineer

JAMSTEC "Cyber System Current Status

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Benchmark Results. 2006/10/03

Experiences of the Development of the Supercomputers

The next generation supercomputer. Masami NARITA, Keiichi KATAYAMA Numerical Prediction Division, Japan Meteorological Agency

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

HPCS HPCchallenge Benchmark Suite

Just on time to face new challenges with NEC super-computer at Meteo-France

Introduction of Oakforest-PACS

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Computer Comparisons Using HPCC. Nathan Wichmann Benchmark Engineer

NCAR s Data-Centric Supercomputing Environment Yellowstone. November 28, 2011 David L. Hart, CISL

The Red Storm System: Architecture, System Update and Performance Analysis

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Cray XC Scalability and the Aries Network Tony Ford

GPU Developments in Atmospheric Sciences

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

NEC SX Series and Its Applications to Weather and Climate Modeling

MAHA. - Supercomputing System for Bioinformatics

The Development of Modeling, Simulation and Visualization Services for the Earth Sciences in the National Supercomputing Center of Korea

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Fujitsu s Technologies to the K Computer

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Oak Ridge National Laboratory Computing and Computational Sciences

FUJITSU HPC and the Development of the Post-K Supercomputer

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

The way toward peta-flops

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Supercomputer SX-9 Development Concept

Using Quality of Service for Scheduling on Cray XT Systems

Shared Services Canada Environment and Climate Change Canada HPC Renewal Project

Basic Specification of Oakforest-PACS

Altix Usage and Application Programming

Design and Evaluation of a 2048 Core Cluster System

HPC and Big Data: Updates about China. Haohuan FU August 29 th, 2017

UAntwerpen, 24 June 2016

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Cluster Network Products

BlueGene/L. Computer Science, University of Warwick. Source: IBM

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

The IBM Blue Gene/Q: Application performance, scalability and optimisation

Practical Scientific Computing

Current Progress of Grid Project in KMA

NCEP HPC Transition. 15 th ECMWF Workshop on the Use of HPC in Meteorology. Allan Darling. Deputy Director, NCEP Central Operations

Intel Many Integrated Core (MIC) Architecture

HPCF Cray Phase 2. User Test period. Cristian Simarro User Support. ECMWF April 18, 2016

Performance Analysis and Prediction for distributed homogeneous Clusters

IBM CORAL HPC System Solution

Intel Math Kernel Library

Pushing the Limits. ADSM Symposium Sheelagh Treweek September 1999 Oxford University Computing Services 1

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

Stockholm Brain Institute Blue Gene/L

APENet: LQCD clusters a la APE

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

CISL Update. 29 April Operations and Services Division

Practical Scientific Computing

MM5 Modeling System Performance Research and Profiling. March 2009

Fujitsu s new supercomputer, delivering the next step in Exascale capability

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Findings from real petascale computer systems with meteorological applications

Data Transfers in the Grid: Workload Analysis of Globus GridFTP

ALCF Argonne Leadership Computing Facility

Storage Technology Requirements of the NCAR Mass Storage System

NCAR s Data-Centric Supercomputing Environment Yellowstone. November 29, 2011 David L. Hart, CISL

Outline. March 5, 2012 CIRMMT - McGill University 2

An Overview of Fujitsu s Lustre Based File System

The Road from Peta to ExaFlop

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

Post-K: Building the Arm HPC Ecosystem

Habanero Operating Committee. January

User Training Cray XC40 IITM, Pune

Technical Computing Suite supporting the hybrid system

X10 and APGAS at Petascale

ICON Performance Benchmark and Profiling. March 2012

The Impact of Inter-node Latency versus Intra-node Latency on HPC Applications The 23 rd IASTED International Conference on PDCS 2011

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

N-Wave Overview. Westnet Boulder, CO 5 June 2012

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

System Software for Big Data and Post Petascale Computing

Introduction to High-Performance Computing

TFLOP Performance for ANSYS Mechanical

The Hyperion Project: Collaboration for an Advanced Technology Cluster Testbed. November 2008

SuperMike-II Launch Workshop. System Overview and Allocations

The Cray Rainier System: Integrated Scalar/Vector Computing

Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems

HPCS HPCchallenge Benchmark Suite

Transcription:

The Architecture and the Application Performance of the Earth Simulator Ken ichi Itakura (JAMSTEC) http://www.jamstec.go.jp 15 Dec., 2011 ICTS-TIFR Discussion Meeting-2011 1 Location of Earth Simulator Facilities Tokyo Yokohama Earth Simulator Site HQ 2

Earth Simulator Building 3 Cross sectional View of the Earth Simulator Building Lightning Conductor Air Return Duct Earth Simulator System Double Floor for Cables Air Conditioning system Power Supply System Seismic Isolation System 4

Earth Simulator Earth Simulator March, 2002 ~ March.2009 (1/2 : ~ Sep.2008) Peak Performance : 40 T Flops Main Memory : 10 T Bytes Earth Simulator(Ⅱ) March, 2009 ~ Peak Performance : 131 T Flops Main Memory : 20 T Bytes 5 Development of the Earth Simulator (ES) Development of ES started in 1997 with the aim of making a comprehensive understanding of global environmental changes such as global warming. Its construction was completed at the end of February, 2002 and the operation started from March 1, 2002 at the Earth Simulator Center TOP500 List Earth Simulator got the top position at isc02 (June 2002) and keep the top position two and half year. New Earth Simulator System (ES2) was installed late 2008 and started operation in march 2009. 6

Earth Simulator (ES2) 7 ES2 System outline 10GbE (partially link aggregation 40GbE) Labs, User Terminals User Network Login server Usable Capacity 1.5PB RAID6 HDD Processor nodes (PNs) 4Gbps Fiber Channel SAN (Storage Area Network) Storage server SX-9/E 160 processor nodes (PNs) (including interactive 2nodes) Total peak performance 131TFLOPS Total main memory 20TB Data storage system 500TB Inter nodes network (IXS) 64GB/sec (bidirectional) /node Operation Network Maintenance Network FC Switch FC Switch FC Switch NQS2 batch job system on PNs Agent request support Use statistical and resource information management Automatic power saving management Operation Servers Maximum power consumption 3000kVA 8

New System Layout Original Earth Simulator (Stoped) 65m New Earth Simulator(ES2) 50m Original Earth simulator is opened on March, 2002. New System start operation on March, 2009. 9 Calculation nodes Clustering of nodes to control the system (transparent for uses). A cluster is consists of 32 nodes. 156 nodes are for batch jobs (batch clusters). L Batch node S Batch node Interactiv e node 156 nodes 2 nodes 2 nodes 160 nodes WORK area login Possible to refer Login server HOME/DATA areas 10

Calculation nodes Providing special 4 nodes for TSS and small batch jobs. Configuration of L the Batch TSS cluster. node TSS nodes [2 nodes 1node (changed in 2010)] Nodes for Single Node batch jobs [2 nodes 3 nodes], 156 nodes S Batch node Interactiv e node 2 nodes 2 nodes 160 nodes WORK area login Possible to refer Login server HOME/DATA areas 11 Configuration of the batch cluster. Nodes for Multi Nodes batch jobs, System disks for user file staging L Batch node Calculation nodes S Batch node Interactiv e node 156 nodes 2 nodes 2 nodes 160 nodes WORK area login Possible to refer Login server HOME/DATA areas 12

Calculation nodes Storage of user files for batch jobs on a mass storage system. Automated file recall (Stage In) and migration (Stage Out). Connection of all the clusters to a mass storage system by IOCS (Linux WS) L Batch node S Batch node Interactiv e node 156 nodes 2 nodes 2 nodes 160 nodes WORK area login Possible to refer Login server HOME/DATA areas 13 Hardware Spec. ES ES2 (SX 9/E) Ratio CPU Clock Cycle 1GHz 3.2GHz 3.2x Performance 8GF 102.4GF 12.8x #CPUs 8 8 1x Node Performance 64GF 819.2GF 12.8x Memory 16GB 128GB 8x Network 12.3GB/s x2 8GB/s x8 x2 5.2x #Nodes 640 160 1/4x Performance 40TF 131TF 3.2x System Memory 10TB 20TB 2x Network Full Crossbar 2 Level Fat Tree Full Bisection Bandwidth - 14

ES2 software OS SUPER UX Environment Fortran90 C/C++ MPI 1/2 HPF MathKeisan (BLAS, LAPACK,etc) ASL Cross compiler on Login server (linux) 15 ES2 Operation

ES2 Operation How many user? About 800 people How many jobs? About 10,000 jobs per month Power(include cooling)? About 3000KVA, about 70% of Original ES Average load for job running? 70 80%, most of the rest is used for pre/post processing. 17 FY2011 ES2 Projects Proposed Research Projects : 29 Earth Science :18 Innovation : 11 Projects Contract Research Projects KAKUSHIN 5 The Strategic Industrial Use (Industrial) 13 CREST 1 Proposed Contract JAMSTEC Resource Allocation JAMSTEC Research Projects 14 JAMSTEC Collaboration Research Industrial fee-based usage (New project is accepted at any time.) Users : 565 Organization 125 University 57, Government 15, Company 34, International 19 28th Sep, 2010 DOE HPC Best Practices Workshop 18

Computing Resource Distribution (Based on Job Size) #nodes 33~64 16.7% #nodes over 65 2.9% #nodes 1~4 28.7% #nodes 17~32 21.8% #nodes 5~8 17.6% #nodes 9~16 12.4% FY2010 19 ES2 Application Field Epoch Making Simulation 11% Industorial Use 4% Global Warming: IPCC 41% Atmospheric and Oceanographic Science 28% Solid Earth Science 16% FY2010 20

ES2 Node Utilization(FY2010) Stopped Operation on 14 Mar. 21 ES2 Node Utilization(FY2011) Degeneration operation is carried out for power saving. 22

ES2 2009 (KWH) ES 2008 (KWH) Reduction rate April May June July August Sep. Oct. 3,065 3,105 2,944 3,084 2,973 3,042 3,091 1,752 3,987 4,013 3,978 4,015 4,138 3,986 (half System) 76.9% 77.4% 74.0% 76.8% 71.9% 76.3% (176.5 %) ES2 power consumption is reduced about 75% from ES. The ratio of peak performance and power consumption is 4.34 times better than ES. 23 Application Performance

ES2 Application 1 AFES OFES CFES 25 ES2 Application 2 26

ES2 Application 3 27 ES2 Application 4 28

Performance Evaluation Results In ES Real Applications Code Name Elapse Time on ES [sec] #CPUs on ES Elapse Time on ES2 [sec] (Speedup ratio) #CPUs on ES2 PHASE 135.3 4096 62.2 (2.18) 1024 NICAM K* 214.7 2560 109.3 (1.97) 640 MSSG 173.9 4096 86.5 (2.01) 1024 SpecFEM3D 96.3 4056 45.5 (2.12) 1014 Seism3D 48.8 4096 15.6 (3.13) 1024 Speedup ratio harmonic mean 2.22 ES2 is 2.22 times faster 29 WRF WRF (Weather Research and Forecasting Model) is a mesoscale meteorological simulation code which has been developed under the collaboration among US institutions, including NCAR (National Center for Atmospheric Research) and NCEP (National Centers for Environmental Prediction). JAMSTEC has optimized WRFV2 on the Earth Simulator (ES2) renewed in 2009 with the measurement of computational performance. As a result, we successfully demonstrated that WRFV2 can run on the ES2 with outstanding performance and the sustained performance.

WRF Performance on ES TOP500 Rank 94 on Nov. 2011 2002 2002200320032004200420052005200620062007200720082008200920092010201020112011 Jun Nov Jun Nov Jun Nov Jun Nov Jun Nov Jun Nov Jun Nov Jun Nov Jun Nov Jun Nov 系列 Rank1 1 1 1 1 1 3 4 7 10 14 20 30 49 73 22 31 37 54 68 94 32

HPC Challenge Awards The Competition will focus on four of the most challenging benchmarks in the suite: Global HPL the Linpack TPP benchmark which measures the floating point rate of execution for solving a linear system of equations. DGEMM measures the floating point rate of execution of double precision real matrix matrix multiplication. Global RandomAccess measures the rate of integer random updates of memory (GUPS). EP STREAM (Triad) per system a simple synthetic benchmark program that measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for simple vector kernel. Global FFT measures the floating point rate of execution of double precision complex one dimensional Discrete Fourier Transform (DFT). 33 The 2009 HPC Challenge Class 1 Awards: G-HPL Achieved System Affiliation 1st place 1533 Tflop/sCray XT5 ORNL 1st runner up 736 Tflop/sCray XT5 UTK 2nd runner up 368 Tflop/sIBM BG/P LLNL XT5@ORNL is 2.3PF and ES2 is 131TF at peak. That is about 17 times. However, G FFT performance is only 1.5 times. G-RandomAccess Achieved System Affiliation 1st place 117 GUPSIBM BG/P LLNL 1st runner up 103 GUPSIBM BG/P ANL 2nd runner up 38 GUPSCray XT5 ORNL G-FFT Achieved System Affiliation 1st place 11 Tflop/sCray XT5 ORNL 1st runner up 8 Tflop/sCray XT5 UTK 2nd runner up 7 Tflop/sNEC SX-9 JAMSTEC EP-STREAM-Triad (system) Achieved System Affiliation 1st place 398 TB/sCray XT5 ORNL 1st runner up 267 TB/sIBM BG/P LLNL 2nd runner up 173 TB/sNEC SX-9 JAMSTEC 28th Sep, 2010 DOE HPC Best Practices Workshop 34

http://www.hpcchallenge.org/ New Earth Simulator(ES2 SX 9/E) HPC Challenge Awards 2010 G FFT No1 in the World Efficiency of G-FFT ES2 11.88TF/131.07TF = 9.1% XT5 10.70TF/2044.7TF = 0.52% 17.5 Times Better New Earth Simulator(ES2 SX 9/E) HPC Challenge Awards 2010 EP STREAM Triad No.3 35 HPC Challenge Awards 2010 2045 15.6 Time We are BIG!! 10.7 0.1 11.9 Jaguar(ORNL) 131 Earth Simulator 2 G-FFT Performance Peak Performance 36

The 2011 HPC Challenge Class 1 Awards: TOP500(Linpack) #cores G FFT Effective Performance Ratio #cores JAMSTEC ES2 (SX 9) RIKEN AICS (K Computer) Compare Rank 94 Rank 1 K is 86 times faster than ES2 122 TF 1,280 Rank 2 10,510 TF 705,024 Rank 1 K is only 2.9 times faster than ES2 11.9 TF 9.1 % 1,280 34.7 TF 1.47 % 147,456 ES is 6.2 times higher effective than K 37 Thank you for your kind attention! Japan Agency for Marine-Earth Science and Technology JAMSTEC