The Earth Simulator System

Similar documents
Brand-New Vector Supercomputer

The Architecture and the Application Performance of the Earth Simulator

Supercomputer SX-9 Development Concept

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Experiences of the Development of the Supercomputers

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Hardware Technology of the SX-9 (2) - Internode Switch -

Hardware Technology of the Earth Simulator

Fujitsu s Approach to Application Centric Petascale Computing

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Just on time to face new challenges with NEC super-computer at Meteo-France

Practical Scientific Computing

Packaging Technology of the SX-9

The Red Storm System: Architecture, System Update and Performance Analysis

Activities of Cyberscience Center and Performance Evaluation of the SX-9 Supercomputer

Practical Scientific Computing

The way toward peta-flops

Overview of the SPARC Enterprise Servers

Development of Transport Systems for Dedicated Service Provision Using Packet Transport Technologies

Performance Modeling the Earth Simulator and ASCI Q

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Memory Systems IRAM. Principle of IRAM

BlueGene/L (No. 4 in the Latest Top500 List)

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

Cluster Network Products

InfiniBand Strengthens Leadership as The High-Speed Interconnect Of Choice

Intel Many Integrated Core (MIC) Architecture

What are Clusters? Why Clusters? - a Short History

Composite Metrics for System Throughput in HPC

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

The Earth Simulator Current Status

Snoop-Based Multiprocessor Design III: Case Studies

The Road from Peta to ExaFlop

Top500 Supercomputer list

CSCI 402: Computer Architectures. Computer Abstractions and Technology (4) Fengguang Song Department of Computer & Information Science IUPUI.

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

High-Performance Sort Chip

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by:

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors

Parallel Computer Architecture II

Ting Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China

It also performs many parallelization operations like, data loading and query processing.

Unit 9 : Fundamentals of Parallel Processing

APENet: LQCD clusters a la APE

Network Processing Technology for Terminals Enabling High-quality Services

Fundamentals of Quantitative Design and Analysis

The Hopper System: How the Largest* XE6 in the World Went From Requirements to Reality! Katie Antypas, Tina Butler, and Jonathan Carter

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Brutus. Above and beyond Hreidar and Gonzales

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Cray XE6 Performance Workshop

Voltaire Making Applications Run Faster

An Overview of Fujitsu s Lustre Based File System

Copyright 2012, Elsevier Inc. All rights reserved.

Trends in HPC (hardware complexity and software challenges)

With Fixed Point or Floating Point Processors!!

Computer System Components

Vector Engine Processor of SX-Aurora TSUBASA

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

The next generation supercomputer. Masami NARITA, Keiichi KATAYAMA Numerical Prediction Division, Japan Meteorological Agency

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Performance of computer systems

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

A 120 fps High Frame Rate Real-time Video Encoder

Introduction to the K computer

A New NSF TeraGrid Resource for Data-Intensive Science

Real Parallel Computers

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Global IP Network System Large-Scale, Guaranteed, Carrier-Grade

System ASIC Technologies for Mounting Large Memory and Microcomputers

GPU-centric communication for improved efficiency

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

HPC Technology Trends

Lecture 9: MIMD Architectures

XPU A Programmable FPGA Accelerator for Diverse Workloads

The Mont-Blanc approach towards Exascale

ISSCC 2003 / SESSION 2 / MULTIMEDIA SIGNAL PROCESSING / PAPER 2.6

High Performance Computing. What is it used for and why?

Heterogeneous Multi-Computer System A New Platform for Multi-Paradigm Scientific Simulation

The Architecture of a Homogeneous Vector Supercomputer

PART I - Fundamentals of Parallel Computing

Network Design Considerations for Grid Computing

IBM Spectrum Scale IO performance

The Future of SPARC64 TM

Fujitsu Petascale Supercomputer PRIMEHPC FX10. 4x2 racks (768 compute nodes) configuration. Copyright 2011 FUJITSU LIMITED

THE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research

High Performance Computing on MapReduce Programming Framework

Transcription:

Architecture and Hardware for HPC Special Issue on High Performance Computing The Earth Simulator System - - - & - - - & - By Shinichi HABATA,* Mitsuo YOKOKAWA and Shigemune KITAWAKI The Earth Simulator, developed by the Japanese government s initiative Earth Simulator ABSTRACT Project, is a highly parallel vector supercomputer system that consists of 6 processor nodes and interconnection network. The processor node is a shared memory parallel vector supercomputer, in which 8 vector processors that can deliver 8GFLOPS are tightly connected to a shared memory with a peak performance of 6GFLOPS. The interconnection network is a huge non-blocking crossbar switch linking 6 processor nodes and supports for global addressing and synchronization. The aggregate peak vector performance of the Earth Simulator is TFLOPS, and the intercommunication bandwidth between every two processor nodes is.gb/s in each direction. The aggregate switching capacity of the interconnection network is 7.87TB/s. To realize a high-performance and high-efficiency computer system, three architectural features are applied in the Earth Simulator; vector processor, shared-memory and high-bandwidth non-blocking interconnection crossbar network. The Earth Simulator achieved.86tflops, or 87.% of peak performance of the system, in LINPACK benchmark, and has been proven as the most powerful supercomputer in the world. It also achieved 6.8TFLOPS, or 6.9% of peak performance of the system, for a global atmospheric circulation model with the spectral method. This record-breaking sustained performance makes this innovative system a very effective scientific tool for providing solutions to the sustainable development of humankind and its symbiosis with the planet earth. KEYWORDS Supercomputer, HPC (High Performance Computing), Parallel processing, Shared memory, Distributed memory, Vector processor, Crossbar network. INTRODUCTION. SYSTEM OVERVIEW The Japanese government s initiative Earth Simulator project started in 997. Its target was to promote research for global change predictions by using computer simulation. One of the most important project activities was to develop a supercomputer at least,000 times as powerful as the fastest one available in the beginning of Earth Simulator project. After five years of research and development activities, the Earth Simulator was completed and came into operation at the end of February 0. Within only two months from the beginning of its operational run, the Earth Simulator was proven to be the most powerful supercomputer in the world by achieving.86tflops, or 87.% of peak performance of the system, in LINPACK benchmark. It also achieved 6.8TFLOPS, or 6.9% of peak performance of the system, for a global atmospheric circulation model with the spectral method. This simulation model was also developed in the Earth Simulator project. 0 *Computers Division National Institute of Advanced Industrial Science and Technology (AIST) Earth Simulator Center, Japan Marine Science and Technology Center The Earth Simulator is a highly parallel vector supercomputer system consisting of 6 processor nodes (PN) and interconnection network (IN) (Fig. ). Each processor node is a shared memory parallel vector supercomputer, in which 8 arithmetic processors (AP) are tightly connected to a main memory system (). AP is a vector processor, which can deliver 8GFLOPS, and is a shared memory of 6GB. The total system thus comprises, APs Interconnection Network Main memory System 6GB Arithmetic Processor #0 Arithmetic Processor # Arithmetic Processor #7 Main memory System 6GB Arithmetic Processor #0 Arithmetic Processor # Arithmetic Processor #7 6 x 6 crossbar switch Main memory System 6GB Processor Node #0 Processor Node # Processor Node #69 Fig. General configuration of the Earth Simulator. Arithmetic Processor #0 Arithmetic Processor # Arithmetic Processor #7 0

NEC Res. & Develop., Vol., No., January 0 and 6 s, and the aggregate peak vector performance and memory capacity of the Earth Simulator are TFLOPS and TB, respectively. The interconnection network is a huge 6 6 non-blocking crossbar switch linking 6 PNs. The interconnection bandwidth between every two PNs is.gb/s in each direction. The aggregate switching capacity of the interconnection network is 7.87TB/s. The operating system manages the Earth Simulator as a two-level cluster system, called the Super- Cluster System. In the Super-Cluster System, 6 nodes are grouped into clusters (Fig. ). Each cluster consists of 6 processor nodes, a Cluster Control Station (CCS), an I/O Control Station (IOCS) and system disks. Each CCS controls 6 processor nodes and an IOCS. The IOCS is in charge of data transfer between the system disk and Mass Storage System, and transfers user file data from the Mass Storage System to a system disk, and the processing result from a system disk to the Mass Storage System. The supervisor, called the Super-Cluster Control Station (SCCS), manages all clusters, and provides a Single System Image (SSI) operational environment. For efficient resource management and job control, clusters are classified as one S-cluster and 9 L-clusters. In the S-cluster, two nodes are used for interactive use and another for small-size batch jobs. User disks are connected only to S-cluster nodes, and used for storing user files. The aggregate capaci- ties of system disks and user disks are TB and TB, respectively. The Mass Storage System is a cartridge tape library system, and its capacity is more than.pb. To realize a high-performance and high-efficiency computer system, three architectural features are applied in the Earth Simulator: Vector processor Shared memory High-bandwidth and non-blocking interconnection crossbar network From the standpoint of parallel programming, three levels of parallelizing paradigms are provided to gain high-sustained performance: Vector processing on a processor Parallel processing with shared memory within a node Parallel processing among distributed nodes via the interconnection network.. PROCESSOR NODE The processor node is a shared memory parallel vector supercomputer, in which 8 arithmetic processors which can deliver 8GFLOPS are tightly connected to a main memory system with a peak 0 0 Fig. System configuration of the Earth Simulator.

Special Issue on High Performance Computing performance of 6GFLOPS (Fig. ). It consists of 8 arithmetic processors, a main memory system, a Remote-access Control Unit (RCU), and an I/O processor (IOP). The most advanced hardware technologies are adopted in the Earth Simulator development. One of the main features is a One chip Vector Processor with a peak performance of 8GFLOPS. This highly integrated LSI is fabricated by using 0.µm CMOS technology with copper interconnection. The vector pipeline units on the chip operate at GHz, while other parts including external interface circuits operate at 00MHz. The Earth Simulator has ordinary air cooling, although the processor chip dissipates approximately W, because a high-efficiency heat sink using heat pipe is adopted. A high-speed main memory device was also developed for reducing the memory access latency and access cycle time. In addition, 00MHz source synchronous transmission is used for the data transfer between the processor and main memory to increase the memory bandwidth. The data transfer rate between the processor and main memory is GB/s, and the aggregate bandwidth of the main memory is 6GB/s. Two levels of parallel programming paradigms are provided within a processor node: To realize such a huge crossbar switch, a byte-slicing technique is applied. Thus, the huge 6 6 non-blocking crossbar switch is divided into a control unit and 8 data switch units as shown in Fig.. Each data switch unit is a one-byte width 6 6 non-blocking crossbar switch. Physically, the Earth Simulator comprises PN cabinets and 6 IN cabinets. Each PN cabinet contains two processor nodes, and the 6 IN cabinets contain the interconnection network. These PN and IN cabinets are installed in the building which is 6m long and 0m wide (Fig. ). The interconnection network is positioned in the center of the computer room. The area occupied by the interconnection network is approximately 80m, or m long and m wide, and the Earth Simulator occupies an area of approximately,600m, or m long and m wide. No supervisor exists in the interconnection network. The control unit and 8 data switch units operate asynchronously, so a processor node controls the total sequence of inter-node communication. For Vector processing on a processor Parallel processing with shared memory. INTERCONNECTION NETWORK The interconnection network is a huge 6 6 non-blocking crossbar switch, supporting global addressing and synchronization. Fig. Interconnection network configuration. From / To Interconnection Network LAN AP #0 AP AP # # Bandwidth : 6GB/s AP #7 RCU IOP User/Work Disks Processor Node Cabinets (PN Cabinets) Air Conditioning System System Disks m m m m Mass Storage System User Disks Interconnection Network Cabinets (IN Cabinets) MMU#0 MMU# MMU# MMU# MMU# MMU# MMU#6 MMU#9 MMU# MMU# 6m 0 Main memory System () (Capacity : 6GB) 8 Mbit DRAM developed for ES (Bank Cycle Time : nsec) 8-way banking scheme Power Supply System 0m Double Floor (Floor height.m) 0 Fig. Processor node configuration. Fig. Bird s-eye view of the Earth Simulator building.

NEC Res. & Develop., Vol., No., January 0 ) Node A requests the control unit to reserve a data path from node A to node B, and the control unit reserves the data path, then replies to node A. ) Node A begins data transfer to node B. ) Node B receives all the data, then sends the data transfer completion code to node A. In the Earth Simulator, 8,0 pairs of.ghz serial transmission through copper cable are used for realizing the aggregate switching capacity of 7.87TB/s, and pairs are used for connecting a processor node and the interconnection network. Thus, the error occurrence rate cannot be ignored for realizing stable inter-node communication. To resolve this error occurrence rate problem, ECC codes are added to the transfer data as shown in Fig. 7. Thus, a receiver node detects the occurrence of intermittent inter-node communication failure by checking ECC codes, and the error byte data can almost always be corrected by RCU within the receiver node. example, the sequence of data transfer from node A to node B is shown in Fig. 6. ECC codes are also used for recovering from a continuous inter-node communication failure resulting from a data switch unit malfunction. In this case the error byte data are continuously corrected by RCU within any receiver node until the broken data switch unit is repaired. To realize high-speed parallel processing synchronization among nodes, a special feature is introduced. Counters within the interconnection network s control unit, called Global Barrier Counter (GBC), and flag registers within a processor node, called Global Barrier Flag (GBF), are used. This barrier synchronization mechanism is shown in Fig. 8. ) The master node sets the number of nodes used for the parallel program into GBC within the IN s control unit. ) The control unit resets all GBF s of the nodes used for the program. ) The node, on which task completes, decrements GBC within the control unit, and repeats to check GBF until GBF is asserted. ) When GBC = 0, the control unit asserts all GBF s of the nodes used for the program. ) All the nodes begin to process the next tasks. AP ) reserve a data path AP RCU Control Unit XSW Data Switch unit ) transfer data from node A to B add ECC code 8 ) send completion code AP AP RCU check ECC code Therefore the barrier synchronization time is constantly less than.µsec, the number of nodes varying from to, as shown in Fig. 9.. PERFORMANCE Several performances of the interconnection network and LINPACK benchmark have been measured. First, the interconnection bandwidth between node A node B Fig. 6 Inter-node communication mechanism. Control Unit - - - GBC 0 Processor Node Interconnection Network Processor Node P/S DATA B ECC B DATA B ECC B X0 P/S P/S X P/S checking ECC codes and correcting error data GBC adding ECC codes DATA Byte P/S X P/S X6 DATA Byte AP AP RCU SREG GBF AP AP RCU GBF 0 Master node Node 0 X7 Fig. 7 Inter-node interface with ECC codes. Fig. 8 Barrier synchronization mechanism using GBC.

Special Issue on High Performance Computing Time ( µ sec).... 0. 0 0 0 0 0 0 00 600 The number of Processor Nodes Fig. 9 Scalability of MPI_Barrier. Time ( µ sec) 60.0 0.0.0.0.0.0 0.0 without GBC with GBC 8 6 6 The number of Processor Nodes Fig. Scalability of MPI_Barrier. [GB/sec] Read Write 0.0 8 90.0 80.0 70.0 Bandwidth 6 60.0 0.0.0.0.0 LINPACK HPC [TFLOPS] Peak Ratio [%] 0 0.00000 0.0000 0.000 0.00 0.0 0. 0 Transfer Data Size [MB] Fig. Inter-node communication bandwidth..0 0.0 0 00 00 00 00 000 6000 The number of processors Fig. LINPACK performance and peak ratio. every two nodes is shown in Fig.. The horizontal axis shows data size, and the vertical axis shows bandwidth. Because the interconnection network is a single-stage crossbar switch, this performance is always achieved for two-node communication. Second, the effectiveness of the Global Barrier Counter is summarized in Fig.. The horizontal axis shows the number of processor nodes, and the Vertical axis shows MPI_barrier synchronization time. The line labeled with GBC shows the barrier synchronization time using GBC, and without GBC shows the software barrier synchronization. By using GBC feature, MPI_barrier synchronization time is constantly less than.µsec. On the other hand, the software barrier synchronization time increases, or is proportional to the number of nodes. The performance of LINPACK benchmark is measured to verify that the Earth Simulator is a high- 0 performance and high-efficiency computer system. The result is shown in Fig., the number of processors varying from,0 to,. The ratio of peak performance is more than 8%, and LINPACK performance is proportional to the number of processors. 6. CONCLUSION An overview of the Earth Simulator system has been given in this paper, focusing on its three architectural features and hardware realization, especially the details of the interconnection network. Three architectural features (i.e.: vector processor, shared memory and high-bandwidth non-blocking interconnection crossbar network) provide three levels of parallel programming paradigms; vector processing on a processor, parallel processing with shared memory within a node and parallel processing among 0 distributed nodes via the interconnection network. The Earth Simulator was thus proven to be the most powerful supercomputer, achieving.86tflops, or 87.% of peak performance of the system, in

6 NEC Res. & Develop., Vol., No., January 0 LINPACK benchmark, and 6.8TFLOPS, or 6.9% of peak performance of the system, for a global atmospheric circulation model with the spectral method. This record-breaking sustained performance make this innovative system a very effective scientific tool for providing solutions for the sustainable development of humankind and its symbiosis with the planet earth. ACKNOWLEDGMENTS The authors would like to offer their sincere condolences for the late Dr. Hajime Miyoshi, who initiated the Earth Simulator project with outstanding leadership. The authors would also like to thank all members who were engaged in the Earth Simulator project, and especially H. Uehara and J. Li for performance measurement. REFERENCES [] M. Yokokawa, S. Shingu, et al., Performance Estimation of the Earth Simulator, Towards Teracomputing, Proc. of 8th ECMWF Workshop, pp.-, World Scientific, 998. [] K. Yoshida and S. Shingu, Research and development of the Earth Simulator, Proc. of 9th ECMWF Workshop, pp.-, World Scientific, 00. [] M. Yokokawa, S. Shingu, et al., A 6.8 TFLOPS Global Atmospheric Simulation with the Spectral Transform Method on the Earth Simulator, SC0, 0. Received November, 0 * * * * * * * * * * * * * * * Shinichi HABATA received his B.E. degree in electrical engineering and M.E. degree in information engineering from Hokkaido University in 980 and 98, respectively. He joined NEC Corporation in 98, and is now Manager of the th Engineering Department, Computers Division. He is engaged in the development of supercomputer hardware. Mr. Habata is a member of the Information Processing Society of Japan. Mitsuo YOKOKAWA received his B.S. and M.S. degrees in mathematics from Tsukuba University in 98 and 98, respectively. He also received a D.E. degree in computer engineering from Tsukuba University in 99. He joined the Japan Atomic Energy Research Institute in 98, and was engaged in high performance computation in nuclear engineering. He was engaged in the development of the Earth Simulator from 996 to 0. He has joined the National Institute of Advanced Industrial Science and Technology (AIST) in 0, and is Deputy Director of the Grid Technology Research Center of AIST. He was a visiting researcher of the Theory Center at Cornell University in the US from 99 to 99. Dr. Yokokawa is a member of the Information Processing Society of Japan and the Japan Society for Industrial and Applied Mathematics. Shigemune KITAWAKI received his B.S. degree and M.S. degree in mathematical engineering from Kyoto University in 966 and 968, respectively. He joined NEC Corporation in 968. He was then involved in developing compilers for NEC s computers. He joined Earth Simulator project in 998, and is now Group leader of User Support Group of the Earth Simulator Center, Japan Marine Science and Technology Center. Mr. Kitawaki belongs to the Information Processing Society of Japan and the Association for Computing Machinery. * * * * * * * * * * * * * * * 0 0