Experiences of the Development of the Supercomputers

Size: px

Start display at page:

Download "Experiences of the Development of the Supercomputers"

Hubert Ira Berry
5 years ago
Views:

1 Experiences of the Development of the Supercomputers - Earth Simulator and K Computer YOKOKAWA, Mitsuo Kobe University/RIKEN AICS

2 Application Oriented Systems Developed in Japan No.1 systems in TOP500 list Titan Tianhe-2 FY FY

3 Talk Outline Earth Simulator (ES) Motivation and Target Performance of ES Achievements in Science & Engineering by ES K Computer Objectives and Design Targets Overview Storage System Summary 2

Motivation We had faced important and serious problems in global natural phenomena which affect human activities in 1996. Understanding and predicting the phenomena is essential.

4 Motivation We had faced important and serious problems in global natural phenomena which affect human activities in Understanding and predicting the phenomena is essential. Climate change (Global warming) Weather Forecast (Heavy rain, droughts) Earthquake and Its Mechanism Air Pollution (acid rain, ozone hole) Simulation models and computer resources in 1997 are insufficient in resolution to solve these large-scale complicated phenomena. We started the development of the Earth Simulator in

4000 x 2000 Vertical layers : several 10s Time step : around 1/10 At least 1000 times performance (about

5 Target Performance of ES Resolution in atmospheric circulation models Model 1997 Target Regional Model Global Model (AGCM) km km ~1km 5-10km Courtesy of JMA AGCM simulation with 10km mesh Horizontal meshes : 4000 x 2000 Vertical layers : several 10s Time step : around 1/10 At least 1000 times performance (about 5Tflop/s sustained). ( Cf. Sustained performance of AGCM simulation on typical supercomputer is 4-6Gflops in ) 4

6 Development of ES Design Concepts Distributed memory parallel system with vector processors as for compute nodes Full crossbar network More than 32TFLOPS peak performance & 8TB main memory capacity Development Schedule (5-year project) Late-project leader: Miyoshi-sensei Operation started in March 11,

7 Storage System of ES Hierarchical System Storage in ES User volume /home /data Work volume 228TB 43.4TB 184.5TB 460TB User volume Work volume MDPS Disks + Tape archive HDD 250TB Tape Archive 1.92PB STK Tape Drive 60GB x 32,000 cartridges MDPS (Mass Data Processing System) Total Capacity 2.86PB 6

100% Statistics of the ES Node Usage (2002.07-2008.

2006/04 2007/04 2008/10 Running(S+L) Pre-run / post-run processing Idle Eco Stopped 10.2% 15.

8 100% Statistics of the ES Node Usage ( ) Computing Resource Distribution based on Job Size 80% 60% 40% 20% Development of MDPS IPCC 2002/ / / / / / /10 Running(S+L) Pre-run / post-run processing Idle Eco Stopped 10.2% 15.9% more than 256node 128~ 255nod e 64~127node 23.7% 32~ 63node 21.1% less than 15node 13.9% 16~31node 15.1% As of FY ,000 18,000 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 0 Number of jobs Courtesy of JAMSTEC 7

Contribution to the IPCC by ES Global warming projections by climate modeling team The Nobel Peace Prize 2007 Fourth Assessment Report by IPCC (Intergovernmental Panel on Climate

Projection of increased strength of typhoons and hurricanes (new finding) Earth system model Carbon cycle feedback causing additional warming (new finding) Working Group III

9 Contribution to the IPCC by ES Global warming projections by climate modeling team The Nobel Peace Prize 2007 Fourth Assessment Report by IPCC (Intergovernmental Panel on Climate Change) (2007) Major outcomes Highest resolution coupled model Very likely attribution for global warming (stronger confirmation) Super-high resolution global atmospheric model Projection of increased strength of typhoons and hurricanes (new finding) Earth system model Carbon cycle feedback causing additional warming (new finding) Working Group III (Mitigation) Working Group II (IAV*) Working Group I (Physical Science Basis) (* IAV :Impact, Adaptation and Vulnerability) Bali Roadmap (The Bali Climate Change Conference in 2007) 8

Seamless Simulation from Globe to City by MSSG

(MSSG) A coupled model with a nonhydrostatic

model, a non-hydrostatic/hydrostatic ocean

Yin-Yang grids Grid Size 11 km Courtesy of Dr.

10 Seamless Simulation from Globe to City by MSSG Multi-scale Simulator for the Geo-environment (MSSG) A coupled model with a nonhydrostatic atmospheric general circulation model, a land model, a non-hydrostatic/hydrostatic ocean model, and an ocean wave model. Yin-Yang grids Grid Size 11 km Courtesy of Dr. Takahashi@JAMSTEC Grid size 5m 15 min. simulation required 4-5 hours on ES. 9

11 What did ES bring us? ES gave a big impact to HPC community with the word Computonik shock and a trigger to build more powerful supercomputers around the world. Simulation models in geoscience fields were extremely sophisticated. First-principle simulations in geoscience became more realistic. Possibility in predicting various phenomena by computer simulations was demonstrated. Necessity of more powerful supercomputers was shown. A development project of a next-generation supercomputer system K computer 10

system, that is 10petaFLOPS (10 quadrillion floating

京 is a Japanese prefix number which stands for 10 16,

12 Supercomputer K computer 京 [keɪ] is a distributed-memory parallel supercomputer developed by RIKEN in cooperation with Fujitsu Ltd. The nickname is after a target performance of the system, that is 10petaFLOPS (10 quadrillion floating point operations per second). 京 is a Japanese prefix number which stands for 10 16, or 10 peta. The K took first place twice in TOP500 supercomputer ranking list last year. Project Leader Dr. Tadashi Watanabe 11

13 Objectives and Design Targets (in 2006) Objectives to develop the world's most advanced and high-performance supercomputer, to develop and deploy its usage technologies including application software as one of Japan's Key Technologies of National Importance. Design Targets 10 peta-flops in LINPACK benchmark Peta-FLOPS sustained performance in various real applications Low power consumption system Highly reliable system The system should be a general-purpose machine and a production system for computational science. 12

Schedule of the Project and Its Location

14 Schedule of the Project and Its Location Start of installation at RIKEN AICS, Kobe End of installation No.1 in 38th TOP500 list (10.51 PFLOPS) HPCC Awards Gordon Bell Prize (Peak Performance) Completion of system development Open to academic and industry September 28, 2012 Operation started. 13

15 System Overview 14

16 Configuration of the K Computer Compute nodes (CPUs): 82,944 (IO nodes : 5,184) Peak performance: 10.6PFLOPS Memory: 1.27PB (16GB/node) 5GB/s(peak)x 2 6-dimensional mesh/torus network: Tofu 10 connections to adjacent compute nodes 24 * 18 * (16+1) * 2 * 3 * 2 Peak bandwidth: 5GB/s x 2 for each connection Logical 3-dimensional torus network SPARC64 TM VIIIfx 5GB/s(peak)x 2 ノード Compute node CPU: 128GFLOPS (8cores) Core Core SIMD(4FMA) Core SIMD(4FMA) Core SIMD(4FMA) Core SIMD(4FMA) Core SIMD(4FMA) Core16GFlops SIMD(4FMA) Core16GFlops 16GFlops SIMD(4FMA) 16GFlops SIMD(4FMA) 16GFlops 16GFlops 16GFLOPS 16GFlops L2$: 6MB 64GB/s MEM: 16GB 5GB/s(peak)x 2 5GB/s(peak)x 2 Courtesy of FUJITSU Ltd. 15

Fujitsu SPARC64 TM VIIIfx Chip Overview 8 cores 4 FMA (multiply and add) operation circuits in each core A SIMD instruction can execute 2 FMAs concurrently 16 giga-flops/core by 2 SIMDs x 2 FMAs in

17 Fujitsu SPARC64 TM VIIIfx Chip Overview 8 cores 4 FMA (multiply and add) operation circuits in each core A SIMD instruction can execute 2 FMAs concurrently 16 giga-flops/core by 2 SIMDs x 2 FMAs in 2GHz 256 FP registers (double precision) Hardware barrier for fast inter-core synchronization Pre-fetch instruction for latency hiding Shared 6MB L2 Cache (12-way) Software controllable cache (sectored cache) Performance: 128 giga-flops/chip Reference: SPARC64 TM VIIIfx Extensions 45nm CMOS process, 2GHz 22.7mm x 22.6mm 760 M transistors 58W (at 30 by water cooling) 16

Tofu (Torus Fusion) Interconnect High communication performance and fault-tolerant network Six-dimensional mesh/torus network Each node has 10 links (5 GB/s x 2 bandwidth) 6 links for XYZ link

18 Tofu (Torus Fusion) Interconnect High communication performance and fault-tolerant network Six-dimensional mesh/torus network Each node has 10 links (5 GB/s x 2 bandwidth) 6 links for XYZ link (XZ:torus / Y:mesh) and 4 links for node link within a basic unit (2x3x2 mesh / torus) Multi-path routing by a combination of XYZ link and 2x3x2 mesh/torus network enables to make a detour of faulty nodes Z+ Y+ Basic unit (2x3x2 mesh / torus) Z Y X X- X+ Y- Z- Neighboring basic units are connected by 12 links. 17

19 System Environments Linux based operating system (OS) on compute nodes Batch job-oriented system Interactive environments are available for debugging. Two file systems of both local file system and global file system by FEFS (Fujitsu Exabyte File System) based on Lustre file system Users permanent files are on the global file system. Staging function Files on the global file system which are used in a job are staged into the local file system. Data generated during a job execution are moved back to the global file system after the job finished. 18

20 Batch Job Processing and Staging Function Job submission File staging Job execution File de-staging Users submit their jobs with directions to a scheduler. Input files used in the job are copied from a global file system to a local file system. The job is executed on the assigned compute nodes. Output files are moved back to the global file system Job # of nodes, expected elapsed time, staging, etc. in a job script Check the job status Job termination Check the job status Job submission Scheduler assigns the job to appropriate compute nodes according to the job Scheduler description Scheduler in the script. Scheduler Scheduler Compute nodes to which the job assigned Compute nodes Compute nodes Job execution Compute nodes Local File Global files ローカルファイル Input data Input data Global Files Input data ローカルファイル Output data Global Files ローカルファイル Output data Output data Global Files 19

21 From Chip to Full System 40m 40m Full System Compute Rack 864 Disk Rack cm 80cm 80cm 80cm 2 Cabinets Compute Rack 4 Disk Rack 1 Compute Rack SB 24 IOSB (11.3) PFLOPS 1.27 (1.34) PiB 50cm 50cm System Board(SB) Node 4 Node CPU ICC Memory 16GB 49.2 (52.4) TFLOPS 6.00 (6.38) TiB 128GFLOPS 16GiB 512GFLOPS 64GiB 12.3 (13.1) TFLOPS 1.50 (1.59) TiB Water cooling module Logical 3-D Torus 200,000 copper cables About 1,000km (620mi.) Courtesy of FUJITSU Ltd. 20

22 I/O Configuration Compute node racks Fibre Channel Local disk rack ~1.7GB/s InfiniBand 4X QDR Global File System ~720GB/s OSS nodes (90) ~3.2TB/s InfiniBand Switch 30PB~ 21

23 System Disks Compute node rack has storages for system files such as Linux kernel, log files, etc. 864 compute node racks in total RAID units / rack RAID5 (4D+1P) x 2sets + 2HS = in. 2TB SATA # of total disks 10,368 Compute node rack 22

2TB SATA 136 OSS (Object Storage Server) 648 OST (Object Storage Target)

24 Local File System (FEFS) Configuration MDS (Meta Data Server) 2 MDT (Meta Data Target) 4 RAID6 (6D+2P) x 4sets + 2HS = in. 2TB SATA 136 OSS (Object Storage Server) 648 OST (Object Storage Target) 2,592 RAID5 (4D+1P) x 4sets + 2HS = in. 300GB SAS 57,024 # of total disks 57,160 23

Global File System (FEFS) Configuration MDS (Meta Data Server) 2 MDT (Meta Data Target) 24 RAID1+0 (4D+4M) x 14sets + 7HS = 119 3.5 in.

25 Global File System (FEFS) Configuration MDS (Meta Data Server) 2 MDT (Meta Data Target) 24 RAID1+0 (4D+4M) x 14sets + 7HS = in. 2TB SATA 2856 OSS (Object Storage Server) 90 OST (Object Storage Target) 720 RAID6(6D+2P) x 4sets + 2HS = in. 2TB SATA 24,480 # of total disks 27,336 24

26 Tsunami simulation of the Japan Earthquake on March 11, 2010 Computed Area:1192x768x500km 3 # of lattices:2304x1536x2000 Maeda et al., Bull. Seism. Soc. Am. in press,

27 Summary HPC and Big Data are essential for sustainable human life in the future and we have to promote HPC activities more and more. Storage technology is one of the keys to realize a large scale storage equipment Efforts for increasing reliability of those devices are critical for the future applications. 26

28 27

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Current Status of the Next- Generation Supercomputer in Japan YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN International Workshop on Peta-Scale Computing Programming Environment, Languages