Fujitsu s Approach to Application Centric Petascale Computing

Similar documents
The way toward peta-flops

Fujitsu s Technologies to the K Computer

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Findings from real petascale computer systems with meteorological applications

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

Introduction of Fujitsu s next-generation supercomputer

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

PRIMEHPC FX10: Advanced Software

Fujitsu s new supercomputer, delivering the next step in Exascale capability

Post-K Development and Introducing DLU. Copyright 2017 FUJITSU LIMITED

Fujitsu Petascale Supercomputer PRIMEHPC FX10. 4x2 racks (768 compute nodes) configuration. Copyright 2011 FUJITSU LIMITED

Experiences of the Development of the Supercomputers

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future

Programming for Fujitsu Supercomputers

Challenges in Developing Highly Reliable HPC systems

Technical Computing Suite supporting the hybrid system

FUJITSU HPC and the Development of the Post-K Supercomputer

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

The Architecture and the Application Performance of the Earth Simulator

Brand-New Vector Supercomputer

Introduction to the K computer

The next generation supercomputer. Masami NARITA, Keiichi KATAYAMA Numerical Prediction Division, Japan Meteorological Agency

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Getting the best performance from massively parallel computer

Cray XC Scalability and the Aries Network Tony Ford

Post-K: Building the Arm HPC Ecosystem

Fujitsu High Performance CPU for the Post-K Computer

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

An Overview of Fujitsu s Lustre Based File System

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

HOKUSAI System. Figure 0-1 System diagram

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

Practical Scientific Computing

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

The Tofu Interconnect D

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

What are Clusters? Why Clusters? - a Short History

OP2 FOR MANY-CORE ARCHITECTURES

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Intel Many Integrated Core (MIC) Architecture

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Overview of Tianhe-2

Mapping MPI+X Applications to Multi-GPU Architectures

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Intel High-Performance Computing. Technologies for Engineering

SPARC64 X: Fujitsu s New Generation 16 core Processor for UNIX Server

InfiniBand-based HPC Clusters

A Case for High Performance Computing with Virtual Machines

Fujitsu's Lustre Contributions - Policy and Roadmap-

Parallel Computer Architecture II

Vector Engine Processor of SX-Aurora TSUBASA

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Topology Awareness in the Tofu Interconnect Series

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

Parallel computer architecture classification

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

Parallel Computer Architecture - Basics -

Lecture 9: MIMD Architectures

Basic Specification of Oakforest-PACS

John Hengeveld Director of Marketing, HPC Evangelist

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Tofu Interconnect 2: System-on-Chip Integration of High-Performance Interconnect

BlueGene/L (No. 4 in the Latest Top500 List)

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Carlo Cavazzoni, HPC department, CINECA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Tianhe-2, the world s fastest supercomputer. Shaohua Wu Senior HPC application development engineer

Fabio AFFINITO.

The Earth Simulator System

Practical Scientific Computing

ARMv8-A Scalable Vector Extension for Post-K. Copyright 2016 FUJITSU LIMITED

Trends in HPC (hardware complexity and software challenges)

DISP: Optimizations Towards Scalable MPI Startup

The Future of SPARC64 TM

Japan HPC Programs - The Japanese national project of the K computer -

Outline. Execution Environments for Parallel Applications. Supercomputers. Supercomputers

Exascale: challenges and opportunities in a power constrained world

CSE5351: Parallel Processing Part III

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Parallel Architectures

Compiler Technology That Demonstrates Ability of the K computer

Fujitsu and the HPC Pyramid

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Large scale Imaging on Current Many- Core Platforms

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

Transcription:

Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd.

Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview Development Status Technologies for Application Centric Petascale Computing CPU VISIMPACT Tofu Interconnect Conclusion 1

Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview Development Status 2

Project Schedule Facilities construction has finished in May 2010 System installation was started in Oct. 2010 Partial system will start test-operation in April 2011 Full system installation will be completed in middle of 2012 Official operation will start by the end of 2012 Courtesy of RIKEN FY 3

K Computer Target Performance of Next-Generation Supercomputer 10 PFlops = 10 16 Flops = 京 (Kei) Flops, 京 means the Gate. 京 (Kei ei) computer K computer Full system installation (CG image) 4

Applications of K computer 5 Courtesy of RIKEN

TOP500 Performance Efficiency Performance Efficiency (RMax : LINPACK Performance / RPeak : Peak Performance) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% GPGPU based system : Fujitsu s user sites November 2010 0 100 200 300 400 500 Rank 6

Design Targets - Toward Application Centric Petascale Computing - High performance High peak performance High efficiency / High sustained performance High scalability Environmental efficiency Low power consumption Small footprint High productivity Less burden to application implementation High reliability and availability Flexible and easy operation 7

K computer Specifications CPU (SPARC64 VIIIfx) Node System board(sb) Rack System Cores/Node Performance Architecture Cache 8 cores (@2GHz) 128GFlops SPARC V9 + HPC extension L1(I/D) Cache : 32KB/32KB L2 Cache : 6MB Power 58W (typ. 30 C) Mem. bandwidth Configuration Memory capacity No. of nodes No. of SB 64GB/s. 1 CPU / Node 16GB (2GB/core) 4 nodes /SB 24 SBs/rack Nodes/system > 80,000 CPU ICC 128GFlops SPARC64 VIIIfx SPARCfx 8 Cores@2.0GHz System Board Node 128 GFlops 16GB Memory 64GB/s Memory band width 512 GFlops 64 GB memory Inter- connect Cooling Topology Performance No. of link Additional feature Architecture CPU, ICC* Other parts Rack 12.3 TFlops 15TB memory * ICC : Interconnect Chip 6D Mesh/Torus 5GB/s. for each link 10 links/ node H/W barrier, reduction Routing chip structure (no outside switch box) Direct water cooling Air cooling System LINPACK 10 PFlops over 1PB mem. 800 racks 80,000 CPUs 640,000 cores 8

Kobe Facilities Research building ~65m Computer building ~65m Power supply and cooling unit building Electric power supply Kobe site ground plan and aerial photo Courtesy of RIKEN 9

Kobe Facilities (cont.) Power supply and cooling unit building Computer building Computer building Research building Exterior of buildings 10 Courtesy of RIKEN

Kobe Facilities (cont.) Seismic isolation structure 11 Courtesy of RIKEN

Kobe Facilities (cont.) Air Handling Units (Computer building 2F) Cooling towers Power Supply and Cooling Unit Building 12 Centrifugal chillers Courtesy of RIKEN

Kobe Facilities (cont.) On Oct. 1st, First 8 racks were installed at Kobe site, RIKEN 13 Courtesy of RIKEN

Technologies for Application Centric Petascale Computing CPU VISIMPACT Interconnect 14

Technologies for Application Centric Petascale Computing Single CPU per node configuration : New Interconnect Tofu : High memory band width and simple memory hierarchy VISIMPACT : Highly efficient threading between cores for hybrid program execution (MPI + threading) Hardware barrier between cores Shared L2$ Compiler optimization Open Petascale Libraries Network : Math. Libraries optimized to hybrid program execution High speed, highly scalable, high operability and high availability interconnect for over 100,000 node system 6-dimensional Mesh/Torus topology Functional interconnect Compiler, MPI libraries and OS optimization Water cooling technologies : High reliability, low power consumption and compact packaging HPC-ACE : SPARC V9 Architecture Enhancement for HPC SIMD Register enhancements Software controllable cache Compiler optimization CPU SPARC64 VIIIfx ICC System SPARCfx Rack Node 1CPU + ICC System board (SB) 15

SPARC64TM VIIIfx Processor Extended SPARC64TM VII architecture for HPC SPARC64 V = 1.0 25 HPC extension for HPC : HPC-ACE 8 cores with 6MB Shared L2 cache SIMD extension 256 Floating point registers per core Application access to cache management FP Performance X 3 15 10 Power (W) X 0.5 5 0 : Inter-core hardware synchronisation (barrier) for high 20 V V+ VI VII SPARC64 families VIIIfx History of Peak Performance & Power efficient threading between core High performance per watt 2 GHz clock, 128 GFlops 58 Watts peak as design target Water cooling Low current leakage of the CPU Low power consumption and low failure rate of CPUs Direct water cooling System Board High reliable design SPARC64TM VIIIfx integrates specific logic circuits to detect and correct errors 16

SIMD Extension (1) Performance improvement on Fujitsu test code set* We expect further performance improvement by compiler optimization Speed up 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Test code set* Effect of SIMD extension on one core of SPARC64TM VIIIfx * : Fujitsu internal BMT set consist of 138 real application kernels 17

SIMD Extension (2) Performance improvement on NPB (class C) and HIMENO-BMT* We expect further NPB performance improvement by compiler optimization 1.6 1.6 : w/o SIMD : SIMD 1.4 1.2 1.4 1.2 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 bt cg ep ft is lu mg sp HIMENO-BMT Effect of SIMD extension on one core of SPARC64TM VIIIfx * : HIMENO-BMT, Benchmark program which measures the speed of major loops to solve Poisson's equation solution using Jacobi iteration method. In this measurement, Grid-size M was used. 18

Floating Point Registers Extension (1) Performance improvement on Fujitsu test code set* No. of floating point registers : 32 256 /core Speed up 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Test codes set* Effect of register size extension (32 to 256) on one core of SPARC64TM VIIIfx * : Fujitsu internal BMT set consist of 138 real application kernels 19

Floating Point Registers Extension (2) Performance improvement on NPB (class C) and HIMENO-BMT* We expect further NPB performance improvement by compiler optimization 1.6 3.0 : 32 FP registers : 256 FP registers 1.4 1.2 2.5 2.0 1.0 1.5 0.8 0.6 1.0 0.4 0.5 0.2 0.0 0.0 bt cg ep ft is lu mg sp HIMENO-BMT Effect of register size extension on one core of SPARC64TM VIIIfx * : HIMENO-BMT, Benchmark program which measures the speed of major loops to solve Poisson's equation solution using Jacobi iteration method. In this measurement, Grid-size M was used. 20

Application Access to Cache Management Application access to cache management 6MB L2$ can be divided by two sectors, Normal Cache normal cache and Pseudo Local Memory Programmer can specify reuse data by Sector 0 compiler directive 39 c--------------------------- 40!ocl cache_sector_size (3, 9) 41 1 s s do iter=1, itmax 42 1 s s call sub(a, b, c, s, n, m) 43 1 s s enddo 44 c-------------------------- 52 subroutine sub(a, b, c, s, n, m) 53 real*8 a(n), b(m), s 54 integer*4 c(n) 55 56 57!ocl cache_subsector_assign (b) <<< Loop-information Start >>> <<< [PARALLELIZATION] <<< Standard iteration count: 728 <<< [OPTIMIZATION] <<< SIMD <<< SOFTWARE PIPELINING <<< Loop-information End >>> 1 pp 4v do i=1,n 58 1 p 4v 59 60 61 1 p 4v Pseudo Local Memory Sector 1 Time (sec.) 3.0E+01 2.5E+01 2.0E+01 1.5E+01 1.0E+01 5.0E+00 FP load memory wait 0.0E+00 Normal cache User access cache a(i) = a(i) + s * b(c(i)) enddo end Effect of application accessible cache on one chip of SPARC64TM VIIIfx 21

Performance of VISIMPACT (Integrated Multi-core Parallel ArChiTecture) Concept Hybrid execution model (MPI + Threading between core) Can improve parallel efficiency and reduce memory impact Can reduce the burden of program implementation over multi and many core CPU Technologies Hardware barriers between cores, shared L2$ and automatic parallel compiler High efficient threading : VISIMPCT (Integrated Multi-core Parallel ArChiTecture) Programming Model Hybrid (MPI + Threading) ex. OpenMP Automatic parallelization VISIMPACT BX900(Nehalem) 2.93GHz 4 threads 1.80 BX900(Nehalem) 2.93GHz 8 threads K (SPARC64TM VIIIfx) 2.0GHz 8 threads 1.60 1.40 μsec Flat MPI 2.00 1.20 1.00 0.80 0.60 0.40 Hybrid execution model on multi/many core CPU system (MPI between node/cores + Threading between cores ) 0.20 0.00 PRALLELDO DO Barrier Reduction Comparison of OpenMP micro BMT performance between SPARC64TM VIIIfx and Nehalem 22

Interconnect for Petascale Computing Cross bar Fat-Tree/ Multi stage Mesh / Torus Performance Best Good Average Operability and Availability Best Good Weak Cost and Power consumption Weak Average Good Topology uniformity Best Average Good Hundreds nodes Weak Thousands nodes Ave.-Good >10,000 nodes Best Vector Parallel x86 Cluster Scalar Massive parallel Characteristics Scalability Example topology Which type of the topology can scale up over 100,000 node? Improvement of the performance, operability and availability of mesh/torus topology are our challenge 23

New Interconnect (1) : Tofu Interconnect Design targets b+1 z+1 Scalabilities toward 100K nodes y-1 c+1 a+1 High operability and usability x-1 x+1 High performance y+1 Topology z-1 User view/application view : Logical 3D Torus (X, Y, Z) b-1 Physical topology : 6D Torus / Mesh addressed by (x, y, z, a, b, c) 10 links / node, 6 links for 3D torus and 4 redundant links (3 : torus) b c a z+ xyz 3D Mesh Node Group Unit (12 nodes group, 2 x 3 x 2) y- z y x- x Node Group 3D connection y+ x+ z- 3D connection of each node 24

New Interconnect (2) : Tofu Interconnect Technology Fast node to node communication : 5 GB/s x 2 (bi-directional) /link, 100GB/s. throughput /node Integrated MPI support for collective operations and global hardware barrier Switch less implementation b+1 c+1 z+1 y-1 x-1 a+1 x+1 y+1 z-1 b-1 Each link : 5GB/s X 2 Throughput : 100GB/s/node Conceptual Model 25 IEEE Computer Nov. 2009

Why 6 dimensions? High Performance and Operability Low hop-count (average hop count is about ½ of conventional 3D torus) The 3D Torus/Mesh view is always provided to an application even when meshes are divided into arbitrary sizes No interference between jobs Fault tolerance 12 possible alternate paths are used to bypass faulty nodes Redundant node can be assigned preserving the torus topology Conventional Torus Fail Job1 Job2 High hop-count & mesh Torus topology can't be configured Tofu interconnect Job1 Fail Job2 Low hop-count & keep torus with job isolation Sustain torus configuration 26

Open Petascale Libraries Network How to reduce the burden to application implementation over multi/many core system, i.e. How to reduce the burden of the two stage parallelization? Collaborative R&D project for Mathematical Libraries just started Target system Multi-core CPU based MPP type system Hybrid execution model (MPI + threading by OpenMP/automatic parallelization) Cooperation and collaboration with computer science, application and computational engineering communities on a global basis, coordinate by FLE* International Research Projects Open-source implementation Sharing information and software Results of this activity will be open to HPC society Open Petascale Libraries Hybrid Programming Model National Labs. (MPI + Threading) Multi-core CPU Universities Open Petascale Libraries Network Interconnect Fujitsu Lab. Europe ISVs Fujitsu Japan *: Fujitsu Labs Europe, located in London 27

Conclusion 28

Toward Application Centric Petascale Computing Installation of RIKEN s K Computer has started and the system is targeting High performance Environmental efficiency High productivity Leading edge technologies are applied to K computer New CPU Innovative interconnect Advanced packaging Open Petascale Libraries Those technologies shall be enhanced and applied to Fujitsu s future commercial supercomputer 29

30