China s HPC development: a brief review and perspectives

Size: px

Start display at page:

Download "China s HPC development: a brief review and perspectives"

George Fox
6 years ago
Views:

1 China s HPC development: a brief review and perspectives Depei Qian Beihang University/Sun Yat-sen University International Symposium on Impact of extreme scale computing Tokyo, Japan Nov. 2, 2017

2 Outline A Brief review The New HPC key project in China Issues in exascale system development

3 A Brief review

4 Three 863 key projects on HPC :High Performance Computer and Core Software Research on resource sharing and collaborative work Grid-enabled applications in multiple areas TFlops computers and China National Grid (CNGrid) testbed :High Productivity Computer and Grid Service Environment High productivity Application performance Efficiency in program development Portability of programs Robust of the system Emphasizing service features of the HPC environment Developing peta-scale computers :High Productivity Computer and Application Service Environment Developing 100PF computers Developing large scale HPC applications Upgrading of CNGird

this year 2016: Sunway TaihuLight Implemented with home-grown Shenwei many-core processors, 10 million cores in total 125 PF peak,

5 High performance Computers 2013: Tianhe-2 CPU+MIC Heterogeneous accelerated architecture 54.9 PF peak, 33.9 PF Linpack, No. 1 in Top500 for 6 times from 2013 to 2015 Installed at the National Supercomputing Center in Guangzhou Will be upgraded to 100PF this year 2016: Sunway TaihuLight Implemented with home-grown Shenwei many-core processors, 10 million cores in total 125 PF peak, 93 PF Linpack, No. 1 in Top500 in June and Nov. of 2016 Installed at the National Supercomputing Center in Wuxi Tianhe-2 Sunway Bluelight

6 Tianhe-2 upgrade Items Tianhe-2 Tianhe-2A Nodes & Performance nodes with Intel CPU + KNC 54.9Pflops nodes with Intel CPU + Matrix Pflops Interconnection 10Gbps, 1.57us 14Gbps, 1us Memory 1.4PB 3.4PB Storage 12.4PB, 512GB/s 19PB, 1TB/s Energy Efficiency 17.8MW, 1.9Gflops/W About 18MW, >5Gflops/W Heterogeneous software MPSS for Intel KNC OpenMP/OpenCL for Matrix

7 Matrix-2000 accelerator Chip specification 4 super-nodes (SN) 8 clusters per SN 4 cores per cluster Core On chip interconnection Self-defined 256-bit vector ISA 16 DP flops/cycle per core Peak performance: Tflops@1.2GHz SN0 C C C C C C C C Cluster 0 Cluster 1 C C C C C C C C Cluster 2 Cluster 3 C C C C C C C C Cluster 4 Cluster 5 C C C C C C C C Cluster 6 Cluster 7 SN1 C C C C C C C C Cluster 0 Cluster 1 C C C C C C C C Cluster 2 Cluster 3 C C C C C C C C Cluster 4 Cluster 5 C C C C C C C C Cluster 6 Cluster 7 SN2 C C C C C C C C Cluster 0 Cluster 1 C C C C C C C C Cluster 2 Cluster 3 C C C C C C C C Cluster 4 Cluster 5 C C C C C C C C Cluster 6 Cluster 7 SN3 C C C C C C C C Cluster 0 Cluster 1 C C C C C C C C Cluster 2 Cluster 3 C C C C C C C C Cluster 4 Cluster 5 C C C C C C C C Cluster 6 Cluster 7 PCIE DDR4 DDR4 DDR4 DDR4 4 SNs x 8 clusters x 4cores x 16 flops x 1.2 GHz = Tflops Peak power dissipation: ~240w Interface 8 DDR channels X16 PCIE 3.0 EP Port 7

8 Compute Nodes Heterogeneous Compute Nodes Intel Xeon CPU x2 Matrix-2000 x2 Com m. P o rt 16X PCIE N IC G b LA N GE Memory:192GB Interconnection:14G proprietary network Peak performance: 5.34Tflops DDR4 DDR4 MT-2000 MT X PCIE 16X PCIE CPU CPU Q PI DM I IP M B PCH CPLD

9 HPC environment 2016 China National Grid, composed of 17 national supercomputing centers and HPC centers, world leading class computing resources

10 HPC applications 2016 HPC applications in many domains 10-million core parallelism reached, Gordon Bell Prize in 2016 Developed a number application software, adopted by production systems aircraft design high speed train design oil & gas exploration new drug discovery ensemble weather forecasting bio-information car development design optimization of large fluid machinery electromagnetic computation

11 Problems identified Lack of the long-term national program for high performance computing Weak in kernel HPC technologies processor/accelerator novel devices (new memory, storage, and network) large scale parallel algorithms and programs implementation Application software is the bottleneck applications rely on imported commercial software expensive small scale parallelism restricted by export regulation Shortage in cross-disciplinary talents No enough talents with both domain and IT knowledge Lack of multi-disciplinary collaboration

12 The new key HPC project in China

13 Reform of research system in China The national research and development system is undergoing a reform 100+ different national R&D programs/initiatives are merged into 5 tracks of national programs Basic research program (NSFC) Mega-science and technology programs Key R&D program (former 863, 973, enabling programs) Enterprise innovation program Facility/talent program

14 A New key project on HPC High performance computing has been identified as a priority subject under the key R&D program (track 3) Strategic studies and planning have been conducted since 2013 A proposal on HPC in the 13 th five-year plan was submitted in early 2015 The key R&D project was approved in Oct by a multi-government agency committee led by the MOST

15 Motivations The key value of exascale computers identified Addressing the grand challenge problems Energy shortage, pollution, climate change Enabling industry transformation supporting development of important products high speed train, commercial aircraft, automobile promoting economy transformation For social development and people s benefit new drug discovery, precision medicine, digital media Enabling scientific discovery high energy physics, computational chemistry, new material, astrophysics Promote computer industry by technology transfer Developing HPC systems by self-controllable technologies a lesson learnt from the recent embargo regulation

16 Major tasks Exa-scale computer development R&D on novel architectures and key technologies of the exa-scale computer Developing the exa-scale computer based on home-grown processors Technology transfer to promote development of high-end servers HPC applications development Basic research on exa-scale modeling methods and parallel algorithms Developing high performance application software Establishing the HPC application eco-system HPC environment development Developing software and platform for national HPC environment Upgrading of the national HPC environment CNGrid Developing service systems on the national HPC environment Each task will cover basic research, key technology development, and application demonstration

17 Basic research Task 1: Exa-scale computer development Novel high performance interconnect Theoretical work on the novel interconnect based on the enabling technologies of 3D chips, silicon photonics and on-chip networks Programming & execution models for exa-scale systems new programming models for heterogeneous systems Improving programming efficiency

18 Task 1: Exa-scale computer development Key technology prototype systems for verifying the exa-scale system technologies 3 typical applications to verify the design exa-scale computer technologies architecture optimized for multi-objectives high efficient computing node high performance processor/accelerator design exa-scale system software scalable interconnect parallel I/O exa-scale infrastructure energy efficiency exa-scale system reliability

19 Task 1: Exa-scale computer development Exa-scale computer development exaflops in peak Linpack efficiency >60% 10PB memory EB storage 30GF/w energy efficiency interconnect >500Gbps large scale system management and resource scheduling easy-to-use parallel programming environment system monitoring and fault tolerance support large scale applications

20 Task 2: HPC application development Basic research computable modeling and computational methods for exa-scale systems scalable highly efficient parallel algorithms and parallel libraries for exa-scale systems Key technology programming framework for exa-scale software development

21 Task 2: HPC application development Application software Numerical devices numerical nuclear reactor numerical aircraft numerical earth system numerical engine high performance domain application software complex engineering project and critical equipment numerical simulation of ocean design of energy-efficient large fluid machineries drug discovery electromagnetic environment simulation ship design oil exploration digital media rendering high performance application software for research material science high energy physics astrophysics life science

22 Task 2: HPC application development HPC application software development establishing a national-level R&D center for HPC application software build up of a platform for HPC software development and optimization tools for performance/energy efficiency and pre- /post-processing build up software resource repository developing typical domain application software a joint effort involving national supercomputing centers, universities, and institutes

23 Basic research Task 3: HPC environment development models and architecture for computational services virtual data space Key technology mechanism and platform for the national HPC environment, providing technical support for service mode operation upgrading the national HPC environment (CNGrid)

24 Services Task 3: HPC environment development integrated business platform, e.g. complex product design HPC-enabled EDA platform application villages innovation and optimization of industrial products drug discovery SME computing and simulation platform platform for HPC education provide computing resources and services to undergraduate and graduate students

25 Projects supported The first call for proposal was issued in Feb., projects supported The second call was issued in Oct., 2016, 18 projects supported, mainly application software The third round of call was issued in Oct. 2017, the review process will begin soon.

26 Sugon exa-prototype: specification metrics prototype exascale ratio Computing Node peak (TF) No. of nodes No. silicon-unit System peak (PF) storage Memory (PB) Storage (PB) network Silicon-switch Power consum Dim. global net 2*1*3 8*8*6 4*8*2 Dim. local net 2*3*2 2*3*2 1 Power consumption Energy efficiency (GF/W) size W*D*H (m) 6*6*6 24*24*6 16 Total cabinets

27 Sugon exa-prototype: general design Computing sub-system home-grown X86 processor + DCU accelerator in 2019 CPU > 1TF, DCU > 15TF Network sub-system 400Gbps 6D-torus, 384 routers Storage sub-system Distributed storage architecture, extensible to EB Infrastructure sub-system Immersive phase-change cooling High voltage DC power supply Hierarchical 3D assembly Software sub-system Mature and complete libs and programming tools Light-weight virtualization and software-defined architecture

28 Sugon exa-prototype: hierarchical 3D structure 层次每单元节点数原型机单元数 E 级机单元数 Node pair Super node Silicon block Silicon cubic

XGKR*2 Pcle 16x CPU2 16x GOP*2 DCU2 16x GOP*2 SATA Pcle 4x 16x GOP*2 BIOS Midplane BIOS SATA Pcle 4x M.2 M.

29 Sugon exa-ptototype: Computing node Node:2 CPU and 2 DCU CPU and DCU interconnected by GOP high speed bus Memory bandwidth: 2667 Mbps, DDR4 Memory capacity 128G DDR4 Interconnect: 200Gbps fast Fabric U/R/LR DDR4 DIMMs 2X200G NIC U/R/LR DDR4 DIMMs DCU0 16x GOP*2 CPU0 XGKR*2 Pcle 16x XGKR*2 Pcle 16x CPU2 16x GOP*2 DCU2 16x GOP*2 SATA Pcle 4x 16x GOP*2 BIOS Midplane BIOS SATA Pcle 4x M.2 M.2 M.2 M.2 SATA/ Pcle 4x SATA/ Pcle 4x 16x GOP*2 XGKR*2 XGKR*2 16x GOP*2 DCU1 CPU1 Pcle 16x Pcle 16x CPU3 DCU3 16x GOP*2 16x GOP*2 BIOS BIOS U/R/LR DDR4 DIMMs AIU U/R/LR DDR4 DIMMs

30 Tianhe exa-prototype: flexible architecture Reconfigurable flexible architecture, meet the requirement of different applications Virtualized OS, provide a configurable computing environment Software-defined interconnect, guarantee bandwidth and fault isolation Hierarchical storage QoS guarantee technology, providing stable and independent storage bandwidth Dynamic optimization providing architecture-aware optimization application compiler runtime OS Computing node Computing sub-system IO storage sub-system

31 Tianhe exa-prototype: technical route performance Special purpose accelerator Many-core customized Energy efficiency Easy to use General purpose many-core is adopted by the prototype 31

32 Tianhe exa-prototype: technical features Flexible architecture to meet the requirement of different applications New generation many-core processor, pursuing balanced computing and memory access Optoelectronic integrated high speed interconnect, greatly improved performance and energy efficiency Fault-tolerance based on new storage medium Accurate heat dissipation, tradeoff between the manufacture cost and the operational cost

power budget <200W, at most 12 ports of 400 Gbps Co-design of ultra short distance

33 Tianhe exa-prototype: interconnect High-radix router for low power consumption, low cost and high desity Exascale communication need: single node > 400Gbps Chip power budget <200W, at most 12 ports of 400 Gbps Co-design of ultra short distance Serdes PHY, PHY coding, and link layer Optoelectronic integration for interconnect 33

34 Sunway exa-prototype: hardware system System composed of computing, interconnect, storage, power supply and cooling New generation many-core based system,512 nodes,performance >4PFlops Self-developed network chip, fat-tree interconnect, point to point bandwidth > 200Gbps Storage subsystem based on Shenwei storage server Self-developed high voltage (300V) DC power supply High efficient water-cooling, enhanced heat transfer copper cold plate 二级胖树互连结构直流供电系统新一代众核处理器强化换热冷板组装节点运算机仓水冷机组

35 Sunway exa-prototype: computing node DDR4 Connection to the interconnect 2 X 25GbpsX4 Point to point one-way bandwidth 200Gbps Peak performance >8TFlops memory > 64GB DDR3 DDR4 DDR3 高速计算网网络接口核组0 时钟管理核组1 处理器管理 PCI-E 电源管理以太网节点监测核组2 BM C 核组3 以太管理网网络接口 DDR3 DDR4 DDR3 DDR4

Basic software for home-grown manycore processor parallel OS high performance storage management system parallel compiler parallel program development environment High efficient compiler for

36 Basic software for home-grown manycore processor parallel OS high performance storage management system parallel compiler parallel program development environment High efficient compiler for heterogeneous many-core SIMD auto-vectorization High performance basic math libs Integrated multi-domain OS for heterogeneous many-core Dynamic storage management Supporting MPI-1 MPI-2 MPI-3 OpenMP3.0, compatible OpenACC2.0 Debugger for heterogeneous manycore Sunway exa-prototype: software system

37 Sunway exa-prototype: demo applications Porting applications on TaihuLight, performance optimization is being conducted Floating platform design seismic Aircraft design Ocean model

Sunway exa-prototype: applications 10-Million core applications on TaihuLight 2016 Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field

Building Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States

38 Sunway exa-prototype: applications 10-Million core applications on TaihuLight 2016 Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

39 Issues in exascale system development

40 Major Challenges to exa-scale systems Power consumption Performance obtained by applications Programmability Resilience How to make tradeoffs between performance, power consumption, and programmability? How to achieve continuous no-stop operation? How to adapt to a wide range of applications with reasonable efficiency?

Low utilization of the co-processor in some applications, using CPU only Bottleneck in moving data between CPU and

41 Architecture Novel architectures beyond the current heterogeneous accelerated/manycore-based expected Co-processor or partitioned heterogeneous architecture? Low utilization of the co-processor in some applications, using CPU only Bottleneck in moving data between CPU and co-processor Application-aware architecture on-chip integration of special purpose units (idea from Prof. Andrew Chien) using the right tool to do the right things dynamic reconfigurable? how to program?

stacking technology Reduce the data move by placing the data closer to processing HBM/HMC near processor

42 Memory system Pursuing large capacity, low latency, high bandwidth Increase capacity and lower power consumption by using DRAM/NVM together Data placement issue Improving bandwidth and latency by using the 3D stacking technology Reduce the data move by placing the data closer to processing HBM/HMC near processor On-chip DRAM Simple functions in memory Reduce data copy cost by using unified memory space in heterogeneous architecture

miniature optical devices High scalability adapting to exascale system interconnect

43 Pursuing low latency, high bandwidth and low energy consumption Adopt new technologies silicon photonics communication between components optical interconnect / communication miniature optical devices High scalability adapting to exascale system interconnect requirement Connecting 10,000+ nodes Low-hop, low-latency topology Reliable and intelligent routing Interconnect

44 Programming the heterogeneous systems Addressing the issues in programming the heterogeneous parallel systems efficient expression of the parallelism, dependence, data sharing, execution semantics problem decomposition appropriate for heterogeneous systems Improving programming by means of a holistic approach new programming models programming languages extension and compiler parallel debugging runtime support and optimization architectural support

Architecture-aware algorithm implementation and optimization is necessary for heterogeneous systems

45 Full-chain innovation mathematical methods computer algorithms algorithm implementation and optimization A good mathematical method is often more effective than hardware improvement and algorithm optimization Architecture-aware algorithm implementation and optimization is necessary for heterogeneous systems Domain-specific libraries for improving software productivity and performance Computational models and algorithms

46 Resilience Resilience is one of the key issues of the exa-scale system Large scale of the system 50K to 100K nodes Huge amount of components Very short MTBF Long time non-stop operation required for solving large scale problems Reliability measures at different levels required, including device, node, and system levels Software / hardware coordination is necessary fast context saving and recovery for checkpointing in case of short MTBF fault-tolerance at the algorithm and application software level

47 Importance of the tools Development and optimization of large scale parallel software require scalable tools Particularly important for systems implemented with home-grown processors current commercial and research tools do not support Three kinds of default tools required Parallel debugger for correctness Performance tuner for performance Energy optimizer for energy efficiency

48 Urgent need for eco-system The eco-system for exa-scale system based on home-grown processors is in a urgent need languages, compilers, OS, runtime tools application development support application software Need to attract the hardware manufacturers and the third party software developers product family instead of a single machine Collaboration between industry, academia and end-users required

49 Thank you!

Overview of Tianhe-2

Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn