Sunway TaihuLight: The system and applications. Zhao Liu Director of Application Support Department National Supercomputing Center in Wuxi

Size: px

Start display at page:

Download "Sunway TaihuLight: The system and applications. Zhao Liu Director of Application Support Department National Supercomputing Center in Wuxi"

Sydney Warren
5 years ago
Views:

1 Sunway TaihuLight: The system and applications Zhao Liu Director of Application Support Department National Supercomputing Center in Wuxi

2 Outline Sunway Machine Applications and Programming Challenges Long Term Plans

3 Sunway TaihuLight 神威太湖之光

4 Sunway TaihuLight City Rank in Top50 Shanghai 1 Suzhou 7 Wuxi 14 Nantong 24 Changzhou 34 Jiaxing 50

5 Sunway-I: - CMA service, commercial chip Tflops - 48 th of TOP500 Sunway BlueLight: - NSCC-Jinan, core processor - 1 Pflops - 14 th of TOP500 Sunway TaihuLight: - NSCC-Wuxi, core processor Pflops - 1 st of TOP500 The Sunway Machine Family

6 Overall Architecture of SW26010 MPE: management processing element 管理 MPE 核心管理 MPE 核心运算核 CPE 心簇 Clusters 运算核 CPE 心簇 Clusters CPE: computing processing element IMPE: intelligent memory processing element SI: system interface 智能存储核心主存智能存储核心主存片上网络 Network On Chip 智能存储核心 IMPE IMPE IMPE Main Memory Main Memory 主存 Main Memory 系统 SI 接口

7 communication row and column buses credit-based wormhole switch cluster controller The Architecture of the CPE Cluster streaming engine: processing of DMA instructions synchronization control: fast hardware synchronization for CPEs consistency processing unit: assuring cache coherency with the MPEs 运算核 CPE Clusters 心簇 Cluster 簇控制器 Controller... 运算运算运算运算 CPE CPE CPE CPE 核心核心核心核心... 运算运算运算运算 CPE CPE CPE CPE 核心核心核心核心... 运算运算运算运算 CPE CPE CPE CPE 核心核心核心核心... 运算运算运算运算 CPE CPE CPE CPE 核心核心核心核心 Cluster Communication 簇通信网络 Network Cluster Communication 簇通信网络接口 Network Interface CTLB L2 ICache 流传输引 Streaming Engine 擎 Network 片上网络接口 On Chip Interface Interruption 中断控制 Control Synchroniza 同步控制 tion Control 一致性处理 Consistency processing 部件 Unit Network On Chip 片上网络

8 The MPE microarchitecture MPE: L1 Instruction L1 指令 Cache Cache complete L1 and L2 caches PC Instruction 指令部件 Unit 3 pipes full support for management, service, and Pipe0 Execution 执行部件 Unit Pipe1 Pipe2 GPR computing functions mainly designed for running OS, runtime, and programs with less parallelism Memory 访存部件 Accessing Unit L1 Data L1 数据 Cache L2 Cache

9 The CPE microarchitecture CPE: L1 Instruction L1 指令 Cache Cache L1 cache for instruction only Scratch Pad Memory (i.e. Local Data Memory) 2 pipes On- 片上网络 Chip 路由器 NR Instruction PC 取指部件 Unit Execution 执行部件 Unit Pipe0 Pipe1 GPR mainly designed for highly parallel computation Memory 访存部件 Accessing Unit SPM/L1 数据 Data Cache

10 High-Density Integration of the Computing System A Five-Level Integration Hierarchy computing node computing board super node cabinet entire computing system

11 High-Density Integration of the Computing System A Five-Level Integration Hierarchy computing node computing board super node cabinet entire computing system

12 High-Density Integration of the Computing System A Five-Level Integration Hierarchy computing node computing board super node cabinet entire computing system

13 High-Density Integration of the Computing System A Five-Level Integration Hierarchy computing node computing board super node cabinet entire computing system

14 High-Density Integration of the Computing System A Five-Level Integration Hierarchy computing node computing board super node cabinet entire computing system

15 A System with Over 10 Million Cores

16 Outline Sunway Machine Applications and Programming Challenges Long Term Plans

17 Applications overview Over 200 applications in different HPC domains 15 full-scale applications More than 50 applications that can be scaled to half system More than 100 applications that can be scaled to one million cores Collaborate with more than 300 companies, research institutes and universities Porting open source applications to Sunway, including WRF, VASP, LAMMPS, etc. Develop highly efficient applications on Sunway

18 An (Incomplete) List of Full-Scale Applications 2016 Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

19 An (Incomplete) List of Full-Scale Applications 2016 Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires A graph processing application designed Run-away Electron for Trajectory local Internet Simulation companies in China Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

20 An (Incomplete) List of Full-Scale Applications 2016 Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Work of the Particle Simulator Genome Functional Annotation and Homeotic Research Team Gene Building of RIKEN AICS Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

21 An (Incomplete) List of Full-Scale Applications 2016 Gordon Bell Finalists Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

22 An (Incomplete) List of Full-Scale Applications 2016 Gordon Bell Prize Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

23 An (Incomplete) List of Full-Scale Applications 2016 Gordon Bell Prize Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Gordon Bell Finalists Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

24 An (Incomplete) List of Full-Scale Applications 2016 Gordon Bell Prize Fully Implicit Solver for Atmospheric Dynamics Surface Wave Modeling Phase Field Simulations of Coarsening Dynamics Atomistic Simulation of Silicon Nanowires Run-away Electron Trajectory Simulation Genome Functional Annotation and Homeotic Gene Building Spacecraft CFD Numerical Simulation 2017 Gordon Bell Prize Extreme-scale Graph Processing Framework Simulation of Planetary Rings Simulations of Quantum Spin Liquid States via PEPS++ Molecular Dynamics Simulation of Condensed Covalent Materials cryo-em Macromolecule Structure Determination Redesigning CAM-SE Nonlinear Earthquake Simulation

25 Two Major Challenges The Memory Barrier The Application Barrier

26 Machine Capability Comparison TaihuLight Tianhe-2 Titan K Computer communication bandwidth Peak Performance Memory Size hpgmg TaihuLight Titan Linpack Tianhe-2 K Computer Graph memory bandwidth Gflops/Watt HPCG Tflops/m^3

27 Again, a Multi-Objective Optimization Problem with a Budget Constraint Peak /.1Pflops Mem / TB Price / Million USD Sunway TaihuLight Tianhe-2 Piz Daint Titan Sequoia K

28 Again, a Multi-Objective Optimization Problem with a Budget Constraint a budget machine with aggressive compute performance Peak /.1Pflops Mem / TB Price / Million USD Sunway TaihuLight Tianhe-2 Piz Daint Titan Sequoia K

29 Major Features to Consider Sunway TaihuLight 125 Pflops 10 million cores user-controlled 64 KB LDM 32 GB and 136GB/s per node 22 flops/byte MPE + CPE register communication among CPEs

30 Major Features to Consider Sunway TaihuLight 125 Pflops 10 million cores user-controlled 64 KB LDM 32 GB and 136GB/s per node 22 flops/byte MPE + CPE register communication among CPEs Intel KNL 7250 of Cori: NVIDIA P100 of Piz Daint: 6.5 flops/byte 7.2 flops/byte

31 Major Challenge : Memory Wall Sunway TaihuLight 125 Pflops 10 million cores user-controlled 64 KB LDM 32 GB and 136GB/s per node 22 flops/byte MPE + CPE register communication among CPEs Refactoring and Redesigning

MPI One MPI process runs on one management core (MPE) Sunway OpenACC Sunway OpenACC conducts data transfer between main memory and local data memory (LDM), and distributes the kernel workload to the

32 MPI One MPI process runs on one management core (MPE) Sunway OpenACC Sunway OpenACC conducts data transfer between main memory and local data memory (LDM), and distributes the kernel workload to the computing cores (CPEs) Athread Programming Model on TaihuLight MPI + X X: (Sunway OpenACC / Athread) Athread is the threading library to manage threads on computing core (CPE), which is used in the Sunway OpenACC implementation

33 Two Major Challenges The Memory Barrier The Application Barrier

Example (I): Nonlinear Earthquake Simulation on Sunway TaihuLight Dynamic rupture source generator (originated from CG-FDM) Seismic wave propagation (originated from AWP-ODC) Other utilities: source

34 Example (I): Nonlinear Earthquake Simulation on Sunway TaihuLight Dynamic rupture source generator (originated from CG-FDM) Seismic wave propagation (originated from AWP-ODC) Other utilities: source partitioner 3D Model Interpolator Restart controller Fault Stress Init Dynamic Rupture Source Generator (Based on CG-FDM) Source Partitioner Seismic Wave Propagation (Based on AWP-ODC) Velocity Stress Update Update Next Timestep Stress Adjustment For Plasticity Friction Law Ctrl Source Injection Wave Eqn Solver 3D Vel/Den Model 3D Model Interpolator Snapshot/Sesimo Recorder Restart Controller LZ4 Compression, Group I/O, Balanced I/O Forwarding

35 Multi-Level Domain Decomposition

36 A Balanced Memory Scheme v 1 v 2 v n before v 1 v 2 v n after u 1 u 2 u w n 1 w 2 xx1 xx 2 yy1 yy 2 zz1 zz 2 w n xxn... yyn zzn u 1 u 2 u w n 1 w 2 u1 v1 w1 u2 v2 w2 fuse arrays w n xx1 xx 2 yy1 yy 2 zz1 zz 2 yz1 yz2 xxn... yyn zzn yzn (1) array fusion, (2) halo exchange through register communication, dvelcx yz1 yz2 yzn dvelcx xx1 yy1 zz1 xy1 xz1 yz1 xx 2yy 2 zz 2 xy 2 xz2 yz2 (3) and optimized blocking configuration guided by an analytical model v 1 v 2 v n u 1 u 2 un w 1 w 2 w n u1 v1 w1 u2 v2 w2 dstrqc dstrqc ID: 00 ID: 01 ID: 02 ID: 63 xx1 xx 2 xxn yz1 yz2 yy1 yy 2 zz1 zz 2... yzn zzn yyn xx1 yy1 zz1 xy1 xz1 yz1 xx 2yy 2 zz 2 xy 2 xz2 yz2 Left boundary Right boundary Inner part DMA transfer Register communication Register communication

37 Weak Scaling 18 Ideal (Linear) 14 Ideal (Non-linear) Ideal (Linear+Compress) 9 Ideal (Non-linear+Compress) 6 PFLOPS Linear (Peak: 10.7PFLops, Para. eff. 97.9%) Non-linear (Peak: 15.2PFlops, Para. eff. 80.1%) Linear+Compress (Peak: 14.2PFlops, Para. eff. 96.5%) Non-linear+Compress (Peak: 18.9PFlops, Para. eff. 79.5%) 8K 12K 16K 24K 32K 40K 48K 64K 80K 96K 120K 160K Number of processes

38 Weak Scaling PFLOPS 18 Ideal (Linear) 14 Ideal (Non-linear) Ideal (Linear+Compress) 9 Ideal (Non-linear+Compress) Peak Mem BW BYTE/flop Best Efficiency Linear (Peak: 10.7PFLops, Para. eff. 97.9%) 1 Titan 27 Pflops 5,475 TB/s Non-linear 0.2 half (Peak: machine: 15.2PFlops, 1.6 Pflops, Para. eff. 11.9% 80.1%) Linear+Compress (Peak: 14.2PFlops, Para. eff. 96.5%) 0.6 Non-linear+Compress (Peak: 18.9PFlops, Para. eff. 79.5%) Sunway without compression: 15.2 Pflops, 12.2% 8K K Pflops 16K4,473 TB/s24K 0.04 TaihuLight 32K 40K 48Kwith compression: 64K 80K K Pflops, 120K15.1% 160K Number of processes

39 Simulation Results

40 Two Major Challenges The Memory Barrier The Application Barrier

41 More and more component models marine biology ocean model oceanatmosphere boundary atmosphere model space weather ocean-ice boundary coupler land-atmosphere boundary atmospheric chemistry dynamic ice ice model ice-land boundary land model land biology solid earth hydrological process

42 The application barrier a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience

CAM5.0 POP2.0 CPL7 CLM4.0 CICE4.0 CESM1.

Students Four component models, millions

TaihuLight 24,000 MPI processes Over one

43 Application (II): Porting CESM and Redesigning CAM-SE for Sunway TaihuLight CAM5.0 POP2.0 CPL7 CLM4.0 CICE4.0 CESM1.2.0 Tsinghua + BNU 30+ Professors and Students Four component models, millions lines of code Large-scale run on Sunway TaihuLight 24,000 MPI processes Over one million cores 10-20x speedup for kernels 2-3x speedup for the entire model Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer, in Proceedings of SC

44 OpenACC-based Refactoring of CAM Pass tracers (u, v) to dynamics CAM initial Dyn_run Phy_run1 Phy_run2 Pass state variables and tracers Pass state variables manual transformation of loops manual OpenACC parallelization and optimization on code and data structures Euler_step: do ie = nets, nete compute Q min/max values for lim8 compute Biharmonic mixing term f do ie = nets, nete 2D advection step data packing Bonundary exchange Data extracting 1 do ie = nets, nete do ie = nets, nete do k = 1, nlev do k = 1, nlev do q = 1, qsize do q = 1, qsize qmin(k,q,ie) = Qtens(k,q,ie) = qmax(k,q,ie) = Qtens(k,q,ie) = Data packing 4 do ie = nets, nete do ie = nets, nete do q = 1, qsize do q = 1, qsize do k = 1, nlev do k = 1, nlev qmin(k,q,ie) = Qtens(k,q,ie) = qmax(k,q,ie) = Qtens(k,q,ie) = Data packing 5!$ACC PARALLEL LOOP!$ACC PARALLEL LOOP do ie_q = 1, do ie_q = 1, qsize*(nete-nets) qsize*(nete-nets) do k = 1, nlev do k = 1, nlev q = func(ie_q) q = func(ie_q) ie = func(ie_q) ie = func(ie_q) qmin(k,q,ie) = Qtens(k,q,ie) = qmax(k,q,ie) = Qtens(k,q,ie) =!$ACC PARALLEL LOOP Data packing 6 do ie = nets, nete do ie = nets, nete do k = 1, nlev do q = 1, qsize dp(k) = func_1() do k = 1, nlev do q = 1, qsize. Qtens(k,q,ie) = func_2(dp(k)) do ie = nets, nete do ie = nets, nete do k = 1, nlev do k = 1, nlev dp0 = func_3() do q = 1, qsize dpdiss = func_4() qmin(k,q,ie) = do q = 1, qsize qmax(k,q,ie) = Qtens(k,q,ie) = func_5(dp0, dpdiss) do ie = nets, nete do k = 1, nlev do k = 1, nlev dp_star(k) = func_8(dp(k)) dp(k) = func_5() Vstar(k) = func_6() do k = 1, nlev Qtens(k,q,ie) = do q = 1, qsize func_9(dp_star(k)) do k = 1, nlev Qtens(k,q,ie) = func_7(dp(k), Vstar(k)) Data packing 2 do ie = nets, nete do ie = nets, nete do k = 1, nlev do q = 1, qsize do q = 1, qsize do k = 1, nlev Qtens(k,q,ie) = func_2(func_1()) optimized:. do ie = nets, nete do ie = nets, nete do k = 1, nlev do k = 1, nlev do q = 1, qsize do q = 1, qsize qmin(k,q,ie) = Qtens(k,q,ie) = qmax(k,q,ie) = func_5(func_3(),func_4()) do ie = nets, nete do k = 1, nlev do q = 1, qsize Qtens(k,q,ie) = do k = 1, nlev func_9(func_8(func_5())) Qtens(k,q,ie) = func_7(func_5(),func_6()) Data packing 3 do begin_chunk to end_chunk tphysbc() tphysbc() tphysbc() { { { do begin_chunk to end_chunk do begin_chunk to end_chunk convect_deep_tend(6.47%) convect_deep_tend(6.47%) convect_deep_tend(6.47%) convect_shallow_tend(15.57%) convect_shallow_tend(15.57%) enddo macrop_driver_tend(8.38%) macrop_driver_tend(8.38%) microp_aero_run(4.29%) microp_aero_run(4.29%) do begin_chunk to end_chunk microp_driver_tend(7.13%) microp_driver_tend(7.13%) microp_driver_tend(7.13%) aerosol_wet_intr(4.29%) aerosol_wet_intr(4.29%) enddo convect_deep_tend_2(0.51%) convect_deep_tend_2(0.51%) radiation_tend(54.07%) radiation_tend(54.07%) do begin_chunk to end_chunk } enddo radiation_tend(54.07%) enddo } enddo } convect_deep_tend(6.47%) { zm_conv_tend(6.47%) do begin_chunk to end_chunk { convect_deep_tend(6.47%) do begin_chunk to end_chunk { zm_convr(2.03%) zm_conv_tend(6.47%) enddo { do begin_chunk to end_chunk zm_convr(2.03%) zm_conv_evap() zm_conv_evap() enddo montran() do begin_chunk to end_chunk convtranc(0.06%) montran() } enddo } do begin_chunk to end_chunk enddo convtranc(0.06%) enddo } } tool based transformation of loops

Athread-based Fine-grained Redesign Step 1: rewrite of Fortran OpenACC code to Athread C code finer memory control through a specific DMA scheme more efficient vectorization elek C0,0 C7,0 C0,1...... C7,1.

45 Athread-based Fine-grained Redesign Step 1: rewrite of Fortran OpenACC code to Athread C code finer memory control through a specific DMA scheme more efficient vectorization elek C0,0 C7,0 C0, C7, elek+7 C0,7... C7,7 Step 2: register-communication based redesign remove data dependency expose more parallelism Stage 1 Ci, j a16*i a16*i+a16*i+1... a16*i+...+a16*i+15 Stage 2 C0,0 a0... a0+...+a15 C1,0 a16 a8+a9... a a31 Ck, 0 ak a0+a1 ak+ak+1... ak+...+ak p15 + p15 = p111 = p31 p127 Stage 3 Ci, j a16*i a16*i+a16*i+1... a16*i+...+a16*i+15 p16*i

46 1 Sunway CG (64 CPEs) could be equivalent to 0.1x Intel Core or 1.8x Intel Core or 7.2x Intel Core or in certain cases 43.1x Intel Core Performance Improvement through Redesign

47 Scaling the Dynamic Core to Millions of Cores PFlops elements in each process 192 elements in each process 768 elements in each process 650 elelments in each process 3.3 PFlops Para.eff 98.55% 2.4 PFlops Para.eff 92.2% 2.72 PFlops Para.eff 92.9% 1.76 PFlops Para.eff 88.3% Number of Processes

48 Simulation of Hurricane Katrina

49 Simulation of Hurricane Katrina

50 Outline Sunway Machine Applications and Programming Challenges Long Term Plans

51 Long Term Plan Traditional HPC Applications (Science -> Service) 52

52 Long Term Plan Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro tanshan.gif 15-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight: Enabling Depiction of Realistic 10 Hz Scenarios, Gordon Bell Prize Finalist, SC

53 Long Term Plan Traditional HPC Applications (Science -> Service) Deep Learning Related Applications 54

54 Average time per iteration Long Term Plan Training AlexNet with swcaffe Traditional HPC Applications (Science -> Service) x 1.0x Deep Learning Related Applications total convolution fully connected 2.2x 2.7x? 3.5x 4.5x 1.0x 7.1x 8.5x Intel(24 cores) swblas(1cg) swdnn(1cg) 55

55 Long Term Plan Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro 56

56 Long Term Plan Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro 57

57 Long Term Plan Traditional HPC Applications (Science -> Service) Deep Learning Related Applications Sunway Micro Sunway Exa-Scale System 58

58 Sunway TaihuLight: - NSCC-Wuxi, core processor Pflops - 1 st of TOP500 Sunway Exa-Pilot System: ~ 10 Tflops per node - 10 ~ 20 Gflops/W Sunway Exa-Scale System ~ Pflops - 30 Gflops/W The Sunway Exa-Scale Plan

HPC and Big Data: Updates about China. Haohuan FU August 29 th, 2017

HPC and Big Data: Updates about China Haohuan FU August 29 th, 2017 1 Outline HPC and Big Data Projects in China Recent Efforts on Tianhe-2 Recent Efforts on Sunway TaihuLight 2 MOST HPC Projects 2016