CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

Size: px

Start display at page:

Download "CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer"

Laurence Giles Stokes
5 years ago
Views:

1 CHAO YANG Dr. Chao Yang is a full professor at the Laboratory of Parallel Software and Computational Sciences, Institute of Software, Chinese Academy Sciences. His research interests include numerical analysis and modeling, large-scale scientific computing, and parallel numerical software. Prof. Yang is a member of IEEE, ACM and SIAM. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

2 Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer Chao Yang Institute of Software, Chinese Academy of Sciences 6/21/2016

3 Acknowledgements Funding «Major Project, NSF of China «863 and 973 Programs, MOST of China Collaborators «Prof. Wei Xue (Tsinghua U) «Prof. Lanning Wang (Beijing Normal U) «Prof. Haohuan Fu (Tsinghua U/NSCC-Wuxi) «... 2

Processor architecture of Sunway TaihuLight Ø Each processor has 4 core groups (CG), in each of which there are 1 management process element (MPE) and 64 computing process elements (CPEs).

4 Processor architecture of Sunway TaihuLight Ø Each processor has 4 core groups (CG), in each of which there are 1 management process element (MPE) and 64 computing process elements (CPEs). Ø The MPE and CPEs share same memory space through a memory controller (MC). Ø The CPEs in each CG are organized as an 8 8 mesh (CPE cluster), with a mesh controller handling interrupt & sync. Ø Each CPE has 64 KB Scratch Pad Memory (SPM) as either a fast user-programmable buffer or a software-emulated cache. 3

5 Programming model for the CPE cluster Ø Programming model: MPI+X MPI is available for MPE. OpenAcc and Athread are available for the CPE cluster. The master thread on MPE may spawn threads running on the CPE cluster. Ø Three advanced arch features: The 64 KB SPM will be controlled by user, which is treated as the private local device memory (LDM). The direct memory access channel (DMA) can fast move data between main memory and LDM. The low-latency register communication mechanism among the CPE cluster. 4

6 1. xmath library 5

Overview of xmath xmath is a high-performance extended math library «Compatible with commercial libs ª MKL, ACML, etc «Four major modules ª BLAS ª LAPACK ª FFT ª Sparse Iterative Solver «Tailored for

7 Overview of xmath xmath is a high-performance extended math library «Compatible with commercial libs ª MKL, ACML, etc «Four major modules ª BLAS ª LAPACK ª FFT ª Sparse Iterative Solver «Tailored for the Sunway CPU ª Assemble-level optimization ª Instruction-level optimization ª Data-movement reduction by LDM/DMA ª Pthreads parallelism for MPEs ª Athread parallelism for CPEs «Version history ª V0.9a released on 01/01/2016 ª V1.0b released on 06/20/2016 6

8 Different optimization strategies for different kernels Recall roofline model Performance (flop/sec) Low, ~O(1) Medium, ~O(log(n)) High, ~O(n) Ops Intensity (flop/byte) Data-movement reduction and task parallelization for bandwidth limited kernels. Assemble- and/or instruction-level optimizations for computing intensive kernels. 7

9 Optimizations for FFT Use the dfti interface of MKL with a descriptor for FFT information. Two-layer decomposition based on iterative Stockham framework. Batch Cooley-Tukey FFT kernels with the 8x8 CPE cluster. 8

10 Optimizations for BLAS-3 kernels 9

tools will not work l QUARK, StarPU, OpenMP4, «A tailored task scheduling engine is needed

11 Optimizations based on task scheduling SWAN: an light-weight task scheduling engine for Sunway CPUs «Why design it ª The CPEs does not support Pthreads ª Modern task scheduling tools will not work l QUARK, StarPU, OpenMP4, «A tailored task scheduling engine is needed on Sunway to support ª Static task scheduling for MIMD ª Dynamic task scheduling based on DAG 10

12 Performance results of xmath 1.0b BLAS-1/2 FFT BLAS-3 LAPACK Note: The results are still preliminary. 11

13 2. HPCG benchmark 12

«Cover the major communication and computational patterns.

14 HPCG High-Performance Conjugate Gradients «Represent applications governed by PDEs and solved by iterative methods. ª PCG solver & MG preconditioner. «Cover the major communication and computational patterns. ª Neighbor/global communication. ª SpMV, BLAS-1, etc. «Proposed by J. Dongarra et al. «List released twice per year since

15 Algorithms in HPCG z BLAS-1 v WAXPBY (3) n w=a*x+b*y v DotProduct (3) n MPI_AllReduce comm z Sparse BLAS-2 v SpMV n MPI neighboring comm z V-cycle MG Preconditioner v SymGS n SpMV-like n MPI neighboring comm v Restriction v Prolongation 14

16 Strategy of optimizing HPCG on Sunway TaihuLight Reordering and recoloring techniques «Reduce the synchronization overhead «Improve the data locality «Achieve a good balance between parallelism and convergence Demand-based data sharing «Keep the effective work set on the CPE cluster. «Exchange data among the CPEs. Fine-grained task schedule «Reduce data transfer cost between MPE and CPE «Explore more thread-level parallelism 15

17 Opt#1: Block red-black reordering 16

18 Opt#2: Demand-based data sharing 17

19 Opt#3: Fine-grained task scheduling 18

20 Scaling results Problem size on each CG: 128 x 128 x 128. Number of processes: (10.65M cores). Performance: 371 Tflop/s. Parallel efficiency: 81.4%. Convergence overhead: 15%. 19

21 3. Atmosphere Simulations 20

Atmospheric modeling Importance of Earth System Modeling «Help understand the past, the present, and the future Importance of Atmospheric Modeling «A key component to be coupled with other sub-models

22 Atmospheric modeling Importance of Earth System Modeling «Help understand the past, the present, and the future Importance of Atmospheric Modeling «A key component to be coupled with other sub-models Importance of Nonhydrostatic Atmospheric Modeling «Valid at O(100m)~O(10km) scale Advantage of Fully Implicit Methods «Enable long-term simulations with time steps free of stability constraint Atmosphere Model Ocean Model Land Model Sea-Ice Model 21

23 The target problem Governing equations 8 >< 0 v e T q) + r ( v) =0, Discretizations «Height-based Gal-Chen terrain-following vertical coordinates «Finite volume scheme based on AUSM+up Riemann solver Continuity equation + r ( v v)+r H p + rp gz +2 v =0, Momentum equation + r (( e T + p)v) = 0 Energy equation + r ( qv) = 0 Moisture equation where 0 =, p 0 = p p, ( e T ) 0 = e T e T, ( q) 0 = q = g, p is decided by the equation of state from, v and ē T «Supports both explicit R-K integration and fully implicit ESDIRK method. 22

24 Algorithms and implementations (1) The stencil and matrix generation kernels.

25 Algorithms and implementations (2) Physics-based geometric multi-block ILU subdomain solver

26 Testing results (baroclinic instability test)

27 Summary We show some of our early experience from three aspects «On library level: we developed xmath «On benchmark level: we optimized HPCG «On application level: we designed a fully implicit atmsphere solver High performance can be achieved if efficiently exploiting «the 64KB LDMs as user-programmer buffers «the fast DMA channels to reduce the data access cost «the low latency register comm. mechanism for quick data exchange You are welcome to tomorrow s HPC Asia session! 26

28 27

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture