Unit Name. Date These work are contributed by all our staffs

Size: px

Start display at page:

Download "Unit Name. Date These work are contributed by all our staffs"

Melina Payne
5 years ago
Views:

1 Supercomputing Applications Title in CAS Xue-bin Chi Supercomputing Center, CNIC, CAS Unit Name April 13-14, 14, 2011, Helsinki, Finland Date These work are contributed by all our staffs

2 Outline Overview the HPC History of China Brief Introduction of SCCAS Grid environment overview Applications in science and engineering Future Plan on high performance environment

3 Overview the HPC History of China 1.Started in the late of 1980 s 2.Developing in the 1990 s

4 Parallel computers ( ) 1994) BJ 01 ( ) 4 processors, with global and local memmory, made by ICT, CAS Transputers ( ) Severalparallelsystems systems, our group had a 17 transputer system KJ 8950 ( ) 16 processors, with global and local memmory, made by ICT, CAS st parallel computer Dawning 1000 occurred Peak performance 2.5GFLOPS in single precision, HPL 50% Similar to Paragon machine, NX parallel implementation environment Intel i860

5 National High Performance computing Centers ( ) National High Performance computing Center (Beijing) In Institute of computing technology, CAS National High Performance computing Center (Wuhan) Huazhong University of Science and Technology National High Performance computing Center (Chengdu) Southwest Jiaotong University National High Performance computing Center (Hefei) University of Science and Technology of China Supercomputing Center in Chinese Academy of Sciences (SCCAS) 1996 Shanghai Supercomputing Center (SSC) 2001

6 Clusters ( ) 2005) Lenovo DeepComp 6800 (SCCAS) 2003, top500 rank 14th, peak performance 5.2TFLOPS Itanium II, Linux OS, MPI, QsNet Budget 60M RMB MOST 15M, Lenovo 15M, CAS 30M Dawning 4000 (SSC) 2004, top500 rank 10th, peak performance 11 TFLOPS AMD Opteron, Linux Os, MPI, Merinet 2000 Budget 60M RMB MOST 30M, Shanghai Government 30M Many other clusters installed during that period

7 100TFLOPS HPCs ( ) 2010) Lenovo DeepComp TFLOPS, our center, Dec., 2008 Budget 180M RMB Dawning 5000 MOST 80M, Lenovo 25M, CAS 75M Peak performance 200TFLOPS, AMD processors Aug., 2009, in Shanghai Supercomputing Center Top500 rank 10th on June of 2009 Budget 200M RMB MOST 100M, Shanghai Government 100M Software and environment 100M

8 Peta FlopsHPCs Dawning Nebulae 600TFLOPS CPU + 3PFLOPS GPU Top500 Rank 2nd Budget 600M RMB MOST 200M, Shenzhen Government 400M Upgrade 1PFLOPS CPU, 200 TFLOPS Godson NUDT Tianhe 200TFLOPS CPU + 1.2PFLOPS GPU Top500 rank 7 th Budget 600M RMB MOST 200M, Tianjin Government 400M Upgrade 1PFLOPS CPU + 4PFLOPS GPU HPL 2.5PFLOPS

9 Brief Introduction of SCCAS 1.Providing computing power and technical support for scientists 2.Training in HPC algorithm and programming

10 SCCAS Computing Resources During (9th 5 year plan) In 1996, SGI Power Challenge XL 6.4Gflops 16 CPUs In1998: Hitachi SR GFlops 32CPUs In 2000, Dawning 2000 II 111.7Gflops 164 CPUs

11 SCCAS Computing Computing Resources During (10th 5 year plan) In 2003, Lenovo DeepComp6800 5Tflops, 1024 CPUs TOP500:No.14;China TOP100:No. 1 During 2006 now (11th 5 year plan) In 2008, Lenovo DeepComp Tflops, 13,000 cores TOP500:No.19;China TOP100:No.2 3 kinds of nodes, Altix 4700, IBM 3950, IBM Blades In , 300TFLOPS GPU

12 SCCAS Services: Services: Users Users in SCCAS has massively increased since 1996 (from 20 to more than 310)

13 SCCAS R&D of Algorithmsandsoftware software Parallel AMR (Adaptive Mesh Refinement) method Parallel Eigenvalue Problem Parallel Fast Multipole Method Parallel Computing Model Gridmol ScGrid middleware PSEPS FMM radar Transplant many open source software

14 Grid environment overview 1.China National Grid (CNGrid) 2.China Scientific Computing Grid (ScGrid)

15 China National Grid

16 China National Grid CNGrid operation center Supercomputing center of CNIC of CAS Northern major node Supercomputing center of CNIC of CAS Southern major node Shanghai supercomputing center Normal nodes 9 sites, they are Tsinghua University, Beijing Institute of Applied Physics and Computational Mathematics, Shandong University, University of Science and Technology of China, Huazhong University of Science and Technology, Shenzhen Institute of Advanced Technology of CAS, Hong Kong University, Xi an Jiaotong University, Gansu province Supercomputing Center

17 China Scientific Computing Grid (ScGrid) 3 Level Grid Property Complementarity Autonomic Ability Scalability Availability Realizability Usability Manageability Stability CAS ARUMS

18 GPU Clusters within CAS Site Vendor R peak /Tflops Institute. of Electrical Engnineering Lenovo 112 Shenzhen Institutes of Advanced Technology Lenovo 200 USTC Lenovo 183 CNIC Lenovo, Dawning 300 National Astronomical Observatories Lenovo 158 Institute of Geology and Geophysics Lenovo, Dawning 200 Institute of Modern Physics Lenovo Institute of High Energy Physics Dawning Institute of Metals Research Dawning 183 Purple Mountain Observatory Dawning 180 SUM Pflops

19 Windows / Linux Clients Users Administrator Web Portal SCE Middleware HPC, Cluster, Workstation, Storage

20 Applications in Science and Engineering 1.Scientific ifi Research 2.Industrial Computing

21 Prediction of Sandstorm Real time prediction system of Sandstorms for China Meteorological Administration DeepComp CPUs from 15hours down to 8mins

22 Climate and Atmosphere Model CUACE(CMA Unified Atmospheric 3000 Chemistry 1000 Environment) 0 TI ME 54km km km km % 100% 80% 60% 40% 20% 0% Ef f i ci ency km 100% 90% % % % % 108km 100% % % % % % 54km 108km 100 RIEMS(Regional 80 Integrated 70 Environment 60 Modeling System) ) MPI MPI+OpenMP

23 The surface temperature eof the Joined Area of Asia&India Pacific sa da ac c Ocean

24 Computational Fluid Dynamics Speed d-up CPU number

25 Gl Galactic Wind Simulation 8192 Computational cores, Internationally advanced both in computational scale Internationally advanced both in computational scale and speed

26 Galactic Wind Simulation on Tianhe 1A Speedup Li near Efficiency % % %

27 Simulation the Universe Evolution by 2048 Cores 13 Days

28 Spaceweather prediction research Shock wave magnetic cloud interactive Using parallel AMR method to do MHD Computing (512 cores)

29 Drug screening for Avian Influenza DeepComp7000 p 2400 CPU cores Time consumed decreased from 2 months to 8hours The compounds screened out by this computation is forwarded to related department tfor further biological test

Simulation: Side plates Formation in Ti Alloys Phase-field model Time-dependent Ginzburg-Landau(TDGL) equation Cahn-Hillard equation Anisotropic interfacial energy Numerical Method time domain finite

30 Simulation: Side plates Formation in Ti Alloys Phase-field model Time-dependent Ginzburg-Landau(TDGL) equation Cahn-Hillard equation Anisotropic interfacial energy Numerical Method time domain finite difference operator splitting Grid size 1024*1024*1024 Number of DOFs 4.295*10 9 Run on 4,096 cores Parallel efficiency 94%. 3 simulations 462,828 core hours. Collaborators Ms. Mei Yang Dr. Hao Wang Dr. Gang Wang Prof. Dongsheng Xu Institute of Metal Research, CAS, Shenyang Side-plates Formation Visualization(512 3 )

31 Side plates formation in Ti Alloys test (Tianhe 1A) Speedup Li near Efficiency % % %

Parallel Software HPSEPS Eigenvalue problem parallel solvers for

Parallel Solver Eigensolver for dense matrices with 8192 cores

Num of Cores 512 1024 2048 4096 8192 Time(sec) 1058 723 471 341

32 Parallel Software HPSEPS Eigenvalue problem parallel solvers for sparse and dense symmetric matrics SVD Parallel Solver LSQR Parallel Solver Eigensolver for dense matrices with 8192 cores on Tianhe-1A(N=40000) 16 H i E i i H ( X) X X Num of Cores Time(sec) Speed up Efficency 100% 73% 56% 39% 25%

33 Hypre on Tianhe 1A Convection Reaction Diffusion div ( K grad u + B u) + C u = F SMG preconditioned GMRES solver N=1024x1024, Np=kxk Assume efficency on 1024 core is100%,we can get efficency in 8100 cores is 51%

34 P3DFFT on Tianhe 1 A Parallel 3D FFT Based on FFTW3.2.1, but MPI_AlltoAll is avoid 1024x1024x x1024x2048 #Procs T(P3DFFT) Efficiency

35 SC PEtot: First Large scale PlanewaveGPUCode Real space & G-space calculation lation Both feature are implemented in GPU Speedup can be 5~6x for CG Data structure Transpose data from (16, 1) to (1,16) in CPU for more efficient GPU implementation Sphere-to-Cube FFT Use sphere to reduce CPU-GPU data transfer size Mapping to cube in GPU to use CUFFT MPI communication and GPU overlap Non-block AlltoAll to overlap CPU communication Communication & computation pipeline in FFT G FFT CG GPU 1.8s 118s CPU 20.5s 645s R Hpsi CG GPU s CPU s

SC PEtot:5xfaster thanvasp Testing systems on SC-PEtot: 512 atom GaAs bulk system 933 atom

represent two different situations: one with atoms uniformly occupying the space, another

5x to 9 times speedup compared with CPU PEtot About 5 times speedup compared with VASP

VASP (CPU) 478 251 186 161 PEtot (CPU) 842 450 255 152 104 495 PEtot (GPU) 87 49 28 24 18

36 SC PEtot:5xfaster thanvasp Testing systems on SC-PEtot: 512 atom GaAs bulk system 933 atom CdSe quantum dot (QD) passivated by hydrogen atoms placed in a large box These two systems represent two different situations: one with atoms uniformly occupying the space, another with a large empty space. Results 4.5x to 9 times speedup compared with CPU PEtot About 5 times speedup compared with VASP Nodes systems 512 GaAs 512 GaAs 512 GaAs 512 GaAs 512 GaAs 933 CdSe VASP (CPU) PEtot (CPU) PEtot (GPU) Speed up(with CPU) 9.7x 9.2x 9.1x 6.3x 5.8x 8.5x Speed up(with VASP) 5.49x 5.1x 6.6x 6.7x SC-Petot testing results compared with CPU Petot and VASP

P_InsPecT/cuda InsPecT Software Software Introduction Both are optimized InsPecT software P_InsPecT is open source and can be

PTMs(post translational modifications) Software Characteristics P_InsPecT via MPI run on CPU cluster or CPU nodes of HPC Software

Environment:DeepComp7000 One Core (estimate) 2048 Cores Time 1177.7 h 0.

37 P_InsPecT/cuda InsPecT Software Software Introduction Both are optimized InsPecT software P_InsPecT is open source and can be downloaded from SCBG Cuda InsPecT will be open source Software Function InsPecT is an unrestricted identification software of PTMs(post translational modifications) Software Characteristics P_InsPecT via MPI run on CPU cluster or CPU nodes of HPC Software Performance Software:P_InsPecT(one ( modification) ) Database:36547 mass spectrometric; protein sequences Environment:DeepComp7000 One Core (estimate) 2048 Cores Time h 0.4 h cuda-inspect via MPI+cuda C run on GPU cluster Software:cuda-InsPecT(two ( modifications) ) Database:62346 mass spectrometric; protein sequencese Environment:Dawn 6000A One Core (estimate) t 677 Fermi C2050 Time 6 years h

38 VAT4M ( A Visual Analysis tool for CryoEM Density Maps) Volume rendering with an intuitive interface Automatic ti segmentation tti of density maps Automatic structure fitting for multi molecular models

CAD/CAE Platform in Automation Industry Who we provide our service to CAD/CAE

parallel l solver High performance graphic workstations for pre/post processing A high

workstations Grid computing based resources Web portalintegrated with CAEsoftware A

39 CAD/CAE Platform in Automation Industry Who we provide our service to CAD/CAE engineers What we deliver to them A high performance service DeepComp 7000 for parallel l solver High performance graphic workstations for pre/post processing A high usability cloud computing solution via internet Remote graphical access to workstations Grid computing based resources Web portalintegrated with CAEsoftware A high safety environment Secure access through VPN and firewall Isolate unauthorized access from cluster by web portal

40 Aerodynamic Noise Simulation inhigh speed Train 13 million mesh elements 128 cores simulation 11 day for flow field 15 days for noise

41 Fingerprint Recognition Fingerprint database :19,600,000 fingerprints HPC:1024 cores ( 16 nodes, 64 cores/node ) Speed:4000 fingerprints/second/core Run time:3 months (linear Speedup)

42 Oil Reservoir simulation on Daqing Oilfield Grid size: 199*87*67 Dawning2000II 64 CPUs The different Oil and water percentage on the same layer at different years: Nov. 1966(left) and Jan. 2100(right)

43 Future Plan on High Performance Environment

44 Petascale Supercomputers in CAS 1 PetaFlops computer in 2012 Budget 250M RMB For scientific computing 10 PetaFlops computer in 2015 Collaboration with Beijing local government Budget 700M RMB For scientific computing and industry computing

45 100PFlops Supercomputers in China Many peta scale HPCs At least one PFLOPS HPC Budget 4B RMB MOST 2.4B, Local Government 1.6B 1 10ExaFLOPS HPC Budget Can t estimate

46 Question & Suggestion Thank you very much!

INSPUR and HPC Innovation

INSPUR and HPC Innovation Dong Qi (Forrest) Product manager Inspur dongqi@inspur.com Contents 1 2 3 4 5 Inspur introduction HPC Challenge and Inspur HPC strategy HPC cases Inspur contribution to HPC community