Analysis of Characteristic Features of HPC Applications. Lian Jin HPC Engineer, Inspur

Size: px

Start display at page:

Download "Analysis of Characteristic Features of HPC Applications. Lian Jin HPC Engineer, Inspur"

Derek Daniel
5 years ago
Views:

1 Analysis of Characteristic Features of HPC Applications Lian Jin HPC Engineer, Inspur

2 HPC applications How to make these applications much more efficient? Application optimization System optimization Understanding Application features

$D Weather\ Climate Miao$

3 Inspur s Strategy Professional research team of HPC application Tao Yu Ph.D Wenjing Lv Ph.D Computational Mathematics Petroleum Material/ Physics Yu Liu Ph.D CAE/ CFD Life science High End Talents Bowen Chen Ph.D Lian Jin Ph.D Weather\ Climate Miao Wang Ph.D More Support teams > 5 Institutes from CAS Tsinghua University etc.

4 Inspur s Strategy Modelling innovation Technology innovation Input parameters optimization Codes optimization Physical Modelling innovation Algorithm innovation GPU MIC Special machines Based on platform Guided by application Pointed by theory Platform optimization Application features analyzing Hardware optimization Software optimization The way to achieve excellent performance

System Platform Optimization preparation Platform Application

parameters; Tuning parallelization mode; Tuning workloads; Features

Platform tuning Platform features analyzing; Tuning hardware

real time features and performance analyzation.

5 System Platform Optimization preparation Platform Application workload Application tuning Application features analyzing ; Tuning parameters; Tuning parallelization mode; Tuning workloads; Features collection Performance data of CPU/Memory/IO/Net work and so on. Platform tuning Platform features analyzing; Tuning hardware configuration and parameters; Tuning middleware; Purpose Provide real time features and performance analyzation. Design the most optimized system solution. Optimize both your hardware and software base on these features

6 System Platform Optimization The different requirements of HPC application : CPU Memory Storage Network I/O-intensive Memoryconstrained Networkintensive Balance a variety of computing requirement and build a highscalable, high-efficient HPC system to maximize the existing application s performance. Computing-intensive

model: pseudopotential DFT QMC CI Application optimization OS: Linux reduced kernel vectorization support

7 System Platform Optimization Methods of optimizing computing-intensive applications Parameters: precision convergence condition boundary condition Code optimization: vectorization,loop optimization function optimization model: pseudopotential DFT QMC CI Application optimization OS: Linux reduced kernel vectorization support Compiler: ifort/icpc O2/O3 SSE AVX Library: MKL IMSL Goto Software optimization CPU: Boost frequency HT: off Hardware optimization

System Platform Optimization Methods of optimizing network-intensive

optimization Software optimization Compilers: Intel MPI mpich2 openmpi

I_MPI_ADJUST_GATHERV MPI environment: mpitune Application optimization

8 System Platform Optimization Methods of optimizing network-intensive applications Devices: FDR IB, TH-Express, RDMA HT: on Hardware optimization Software optimization Compilers: Intel MPI mpich2 openmpi platform mpi Parameter: I_MPI_ADJUST_BCAST I_MPI_ADJUST _GATHER I_MPI_ADJUST_GATHERV MPI environment: mpitune Application optimization Input parameters: mode(eg: spin polarization particles bands base number) granularity

2 39 58 77 96 5 34 53 72 9 2 229 248 267 286 35 324 343 362 38 4

5.5 Teye Application Features Analyzer HPC application experts eye

application features Diagnose application bottleneck Help to

performance Large scale cluster and database support Asynchronous

resources requirement PCIe_total_bw_GB PCIe_read_bw_GB

system level resources useage Monitoring Xeon CPU

Monitored data analysing support Mem CPU usage < Xeon CPU

9 Teye Application Features Analyzer HPC application experts eye CPU microarchitecture and bandwidth features Analyse HPC application features Diagnose application bottleneck Help to improve application performance Reduce the optimization cost High performance Large scale cluster and database support Asynchronous Monitroing monitoring PCI-E Bandwidth support Low hardware resources requirement PCIe_total_bw_GB PCIe_read_bw_GB PCIe_write_bw_GB Refresh every second CT scanning Monitoring system level resources useage Monitoring Xeon CPU microarchitecture indexes Monitoring PCI-E devices bandwidth Monitored data analysing support Mem CPU usage < Xeon CPU teyeserver MySQL >496 cores support User teyemon PCI-Express IB To other nodes Accelerators Teye system diagram In-node monitoring diagram

Experiment scientists Theoretical scientists Running application on supercomputer Quick

24 2 28 225 232 239 246 253 26 267 274 28 288 295 32 总浮点运算速度 X87 单元运算速度 SSE 向量化运算速度 SSE

249 257 265 273 28 289 297 总内存带宽内存读带宽 Applications Profiling by Teye Application analysis

10 Experiment scientists Theoretical scientists Running application on supercomputer Quick Results Clients 总浮点运算速度 X87 单元运算速度 SSE 向量化运算速度 SSE 标量运算速度总内存带宽内存读带宽 Applications Profiling by Teye Application analysis Inspur application features database Highly optimized jobs Guidance Diagnose and Optimize Application

,ACP receptor in water 8435 atoms Intel composer xe 23 Hardware environment:

11 Application Diagnosing Optimize Gromacs and understand its features on HPC systems based on Intel xeon processors. Software environment: Gromacs 4.6.,ACP receptor in water 8435 atoms Intel composer xe 23 Hardware environment: management node I/O nodes,9 blade calculation boxes Intel Xeon E5-267 processors Infiniband

12 Micro structure features Gromacs using single precision Floating point processing floating (GFlops) point processing and Floating it is point processing (GFlops) Double Precision Single Precision SSE Packed SP floating point intensive application. SSE Scalar SP Application Diagnosing Before Optimization Vectorization rate and CPI SSE_SP_VEC AVX_SP_VEC CPI AVX Packed SP Analyzation: These figures show us that Besides, Gromacs can be 75 optimized for the AVX instructions. 5 After Optimization Vectorization rate and CPI SSE_SP_VEC AVX_SP_VEC CPI Double Precision Single Precision SSE Packed SP SSE Scalar SP AVX Packed SP

13 Application Diagnosing Memory and network features Before Optimization Memory bandwidth (GB/s) mem_total_bw_gb mem_read_bw_gb mem_write_bw_gb 3.5 memory bandwidth but require IB bandwidth (MB/s) large network bandwidth. So, Send Receive Analyzation:.5 These figures show us that Gromacs costs very low Gromacs is network bandwidth intensive application. After Optimization Memory bandwidth (GB/s) mem_total_bw_gb mem_read_bw_gb mem_write_bw_gb IB bandwidth (MB/s) Send Receive

28 55 82 9 36 63 9 27 244 27 298 325 352 379 46 433 46 487 54 54 568 595 622 649 676 26 5 76 26 5 76 2 226 25 276 3 326 35 376 4 426 45 476 5 526 55 576 6 626 65 676 27 53 79 5 3 57 83

.8.6.4.2.7.6.5.4.3.2. Analyzation: These figures show us that Gromacs is not I/O intensive application and its output model is not parallelized.

14 I/O features Before Optimization nfs_read_mb Application Diagnosing Disk IO (MB/s) disk_read_mb NFS IO (MB/s) disk_wsize_mb nfs_write_mb Analyzation: These figures show us that Gromacs is not I/O intensive application and its output model is not parallelized. After Optimization Disk IO (MB/s) disk_read_mb NFS IO (MB/s) nfs_read_mb disk_write_mb nfs_write_mb

15 Application Diagnosing Gromacs features radar chart Memory capacity Memory bandwi 3rd quadra Data for storage Disk IO% 2nd quadra 4th quadra CPU averag CPU Time% st quadra Real time Commun ication Groma Computationally intensive: Higher CPU frequency higher performance Memory: Performance is nearly independence with the Mem BW. Storage IO: Low requirement Network: The high speed, low latency network will give a good parallel performance and high efficiency.

16 Application Diagnosing Optimize WRF and understand its features on HPC systems based on Intel xeon processors. Software environment: WRF3.4., Intel composer xe 23 km Hardware environment: management node I/O nodes,2 blade calculation boxes Intel Xeon E5-267 processors Infiniband

17 Application Diagnosing Micro structure features of WRF Total SP GFlops Total_GFlops Clock per instruction CPI SSE Vectorization rate SSE_VEC Analyzation: These figures shows us that WRF using single precision floating point processing. Besides, WRF is not highly optimized for the AVX instructions...5 AVX Vectorization rate AVX_VEC

18 Application Diagnosing Memory and network features Total BandWidth (GB/s) mem_total_bw_gb Memory read (GB/s) mem_read_bw_gb Infiniband send (MB/s) ib_xmitdata_mb Analyzation: These figures shows us that WRF is memory bandwidth intensive and network bandwidth intensive application Infiniband receive (MB/s) ib_rcvdata_mb

19 Application Diagnosing I/O features Server node 4 2 Disk read (MB/s) disk_read_mb 2 Computing node 3 2 Disk write (MB/s) disk_write_mb Nfs read (MB/s) nfs_read_mb Analyzation: These figures shows us that WRF is some kind of I/O intensive application but its input and output mode is not parallelized. 5 Nfs write (MB/s) nfs_write_mb

20 Application Diagnosing High memory Bandwidth Compiler options, 3%~5% Intel MKL DFT,3% Total %~2% w Quilt IO,~% Parallel file system,~% Hybrid DM+SM,~5% nproc_x/nproc_y parameters optimization, 3%~%

Application Diagnosing 3rd quadra WRF features radar chart Memory capacity Memory bandwi Data for storage Disk IO% 2nd quadra 4th quadra CPU averag CPU Time% st quadra Real time Commun ication WRF

21 Application Diagnosing 3rd quadra WRF features radar chart Memory capacity Memory bandwi Data for storage Disk IO% 2nd quadra 4th quadra CPU averag CPU Time% st quadra Real time Commun ication WRF Computationally intensive: Higher CPU frequency higher performance Memory band-width sensitive :The requirement of memory capacity is defined by the size of workload and more compute resources less memory bandwidth requirement. Storage IO: High requirement for large workload. Network:The high speed, low latency infiniband network will give a good parallel performance and high efficiency.

22 VASP cases Platform Application Diagnosing Profile VASP features and understand Its behavior while running on a supercomputer. CASE CASE 2 Ions 9 96 Base sets.4 million.37 million Bands K-points 5 Algorithm CG DIIS CPU Memory Network IO Device E5-2692v2 DDR3 6 FDR Raid Notes 422.4GFlop s 2.4GF/s 56Gbit/s ~8MB/s

23 Floating point speed (GFlops) Case Case Application Diagnosing Total X87 SSE vectorized SSE scalar AVX vectorized Notes: The intensity of FP is determined by the number of bands. The continuity is determined by algorithm.

24 Vectorization rate Case Case Application Diagnosing SSE vectorization rate Notes: Vectorization rate is determined by the number of basesets and its maximun value is determined by the number of bands. AVX vectorization rate

25 Memory bandwidth (GB/s) Case Case Application Diagnosing Total MemBW Read MemBW Write MemBW Notes: The intensity of memory bandwidth and its extremum value is determined by the number of basesets and algorithm.

26 Infiniband bandwidth (MB/s) Case Case Application Diagnosing Send 发送速率 Notes: Recv 接收速率 Same as that of memory bandwidth. One should note that the continuity of data exchange would influence the application s scalability crucially.

27 IO and NFS bandwidth (MB/s) Case Case Application Diagnosing IO read IO write NFS write NFS read Notes: VASP is not a IO intensive application and its IO is not parallelized.

CPU frequency higher performance Memory bandwidth sensitive: Higher memory bandwidth higher performance Storage IO:

28 Application Diagnosing VASP features radar chart Memory capacity Memory bandwi 3rd quadra Data for storage Disk IO% 2nd quadra 4th quadra CPU averag CPU Time% st quadra Real time Commun ication VASP Computationally intensive: Higher CPU frequency higher performance Memory bandwidth sensitive: Higher memory bandwidth higher performance Storage IO: Low requirement Network: The high speed, low latency network will give a good parallel performance and high efficiency.

29 Application Diagnosing App field App CPU Mem_used Mem bandwidth IO Network Scalabilit y MD GROMACS Weather WRF MS VASP note: Full signal means high requirement.

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU