John Levesque Nov 16, 2001

Size: px

Start display at page:

Download "John Levesque Nov 16, 2001"

Jason Joseph
6 years ago
Views:

1 1

2 We see that the GPU is the best device available for us today to be able to get to the performance we want and meet our users requirements for a very high performance node with very high memory bandwidth. Buddy Bland, ORNL Project Director OLCF-3 HPCWire Interview, October 14,

16, 32, or 64GB 1600 MHz DDR3 6GB GDDR5 170

3 XK6 Compute Node Characteristics Host Processor AMD Series 6200 (Interlagos) Tesla X2090 Perf. Host Memory Tesla X090 Memory 665 Gflops 16, 32, or 64GB 1600 MHz DDR3 6GB GDDR5 170 GB/sec Gemini High Speed Interconnect Upgradeable to Kepler many-core processor 3

4 Accelerator Tools Optimized Libraries Analysis and Scoping Tools Compiler Directives 4

5 Accelerator Tools Statistics gathering for identification of potential accelerator kernels Statistics gathering for code running on accelerator Optimized Libraries Utilization of Autotuning framework for generating optimized accelerator library Whole program analysis to performance scoping for OpenMP and OpenACC directives 5

Nvidia Directives can be ignored on systems without

6 Open standard for addressing the acceleration of Fortran, C and C++ applications Originally designed by Cray, PGI and Nvidia Directives can be ignored on systems without accelerator Can be used to target accelerators from Nvidia, AMD and Intel 6

7 7

8 Name Final Configuration Architecture Processor Titan XK6 Cabinets Core AMD Nodes 18,688 Cores/node 16 Total Cores 299,008 Memory/Node Memory/Core Interconnect GPUS 32GB 2GB Gemini TBD 8

9 Early Science Applications CAM-SE Denovo LAMMPS PFLOTRAN S3D WL-LSMS 9

10 CAM-SE Key code kernels have been ported and their performance project a 4X speed up on XK6 over Jaguar 10

11 Major REMAP kernel All times in Millisecs Original REMAP Rewrite for porting to accelerator OpenMP Parallel DO 24Threads Magny Cours Hand Coded CUDA 10.2 OpenACC directives

12 WL-LSMS The kernel responsible for 95% of the compute time on the CPU has been ported and shows a 2.5X speed up over the replaced CPU 12

gwl-lsms3 First Principles Statistical Mechanics of Magnetic Materials identified kernel for initial GPU work zblock_lu (95% of wall time on CPU) kernel performance: determined by BLAS and LAPACK:

13 gwl-lsms3 First Principles Statistical Mechanics of Magnetic Materials identified kernel for initial GPU work zblock_lu (95% of wall time on CPU) kernel performance: determined by BLAS and LAPACK: ZGEMM, ZGETRS, ZGETRF preliminary performance of zblock_lu for 12 atoms/node of Jaguarpf or 12 atoms/gpu For Fermi C2050, times include host-gpu PCIe transfers Currently GPU node does not utilize AMD Magny Cours host for compute Jaguarpf node (12 cores AMD Istanbul) Fermi C2050 using CUBLAS Time (sec) Fermi C2050 using Cray Libsci 13

14 Denovo The 3-D sweep kernel, 90% of the runtime, runs 40X faster on Fermi compared to an Opteron core. The new GPU-aware sweeper also runs 2X faster on CPUs compared to the previous CPU-based sweeper due to performance optimizations 14

15 Single Major Kernel - SWEEP The sweep code is written in C++ using MPI and CUDA runtime calls. CUDA constructs are employed to enable generation of both CPU and GPU object code from a single source code. C++ template metaprogramming is used to generate highly optimized code at compile time, using techniques such as function inlining and constant propagation to optimize for specific use cases. 15

16 Seconds Denovo Performancermance data Jaguar, Denovo, old sweeper Jaguar, Denovo, new sweeper Jaguar, standalone new sweeper Fermi, standalone new sweeper, extrapolated Fermi + Gemini, standalone sweeper, estimated Nodes 16

17 LAMMPS Currently seeing a 2X-5X speed up over the replaced CPU 17

Loop Time (s) Host-Device Load Balancing Split work further by spatial domain to improve data locality on GPU Further split work not ported to GPU

1000 100 10 CPU (12ppn) GPU (2ppn) GPU LB (12 ppn) GPU-N (2ppn) GPU-N LB (12ppn) 1 2 4 8 16 Nodes 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

18 Loop Time (s) Host-Device Load Balancing Split work further by spatial domain to improve data locality on GPU Further split work not ported to GPU across more CPU cores Concurrent calculation of routines not ported to GPU with GPU force calculation Concurrent calculation of force on CPU and GPU CPU (12ppn) GPU (2ppn) GPU LB (12 ppn) GPU-N (2ppn) GPU-N LB (12ppn) Nodes 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% GPU-Comm Other Comm Neigh Pair Nodes 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Other Comm Pair+Neigh+GPUComm Nodes

19 S3D Full Application running using new OpenACC directives. Target performance 4x JaguarPF 19

20 Refractor all MPI to Hybrid MPI/OpenMP 20

21 Covert OpenMP Regions to OpenACC All times in Seconds OpenMP Parallel DO 16 Threads Interlagos OpenACC Parallel Construct Getrates Diffusive Flux Point wise Compute Total Run /cycle Entire application

22 Transfer from host to accelerator Computation on the accelerator Transfer for accelerator to host Communication on host Computation on host Loop RK loop!$acc data in integrate(major Arrays on Accelerator Loop RK loop!$acc intialization in rhsf(1,2) Loop RK loop!$acc update host(u,yspecies,temp) Loop RK loop!$acc parallel loop in rhsf (3) Loop RK loop MPI Halo Update for U,YSPECIES, TEMP) Loop RK loop!$acc update device(grad_u,grad_ys,grad_t) Loop RK loop!$acc parallel loopin rhsf (4-5) Loop RK loop!$acc update host(mixmw) Loop RK loop MPI Halo Update for mixmw Loop RK loop!$acc update device(grad_mixmw) Loop RK loop!$acc parallel loop in rhsf(6,7,8,9) Loop RK loop MPI Halo Update for TMMP Loop RK loop Fill RHS array on host Loop RK loop!$acc update device(diffflux) Loop RK loop!$acc parallel loop in rhsf(10) Loop RK loop!$acc update host(diffflux) Loop RK loop MPI Halo Update for diffflux Loop RK loop!$acc update device(diffflux,rhs) Loop RK loop!$acc parallel loop in rhsf(11,12) Loop RK loop!$acc update host(rhs) Copyright 2011 Cray Inc. Supercomputing 2011

23 Host Host Acc Acc Copy Acc Copy Calls Function Time% Time Time In Out PE=HIDE (MBytes) (MBytes) 100.0% Total % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % % Copyright 2011 Cray Inc. Supercomputing 2011

24 982. #ifdef GPU 983. G <>!$acc parallel loop private(i,ml,mu) present( temp, pressure, yspecies,rb,rf,cgetrates) 984. #else 985.!$omp parallel private(i, ml, mu) 986.!$omp do 987. #endif 988. g < do i = 1, nx*ny*nz, ms 989. g ml = i 990. g mu = min(i+ms-1, nx*ny*nz) 991. g gr4 I----> call reaction_rate_vec_1( temp, pressure, yspecies, ml, mu, rb,rf,cgetrates ) 992. g > end do 993. #ifdef GPU 994.!$acc end aparallel loop 995. #else 996.!$omp end parallel 997. #endif Copyright 2011 Cray Inc. Supercomputing 2011

25 526. #ifdef GPU 527.!$acc update device(grad_u,mixmw) 528. G----<>!$acc parallel private(i,ml,mu) 529.!$acc loop 530. #else 531.!$omp parallel private(i, ml, mu) 532.!$omp do 533. #endif 534. g < do i = 1, nx*ny*nz, ms 535. g ml = i 536. g mu = min(i+ms-1, nx*ny*nz) 537. g if(jstage.eq.1)then 538. g gr4 I----> call computecoefficients_r( pressure, Temp, yspecies, q(:,:,:,4),ds_mxvg,vscsty,mixmw, ml, mu ) 539. g endif 540. g gw I-----> call computestresstensor_r( grad_u, vscsty,ml, mu) 541. g > enddo 542. #ifdef GPU 543.!$acc end loop 544.!$acc end parallel 545. #else 546.!$omp end parallel 547. #endif Copyright 2011 Cray Inc. Supercomputing 2011

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid