Porting Scientific Applications to OpenPOWER

Size: px

Start display at page:

Download "Porting Scientific Applications to OpenPOWER"

Ernest Whitehead
6 years ago
Views:

1 Porting Scientific Applications to OpenPOWER Dirk Pleiter Forschungszentrum Jülich / JSC #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1

JUBL, 45 TFlop/s IBM Blue Gene/P JUGENE, 1 PFlop/s IBM Blue Gene/Q JUQUEEN 5.

2 JSC s HPC Strategy IBM Power 6 JUMP, 9 TFlop/s Intel Nehalem JUROPA 300 TFlop/s JURECA ~ 2 PFlop/s + Booster ~ 10 PFlop/s File Server Lustre GPFS IBM Blue Gene/L JUBL, 45 TFlop/s IBM Blue Gene/P JUGENE, 1 PFlop/s IBM Blue Gene/Q JUQUEEN 5.9 PFlop/s General-Purpose Cluster Highly Scalable System Join the conversation at #OpenPOWERSummit 2

Achieving Scalability Need for Research on

incentives to users: High-Q Club Showcase

Gene/Q at JSC Selected Club members dynqcd:

DFT-based condensed matter PEPC: Tree-based

3 Achieving Scalability Need for Research on Research on architectures and technologies Research on applications and algorithms Ingredients for HPC co-design Provide incentives to users: High-Q Club Showcase for codes that can utilize a 28-rack Blue Gene/Q at JSC Selected Club members dynqcd: simulation of particle theories KKRnano: DFT-based condensed matter PEPC: Tree-based N-body code... Join the conversation at #OpenPOWERSummit 3

4 Why OpenPOWER? Answer from a customer point of view Increasing share of Top500 are based on CPUs from single vendor Pure market observation, no statement about technology Lack of competition in processor technologies Usually higher prices Less incentive for innovations Need for promoting alternative technologies OpenPOWER Join the conversation at #OpenPOWERSummit 4

5 Why OpenPOWER? Answer from an architectural point of view Tight integration of high-performance processor and low-clocked, highly parallel compute devices Enable drastic improvement of power efficiency Preserve usability at tremendously increased level of parallelism Opportunity to improve overall balance of system Integration of non-volatile memory into fat compute nodes Increased reliability though reduced number of components and support of resilience Addresses exascale challenges Join the conversation at #OpenPOWERSummit 5

6 POWER Acceleration and Design Center PADC is a collaboration between IBM R&D Labs in Böblingen and Zürich Forschungszentrum Jülich NVIDIA Europe Mission statement Support scientists and engineers to target the grand challenges facing society using OpenPOWER technologies Grand challenges Energy and environment, e.g. plasma physics Information, e.g. condensed matter physics Healthcare, e.g. brain research Join the conversation at #OpenPOWERSummit 6

7 Applications PADC takes application driven approach Builds on previous work in NVIDIA Application Lab Previously targeted applications Regional Flood Model B-CALM PANDA... Ongoing and future applications KKRnano BigBrain... Join the conversation at #OpenPOWERSummit 7

8 Performance Analysis Performance analysis POWER8 memory hierarchy Performance analysis GPU-GPU data transport Join the conversation at #OpenPOWERSummit 8

9 Performance Characterization Characterization of applications on given hardware Methodology Identification of performance critical kernels Optimization of kernel at best effort with given constraints Performance characterization Measurement of extensive performance metrics Architectural analysis Question addressed in architectural analysis How does performance change with clock speed? How does it depend on memory hierarchy? Join the conversation at #OpenPOWERSummit 9

10 Performance Characterization Example: Regional Flood Model Key kernel: Solver for Saint-Venant equations Compute particle flow in 2 dimensions Selected performance metrics (on K20x) Arithmetic intensity AI acc (T) = 0.5 Memory rd/wr bandwidth = 80/86 GByte/s Warp execution efficiency ε warp = 80% Example analysis for changing boost clock Join the conversation at #OpenPOWERSummit 10

11 Performance Modelling Semi-empirical performance modelling methodology Methodology On basis of prior knowledge formulate scaling formulae describing dependence of execution time t(w) as function of work-load W Measure t(w ) for different W and fit scaling formulae to result Check fitted parameters for plausibility Considered example: B-CALM 1-dimensionally parallelized Finite Difference Time Domain approach for electro-magnetic simulations Join the conversation at #OpenPOWERSummit 11

12 Performance Modelling Semi-empirical performance modelling for B-CALM Model ansatz Calculation of boundary sites t bnd ~ N x N y Calculation of bulk sites t bulk ~ N x N y (N z / P) Communication of boundary t net ~ N x N y Overlapping calculations and communications: t = t bnd + max( t bulk, t net ) Weak scaling measured for fixed N x N y using P=2 GPUs attached to single processor Non-optimized MPI Join the conversation at #OpenPOWERSummit 12

Future opportunities Challenging applications with large memory capacity and high bandwidth requirements High bandwidth, smaller capacity memory attached to GPU Large capacity, smaller bandwidth

13 Future opportunities Challenging applications with large memory capacity and high bandwidth requirements High bandwidth, smaller capacity memory attached to GPU Large capacity, smaller bandwidth memory attached to CPU Example: BigBrain project at FZ Jülich Goal: 3d brain model reconstructed from 2d slices Computational challenge: image registration Compute intensive computation of mutual information metric Large capacity required for storing high-resolution images Significant slow-down found using host memory on today s architectures [A. Adinets et al., HeteroPar 2013] Large benefit expected from NVLink Join the conversation at #OpenPOWERSummit 13

14 Conclusions OpenPOWER opens important opportunities for HPC infrastructure providers Exascale challenges are addressed No problems porting GPU-enabled applications to OpenPOWER Room for optimizations Support for porting more applications required Based on performance characterization and modelling Optimization and code restructuring within PADC Join the conversation at #OpenPOWERSummit 14

NVIDIA Application Lab at Jülich

Mitglied der Helmholtz- Gemeinschaft NVIDIA Application Lab at Jülich Dirk Pleiter Jülich Supercomputing Centre (JSC) Forschungszentrum Jülich at a Glance (status 2010) Budget: 450 mio Euro Staff: 4,800