Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework

Size: px

Start display at page:

Download "Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework"

Bryce Andrew Powers
5 years ago
Views:

1 Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework Erik Schne+er Perimeter Ins1tute for Theore1cal Physics XSCALE 2013, Boulder, CO,

3 Gamma- Ray Bursts ~10 7 km He Protoneutron Star Accretion Collapse to a Black Hole Jet Formation and Sustainment Fe-group nuclei Si O/Ne/Mg C (not drawn to scale) Iron Core Collapse Accretion with Accretion Disk Jet Propagation / Breakout Disruption of Star Intense, narrowly- beamed flashes of high- energy photons most energe1c events in today s universe Mechanisms s1ll a riddle (grand challenge in astrophysics) Gravita1onal waves likely to be detected by LIGO in coming years At the intersec1on of many different fields of physics Afterglow Emission

4 Estimated Requirements to Model a Long GRB ab initio Computa1on Physics Stage Core collapse, Supernova EOS, neutrino transport, MHD, GR 25 m 10,000 km 1 ms 2 s AMR: 11 levels 640M cells 5M steps 18 TByte 270,000 PFlop (3 days) Accre1on, Jet forma1on Neutrino transport, MHD, GR 25 m 1,000 km 1 ms 200 s AMR: 10 levels 80M cells 600M steps 3 TByte 6,000,000 PFlop (70 days) Break- out Photon transport, MHD, coupled to accre1on 1,000 km 1,000,000 km 1 ms 200 s AMR: 15 levels 100M cells 6M steps 25 TByte 300,000 PFlop (3 days) A9erglow Photon transport/ absorp1on, nuclear decay Even larger / months Monte Carlo????

6 30 3 = (30+3+3) 3 = = x 1.7

7 Current Scalability Limitations Adap1ve Mesh Refinement (AMR) serializes 1me evolu1on of different levels Higher- order methods require many ghost zones Communica1on becomes bandwidth limited Prefer large memory per node, many cores per node Happy about recent architecture development

8 Current Performance Limitations Equa1ons are complex, need many Flop per cell (e.g. 10 kflop) Need large caches per core (both instruc1on and data) Okay, not so happy aher all Ques1on: Task- based mul1- threading?

10 A Hard Problem Publication Physics Model (EDL) Results Kranc Code Generator CaKernel Programming Abstractions Cactus-Carpet Computational Infrastructure Parallel Programming Environment

AMR Bounding Box Algebra Need to describe shapes of refined regions bbox: rectangular region of grid points bboxset: set of bboxes, arbitrary shape How to

14 AMR Bounding Box Algebra Need to describe shapes of refined regions bbox: rectangular region of grid points bboxset: set of bboxes, arbitrary shape How to implement set opera1ons efficiently? (intersec1on, union, ) How to turn into list of bboxes? [h+p:// science/ar1cle/pii/s ]

15 Deriva1ve contains only key points of original shape x deriva1ve xy deriva1ve

16 Implementation To store bboxset, calculate deriva1ve, and store result in tree structure To apply opera1ons (union, intersec1on, ), use sweeping algorithm: Restore full bboxset on the fly Apply opera1on Calculate deriva1ve again Complexity: O(N log N) N: number of regions (not number of points!)

17 Time per grid point RHS [µs] Weak scaling benchmarks (GR + hydro) Einstein Toolkit benchmark: TOV (unigrid) Blue Waters Hopper Stampede Vesta (BG/Q) cores Time per grid point RHS [µs] Einstein Toolkit benchmark: TOV (9 levels) Blue Waters Stampede cores

18 Dynamic Loop Optimization [Compare Kokkos presenta1on] Cactus contains a module LoopControl that provides an abstrac1on for traversing loops: CCTK_LOOP3(loopName, i,j,k, imin,jmin,kmin, imax,jmax,kmax) // loop body CCTK_ENDLOOP3(loopName) Equivalent to 3D nest of for loops

19 Dynamic Loop Optimization LoopControl exposes loop- level parallelism Hierarchy of loop traversal strategies: 1. Coarse- grained mul1threading (different caches) 2. Itera1ng over 1les 3. Itera1ng within 1les (each fits into cache) 4. Fine- grained mul1threading (aka SMT) (shared cache) 5. (SIMD vectoriza1on)

20 Dynamic Loop Optimization LoopControl employs dynamic op1miza1ons Loop traversal depends on run- 1me parameters (e.g. thread decomposi1on, 1le sizes) Loop execu1on is profiled Parameter setngs are automa1cally improved at run 1me: Random- Restart Hill Climbing Profiling and op1miza1on transparent to user code Only minor source modifica1on (replace for by CCTK_LOOP3 ) Implemented via macros

21 Random- Restart Hill Climbing For each loop setup, remember current best parameters Hill Climbing: examine neighbouring parameter choices Random Restart: from 1me to 1me, randomly choose new parameters, and backtrack if necessary Goal is to reduce run 1me, not to find best parameter choice! Bad parameter choices can be 10x slower Be+er to find and use mediocre parameters many 1mes than a bad choice just a few 1mes Typical improvements (for our code): 10% to 20%

22 Dynamic Optimization vs. Auto- Tuning Auto- Tuning: Determine best loop parameters ahead of 1me Advantage: Actually finds best parameters Disadvantage: Specific to machine compiler version code version grid size

23 SIMD Vectorization SIMD vectoriza1on is important Compilers ohen fail To make them succeed, one needs to add annota1ons To produce be+er code, one needs to add more annota1ons However, some code proper1es currently cannot be described via annota1ons Need portable code avoid or abstract architecture- dependent or compiler- dependent mechanisms

24 SIMD Intrinsics Original loop: for (int i=0; i<n; ++i) { a[i] = b[i] * c[i] + d[i]; } #include <emmintrin.h> for (int i=0; i<n; i+=2) { m128d ai, bi, ci, di; bi = _mm_load_pd(&b[i]); ci = _mm_load_pd(&c[i]); di = _mm_load_pd(&d[i]); ai = _mm_add_pd(_mm_mul_pd(bi, ci), ci); _mm_store_pd(&a[i], ai); } #include <builtins.h> SSE: for (int i=0; i<n; i+=4) { vector4double ai, bi, ci, di; bi = vec_lda(0, &b[i]); ci = vec_lda(0, &c[i]); di = vec_lda(0, &d[i]); QPX: ai = vec_madd(bi, ci, di); vec_sta(ai, 0, &a[i]); }

SIMD Vectorization API (for Stencil- Based Codes) Implemented in Cactus in module Vectors 1. Data types for vectors of double, int, bool 2. Arithme1c opera1ons (+ - * /?: etc.) 3.

25 SIMD Vectorization API (for Stencil- Based Codes) Implemented in Cactus in module Vectors 1. Data types for vectors of double, int, bool 2. Arithme1c opera1ons (+ - * /?: etc.) 3. Math func1ons (sqrt sin cos exp) 4. Memory load/store opera1ons Aligned/unaligned Par1al access (masks) Cache bypass 5. Helper func1ons for itera1ng over arrays/stencil opera1ons Array index calcula1ons Mask genera1on OpenCL, OpenMP

26 Finite Difference Stencil Example for (int i=1; i<n-1; ++i) { a[i] = 0.5 * (b[i+1] - b[i-1]); } #include <vectors.h> VEC_ITERATE(i, 1, N-1) { CCTK_REAL_VEC ai, bim, bip; bim = vec_loadu_off(-1, &b[i-1]); bip = vec_loadu_off(+1, &b[i+1]); ai = vec_mul(vec_set1(0.5), vec_sub(bip, bim)); vec_store_nta_partial(&a[i], ai); } Code transforma1on is straighworward Can provide addi1onal informa1on to improve performance

27 Conclusion Automated code genera1on: Crea1ng applica1on modules from equa1ons/stencil descrip1ons Cactus sohware framework for portability Three performance/op1miza1on abstrac1ons: Efficient bounding box set algebra Dynamic loop op1miza1ons Explicit vectoriza1on API

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation Erik Schnetter, Perimeter Institute with M. Blazewicz, I. Hinder, D. Koppelman, S. Brandt, M. Ciznicki, M.