Crossing the Architectural Barrier: Evaluating Representative Regions of Parallel HPC Applications

Size: px

Start display at page:

Download "Crossing the Architectural Barrier: Evaluating Representative Regions of Parallel HPC Applications"

Alan Andrews
6 years ago
Views:

Crossing the Architectural Barrier: Evaluating Representative Regions of Parallel HPC Applications Alexandra Ferrerón (University of Zaragoza),

1 Crossing the Architectural Barrier: Evaluating Representative Regions of Parallel HPC Applications Alexandra Ferrerón (University of Zaragoza), Radhika Jagtap, Sascha Bischoff, Roxana Rușitoru (ARM) Senior Research Engineer Software & Large Scale Systems Research ISPASS April 24, 2017 ARM 2017

2 Premise Design space exploration and sensitivity studies Exascale, perf/w, etc. Simulation is expensive Accuracy versus speed Find and run only representative parts of parallel applications (ideally as little as possible!) 2 ARM 2017

3 What it s for (example) Premise Hardware-Software Co-design Vary system parameters, whilst keeping the software constant Applications BarrierPoint methodology Real hardware Node simulator (e.g. a single gem5 instance) Representative parallel sections Currently, this methodology is intended for node-level studies. Our contributions 3 ARM 2017

4 The methodology (prior art) Barrier-synchronized (e.g., OpenMP) applications are an important subset of HPC parallel workloads Barriers offer a natural synchronization point where all threads align! BarrierPoint: Sampled simulation of multi-threaded applications [Carlson et al., ISPASS 2014] Simulate a select number of representative inter-barrier regions (barrier points) and predict the total application performance from those barrier points. We have implemented and validated the methodology for x86_64/armv8 using HPC benchmarks, on real hardware. 4 ARM 2017

$itr < 3 ; itr++) { OMP_PARALLEL_START printf($ I am part of a BP as well!

shared(a,b,c) // C <- C + A x B for (i = 0; i

5 What is a barrier point (BP)? OMP_PARALLEL_START BEGIN OF ROI /*.. A, B, and C are initialize with pseudo-random values.. */ BP.0 master thread BEGIN_OF_ROI(); for (itr = 0 ; itr < 3 ; itr++) { OMP_PARALLEL_START printf( I am part of a BP as well! ); #pragma omp parallel for private(j,k) shared(a,b,c) // C <- C + A x B for (i = 0; i < size; ++i) { for (j = 0; j < size; ++j) { for (k = 0; k < size; ++k) { C[i][j] += A[i][k] * B[k][j]; } } } } END_OF_ROI(); BP.1 BP.2 OMP_PARALLEL_START Sequential code (master thread) Threads align! 5 ARM 2017 END OF ROI

6 Main contributions Provide an independent evaluation of the BarrierPoint methodology Tested on HPC proxy applications on x86_64 and ARMv8, on real hardware Show that abstract characteristics can be used to identify the representative parts of HPC applications across both ISAs Look into architectural features (vector capabilities), and evaluate their impact on the representativeness of the selected phases Errors are similar to the non-vectorised cases Results sneak peak: Performance estimation error <2.3% (x86_64 & ARMv8) for cycles and instructions Total instruction count reduction of up to 178x 6 ARM 2017

7 Cross-architectural evaluation methodology x86_64 (1) Obtain barrier points AArch64 The source code instrumentation is architecture agnostic. (2) Gather performance counter statistics (2) Gather performance counter statistics (3) Program behaviour reconstruction (3) Program behaviour reconstruction (4) Barrier point set validation (4) Barrier point set validation 7 ARM 2017

Barrier point instrumentation and validation master thread BEGIN OF ROI master thread BEGIN OF ROI Adding instrumentation at the beginning of

8 Barrier point instrumentation and validation master thread BEGIN OF ROI master thread BEGIN OF ROI Adding instrumentation at the beginning of each parallel region No warm-up issues: complete application run END OF ROI END OF ROI Baseline With instrumentation per barrier point 8 ARM 2017

9 Experimental configuration #1 Applications AMGMk CoMD Graph500 HPCG HPGMG-FV LULESH MCB minife Pathfinder RS & XSBench Hardware x86_64: Intel Core 3.4 GHz (4 cores x ) 32 KB L1D+32 KB L1I, 256 KB L2 per core 8 MB shared L3 ARM: ARMv8 AppliedMicro 2.4 GHz (4 clusters x 2 cores) 32 KB L1D + 32 KB L1I per core, 256 KB L2 per cluster 8 MB shared L3 (64-bit) 9 ARM 2017

10 Experimental configuration #2 Configurations: 1, 2, 4, Vectorised/Non-vectorised x86_64/armv8 Metrics: Cycles, Instructions, L1-D misses, L2-D misses Representativeness: estimation error for performance counter metrics within acceptable bounds (5-10%). Obtain barrier points (10 sets per configuration, x86_64 only) Gather performance counter statistics (20x+warm-up per configuration) Program behaviour reconstruction (10x per configuration) Barrier point set validation (10x per configuration) 10 ARM 2017

11 Limitations Applications with a single large parallel region There is no speedup to be gained Apps: XSBench, RSBench and Pathfinder Statistics collection overhead An issue when the size of the barrier point is too small Apps: HPGMG-FV and LULESH Statistics collection variability Number of <stat> is too low, thus the variability impact is higher Apps: CoMD L1D$ misses on AArch64 Instruction count variability due to floating-point operation accuracy Apps: HPGMG-FV 11 ARM 2017

12 Results (when it works) Avg. abs. error (%) x86_64 x86_64 vect ARMv8 ARMv8 vect HPCG Instr. selected: 2.76/1.14% Cycles Instructions L1D Misses L2D Misses Avg. abs. error (%) x86_64 x86_64 vect ARMv8 ARMv8 vect AMGMk Instr. selected: 3.82/2.52% Cycles Instructions L1D Misses L2D Misses Instruction selected: non-vectorised/vectorised 12 ARM 2017

13 Results #2 (when it works) Avg. abs. error (%) x86_64 x86_64 vect ARMv8 ARMv8 vect CoMD Avg. abs. error (%) Instr. selected: 2.07/1.42% Cycles Instructions L2D Misses L1D Misses minife Instr. selected: 0.56/0.59% Avg. abs. error (%) x86_64 x86_64 vect ARMv8 ARMv8 vect Cycles Instructions L1D Misses L2D Misses Instruction selected: non-vectorised/vectorised 13 ARM 2017

14 Results (when it doesn t work) Avg. absolute error (%) x86_64 ARMv8 HPGMG-FV 0 Avg. abs. error (%) Cycles Instructions L1D Misses L2D Misses x86_64 x86_64 vect ARMv8 ARMv8 vect LULESH Cycles Instructions L1D Misses L2D Misses 14 ARM 2017

15 Conclusions Independent evaluation of the BarrierPoint methodology HPC proxy applications x86_64 and ARMv8 AVX and NEON Results show that we can identify representative regions on x86_64 and validate them on ARMv8, with an error of within 3% for all statistics (exception: CoMD L1D misses on ARM). Instruction count reduction from 2x to 178x. 15 ARM 2017

16 Future work Evaluate the applicability of the methodology across different core types, such as in-order versus out-of-order. Validate the representative sections against a more comprehensive set of performance counters. Adjust the size of barrier points so that more applications benefit from the BarrierPoint methodology, such as RSBench, XSBench, and LULESH. Generalise the implementation to work on non-openmp applications. Quantify cross-architectural ISA differences, and explore the methodology s cross-architectural applicability limits. Ferrerón was supported in part by grants gaz: T48 research group (Arago n Gov. and European ESF), TIN C2-1-P, TIN C2-1-R, Consolider NoE TIN REDC (Spanish Gov.) and HiPEAC-3 NoE (European FET FP7/ICT ). Ruṣitoru has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement N ARM 2017

17 Thank you! Questions? 17 ARM 2017

Crossing the Architectural Barrier: Evaluating Representative Regions of Parallel HPC Applications

Crossing the Architectural Barrier: Evaluating Representative Regions of Parallel HPC Applications Alexandra Ferrerón Universidad de Zaragoza, Spain ferreron@unizar.es Radhika Jagtap ARM Ltd., U.K. radhika.jagtap@arm.com