The AutoTune Project

Size: px

Start display at page:

Download "The AutoTune Project"

Karen Rose
6 years ago
Views:

1 The AutoTune Project Siegfried Benkner (on behalf of the Autotune consortium) Research Group Scientific Computing University of Vienna AutoTune: Interna-onal Workshop on Code Auto- Tuning, CGO 2015, San Francisco, Feb 7, 2015

2 Outline Project Objec/ves Periscope Tuning Framework Tuning Model & API Selected Tuning Plugins Conclusion

3 AutoTune Project European FP7 ICT Project Automa/c Online Tuning Oct Apr Partners - Technische Universität München (Coordinator) - University of Vienna - Universitat Autonoma de Barcelona - Na-onal University of Ireland - ICHEC - Leibniz Supercompu-ng Centre LRZ - CAPS Entreprise (un-l 06/2014) - IBM Germany (associated partner)

4 Project Objec/ves Combine performance analysis and tuning into single framework. Periscope Tuning Framework (PTF) Extend Periscope with automa-c tuning plugins for performance and energy efficiency tuning.

Use expert knowledge to guide search for performance proper-es

5 Project Objec/ves Facilitate Online Tuning: Performance analysis and tuning at run-me. Use expert knowledge to guide search for performance proper-es and tuned versions. Design Time Coding Performance Analysis Tuning Run,me Performance Analysis Tuning

6 Project Objec/ves Parallel architectures - Mul-core servers - Large- scale HPC systems (e.g. SuperMUC) - Accelerated systems (GPU, Xeon Phi) Support different programming paradigms/tools - MPI, MPI/OpenMP - HMPP/OpenACC/OpenCL; Pacerns (PEPPHER) Autotuning (at) Different Layers of SW Stack - High- level language (direc-ves/pacerns) - Compilers / Transforma-on systems - Run-me systems and libraries

Periscope Tuning Framework PTF User Interface www.autotune-project.

7 Periscope Tuning Framework PTF User Interface PTF reflects Periscope s design - Iterative online performance analysis - Distributed, scalable architecture - Search for performance bottlenecks - Phase regions (SIR) Tuning plugins - Tuning parameters & regions & objective - Use/extend Periscope s performance properties and search strategies - Execute tuning scenarios (in parallel) Search Algorithms PTF Frontend Tuning Plugins Static Program Info Experiment Execution Agent Network Monitor Monitor Monitor Performance Analysis Application

8 Tuning Plugins 1. MPI Parameter Tuning 2. Pipeline PaQerns for Accelerated Architectures 3. DVFS Plugin 4. Compiler Flags Selec-on 5. MPI Master/Worker Tuning 6. OpenCL Worksize Tuning 7. Parallelism Capping Plugin 8. MPI- IO Plugin 9. Hybrid Manycore HMPP Codelets 10. Combined Tuning Plugins

9 PTF Tuning Model Analysis strategy guides search for performance proper-es Plugin strategy guides search for tuned versions based on found performance proper-es, expert knowledge,... Result Tuning recommenda-on that can be integrated into produc-on code

build scenarios to be tested Prepare Scenario(s) Do whatever needed to apply the scenario (e.g.

10 Tuning Step Pre- Analysis (opt.) Determine performance proper-es Create Search Space Create space of all tuning variants to consider Create Scenario(s) Select and build scenarios to be tested Prepare Scenario(s) Do whatever needed to apply the scenario (e.g., recompila-on, ) Run/Analyse Scenario(s) Chooses/runs scenario(s) to be executed in one step Tuning Step Pre- Analysis Create Search Space Search Step N N Create Scenario(s) Prepare Scenario(s) Run/Analzse Scenario(s) Search Finished? J Tuning Finished? J

11 PTF Tuning API Finite State Machine Scenario Execution Engine FrontEnd Application restart Tuning Strategy Plugin void initplugin(); void createscenarios(); void preparescenarios(); void compile(); void getrestartinfo(); void defineexperiment(); bool issearchfinished(); bool istuningdone(); void getadvice(); Search Engine void addsearchspace( VariantSpace*); void setobjective(int objective); void createscenarios(); bool issearchfinished() ; int getoptimum(); map<int, MetaProperty* > getsearchpath();

12 PTF Search Strategies Exhaus/ve search Probabilis/c random search Samples the search space according to a probability model Individual search Tuning parameters are considered individually according to their importance GDE3 mul/- objec/ve gene/c search Generalized Differen-al Evolu-on 3 gene-c algorithm Ac/ve Harmony's Nelder- Mead Simplex algorithm Interfaces the Harmony server to obtain new scenarios in each search step.

13 MPI Parameter Tuning Plugin Tuning of selected parameters of MPI implementa/ons MP_EAGER_LIMIT, MP_BUFFER_MEM, e.g., tradeoff between performance and memory usage Tuning Configura-on File - Parameters to tune and value ranges - Search strategy to be used

14 Eager Limit New Performance Metrics - PSC_MPI_MSG_P2P_THR: total number of bytes transferred near the eager limit (1KB - 64KB). - PSC_MPI_MSG_P2P_TOT: total number of bytes transferred using point to point communica-on. - PSC_MPI_MSG_P2P <2K;16K;32K;64K>: total number of messages (count) at certain size ranges. New Performance Property: EagerLimitDependency - Frac-on of the total MPI point to point traffic near valid eager limit - Determines if generated point to point traffic is sensible to altera-ons to the eager limit.

15 EAGER_LIMIT & BUFFER_MEM FSSIM Code on SuperMuc (4 nodes, 64 cores) Default (IBM MPI) EAGER_LIMIT: 32 KB BUFFER_MEM: 64 MB Best (FSSIM Code) EAGER_LIMIT: 1 KB BUFFER_MEM: 128 MB Improvement: ~30%

High- Level PaQern Tuning PEPPHER HL Programming Framework (www.peppher.eu) Component- based Approach Components with mul-ple implementa-on variants (diff. prog.

16 High- Level PaQern Tuning PEPPHER HL Programming Framework ( Component- based Approach Components with mul-ple implementa-on variants (diff. prog. models) High- level coordina-on mechanisms (task parallelism); pacerns Intelligent Run/me System (StarPU) Asynchronous task- based execu-on model Dynamic variant selec-on & scheduling to all processing units

PEPPHER Methodology www.peppher.eu FOR k = 0.

.tiles-1 TRSM(A[k][k], A[m][k]) FOR n = k+1.

.tiles-1 GEMM(A[m][k], #pragma pph call A[n][k], A[m][n])

.. Dynamic Task Graph w. data dependencies S. Benkner, U.

17 PEPPHER Methodology FOR k = 0..TILES-1 Cholesky POTRF(A[k][k]) Factorization FOR m = k+1..tiles-1 TRSM(A[k][k], A[m][k]) FOR n = k+1..tiles-1 SYRK(A[n][k], A[n][n]) FOR m = n+1..tiles-1 GEMM(A[m][k], #pragma pph call A[n][k], A[m][n]) GEMM(A[m][k], A[n][k], A[m][n]) P... T... S... S G G G G... Dynamic Task Graph w. data dependencies S. Benkner, U. Vienna «variant» GEMM_CPU «variant» GEMM_GPU from PLASMA/MAGMA libraries GPU GPU G G G S S... S GEMM Runtime Scheduler... T «interface» T S CPU G

Pipeline PaQern Plugin Pipeline stages correspond to component calls Support for stage replication and stage merging High-Level buffer management (size, order-type) #pragma pph

18 Pipeline PaQern Plugin Pipeline stages correspond to component calls Support for stage replication and stage merging High-Level buffer management (size, order-type) #pragma pph pipeline while (...) { a(x) #pragma stage replicate(?) b(x,y) c(y) } Programmer-controlled autotuning: replicate(?), replicate(2:10:2)? B_CPU <<interface>> B B_GPU A B C B A B C B

19 Pipeline PaQern Tuning Performance Analysis Stage execution times Buffer wait times Overall pipeline execution time à determine slowest pipeline stage (limiter stage) Tuning Parameters Stage replication factor, buffer sizes Number of CPUs/GPUs to use Runtime scheduling strategy (HEFT, EAGER, ) B A B C B

20 #pragma pipeline buffer(?) while(b!= 0) { readblock(file,b); #pragma stage replicate(?) compress(b); writecomprblock(file,b); } PEPPHER Framework C/C++ with PEPPHER Vienna Transforma-on System Adap-ve Target Code + Periscope Instrumenta/on Calls Target Compiler(s) Annota,on Compila,on Instrumenta,on Tuning Parameters SIR File Code Regions PTF Modified Tuning Parameter(s) Executable MRI Monitor Pipeline Tuning Plugin Metrics measure Performance Data Periscope Analysis Agent Best Tuning Scenario R = 4, B = 8 NGPU = 2 NCPU = 4 Search Engine PTF Frontend Execu,on Measurement Online Tuning

Face Detec/on Applica/on CPU & GPU variants for middle stages re- engineered from OpenCV library Can transparently u-lize all resources in a heterogenous system.

21 Face Detec/on Applica/on CPU & GPU variants for middle stages re- engineered from OpenCV library Can transparently u-lize all resources in a heterogenous system... #pragma pph pipeline while(inputstream >> file) { readimage(file,image); #pragma pph stage replicate(?) { resizeandcolorconvert(image); detectface(image,outimage); } writefacedetectedimage(file,outimage); }

Results: Face Detec/on Architecture: 2 Intel Xeon X5550 + NVIDIA C2050 + C1060 Parameter Values Best Repl.

22 Results: Face Detec/on Architecture: 2 Intel Xeon X NVIDIA C C1060 Parameter Values Best Repl. Factor 1,2,4,8 8 CPU cores 1,2,4,6,8 6 GPUs 0,1,2 2 Scheduling 360 scenarios Whole machine Best: 8.2 s Worst: 19.6 s eager, hey hey Buffer Size 8, 16, 32 32

Results: Face Detec/on Architecture: 2 Intel Xeon E5-2650 + 4 NVIDIA K20 Parameter Values 1260 scenarios Whole machine Best: 4.

23 Results: Face Detec/on Architecture: 2 Intel Xeon E NVIDIA K20 Parameter Values 1260 scenarios Whole machine Best: 4.6 secs Worst: 15.3 secs Best Rpl. Factor 1,2,4,8,16,32 16 CPU cores 1,2,4,8,16,32 16 GPUs 0,1,2,3,4 4 Scheduling eager, hey hey Buffer Size 32,64,128 64

24 DVFS Plugin (SandyBridge) Changing clock frequency for certain code regions to op-mize energy to solu-on for MPI/OpenMP applica-ons. enopt library: u-lizes cpufreq (userspace govenor) and PAPI- RAPL features Performance Analysis SuitedForEnergyConfigura/on property: suitable code regions Measurements for Energy Predic/on Model (GIPS, CPI, L2/L3 cache misses) performed at some frequency f 0 Tuning Strategy Energy Predic/on Model to predict best frequency f n for each region Evaluate three frequencies [f n- 1, f n, f n+1 ] per region. Exhaus-ve search strategy

25 DVFS Plugin Energy Predic/on Model (based on work by Auweter et al.)

26 DVFS Plugin Energy Predic/on Model (based on work by Auweter et al.)

27 Results: DVFS Plugin SeiSol code on SuperMUC 3 suitable regions Energy (J) Time (s)

28 MPI Master/Worker Plugin Tuning of master/worker MPI applica/ons Op-mizing execu-on -me through improved load balancing Performance Analysis: Execu-on -mes, communica-on -mes, imbalance, Tuning Parameters and Strategy - par--on factor (decreasing size factoring) - number of workers - analy-cal models to reduce search space - 2- step Tuning Strategy (f then n)

29 Tuning Plugins... MPI- IO Plugin - Tunes the number of aggregators and the collec-ve buffer size for two- phase IO and the data sieving buffer size for non- collec-ve opera-ons. Parallelism Capping Plugin - Find number of OpenMP threads for inividual parallel regions that minimizes Energy Delay Product (EDP) objec-ve func-on.

30 Tuning Plugins... Compiler Op/miza/on Tuning Plugin - Reduce the execu-on -me of the applica-on s phase region(s) by selec-ng the best combina-on of compiler flags for the applica-on. Combined Tuning Plugin - Meta- plugin combining compiler flag selec-on, energy, and MPI parameter tuning - Fixed vs. adap-ve sequence.

31 Conclusion AutoTune Project Combining performance analysis and tuning Integra-on of expert knowledge PTF: Flexible, Plugin- based architecture Support for different programming models/tools/architectures Tuning (at) different layers of soyware stack Plugable search strategies Combined tuning plugins PTF Release with selected tuning plugins available soon

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy