ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications

Size: px

Start display at page:

Download "ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications"

Clinton Pierce
5 years ago
Views:

Workshop on Extreme-Scale Programming Tools 18th November 2013 Supercomputing 2013 ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications Toni

1 Workshop on Extreme-Scale Programming Tools 18th November 2013 Supercomputing 2013 ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications Toni Espinosa Andrea Martínez, Anna Sikora, Eduardo César and Joan Sorribes Universitat Autónoma de Barcelona Computer Architecture and Operating Systems Departament

2 Outline 1Motivation. 2Scalable Dynamic Tuning. 3ELASTIC. 4Experimental Evaluation. 5Conclusions and Future Work.

Motivation Centralised Architecture of Tuning Tools Analysis Analysis and and and tuning tuning module module Analysis and tuning module Elimination of a single

3 Motivation Centralised Architecture of Tuning Tools Analysis Analysis and and and tuning tuning module module Analysis and tuning module Elimination of a single centralised control point. BE daemon BE daemon... BE daemon... BE daemon Task 0 Task 01 Task n-1 1 Task n-1 Distribution of the analysis and tuning process, remaining effective. 3/22

4 Outline 1Motivation. 2Scalable Dynamic Tuning. 3ELASTIC. 4Experimental Evaluation. 5Conclusions and Future Work.

5 Hierarchical Tuning Network Decompose. A base level of analysis and tuning modules (ATM) that controls disjoint domains of application tasks. ATM user Source Code Local performance improvements are achieved. Executable Application Memory Execution time tool Monitoring Analysis Tuning and how to obtain global Monitoring performance orders. improvements? Events. Tuning Analysis orders. and tuning domain 5/22

6 Hierarchical Tuning Network Abstract. The abstraction mechanism is carried out by the ATMs. representing the tasks of the virtual parallel application 6/22

7 Hierarchical Tuning Network Decompose. Abstract. ATM Abstractor The Theactuation virtual of parallel each application ATM of theis network decomposed gives a hierarchical distribution of the analysis and tuning process and then analysed and tuned by ATMs located at the high level. Source Code user Executable Application Memory Execution time tool Monitoring Analysis Tuning 7/22

8 Abstraction Mechanism Parallel Application Task ATM Level i+1 Level i+1 Instrumentation Order Translator Abstractor Event Creator INTERNAL API Event Manager Performance Evaluator Instrumentation Order Sender Event to the parent level Instrumentation orders from the parent level Parallel Application Task ATM Level i Level i Instrumentation Order Translator Abstractor Event Creator INTERNAL API Event Manager Performance Evaluator Instrumentation Order Sender Events from the immediate children Instrumentation orders to the immediate children 8/22

9 Knowledge in the Tuning Network Performance Performance Model Model Monitoring Monitoring Points Points Performance Performance Expressions Expressions Tuning Points, Tuning Actions Points, and Actions Synchronisation and Synchronisation Method Method Abstraction Abstraction Model Model How to translate How to a translate monitoring a monitoring order order How to translate How to a translate tuning order a tuning order How to create How a to new create event a new event How to decompose How to decompose the real or the virtual real parallel or virtual application parallel application 9/22

10 Outline 1Motivation. 2Scalable Dynamic Tuning. 3ELASTIC. 4Experimental Evaluation. 5Conclusions and Future Work.

11 ELASTIC Prototype implementation in C++. For MPI parallel applications. Target systems: UNIX based supercomputers. user Source Code Executable Application Memory Execution time Dynamic ELASTIC instrumentation Event tracing Monitoring Analysis Tuning Performance and abstraction models Dynamic instrumentation 11/22

......... ELASTIC: Architecture ELASTIC Front-End Abstractor-ATM ELASTIC Front-End Abstractor-ATM pair

configured to accommodate the size of the parallel application and the complexity of the tuning strategy

12 ELASTIC: Architecture ELASTIC Front-End Abstractor-ATM ELASTIC Front-End Abstractor-ATM pair ELASTIC Back-Ends TMLib Abstractor-ATM Abstractor-ATM Abstractor-ATM The tuning network topology can be configured to accommodate the size of the parallel application and the complexity of the tuning strategy being employed. Abstractor-ATM Abstractor-ATM... Abstractor-ATM... Abstractor-ATM ELASTIC Back-End... ELASTIC Back-End ELASTIC Back-End... ELASTIC Back-End ELASTIC Back-End... ELASTIC Back-End TMLib Application Task TMLib Application Task TMLib Application Task TMLib Application Task TMLib Application Task TMLib Application Task 12/22

PerformanceEvaluator::InitialMonitoringOrders() bool PerformanceEvaluator::NewEvent(Event *e) vector<order> PerformanceEvaluator::EvaluatePerformance() vector<monitoring Order>

13 ELASTIC Package Set of code and configurations that implements the performance and abstraction model. Performance Model Monitoring Points Performance Expressions Tuning Points, Actions and Synchronisation Method Abstraction Model How to translate a monitoring order vector<monitoring Order> PerformanceEvaluator::InitialMonitoringOrders() bool PerformanceEvaluator::NewEvent(Event *e) vector<order> PerformanceEvaluator::EvaluatePerformance() vector<monitoring Order> InstrumentationOrderTranslator:: TranslateMonitoringOrder(MonitoringOrder *mo) vector<tuning Order> InstrumentationOrderTranslator:: TranslateTuningOrder(TuningOrder *to) How to translate a tuning order How to create a new event How to decompose the real or virtual parallel application bool EventCreator::NewEvent(Event *e) vector<event> EventCreator::CreateEvent() 13/22

14 ELASTIC Package Codification of the ELASTIC Package based on subclassing Abstractor-ATM components. This plugin architecture converts ELASTIC into a general purpose tuning tool and gives it the flexibility to tackle a wide range of performance problems 14/22

15 Outline 1Motivation. 2Scalable Dynamic Tuning. 3ELASTIC. 4Experimental Evaluation. 5Conclusions and Future Work.

Experimental Evaluation The evaluation consists of Executing a parallel application which presents a specific performance problem and using ELASTIC to dynamically detect and resolve the problem.

16 Experimental Evaluation The evaluation consists of Executing a parallel application which presents a specific performance problem and using ELASTIC to dynamically detect and resolve the problem. Synthetic Parallel Application Real Agent-based Parallel Application SPMD Load balancing problem Execution Environment: Supercomputer SuperMUC at LRZ compute nodes ( cores). Each node has 2 8-core 2.7 GHz Intel Xeon processors and 32GB main memory. SuSe Linux. 16/22

17 Experimental Evaluation Results Execution time (s) % Original application Application tuned by ELASTIC 50% 46.8% 40.6% 29.4% 25.8% Number of tasks 17/22

18 Experimental Evaluation Results 300 Original application Application tuned by ELASTIC % Execution time (s) % 28.3% 30.3% Number of tasks 18/22

19 Outline 1Motivation. 2Scalable Dynamic Tuning. 3ELASTIC. 4Experimental Evaluation. 5Conclusions and Future Work.

20 Conclusions The distribution of the dynamic tuning process through a hierarchical tuning network of analysis modules has been defined. ELASTIC, a tool that implements the proposed design, has been developed. It offers dynamic tuning through dynamic monitoring, automatic performance analysis and dynamic modifications. It presents an adaptable topology and a plugin architecture. The encouraging results obtained from the experimental evaluation using ELASTIC show that our approach is effective for large-scale dynamic tuning. 20/22

21 Future Work Creation of general ELASTIC Packages which solve a given performance problem. It would be required a small adaptation to applied them to specific parallel applications. Combine our approach with the one implemented under the AutoTune project. 21/22

22 Workshop on Extreme-Scale Programming Tools Supercomputing 2013 ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications Thank you Toni Espinosa, Andrea Martinez, Anna Sikora, Eduardo César and Joan Sorribes Universitat Autónoma de Barcelona Computer Architecture and Operating Systems Departament

23 Tuning Network Topology Parallel Application of N tasks The structure of the topology will depend on the number of levels in the hierarchy and the number of Abstractor-ATM pairs in each level The use of tuning networks composed of the minimum number of non-saturated Abstractor-ATM pairs. The maximum domain size that an Abstractor-ATM pair can manage without becoming saturated. 23/66

24 Modelling an analysis and tuning process N *T m * E a E T a a N ( ) T t * f rp E c *T c N *T m N *T m N *T m Event Batch 1 N 1 N 1 N Event Batch N *T m N *T m N *T m T a N 1 N 1 N 1 N N *T m N *T m N *T m T a N ( ) ( ) Event Batch 1 N 1 N 1 N N *T m N *T m N *T m T a N ( ) Event Batch 1 N 1 N 1 N 24/66

25 Calculating the number of tasks that an analysis module can manage without becoming saturated f analysis = f e E a Period analysis Period analysis N *T m * E a Parallel Application of 256 tasks + E T a a ( N) *T c T t * f rp E c 1 T f analisis a f T m E a a f T c E e c f rp T 64 t /66

26 Experimental Evaluation Application and performance problem Logical layout: 2D grid. Iteration pattern: o Computation. o Communication. work units Computation Communication fixed size Load imbalance: hotspots of additional workload were introduced into the application at runtime. Scenario 1 10% Scenario 2 21% 1024 Tasks (32x32 grid) - After 1st Injection Tasks (32x32 grid) - After 2nd Injection Tasks (32x32 grid) - After 3rd Injection Work Units Work Units /66 0

Per task Experimental Evaluation ELASTIC Package to balance the work units Performance Model Abstraction Model Performance Expressions Measurement Points Constraint: the work units only can be moved

27 Per task Experimental Evaluation ELASTIC Package to balance the work units Performance Model Abstraction Model Performance Expressions Measurement Points Constraint: the work units only can be moved between neighbouring tasks Input: work units. work units iteration id task id Place work() Output: work units to send. Tuning Points, Action and Synchronisation Points: [send_north, send_south, send_east, send_west] Action: set the value of these variables. Synchronisation: at the beginning of the migration phase. Migration function 27/66

28 Experimental Evaluation ELASTIC Package to balance the work units Performance Model Abstraction Model Decomposition Scheme Real/Virtual Application Constraint: communication pattern between tasks Monitoring Order Translation & Event Creation Tuning Network Monitoring Order Tuning Order Translation Tuning Network Events [send_south, 60] [send_south, 20] [send_south, 20] [send_south, 20] Tuning Order work units 28/66

29 Experimental Evaluation Experimentation Plan For the two scenarios Maximum domain size = iterations. 20 work units per task. The additional load is proportional to the size of the parallel application. 29/66

30 Experimental Evaluation Results Load State 4096 tasks parallel application (64x64 grid) Original application Virtual application 30/66

31 Experimental Evaluation Results Execution time (s) Original application Application tuned by ELASTIC 65.1% 63.1% 54.6% 50.1% 33.1% 22.9% Number of tasks 31/66

32 Experimental Evaluation Results Iteration Time Iteration time (ms) Iteration time (ms) 256 Tasks Synthetic Application 1024 Tasks Synthetic Application 2304 Tasks Synthetic Application 256 Task Synthetic Application Original Application Original Application 2000 Application tuned by ELASTIC 2000 Application tuned by ELASTIC Ideal balanced time Ideal balanced time Iteration Iteration Iteration time (ms) Iteration time (ms) 1024 Task Synthetic Application Original Application 2000 Application tuned by ELASTIC Ideal balanced time Iteration 4096 Tasks Synthetic Application 9216 Tasks Synthetic Application Tasks Synthetic Application Task Synthetic Application Original Application 2000 Application tuned by ELASTIC Ideal balanced time Iteration time (ms) 256 Tasks Synthetic Application 256 Task Synthetic Application 9216 Task Synthetic Application Original Application 2000 Application tuned by ELASTIC Ideal balanced time Original Application Iteration time (ms) Application tuned by ELASTIC Ideal balanced time Iteration Task Synthetic Application Iteration Iteration Iteration Iteration time (ms) Task Synthetic Application Original Application 2000 Application tuned by ELASTIC Ideal balanced time /66

33 Experimental Evaluation Results Execution time (s) % Original application Application tuned by ELASTIC 50% 46.8% 40.6% 29.4% 25.8% Number of tasks 33/66

34 Experimental Evaluation Synthetic Parallel Application Real Agent-based Parallel Application Application and performance problem. ELASTIC Package developed. Experimentation plan. Results. 34/66

35 Task 2 Task n-1 Task 0 Task 1 Experimental Evaluation Application and performance problem Large-scale agent-based simulation. Simulates an epidemic model. A AA A A A A A Communication A A A A Communication pattern: any-to-any A A A A A A A Load imbalance problem due to the dynamic behaviour of the agents: o Births and death. o Time required to process an agent is not uniform. 35/66

36 Per task Experimental Evaluation ELASTIC Package to balance the computation time Performance Model Abstraction Model Performance Expressions Measurement Points Marquez et al Migrations can be between any two tasks in the analysis and tuning domain time # agent iteration id Place phase_4() Input: #agents and computation time. Output: #agents to send to each task. task id Tuning Points, Action and Synchronisation Points: [intradomain_migrate[], interdomain_migrate[]] Action: set the value of these variables. Synchronisation: at the beginning of the migration phase. Migration functions. 36/66

37 Experimental Evaluation ELASTIC Package to balance the computation time Performance Model Abstraction Model Decomposition Scheme Real/Virtual Application Constraint: domains with the same size Monitoring Order Translation & Event Creation Tuning Network Monitoring Order Tuning Order Translation Tuning Network [intradomain, 80] [interdomain, 20] [interdomain, 20] Tuning Order #agents Events [interdomain, 20] [interdomain, 20] time 37/66

38 Experimental Evaluation Experimentation Plan Scale the number of agents. Scale the simulation space. Domain size = /66

39 256 tasks Experimental Evaluation Computation times agents on 256 tasks Agent-based application alone Agent-based application with ELASTIC Ideal balanced time Results Computation times agents on 512 tasks 4500 Agent-based application alone Agent-based application with ELASTIC 4000 Ideal balanced time 512 tasks Computation time (ms) Computation time (ms) Iteration Iteration 4500 Agent-based application alone 4500 Agent-based application alone 1024 tasks Agent-based application with ELASTIC Agent-based application with ELASTIC 2048 tasks 4000 Ideal balanced time 4000 Ideal balanced time Computation time (ms) Computation times agents on 1024 tasks Computation times agents on 2048 tasks Computation time (ms) Iteration Iteration 39/66

40 Experimental Evaluation Results 300 Original application Application tuned by ELASTIC % Execution time (s) % 28.3% 30.3% Number of tasks 40/66

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy