PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

Size: px

Start display at page:

Download "PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications"

Herbert Reynold Ray
5 years ago
Views:

1 PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications Yanhua Sun, Laxmikant V. Kalé University of Illinois at Urbana-Champaign sun51@illinois.edu November 26, 2014 Yanhua Sun U of Illinois at Urbana-Champaign 1/18

Parallel Programming Laboratory @UIUC PPL : led by Professor Kale since 1985 (30 years) Group of research staff, post-doc, graduate students, undergraduate

2 Parallel Programming PPL : led by Professor Kale since 1985 (30 years) Group of research staff, post-doc, graduate students, undergraduate (20+) Charm++ programming model and runtime system, real world applications (open source) 12 Charm++ workshops Yanhua Sun U of Illinois at Urbana-Champaign 2/18

3 Charm++ NAMD ChaNGa openatom Charm++ (Over-decomposition, Asynchronization, Migration) Converse Runtime (scheduling,threading) P I C S EpiSemdemics Cloth Simulation PDES more Fault Tolerance Load Balancing Power/Energy Saving LRTS Machine Layers (ugni, PAMI, Verbs, Net) Yanhua Sun U of Illinois at Urbana-Champaign 3/18

4 Goal : Productivity + Performance Asynchronous, message driven, over-decomposition programming model More control: mapping, load balancing, memory management, communication optimization Observability and controllability Most important feature : load balancing Why not a general scheme to enhance the adaptivity? Yanhua Sun U of Illinois at Urbana-Champaign 4/18

5 Goal : Productivity + Performance Asynchronous, message driven, over-decomposition programming model More control: mapping, load balancing, memory management, communication optimization Observability and controllability Most important feature : load balancing Why not a general scheme to enhance the adaptivity? PICS : Control point centered introspective control system to steer applications and runtime system Yanhua Sun U of Illinois at Urbana-Champaign 4/18

6 Observation Configurations of tunable parameters in the runtime system and applications significantly affect the performance M data Ping time of using different number of messages to send data 512 1M (f: ) time of using different number of messages to send data 512 time(us) time(us) Number of messages Figure: Data transfer without computation Number of messages Figure: Data transfer with computation Yanhua Sun U of Illinois at Urbana-Champaign 5/18

7 Principle of Persistence Things rarely change suddenly Yanhua Sun U of Illinois at Urbana-Champaign 6/18

8 Principle of Persistence Things rarely change suddenly Yanhua Sun U of Illinois at Urbana-Champaign 6/18

9 Principle of Persistence Things rarely change suddenly Yanhua Sun U of Illinois at Urbana-Champaign 6/18

10 Overview of PICS framework applica/ons Mini apps Applica/on control points Real world applica/ons Applica/on reconfigura/on PICS Controller Performance instrumenta/on Automa/c performance analysis Performance data Expert knowledge rules Adap/ve run/me system Run/me control points Run/me reconfigura/on Yanhua Sun U of Illinois at Urbana-Champaign 7/18

11 Control Points Control points Control points are tunable parameters for application and runtime to interact with control system. First proposed in Dooley s research. 1 Name, Values : default, min, max 2 Movement unit: +1, 2 3 Associated function, object, array 4 Effects, directions Degree of parallelism Grainsize Priority Memory usage GPU load Message size Number of messages other effects Yanhua Sun U of Illinois at Urbana-Champaign 8/18

12 Application and Control Points Application 1 Application specific control points provided by users 2 Applications should be able to reconfigure to use new values Runtime 1 Registered by runtime itself 2 Requires no change from applications 3 Affect all applications Control points Effects Use Cases sub-block size parallelism, grain size Jacobi, Wave, stencil code parallel threshold parallelism, overhead, grain size state space search stages in pipeline number of messages, message size pipeline collectives algorithm selection degree of parallelism, grain size 3D FFT decomposition (slab or pencil) software cache size memory usage, amount of communication ChaNGa ratio of GPU CPU load computation, load balance NAMD, ChaNGa Yanhua Sun U of Illinois at Urbana-Champaign 9/18

13 Observe Program Behaviors Record all events Events : begin idle, end idle Functions: name, begin execution, end execution Communication : message creation, size, source/destination Hardware counters Module link, no source code modification Performance summary data Yanhua Sun U of Illinois at Urbana-Champaign 10/18

14 Automatically Analyze the Performance Many control points are registered. How to reduce the search space? Yanhua Sun U of Illinois at Urbana-Champaign 11/18

15 Automatically Analyze the Performance Many control points are registered. How to reduce the search space? Performance analysis to identify program problems to narrow down the control points Expert' knowledge' rules'database' Performance'' summary'data' Yanhua Sun U of Illinois at Urbana-Champaign 11/18

16 Decision Tree Based Performance Analysis Performance summary CPU Utilization > 90% Overhead >10% Idle >10% Sequential performance? Small Bytes per message Small entry methods Others? Decomposition problem? Mapping problem? Scheduling problem? Cache Miss > 10% Increase aggregation threshold Increase grain size Large Bytes per message Long reduction broadcast Longer entry method Larger single object Long critical path Few objects per PE Large communication on one object Load imbalance Communication time >> model time Critical tasks are delayed Decrease grain size Decrease aggregation threshold Large Large Long latency Decrease Load Collectives Replicate objects external communication Topology aware mapping for big msgs grain size balancer communication on one PE Prioritize the tasks Compress message Remap Encoded in a plain text file Constructed at the beginning Dynamic learning new rules Yanhua Sun U of Illinois at Urbana-Champaign 12/18

17 Correlate Performance with Control Points Traverse the tree using the performance summary results Idle &me > 10% Max load/avg load > 1.2 Number of tasks < number of cores Steering Increase load balancing freq Decrease grain size Increase parallelism Effects of Control Points Yanhua Sun U of Illinois at Urbana-Champaign 13/18

18 Control System APIs t y p e d e f s t r u c t C o n t r o l P o i n t t { c h a r name [ 3 0 ] ; enum TP DATATYPE d a t a t y p e ; d o u b l e d e f a u l t V a l u e ; d o u b l e c u r r e n t V a l u e ; d o u b l e minvalue ; d o u b l e maxvalue ; d o u b l e b e s t V a l u e ; d o u b l e moveunit ; i n t moveop ; i n t e f f e c t ; i n t e f f e c t D i r e c t i o n ; i n t s t r a t e g y ; i n t entryep ; i n t o b j e c t I D ; } C o n t r o l P o i n t ; Yanhua Sun U of Illinois at Urbana-Champaign 14/18

19 APIs for applications v o i d r e g i s t e r C o n t r o l P o i n t ( C o n t r o l P o i n t tp ) ; v o i d s t a r t S t e p ( ) ; v o i d endstep ( ) ; v o i d s t a r t P h a s e ( i n t p h a s e I d ) ; v o i d endphase ( ) ; d o u b l e gettunedparameter ( c o n s t c h a r name, b o o l v a l i d ) ; Yanhua Sun U of Illinois at Urbana-Champaign 15/18

20 Jacobi3d Performance Steering Control Points: sub-block size in each dimension Three control points Cache miss rate, high idle suggest decreases sub-block size Overhead total time idle time cpu time runtime overhead 3 timestep(ms/step) step Figure: Jacobi3d performance steering on 64 cores for problem of 1024*1024*1024 Yanhua Sun U of Illinois at Urbana-Champaign 16/18

21 Communication Bottleneck in ChaNGa Control points: number of mirrors Ratio of maximum communication per object to average tune mirrors with PICS no mirrors s/step steps Figure: Time cost of calculating gravity for various mirrors and no mirror on 16k cores on Blue Gene/Q Yanhua Sun U of Illinois at Urbana-Champaign 17/18

22 Conclusion Application developers can provide hints to help optimize applications Automatic performance analysis helps guide performance steering Steering both runtime system and applications is important mailing list: Yanhua Sun U of Illinois at Urbana-Champaign 18/18

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs Isaac Dooley, Laxmikant V. Kale Department of Computer Science University of Illinois idooley2@illinois.edu kale@illinois.edu