POP CoE: Understanding applications and how to prepare for exascale

Size: px

Start display at page:

Download "POP CoE: Understanding applications and how to prepare for exascale"

Abner Watson
5 years ago
Views:

1 POP CoE: Understanding applications and how to prepare for exascale Jesus Labarta (BSC) EU H2020 Center of Excellence (CoE) Lecce, May 17 th th ENES HPC workshop

2 POP objective Promote methodologies and best pracices in Performance analysis Parallel programming pracices By means of services Performance assessments Proof of concept 2

3 Activities (Dec 2017) 195 Services Completed/reporIng: 113 Codes being analyzed: 29 WaiIng user / New: 36 Cancelled: 17 Reports 5-15 pages By type Audits: 137 Plan: 22 Proof of concept: training workshops 3

4 Methodologies and best practices Understanding applicaion behaviour Hierarchical performance model Performance AnalyIcs & details Timelines What if Clustering, tracking, folding, Towards producive programming large scale systems MPI - OpenMP interoperability Task based overlap communicaion and computaion ExploiIng malleability Dynamic load balance 4

5 Hierarchical Performance Model Efficiencies: ~ (0,1] MulIplicaIve model Global Efficiency ComputaIon Efficiency Parallel Efficiency IPC scaling Efficiency Frequency Efficiency InstrucIon scaling Efficiency Load Balance CommunicaIon Efficiency Cache Memory BW Sharing effects Dependencies InstrucIon mix NUMAness SM SynchronizaIon Code replicaion OS noise SerializaIon Efficiency Transfer Efficiency 5

6 Hierarchical Performance Model Parallel Efficiency Load Balance Serialization efficiency Transfer Efficiency Computation Efficiency IPC scalability Instruction scalability Frequency scalability Global efficiency Parallel Efficiency Load Balance Serialization efficiency Transfer Efficiency Computation Efficiency IPC scalability Instruction scalability Frequency scalability Global efficiency Parallel Efficiency Load Balance Serialization efficiency Transfer Efficiency Computation Efficiency IPC scalability Instruction scalability Frequency scalability Global efficiency Parallel Efficiency Load Balance Serialization efficiency Transfer Efficiency Computation Efficiency IPC scalability Instruction scalability Frequency scalability Global efficiency Parallel Efficiency Load Balance Serialization efficiency Transfer Efficiency Computation Efficiency IPC scalability Instruction scalability Frequency scalability Global efficiency Coloring

7 and detail What if MPI had no overhead and transfer was instantaneous? Detailed communicaion paeern? Fundamental underlying causes? How to counteract? 7

8 Tracking MPI+OMP strong scaling 48x1 48x2 48x4 48x8 48x9 48x18 8

9 Tracking MPI+OMP strong scaling 9

MPI OpenMP interoperability Hybrid Amdahl s law A fairly bad message for programmers Significant non parallelized parts pack/unpack Oken too fine grain Significant

10 MPI OpenMP interoperability Hybrid Amdahl s law A fairly bad message for programmers Significant non parallelized parts pack/unpack Oken too fine grain Significant variability MPI calls Too serial Communicator context MPI order semanics Instead of tags Hardwired schedules MAXW-DGTD for () pack irecv isend wait all sends for () test unpack NMMB

MPI OpenMP interoperability Taskifying MPI calls Virtualize communicaion resource OpportuniIes

iteraions Provide laxity for communicaions Tolerate poorer communicaion Migrate/aggregate load

Marjanovic et al, Overlapping Communication and Computation by using a Hybrid MPI/SMPSs

11 MPI OpenMP interoperability Taskifying MPI calls Virtualize communicaion resource OpportuniIes Overlap/out of order execuion ComputaIon - communicaion CommunicaIon - communicaion Phases / iteraions Provide laxity for communicaions Tolerate poorer communicaion Migrate/aggregate load balance issues Flexibility for DLB physics ns IFS weather code kernel. ECMWF V. Marjanovic et al, Overlapping Communication and Computation by using a Hybrid MPI/SMPSs Approach ICS 2010 K. Sala et al, "Improving the Interoperability between MPI and Task-Based Programming Models. Submitted 11

12 Exploiting malleability Dynamic Load Balance & Resource management Intra/inter process/applicaion Library (DLB) RunIme intercepion (MPIP, OMPT, ) API to hint resource demands Core reallocaion policy ECHAM Opportunity to fight Amdalh s law ProducIve / Easy!!! Nx1 Hybridize imbalanced regions LeWI: A Runtime Balancing Algorithm for Nested Parallelism. M.Garcia et al. ICPP09 Hints to improve automatic load balancing with LeWI for hybrid applications JPDC

resource demands Core reallocaion policy Opportunity to fight Amdalh s law

!! Nx1 Hybridize imbalanced regions RelaIonal Discovery LeWI: A Runtime

13 Exploiting malleability Dynamic Load Balance & Resource management Intra/inter process/applicaion Library (DLB) RunIme intercepion (MPIP, OMPT, ) API to hint resource demands Core reallocaion policy Opportunity to fight Amdalh s law ProducIve / Easy!!! Nx1 Hybridize imbalanced regions RelaIonal Discovery LeWI: A Runtime Balancing Algorithm for Nested Parallelism. M.Garcia et al. ICPP09 Hints to improve automatic load balancing with LeWI for hybrid applications JPDC

14 Exploiting malleability Dynamic Load Balance & Resource management Intra/inter process/applicaion Library (DLB) RunIme intercepion (MPIP, OMPT, ) API to hint resource demands Core reallocaion policy Opportunity to fight Amdalh s law ProducIve / Easy!!! Nx1 Hybridize imbalanced regions RelaIonal Discovery LeWI: A Runtime Balancing Algorithm for Nested Parallelism. M.Garcia et al. ICPP09 Hints to improve automatic load balancing with LeWI for hybrid applications JPDC2014

15 Coupled codes MulIple physics, domains Compute & I/O 2.5 s EC-EARTH 1600 cores Atmosphere Ocean 26.7MB trace Eff: 0.43; LB: 0.52; Comm:

16 Exploiting Coupled codes Dynamic load balance How to allocate resources? Configure the runs Important to maximize performance without needing to care about detailed configuraion Fluid dominated ParIcle dominated Fluid ParIcle

17 Closing remarks The real parallel programming revoluion is in the mindset of programmers From latency to throughput oriented!!! Think global, specify local and can be achieved producively Incrementally On a standard programming model (MPI+OpenMP) Age before beauty Behavior (insight/models) before syntax Detail performance analyics before aggregated profiles Work instaniaion and order before overhead Malleability before fieed rigid structure PossibiliIes before how tos Elegance before one day shine 17

18 POP Past Huge effort, high appreciaion Provided useful insight to a large set of users Using simple techniques Plan ConInue with basic service Ease of use of tools Extend use of more advanced techniques (clustering, tracking, folding, ) Emphasis on programming best pracices Towards larger scales 18

Performance OpGmisaGon and ProducGvity A Centre of Excellence in CompuIng ApplicaIons Contact: https://www.popcoe.eu mailto:pop@bsc.

19 Performance OpGmisaGon and ProducGvity A Centre of Excellence in CompuIng ApplicaIons Contact: mailto:pop@bsc.es This 11/23/2016 project has received funding from the European Union s Horizon 2020 research and innovagon programme under grant agreement No

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

From the latency to the throughput age Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC ETP4HPC Post-H2020 HPC Vision Frankfurt, June 24 th 2018 To exascale... and beyond 2 Vision The multicore