PGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12

Similar documents
Parallel Programming. Libraries and Implementations

CS 470 Spring Parallel Languages. Mike Lam, Professor

Parallel Programming Libraries and implementations

CS 553: Algorithmic Language Compilers (PLDI) Graduate Students and Super Undergraduates... Logistics. Plan for Today

Portable, MPI-Interoperable! Coarray Fortran

Comparing One-Sided Communication with MPI, UPC and SHMEM

Portable, MPI-Interoperable! Coarray Fortran

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Trends and Challenges in Multicore Programming

Parallel Programming. Libraries and implementations

OpenACC Course. Office Hour #2 Q&A

Programming with MPI

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Parallel and High Performance Computing CSE 745

Parallel Languages: Past, Present and Future

Comparing One-Sided Communication With MPI, UPC and SHMEM

Lecture 32: Partitioned Global Address Space (PGAS) programming models

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Programming with MPI

OPENMP TIPS, TRICKS AND GOTCHAS

Parallel Programming with Coarray Fortran

Scaling with PGAS Languages

The Cray Programming Environment. An Introduction

Chapel Introduction and

The Arm Technology Ecosystem: Current Products and Future Outlook

Programming with MPI

!OMP #pragma opm _OPENMP

Exploring XcalableMP. Shun Liang. August 24, 2012

MPI+X on The Way to Exascale. William Gropp

OPENMP TIPS, TRICKS AND GOTCHAS

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

GPUs and Emerging Architectures

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Interfacing Chapel with traditional HPC programming languages

Hybrid multicore supercomputing and GPU acceleration with next-generation Cray systems

Compute Node Linux (CNL) The Evolution of a Compute OS

The Mother of All Chapel Talks

CS162 Operating Systems and Systems Programming Lecture 11 Page Allocation and Replacement"

PROGRAMMING MODEL EXAMPLES

Mixed Mode MPI / OpenMP Programming

MPI+X on The Way to Exascale. William Gropp

MPI in 2020: Opportunities and Challenges. William Gropp

Intel VTune Performance Analyzer 9.1 for Windows* In-Depth

Programming Models for Supercomputing in the Era of Multicore

A PGAS implementation by co-design of the ECMWF Integrated Forecasting System. George Mozdzynski, ECMWF

Carlo Cavazzoni, HPC department, CINECA

Future Applications and Architectures

CUG Talk. In-situ data analytics for highly scalable cloud modelling on Cray machines. Nick Brown, EPCC

Steve Deitz Chapel project, Cray Inc.

First Experiences with Application Development with Fortran Damian Rouson

MATH 676. Finite element methods in scientific computing

Why C? Because we can t in good conscience espouse Fortran.

ARCHER Single Node Optimisation

Allinea Unified Environment

Understanding Communication and MPI on Cray XC40 C O M P U T E S T O R E A N A L Y Z E

UCX: An Open Source Framework for HPC Network APIs and Beyond

OpenMPand the PGAS Model. CMSC714 Sept 15, 2015 Guest Lecturer: Ray Chen

Summary: Issues / Open Questions:

Optimising for the p690 memory system

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

High-Performance Key-Value Store on OpenSHMEM

Programming Environment 4/11/2015

CFD exercise. Regular domain decomposition

How mobile is changing and what publishers need to do about it

When Will SystemC Replace gcc/g++? Bodo Parady, CTO Pentum Group Inc. Sunnyvale, CA


Porting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

An Introduction to OpenACC

Profiling Arpege, Aladin and Arome and Alaro! R. El Khatib with contributions from CHMI

Parallel Constraint Programming (and why it is hard... ) Ciaran McCreesh and Patrick Prosser

The Eclipse Parallel Tools Platform Project

A Characterization of Shared Data Access Patterns in UPC Programs

How Rust views tradeoffs. Steve Klabnik

High performance computing and numerical modeling

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

The Cray Programming Environment. An Introduction

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Parallel Hybrid Computing Stéphane Bihan, CAPS

Introduction to OpenMP

OpenMP: Open Multiprocessing

Parallel Code Optimisation

Parallel Programming Languages. HPC Fall 2010 Prof. Robert van Engelen

Point-to-Point Synchronisation on Shared Memory Architectures

Evaluating Fortran Coarrays and MPI on a Modern HPC architecture

The Eclipse Parallel Tools Platform

Parallel Programming Features in the Fortran Standard. Steve Lionel 12/4/2012

Acknowledgements These slides are based on Kathryn McKinley s slides on garbage collection as well as E Christopher Lewis s slides

HPCF Cray Phase 2. User Test period. Cristian Simarro User Support. ECMWF April 18, 2016

Parallel Programming

New Programming Paradigms: Partitioned Global Address Space Languages

Optimize HPC - Application Efficiency on Many Core Systems

High Performance Computing on GPUs using NVIDIA CUDA

Cuda C Programming Guide Appendix C Table C-

The Need for Speed: Understanding design factors that make multicore parallel simulations efficient

Compute Node Linux: Overview, Progress to Date & Roadmap

OpenMP: Open Multiprocessing

Towards Exascale Programming Models. HPC 2014 Cetraro Erwin Laure, Stefano Markidis, KTH

Transcription:

PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk

What is PGAS? a model, not a language! based on principle of partitioned global address space many different implementations exist new languages, language extensions, libraries world of PGAS is rather complex and murky...

Some implementations Unified Parallel C, Coarray Fortran Chapel, X10, Titanium, Fortress Global Arrays, OpenSHMEM

Important point to keep in mind unfortunately, there isn t really such a thing as a typical PGAS language...... there are many programming languages that implement the PGAS model in very different, even opposing, ways.

PGAS example: The UPC model thread 0 thread 1 thread 2 thread 3 global partitioned address space shared private cpu cpu cpu cpu

SPMD and Global - two different views UPC and CAF take classic fragmented SPMD approach all processes execute same program Chapel and X10 take a global view they are able to dynamically spawn processes, as and when required advantage: (in principle) no redundant serial computation

A model for the future...? single-sided communication and built-in parallelism attractive concepts manipulate remote memory directly complex communications patterns easy(ish) to implement parallelism explicitly supported most implementations can be used standalone or alongside other models learning curve is low compared to MPI

User Adoption PGAS will only play a role in Exascale computing if user adoption is improved there is a lot of scepticism in the user community this can only happen if performance is able match that of established models, i.e. MPI there is support in the form of benchmark suites, libraries and debugging/performance tools

Performance depends on the quality of runtime and compiler not a problem for CAF, UPC or Chapel... if you own a Cray! other vendors now starting to catch up also depends on the quality of the implementation (of course) this is where the HPC user community and tools providers come into play

Possible performance gains - IFS Image courtesy of George Mozdzynski (ECMWF) and the CRESTA project T2047L137 model performance on HECToR (CRAY XE6) RAPS12 IFS (CY37R3), cce=7.4.4 APRIL 2012 450 400 350 Forecast Days / Day 300 250 200 150 100 Ideal LCOARRAYS=T LCOARRAYS=F ORIGINAL F - includes MPI optimisations to wave model + other opts T - includes above & Legendre transform coarray optimization 50 0 0 10000 20000 30000 40000 50000 60000 70000 Number of Cores Operational performance requirement

Distributed Hash Table time (seconds) 100 10 MPI UPC SHMEM CAF XMP 1 32 64 128 256 512 1024 2048 4096 8192 16384 number of cores Image courtesy of the HPCGAP project

The truth about PGAS although in general easy to learn... simple codes can be parallelised very quickly... difficult to use on real codes! a lot of functionality hidden from the user; often implicit communication and parallelism hidden functionality may be root cause of poor performance

Common bottleneck: data access Shared data objects can be accessed directly cost of access depends on where data resides is it in shared cache? on a memory bank attached to a processor in another cabinet? Deceptively simple operation but implications for performance are huge

Also: synchronisation important for memory consistency, avoidance of data races implicit nature of communication makes this surprisingly difficult to get right especially problematic in large codes common approach: if in doubt, synchronise! the result is correct but badly performing code that spends most if its time waiting for things to happen

What needs to happen? An ideal world common misplaced belief that PGAS is easy needs to be addressed not a quick fix for performance and scaling problems! stories of success and failure need to be told what works? what doesn t? and finally: programmers need help, writing code without the support of tools is like shooting in the dark...

Debugging main focus needs to be on the principal feature of PGAS and the unwanted side-effects of RDMA memory consistency is the key detect data races: is a memory location safe to use? help with resolution of data races, e.g. atomic operations, synchronisation, critical sections,...

Debugging (2) ensure synchronisations are correct too few, and the code will break; too many, and the code will perform badly...

Debugging (2) ensure synchronisations are correct too few, and the code will break; too many, and the code will perform badly... a debugging tool could for example visually match synchronisation points and give advice based on data race detection

(Micro-)Benchmarks performance characteristics need to be quantifiable runtime overheads, communication costs, parallel constructs allows programmers to model and analyse the performance of their code and make intelligent decisions regarding implementation

Interoperability tools focus on language interoperability aim is to enable code written in one language to be called directly from code written in another language encourage and enable code reuse libraries

Interoperability tools focus on language interoperability aim is to enable code written in one language to be called directly from code written in another language encourage and enable code reuse libraries notable effort here is Babel (out of LLNL) C,C++, Fortran 77-2008, Python, Java and now also Chapel, UPC and X10 though the latter three are still experimental

Performance profiling get information on hotspots and breakdown of timings how much time is spent waiting for data to arrive at the processing core? how much time does is spent on memory management? lower-level information such as cache reuse, memory bandwidth, cycles per instruction, etc.

Visualising data locality accessing memory not uniformly expensive important to keep data on memory infrastructure close to processing core that will operate on it

Visualising data locality accessing memory not uniformly expensive important to keep data on memory infrastructure close to processing core that will operate on it tool should highlight poor data locality, based on memory access patterns

Visualising communications related to data locality on previous slide communication is implicit in most (though not all) PGAS languages remote direct memory access difficult to gain a clear understanding of the communication patterns optimising these patterns important for performance

What is the reality? some of this functionality would be extremely beneficial, but does not even exist for shared-memory programming e.g. data locality visualiser for multi-core processors? what chance is there for PGAS tools? need to support a myriad of different programming, memory and execution models...

Will PGAS play a role in Exascale? not all of the PGAS languages will survive they will suffer the same fate as HPF the ones (or even the one?) that will remain won t necessarily be the best implementations of PGAS but those that got the most support and managed to pick up momentum

Conclusions PGAS is in principle an attractive model but there are too many disparate implementations this makes community support difficult and may even be the downfall of the PGAS implementations only time will tell!

Questions?