PGAS languages The facts, the myths and the requirements Dr Michèle Weiland m.weiland@epcc.ed.ac.uk
What is PGAS? a model, not a language! based on principle of partitioned global address space many different implementations exist new languages, language extensions, libraries world of PGAS is rather complex and murky...
Some implementations Unified Parallel C, Coarray Fortran Chapel, X10, Titanium, Fortress Global Arrays, OpenSHMEM
Important point to keep in mind unfortunately, there isn t really such a thing as a typical PGAS language...... there are many programming languages that implement the PGAS model in very different, even opposing, ways.
PGAS example: The UPC model thread 0 thread 1 thread 2 thread 3 global partitioned address space shared private cpu cpu cpu cpu
SPMD and Global - two different views UPC and CAF take classic fragmented SPMD approach all processes execute same program Chapel and X10 take a global view they are able to dynamically spawn processes, as and when required advantage: (in principle) no redundant serial computation
A model for the future...? single-sided communication and built-in parallelism attractive concepts manipulate remote memory directly complex communications patterns easy(ish) to implement parallelism explicitly supported most implementations can be used standalone or alongside other models learning curve is low compared to MPI
User Adoption PGAS will only play a role in Exascale computing if user adoption is improved there is a lot of scepticism in the user community this can only happen if performance is able match that of established models, i.e. MPI there is support in the form of benchmark suites, libraries and debugging/performance tools
Performance depends on the quality of runtime and compiler not a problem for CAF, UPC or Chapel... if you own a Cray! other vendors now starting to catch up also depends on the quality of the implementation (of course) this is where the HPC user community and tools providers come into play
Possible performance gains - IFS Image courtesy of George Mozdzynski (ECMWF) and the CRESTA project T2047L137 model performance on HECToR (CRAY XE6) RAPS12 IFS (CY37R3), cce=7.4.4 APRIL 2012 450 400 350 Forecast Days / Day 300 250 200 150 100 Ideal LCOARRAYS=T LCOARRAYS=F ORIGINAL F - includes MPI optimisations to wave model + other opts T - includes above & Legendre transform coarray optimization 50 0 0 10000 20000 30000 40000 50000 60000 70000 Number of Cores Operational performance requirement
Distributed Hash Table time (seconds) 100 10 MPI UPC SHMEM CAF XMP 1 32 64 128 256 512 1024 2048 4096 8192 16384 number of cores Image courtesy of the HPCGAP project
The truth about PGAS although in general easy to learn... simple codes can be parallelised very quickly... difficult to use on real codes! a lot of functionality hidden from the user; often implicit communication and parallelism hidden functionality may be root cause of poor performance
Common bottleneck: data access Shared data objects can be accessed directly cost of access depends on where data resides is it in shared cache? on a memory bank attached to a processor in another cabinet? Deceptively simple operation but implications for performance are huge
Also: synchronisation important for memory consistency, avoidance of data races implicit nature of communication makes this surprisingly difficult to get right especially problematic in large codes common approach: if in doubt, synchronise! the result is correct but badly performing code that spends most if its time waiting for things to happen
What needs to happen? An ideal world common misplaced belief that PGAS is easy needs to be addressed not a quick fix for performance and scaling problems! stories of success and failure need to be told what works? what doesn t? and finally: programmers need help, writing code without the support of tools is like shooting in the dark...
Debugging main focus needs to be on the principal feature of PGAS and the unwanted side-effects of RDMA memory consistency is the key detect data races: is a memory location safe to use? help with resolution of data races, e.g. atomic operations, synchronisation, critical sections,...
Debugging (2) ensure synchronisations are correct too few, and the code will break; too many, and the code will perform badly...
Debugging (2) ensure synchronisations are correct too few, and the code will break; too many, and the code will perform badly... a debugging tool could for example visually match synchronisation points and give advice based on data race detection
(Micro-)Benchmarks performance characteristics need to be quantifiable runtime overheads, communication costs, parallel constructs allows programmers to model and analyse the performance of their code and make intelligent decisions regarding implementation
Interoperability tools focus on language interoperability aim is to enable code written in one language to be called directly from code written in another language encourage and enable code reuse libraries
Interoperability tools focus on language interoperability aim is to enable code written in one language to be called directly from code written in another language encourage and enable code reuse libraries notable effort here is Babel (out of LLNL) C,C++, Fortran 77-2008, Python, Java and now also Chapel, UPC and X10 though the latter three are still experimental
Performance profiling get information on hotspots and breakdown of timings how much time is spent waiting for data to arrive at the processing core? how much time does is spent on memory management? lower-level information such as cache reuse, memory bandwidth, cycles per instruction, etc.
Visualising data locality accessing memory not uniformly expensive important to keep data on memory infrastructure close to processing core that will operate on it
Visualising data locality accessing memory not uniformly expensive important to keep data on memory infrastructure close to processing core that will operate on it tool should highlight poor data locality, based on memory access patterns
Visualising communications related to data locality on previous slide communication is implicit in most (though not all) PGAS languages remote direct memory access difficult to gain a clear understanding of the communication patterns optimising these patterns important for performance
What is the reality? some of this functionality would be extremely beneficial, but does not even exist for shared-memory programming e.g. data locality visualiser for multi-core processors? what chance is there for PGAS tools? need to support a myriad of different programming, memory and execution models...
Will PGAS play a role in Exascale? not all of the PGAS languages will survive they will suffer the same fate as HPF the ones (or even the one?) that will remain won t necessarily be the best implementations of PGAS but those that got the most support and managed to pick up momentum
Conclusions PGAS is in principle an attractive model but there are too many disparate implementations this makes community support difficult and may even be the downfall of the PGAS implementations only time will tell!
Questions?