The Mother of All Chapel Talks

Similar documents
Chapel Background. Brad Chamberlain. PRACE Winter School 12 February Chapel: a new parallel language being developed by Cray Inc.

CS 470 Spring Parallel Languages. Mike Lam, Professor

Chapel Introduction and

Multicore ZPL. Steven P. Smith. A senior thesis submitted in partial fulfillment of. the requirements for the degree of

Parallel Programming Libraries and implementations

Trends and Challenges in Multicore Programming

Threaded Programming. Lecture 9: Alternatives to OpenMP

Scalable Shared Memory Programing

Parallel & Concurrent Programming: ZPL. Emery Berger CMPSCI 691W Spring 2006 AMHERST. Department of Computer Science UNIVERSITY OF MASSACHUSETTS

Parallel Languages: Past, Present and Future

CSE 590o: Chapel. Brad Chamberlain Steve Deitz Chapel Team. University of Washington September 26, 2007

Overview of research activities Toward portability of performance

Steve Deitz Chapel project, Cray Inc.

Chapter 3 Parallel Software

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Scalable Software Transactional Memory for Chapel High-Productivity Language

The goal of the Pangaea project, as we stated it in the introduction, was to show that

Parallel Programming. Libraries and Implementations

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

Programming Models for Supercomputing in the Era of Multicore

HPC Programming Models: Current Practice, Emerging Promise. Brad Chamberlain Chapel Team, Cray Inc.

PGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12

OpenACC Course. Office Hour #2 Q&A

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Parallel Hybrid Computing Stéphane Bihan, CAPS

LLVM-based Communication Optimizations for PGAS Programs

Overview: Emerging Parallel Programming Models

Parallel and High Performance Computing CSE 745

CS 553: Algorithmic Language Compilers (PLDI) Graduate Students and Super Undergraduates... Logistics. Plan for Today

Overview: The OpenMP Programming Model

GPUs and Emerging Architectures

Parallel Programming with Coarray Fortran

Chapel: An Emerging Parallel Programming Language. Brad Chamberlain, Chapel Team, Cray Inc. Emerging Technologies, SC13 November 20 th, 2013

ZPL's WYSIWYG Performance Model

Parallel Programming on Larrabee. Tim Foley Intel Corp

Compilers and Compiler-based Tools for HPC

Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON

A Comparison of Unified Parallel C, Titanium and Co-Array Fortran. The purpose of this paper is to compare Unified Parallel C, Titanium and Co-

High-Level Programming Language Abstractions for Advanced and Dynamic Parallel Computations

A General Discussion on! Parallelism!

Sung-Eun Choi and Steve Deitz Cray Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

2 Introduction to parallel computing. Chip Multiprocessors (ACS MPhil) Robert Mullins

Peril-L, Reduce, and Scan

Overview. 2 Introduction to parallel computing. The control structure. Parallel computers

A Local-View Array Library for Partitioned Global Address Space C++ Programs

Hybrid Model Parallel Programs

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

1 GF 1988: Cray Y-MP; 8 Processors. Static finite element analysis. 1 TF 1998: Cray T3E; 1,024 Processors. Modeling of metallic magnet atoms

Lecture 32: Partitioned Global Address Space (PGAS) programming models

A Characterization of Shared Data Access Patterns in UPC Programs

OpenMP: Open Multiprocessing

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

Compute Node Linux (CNL) The Evolution of a Compute OS

Locality/Affinity Features COMPUTE STORE ANALYZE

Programming with MPI

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Comparing One-Sided Communication with MPI, UPC and SHMEM

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

Programming Languages

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

. Programming in Chapel. Kenjiro Taura. University of Tokyo

Managing Hierarchy with Teams in the SPMD Programming Model

MPI in 2020: Opportunities and Challenges. William Gropp

Carlo Cavazzoni, HPC department, CINECA

CUDA GPGPU Workshop 2012

PROGRAMMING MODEL EXAMPLES

OpenMP 4.0/4.5. Mark Bull, EPCC

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

ADVANCED PGAS CENTRIC USAGE OF THE OPENFABRICS INTERFACE

Barbara Chapman, Gabriele Jost, Ruud van der Pas

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Hierarchical Pointer Analysis for Distributed Programs

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to Multicore Programming

CS560 Lecture Parallel Architecture 1

Parallel Programming with OpenMP. CS240A, T. Yang

High Performance Computing on GPUs using NVIDIA CUDA

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

A Hands-On Introduction to Using Python in the Atmospheric and Oceanic Sciences

CSE 12 Abstract Syntax Trees

Eclipse-PTP: An Integrated Environment for the Development of Parallel Applications

Friday, May 25, User Experiments with PGAS Languages, or

Appendix D. Fortran quick reference

Introduction to OpenMP

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Chapel: An Emerging Parallel Programming Language. Thomas Van Doren, Chapel Team, Cray Inc. Northwest C++ Users Group April 16 th, 2014

Linear Algebra Programming Motifs

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Heterogeneous-Race-Free Memory Models

Porting GASNet to Portals: Partitioned Global Address Space (PGAS) Language Support for the Cray XT

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Parallel Programming for Graphics

Transcription:

The Mother of All Chapel Talks Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010

Lecture Structure 1. Programming Models Landscape 2. Chapel Motivating Themes 3. Chapel Language Features 4. Project Status 5. Sample Codes

Chapel: the Programming Models Landscape

Disclaimers This lecture s contents should be considered my personal opinions (or at least one facet of them) and not necessarily those of Cray Inc. nor my funding sources. I work in high-performance scientific computing, so my talk may reflect my biases in that regard (as compared to, say, mainstream multicore programming). That said, there are probably more similarities than differences between the two worlds (esp. as time goes on).

Terminology: Programming Models Programming Models: 1.abstract models that permit users to reason about how their programs will execute with respect to parallelism, memory, communication, performance, etc. e.g., what should/can I be thinking about when writing my programs? 2.concrete notations used to write programs i.e., the union of programming languages, libraries, annotations,

HPC Programming Model Taxonomy (2010) Communication Libraries MPI, PVM, SHMEM, ARMCI, GASNet, Shared Memory Programming Models OpenMP, pthreads, Hybrid Models MPI+OpenMP, MPI+CUDA, MPI+OpenCL, Traditional PGAS Languages Unified Parallel C (UPC), Co-Array Fortran (CAF), Titanium HPCS Languages Chapel, X10, Fortress GPU Programming Models CUDA, OpenCL, PGI annotations, CAPS, Others (for which I don t have a neat unifying category) Global Arrays, Charm++, ParalleX, Cilk, TBB, PPL, parallel Matlabs, Star-P, PLINQ, Map-Reduce, DPJ, Yada,

Distributed Memory Programming Characteristics: execute multiple binaries simultaneously & cooperatively each binary has its own local namespace binaries transfer data via communication calls Examples: MPI, PVM, SHMEM, X X X X MEM MEM MEM MEM

MPI (Message Passing Interface) Evaluation MPI strengths + users can get real work done with it + it is extremely general + it runs on most parallel platforms + it is relatively easy to implement (or, that s the conventional wisdom) + for many architectures, it can result in near-optimal performance + it can serve as a strong foundation for higher-level technologies MPI weaknesses encodes too much about how data should be transferred rather than simply what data (and possibly when ) can mismatch architectures with different data transfer capabilities only supports parallelism at the cooperating executable level applications and architectures contain parallelism at many levels doesn t reflect how one abstractly thinks about parallel algorithm no abstractions for distributed data structures places a significant bookkeeping burden on the programmer

Panel Question: What problems are poorly served by MPI? My reaction: What problems are well-served by MPI? well-served : MPI is a natural/productive way of expressing them embarrassingly parallel: arguably data parallel: not particularly, due to cooperating executable issues bookkeeping details related to manual data decomposition data replication, communication, synchronization local vs. global indexing issues task parallel: even less so e.g., write a divide-and-conquer algorithm in MPI without MPI-2 dynamic process creation yucky with it, your unit of parallelism is the executable weighty Its base languages have issues as well Fortran: age leads to baggage + failure to track modern concepts C/C++: impoverished support for arrays, pointer aliasing issues

(Traditional) PGAS Programming Models Characteristics: execute an SPMD program (Single Program, Multiple Data) all binaries share a namespace namespace is partitioned, permitting reasoning about locality binaries also have a local, private namespace compiler introduces communication to satisfy remote references Examples: UPC, Co-Array Fortran, Titanium y z X X X X MEM MEM MEM MEM

PGAS Languages PGAS: What s in a Name? memory model programming model execution model data structures communication MPI OpenMP CAF UPC Titanium PGAS Single Program, Multiple Data (SPMD) co-arrays 1D blk-cyc arrays/ distributed pointers class-based arrays/ distributed pointers co-array refs implicit method-based Chapel

PGAS Languages PGAS: What s in a Name? memory model programming model execution model data structures communication MPI distributed memory cooperating executables (often SPMD in practice) manually fragmented APIs OpenMP shared memory global-view parallelism shared memory multithreaded shared memory arrays N/A CAF UPC Titanium PGAS Single Program, Multiple Data (SPMD) co-arrays 1D blk-cyc arrays/ distributed pointers class-based arrays/ distributed pointers co-array refs implicit method-based Chapel PGAS global-view parallelism distributed memory multithreaded global-view distributed arrays implicit

PGAS Evaluation PGAS strengths + Implicit expression of communication through variable names decouples data transfer from synchronization + Ability to reason about locality/affinity supports scalable performance Traditional PGAS language strengths + Elegant, reasonably minimalist extensions to established languages + Raises level of abstraction over MPI + Good support for distributed pointer-based data structures + Some support for distributed arrays Traditional PGAS language weaknesses all: Imposes an SPMD programming + execution model on the user CAF: Problems that don t divide evenly impose bookkeeping details UPC: Like C, 1D arrays seem impoverished for many HPC codes Titanium: Perhaps too pure an OO language for HPC e.g., arrays should have value rather than reference semantics

post-spmd/asynchronous PGAS (APGAS) Characteristics: uses the PGAS memory model distinct concepts for locality vs. parallelism programming/execution models are richer than SPMD each unit of locality can execute multiple tasks/threads nodes can create work for one another Examples: Chapel, X10, Fortress task queues/pools: MEM MEM MEM MEM

ZPL Main concepts: abstract machine model: CTA data parallel programming via global-view abstractions regions: first-class index sets WYSIWYG performance model

ZPL Concepts: Regions regions: distributed index sets region R = [1..m, 1..n]; InnerR = [2..m-1, 2..n-1]; R InnerR used to declare distributed arrays var A, B: [R] real; A B and computation over distributed arrays [InnerR] A = B; A InnerR B InnerR

ZPL Concepts: Array Operators array operators: describe nontrivial array indexing translation via at operator (@) [InnerR] A = B@[0,1]; replication via flood operator (>>) [R] A = >>[1, 1..n] B; reduction via reduction operator (op<<) [R] sumb = +<< B; parallel prefix via scan operator (op ) [R] A = + B; 1 2 3 4 sumb + 1 1 1 1 arbitrary indexing via remap operator (#) [R] A = B#[X,Y]; B X(i,j),Y(i,j) A i,j

ZPL Concepts: Syntactic Performance Model [InnerR] A = B; array No Array operators: Operators describe nontrivial array indexing No Communication translation via at operator (@) [InnerR] A = B@[0,1]; At Operator Point-to-Point Communication replication via flood operator (>>) [R] A = >>[1, 1..n] B; reduction via reduction (log-tree) operator Communication (op<<) [R] sumb = +<< B; (log-tree) Communication parallel prefix via scan operator (op ) [R] A = + B; Flood Operator Broadcast Reduce Operator Reduction 1 2 3 4 sumb + 1 1 1 1 (log-tree) Communication arbitrary indexing via remap operator (#) [R] A = B#[X,Y]; Scan Operator Parallel-Prefix B X(i,j),Y(i,j) Remap Operator Arbitrary (all-to-all) Communication A i,j

Why Aren t We Done? (ZPL s Limitations) Only supports a single level of data parallelism imposed by execution model: single-threaded SPMD not well-suited for task parallelism, dynamic parallelism no support for nested parallelism Distinct types & operators for distributed and local arrays supports ZPL s WYSIWYG syntactic model impedes code reuse (and has potential for bad cross-products) annoying Only supports a small set of built-in distributions for arrays e.g., Block, Cut (irregular block), if you need something else, you re stuck

ZPL s Successes First-class concept for representing index sets makes clouds of scalars in array declarations and loops concrete supports global-view of data and control; improved productivity useful abstraction for user and compiler The Design and Implementation of a Region-Based Parallel Language. Bradford L. Chamberlain. PhD thesis, University of Washington, November 2001 Semantics constraining alignment of interacting arrays communication requirements visible to user and compiler in syntax ZPL's WYSIWYG performance model. Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin, Lawrence Snyder, and W. Derrick Weathersby. In Proceedings of the IEEE Workshop on High-Level Parallel Programming Models and Supportive Environments, 1998. Implementation-neutral expression of communication supports implementation on each architecture using best paradigm A compiler abstraction for machine independent parallel communication generation. Bradford L. Chamberlain, Sung-Eun Choi, and Lawrence Snyder. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, 1997. A good start on supporting distributions, task parallelism Steven J. Deitz. High-Level Programming Language Abstractions for Advanced and Dynamic Parallel Computations. PhD thesis, University of Washington, February 2005.

Chapel and ZPL Base Chapel s data parallel features on ZPL s successes carry first-class index sets forward unify with local arrays for consistency, sanity no syntactic performance model generalize to support richer data aggregates: sets, graphs, maps remove alignment requirement on arrays for programmability no syntactic performance model yet, preserve user/compiler ability to reason about aligned arrays preserve implementation-neutral expression of communication support user-defined distributions for arrays while expanding to several areas beyond ZPL s scope task parallelism, concurrency, synchronization, nested parallelism, OOP, generic programming, modern syntax, type inference,

A Design Principle HPC should revisit Support the general case, optimize for the common case Claim: a lot of suffering in HPC is due to programming models that focus too much on common cases: e.g., only supporting a single mode of parallelism e.g., exposing too much about target architecture and implementation Impacts: hybrid models needed to target all modes of parallelism (HW & SW) challenges arise when architectures change (e.g., multicore, GPUs) presents challenges to adoption ( linguistic dead ends ) That said, this approach is also pragmatic particularly given community size, (relatively) limited resources and frankly, we ve achieved a lot of great science when things fit But we shouldn t stop striving for more general approaches Chamberlain (27)