The Cascade High Productivity Programming Language

Similar documents
The Cray Rainier System: Integrated Scalar/Vector Computing

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Overview: Emerging Parallel Programming Models

Parallel and High Performance Computing CSE 745

HPCS HPCchallenge Benchmark Suite

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

CSE 590o: Chapel. Brad Chamberlain Steve Deitz Chapel Team. University of Washington September 26, 2007

Trends and Challenges in Multicore Programming

Compilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University

Issues in Multiprocessors

Chapel Introduction and

Scalable Shared Memory Programing

Generation of High Performance Domain- Specific Languages from Component Libraries. Ken Kennedy Rice University

Programming Models for Supercomputing in the Era of Multicore

Steve Deitz Chapel project, Cray Inc.

CS 470 Spring Parallel Languages. Mike Lam, Professor

6.1 Multiprocessor Computing Environment

Distributed Systems Principles and Paradigms. Chapter 01: Introduction

Computer Architecture

Issues in Multiprocessors

Lecture 9: MIMD Architectures

Parallel Languages: Past, Present and Future

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CA464 Distributed Programming

Supercomputing and Mass Market Desktops

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Distributed Systems Principles and Paradigms. Chapter 01: Introduction. Contents. Distributed System: Definition.

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architecture

Adaptive Cluster Computing using JavaSpaces

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN. Chapter 1. Introduction

WHY PARALLEL PROCESSING? (CE-401)

Distributed Computing: PVM, MPI, and MOSIX. Multiple Processor Systems. Dr. Shaaban. Judd E.N. Jenne

Distributed Systems Principles and Paradigms

MPI in 2020: Opportunities and Challenges. William Gropp

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Compilers and Run-Time Systems for High-Performance Computing

On the realizability of hardware microthreading. Revisiting the general-purpose processor interface: consequences and challenges Poss, R.C.

Scaling Tuple-Space Communication in the Distributive Interoperable Executive Library. Jason Coan, Zaire Ali, David White and Kwai Wong

CUDA GPGPU Workshop 2012

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Chapel: Status and Future Directions

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Example of a Parallel Algorithm

It also performs many parallelization operations like, data loading and query processing.

PoS(eIeS2013)008. From Large Scale to Cloud Computing. Speaker. Pooyan Dadvand 1. Sònia Sagristà. Eugenio Oñate

CS 526 Advanced Topics in Compiler Construction. 1 of 12

Compilation for Heterogeneous Platforms

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Continuum Computer Architecture

CSE 5306 Distributed Systems. Course Introduction

Client Server & Distributed System. A Basic Introduction

The Cascade High Productivity Language

MPI versions. MPI History

MIND: Scalable Embedded Computing through Advanced Processor in Memory

Marco Danelutto. May 2011, Pisa

Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

MPI History. MPI versions MPI-2 MPICH2

Scalable Software Transactional Memory for Chapel High-Productivity Language

Chap. 4 Multiprocessors and Thread-Level Parallelism

Many Cores, One Thread: Dean Tullsen University of California, San Diego

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

Architecture Conscious Data Mining. Srinivasan Parthasarathy Data Mining Research Lab Ohio State University

Cluster-Based Scalable Network Services

Moore s Law. Computer architect goal Software developer assumption

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Cloud Programming James Larus Microsoft Research. July 13, 2010

Leveraging Flash in HPC Systems

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

Recap. Practical Compiling for Modern Machines (Special Topics in Programming Languages)

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Practical Near-Data Processing for In-Memory Analytics Frameworks

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

An Introduction to GPFS

Chapter 20: Database System Architectures

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Perspectives form US Department of Energy work on parallel programming models for performance portability

COSC 6385 Computer Architecture - Multi Processor Systems

Gen-Z Overview. 1. Introduction. 2. Background. 3. A better way to access data. 4. Why a memory-semantic fabric

High Performance Computing. University questions with solution

Multicore Computing and Scientific Discovery

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Fault-tolerant Distributed-Shared-Memory on a Broadcast-based Interconnection Network

NUMA-aware OpenMP Programming

The Cray Programming Environment. An Introduction

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

PGAS languages. The facts, the myths and the requirements. Dr Michèle Weiland Monday, 1 October 12

A Lightweight OpenMP Runtime

Distributed OS and Algorithms

The Use of Cloud Computing Resources in an HPC Environment

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Announcements. me your survey: See the Announcements page. Today. Reading. Take a break around 10:15am. Ack: Some figures are from Coulouris

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Transcription:

The Cascade High Productivity Programming Language Hans P. Zima University of Vienna, Austria and JPL, California Institute of Technology, Pasadena, CA CMWF Workshop on the Use of High Performance Computing in Meteorolog Reading, UK, October 25, 2004

Contents 1 Introduction 2 Programming Models for High Productivity Computing Systems 3 Cascade and the Chapel Language Design 4 Programming Environments 5 Conclusion

Abstraction in Programming Programming models and languages bridge the gap between reality and hardware at different levels of abstraction - e.g., assembly languages general-purpose procedural languages functional languages very high-level domain-specific languages Abstraction implies loss of information gain in simplicity, clarity, verifiability, portability versus potential performance degradation

The Emergence of High-Level Sequential Languages The designers of the very first high level programming language age were aware that their success depended on acceptable performance of the generated target programs: John Backus (1957): It was our belief that if FORTRAN were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger High-level algorithmic languages became generally accepted standards for sequential programming since their advantages outweighed any performance drawbacks For parallel programming no similar development took place

The Crisis of High Performance Computing Current HPC hardware: large clusters at reasonable cost commodity clusters or custom MPPs off-the-shelf processors and memory components, built for mass market latency and bandwidth problems Current HPC software application efficiency sometimes in single digits low-productivity local view programming models dominate explicit processors: local views of data program state associated with memory regions explicit communication intertwined with the algorithm wide gap between domain of scientist and programming language inadequate programming environments and tools higher level approaches (e.g., HPF) did not succeed, for a variety of reasons

State-of-the-Art Current parallel programming language, compiler, and tool technologies are unable to support high productivity computing New programming models, languages, compiler, and tool technologies are necessary to address the productivity demands of future systems

Goals Make Scientists and Engineers more productive: provide a higher level of abstraction Support Abstraction without Guilt [Ken Kennedy]: increase programming language usability without sacrificing performance

Contents 1 Introduction 2 Programming Models for High Productivity Computing Systems 3 Cascade and the Chapel Language Design 4 Programming Environments 5 Conclusion

Productivity Challenges of Peta-Scale Systems Large scale architectural parallelism hundreds of thousands of processors component failures may occur in relatively short intervals Extreme non uniformity in data access Applications are becoming larger and more complex multi-disciplinary, multi-language, multi-paradigm dynamic, irregular, and adaptive Legacy codes pose a problem: long lived applications, surviving many generations of hardware from F77 to F90, C/C++, MPI, Coarray Fortran etc. automatic re-write under constraint of performance portability is difficult Performance, user productivity, robustness, portability

Programming Models for High Productivity Computing Programming Model conceptual view of data and control Semantics Productivity Model realizations Programming Language S P Programming Language + Directives Library S P S P Command-line Interface S P Execution Model abstract machine

Programming models for high productivity computing and their realizations can be characterized along (at least) three dimensions: semantics user productivity (time to solution) performance Semantics: : a mapping from programs to functions specifying input/output behavior of the program: S: P F, where each f in F is a function f: I O User productivity (programmability): : a mapping from programs to a characterization of structural complexity: U: P N Programming Model Issues Performance: : a mapping from programs to functions specifying the complexity of the program in terms of its execution on a real or abstract target machine: C: P G, where each g in G is a function g : I N*

Contents 1 Introduction 2 Programming Models for High Productivity Computing Systems 3 Cascade and the Chapel Language Design 4 Programming Environments 5 Conclusion

High Productivity Computing Systems Goals: 2 Provide a new generation of economically viable high productivity computing systems for the national security and industrial user community (2007 2010) Impact: 3 Performance (efficiency): critical national security applications by a factor of 10X to 40X 3 Productivity (time-to-solution) 3 Portability (transparency): insulate research and operational application software from system 3 Robustness (reliability): apply all known techniques to protect against outside attacks, hardware faults, & programming errors HPCS Program Focus Areas Applications: 3 Intelligence/surveillance, reconnaissance, cryptanalysis, airborne contaminant modeling and biotechnology Fill Fill the the Critical Critical Technology Technology and and Capability Capability Gap Gap Today Today (late (late 80 s 80 s HPC HPC technology)..to..future technology)..to..future (Quantum/Bio (Quantum/Bio Computing) Computing)

The Cascade Project One year Concept Study, July 2002 -- June 2003 Three year Prototyping Phase, July 2003 -- June 2006 Led by Cray Inc. (Burton Smith) Partners Partners Caltech/JPL University of Notre Dame Stanford University Collaborators in the Programming Environment Area David Callahan, Brad Chamberlain, Mark James, John Plevyak

Key Elements of the Cascade Architecture High performance networks and multithreading contribute to tolerating memory latency and improving memory bandwidth Hardware support for locality aware programming and program-controlled selection of UMA/NUMA data access avoid serious performance problems present in current architectures Shared address space without global cache coherence eliminates a major source of bottlenecks Hierarchical two-level processing structure exploits temporal as well as spatial locality Lightweight processors in smart memory provide a computational fabric as well as an introspection infrastructure

A Simplified Global View of the Cascade Architecture Global Interconnection Network HWP SW-controlled Cache HWP SW-controlled Cache Smart Memory Smart Memory Locale Locale

A Cascade Locale DRAM LWP LWP LWP LWP Lightweight Processors Lightweight Processor Chip Multithreaded cache DRAM LWP LWP LWP LWP Lightweight Processors Lightweight Processor Chip Multithreaded cache DRAM LWP LWP LWP LWP Lightweight Processors Lightweight Processor Chip Multithreaded cache Heavyweight Processor Vector Multithreaded Streaming Compiler-assisted cache Locale Interconnect Network Router To Other Locales DRAM LWP LWP LWP LWP Lightweight Processors Lightweight Processor Chip Multithreaded cache DRAM LWP LWP LWP LWP Lightweight Processors Lightweight Processor Chip Multithreaded cache DRAM LWP LWP LWP LWP Lightweight Processors Lightweight Processor Chip Multithreaded cache Source: David Callahan, Cray Inc.

Lightweight Processors and Threads Lightweight processors co-located with memory focus on availability full exploitation is not a primary system goal Lightweight threads minimal state high rate context switch spawned by sending a parcel to memory Exploiting spatial locality fine-grain: reductions, prefix operations, search coarse-grain: data distribution and alignment Saving bandwidth by migrating threads to data

Key Issues in High Productivity Languages Critical Functionality Orthogonal Language Issues High-Level Features for Explicit Concurrency High-Level Features for Locality Control global address space object orientation generic programming High-Level Support for Distributed Collections Extensibility High-Level Support for Programming-In-the-Large Safety performance transparency

Locality Awareness: Distribution, Alignment, Affinity HPCS architecture P M P M P M P M P M P M P M P M P M P M P M distribute distribute align data data Data/thread affinity Issues work Model partitioning of physical domains Support dynamic reshaping Express data/thread affinity Support user-defined distributions

Design Criteria of the Chapel Programming Language Global name space even in the context of a NUMA model avoid local view programming model of MPI, Coarray Fortran, UPC Multiple models of parallelism Provide support for: explicit parallel programming locality-aware programming interoperability with legacy codes (MPI, Coarray Fortran, UPC, etc.) generic programming

Chapel Basics A modern base language Strongly typed Fortran-like array features Objected-oriented Module structure for name space management Optional automatic storage management High performance features Abstractions for parallelism data parallelism (domains, forall) task parallelism (cobegin) Locality management via data distributions and affinity

Type Tree primitive domain collection function homogeneous collection heterogeneous collection iterator other function array sequence set class record

Language Design Highlights The Concrete Language enhances HPF and ZPL domains as first-class objects: index space, distribution, and associated set of arrays generalized arrays and HPF-type data distributions support for automatic partitioning of dynamic graph-based data structures high-level control for communication (halos, ) abstraction of iteration: generalization of the CLU iterator The Abstract Language supports generic programming abstraction of types: type inference from context data structure inference: system-selected implementation for programmer-specified object categories specialization using cloning

Domains index sets: Cartesian products, sparse, opaque locale view: a logical view for a set of locales distribution: a mapping of an index set to a locale view array: a map from an index set to a collection of variables Source: Brad Chamberlain, Cray Inc.

2) Separation of Concerns: Definition Example: Matrix Vector Multiplication V1 var Mat: domain(2) = [1..m, 1..n]; var MatCol: domain(1) = Mat(2); var MatRow: domain(1) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:mat] A(i,j)*v(j);

Example: Matrix Vector Multiplication V2 var L: array[1..p1,1..p2] of locale; var Mat: domain(2) dist(block,block) to L = [1..m,1..n]; var MatCol: domain(1) align(*,mat(2)) = Mat(2); var MatRow: domain(1) align(mat(1),*) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:mat] A(i,j)*v(j);

Sparse Matrix Distribution 0 53 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 93 0 0 0 0 0 0 0 0 0 0 23 69 0 27 0 0 11 0 0 0 0 21 0 0 0 16 72 0 0 0 0 0 0 0 0 0 13 0 44 0 0 19 37 0 0 0 0 0 64 0 D 0 53 19 17 93 D 02 53 23 19 69 17 27 93 11 C 0 2 1 4 5 C 2 2 3 1 4 R 0 1 2 2 3 3 4 5 5 R 2 1 1 3 5 D 01 53 21 19 16 17 72 93 13 D 3 44 19 37 64 C 01 2 13 41 52 C 3 1 4 1 3 R 01 1 21 2 3 34 4 54 5 R 3 1 3 4 5

Example: Matrix Vector Multiplication V3 var L: array[1..p1,1..p2] of locale; var Mat: domain(sparse2) dist(myspdist) to L layout(mysplay) = [1..m,1..n] where enumeratenonzeroes(); var MatCol: domain(1) align(*,mat(2)) = Mat(2); var MatRow: domain(1) align(mat(1),*) = Mat(1); var A: array [Mat] of float; var v: array [MatCol] of float; var s: array [MatRow] of float; s = sum(dim=2) [i,j:mat] A(i,j)*v(j);

Language Summary Global name space High level control features supporting explicit parallelism High level locality management High level support for collections Static typing Support for generic programming

Contents 1 Introduction 2 Programming Models for High Productivity Computing Systems 3 Cascade and the Chapel Language Design 4 Programming Environments 5 Conclusion

Issues in Programming Environments and Tools Reliability Challenges massive parallelism poses new problems fault prognostics, detection, recovery data distribution may cause vital data to be spread across all nodes (Semi) Automatic Tuning closed loop adaptive control: measurement, decision-making, actuation information exposure: users, compilers, runtime systems learning from experience: databases, data mining, reasoning systems Introspection a technology for support of validation, fault detection, performance tuning

Example: Offline Performance Tuning source program Transformation system Parallelizing compiler Program / Execution Knowledge Base Expert advisor target program end of tuning cycle execution on target machine

Legacy Code Migration Rewriting Legacy Codes preservation of intellectual content opportunity for exploiting new hardware, including new algorithms code size may preclude practicality of rewrite Language, compiler, tool, and runtime support (Semi) automatic tools for migrating code incremental porting transition of performance-critical sections requires highlysophisticated software for automatic adaptation high-level analysis pattern matching and concept comprehension optimization and specialization

Potential Uses of Lightweight Threads Fine grain application parallelism Implementation of a service layer Components of agent systems that asynchronously monitor the computation performing introspection, and dealing with: dynamic program validation fault tolerance intrusion prevention and detection performance analysis and tuning support of feedback-oriented compilation Introspection can be defined as a system s s ability to: explore its own properties reason about its internal state make decisions about appropriate state changes where necessary

Example: A Society of Agents for Performance Analysis and Feedback-Oriented Tuning compiler instrumenter execution of instrumented application program data collection Program/ Performance Data Base data reduction and filtering simplification agent analysis agent invariant checking agent Performance Exception Handler

Conclusion Today s s programming languages, models, and tools cannot deal with 2010 architectures and application requirements Peta scale architectures will pose new challenges but may provide enhanced support for high level languages and compilers The Cascade programming language Chapel targets the creation of a viable language system together with a programming environment for economically feasible and robust high productivity computing of the future D.Callahan, B.Chamberlain, H.P.Zima: The Cascade High Productivity ty Language Proceedings of the HIPS2004 Workshop, Santa Fe, New Mexico, April 2004