Distributed Shared Memory for High-Performance Computing

Similar documents
Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

CS 470 Spring Parallel Languages. Mike Lam, Professor

Chapel Introduction and

Non-uniform memory access (NUMA)

Overview: Emerging Parallel Programming Models

. Programming in Chapel. Kenjiro Taura. University of Tokyo

Parallel Computing. Hwansoo Han (SKKU)

Memory Systems in Pipelined Processors

COSC 6385 Computer Architecture - Multi Processor Systems

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering

WHY PARALLEL PROCESSING? (CE-401)

Trends and Challenges in Multicore Programming

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Computing Platforms

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

OpenMP for next generation heterogeneous clusters

Convergence of Parallel Architecture

Introduction to Parallel Computing

COSC 6374 Parallel Computation. Parallel Computer Architectures

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

COSC 6374 Parallel Computation. Parallel Computer Architectures

Parallel Architectures

An Introduction to Parallel Programming

An Introduction to OpenMP

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Cray XE6 Performance Workshop

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Lecture 9: MIMD Architectures

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Introduction to OpenMP

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

HPC Architectures. Types of resource currently in use

Compute Node Linux (CNL) The Evolution of a Compute OS

Implementing X10. Towards Performance and Productivity at Scale. David Grove IBM Research NICTA Summer School February 5, 2013

Parallel Algorithm Engineering

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

A brief introduction to OpenMP

Steve Deitz Chapel project, Cray Inc.

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Issues in Multiprocessors

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

X10 and APGAS at Petascale

Tutorial: Introduction to POSIX Threads

A Uniform Programming Model for Petascale Computing

Tasking and OpenMP Success Stories

Online Course Evaluation. What we will do in the last week?

Parallel Architecture. Hwansoo Han

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

Chapter 5. Multiprocessors and Thread-Level Parallelism

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Programming. Libraries and Implementations

Parallel Languages: Past, Present and Future

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Topic C Memory Models

Lecture 13. Shared memory: Architecture and programming

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware

Multi-core Programming - Introduction

Multiprocessors - Flynn s Taxonomy (1966)

CUDA GPGPU Workshop 2012

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Parallel Programming with Coarray Fortran

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

High Performance Computing on GPUs using NVIDIA CUDA

Enforcing Textual Alignment of

Issues in Multiprocessors

Why you should care about hardware locality and how.

Barbara Chapman, Gabriele Jost, Ruud van der Pas

PCS - Part Two: Multiprocessor Architectures

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Overview: The OpenMP Programming Model

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

CS420: Operating Systems

Introduction to OpenMP

Parallel and Distributed Computing

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 9: MIMD Architecture

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

27. Parallel Programming I

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Parallelization, OpenMP

Lecture 9: MIMD Architectures

Introduction to OpenMP

Lecture 32: Partitioned Global Address Space (PGAS) programming models

CPEG 852 Advanced Topics in Computing Systems The Dataflow Model of Computation

Transcription:

Distributed Shared Memory for High-Performance Computing Stéphane Zuckerman Haitao Wei Guang R. Gao Computer Architecture & Parallel Systems Laboratory Electrical & Computer Engineering Dept. University of Delaware 140 Evans Hall Newark,DE 19716, USA {szuckerm, hwei, ggao@udel.edu Tuesday, October 17, 2014 Zuckerman et al. PGAS 1 / 22

Outline 1 Introduction 2 Hardware Distributed Shared Memory: NUMA Systems Uniform Memory Access Systems Non-Uniform Memory Access Systems 3 Software Distributed Memory Systems Introduction to Global Partition Address Space Systems 4 Where to Learn More Zuckerman et al. PGAS 2 / 22

Introduction I Why Distributed Shared Memory? To ease the programmer s task productivity... And that is mostly it, really. Why Is Productivity Important? Let s ask Fred Brooks (Brooks, The Mythical Man-month (Anniversary Ed.) Chap. 8): IBM OS/360 experience, while not available in the detail of Harr s data, confirms it. Productivities in range of 600 800 debugged instructions per man-year were experienced by control program groups. Productivities in the 2000 3000 debugged instructions per man-year were achieved by language translator groups. These include planning done by the group, coding component test, system test, and some support activities (... ) Both Harr s data and OS/360 data are for assembly language programming. Little data seem to have been published on system programming productivity using higher-level languages. Corbatò of MIT s Project MAC reports, however, a mean productivity of 1200 lines of debugged PL/I statements per man-year on the MULTICS system (between 1 and 2 million words). (... ) Zuckerman et al. PGAS 3 / 22

Introduction II Why Is Productivity Important? (Cont d) Brooks, The Mythical Man-month (Anniversary Ed.) Chap. 8: This number is very exciting. Like the other projects, MULTICS includes control programs and language translators. Like the others, it is producing a system programming product, tested and documented. The data seem to be comparable in terms of kind of effort included. And the productivity number is a good average between the control program and translator productivities of other projects. But Corbatò s number is lines per man-year, not words! Each statement in his system corresponds to about three to five words of handwritten code! This suggests two important conclusions. Productivity seems constant in terms of elementary statements, a conclusion that is reasonable in terms of the thought a statement requires and the errors it may include. Programming productivity may be increased as much as five times when a suitable high-level language is used Zuckerman et al. PGAS 4 / 22

Introduction III Why Is Productivity Important? (cont d) Brooks, The Mythical Man-month (Anniversary Ed.) Chap. 12: The chief reasons for using a high-level language are productivity and debugging speed (... ) There is not a lot of numerical evidence [in the 1960s... ], but what there is suggests improvement by integral factors, not just incremental percentages. (... ) For me, these productivity and debugging reasons are overwhelming. I cannot easily conceive of a programming system I would build in assembly language. So really, productivity solves two major problems: Time-to-solution (i.e., software is produced faster), and how to produce bug-free code Zuckerman et al. PGAS 5 / 22

Introduction IV Different Ways of Implementing DSMs Hardware Software Hardware-software hybrids Zuckerman et al. PGAS 6 / 22

Uniform Memory Access Systems I Uniform Memory Access SMP systems used to propose a uniform access to memory banks Example: for x86, a single front-side bus (FSB) to access DRAM Advantages: For the hardware, easier to design and implement For the programmer, guarantees on latency Drawbacks: To guarantee uniform access, throughput is somewhat slowed down In general, UMA architectures do not scale beyond a single compute node. Even on a single compute node, UMA systems saturate easily Zuckerman et al. PGAS 7 / 22

Non-Uniform Memory Access Systems I Non-Uniform Memory Access Systems Hardware can be designed so memory banks are directly attached to a given (set of) socket(s) To maintain a single address space, an interconnection system must be implemented In theory NUMA systems need not be coherent In practice all NUMA systems currently available are really Cache Coherent NUMA (ccnuma) Examples: x86-based multi-processor compute nodes provide an interconnection network: AMD Opteron-based systems use HyperTransport Intel Xeon-based systems use QuickPath Interconnect (QPI) SGI proposed the Altix multiprocessor NUMA system (based on Intel Itanium2 processors) where an unmodified Linux OS could access up to 1024 processors (so up to 2048 cores) Zuckerman et al. PGAS 8 / 22

Non-Uniform Memory Access Systems II Limits of (cc-)numa Even in the case of large-scale NUMA like Altix systems, scalability remains an issue: At the hardware level: producing hardware for large-scale ccnuma requires it to be tightly coupled with the processors At the software level: ensuring data locality becomes a bigger problem At the operating system level: a choice must be made (by the user): Let the OS follow a first-touch page allocation policy best for when the software can easily be optimized for locality Require the OS to allocate pages in a random or round-robin way (when data access is truly random-ish). For very large scale computations, an additional software layer must be implemented to help access the fast network devices (e.g., Infiniband, Quadrics, etc.). Zuckerman et al. PGAS 9 / 22

Introduction to Global Partition Address Space Systems Basic Concepts Maintain a programmer-centric global address space Automagically partition arrays and other shared data structures across compute nodes Provide means to handle locality: if an object is supposed to be available locally, there should be a way to inform the system When shared data structures are accessed, the software automatically knows where to issue the request How to Implement PGAS Using a library (e.g., Gasnet) Using a programming language (e.g., X10, Chapel, Titanium,... ) Zuckerman et al. PGAS 10 / 22

PGAS Languages A Very Brief History DARPA s High Productivity Computing Systems (HPCS) program was launched in 2002 with five teams, each led by a hardware vendor: Cray Inc., Hewlett-Packard, IBM, SGI, and Sun Microsystems. Examples of PGAS Languages Several languages are following a PGAS approach: IBM s X10, Cray s Chapel, HP and Berkeley s Unified Parallel C (UPC), etc. They all propose constructs to express parallelism in a more or less implicit way. Most of these languages are either developed as an Open Source package or propose an open implementation a a This is important: languages get popular thanks to their availability or because they are the only ones on their market segments! Zuckerman et al. PGAS 11 / 22

Chapel I Overview Chapel Official web site: http://chapel.cray.com Open Source (https://github.com/chapel-lang) Targets general parallelism (i.e. any algorithm should be expressible as a Chapel program) Separates parallelism and locality: concurrent regions of code vs. data placement. Multi-resolution parallelism: either use implicit parallelism, or if parallel expert, use direct parallel constructs to drive parallel execution Targets productivity (type inference, iterator functions, OOP, various array types) Data-centric Zuckerman et al. PGAS 12 / 22

Chapel II Overview Task Parallelism Constructs begin{...: Creates an anonymous task using the code between braces cobegin{...: Fine-grain way to task creation Creates a task for each statement in the block Data Parallelism Constructs forall elem in Range do...: Creates a coarse-grain parallel loop akin to OpenMP s #pragma omp for coforall elem in Collection do...: Creates a fine-grain parallel loop each iteration is a task Zuckerman et al. PGAS 13 / 22

Chapel III Overview Synchronization sync{statement;: Creates a synchronization point for all tasks created within a parallel region. Akin to a barrier (coarse-grain synchronization) var variablename sync type;: Creates a (set of) synchronization variable(s) which acts as a full/empty bit (set of) location(s). var variablename atomic type;: Creates a (set of) variable(s) that are accessed atomically (accesses are sequentially consistent). Zuckerman et al. PGAS 14 / 22

Chapel IV Overview Locality Constructs The Locale type: used to confine portions of computations and data to a specific part of the machine (typically a compute node). The on clauses: to make a statement execute a specific locale. Locality and parallelism constructs can be combined, e.g., begin on Locale.left {... Zuckerman et al. PGAS 15 / 22

X10 I X10 Official web site: http://x10-lang.org Open Source (http://sourceforge.net/projects/x10/files/x10dt/ 2.5.0/x10dt-2.5.0-linux.gtk.x86.zip/download) Built on top of Java VM Partially inspired by Scala (a mostly functional, but multi-paradigm language based on the JVM) Provides a back-end to both Java and C++ code (source-to-source translation) Also distinguishes between parallelism and locality Zuckerman et al. PGAS 16 / 22

X10 II Parallel Constructs async{...: Creates an anonymous task using the code between braces finish{statement;: Creates a synchronization point for all async tasks created within a parallel region. Can be combined: finish async{... Locality Constructs: Accessing Places at: Place shifting operation when: Concurrency control within a place atomic: Concurrency control within a place GlobalRef[T]: Distributed heap management PlaceLocalHandle[T]: Distributed heap management Zuckerman et al. PGAS 17 / 22

X10 III Combining Locality and Parallelism Constructs at(p) function(...): Remote evaluation at(p) async function(...): Active message finish for (p in Places.places()) { at(p) async runeverywhere(...) : SPMD at(ref) async atomic ref() += v: Atomic remote update Zuckerman et al. PGAS 18 / 22

X10 Examples Hello World import x10.io.console; class HelloWorld { public static def main(rail[string]) { Console.OUT.println("Hello World!" ); Zuckerman et al. PGAS 19 / 22

X10 Examples Hello World import x10.io.console; class HelloWorld { public static def main(rail[string]) { Console.OUT.println("Hello World!" ); import x10.io.console; class HelloWholeWorld { public static def main(args:rail[string]):void { if (args.size < 1) { Console.OUT.println("Usage: HelloWholeWorld message"); return; finish for (p in Place.places()) { at (p) async Console.OUT.println(here+" says hello and "+args(0)); Console.OUT.println("Goodbye"); Zuckerman et al. PGAS 19 / 22

X10 Examples Fibonacci import x10.io.console; public class Fibonacci { public static def fib(n:long) { if (n<2) return n; val f1:long; val f2:long; finish { async { f1 = fib(n-1); async { f2 = fib(n-2); return f1 + f2; public static def main(args:rail[string]) { val n = (args.size > 0)? Long.parse(args(0)) : 10; Console.OUT.println("Computing fib("+n+")"); val f = fib(n); Console.OUT.println("fib("+n+") = "+f); Zuckerman et al. PGAS 20 / 22

Learning More About Multi-Threading and OpenMP Internet Resources General PGAS web site: http://pgas.org Chapel: http://chapel.cray.com X10: http://x10-lang.org UPC: http://upc-lang.org, http://upc.lbl.gov/ and http://upc.gwu.edu The GASNet library (used in Berkeley s UPC): http://gasnet.lbl.gov/ Tutorials Used for this Class Bradford L. Chamberlain s overview of Chapel: http://chapel.cray.com/papers/briefoverviewchapel.pdf Chamberlain s slides to present Chapel: http://chapel.cray.com/presentations/chapelforeth-distributeme.pdf X10 tutorial slides: http://x10.sourceforge.net/tutorials/x10-2.4/ APGASProgrammingInX10/APGASprogrammingInX10-slides-V7.pdf UPC tutorial slides: http://upc.gwu.edu/tutorials/tutorials_sc2003.pdf Food for Thoughts Sutter, The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software (available at http://www.gotw.ca/publications/concurrency-ddj.htm) Lee, The Problem with Threads (available at http://www.eecs.berkeley.edu/pubs/techrpts/2006/eecs-2006-1.pdf) Zuckerman et al. PGAS 21 / 22

References I Boehm, Hans-J. Threads Cannot Be Implemented As a Library. In: SIGPLAN Not. 40.6 (June 2005), pp. 261 268. ISSN: 0362-1340. DOI: 10.1145/1064978.1065042. URL: http://doi.acm.org/10.1145/1064978.1065042. Brooks Jr., Frederick P. The Mythical Man-month (Anniversary Ed.) Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1995. ISBN: 0-201-83595-9. Lee, Edward A. The Problem with Threads. In: Computer 39.5 (May 2006), pp. 33 42. ISSN: 0018-9162. DOI: 10.1109/MC.2006.180. URL: http://dx.doi.org/10.1109/mc.2006.180. Sutter, Herb. The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. In: Dr. Dobb s Journal 30.3 (2005). Zuckerman et al. PGAS 22 / 22