Scalable Shared Memory Programing

Similar documents
Programming Models for Supercomputing in the Era of Multicore

Chapel Introduction and

Inferring Method Effect Summaries for Nested Heap Regions

Universal Parallel Computing Research Center at Illinois

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

CUDA GPGPU Workshop 2012

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes.

Trends and Challenges in Multicore Programming

Shared Virtual Memory. Programming Models

CSE 374 Programming Concepts & Tools

Shared Memory Programming. Parallel Programming Overview

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

PGAS Languages (Par//oned Global Address Space) Marc Snir

Rethinking Shared-Memory Languages and Hardware

Introduction. Distributed Systems IT332

The Mother of All Chapel Talks

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 5 Programming with Locks and Critical Sections

Lecture 32: Partitioned Global Address Space (PGAS) programming models

Compsci 590.3: Introduction to Parallel Computing

Lecture 20: Transactional Memory. Parallel Computer Architecture and Programming CMU , Spring 2013

Cache Coherence (II) Instructor: Josep Torrellas CS533. Copyright Josep Torrellas

Overview: Emerging Parallel Programming Models

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

WaveScalar. Winter 2006 CSE WaveScalar 1

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Givy, a software DSM runtime using raw pointers

Concurrency: what, why, how

High Performance Computing on GPUs using NVIDIA CUDA

CSE332: Data Abstractions Lecture 23: Programming with Locks and Critical Sections. Tyler Robison Summer 2010

Transactional Memory

Opal. Robert Grimm New York University

6.1 Multiprocessor Computing Environment

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Efficient Data Race Detection for Unified Parallel C

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Characterization of Shared Data Access Patterns in UPC Programs

Problems with Concurrency

PGAS: Partitioned Global Address Space

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Improving the Practicality of Transactional Memory

What is Software Architecture

Deterministic Shared Memory Multiprocessing

Computer Systems A Programmer s Perspective 1 (Beta Draft)

Progress Report Toward a Thread-Parallel Geant4

The Cascade High Productivity Programming Language

Review: Creating a Parallel Program. Programming for Performance

Java Memory Model. Jian Cao. Department of Electrical and Computer Engineering Rice University. Sep 22, 2016

Tasks with Effects. A Model for Disciplined Concurrent Programming. Abstract. 1. Motivation

New Programming Paradigms: Partitioned Global Address Space Languages

Concurrency: what, why, how

Field Analysis. Last time Exploit encapsulation to improve memory system performance

Chapter 5. Names, Bindings, and Scopes

Cilk Plus GETTING STARTED

CSE 332: Data Structures & Parallelism Lecture 17: Shared-Memory Concurrency & Mutual Exclusion. Ruth Anderson Winter 2019

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

proceedings (OOPSLA 2009 and POPL 2011). As to that work only, the following notice applies: Copyright

Leveraging Flash in HPC Systems

Programmazione di sistemi multicore

MPI in 2020: Opportunities and Challenges. William Gropp

Problems with Concurrency. February 19, 2014

Introduction to Parallel Computing

An Introduction to Parallel Programming

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Introduction to Parallel Programming Models

OpenMP 4.0/4.5. Mark Bull, EPCC

Embedded Systems: Projects

Productive Design of Extensible Cache Coherence Protocols!

Titanium. Titanium and Java Parallelism. Java: A Cleaner C++ Java Objects. Java Object Example. Immutable Classes in Titanium

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

Lecture V: Introduction to parallel programming with Fortran coarrays

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

McRT-STM: A High Performance Software Transactional Memory System for a Multi- Core Runtime

CS558 Programming Languages Winter 2018 Lecture 4a. Andrew Tolmach Portland State University

Lecture 9: MIMD Architectures

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Programming for Fujitsu Supercomputers

MPI+X on The Way to Exascale. William Gropp

Sustainable Memory Use Allocation & (Implicit) Deallocation (mostly in Java)

Thread-Local. Lecture 27: Concurrency 3. Dealing with the Rest. Immutable. Whenever possible, don t share resources

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

Deterministic Parallel Programming

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Scalable Software Transactional Memory for Chapel High-Productivity Language

Optimize HPC - Application Efficiency on Many Core Systems

MultiJav: A Distributed Shared Memory System Based on Multiple Java Virtual Machines. MultiJav: Introduction

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

A NEW DISTRIBUTED COMPOSITE OBJECT MODEL FOR COLLABORATIVE COMPUTING

ECE/CS 757: Homework 1

DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism*

High-Productivity Languages for HPC: Compiler Challenges

Parallel Programming with OpenMP

Intel Thread Building Blocks, Part II

Types. What is a type?

Shared Memory programming paradigm: openmp

+ Today. Lecture 26: Concurrency 3/31/14. n Reading. n Objectives. n Announcements. n P&C Section 7. n Race conditions.

Transcription:

Scalable Shared Memory Programing Marc Snir www.parallel.illinois.edu

What is (my definition of) Shared Memory Global name space (global references) Implicit data movement Caching: User gets good memory performance by taking care of locality Temporal locality Spatial locality Processor locality Locality is a property of memory accesses, not of memory layout! (good location is derived from good access pattern) 2 www.parallel.illinois.edu

Why Do We Like Shared Memory HPC applications are designed in terms of global (partitioned) data structures; one would like to avoid (up-front) manual translation from global name space to local name spaces One would like users to manage locality, rather than manage communication Good (load balanced, communication efficient) computation partition is derived from a good data partition only for very simplistic codes need control parallelism, not data parallelism Determine data location from place it is accessed, not viceversa 3 www.parallel.illinois.edu

What s Bad About (hardware supported) Shared Memory Coherence protocols do not scale Need more prefetching, to cope with latency; prefetch predictors are not sufficient True even at small scale Need to move data in bigger chunks (than a cache line) to cope with communication overhead; but large, fixed chunks result in false sharing (e.g., DSVM) Data races cause insidious bugs 4 www.parallel.illinois.edu

Does PGAS Provide Shared Memory? Global name space Implicit data movement (but copying needed for performance) Caching Good temporal locality does not necessarily translate into reduced communication Limited ability to aggregate communication Limited ability to prefetch Both depend on compiler analysis 5 www.parallel.illinois.edu

PGAS Productivity Issues (1) UPC, CAF partitions are unlikely to match application needs; need an additional level of mapping (e.g., mapping mesh patches unto UPC partitions) Need not be part of core language Need generic programming facilities to handle well UPC++? Fortran 2008? 6 www.parallel.illinois.edu

PGAS Productivity Issues (2) Good performance in UPC/CAF/ requires explicit coding of communication (aggregating, prefetching and caching) i.e., requires coding in a distributed memory style PGAS languages do not provide mechanisms for users to help with aggregation, prefetching and caching, except via explicit messaging 7 www.parallel.illinois.edu

Is PGAS Useful? Not as a productivity enhancer, but as a performance enhancer: compiled communications Compiler & run-time can optimize communications (aggregation and prefetch) Communications can be mapped more directly to the iron (vendors may be willing to expose to a run-time a layer they do not want to expose to end-users) Second opportunity may be the most significant, but is not leveraged enough 8 www.parallel.illinois.edu

Beyond PGAS? Goals: Support shared memory (global name space, implicit communication, caching) at scale Enable user to provide information about communication doing so in an orthogonal manner (preferably, impacting performance, not semantics) Support for prefetch & aggregation Avoid races 9 www.parallel.illinois.edu

What Gives? Assumptions: code is bulk-synchronous (or nested bulk synchronous) ownership of data does not change too frequently Communications can be coarse grained Changes of ownership are synchronized (no races) Synchronization is collective 10 www.parallel.illinois.edu

Let s Think of Hardware Supported Shared Memory Cache line: unit of transfer Home: default location for line Directory: mechanism for tracking locations and state of line Coherence protocol: protocol to ensure that states of line in different caches and information in directory are consistent 11 www.parallel.illinois.edu

Local Memory Copy of Cache Line Data object Dynamically allocated + reference swizzling Or additional level of indirection Partitions of array Dynamically allocated array chunks; UPC-style global to local address translation Halo Extend address mapping of surrounded cell 12 www.parallel.illinois.edu

Home Current focus of PGAS languages Static of dynamic? E.g., particles in cells: if a particle migrates, does it change name (memcopy) or does it change home? Virtualization (for load balancing and fault tolerance) requires support for home relocation Global and (hopefully) infrequent operation 13 www.parallel.illinois.edu

Coherence Protocol (1) If code is race-free then all threads that need to participate in a coherence transaction execute a (collective) synchronization Conflicting accesses to shared variables (read on write, write on write) must be ordered by an explicit synchronization; we assume it is a collective operation 14 www.parallel.illinois.edu

Coherence Protocol (2) Problem: identifying which cache lines are involved in the synchronization (doing better than cache flush/invalidate) Possible solution: have explicit transfer of ownership (collective state change for a cache line or set of cache lines ) Less painful than message passing Early prototypes (PPL1, MSA) show good expressiveness; not yet good performance If code is race-free, then no directory is needed! 15 www.parallel.illinois.edu

Performance Enhancers Aggregation: Large enough cache lines Prefetch: memget( cache line ) memput( cache line, locale, state) Changes caching state, does not change name Need to be properly synchronized (as any other access), to avoid races 16 www.parallel.illinois.edu

Extensions Reductions (histograms) are handled as added states in coherence protocol: Add-reduce line can be accessed concurrently by multiple threads; only access allowed is increment. Still need collectives Did not discuss control; this is an orthogonal issue 17 www.parallel.illinois.edu

Can We Ensure that Code Has No Concurrency Bugs? Run-time checks: Add global directory, and local state tag for each line Make sure that thread accesses only cache lines it is allowed to access, and performs only allowable accesses out of line bound checks and guards on loads and stores Make sure that coherence protocol is correct Check directory change is valid 18 www.parallel.illinois.edu

Can We Do Checks at Compile Time? (Concurrency Safety) Deterministic Parallel Java (Vikram Adve, Rob Bochino) Use Type and Effect notation Heap is partitioned into regions, each with a different type Methods are annotated with effect notation that specifies which type is read and which is written Compile time checks ensures that concurrent accesses are consistent Types can be recursive or parameterized 19 www.parallel.illinois.edu

Examples: Regions and Effects class C { region r1, r2; int f1 in r1; int f2 in r2; void m1(int x) writes r1 {f1 = x; void m2(int y) writes r2 {f2 = y; void m3(int x, int y){ cobegin { m1(x); m2(y); Partitioning the heap C f1 0 f2 5 C.r1 C.r2 20 www.parallel.illinois.edu upcrc.illinois.edu

Examples: Regions and Effects class C { region r1, r2; int f1 in r1; int f2 in r2; void m1(int x) writes r1 {f1 = x; void m2(int y) writes r2 {f2 = y; void m3(int x, int y){ cobegin { m1(x); m2(y); Summarizing method effects C f1 0 f2 5 C.r1 C.r2 21 www.parallel.illinois.edu upcrc.illinois.edu

Examples: Regions and Effects class C { region r1, r2; int f1 in r1; int f2 in r2; void m1(int x) writes r1 {f1 = x; void m2(int y) writes r2 {f2 = y; void m3(int x, int y){ cobegin { m1(x); // Effect = writes r1 m2(y); // Effect = writes r2 C f1 0 f2 5 Expressing parallelism No run-time checks are needed! C.r1 C.r2 22 www.parallel.illinois.edu upcrc.illinois.edu

More Powerful Type Expressions Recursive types (recursively nested regions) Handles tree computations, divide and conquer, etc. Parameterized types Handles parallel access to arrays No run-time checks needed in either case 23 www.parallel.illinois.edu upcrc.illinois.edu

Recursive Types class Tree Node<region P> { region Links, L, R, M, F; double mass in P: M; double force in P: F; Tree Node<L> left in Links; Tree Node<R> right in Links; Tree Node< > link in Links; void compforce( ) reads Links,*:M writes P:F { cobegin { this.force = this.mass link.mass; if (left!= null) left.compforce( ); if (right!= null) right.compforce( ); compforce reads mass fields from nodes reachable from Links and writes the local force field two recursive calls can execute in parallel 24 www.parallel.illinois.edu upcrc.illinois.edu

Parameterized Types class Body<region P> { region Link, M, F; double mass in P:M; double force in P:F; Body< > link in Link; void compforce( ) reads Link, :M writes P:F { force = mass link.mass; Body< >[]< > bodies = new Body< >[N]< >; foreach (int i in 0,N) { bodies[i] = new Body<[i]>( ); foreach (int i in 0,N) { bodies[i].compforce( ); compforce reads mass fields from nodes reachable from Links and writes the local force field all invocations can occur in parallel 25 www.parallel.illinois.edu upcrc.illinois.edu

Expressiveness & Performance Can express nontrivial parallel algorithms Can achieve reasonable scalability Base performance very close to raw Java performance 26 www.parallel.illinois.edu upcrc.illinois.edu

DPJizer: Porting Java to DPJ An interactive Eclipse plug-in Input: Java program + region declarations + foreach / cobegin Output: Same program + effect summaries on methods Features: Fully automatic effect inference Supports all major features of DPJ Region parameters; regions for recursive data; array regions Interactive GUI using Eclipse mechanisms Next step: Infer regions automatically as well 27 www.parallel.illinois.edu upcrc.illinois.edu

Summary (1) PGAS languages essentially support same programming model as MPI Fixed number of threads Data location is static Data movement is via explicit copying (by necessity, with MPI, for performance, with UPC) Not necessarily bad MPI code can be converted with limited effort, in order to achieve better communication performance 28 www.parallel.illinois.edu

Summary (2) A true scalable shared memory programming model, while not extant, seems possible In order to achieve performance, need coarser granularity (larger cache lines, less frequent state change); doable with many scientific codes Scalable, fine grain shared memory requires HW support that has not proven viable in the last 30 years 29 www.parallel.illinois.edu