Expressing Parallel Com putation

Similar documents
IBM Power Multithreaded Parallelism: Languages and Compilers. Fall Nirav Dave

Start Voyager: Message Passing and DSM on SMP Clusters

Lecture 14. Synthesizing Parallel Programs IAP 2007 MIT

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

ECE 669 Parallel Computer Architecture

Table of Contents. Cilk

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Example of a Parallel Algorithm

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Introduction II. Overview

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Written Presentation: JoCaml, a Language for Concurrent Distributed and Mobile Programming

EE/CSCI 451: Parallel and Distributed Computation

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Issues in Multiprocessors

EE/CSCI 451: Parallel and Distributed Computation

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Lecture 3: Intro to parallel machines and models

Issues in Multiprocessors

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Trends and Challenges in Multicore Programming

CS 33. Caches. CS33 Intro to Computer Systems XVIII 1 Copyright 2017 Thomas W. Doeppner. All rights reserved.

Chip Multiprocessors COMP Lecture 9 - OpenMP & MPI

COMP Parallel Computing. SMM (2) OpenMP Programming Model

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

CS533 Concepts of Operating Systems. Jonathan Walpole

Chapel Introduction and

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Module 4: Parallel Programming: Shared Memory and Message Passing Lecture 7: Examples of Shared Memory and Message Passing Programming

Detection and Analysis of Iterative Behavior in Parallel Applications

15-853:Algorithms in the Real World. Outline. Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms

Parallel Computing. Prof. Marco Bertini

Parallel Programming Patterns Overview CS 472 Concurrent & Parallel Programming University of Evansville

Processor speed. Concurrency Structure and Interpretation of Computer Programs. Multiple processors. Processor speed. Mike Phillips <mpp>

Parallel Computing Basics, Semantics

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Multithreaded Programming in. Cilk LECTURE 2. Charles E. Leiserson

COP 1220 Introduction to Programming in C++ Course Justification

PGAS: Partitioned Global Address Space

Threaded Programming. Lecture 9: Alternatives to OpenMP

Multithreading and Interactive Programs

High Performance Computing

Chapter 3 Parallel Software

Exploring Parallelism At Different Levels

Point-to-Point Synchronisation on Shared Memory Architectures

Titanium. Titanium and Java Parallelism. Java: A Cleaner C++ Java Objects. Java Object Example. Immutable Classes in Titanium

An Introduction to Parallel Programming

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Programming Languages

Talk based on material by Google

High Performance Computing on GPUs using NVIDIA CUDA

Multithreading and Interactive Programs

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

What is a parallel computer?

Concurrency: what, why, how

Questions from last time

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

Getting the most out of your CPUs Parallel computing strategies in R

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

CS4961 Parallel Programming. Lecture 12: Advanced Synchronization (Pthreads) 10/4/11. Administrative. Mary Hall October 4, 2011

Computer Systems A Programmer s Perspective 1 (Beta Draft)

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

Programming for High Performance Computing in Modern Fortran. Bill Long, Cray Inc. 17-May-2005

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Barbara Chapman, Gabriele Jost, Ruud van der Pas

CS5460: Operating Systems

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Concurrency: what, why, how

WHY PARALLEL PROCESSING? (CE-401)

EE/CSCI 451: Parallel and Distributed Computation

Intel Array Building Blocks

Warm-up question (CS 261 review) What is the primary difference between processes and threads from a developer s perspective?

A brief introduction to OpenMP

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 14 Parallelism in Software I

Introduction to OpenMP

Flynn classification. S = single, M = multiple, I = instruction (stream), D = data (stream)

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Multithreaded Algorithms Part 2. Dept. of Computer Science & Eng University of Moratuwa

Compilers for High Performance Computer Systems: Do They Have a Future? Ken Kennedy Rice University

Parallel Programming: Background Information

Parallel Programming: Background Information

Introduction to Parallel Computing

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Session 4: Parallel Programming with OpenMP

Lecture 10 Midterm review

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Introduction to Multithreaded Algorithms

Technical Report Research Experiences for Undergraduates in Dynamic Distributed Real-Time Systems Program Summer 2000

Program Optimization

Transcription:

L1-1 Expressing Parallel Com putation Laboratory for Computer Science M.I.T. Lecture 1

Main St r eam Par allel Com put ing L1-2 Most server class machines these days are symmetric multiprocessors (SMP s) PC class SMP s 2 to 4 processors cheap run Windows & Linux track technology Delux SMP s 8 to 64 processors expensive (16-way SMP costs >> 4 x 4-way SMPs) Applications databases, OLTP, Web servers, Internet commerce... potential applications: EDA tools, technical computing,...

Lar ge Scale Par allel Com put ing L1-3 Most parallel machines with hundreds of processors are build as clusters of SMP s usually custom built with government funding expensive: $10M to $100M are treated as a national or international resource Total sales are a tiny fraction of server sales hard time tracking technology Applications weather and climate modeling drug design code breaking (NSA, CIA,...) basic scientific research weapons development Few independent software developers; Programmed by very smart people

Paucit y of Par allel Applicat ions L1-4 Applications follow cost-effective hardware which has become available only since 1996 Important applications are hand crafted (usually from their sequential versions) to run on parallel machines explicit, coarse-grain multithreading on SMP s most business applications explicit, coarse-grain message passing on large clusters most technical/scientific applications Technical reasons: automatic parallelization is difficult parallel programming is difficult parallel programs run poorly on sequential machines Can the entry cost of parallel programming be lowered?

Why not use sequent ial languages? L1-5 Algorithm with parallelism encode Program with sequential semantics detect parallelism Parallel code

L1-6 Mat r ix Mult iply C = A x B C i,j = k A i,k B k,j All C i,j 's can be computed in parallel. In fact, all multiplications can be done in parallel! Fortran do i = 1,n do j = 1,n do k = 1,n s = s + A(i,k)*B(k,j) continue C(i,j) = s continue continue Parallelism?

Par allelizing Com piler s L1-7 After 30 years of intensive research only limited success in parallelism detection and program transformations instruction-level parallelism at the basic-block level can be detected parallelism in nested for-loops containing arrays with simple index expressions can be analyzed analysis techniques, such as data dependence analysis, pointer analysis, flow sensitive analysis, abstract interpretation,... when applied across procedure boundaries often take far too long and tend to be fragile, i.e., can break down after small changes in the program. instead of training compilers to recognize parallelism, people have been trained to write programs that parallelize

Par allel Pr ogr am m ing Models L1-8 If parallelism can t be detected in sequential programs automatically then design new parallel programming models... High-level Data parallel: Fortran 90, HPF,... Multithreaded: Cid, Cilk,... Id, ph, Sisal,... Low-level Message passing: PVM, MPI,... Threads & synchronization: Futures, Forks, Semaphores,...

Pr oper t ies of Models L1-9 Determinacy - repeatable? is the behavior of a program Compositionality - can independently created subprograms be combined in a meaningful way? Expressiveness - can all sorts of parallelism be expressed? Implementability - can a compiler generate efficient code for a variety of architectures?

Safer Way s of Expr essing Par allelism L1-10 Data parallel: Fortran 90, HPF,... - sources of parallelism: vector operators, forall - communication is specified by shift operations. Implicit Parallel Programming: Id, ph, Sisal,... - functional and logic languages specify only a partial order on operations. Determinacy of programs is guaranteed????????????????????????????????????????????easier? debugging!!! Programming is independent of machine configuration.

Dat a Par allel Pr ogr am m ing Model L1-11 All data structures are assigned to a communicate grid of virtual processors. (global) Generally the owner processor computes the data elements assigned to it. Global communication primitives allow processors to exchange data. Implicit global barrier after each communication. All processors execute the same program. compute (local) communicate (global) compute (local)

Dat a Par allel Mat r ix Mult iply L1-12 Real Array(N,N) :: A, B, C each element is on a virtual processor of a 2D grid Layout A(:NEWS, :NEWS), B(:NEWS, :NEWS) C(:NEWS, :NEWS)... set up the initial distribution of data elements... Do i = 1,N-1!Shift rows left and columns up A = CShift(A, Dim=2, Shift=1) B = CShift(B, Dim=1, Shift=1) C = C + A * B End Do data parallel operations communication Connection Machine Fortran

Data Parallel Model L1-13 + Good implementations available - Difficult to write programs + Easy to debug programs because of a single thread + Implicit synchronization and communication - Limited compositionality! are communicate compute communicate compute? For general-purpose programming, which has more unstructured parallelism, we need more flexibility in scheduling.

Fully Par allel, Mult it hr eaded Model L1-14 Tree of Activation Frames f: Global Heap of Shared Objects g: h: active threads loop asynchronous at all levels

Explicit vs I m plicit Mult it hr eading L1-15 Explicit: C + forks + joins + locks multithreaded C: Cid, Cilk,... easy path for exploiting coarse-grain parallelism in existing codes error-prone if locks are used Implicit: languages that specify only a partial order on operations functional languages: Id, ph,... safe, high-level, but difficult to implement efficiently without shared memory &...

Explicitly Parallel Matrix Multiply L1-16 cilk void matrix_multiply(int** A, int** B, int** C, int n) {...... for (i = 0; i < n; i++) for (j = 0; j < n; j++) C[i][j] = spawn IP(A, B, i, j, n); } sync;...... int IP(... ) {... } Cilk program

I m plicit ly Par allel Mat r ix Mult iply L1-17 make_matrix ((1,1),(n,n)) f where f is the filling function \(i,j).(ip (row i A) (column j B)) make_matrix does not specify the order in which the matrix is to be filled! no implication regarding the distribution of computation and data over a machine. ph program

I m plicit ly Par allel Languages L1-18 Expose all types of parallelism, permit very high level programming but may be difficult to implement efficiently Why? some inherent property of the language? lack of maturity in the compiler? the run-time system? the architecture?

Future L1-19 Id ph Cilk HPF multithreaded intermediate language notebooks SMPs MPPs Freshman will be taught sequential programming as a special case of parallel programming

6.827 Mult it hreaded Languages and Com pilers This subject is about implicit parallel programming in ph functional languages and the λ calculus because they form the foundation of ph multithreaded implementation of ph aimed at SMP s some miscellaneous topics from functional languages, e.g., abstract implementation, confluence, semantics, TRS... This term I also plan to give 6 to 8 lectures on Bluespec, a new language for designing hardware. Bluespec also boroughs heavily from functional languages but has a completely different execution model. L1-20