Distributed Shared Memory for High-Performance Computing

Size: px
Start display at page:

Download "Distributed Shared Memory for High-Performance Computing"

Transcription

1 Distributed Shared Memory for High-Performance Computing Stéphane Zuckerman Haitao Wei Guang R. Gao Computer Architecture & Parallel Systems Laboratory Electrical & Computer Engineering Dept. University of Delaware 140 Evans Hall Newark,DE 19716, USA {szuckerm, hwei, Tuesday, October 17, 2014 Zuckerman et al. PGAS 1 / 22

2 Outline 1 Introduction 2 Hardware Distributed Shared Memory: NUMA Systems Uniform Memory Access Systems Non-Uniform Memory Access Systems 3 Software Distributed Memory Systems Introduction to Global Partition Address Space Systems 4 Where to Learn More Zuckerman et al. PGAS 2 / 22

3 Introduction I Why Distributed Shared Memory? To ease the programmer s task productivity... And that is mostly it, really. Why Is Productivity Important? Let s ask Fred Brooks (Brooks, The Mythical Man-month (Anniversary Ed.) Chap. 8): IBM OS/360 experience, while not available in the detail of Harr s data, confirms it. Productivities in range of debugged instructions per man-year were experienced by control program groups. Productivities in the debugged instructions per man-year were achieved by language translator groups. These include planning done by the group, coding component test, system test, and some support activities (... ) Both Harr s data and OS/360 data are for assembly language programming. Little data seem to have been published on system programming productivity using higher-level languages. Corbatò of MIT s Project MAC reports, however, a mean productivity of 1200 lines of debugged PL/I statements per man-year on the MULTICS system (between 1 and 2 million words). (... ) Zuckerman et al. PGAS 3 / 22

4 Introduction II Why Is Productivity Important? (Cont d) Brooks, The Mythical Man-month (Anniversary Ed.) Chap. 8: This number is very exciting. Like the other projects, MULTICS includes control programs and language translators. Like the others, it is producing a system programming product, tested and documented. The data seem to be comparable in terms of kind of effort included. And the productivity number is a good average between the control program and translator productivities of other projects. But Corbatò s number is lines per man-year, not words! Each statement in his system corresponds to about three to five words of handwritten code! This suggests two important conclusions. Productivity seems constant in terms of elementary statements, a conclusion that is reasonable in terms of the thought a statement requires and the errors it may include. Programming productivity may be increased as much as five times when a suitable high-level language is used Zuckerman et al. PGAS 4 / 22

5 Introduction III Why Is Productivity Important? (cont d) Brooks, The Mythical Man-month (Anniversary Ed.) Chap. 12: The chief reasons for using a high-level language are productivity and debugging speed (... ) There is not a lot of numerical evidence [in the 1960s... ], but what there is suggests improvement by integral factors, not just incremental percentages. (... ) For me, these productivity and debugging reasons are overwhelming. I cannot easily conceive of a programming system I would build in assembly language. So really, productivity solves two major problems: Time-to-solution (i.e., software is produced faster), and how to produce bug-free code Zuckerman et al. PGAS 5 / 22

6 Introduction IV Different Ways of Implementing DSMs Hardware Software Hardware-software hybrids Zuckerman et al. PGAS 6 / 22

7 Uniform Memory Access Systems I Uniform Memory Access SMP systems used to propose a uniform access to memory banks Example: for x86, a single front-side bus (FSB) to access DRAM Advantages: For the hardware, easier to design and implement For the programmer, guarantees on latency Drawbacks: To guarantee uniform access, throughput is somewhat slowed down In general, UMA architectures do not scale beyond a single compute node. Even on a single compute node, UMA systems saturate easily Zuckerman et al. PGAS 7 / 22

8 Non-Uniform Memory Access Systems I Non-Uniform Memory Access Systems Hardware can be designed so memory banks are directly attached to a given (set of) socket(s) To maintain a single address space, an interconnection system must be implemented In theory NUMA systems need not be coherent In practice all NUMA systems currently available are really Cache Coherent NUMA (ccnuma) Examples: x86-based multi-processor compute nodes provide an interconnection network: AMD Opteron-based systems use HyperTransport Intel Xeon-based systems use QuickPath Interconnect (QPI) SGI proposed the Altix multiprocessor NUMA system (based on Intel Itanium2 processors) where an unmodified Linux OS could access up to 1024 processors (so up to 2048 cores) Zuckerman et al. PGAS 8 / 22

9 Non-Uniform Memory Access Systems II Limits of (cc-)numa Even in the case of large-scale NUMA like Altix systems, scalability remains an issue: At the hardware level: producing hardware for large-scale ccnuma requires it to be tightly coupled with the processors At the software level: ensuring data locality becomes a bigger problem At the operating system level: a choice must be made (by the user): Let the OS follow a first-touch page allocation policy best for when the software can easily be optimized for locality Require the OS to allocate pages in a random or round-robin way (when data access is truly random-ish). For very large scale computations, an additional software layer must be implemented to help access the fast network devices (e.g., Infiniband, Quadrics, etc.). Zuckerman et al. PGAS 9 / 22

10 Introduction to Global Partition Address Space Systems Basic Concepts Maintain a programmer-centric global address space Automagically partition arrays and other shared data structures across compute nodes Provide means to handle locality: if an object is supposed to be available locally, there should be a way to inform the system When shared data structures are accessed, the software automatically knows where to issue the request How to Implement PGAS Using a library (e.g., Gasnet) Using a programming language (e.g., X10, Chapel, Titanium,... ) Zuckerman et al. PGAS 10 / 22

11 PGAS Languages A Very Brief History DARPA s High Productivity Computing Systems (HPCS) program was launched in 2002 with five teams, each led by a hardware vendor: Cray Inc., Hewlett-Packard, IBM, SGI, and Sun Microsystems. Examples of PGAS Languages Several languages are following a PGAS approach: IBM s X10, Cray s Chapel, HP and Berkeley s Unified Parallel C (UPC), etc. They all propose constructs to express parallelism in a more or less implicit way. Most of these languages are either developed as an Open Source package or propose an open implementation a a This is important: languages get popular thanks to their availability or because they are the only ones on their market segments! Zuckerman et al. PGAS 11 / 22

12 Chapel I Overview Chapel Official web site: Open Source ( Targets general parallelism (i.e. any algorithm should be expressible as a Chapel program) Separates parallelism and locality: concurrent regions of code vs. data placement. Multi-resolution parallelism: either use implicit parallelism, or if parallel expert, use direct parallel constructs to drive parallel execution Targets productivity (type inference, iterator functions, OOP, various array types) Data-centric Zuckerman et al. PGAS 12 / 22

13 Chapel II Overview Task Parallelism Constructs begin{...: Creates an anonymous task using the code between braces cobegin{...: Fine-grain way to task creation Creates a task for each statement in the block Data Parallelism Constructs forall elem in Range do...: Creates a coarse-grain parallel loop akin to OpenMP s #pragma omp for coforall elem in Collection do...: Creates a fine-grain parallel loop each iteration is a task Zuckerman et al. PGAS 13 / 22

14 Chapel III Overview Synchronization sync{statement;: Creates a synchronization point for all tasks created within a parallel region. Akin to a barrier (coarse-grain synchronization) var variablename sync type;: Creates a (set of) synchronization variable(s) which acts as a full/empty bit (set of) location(s). var variablename atomic type;: Creates a (set of) variable(s) that are accessed atomically (accesses are sequentially consistent). Zuckerman et al. PGAS 14 / 22

15 Chapel IV Overview Locality Constructs The Locale type: used to confine portions of computations and data to a specific part of the machine (typically a compute node). The on clauses: to make a statement execute a specific locale. Locality and parallelism constructs can be combined, e.g., begin on Locale.left {... Zuckerman et al. PGAS 15 / 22

16 X10 I X10 Official web site: Open Source ( /x10dt linux.gtk.x86.zip/download) Built on top of Java VM Partially inspired by Scala (a mostly functional, but multi-paradigm language based on the JVM) Provides a back-end to both Java and C++ code (source-to-source translation) Also distinguishes between parallelism and locality Zuckerman et al. PGAS 16 / 22

17 X10 II Parallel Constructs async{...: Creates an anonymous task using the code between braces finish{statement;: Creates a synchronization point for all async tasks created within a parallel region. Can be combined: finish async{... Locality Constructs: Accessing Places at: Place shifting operation when: Concurrency control within a place atomic: Concurrency control within a place GlobalRef[T]: Distributed heap management PlaceLocalHandle[T]: Distributed heap management Zuckerman et al. PGAS 17 / 22

18 X10 III Combining Locality and Parallelism Constructs at(p) function(...): Remote evaluation at(p) async function(...): Active message finish for (p in Places.places()) { at(p) async runeverywhere(...) : SPMD at(ref) async atomic ref() += v: Atomic remote update Zuckerman et al. PGAS 18 / 22

19 X10 Examples Hello World import x10.io.console; class HelloWorld { public static def main(rail[string]) { Console.OUT.println("Hello World!" ); Zuckerman et al. PGAS 19 / 22

20 X10 Examples Hello World import x10.io.console; class HelloWorld { public static def main(rail[string]) { Console.OUT.println("Hello World!" ); import x10.io.console; class HelloWholeWorld { public static def main(args:rail[string]):void { if (args.size < 1) { Console.OUT.println("Usage: HelloWholeWorld message"); return; finish for (p in Place.places()) { at (p) async Console.OUT.println(here+" says hello and "+args(0)); Console.OUT.println("Goodbye"); Zuckerman et al. PGAS 19 / 22

21 X10 Examples Fibonacci import x10.io.console; public class Fibonacci { public static def fib(n:long) { if (n<2) return n; val f1:long; val f2:long; finish { async { f1 = fib(n-1); async { f2 = fib(n-2); return f1 + f2; public static def main(args:rail[string]) { val n = (args.size > 0)? Long.parse(args(0)) : 10; Console.OUT.println("Computing fib("+n+")"); val f = fib(n); Console.OUT.println("fib("+n+") = "+f); Zuckerman et al. PGAS 20 / 22

22 Learning More About Multi-Threading and OpenMP Internet Resources General PGAS web site: Chapel: X10: UPC: and The GASNet library (used in Berkeley s UPC): Tutorials Used for this Class Bradford L. Chamberlain s overview of Chapel: Chamberlain s slides to present Chapel: X10 tutorial slides: APGASProgrammingInX10/APGASprogrammingInX10-slides-V7.pdf UPC tutorial slides: Food for Thoughts Sutter, The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software (available at Lee, The Problem with Threads (available at Zuckerman et al. PGAS 21 / 22

23 References I Boehm, Hans-J. Threads Cannot Be Implemented As a Library. In: SIGPLAN Not (June 2005), pp ISSN: DOI: / URL: Brooks Jr., Frederick P. The Mythical Man-month (Anniversary Ed.) Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., ISBN: Lee, Edward A. The Problem with Threads. In: Computer 39.5 (May 2006), pp ISSN: DOI: /MC URL: Sutter, Herb. The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. In: Dr. Dobb s Journal 30.3 (2005). Zuckerman et al. PGAS 22 / 22

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008 Parallel Computing Using OpenMP/MPI Presented by - Jyotsna 29/01/2008 Serial Computing Serially solving a problem Parallel Computing Parallelly solving a problem Parallel Computer Memory Architecture Shared

More information

CS 470 Spring Parallel Languages. Mike Lam, Professor

CS 470 Spring Parallel Languages. Mike Lam, Professor CS 470 Spring 2017 Mike Lam, Professor Parallel Languages Graphics and content taken from the following: http://dl.acm.org/citation.cfm?id=2716320 http://chapel.cray.com/papers/briefoverviewchapel.pdf

More information

Chapel Introduction and

Chapel Introduction and Lecture 24 Chapel Introduction and Overview of X10 and Fortress John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 But before that Created a simple

More information

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access

More information

Overview: Emerging Parallel Programming Models

Overview: Emerging Parallel Programming Models Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum

More information

. Programming in Chapel. Kenjiro Taura. University of Tokyo

. Programming in Chapel. Kenjiro Taura. University of Tokyo .. Programming in Chapel Kenjiro Taura University of Tokyo 1 / 44 Contents. 1 Chapel Chapel overview Minimum introduction to syntax Task Parallelism Locales Data parallel constructs Ranges, domains, and

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Memory Systems in Pipelined Processors

Memory Systems in Pipelined Processors Advanced Computer Architecture (0630561) Lecture 12 Memory Systems in Pipelined Processors Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Interleaved Memory: In a pipelined processor data is required every

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering

The APGAS Programming Model for Heterogeneous Architectures. David E. Hudak, Ph.D. Program Director for HPC Engineering The APGAS Programming Model for Heterogeneous Architectures David E. Hudak, Ph.D. Program Director for HPC Engineering dhudak@osc.edu Overview Heterogeneous architectures and their software challenges

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Spring 2010 Flynn s Taxonomy SISD:

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains: The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012

More information

Introduction to OpenMP. Lecture 2: OpenMP fundamentals

Introduction to OpenMP. Lecture 2: OpenMP fundamentals Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview 2 Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs What is OpenMP? 3 OpenMP is an API designed for programming

More information

COSC 6374 Parallel Computation. Parallel Computer Architectures

COSC 6374 Parallel Computation. Parallel Computer Architectures OS 6374 Parallel omputation Parallel omputer Architectures Some slides on network topologies based on a similar presentation by Michael Resch, University of Stuttgart Edgar Gabriel Fall 2015 Flynn s Taxonomy

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

An Introduction to OpenMP

An Introduction to OpenMP An Introduction to OpenMP U N C L A S S I F I E D Slide 1 What Is OpenMP? OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism

More information

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1

Chap. 6 Part 3. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1 Chap. 6 Part 3 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 OpenMP popular for decade Compiler-based technique Start with plain old C, C++, or Fortran Insert #pragmas into source file You

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Multicore Programming Overview Shared memory systems Basic Concepts in OpenMP Brief history of OpenMP Compiling and running OpenMP programs 2 1 Shared memory systems OpenMP

More information

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems Reference Papers on SMP/NUMA Systems: EE 657, Lecture 5 September 14, 2007 SMP and ccnuma Multiprocessor Systems Professor Kai Hwang USC Internet and Grid Computing Laboratory Email: kaihwang@usc.edu [1]

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother? Parallel Programming Concepts Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04 Parallel Background Why Bother? 1 What is Parallel Programming? Simultaneous use of multiple

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP History of OpenMP Compiling and running OpenMP programs 2 1 What is OpenMP? OpenMP is an API designed for programming

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Compute Node Linux (CNL) The Evolution of a Compute OS

Compute Node Linux (CNL) The Evolution of a Compute OS Compute Node Linux (CNL) The Evolution of a Compute OS Overview CNL The original scheme plan, goals, requirements Status of CNL Plans Features and directions Futures May 08 Cray Inc. Proprietary Slide

More information

Implementing X10. Towards Performance and Productivity at Scale. David Grove IBM Research NICTA Summer School February 5, 2013

Implementing X10. Towards Performance and Productivity at Scale.   David Grove IBM Research NICTA Summer School February 5, 2013 Implementing X10 Towards Performance and Productivity at Scale http://x10-lang.org David Grove NICTA Summer School February 5, 2013 This material is based upon work supported in part by the Defense Advanced

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

Steve Deitz Chapel project, Cray Inc.

Steve Deitz Chapel project, Cray Inc. Parallel Programming in Chapel LACSI 2006 October 18 th, 2006 Steve Deitz Chapel project, Cray Inc. Why is Parallel Programming Hard? Partitioning of data across processors Partitioning of tasks across

More information

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP Parallel Programming with Compiler Directives OpenMP Clemens Grelck University of Amsterdam UvA-SARA High Performance Computing Course June 2013 OpenMP at a Glance Loop Parallelization Scheduling Parallel

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

X10 and APGAS at Petascale

X10 and APGAS at Petascale X10 and APGAS at Petascale Olivier Tardieu 1, Benjamin Herta 1, David Cunningham 2, David Grove 1, Prabhanjan Kambadur 1, Vijay Saraswat 1, Avraham Shinnar 1, Mikio Takeuchi 3, Mandana Vaziri 1 1 IBM T.J.

More information

Tutorial: Introduction to POSIX Threads

Tutorial: Introduction to POSIX Threads Tutorial: Introduction to POSIX Threads Stéphane Zuckerman Haitao Wei Guang R. Gao Computer Architecture & Parallel Systems Laboratory Electrical & Computer Engineering Dept. University of Delaware 140

More information

A Uniform Programming Model for Petascale Computing

A Uniform Programming Model for Petascale Computing A Uniform Programming Model for Petascale Computing Barbara Chapman University of Houston WPSE 2009, Tsukuba March 25, 2009 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools Agenda

More information

Tasking and OpenMP Success Stories

Tasking and OpenMP Success Stories Tasking and OpenMP Success Stories Christian Terboven 23.03.2011 / Aachen, Germany Stand: 21.03.2011 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda OpenMP: Tasking

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM

SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES. Basic usage of OpenSHMEM SINGLE-SIDED PGAS COMMUNICATIONS LIBRARIES Basic usage of OpenSHMEM 2 Outline Concept and Motivation Remote Read and Write Synchronisation Implementations OpenSHMEM Summary 3 Philosophy of the talks In

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Parallel Languages: Past, Present and Future

Parallel Languages: Past, Present and Future Parallel Languages: Past, Present and Future Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab 1 Kathy Yelick Internal Outline Two components: control and data (communication/sharing) One

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Topic C Memory Models

Topic C Memory Models Memory Memory Non- Topic C Memory CPEG852 Spring 2014 Guang R. Gao CPEG 852 Memory Advance 1 / 29 Memory 1 Memory Memory Non- 2 Non- CPEG 852 Memory Advance 2 / 29 Memory Memory Memory Non- Introduction:

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

Multi-core Programming - Introduction

Multi-core Programming - Introduction Multi-core Programming - Introduction Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory

COMP Parallel Computing. SMM (1) Memory Hierarchies and Shared Memory COMP 633 - Parallel Computing Lecture 6 September 6, 2018 SMM (1) Memory Hierarchies and Shared Memory 1 Topics Memory systems organization caches and the memory hierarchy influence of the memory hierarchy

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Parallel Programming with Coarray Fortran

Parallel Programming with Coarray Fortran Parallel Programming with Coarray Fortran SC10 Tutorial, November 15 th 2010 David Henty, Alan Simpson (EPCC) Harvey Richardson, Bill Long, Nathan Wichmann (Cray) Tutorial Overview The Fortran Programming

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Enforcing Textual Alignment of

Enforcing Textual Alignment of Parallel Hardware Parallel Applications IT industry (Silicon Valley) Parallel Software Users Enforcing Textual Alignment of Collectives using Dynamic Checks and Katherine Yelick UC Berkeley Parallel Computing

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

Why you should care about hardware locality and how.

Why you should care about hardware locality and how. Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

PCS - Part Two: Multiprocessor Architectures

PCS - Part Two: Multiprocessor Architectures PCS - Part Two: Multiprocessor Architectures Institute of Computer Engineering University of Lübeck, Germany Baltic Summer School, Tartu 2008 Part 2 - Contents Multiprocessor Systems Symmetrical Multiprocessors

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS. Multiprocessor OS COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2 2 Key design challenges: Correctness of (shared) data structures Scalability Scalability of Multiprocessor OS

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Le Yan Scientific computing consultant User services group High Performance Computing @ LSU Goals Acquaint users with the concept of shared memory parallelism Acquaint users with

More information

Parallel and Distributed Computing

Parallel and Distributed Computing Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010 Parallel Programming Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code Mary Hall September 23, 2010 1 Observations from the Assignment Many of you are doing really well Some more are doing

More information

27. Parallel Programming I

27. Parallel Programming I The Free Lunch 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism,

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Parallelization, OpenMP

Parallelization, OpenMP ~ Parallelization, OpenMP Scientific Computing Winter 2016/2017 Lecture 26 Jürgen Fuhrmann juergen.fuhrmann@wias-berlin.de made wit pandoc 1 / 18 Why parallelization? Computers became faster and faster

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Le Yan Objectives of Training Acquaint users with the concept of shared memory parallelism Acquaint users with the basics of programming with OpenMP Memory System: Shared Memory

More information

Lecture 32: Partitioned Global Address Space (PGAS) programming models

Lecture 32: Partitioned Global Address Space (PGAS) programming models COMP 322: Fundamentals of Parallel Programming Lecture 32: Partitioned Global Address Space (PGAS) programming models Zoran Budimlić and Mack Joyner {zoran, mjoyner}@rice.edu http://comp322.rice.edu COMP

More information

CPEG 852 Advanced Topics in Computing Systems The Dataflow Model of Computation

CPEG 852 Advanced Topics in Computing Systems The Dataflow Model of Computation CPEG 852 Advanced Topics in Computing Systems The Dataflow Model of Computation Stéphane Zuckerman Computer Architecture & Parallel Systems Laboratory Electrical & Computer Engineering Dept. University

More information