Overview. 2 Introduction to parallel computing. The control structure. Parallel computers

Size: px
Start display at page:

Download "Overview. 2 Introduction to parallel computing. The control structure. Parallel computers"

Transcription

1 Overview 2 Introduction to parallel computing Robert Mullins Parallel computing platforms Approaches to building parallel computers Today's chip-multiprocessor architectures Approaches to parallel programming Programming with threads and shared memory Message-passing libraries PGAS languages High-level parallel languages 2 Parallel computers The control structure How might we exploit multiple processing elements and memories in order to complete a large computation quickly? How are the processing elements controlled? How many processing elements, how powerful? How do they communicate and cooperate? Flynn's taxonomy: Single Instruction Multiple Data (SIMD) Multiple Instruction Multiple Data (MIMD) How are memories and processing elements interconnected? How is the memory hierarchy organised? Centrally from single control unit or can they work independently? How might we program such a machine? 3 4

2 The control structure The control structure Other possible organisations: SIMD Dataflow Systolic array Single-Program Multiple-Data (SPMD) Same program runs on all processors Commonly seen in MPI (message-passing) or PGAS programs The scalar pipelines execute in lockstep Data-independent logic is shared Efficient for highly data parallel applications Much simpler instruction fetch and supply mechanism SIMD hardware can support a SPMD model if the individual threads follow similar control flow Masked execution 5 A Generic Streaming Multiprocessor (for graphics applications) Reproduced from, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow", W. W. L. Fung et al 6 The communication model The communication model A clear distinction is made between two common communication models: 2. Message-passing platforms 1. Shared-address-space platforms All processors have access to a shared data space accessed via a shared address space All communication takes place via a shared memory Each processing element may also have an area of memory that is private 7 Each processing element has its own exclusive address space Communication is achieved by sending explicit messages between processing elements The sending and receiving of messages can be used to both communicate between and synchronize the actions of multiple processing elements 8

3 Multi-core SMP multiprocessor Figure courtesy of Tim Harris, MSR 9 Figure courtesy of Tim Harris, MSR NUMA multiprocessor 10 Message-passing platforms Many early messagepassing machines provided hardware primitives that were close to the send/receive user-level communication commands e.g. a pair of processors may be interconnected with a hardware FIFO queue The network topology restricted which processors could be named in a send or receive operation (e.g. only neighbours could communicate in a mesh network) [Culler, Figure 1.22] Figure courtesy of Tim Harris, MSR

4 Message-passing platforms Message-passing platforms The Transputer (1984) Recently some chipmultiprocessors have taken a similar approach (RAW/Tilera and XMOS) The result of an earlier foray into the world of parallel computing! Transputer contained integrated serial links for building multiprocessors IN/OUT instructions in ISA for sending and receiving messages Programmed in OCCAM (based on CSP) IBM Victor V256 (1991) 16x16 array of transputers The processors could be partitioned dynamically between different users Message queues (or communication channels) may be register mapped or accessed via special instructions The processor stalls when reading an empty input queue or when trying to write to a full output buffer A wireless application mapped to the RAW processor. Data is streamed from one core to another over a statically scheduled network. Network input and output is register mapped. (See also the iwarp paper on wiki) Message-passing platforms Message-passing platforms For larger message-passing machines (typically scientific supercomputers) direct FIFO designs were soon replaced by designs that built message-passing upon remote memory copies (supported by DMA or a The most fundamental communication primitives in a message-passing machine are synchronous send and receive operations more general communication assist processor) The interconnection networks also became more powerful, supporting the automatic routing of messages between arbitrary nodes No restrictions on programmer or software support required Hardware and software evolution meant there was a general convergence of parallel machine organisations 15 Here data movement must be specified at both ends of the communication, this is known as two-sided communication. e.g. MPI_Send and MPI_Recv* Non-blocking versions of send and receive are also often provided to allow computation and communication to be overlapped *Message Passing Interface (MPI) is a portable message-passing system that is supported by a very wide range of parallel machines. 16

5 One-side communication The communication model SHMEM From a hardware perspective we would like to keep the machine simple (message-passing) But we inevitably need to simplify the programmer's and compiler's task Provides routines to access the memory of a remote processing element without any assistance from the remote process, e.g: shmem_put (target_addr, source_addr, length, remote_pe) shmem_get, shmem_barrier etc. One-sided communication may be used to reduce synchronization, simplify programming and reduce data movement 17 Today's chip multiprocessors Intel Nehalem-EX (2009) Efficiently support shared-memory programming Add support for transactional memory? Create a simple but high-performance target Trade-offs between hardware complexity and complexity of hardware and compiler. 18 Today's chip multiprocessors Intel Nahalem-EX (2009) 8-cores 2-way hyperthreaded (SMT) 16 hardware threads L1 L1I 32KB, L1D 32KB 256 KB L2 (Private) 24MB L3 (Shared) L2 8-banks Inclusive L3 Shared L3 Memory 19 20

6 Today's chip multiprocessors IBM Power 7 (2010) Today's chip multiprocessors IBM Power 7 (2010) 8 core (dual-chip module to hold 16 cores) 32MB shared edram L3 cache 2-channel DDR3 controllers Individual cores 4-thread SMT per core 6 ops/cycle 4GHz 21 Today's chip multiprocessors 22 Oracle M7 Processor (2014) Sun Niagara T1 (2005) 32 core Dual-issue, OOO Dynamic multithreading 1-8 threads/core 256KB I&D L2 caches shared by groups of 4 cores 64MB L3 Technology: 20nm, 13 metal layers 16 DDR channels 160GB/s (vs. ~20GB/s for T1) >10B transistors! Each core has its own level 1 cache (16KB for instructions, 8KB for data). The level 2 caches are 3MB in total and are effectively 12-way associative. They are interleaved by 64-byte cache lines

7 Manycore designs: Tilera Manycore designs: Celerity (2017) Tilera (now Mellanox) Evolution of MIT RAW 100-cores grid of identical tiles Low-power 3-way VLIW cores Cores interconnected by a selection of static and dynamic on-chip networks Tiered Accelerator Fabric General-purpose tier: 5 Rocket RISC-V cores Massively parallel tier: stage RISC-V cores, 16x31 tiled mesh array Specialised tier: Binarized Neural Network accelerator GPUs Communication latencies TESLA P100 Chip multiprocessor Some have very fast core to core communication, as low as 1-3 cycles Opportunities to add dedicated core-to-core links Typical L1-to-L1 communication latencies may be around cycles 56 Streaming multiprocessors x 64 cores = 3584 cores or lanes 732GB/s memory bandwidth 4MB L2 cache 15.3 billion transistors Other types of parallel machine: Shared memory multiprocessor ~500 Cluster/supercomputer ~ The NVIDIA GeForce 8800 GPU, Hot Chips

8 Approaches to parallel programming Approaches to parallel programming Principles of Parallel Programming, Calvin Lin and Lawrence Snyder, Pearson, 2009 This book provides a good overview of the different approaches to parallel programming There is also a significant amount of information on the course wiki Programming with threads and shared memory Message-passing libraries PGAS languages High level parallel languages Try some examples! A thread, or thread of execution, is a unit of parallelism How might we express threads in our code? fork/join It consists of everything necessary to execute a sequential stream of instructions program code, a call stack, set of registers (incl. a single program counter) It shares memory with other threads Threads cooperate and coordinate there actions by reading and writing to shared variables Special atomic operations are provided by the multiprocessor for synchronization 31 Fork/Join keywords can appear anywhere in code General, but unstructured p1 ; start p5 in fork(p5) p2 fork(p3) P4 ; wait for p5 to ; complete join(p5) p6 join(p3) p7 A forked procedure runs in parallel with main thread 32

9 fork/join using the pthreads library parbegin/parend (cobegin/coend) Limitations to bare metal thread programming? void *thread_func ( void *ptr) { int i = ((thread_args *) ptr) >input; ((thread_args *) ptr) >output = fib(i); return NULL; } args.input=n 1; // create and start first thread status = pthread_create(&thread, NULL, thread_func, (void*)&args ); // calc. fib(n 2) in parallel result = fib (n 2); // join pthread_join(thread, NULL); 33 Simple and structured, but not as general as fork/join, e.g. we cannot represent the graph on the previous slide. p1 parbegin p5 begin p2 parbegin p3 p4 parend end parend p6 p7 34 Even though parbegin..parend can only represent properly nested dependency graphs it is usually adequate Cilk style spawn/sync forall (doall, parfor) cilk int fib (int n) { if (n < 2) return n; else { int x, y; spawn indicates that the proceduce call can safely proceed in parallel sync wait until all previously spawned procedures have returned their results Simply allows a programmer to indicate that each iteration of the loop is independent and may be run in parallel OpenMP example: #pragma omp parallel for for (i=first; i<n; i+=prime) marked[i]=1; x = spawn fib (n 1); y = spawn fib (n 2); sync; } return (x+y); } 35 36

10 Futures Synchronization and coordination Future <expr> Evaluate the expression concurrently with calling program. An asynchronous function call If a thread requires the value of a future that has not been computed, stall the thread until it is available y=future (fn(x)) z=y+1; In addition to creating threads, we also need to be able to control the way threads interact. Often involves identifying critical sections Mechanisms Locks and barriers Mutexes and monitors Condition Variables (wait/signal) Transactional memory See reading group papers and examples The incremental garbage collection of processes, Baker/Hewitt, Message-passing PGAS languages Simple (perhaps primitive) programming model Partitioned Global Address Space Languages Programmer must distribute and explicitly move data The fact that the interactions are explicit can be seen as both an advantage and a disadvantage Potentially simple hardware implementation Processes communicate and synchronize by sending messages Message Passing Interface (MPI) standard Widely used on High-Performance Computing (HPC) platforms Programs tend to be portable Usually written in a Single-Program Multiple-Data (SPMD) style 39 Aimed at large-scale distributed memory machines Aim to improve on MPI PGAS languages overlay a global address space on the virtual memories of the distributed machines No expectation that memories will be coherent The programmer distinguishes between local and non-local data The compiler generates the necessary communication calls in response to non-local references Compiler exploits one-sided communication primitives rather than message-passing Co-Array Fortran, Unified Parallel C, Titanium (Ti) (Titanium extends Java) 40

11 High-level parallel languages Global view of computation Raise level of abstraction Hide low-level details of communication and synchronization Take a global view and describe the algorithm rather than per-task behavior e.g. ZPL forces programmer to think in parallel style using array operations (reference to neighboring elements, flood, remap, reduction,...) Compiler, runtime and libraries will manage implementation details Interesting examples: ZPL Array programming language NESL, Data Parallel Haskell (see wiki) See also Cray Chapel, IBM X10, Sun Fortress languages (DARPA HPCS project) 41

2 Introduction to parallel computing. Chip Multiprocessors (ACS MPhil) Robert Mullins

2 Introduction to parallel computing. Chip Multiprocessors (ACS MPhil) Robert Mullins 2 Introduction to parallel computing Robert Mullins Overview Parallel computing platforms Approaches to building parallel computers Today's chip-multiprocessor architectures Approaches to parallel programming

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Parallel Architecture. Hwansoo Han

Parallel Architecture. Hwansoo Han Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Parallel Computing Platforms

Parallel Computing Platforms Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

CSE 392/CS 378: High-performance Computing - Principles and Practice

CSE 392/CS 378: High-performance Computing - Principles and Practice CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) COMP 322: Fundamentals of Parallel Programming Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) Mack Joyner and Zoran Budimlić {mjoyner,

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

CS/COE1541: Intro. to Computer Architecture

CS/COE1541: Intro. to Computer Architecture CS/COE1541: Intro. to Computer Architecture Multiprocessors Sangyeun Cho Computer Science Department Tilera TILE64 IBM BlueGene/L nvidia GPGPU Intel Core 2 Duo 2 Why multiprocessors? For improved latency

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11

More information

CS 5220: Shared memory programming. David Bindel

CS 5220: Shared memory programming. David Bindel CS 5220: Shared memory programming David Bindel 2017-09-26 1 Message passing pain Common message passing pattern Logical global structure Local representation per processor Local data may have redundancy

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Overview. CS 472 Concurrent & Parallel Programming University of Evansville Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical

More information

Introduction to Runtime Systems

Introduction to Runtime Systems Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Top500 Supercomputer list

Top500 Supercomputer list Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017 VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

Parallel Computing. Prof. Marco Bertini

Parallel Computing. Prof. Marco Bertini Parallel Computing Prof. Marco Bertini Shared memory: OpenMP Implicit threads: motivations Implicit threading frameworks and libraries take care of much of the minutiae needed to create, manage, and (to

More information

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

CS 470 Spring Parallel Languages. Mike Lam, Professor

CS 470 Spring Parallel Languages. Mike Lam, Professor CS 470 Spring 2017 Mike Lam, Professor Parallel Languages Graphics and content taken from the following: http://dl.acm.org/citation.cfm?id=2716320 http://chapel.cray.com/papers/briefoverviewchapel.pdf

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Programming with Shared Memory. Nguyễn Quang Hùng

Programming with Shared Memory. Nguyễn Quang Hùng Programming with Shared Memory Nguyễn Quang Hùng Outline Introduction Shared memory multiprocessors Constructs for specifying parallelism Creating concurrent processes Threads Sharing data Creating shared

More information

Lecture 13. Shared memory: Architecture and programming

Lecture 13. Shared memory: Architecture and programming Lecture 13 Shared memory: Architecture and programming Announcements Special guest lecture on Parallel Programming Language Uniform Parallel C Thursday 11/2, 2:00 to 3:20 PM EBU3B 1202 See www.cse.ucsd.edu/classes/fa06/cse260/lectures/lec13

More information

Parallel Languages: Past, Present and Future

Parallel Languages: Past, Present and Future Parallel Languages: Past, Present and Future Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab 1 Kathy Yelick Internal Outline Two components: control and data (communication/sharing) One

More information

CMSC 611: Advanced. Parallel Systems

CMSC 611: Advanced. Parallel Systems CMSC 611: Advanced Computer Architecture Parallel Systems Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

27. Parallel Programming I

27. Parallel Programming I 771 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

A General Discussion on! Parallelism!

A General Discussion on! Parallelism! Lecture 2! A General Discussion on! Parallelism! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Lecture 2: Overview Flynn s Taxonomy of

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

Multi-core Programming Evolution

Multi-core Programming Evolution Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

27. Parallel Programming I

27. Parallel Programming I 760 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

IBM Power Multithreaded Parallelism: Languages and Compilers. Fall Nirav Dave

IBM Power Multithreaded Parallelism: Languages and Compilers. Fall Nirav Dave 6.827 Multithreaded Parallelism: Languages and Compilers Fall 2006 Lecturer: TA: Assistant: Arvind Nirav Dave Sally Lee L01-1 IBM Power 5 130nm SOI CMOS with Cu 389mm 2 2GHz 276 million transistors Dual

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

Parallel Computing Why & How?

Parallel Computing Why & How? Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Introduction to GPU programming with CUDA

Introduction to GPU programming with CUDA Introduction to GPU programming with CUDA Dr. Juan C Zuniga University of Saskatchewan, WestGrid UBC Summer School, Vancouver. June 12th, 2018 Outline 1 Overview of GPU computing a. what is a GPU? b. GPU

More information

27. Parallel Programming I

27. Parallel Programming I The Free Lunch 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism,

More information

THREAD LEVEL PARALLELISM

THREAD LEVEL PARALLELISM THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Arquitecturas y Modelos de. Multicore

Arquitecturas y Modelos de. Multicore Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information