Moore s Law. Computer architect goal Software developer assumption

Similar documents
Moore s Law. Computer architect goal Software developer assumption

Message-Passing Shared Address Space

Introduction to parallel computing

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Top500 Supercomputer list

Parallel and High Performance Computing CSE 745

Introduction to Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

WHY PARALLEL PROCESSING? (CE-401)

Parallel Computing Introduction

Introduction II. Overview

Multicores, Multiprocessors, and Clusters

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

An Introduction to Parallel Programming

COSC 6385 Computer Architecture - Multi Processor Systems

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Computer parallelism Flynn s categories

Multiprocessors - Flynn s Taxonomy (1966)

Computer Architecture

High Performance Computing. Introduction to Parallel Computing

Parallel Computers. c R. Leduc

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Operating Systems, Fall Lecture 9, Tiina Niklander 1

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

BlueGene/L (No. 4 in the Latest Top500 List)

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

Parallel Architecture. Hwansoo Han

Parallel Programming Multicore systems

CS377P Programming for Performance Multicore Performance Multithreading

Dr. Joe Zhang PDC-3: Parallel Platforms

Parallel Architectures

High Performance Computing

Approaches to Parallel Computing

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Parallel Computing Platforms

Chap. 4 Multiprocessors and Thread-Level Parallelism

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Introduction to Parallel Programming

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Multi-core Architectures. Dr. Yingwu Zhu

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Online Course Evaluation. What we will do in the last week?

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

4. Networks. in parallel computers. Advances in Computer Architecture

Parallel programming models. Main weapons

Parallel Algorithm Design. CS595, Fall 2010

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

Multiprocessors & Thread Level Parallelism

Parallel Computing: Parallel Architectures Jin, Hai

General introduction: GPUs and the realm of parallel architectures

Introduction to Parallel Computing

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

BİL 542 Parallel Computing

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Lecture 5: Process Description and Control Multithreading Basics in Interprocess communication Introduction to multiprocessors

Advances of parallel computing. Kirill Bogachev May 2016

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

The Art of Parallel Processing

Parallel programming. Luis Alejandro Giraldo León

Introduction to High-Performance Computing

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

CS/COE1541: Intro. to Computer Architecture

Processor Architecture and Interconnect

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multicore Computing and Scientific Discovery

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Multi-core Programming - Introduction

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architecture

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Parallel Architectures

Introduction to Parallel Programming

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

CS420/CSE 402/ECE 492. Introduction to Parallel Programming for Scientists and Engineers. Spring 2006

Multiprocessors 1. Outline

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Lecture 7: Parallel Processing

ANURADHA DANDE SIMULATION OF MULTIPROCESSOR SYSTEM SCHEDULING

Introduction to Parallel Processing

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Test on Wednesday! Material covered since Monday, Feb 8 (no Linux, Git, C, MD, or compiling programs)

Introduction to Parallel Programming

Multi-core Architectures. Dr. Yingwu Zhu

Computer Architecture and OS. EECS678 Lecture 2

Transcription:

Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer assumption 2

Impediments to Moore s Law Theoretical Limit What to do with all that die space? Design complexity How do you meet the expected performance increase? 3

von Neumann model Execute a stream of instructions (machine code) Instructions can specify Arithmetic operations Data addresses Next instruction to execute Complexity Track billions of data locations and millions of instructions Manage with: Modular design High-level programming languages 4

Number of Cores Manufacturing Process Parallelism Continue to increase performance via parallelism. 300 250 200 150 100 50 0 Number of Cores Processing 2004 2006 2008 2010 2012 2014 2016 2018 100 90 80 70 60 50 40 30 20 10 0 5

From a software point-of-view, need to solve demanding problems Engineering Simulations Scientific Applications Commercial Applications Need the performance, resource gains afforded by parallelism 6

Engineering Simulations Aerodynamics Engine efficiency 7

Scientific Applications Bioinformatics Thermonuclear processes Weather modeling 8

Commercial Applications Financial transaction processing Data mining Web Indexing 9

Unfortunately, greatly increases coding complexity Coordinating concurrent tasks Parallelizing algorithms Lack of standard environments and support 10

The challenge Provide the abstractions, programming paradigms, and algorithms needed to effectively design, implement, and maintain applications that exploit the parallelism provided by the underlying hardware in order to solve modern problems. 11

Standard sequential architecture CPU BUS RAM Bottlenecks 12

Use multiple Datapaths Memory units Processing units 13

SIMD Single instruction stream, multiple data stream Processing Unit Control Unit Processing Unit Processing Unit Processing Unit Interconnect Processing Unit 14

SIMD Advantages Performs vector/matrix operations well EX: Intel s MMX chip Disadvantages Too dependent on type of computation EX: Graphics Performance/resource utilization suffers if computations aren t embarrasingly parallel. 15

MIMD Multiple instruction stream, multiple data stream Processing/Control Unit Processing/Control Unit Processing/Control Unit Interconnect Processing/Control Unit 16

MIMD Advantages Can be built with off-the-shelf components Better suited to irregular data access patterns Disadvantages Requires more hardware (!sharing control unit) Store program/os at each processor Ex: Typical commodity SMP machines we see today. 17

Task Communication Shared address space Use common memory to exchange data Communication and replication are implicit Message passing Use send()/receive() primitives to exchange data Communication and replication are explicit 18

Shared address space Uniform memory access (UMA) Access to a memory location is independent of which processing unit makes the request. Non-uniform memory access (NUMA) Access to a memory location depends on the location of the processing unit relative to the memory accessed. 19

Message passing Each processing unit has its own private memory Exchange of messages used to pass data APIs Message Passing Interface (MPI) Parallel Virtual Machine (PVM) 20

Algorithm a sequence of finite instructions, often used for calculation and data processing. Parallel Algorithm An algorithm that which can be executed a piece at a time on many different processing devices, and then put back together again at the end to get the correct result 21

Challenges Identifying work that can be done concurrently. Mapping work to processing units. Distributing the work Managing access to shared data Synchronizing various stages of execution. 22

Models A way to structure a parallel algorithm by selecting decomposition and mapping techniques in a manner to minimize interactions. 23

Models Data-parallel Task graph Work pool Master-slave Pipeline Hybrid 24

Data-parallel Mapping of Work Static Tasks -> Processes Mapping of Data Independent data items assigned to processes (Data Parallelism) 25

Data-parallel Computation Tasks process data, synchronize to get new data or exchange results, continue until all data processed Load Balancing Uniform partitioning of data Synchronization Minimal or barrier needed at end of a phase Examples Ray Tracing 26

Data-parallel P O D D P O D D D P O O P O P 27

Task graph Mapping of Work Static Tasks are mapped to nodes in a data dependency task dependency graph (Task parallelism) Mapping of Data Data moves through graph (Source to Sink) 28

Task graph Computation Each node processes input from previous node(s) and send output to next node(s) in the graph Load Balancing Assign more processes to a given task Eliminate graph bottlenecks Synchronization Node data exchange Examples Parallel Quicksort, Divide and Conquer approaches Scientific Applications that can be expressed in workflows (e.g. DAGs) 29

Task graph P P D P P P O D O D P 30

Work pool Mapping of Work/Data No desired pre-mapping Any task performed by any process Pull-model oriented Computation Processes work as data becomes available (or requests arrive) 31

Work pool Load Balancing Dynamic mapping of tasks to processes Synchronization Adding/removing work from input queue Examples Web Server Bag-of-tasks 32

Work pool Work Pool Input queue P P P P P Output queue 33

Master-slave Modification to Worker Pool Model One or more Master processes generate and assign work to worker processes\ Push-model oriented Load Balancing A Master process can better distribute load to worker processes 34

Pipeline Mapping of work Processes are assigned tasks that correspond to stages in the pipeline Static Mapping of Data Data processed in FIFO order Stream parallelism 35

Pipeline Computation Data is passed through a succession of processes, each of which will perform some task on it Load Balancing Insure all stages of the pipeline are balanced (contain the same amount of work) Synchronization Producer/Consumer buffers between stages Ex: Processor pipeline, graphics pipeline 36

Pipeline Input queue P buffer P buffer P Output queue 37

Message-Passing Shared Address Space 38

Message-Passing Most widely used for programming parallel computers (clusters of workstations) Key attributes: Partitioned address space Explicit parallelization Process interactions Send and receive data 39

Message-Passing Communications Sending and receiving messages Primitives send(buff, size, destination) receive(buff, size, source) Blocking vs non-blocking Buffered vs non-buffered Message Passing Interface (MPI) Popular message passing library ~125 functions 40

Message-Passing send(buff1, 1024, p3) receive(buff3, 1024, p1) P1 P2 P3 P4 Workstation Workstation Workstation Workstation Data 41

Shared Address Space Mostly used for programming SMP machines (multicore chips) Key attributes Shared address space Threads Shmget/shmat UNIX operations Implicit parallelization Process/Thread communication Memory reads/stores 42

Shared Address Space Communication Read/write memory EX: x++; Posix Thread API Popular thread API Operations Creation/deletion of threads Synchronization (mutexes, semaphores) Thread management 43

Shared Address Space T1 T2 T3 T4 Data RAM SMP Workstation 44

Synchronization Deadlock Livelock Fairness Efficiency Maximize parallelism Reliability Correctness Debugging 45

46