Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Similar documents
Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Online Course Evaluation. What we will do in the last week?

Chapter 7. Multicores, Multiprocessors, and Clusters

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

COMPUTER ORGANIZATION AND DESIGN

Multiprocessors & Thread Level Parallelism

Chapter 7. Multicores, Multiprocessors, and Clusters

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Chapter 7. Multicores, Multiprocessors, and

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

Multicores, Multiprocessors, and Clusters

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Parallel Computing Platforms

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Chap. 4 Multiprocessors and Thread-Level Parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Memory Systems in Pipelined Processors

Multiprocessors - Flynn s Taxonomy (1966)

Lecture 9: MIMD Architectures

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Computer parallelism Flynn s categories

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Lec 25: Parallel Processors. Announcements

Lecture 9: MIMD Architectures

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Lecture notes for CS Chapter 4 11/27/18

Chapter 5: Thread-Level Parallelism Part 1

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

High Performance Computing Course Notes HPC Fundamentals

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Why Multiprocessors?

Chapter 18 - Multicore Computers

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

Distributed Systems CS /640

PARALLEL MEMORY ARCHITECTURE

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

Organisasi Sistem Komputer

Introduction to Parallel Computing

CSCI 4717 Computer Architecture

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Lecture 9: MIMD Architecture

Cloud Computing CS

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Parallel Computing. Hwansoo Han (SKKU)

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

Comp. Org II, Spring

Lecture 7: Parallel Processing

Parallel Processing & Multicore computers

Engineering Robust Server Software

Computer Systems Architecture

Chapter 5. Multiprocessors and Thread-Level Parallelism

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Convergence of Parallel Architecture

An Introduction to Parallel Programming

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Chapter 18 Parallel Processing

Comp. Org II, Spring

Three basic multiprocessing issues

Multiprocessor Systems Continuous need for faster computers Multiprocessors: shared memory model, access time nanosec (ns) Multicomputers: message pas

Operating Systems, Fall Lecture 9, Tiina Niklander 1

Computer Organization. Chapter 16

Parallel Architectures

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

Introduction II. Overview

Response Time and Throughput

Introduction to parallel Computing

Outline Marquette University

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Computer Systems Architecture

Lect. 2: Types of Parallelism

High Performance Computing Course Notes Course Administration

27. Parallel Programming I

27. Parallel Programming I

Parallel Programming Multicore systems

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Parallel Architecture. Hwansoo Han

27. Parallel Programming I

High Performance Computing in C and C++

Processor Architecture and Interconnect

Chapter 5. Multiprocessors and Thread-Level Parallelism

Introduction to Parallel Computing

Transcription:

Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level) parallelism High throughput for independent jobs Parallel processing program Single program run on multiple processors Multicore microprocessors Chips with multiple processors (cores) 9.1 Intro oduction Chapter 7 Multicores, Multiprocessors, and Clusters 2

Hardware and Software Hardware Serial: e.g., Pentium 4 Parallel: e.g., quad-core Xeon e5345 Software Sequential: e.g., matrix multiplication Concurrent: e.g., operating system, multithreaded matrix multiplication Sequential/concurrent software can run on serial/parallel hardware Challenge: making effective use of parallel hardware Chapter 7 Multicores, Multiprocessors, and Clusters 3 Parallel Programming Parallel software is the problem Need to get significant performance improvement Otherwise, just use a faster uniprocessor, since it s easier! Difficulties Partitioning Coordination Communications overhead 7.2 The Difficulty of Creatin ng Parallel Processin ng Progra ms Chapter 7 Multicores, Multiprocessors, and Clusters 4

Amdahl s Law Sequential part can limit speedup Example: 100 processors, 90 speedup? T new = T parallelizable /100 + T sequential 1 Speedup 90 (1F ) F /100 parallelizable Solving: F parallelizable = 0.999 parallelizable Need sequential part to be 0.1% of original time Chapter 7 Multicores, Multiprocessors, and Clusters 5 Scaling Example Workload: sum of 10 scalars, and 10 10 matrix sum Speed up from 10 to 100 processors Single processor: Time = (10 + 100) t add 10 processors Time = 10 t add + 100/10 t add = 20 t add Speedup = 110/20 = 5.5 (55% of potential) 100 processors Time = 10 t add + 100/100 t add = 11 t add Speedup = 110/11 = 10 (10% of potential) Assumes load can be balanced across processors Chapter 7 Multicores, Multiprocessors, and Clusters 6

Scaling Example (cont) What if matrix size is 100 100? Single processor: Time = (10 + 10000) t add 10 processors Time = 10 t add + 10000/10 t add = 1010 t add Speedup = 10010/1010 = 9.9 (99% of potential) 100 processors Time = 10 t add + 10000/100 t add = 110 t add Speedup = 10010/110 = 91 (91% of potential) Assuming load balanced Chapter 7 Multicores, Multiprocessors, and Clusters 7 Strong vs Weak Scaling Strong scaling: problem size fixed As in example Weak scaling: problem size proportional to number of processors 10 processors, 10 10 matrix Time = 20 t add 100 processors, 32 32 matrix Time = 10 t add + 1000/100 t add = 20 t add Constant performance in this example Chapter 7 Multicores, Multiprocessors, and Clusters 8

Shared Memory SMP: shared memory multiprocessor Hardware provides single physical address space for all processors Synchronize shared variables using locks Cache coherence issue Memory access time UMA (uniform) vs. NUMA (nonuniform) 7.3 Sha ared Memo ory Multipr rocessors Chapter 7 Multicores, Multiprocessors, and Clusters 9 Shared Memory Multi-cores processors: Core Cache Core Cache 7.3 Sha ared Memory Multipr rocessors Chapter 7 Multicores, Multiprocessors, and Clusters 10

Message Passing Each processor has private physical address space Hardware sends/receives messages between processors 7.4 Clus sters and Other Mes ssage-passing Multiprocesso rs Chapter 7 Multicores, Multiprocessors, and Clusters 11 Loosely Coupled Clusters Network of independent computers Each has private memory and OS Connected using I/O system E.g., Ethernet/switch, Internet Suitable for applications with independent tasks Web servers, databases, simulations, High availability, scalable, affordable Problems Administration cost (prefer virtual machines) Low interconnect bandwidth c.f. processor/memory bandwidth on an SMP Chapter 7 Multicores, Multiprocessors, and Clusters 12

Grid Computing Separate computers interconnected by long-haul networks E.g., Internet connections Work units farmed out, results sent back Can make use of idle time on PCs E.g., SETI@home, World Community Grid Other parallel computers Supercomputers, p vector computers, GPUs, cells, etc. Chapter 7 Multicores, Multiprocessors, and Clusters 13 Concluding Remarks Goal: higher performance by using multiple processors Difficulties Developing parallel software Devising appropriate architectures Many reasons for optimism Changing software and application environment Chip-level multiprocessors with lower latency, higher bandwidth interconnect An ongoing challenge for computer architects, software/tool/compiler developers/researchers! 7.13 Co oncluding Remarks Chapter 7 Multicores, Multiprocessors, and Clusters 14