Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Similar documents
Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer Systems Architecture

Computer Systems Architecture

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Chap. 4 Multiprocessors and Thread-Level Parallelism

THREAD LEVEL PARALLELISM

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

WHY PARALLEL PROCESSING? (CE-401)

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Online Course Evaluation. What we will do in the last week?

Issues in Multiprocessors

Handout 3 Multiprocessor and thread level parallelism

Parallel Processing SIMD, Vector and GPU s cont.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Issues in Multiprocessors

Computer parallelism Flynn s categories

Computer Architecture Spring 2016

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization


Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Computer Architecture

Introduction II. Overview

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Part VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multiprocessors & Thread Level Parallelism

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Multiprocessors 1. Outline

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Module 5 Introduction to Parallel Processing Systems

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

COSC 6385 Computer Architecture - Multi Processor Systems

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Processor Architecture and Interconnect

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Parallel Computing Platforms

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Comp. Org II, Spring

Lecture 24: Virtual Memory, Multiprocessors

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Processing & Multicore computers

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

! Readings! ! Room-level, on-chip! vs.!

Comp. Org II, Spring

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Lecture 24: Memory, VM, Multiproc

BlueGene/L (No. 4 in the Latest Top500 List)

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Lecture 9: MIMD Architectures

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Chapter 2 Parallel Computer Architecture

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Lecture 9: MIMD Architectures

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CMSC 611: Advanced. Parallel Systems

CS425 Computer Systems Architecture

COSC4201 Multiprocessors

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Parallel Architecture. Hwansoo Han

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Advanced Topics in Computer Architecture

Convergence of Parallel Architecture

Organisasi Sistem Komputer

Beyond Instruction Level Parallelism

Computer Science 146. Computer Architecture

Lecture 1: Introduction

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

Parallel Computing: Parallel Architectures Jin, Hai

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Aleksandar Milenkovich 1

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture 28 Introduction to Parallel Processing and some Architectural Ramifications. Flynn s Taxonomy. Multiprocessing.

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Chapter 18 Parallel Processing

Computer Organization. Chapter 16

SMD149 - Operating Systems - Multiprocessing

Lecture 7: Parallel Processing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Parallel Systems I The GPU architecture. Jan Lemeire

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

TDT 4260 lecture 3 spring semester 2015

CSE 392/CS 378: High-performance Computing - Principles and Practice

Lect. 2: Types of Parallelism

Transcription:

Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors

Challenges of Parallel Processing First challenge is % of program inherently sequential Second challenge is long latency to remote memory 1. Application parallelism primarily via new algorithms that have better parallel performance 2. Long remote latency impact both by architect and by the programmer

Forcing Challenges of Parallel Processing Long remote latency impact For example, reduce frequency of remote accesses either by 1) Caching shared data (HW) 2) Restructuring the data layout to make more accesses local (SW) More Challenges: Parallel Programming Complex Algorithms Memory issues Communication Costs Load Balancing

Shared Memory Architectures From multiple boards on a shared bus to multiple processors inside a single chip Caches both Private data are used by a single processor Shared data are used by multiple processors Caching shared data reduces latency to shared data, memory bandwidth for shared data, and interconnect bandwidth cache coherence problem

Amdahl s Law Architecture design is very bottleneck-driven 1. make the common case fast 2. do not waste resources on a component that has little impact on overall performance/power Amdahl s Law: performance improvements through an enhancement is limited by the fraction of time the enhancement comes into play Execution Time After Improvement = Execution Time Unaffected + Execution Time Affected Amount of Improvement Speedup = Performance after improvement Performance before improvement

Single instr stream Multiple instr streams Flynn Classification of parallel processing architectures Johnson s expansion Global memory Distributed memory Flynn classification of parallel processing architectures SISD, SIMD, MISD, MIMD Basic multiprocessor architectures Single data stream Multiple data streams Shared variables Message passing SISD Uniprocessors SIMD Array or vector processors GMSV Shared-memory multiprocessors GMMP Rarely used MISD Rarely used MIMD Multiproc s or multicomputers DMSV Distributed shared memory DMMP Distrib-memory multicomputers Flynn s categories

Flynn s Taxonomy Flynn classified by data and control streams in 1966 Single Instruction Single Data (SISD) or (Uniprocessor) Multiple Instruction Single Data (MISD) or (ASIP) Single Instruction Multiple Data SIMD (single PC: Vector, CM-2) Multiple Instruction Multiple Data MIMD (Clusters, SMP servers) Computer Architectures SISD SIMD MIMD MISD

SISD No instruction parallelism No data parallelism SISD processing architecture example a personal computer processing instructions and data on single processor Chapter 18 & 17

SIMD Multiple data streams in parallel with a single instruction stream At one time, one instruction operates on many data Data parallel architecture Vector architecture has similar characteristics, but achieve the parallelism with pipelining. Array processors a graphic processor processing instructions for translation or rotation or other operations are done on multiple data An array or matrix is also processed in SIMD Chapter 18 & 17

MISD Multiple instruction streams in parallel operating on single Data stream Processing architecture example processing for critical controls of missiles where single data stream processed on different processors to handle faults if any during processing

MIMD Multiple processing streams in parallel processing on parallel data streams MIMD processing architecture example is super computer or distributed computing systems with distributed or single shared memory

Flynn Classification SISD: traditional sequential architecture SIMD: processor arrays, vector processor Parallel computing on a budget reduced control unit cost Many early supercomputers MIMD: most general purpose parallel computer today Clusters, MPP, data centers MISD: not a general purpose architecture. Chapter 18 & 17

Johnson s expansion Lecture 1 : Introduction 15

classification of Parallel Processor Architectures

What is Parallel Architecture A parallel computer is a collection of processing elements that cooperate to solve large problems Most important new element: it is all about communication!! What does the programmer (or OS or compiler writer) think about? 1) Models of computation Sequential consistency? 2) Resource allocation What mechanisms must be in hardware A high performance processor (SIMD, or Vector Processor) 3) Data access, Communication, and Synchronization Chapter 18 & 17

Multiprocessor Basics A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Parallel Architecture = Computer Architecture + Communication Architecture 2 classes of multiprocessors WRT memory: 1. Centralized Memory Multiprocessor <few dozen cores Small enough to share single, centralized memory with Uniform Memory Access latency (UMA) 2. Physically Distributed Memory Multiprocessor Larger number chips and cores than the first. BW demands Memory distributed among processors with Non Uniform Memory Access/latency (NUMA)

Multicore: Mainstream Multiprocessors Multicore chips IBM Power5 Two 2+GHz PowerPC cores Shared 1.5 MB L2, L3 tags AMD Quad Phenom Four 2+ GHz cores Per-core 512KB L2 cache Shared 2MB L3 cache Intel Core 2 Quad Four cores, shared 4 MB L2 Two 4MB L2 caches Sun Niagara 8 cores, each 4-way threaded Shared 2MB L2, shared FP For servers, not desktop All have same shared memory programming model Chapter 18 & 17

Definitions of Threads and Processes Instruction stream divided into smaller streams (threads) Parallel execution Thread in multithreaded processors may or may not be same as software threads Process: An instance of program running on computer Thread: dispatchable unit of work within process Thread switch Switching processor between threads within same process Typically less costly than process switch a thread of execution is the smallest sequence of programmed instructions that can be managed independently

But First, Uniprocessor Concurrency Software thread : Independent flow of execution Context state: PC, registers Threads generally share the same memory space Process like a thread, but different memory space Java has thread support built in, C/C++ supports P-threads library Generally, system software (the O.S.) manages threads Thread scheduling, context switching : PC and registers switching All threads share the one processor Quickly swapping threads gives illusion of concurrent execution thread is a component of a process. Multiple threads can exist within the same process but different processes do not share these resources.

Conceptual Diagram Thread A Thread B Thread C Thread D

Hardware Multithreading (TLP) Coarse- grained multithreading Single thread runs until a costly stall E.g. 2nd level cache miss Another thread starts during stall for first Pipeline fill time requires several cycles! Does not cover short stalls Less likely to slow execution of a single thread (smaller latency) Needs hardware support PC and register file for each thread little other hardware

Coarse Multithreading Stalls for A and C would be longer than indicated in previous slide Assume long stalls at end of each thread indicated in previous slide

Hardware Multithreading Fine-grained multithreading Two or more threads interleave instructions Round-ribbon fashion Skip stalled threads Needs hardware support Separate PC and register file for each thread Hardware to control alternating pattern Naturally hides delays Data hazards, Cache misses Pipeline runs with rare stalls Does not make full use of multi-issue architecture

Fine Multithreading Skip A

Hardware Multithreading Simultaneous Multithreading (SMT) Hyper-threading by Intel Instructions from multiple threads issued on same cycle Uses register renaming and dynamic scheduling facility of multi-issue architecture Needs more hardware support Register files, PC s for each thread Temporary result registers before commit Support to sort out which threads get results from which instructions Maximizes utilization of execution units

Simultaneous Multithreading Skip C Skip A

Hardware Multithreading Hardware Multithreading (MT) Multiple threads dynamically share a single pipeline (caches) Replicate thread contexts: PC and register file Coarse -grain MT: switch on L2 misses Why? Fine grain MT: Does not make full use of multi-issue architecture Simultaneous MT: no explicit switching, it converts TLP into ILP MT does not improve single-thread performance

Multithreading Categories

Basic Principles Systems provided of reconfigurable logic are often called Reconfigurable Instruction Set Processors (RISP). The reconfigurable logic includes a set of programmable processing units, which can be reconfigured in the field to implement logic operations or functions, and programmable interconnections between them.

Reconfiguration Steps To execute a program taking advantage of the reconfigurable logic, use the following Steps. 1. Code Analysis: the first thing to do is to identify parts of the code that can be transformed for execution on the reconfigurable logic. The goal of this step is to find the best trade off considering performance and available resources

Basic steps in a reconfigurable system Lecture 1 : Introduction 35

Reconfiguration Steps 2- Code transformation: Once the best candidate parts of code to be accelerated are found,. 3- Reconfiguration After code transformation, it is time to send to the reconfigurable system. 4. Input Context Loading: To perform a given reconfigurable operation, a set of input operands are fetched 5. Execution: After the reconfigurable unit is set and the proper input operands are fetched 6. Write back: The results of the reconfigurable operation are saved back to the register file,

Underlying Execution Mechanism Why is reconfigurable system interesting? 1. Some applications are poorly suited to microprocessor. 2. VLSI explosion provides increasing resources. 3. Hardware/Software co-design 4. Relatively new research area. 5. Relatively the same computation

Programming multiprocessor systems What is concurrency? What is a sequential program? A single thread of control that executes one instruction and when it is finished execute the next logical instruction What is a concurrent program? A collection of autonomous sequential threads, executing (logically) in parallel The implementation (i.e. execution) of a collection of threads can be: 1. Multiprogramming Threads multiplex their executions on a single processor 2. Multiprocessing Threads multiplex their executions on a multiprocessor or a multicore system 3. Distributed Processing Processes multiplex their executions on several different machines

Why use Concurrent Programming? 1) Natural Application Structure: The world is not sequential! Easier to program multiple independent and concurrent activities. 2) Increased application throughput and responsiveness 3) Not blocking the entire application due to blocking IO 4) Performance from multiprocessor/multicore hardware 5) Parallel execution 6) Distributed systems 7) Single application on multiple machines 8) Client/server type or peer-to-peer systems

Comparing Uniprocessing to Multiprocessing A Uniprocessor (UP) system has one processor. A driver in a UP system executes one thread at any given time. A Multiprocessor (MP) system has two or more processors. A driver in this system can execute multiple threads concurrently.