Computer Architecture

Similar documents
Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Lecture 7: Parallel Processing

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Multiprocessors - Flynn s Taxonomy (1966)

COSC 6385 Computer Architecture - Multi Processor Systems

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Chap. 4 Multiprocessors and Thread-Level Parallelism

Dr. Joe Zhang PDC-3: Parallel Platforms

Computer parallelism Flynn s categories

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Parallel Architectures

Computer Systems Architecture

PARALLEL COMPUTER ARCHITECTURES

Lecture 7: Parallel Processing

Lecture 9: MIMD Architectures

Top500 Supercomputer list

BlueGene/L (No. 4 in the Latest Top500 List)

Computer Systems Architecture

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Module 5 Introduction to Parallel Processing Systems

Parallel Computing Introduction

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architecture

Chapter 18 Parallel Processing

Introduction to Parallel Computing

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

RAID 0 (non-redundant) RAID Types 4/25/2011

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

Lecture 24: Virtual Memory, Multiprocessors

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Issues in Multiprocessors

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Computer Organization. Chapter 16

Handout 3 Multiprocessor and thread level parallelism

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Processor Architecture and Interconnect

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Chapter 11. Introduction to Multiprocessors

Organisasi Sistem Komputer

Parallel Computing Platforms

Parallel Computers. c R. Leduc

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Lecture 2 Parallel Programming Platforms

Issues in Multiprocessors

CSCI 4717 Computer Architecture

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Online Course Evaluation. What we will do in the last week?

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

Multi-core Programming - Introduction

Intro to Multiprocessors

Objectives of the Course

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Types of Parallel Computers

Chapter-4 Multiprocessors and Thread-Level Parallelism

Computer Architecture Spring 2016

Lecture 24: Memory, VM, Multiproc

Multiprocessors & Thread Level Parallelism

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Multicores, Multiprocessors, and Clusters

Parallel Architecture. Hwansoo Han

Parallel Architecture. Sathish Vadhiyar

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

WHY PARALLEL PROCESSING? (CE-401)

Parallel Architectures

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Chapter 5. Thread-Level Parallelism

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Why Multiprocessors?

Introduction to Parallel Programming

ARCHITECTURES FOR PARALLEL COMPUTATION

Introduction II. Overview

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

CDA3101 Recitation Section 13

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

Introduction to Parallel Programming

A taxonomy of computer architectures

Chapter 17 - Parallel Processing

High Performance Computing in C and C++

Introduction to parallel computing

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Architecture of parallel processing in computer organization

Dheeraj Bhardwaj May 12, 2003

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Transcription:

Computer Architecture Chapter 7 Parallel Processing 1

Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors (1 control unit, limited application) multiprocessor (multiple CPUs, common memory) multicomputer (multiple CPUs, each with own memory) 2

Processor-Level Parallelism Instruction-level parallelism helps a little, but pipelining and superscalar operation rarely wins more than a factor of five to ten. How to get even more speedup: An array processor which consists of a large number of identical processors that perform the same sequence of instructions on different sets of data. 3

4

Problems 1- Array processors work only well on problems requiring the same computation to be performed on many data sets simultaneously. 2- they require much more hardware and are difficult to program. 3- The processing elements are not independent CPU s since there is only one control unit! 5

Solution: Multiprocessors A multiprocessor is a system with more than one CPU sharing a common memory. Problems: Conflicts will result when processors access the common bus! Solution: Use a multicomputer 6

7

DESIGN ISSUE FOR PARALLEL COMPUTERS What are the nature, size, and number of the processing elements? What are the nature, size and number of memory modules? How are the processing and memory elements interconnected? 8

CLASSIFICATION OF PARALLEL STRUCTURES 1) A single processor system is called: Single Instruction stream, Single Data stream (SISD System). 2) A single stream of instructions is broadcast to a number of processors, each processor operates on its own data. Single Instruction stream, Multiple Data stream (SIMD system) 9

3)A number of independent processors executing a different program and having their own sequence of data: Multiple Instruction stream, Multiple Data stream: (MIMD system) 4) A common data structure is manipulated by separate processors each executing a different program. Multiple Instruction stream, Single Data stream ( MISD system) This form does not occur often in practice! 10

11

12

SIMD COMPUTERS Array Processing Idea: single Control Unit for many processing units. Examples: ILLIAC IV, CM-2, Maspar MP-2. 13

14

SIMD COMPUTERS Vector Processing It has been much more successful commercially. Developed by Seymour Cray for Cray Research. The machine takes two n-element vectors as input, and operates on the corresponding elements in parallel using a vector ALU that can operate on all n elements simultaneously. It produces a vector result. Examples: Cray-1 Vector Supercomputer, 15

16

CRAY-1 17

MIMD SYSTEMS These systems can be divided into 2 categories: Multiprocessors: also called shared memory system Multicomputers: also called distributed memory system 18

Multiprocessors All processes working together on a multiprocessor can share a single virtual address space mapped onto the common memory. The ability for two (or more) processes to communicate by just reading and writing memory is the reason multiprocessors are popular. It s an easy model for programmers to understand and is applicable on a very wide range of problems. 19

The system runs one copy of the operating system. When every CPU has equal access to all the memory modules and all the I/O devices, the system is called an SMP (Symmetric MultiProcessor) architecture 20

Multiprocessors 21

Example: UMA Bus-Based Architecture UMA: Uniform Memory Access Architecture based on a single bus bus contention. Solution: add cache to each CPU Also add private memory which can be accessed over a dedicated (private) bus. Results: much less traffic system supports more CPU s. 22

UMA Bus-Based Architecture 23

NUMA Multiprocessors NUMA: Non-Uniform Memory Access To get to more than 100 CPU s, the UMA fails due to hardware complexity, and to the fact that all memory modules have the same access time. Like UMA, they provide a single address space across all the CPU s, but access to local memory modules is faster than access to remote ones. 24

NUMA 25

Characteristics of NUMA machines 1) There is a single address space visible to all CPUs Access to remote memory is done using LOAD and STORE instructions. Access to remote memory is slower than access to local memory. 26

COMA Multiprocessors COMA: Cache Only Memory Access NUMA machines have the disadvantage that access to remote memory are much slower than local one even though excellent performance, but limited in size and quite expensive. Solution: Using each CPU s main memory as a cache greatly increases the hit rate, hence the performance. 27

MESSAGE-PASSING MULTICOMPUTERS In MIMD architectures: Multiprocessors: appear to the OS as having a shared memory that can be accessed by LOAD and STORE instructions. Multicomputers have one address space per CPU Distributed Memory System Instead of reading and writing the common memory, Multicomputers use another communication mechanism: Pass messages back and forth using the interconnection network Software primitives: send and receive 28

29

Interconnection networks 30

31

Multiprocessors are easy to program, so why do we have to build multicomputers? Answer: large multicomputers are much simpler and cheaper to build than multiprocessors with the same number of CPUs 32

MPP: Massively Parallel Processors MPP: Huge multimillion dollars supercomputers used in science, engineering and industry for very large complex calculations, for handling very large numbers of instructions per second. Or for data warehousing (managing immense databases). Most of these machines use standard CPUs as their processors. ( Pentium, Sun ultrasparc, IBM RS/6000, DEC Alpha..) 33

Classic (old) Example: Intel/Sandia Option Red machine 4608 CPUs arranged in 3D mesh (32 x 38 x 2): 4536 compute nodes, 32 service nodes, 32 disk nodes, 6 network nodes, 2 boot nodes. I/O nodes manage 640 disks with 1 TB of data. Speed: up to 100 teraflops 10 14 FL operations per second. 34

35

COW : Cluster Of Workstations Also called NOW ( Network Of Workstations). Consists of a few hundreds of PCs or workstations connected by a commercially available network board. The difference between MPPs and COWs is analogous to the difference between a mainframe and a PC! 36

Parallel computing performance depends on Hardware CPU speed of individual processors I/O speed of individual processors Interconnection network Scalability Software Parallelizability of algorithms Application programming languages Operating systems Parallel system libraries 37

Hardware CPU and I/O speed: Same factors as for single-processor machines Interconnection network Latency (wait time): Distance Collisions / collision resolution Bandwidth (bps) Bus limitations CPU and I/O limitations Scalability Adding more processors affects latency and bandwidth 38

Hardware Reducing latency Reducing collisions Resolving collisions Increasing bandwidth 39

40

41

Software Parallelizability of algorithms Number of processors Sequential/parallel parts Amdahl's Law: n = number of processors f = fraction of code that is sequential T = time to process entire algorithm sequentially speedup n 1 (n 1) f Note: total execution time is: ft (1 f n ) T 42

Example: Software An algorithm takes 15 seconds to execute on a single 1.8G processor. 30% of the algorithm is sequential. Assuming zero latency and perfect parallelism in the remaining code, how long should the algorithm take on a 20 x 1.8G processor parallel machine? 43

Example: Software An algorithm takes 15 seconds to execute on a single 1.8G processor. 30% of the algorithm is sequential. Assuming zero latency and perfect parallelism in the remaining code, how long should the algorithm take on a 20 x 1.8G processor parallel machine? speedup 1 n (n 1)f 20 1.3x19 20 6.7 Therefore the expected time is T / speedup 15 / (20 / 6.7) = 5.025 seconds Another way: (.3 x 15) + (.7 x 15) / 20 Seq. + Parallel 44

Software speedup n 1 (n 1) Assuming perfect scalability, what are the implications of Amdahl s Law as n? f 45

Software speedup n 1 (n 1) f Assuming perfect scalability, what are the implications of Amdahl s Law when n? speedup 1/f (assuming f 0) So if f =.4, parallelism can never make the program run more than 2.5 times as fast. 46

Software Parallel system libraries Precompiled functions designed for multiprocessing (e.g., matrix transformations) Functions for control of communication (e.g., background printing) Application programming languages Built-in functions for creating child processes, threads, parallel looping, etc. 47

Software issues: In order to really take advantage of hardware parallelism 1. Control models Single instruction thread Multiple instruction threads Single data set Multiple data sets SISD, SIMD, MISD, MIMD Software (including OS, compilers, etc.) must be designed to use the features 48

Software issues: In order to really take advantage of hardware parallelism 2. Granularity of parallelism At what levels is parallelism implemented? 3. Computational paradigms Pipelining Divide and conquer Phased computation Replicated worker 49

50

Software issues: In order to really take advantage of hardware parallelism 4. Communication methods Shared variable Message passing 5. Synchonization Semaphores, locks, etc. 51