Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Similar documents
ARCHITECTURAL CLASSIFICATION. Mariam A. Salih

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Module 5 Introduction to Parallel Processing Systems

Computer parallelism Flynn s categories

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architecture

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Parallel Computing Platforms

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Dr. Joe Zhang PDC-3: Parallel Platforms

BlueGene/L (No. 4 in the Latest Top500 List)

Dheeraj Bhardwaj May 12, 2003

FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE

Lecture 8: RISC & Parallel Computers. Parallel computers

Chapter 11. Introduction to Multiprocessors

COSC 6385 Computer Architecture - Multi Processor Systems

Lecture 7: Parallel Processing

Parallel Architectures

Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer Organization. Chapter 16

Parallel Processors. Session 1 Introduction

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Processor Architecture and Interconnect

Lecture 9: MIMD Architectures

Lecture 24: Virtual Memory, Multiprocessors

High Performance Computing in C and C++

Top500 Supercomputer list

Classification of Parallel Architecture Designs

Lecture 7: Parallel Processing

Organisasi Sistem Komputer

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

Issues in Multiprocessors

Overview. Processor organizations Types of parallel machines. Real machines

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Introduction to parallel computing

Unit 9 : Fundamentals of Parallel Processing

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

COSC 6374 Parallel Computation. Parallel Computer Architectures

Issues in Multiprocessors

Convergence of Parallel Architecture

Multiprocessors & Thread Level Parallelism

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Architecture of parallel processing in computer organization

Types of Parallel Computers

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

COSC 6374 Parallel Computation. Parallel Computer Architectures

Chap. 4 Multiprocessors and Thread-Level Parallelism

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Parallel Architectures

Lect. 2: Types of Parallelism

Multiprocessors - Flynn s Taxonomy (1966)

Chapter 1: Perspectives

PARALLEL COMPUTER ARCHITECTURES

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introduction to Parallel Programming

Flynn s Classification

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multicores, Multiprocessors, and Clusters

Parallel Architectures

Chapter 18 Parallel Processing

Lecture 2 Parallel Programming Platforms

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Normal computer 1 CPU & 1 memory The problem of Von Neumann Bottleneck: Slow processing because the CPU faster than memory

Parallel Architecture. Hwansoo Han

Crossbar switch. Chapter 2: Concepts and Architectures. Traditional Computer Architecture. Computer System Architectures. Flynn Architectures (2)

Lecture 24: Memory, VM, Multiproc

Three basic multiprocessing issues

Multi-Processor / Parallel Processing

WHY PARALLEL PROCESSING? (CE-401)

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

Chapter 5: Thread-Level Parallelism Part 1

Chapter 5. Thread-Level Parallelism

Advanced Computer Architecture. The Architecture of Parallel Computers

Introduction II. Overview

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Embedded Systems Architecture. Computer Architectures

Advanced Parallel Architecture. Annalisa Massini /2017

Chapter Seven. Idea: create powerful computers by connecting many smaller ones

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

What are Clusters? Why Clusters? - a Short History

CDA3101 Recitation Section 13

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Parallel Computing Introduction

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

PIPELINE AND VECTOR PROCESSING

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

CS Parallel Algorithms in Scientific Computing

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Parallel Computing: Parallel Architectures Jin, Hai

Chapter 4 Data-Level Parallelism

CSCI 4717 Computer Architecture

RISC Processors and Parallel Processing. Section and 3.3.6

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

Transcription:

Parallel Computer Architectures Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Outline Flynn s Taxonomy Classification of Parallel Computers Based on Architectures

Flynn s Taxonomy Based on notions of instruction and data streams SISD (Single Instruction stream, a Single Data stream ) SIMD (Single Instruction stream, Multiple Data streams ) MISD (Multiple Instruction streams, a Single Data stream) MIMD (Multiple Instruction streams, Multiple Data stream) Popularity MIMD > SIMD > MISD

SISD SISD Conventional sequential machines IS : Instruction Stream CU : Control Unit MU : Memory Unit DS : Data Stream PU : Processing Unit IS I/O IS DS CU PU MU

SIMD SIMD Vector computers, processor arrays Special purpose computations PE : Processing Element LM : Local Memory IS CU Program loaded from host IS PE1 PEn DS DS LM1 LMn DS DS Data sets loaded from host SIMD architecture with distributed memory

MISD MISD Systolic arrays Special purpose computations IS IS CU1 CU2 CUn Memory (Program, IS IS IS DS DS DS Data) PU1 PU2 PUn I/O MISD architecture (the systolic array) DS

MIMD MIMD General purpose parallel computers IS IS DS CU1 PU1 I/O I/O IS DS CUn PUn Shared Memory IS MIMD architecture with shared memory

Classification based on Architecture Pipelined Computers Dataflow Architectures Data Parallel Systems Multiprocessors Multicomputers

Pipeline Computers (1) Instructions are divided into a number of steps (segments, stages) At the same time, several instructions can be loaded in the machine and be executed in different steps

Pipeline Computers (2) IF instruction fetch ID instruction decode and register fetch EX- execution and effective address calculation MEM memory access WB- write back Instruction # 1 2 3 4 5 6 7 8 Cycles 9 Instruction i IF ID EX MEM WB Instruction i+1 Instruction i+2 Instruction i+3 Instruction i+4 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB

Dataflow Architecture Data-driven model A program is represented as a directed acyclic graph in which a node represents an instruction and an edge represents the data dependency relationship between the connected nodes Firing rule» A node can be scheduled for execution if and only if its input data become valid for consumption Dataflow languages Id, SISAL, Silage, LISP,... Single assignment, applicative(functional) language Explicit parallelism

Dataflow Graph z = (a + b) * c a b + * z c The dataflow representation of an arithmetic expression

Dataflow Computer Execution of instructions is driven by data availability What is the difference between this and normal (control flow) computers? Advantages Very high potential for parallelism High throughput Free from side-effect Disadvantages Time lost waiting for unneeded arguments High control overhead Difficult in manipulating data structures

Dataflow Representation input d,e,f c0 = 0 for i from 1 to 4 do begin d1 / e1 a1 d2 / e2 a2 d3 / e3 a3 d4 / e4 a4 ai := di / ei bi := ai * fi f1 * f2 * f3 * f4 * ci := bi + ci-1 end c0 + b1 + b2 + b3 + b4 c4 c1 c2 c3 output a, b, c

Execution on a Control Flow Machine Assume all the external inputs are available before entering do loop + : 1 cycle, * : 2 cycles, / : 3 cycles, a1 b1 c1 a2 b2 c2 a4 b4 c4 Sequential execution on a uniprocessor in 24 cycles How long will it take to execute this program on a dataflow computer with 4 processors?

Execution on a Dataflow Machine a1 a2 a3 b1 b2 b3 c1 c2 c3 c4 a4 b4 Data-driven execution on a 4-processor dataflow computer in 9 cycles Can we further reduce the execution time of this program?

Data Parallel Systems (1) Programming model Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps Conceptually, a processor associated with each data element

Data Parallel Systems (2) SIMD Architectural model Array of many simple, cheap processors with little memory each» Processors don t sequence through instructions Attached to a control processor that issues instructions Specialized and general communication, cheap global synchronization

Vector Processors Instruction set includes operations on vectors as well as scalars 2 types of vector computers Processor arrays Pipelined vector processors

Processor Array A sequential computer connected with a set of identical processing elements simultaneouls doing the same operation on different data. Eg CM-200 Front-end computer Program and Data Memory CPU I/O processor I/0 Data path Instruction path Processing element Processing element Processing element Processor array Data memory Data memory Data memory Interconnection network I/0

Pipeline Vector Processor Stream vector from memory to the CPU Use pipelined arithmetic units to manipulate data Eg: Cray-1, Cyber-205

Multiprocessor Consists of many fully programmable processors each capable of executing its own program Shared address space architecture Classified into 2 types Uniform Memory Access (UMA) Multiprocessors Non-Uniform Memory Access (NUMA) Multiprocessors

UMA Multiprocessor (1) Uses a central switching mechanism to reach a centralized shared memory All processors have equal access time to global memory Tightly coupled system Problem: cache consistency C1 C2 Cn P i Processor i P 1 P 2 P n Ci Cache i Switching mechanism I/O Memory banks

UMA Multiprocessor (2) Crossbar switching mechanism Mem Mem Mem Mem Cache Cache I/O I/O P P

UMA Multiprocessor (3) Shared-bus switching mechanism Mem Mem Mem Mem Cache Cache I/O I/O P P

UMA Multiprocessor (4) Packet-switched network Mem Mem Mem Network Cache P Cache P Cache P

NUMA Multiprocessor Distributed shared memory combined by local memory of all processors Memory access time depends on whether it is local to the processor Caching shared (particularly nonlocal) data? Mem Mem P P Cache Mem Cache Network Cache P Mem Cache P Distributed Memory

Current Types of Multiprocessors PVP (Parallel Vector Processor) A small number of proprietary vector processors connected by a high-bandwidth crossbar switch SMP (Symmetric Multiprocessor) A small number of COST microprocessors connected by a high-speed bus or crossbar switch DSM (Distributed Shared Memory) Similar to SMP The memory is physically distributed among nodes.

PVP (Parallel Vector Processor) VP : Vector Processor SM : Shared Memory VP VP VP Crossbar Switch SM SM SM

SMP (Symmetric Multi-Processor) P/C : Microprocessor and Cache SM: Shared Memory P/C P/C P/C Bus or Crossbar Switch SM SM SM

DSM (Distributed Shared Memory) MB: Memory Bus P/C: Microprocessor & Cache LM: Local Memory DIR: Cache Directory NIC: Network Interface Circuitry MB P/C LM DIR MB P/C LM DIR NIC NIC Custom-Designed Network

Multicomputers Consists of many processors with their own memory No shared memory Processors interact via message passing loosely coupled system P/C: Microprocessor & Cache M: Memory Message-passing Interconnection Network P/C P/C P/C M M M

Current Types of Multicomputers MPP (Massively Parallel Processing) Total number of processors > 1000 Cluster Each node in system has less than 16 processors. Constellation Each node in system has more than 16 processors

MPP (Massively Parallel Processing) P/C: Microprocessor & Cache NIC: Network Interface Circuitry MB: Memory Bus LM: Local Memory MB P/C MB P/C LM LM NIC NIC Custom-Designed Network

Clusters MB: Memory Bus P/C: Microprocessor & Cache M: Memory LD: Local Disk IOB: I/O Bus NIC: Network Interface Circuitry MB LD P/C M Bridge IOB NIC MB LD P/C M Bridge IOB NIC Commodity Network (Ethernet, ATM, Myrinet, VIA)

Constellations P/C: Microprocessor & Cache NIC: Network Interface Circuitry IOC: I/O Controller >= 16 MB: Memory Bus SM: Shared Memory LD: Local Disk >= 16 P/C P/C P/C P/C Hub IOC Hub IOC SM SM LD NIC SM SM LD NIC Custom or Commodity Network

Trend in Parallel Computer Architectures Number of HPCs 400 350 300 250 200 150 100 50 0 MPPs Constellations Clusters SMPs 1997 1998 1999 2000 2001 2002 Years