Part VII Advanced Architectures. Feb Computer Architecture, Advanced Architectures Slide 1

Similar documents
Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Computer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing

Lecture 8: RISC & Parallel Computers. Parallel computers

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

Unit 9 : Fundamentals of Parallel Processing

Static Compiler Optimization Techniques

18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

WHY PARALLEL PROCESSING? (CE-401)

Module 5 Introduction to Parallel Processing Systems

Fundamentals of Computers Design

Computer Architecture

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Fundamentals of Computer Design

Computer Systems Architecture

Chap. 4 Multiprocessors and Thread-Level Parallelism

Computer Systems Architecture

CS/COE1541: Intro. to Computer Architecture

CMSC 313 Lecture 27. System Performance CPU Performance Disk Performance. Announcement: Don t use oscillator in DigSim3

Pipeline and Vector Processing 1. Parallel Processing SISD SIMD MISD & MIMD

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

INTELLIGENCE PLUS CHARACTER - THAT IS THE GOAL OF TRUE EDUCATION UNIT-I

Computer Architecture and Organization

Lecture 1: Introduction

Keywords and Review Questions

Lecture 2 Parallel Programming Platforms

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

omputer Design Concept adao Nakamura

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Multiprocessors & Thread Level Parallelism

Parallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

EC 513 Computer Architecture

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Course web site: teaching/courses/car. Piazza discussion forum:

Issues in Multiprocessors

Parallel Computing: Parallel Architectures Jin, Hai

Overview. Processor organizations Types of parallel machines. Real machines

ARCHITECTURES FOR PARALLEL COMPUTATION

Intel released new technology call P6P

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

Chapter 9 Multiprocessors

Parallel Systems I The GPU architecture. Jan Lemeire

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Review: Parallelism - The Challenge. Agenda 7/14/2011

Chapter 4 Data-Level Parallelism

PIPELINE AND VECTOR PROCESSING

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Multiprocessors - Flynn s Taxonomy (1966)

PARALLEL COMPUTER ARCHITECTURES

Portland State University ECE 588/688. Cray-1 and Cray T3E

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Intro to Multiprocessors

Computer Architecture Spring 2016

Aleksandar Milenkovich 1

Online Course Evaluation. What we will do in the last week?

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

BlueGene/L (No. 4 in the Latest Top500 List)

Academic Course Description. EM2101 Computer Architecture

Hakam Zaidan Stephen Moore


The Processor: Instruction-Level Parallelism

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Parallel Architectures

Advanced Parallel Architecture. Annalisa Massini /2017

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

ECE 172 Digital Systems. Chapter 4.2 Architecture. Herbert G. Mayer, PSU Status 6/10/2018

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

CPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner

Outline Marquette University

Processor (IV) - advanced ILP. Hwansoo Han

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

Handout 3 Multiprocessor and thread level parallelism

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

CMSC 611: Advanced. Parallel Systems

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Flynn s Classification

Parallel Numerics, WT 2013/ Introduction

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Types of Parallel Computers

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Transcription:

Part VII Advanced Architectures Feb. 2011 Computer Architecture, Advanced Architectures Slide 1

About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. Behrooz Parhami Edition Released Revised Revised Revised Revised First July 2003 July 2004 July 2005 Mar. 2007 Feb. 2011* * Minimal update, due to this part not being used for lectures in ECE 154 at UCSB Feb. 2011 Computer Architecture, Advanced Architectures Slide 2

VII Advanced Architectures Performance enhancement beyond what we have seen: What else can we do at the instruction execution level? Data parallelism: vector and array processing Control parallelism: parallel and distributed processing Topics in This Part Chapter 25 Road to Higher Performance Feb. 2011 Computer Architecture, Advanced Architectures Slide 3

25 Road to Higher Performance Review past, current, and future architectural trends: General-purpose and special-purpose acceleration Introduction to data and control parallelism Topics in This Chapter 25.1 Past and Current Performance Trends 25.2 Performance-Driven ISA Extensions 25.3 Instruction-Level Parallelism 25.4 Speculation and Value Prediction 25.5 Special-Purpose Hardware Accelerators 25.6 Vector, Array, and Parallel Processing Feb. 2011 Computer Architecture, Advanced Architectures Slide 4

25.1 Past and Current Performance Trends Intel 4004: The first p (1971) Intel Pentium 4, circa 2005 0.06 MIPS (4-bit processor) 10,000 MIPS (32-bit processor) 8-bit 8008 8080 8084 8086 8088 80186 80188 16-bit 80286 80386 80486 Pentium, MMX 32-bit Pentium Pro, II Pentium III, M Celeron Feb. 2011 Computer Architecture, Advanced Architectures Slide 5

Newer methods Covered in Part VII Established methods Previously discussed Architectural Innovations for Improved Performance Computer performance grew by a factor of about 10000 between 1980 and 2000 100 due to faster technology 100 due to better architecture Available computing power ca. 2000: GFLOPS on desktop TFLOPS in supercomputer center PFLOPS on drawing board Architectural method Improvement factor 1. Pipelining (and superpipelining) 3-8 2. Cache memory, 2-3 levels 2-5 3. RISC and related ideas 2-3 4. Multiple instruction issue (superscalar) 2-3 5. ISA extensions (e.g., for multimedia) 1-3 6. Multithreading (super-, hyper-) 2-5? 7. Speculation and value prediction 2-3? 8. Hardware acceleration 2-10? 9. Vector and array processing 2-10? 10. Parallel/distributed computing 2-1000s? Feb. 2011 Computer Architecture, Advanced Architectures Slide 6

PFLOPS Peak Performance of Supercomputers 10 / 5 years Earth Simulator ASCI White Pacific TFLOPS ASCI Red TMC CM-5 Cray T3D Cray X-MP TMC CM-2 Cray 2 GFLOPS 1980 1990 2000 2010 Dongarra, J., Trends in High Performance Computing, Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04] Feb. 2011 Computer Architecture, Advanced Architectures Slide 7

25.2 Performance-Driven ISA Extensions Adding instructions that do more work per cycle Shift-add: replace two instructions with one (e.g., multiply by 5) Multiply-add: replace two instructions with one (x := c + a b) Multiply-accumulate: reduce round-off error (s := s + a b) Conditional copy: to avoid some branches (e.g., in if-then-else) Subword parallelism (for multimedia applications) Intel MMX: multimedia extension 64-bit registers can hold multiple integer operands Intel SSE: Streaming SIMD extension 128-bit registers can hold several floating-point operands Feb. 2011 Computer Architecture, Advanced Architectures Slide 8

25.3 Instruction-Level Parallelism 30% 3 Fraction of cycles 20% 10% Speedup attained 2 0% 0 1 2 3 4 5 6 7 8 Issuable instructions per cycle (a) 1 0 2 4 6 8 Instruction issue width (b) Figure 25.4 Available instruction-level parallelism and the speedup due to multiple instruction issue in superscalar processors [John91]. Feb. 2011 Computer Architecture, Advanced Architectures Slide 9

Instruction-Level Parallelism Figure 25.5 A computation with inherent instruction-level parallelism. Feb. 2011 Computer Architecture, Advanced Architectures Slide 10

VLIW and EPIC Architectures VLIW EPIC Very long instruction word architecture Explicitly parallel instruction computing General registers (128) Execution unit Execution unit... Execution unit Memory Predicates (64) Floating-point registers (128) Execution unit Execution unit... Execution unit Figure 25.6 Hardware organization for IA-64. General and floatingpoint registers are 64-bit wide. Predicates are single-bit registers. Feb. 2011 Computer Architecture, Advanced Architectures Slide 11

25.4 Speculation and Value Prediction load spec load check load (a) Control speculation store load spec load store check load (b) Data speculation Figure 25.7 Examples of software speculation in IA-64. Feb. 2011 Computer Architecture, Advanced Architectures Slide 12

Value Prediction Memo table Inputs Mult/ Div Miss Mux 0 1 Output Done Inputs ready Control Output ready Figure 25.8 Value prediction for multiplication or division via a memo table. Feb. 2011 Computer Architecture, Advanced Architectures Slide 13

25.5 Special-Purpose Hardware Accelerators Data and program memory CPU Configuration memory FPGA-like unit on which accelerators can be formed via loading of configuration registers Accel. 1 Accel. 2 Unused resources Accel. 3 Figure 25.9 General structure of a processor with configurable hardware accelerators. Feb. 2011 Computer Architecture, Advanced Architectures Slide 14

Graphic Processors, Network Processors, etc. PE 0 PE 1 PE 2 PE 3 Input buffer PE 4 PE 8 PE PE 5 PE 9 PE 6 PE 10 PE 7 PE 11 Output buffer PE 12 PE 13 PE 14 PE 15 Feedback path Column memory Column memory Column memory Column memory Figure 25.10 Simplified block diagram of Toaster2, Cisco Systems network processor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 15

25.6 Vector, Array, and Parallel Processing Single data stream Multiple data streams Shared variables Message passing Single instr stream Multiple instr streams SISD Uniprocessors MISD Rarely used SIMD Array or vector processors MIMD Multiproc s or multicomputers Johnson s expansion Global memory Distributed memory GMSV Shared-memory multiprocessors DMSV Distributed shared memory GMMP Rarely used DMMP Distrib-memory multicomputers Flynn s categories Figure 25.11 The Flynn-Johnson classification of computer systems. Feb. 2011 Computer Architecture, Advanced Architectures Slide 16

SIMD Architectures Data parallelism: executing one operation on multiple data streams Concurrency in time vector processing Concurrency in space array processing Example to provide context Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] x[i], 0 i < n Sources of performance improvement in vector processing One instruction is fetched and decoded for the entire operation The multiplications are known to be independent (no checking) Pipelining/concurrency in memory access as well as in arithmetic Array processing is similar Feb. 2011 Computer Architecture, Advanced Architectures Slide 17

MISD Architecture Example I n s t r u c t i o n s t r e a m s 1-5 Data in Data out Figure 25.12 Multiple instruction streams operating on a single data stream (MISD). Feb. 2011 Computer Architecture, Advanced Architectures Slide 18

MIMD Architectures Control parallelism: executing several instruction streams in parallel GMSV: Shared global memory symmetric multiprocessors DMSV: Shared distributed memory asymmetric multiprocessors DMMP: Message passing multicomputers Processors Memory modules Memories and processors Routers Processortoprocessor network 0 0 1 Processorto- 1 memory network p 1...... Parallel I/O... m 1 Parallel input/output A computing node 0 1 Inter- p 1. connection network... Figure 27.1 Centralized shared memory. Figure 28.1 Distributed memory. Feb. 2011 Computer Architecture, Advanced Architectures Slide 19

Amdahl s Law Revisited Speedup (s ) 50 40 30 20 10 0 f = 0 f = 0.01 f = 0.02 f = 0.05 f = 0.1 0 10 20 30 40 50 Enhancement factor (p ) f = sequential fraction p = speedup of the rest with p processors s = 1 f + (1 f)/p min(p, 1/f) Figure 4.4 Amdahl s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 f part runs p times as fast. Feb. 2011 Computer Architecture, Advanced Architectures Slide 20

26 Vector and Array Processing Single instruction stream operating on multiple data streams Data parallelism in time = vector processing Data parallelism in space = array processing Topics in This Chapter 26.1 Operations on Vectors 26.2 Vector Processor Implementation Feb. 2011 Computer Architecture, Advanced Architectures Slide 21

26.1 Operations on Vectors Sequential processor: for i = 0 to 63 do P[i] := W[i] D[i] endfor for i = 0 to 63 do X[i+1] := X[i] + Z[i] Y[i+1] := X[i+1]+Y[i] endfor Vector processor: load W load D P := W D store P Unparallelizable Feb. 2011 Computer Architecture, Advanced Architectures Slide 22

26.2 Vector Processor Implementation From scalar registers Function unit 1 pipeline To and from memory unit Load unit A Load unit B Store unit Vector register file Function unit 2 pipeline Function unit 3 pipeline Figure 26.1 Forwarding muxes Simplified generic structure of a vector processor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 23

Example Array Processor Control Processor array Switches Control broadcast Parallel I/O Figure 26.7 Array processor with 2D torus interprocessor communication network. Feb. 2011 Computer Architecture, Advanced Architectures Slide 24

27 Shared-Memory Multiprocessing Multiple processors sharing a memory unit seems naïve Didn t we conclude that memory is the bottleneck? How then does it make sense to share the memory? Topics in This Chapter 27.1 Centralized Shared Memory 27.2 Multiple Caches and Cache Coherence 27.3 Implementing Symmetric Multiprocessors 27.6 Implementing Asymmetric Multiprocessors Feb. 2011 Computer Architecture, Advanced Architectures Slide 25

Parallel Processing as a Topic of Study An important area of study that allows us to overcome fundamental speed limits Our treatment of the topic is quite brief (Chapters 26-27) Graduate course ECE 254B: Adv. Computer Architecture Parallel Processing Feb. 2011 Computer Architecture, Advanced Architectures Slide 26

27.1 Centralized Shared Memory Processors Memory modules 0 0 Processortoprocessor network 1 Processorto- 1 memory network...... p 1... Parallel I/O Figure 27.1 Structure of a multiprocessor with centralized shared-memory. m 1 Feb. 2011 Computer Architecture, Advanced Architectures Slide 27

27.2 Multiple Caches and Cache Coherence Processors Caches Memory modules 0 0 Processortoprocessor network 1 Processorto- 1 memory network...... p 1... Parallel I/O m 1 Private processor caches reduce memory access traffic through the interconnection network but lead to challenging consistency problems. Feb. 2011 Computer Architecture, Advanced Architectures Slide 28

27.3 Implementing Symmetric Multiprocessors Computing nodes (typically, 1-4 CPUs and caches per node) Interleaved memory I/O modules Standard interfaces Bus adapter Bus adapter Very wide, high-bandwidth bus Figure 27.6 Structure of a generic bus-based symmetric multiprocessor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 29

27.6 Implementing Asymmetric Multiprocessors To I/O controllers Computing nodes (typically, 1-4 CPUs and associated memory) Node 0 Node 1 Node 2 Node 3 Link Memory Link Link Link Ring network Figure 27.11 Structure of a ring-based distributed-memory multiprocessor. Feb. 2011 Computer Architecture, Advanced Architectures Slide 30

28 Distributed Multicomputing Computer architects dream: connect computers like toy blocks Building multicomputers from loosely connected nodes Internode communication is done via message passing Topics in This Chapter 28.1 Communication by Message Passing 28.2 Interconnection Networks 28.3 Message Composition and Routing 28.6 Grid Computing and Beyond Feb. 2011 Computer Architecture, Advanced Architectures Slide 31

28.1 Communication by Message Passing Memories and processors Routers Parallel input/output A computing node 0 1 Inter- p 1. connection network Figure 28.1 Structure of a distributed multicomputer. Feb. 2011 Computer Architecture, Advanced Architectures Slide 32

Router Design Injection channel Ejection channel LC LC Link controller Routing and arbitration Q Q Message queue Input queues Output queues Input channels LC LC LC LC Q Q Q Q Switch Q Q Q Q LC LC LC LC Output channels Figure 28.2 The structure of a generic router. Feb. 2011 Computer Architecture, Advanced Architectures Slide 33

Interprocess Communication via Messages Process A Process B Communication latency Time.................. send x........................... receive x......... Process B is suspended Process B is awakened Figure 28.4 Use of send and receive message-passing primitives to synchronize two processes. Feb. 2011 Computer Architecture, Advanced Architectures Slide 34

28.2 Interconnection Networks Nodes Routers Nodes (a) Direct network (b) Indirect network Figure 28.5 Examples of direct and indirect interconnection networks. Feb. 2011 Computer Architecture, Advanced Architectures Slide 35

Direct Interconnection Networks (a) 2D torus (b) 4D hypercube (c) Chordal ring (d) Ring of rings Figure 28.6 A sampling of common direct interconnection networks. Only routers are shown; a computing node is implicit for each router. Feb. 2011 Computer Architecture, Advanced Architectures Slide 36

Indirect Interconnection Networks Level-3 bus Level-2 bus Level-1 bus (a) Hierarchical buses (b) Omega network Figure 28.7 Two commonly used indirect interconnection networks. Feb. 2011 Computer Architecture, Advanced Architectures Slide 37

28.3 Message Composition and Routing Message Padding Packet data A transmitted packet First packet Header Data or payload Last packet Trailer Flow control digits (flits) Figure 28.8 Messages and their parts for message passing. Feb. 2011 Computer Architecture, Advanced Architectures Slide 38

Computing in the Cloud Computational resources, both hardware and software, are provided by, and managed within, the cloud Users pay a fee for access Managing / upgrading is much more efficient in large, centralized facilities (warehouse-sized data centers or server farms) Image from Wikipedia This is a natural continuation of the outsourcing trend for special services, so that companies can focus their energies on their main business Feb. 2011 Computer Architecture, Advanced Architectures Slide 39

Warehouse-Sized Data Centers Image from IEEE Spectrum, June 2009 Feb. 2011 Computer Architecture, Advanced Architectures Slide 40