CSE502 Graduate Computer Architecture. Lec 22 Goodbye to Computer Architecture and Review

Similar documents
Computer Systems Architecture

Computer Systems Architecture

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS 654 Computer Architecture Summary. Peter Kemper

UNIT I (Two Marks Questions & Answers)

Keywords and Review Questions

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

CS425 Computer Systems Architecture

Exploitation of instruction level parallelism

Chapter 06: Instruction Pipelining and Parallel Processing

Chap. 4 Multiprocessors and Thread-Level Parallelism

Copyright 2012, Elsevier Inc. All rights reserved.

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Handout 3 Multiprocessor and thread level parallelism

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

ROEVER ENGINEERING COLLEGE DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

RECAP. B649 Parallel Architectures and Programming

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

LECTURE 5: MEMORY HIERARCHY DESIGN

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

Tutorial 11. Final Exam Review

CSE 502 Graduate Computer Architecture. Lec 11 Simultaneous Multithreading

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

LIMITS OF ILP. B649 Parallel Architectures and Programming

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.


CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CS 1013 Advance Computer Architecture UNIT I

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

anced computer architecture CONTENTS AND THE TASK OF THE COMPUTER DESIGNER The Task of the Computer Designer

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

CS654 Advanced Computer Architecture. Lec 2 - Introduction

Handout 2 ILP: Part B

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

The Processor: Instruction-Level Parallelism

Lecture 14: Multithreading

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Multithreading: Exploiting Thread-Level Parallelism within a Processor

UNIT I 1.What is ILP ILP = Instruction level parallelism multiple operations (or instructions) can be executed in parallel

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Lecture 1: Introduction

Computer Architecture Spring 2016

Lecture-13 (ROB and Multi-threading) CS422-Spring

Memory Hierarchy Basics

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

(1) Measuring performance on multiprocessors using linear speedup instead of execution time is a good idea.

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Multiple Issue ILP Processors. Summary of discussions

5008: Computer Architecture

Online Course Evaluation. What we will do in the last week?

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Getting CPI under 1: Outline

Adapted from David Patterson s slides on graduate computer architecture

Chapter 4 The Processor (Part 4)

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Copyright 2012, Elsevier Inc. All rights reserved.

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

MEMORY HIERARCHY DESIGN. B649 Parallel Architectures and Programming

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Processor (IV) - advanced ILP. Hwansoo Han

Advanced Computer Architecture

CSE Memory Hierarchy Design Ch. 5 (Hennessy and Patterson)

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Hardware-Based Speculation

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

EECS4201 Computer Architecture

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

CISC 662 Graduate Computer Architecture. Lecture 10 - ILP 3

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Copyright 2012, Elsevier Inc. All rights reserved.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Multiple Instruction Issue. Superscalars

Review: Compiler techniques for parallelism Loop unrolling Ÿ Multiple iterations of loop in software:

Dynamic Scheduling. CSE471 Susan Eggers 1

A Cache Hierarchy in a Computer System

Multiprocessors & Thread Level Parallelism

45-year CPU Evolution: 1 Law -2 Equations

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Transcription:

CSE502 Graduate Computer Architecture Lec 22 Goodbye to Computer Architecture and Review Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David Patterson, UC-Berkeley cs252-s06

Chapter 1: Introduction Computer Science at the crossroads from sequential to parallel computing Salvation requires innovation in many fields, including computer architecture Computer Architecture skill sets are different Technology tracking and anticipation Solid interfaces that really work Quantitative approach to design Five Quantitative principles of design: 1. Take Advantage of Parallelism 2. Principle of Locality 3. Focus on the Common Case 4. Amdahl s Law 5. The Processor Performance Equation Computer Architecture >> instruction sets 12/18/08 CSE502-F08, Lec 22 Review 2

Chapter 2: Dynamic Execution Processor Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP Dynamic HW exploiting ILP Works when cannot know dependence at compile time Can hide L1 cache misses Code for one machine runs well on another Reservations stations: renaming to larger set of registers + buffering source operands Prevents registers as bottleneck Avoids WAR, WAW hazards Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions Dynamic scheduling, Register renaming, Load/store disambiguation 12/18/08 CSE502-F08, Lec 22 Review 3

Chapter 3: Static Instruction Level Limits to ILP (power efficiency, compilers, dependencies) seem to limit to 3 to 6 issues/cycle for practical options Explicitly parallel (Data level parallelism or Thread level parallelism) is next step to performance Coarse grained vs. Fine grained multithreading Only on big stall vs. on every clock cycle Simultaneous Multithreading if fine grained multithreading based on OutOfOrder superscalar microarchitecture Instead of replicating registers, reuse rename registers Itanium/EPIC/VLIW not a breakthrough in ILP Balance of ILP and TLP decided in marketplace 12/18/08 CSE502-F08, Lec 22 Review 4

Appendix F: Vector Architecture Vector is alternative model for exploiting ILP If code is vectorizable, then simpler hardware, more energy efficiency, and better real-time model than Out-of-order machines Design issues include number of lanes, number of functional units, number of vector registers, length of vector registers, exception handling, and conditional operations Fundamental design issue is memory bandwidth With virtual address translation and caching 12/18/08 CSE502-F08, Lec 22 Review 5

Byebye Uniprocessors & Questions for 2012 Did vector processors ever become popular for multimedia and/or signal processing? Did EPIC & VLIW die? Given the new emphasis on power limits and the switch to multiprocessors, did microarchitecture complexity 1. Continue to increase ( >6 issued, >6 completed instructions per cycle)? 2. Freeze as of ~ 2002 (e.g., Power 4, Opteron)? 3. or Regress? (e.g., Niagara)? Did fine grain multithreading help with efficiency and/or did it make parallel programming task harder since the scale of parallelism is larger and so it declined in popularity? 12/18/08 CSE502-F08, Lec 22 Review 6

Chapter 4: Multiprocessors Caches contain all information on state of cached memory blocks Snooping cache over shared medium for smaller MP by invalidating other cached copies on write Sharing cached data ==> problems with Coherence (values returned by a read) & Consistency (when a written value will be returned by a read) Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping ==> uniform memory access) but limits number of PUs Directory has extra data structure to keep track of state of all cache blocks Distributing directory ==> Scalable shared address multiprocessor ==> Cache coherent, Non uniform memory access 12/18/08 CSE502-F08, Lec 22 Review 7

Byebye Multiprocessors&Questions for 2012 Did processors / chip continue 2X every 2 yrs? Desktop vs. servers vs. embedded How many sockets can easily put together? Any innovations in synchronization or communication taking advantage of cores on the same chip? Did everything remain cache coherent all the time, or can caches be turned off to use just message passing? Any changes to the ISA to make it easier to support parallel programming? (e.g., Transactional memory?) Did we need new languages/compilers to use them? Threads vs. messages? MPI still popular? Did enhancements to performance / accountability / predictability simplify parallel programming? Performance/Power only or SPUR SecurityPrivacyUsabilityReliability too? What % of peak did PETAFLOPS machine deliver? 12/18/08 CSE502-F08, Lec 22 Review 8

Chapter 5: Advanced Memory Hierarchy Memory-CPU wall inspires optimizations since so much performance lost there Reducing hit time: Small & simple caches, Way prediction, Trace caches Increasing cache bandwidth: Pipelined caches, Multibanked caches, Nonblocking caches Reducing Miss Penalty: Critical word first, Merging writebuffers Reducing Miss Rate: Compiler optimization to fit data accesses to caches Reducing miss penalty or miss rate via parallelism: Hardware prefetching, Compiler prefetching Auto-tuners! search replacing static compilation toexplore optimization space? DRAM Continuing Bandwidth innovations: Fastpage mode, Synchronous DRAM, Double Data Rate 12/18/08 CSE502-F08, Lec 22 Review 9

Chapter 5: Advanced Memory Hierarchy II Virtual Machine Revival Overcome security flaws of modern Oses Processor performance no longer highest priority Manage software, manage hardware Virtualization challenges for processor, virtual memory, I/O (especially I/O) Paravirtualization, ISA upgrades to cope with those difficulties Xen as example of a VMM using paravirtualization 12/18/08 CSE502-F08, Lec 22 Review 10

Byebye Memory Hierarchy & 2012 Questions What was maximum number of levels of cache on chip? Off chip? (Did it go beyond 3?) Did local memory (IBM Cell) replace/enhance caches? Did DRAM latency improve? How much better DRAM BW? Rate of improvement in capacity? 2X / 3 years? Did Virtual Machines become very popular? Which was primary reason? Enhance Security, Manage Hardware, Manage Software? Is software shipped inside a VM? Is x86 as low overhead for VMM now as IBM 370? How many iterations did it take? Compared to 2007, are security/privacy/break-ins 1. Worse; 2. Same; 3. Better? If 3, Did architecture/hardware play an important role? 12/18/08 CSE502-F08, Lec 22 Review 11

Chapter 6: Advanced Storage Disks: Arial Density is 30%/yr vs. 100%/yr in early 2000s TPC: performance/$ to normalize configurations Auditing to ensure no foul play Throughput with restricted response time is normal measure Fault ==> Latent errors in system ==> Failure in service Components often fail slowly Real systems: problems in maintenance, operation as well as hardware, software Queuing models assume state of equilibrium: input rate = output rate Little s Law: Length system = rate x Time system (Mean number of customers = arrival rate of customers x mean service time) Appreciation of latency vs. utilization Clusters for storage as well as computation RAID: reliability >> performance for storage 12/18/08 CSE502-F08, Lec 22 Review 12

Goodbye to Storage & Questions for 2012 How did disk capacity improve? Seek time? BW? Are disks basically unchallenged for storage, or did flash or NVDRAM make major inroads? TPC-C still used for benchmark? Did SATA, SAS, FC-AL interfaces all remain popular? Is RAID 6 the norm? RAID-7? (recover 3 errors) Did tapes finally die, replaced with remote disks? Which became the most popular model of distributed (storage/caching) services: 1. Model A (Akamai): 1000s of data centers, each withdozens of computers 2. Model G (Google): dozens of data centers, each with1000s of computers Did Statistical Machine Learning (RAD Lab) significantly help storage/distributed systems? 12/18/08 CSE502-F08, Lec 22 Review 13