Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)

Similar documents
Lecture: Pipelining Basics

Advanced Computer Architecture

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

What is Pipelining? RISC remainder (our assumptions)

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

COMPUTER ORGANIZATION AND DESI

ECE 505 Computer Architecture

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

CPU Pipelining Issues

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

Computer Systems Architecture

Keywords and Review Questions

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Pipelining. CSC Friday, November 6, 2015

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

LECTURE 3: THE PROCESSOR

ECE 341. Lecture # 15

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Chapter 4. The Processor

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

LECTURE 10. Pipelining: Advanced ILP

HPC VT Machine-dependent Optimization

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Lecture 7: Static ILP and branch prediction. Topics: static speculation and branch prediction (Appendix G, Section 2.3)

CPE300: Digital System Architecture and Design

Complex Pipelines and Branch Prediction

Multi-cycle Instructions in the Pipeline (Floating Point)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 5: Pipelining Basics

Hardware-based Speculation

Readings. H+P Appendix A, Chapter 2.3 This will be partly review for those who took ECE 152

Chapter 06: Instruction Pipelining and Parallel Processing

ECE232: Hardware Organization and Design

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

ECE331: Hardware Organization and Design

ECE260: Fundamentals of Computer Engineering

Full Datapath. Chapter 4 The Processor 2

UNIT- 5. Chapter 12 Processor Structure and Function

CS425 Computer Systems Architecture

CS 341l Fall 2008 Test #2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Dynamic Control Hazard Avoidance

Final Lecture. A few minutes to wrap up and add some perspective

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

CS252 Prerequisite Quiz. Solutions Fall 2007

Multiple Instruction Issue. Superscalars

5008: Computer Architecture

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

Computer Architecture CS372 Exam 3

Tutorial 11. Final Exam Review

Cycle Time for Non-pipelined & Pipelined processors

Lecture: Pipeline Wrap-Up and Static ILP

Performance of tournament predictors In the last lecture, we saw the design of the tournament predictor used by the Alpha

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

ECE 2300 Digital Logic & Computer Organization. Caches

Architectures for Instruction-Level Parallelism

In embedded systems there is a trade off between performance and power consumption. Using ILP saves power and leads to DECREASING clock frequency.

Basic Pipelining Concepts

Improve performance by increasing instruction throughput

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

More advanced CPUs. August 4, Howard Huang 1

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

COMPUTER ORGANIZATION AND DESIGN

CS433 Homework 2 (Chapter 3)

Lecture 2: Pipelining Basics. Today: chapter 1 wrap-up, basic pipelining implementation (Sections A.1 - A.4)

Recap: Machine Organization

Processor (IV) - advanced ILP. Hwansoo Han

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Chapter 4. The Processor

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Instruction-Level Parallelism and Its Exploitation

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

The Processor: Instruction-Level Parallelism

EE 4980 Modern Electronic Systems. Processor Advanced

CMSC411 Fall 2013 Midterm 2 Solutions

UNIT I (Two Marks Questions & Answers)

Control Hazards. Branch Prediction

CS 2200 Spring 2008 Test 2

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

LECTURE 11. Memory Hierarchy

Computer Architecture Spring 2016

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Control Hazards. Prediction

Exploitation of instruction level parallelism

CS 614 COMPUTER ARCHITECTURE II FALL 2005

CS433 Homework 2 (Chapter 3)

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

Multiple Issue ILP Processors. Summary of discussions

Transcription:

Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Fall 2002 Edward F. Gehringer Automobile Factory (note: non-animated version) Automobile Factory (note: non-animated version) 1

Automobile Factory (note: non-animated version) Automobile Factory (note: non-animated version) Automobile Factory (note: non-animated version) 2

Automobile Factory (note: non-animated version) Basic Assembly Line Unchangeable truth It takes a long time to build one car Example: Time spent in assembly line is 1 hour (12 min. per station) Basic assembly line Throughput = 1 car per hour We wait until first car is fully assembled before starting the next one: only 1 car in assembly line at a time only 1 station is active at a time; other 4 are idle Automobile Factory (note: non-animated version) 3

Automobile Factory (note: non-animated version) Automobile Factory (note: non-animated version) Automobile Factory (note: non-animated version) 4

Automobile Factory (note: non-animated version) Automobile Factory (note: non-animated version) Pipelined Assembly Line Unchangeable truth It still takes a long time to build one car Pipelining Time to fill pipeline = 1 hour Once filled, throughput = 1 car per 12 minutes Speedup due to pipelining is (unusual definition)... rate speedup = rate rateunpipelined pipelined unpipelined = 1 12 min 1 1 60 min = 4 5

Simple Processor Pipeline Example Instruction r1, r2, r3 r1 r2 + r3 : the from using the current PC (program counter), then PC PC + 1 : Decode the to determine the opcode, values of r2 and r3 from the register file : Perform r2 + r3 in the ALU (arithmetic/logic unit) : Do nothing (only loads/stores access ) : of r2 + r3 into r1, in the register file Pipeline Performance Problems (1) Data dependences r1, r2, r3 r4, r1, r9 must wait ( stall ) in stage until completes writes the r1 into register file in s the r1 from register file in 6

Data Dependence Stalls Data Dependence Stalls Data Dependence Stalls r1 Register file r1 (stalled) 7

Data Dependence Stalls r1 Register file r1 (stalled) (bubble) Data Dependence Stalls r1 Register file r1 (stalled) (bubble) (bubble) Data Dependence Stalls (bubble) (bubble) 8

Speedup with data dependences What is the speedup of this pipeline (T sequential /T pipelined ) if 1/10th of all s contain a data dependence? Can you give a general formula for a k-stage pipeline? What other information do you need to know? Reducing Data Dependence Stalls We could directly forward s from producer to consumer, bypassing the register file. The hardware is called data bypass, bypass, or register file bypass. The technique is called bypassing or forwarding. Data Bypass 9

Data Bypass Data Bypass r1 (garbage) Register file Data Bypass r1 (garbage) Register file r1 (correct) data bypass 10

Pipeline Performance Problems (2) Branches r1, r2, r3 X, r5, r7 r4, r1, r9 LD r4, 10(r4) X: AND r4, r10, r11 taken Which should be fetched after the branch? stage stalls until reaches stage. stage evaluates branch condition (r5 = = r7). Branch Stalls Branch Stalls 11

Branch Stalls (bubble) Branch Stalls Branch outcome: taken (bubble) (bubble) Branch Stalls AND (bubble) (bubble) 12

Reducing Branch Stalls Branch prediction Learn which way a given branch tends to go. Like predicting the economy, branch prediction is based on past history. Even simple predictors can be 80% accurate. If correct: no branch stalls. In incorrect: Quash s in previous pipeline stages. Performance degrades to the stall case. May have additional penalties to clean up the pipeline. Branch Prediction (correct) Branch Prediction (correct) 13

Branch Prediction (correct) Predict taken AND Branch Prediction (incorrect) Branch Prediction (incorrect) 14

Branch Prediction (incorrect) Predict not taken Branch Prediction (incorrect) Branch outcome: taken LD Branch Prediction (incorrect) AND LD 15

Speedup with branch stalls What is the speedup of the pipeline if 1/5 of the s are branches, and 4/5 of those are correctly predicted? Can you give a general formula for a k-stage pipeline? What other information do you need to know? Sears Tower Repairman Repair shop is in the basement Has many tools. A few are used frequently, e.g., hammer, crescent wrench, screwdriver Most are used infrequently, e.g., socket wrenches Sears Tower Repairman Problem Sears Tower has 110 stories! Today, you are working on the top floor. Can t bring entire shop with you. Don t know exactly which tools to bring with you from the basement. 16

Sears Tower Repairman Solution Carry frequently used tools in your tool belt. Tool-belt becomes a cache of tools drastically reduces the number of trips down to the basement. When you have to fetch ¼" socket wrench, common sense says to also fetch ½", ¾", etc., just in case. Caches The processor- speed gap Processor is very fast Intel Pentium-4: 1 GHz, 1 clock cycle = 1 ns Large is slow! Main : 50 ns to access, 50 times slower than Pentium-4! Processor wants large and fast. LARGE: O/S and applications consume lots of FAST: Otherwise, processor stalls nearly 100% of time waiting for. Caches Processor 1-ns clock Hits 95% of time Tool-belt 64KB cache 2-ns time Basement shop 256MB Main Memory 50-ns time 17

Average access time What is the average access time in this system? Caches Caches are effective because of locality of reference. Temporal locality: If you access an item, you are likely to access it again in near future. Tool-belt contains frequently used tools. Spatial locality: If you access an item, you are likely to access a nearby item in the near future. This is why repairman also fetched ½" and ¾" socket wrenches when (s)he only needed ¼". Overview of Topics in 463/521 Measuring performance and cost Caches and hierarchies Instruction-set architecture (ISA) Defines software/hardware interface Simple pipelining Data and control (branch) dependences Data bypasses Branch prediction 18

Overview of Topics in 463/521 Complex pipelining and -level parallelism (ILP). Data hazards Dynamic scheduling, register renaming, Tomasulo s algorithm. Precise interrupts Superscalar, VLIW, and vector processors. Projects Three projects Cache simulator Branch predictor simulator Dynamic scheduling pipeline simulator Programming for projects is harder than anything most of you have encountered before. Course Web site http://courses.ncsu.edu/ece463/lec/001 http://courses.ncsu.edu/ece521/lec/001 These two homepages are linked to a common Web site. Any info specific to one course will be listed in the announcements section of that course s home page. 19