Lecture 1: Introduction

Similar documents
CS654 Advanced Computer Architecture. Lec 2 - Introduction

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

COSC 6385 Computer Architecture - Pipelining

Computer Architecture

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Computer Architecture Spring 2016

Modern Computer Architecture

Lecture 3. Pipelining. Dr. Soner Onder CS 4431 Michigan Technological University 9/23/2009 1

Copyright 2012, Elsevier Inc. All rights reserved.

EECS4201 Computer Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

B649 Graduate Computer Architecture. Lec 1 - Introduction

MOTIVATION. B649 Parallel Architectures and Programming

Computer Architecture and System

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Computer Architecture and System

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Fundamentals of Quantitative Design and Analysis

Lecture 1: Introduction

Computer Systems Architecture Spring 2016

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

Introduction to Computer Architecture II

Technology. Giorgio Richelli

MIPS An ISA for Pipelining

Lecture 2: Processor and Pipelining 1

Pipelining: Hazards Ver. Jan 14, 2014

Advanced Computer Architecture

CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Computer Architecture

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

Advanced Computer Architecture

COSC4201 Pipelining. Prof. Mokhtar Aboelaze York University

Processor Architecture

CSE 502 Graduate Computer Architecture. Lec 3-5 Performance + Instruction Pipelining Review

CPE Computer Architecture. Appendix A: Pipelining: Basic and Intermediate Concepts

Modern Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Chapter 4 The Processor 1. Chapter 4A. The Processor

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Lecture 05: Pipelining: Basic/ Intermediate Concepts and Implementation

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Advanced Computer Architecture Week 1: Introduction. ECE 154B Dmitri Strukov

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Lecture 9. Pipeline Hazards. Christos Kozyrakis Stanford University

CSE 502 Graduate Computer Architecture

CISC 662 Graduate Computer Architecture Lecture 6 - Hazards

CSE 533: Advanced Computer Architectures. Pipelining. Instructor: Gürhan Küçük. Yeditepe University

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

CMSC 411 Computer Systems Architecture Lecture 6 Basic Pipelining 3. Complications With Long Instructions

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Memory hierarchy review. ECE 154B Dmitri Strukov

CMSC411 Fall 2013 Midterm 1

DLX Unpipelined Implementation

Appendix C. Abdullah Muzahid CS 5513

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Pipelining. Maurizio Palesi

CS654 Advanced Computer Architecture. Lec 1 - Introduction

Advanced Computer Architecture Pipelining

CPU Pipelining Issues

CENG 3531 Computer Architecture Spring a. T / F A processor can have different CPIs for different programs.

14:332:331 Pipelined Datapath

CSE 502 Graduate Computer Architecture. Lec 4-6 Performance + Instruction Pipelining Review

Course web site: teaching/courses/car. Piazza discussion forum:

The Big Picture Problem Focus S re r g X r eg A d, M lt2 Sub u, Shi h ft Mac2 M l u t l 1 Mac1 Mac Performance Focus Gate Source Drain BOX

ECE 486/586. Computer Architecture. Lecture # 2

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

Administrivia. CMSC 411 Computer Systems Architecture Lecture 8 Basic Pipelining, cont., & Memory Hierarchy. SPEC92 benchmarks

CS 61C: Great Ideas in Computer Architecture Pipelining and Hazards

CISC 662 Graduate Computer Architecture Lecture 5 - Pipeline. Pipelining. Pipelining the Idea. Similar to assembly line in a factory:

HY425 Lecture 05: Branch Prediction

第三章 Instruction-Level Parallelism and Its Dynamic Exploitation. 陈文智 浙江大学计算机学院 2014 年 10 月

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

ECE 587 Advanced Computer Architecture I

Lecture Topics. Announcements. Today: Data and Control Hazards (P&H ) Next: continued. Exam #1 returned. Milestone #5 (due 2/27)

Basic Pipelining Concepts

Overview. Appendix A. Pipelining: Its Natural! Sequential Laundry 6 PM Midnight. Pipelined Laundry: Start work ASAP

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Appendix A. Overview

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Perfect Student CS 343 Final Exam May 19, 2011 Student ID: 9999 Exam ID: 9636 Instructions Use pencil, if you have one. For multiple choice

Computer Architecture. R. Poss

LECTURE 3: THE PROCESSOR

ECE 2300 Digital Logic & Computer Organization. Caches

Graduate Computer Architecture. Chapter 3. Instruction Level Parallelism and Its Dynamic Exploitation

ECE154A Introduction to Computer Architecture. Homework 4 solution

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Cycle Time for Non-pipelined & Pipelined processors

Computer Architecture Lecture 1: Fundamentals of Quantitative Design and Analysis (Chapter 1)

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

Computer Architecture. Lecture 6.1: Fundamentals of

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Pipeline Hazards. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Transistors and Wires

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

Transcription:

Lecture 1: Introduction Dr. Eng. Amr T. Abdel-Hamid Winter 2014 Computer Architecture Text book slides: Computer Architec ture: A Quantitative Approach 5 th E dition, John L. Hennessy & David A. Patterso with modifications.

CPU History in a Flash Intel 4004 (1971): 4-bit processor, 2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm 2 chip RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm 2 chip 125 mm 2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache RISC II shrinks to ~ 0.02 mm 2 at 65 nm Caches via DRAM or 1 transistor SRAM Processor is the new transistor?

Processor Performance RISC Move to multi-processor Introduction

Snapdragon 805 processor specs

Classes of Computers Personal Mobile Device (PMD) e.g. start phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performance Servers Emphasis on availability, scalability, throughput Clusters / Warehouse Scale Computers Used for Software as a Service (SaaS) Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks Embedded Computers Emphasis: price

Instruction Set Architecture: Critical Interface software hardware instruction set Properties of a good abstraction Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels

ISA vs. Computer Architecture Old definition of computer architecture = instruction set design Other aspects of computer design called implementation Insinuates implementation is uninteresting or less challenging Our view is: computer architecture >> ISA Architect s job much more than instruction set design; techni cal hurdles today more challenging than those in instruction set design

Computer Architecture is Design and Analysis Analysis Design Creativity Architecture is an iterative process: Searching the space of possible designs At all levels of computer systems Cost / Performance Analysis Bad Ideas Good Ideas Mediocre Ideas

Administrivia Instructor: Dr. Amr T. Abdel-Hamid Office: C3-320 Email: amr.talaat@guc.edu.eg Office Hours: Monday, 3 rd & 4 th Lectures. T. A:???

Administrivia

Course Grading Exams Quizzes 3 Quizzes: best 2 10 % Mid Term 30 % Final exam 40 % Project 20% Assignments Not Graded

Project Phase 0: Select your partner (17/9/2014) Submit list of your group members (2-4 per group) Submit a Comparision between Risc and Cisc Proc. Phase 1:.... Phase N: Project Implementation + Report (2 weeks before fin als) FINAL Non-Negotiable deadline

In time & It is too LATE Policy In phases 0, & 1: 5% of project grade penalty per day for being late In phase 2, to n: No late presentation is possible. Honor code 100% penalty for both copier and copy-giver of Any Report/CODE.

Quantitative Principles of Design 1. Take Advantage of Parallelism 2. Principle of Locality 3. Focus on the Common Case 4. Amdahl s Law 5. The Processor Performance Equation

1) Taking Advantage of Parallelism Increasing throughput of server computer via multiple processors or multi ple disks Detailed HW design (DSD course shortly) Carry lookahead adders uses parallelism to speed up computing sum s from linear to logarithmic in number of bits per operand Multiple memory banks searched in parallel in set-associative caches Pipelining: overlap instruction execution to reduce the total time to compl ete an instruction sequence. Not every instruction depends on immediate predecessor executin g instructions completely/partially in parallel possible Classic 5-stage pipeline: 1) Instruction Fetch (), 2) ister Read (), 3) Execute (), 4) Data Memory Access (Dmem), 5) ister Write ()

2) The Principle of Locality The Principle of Locality: Program access a relatively small portion of the address space at any instan t of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be r eferenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addr esses are close by tend to be referenced soon (e.g., straight-line code, array access) Last 30 years, HW relied on locality for memory perf. P $ MEM

Capacity Access Time Cost Tape infinite sec-min Elect ~$1 / 707 GByte Levels of the Memory Hierarchy CPU isters 100s Bytes 300 500 ps (0.3-0.5 ns) L1 and L2 Cache 10s-100s K Bytes ~1 ns - ~10 ns $1000s/ GByte Main Memory G Bytes 80ns- 200ns ~ $100/ GByte Disk 10s T Bytes, 10 ms (10,000,000 ns) ~ $1 / GByte isters L1 Cache L2 Cache Memory Disk Tape Instr. Operands Blocks Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 32-64 bytes cache cntl 64-128 bytes OS 4K-8K bytes user/operator Mbytes Upper Level faster Larger Lower Level

3) Focus on the Common Case Common sense guides computer design Since its engineering, common sense is valuable In making a design trade-off, favor the frequent case over t he infrequent case E.g., Instruction fetch and decode unit used more frequently th an multiplier, so optimize it 1st E.g., If database server has 50 disks / processor, storage dep endability dominates system dependability, so optimize it 1st Frequent case is often simpler and can be done faster than the infrequent case E.g., overflow is rare when adding 2 numbers, so improve perf ormance by optimizing more common case of no overflow May slow down overflow, but overall performance improved by optimizing for the normal case What is frequent case and how much performance improve d by making case faster => Amdahl s Law

4) Amdahl s Law ExTimenew ExTimeold 1 Speedup overall ExTime ExTime old new 1 Best you could ever hope to do: Speedup maximum Fractionenhanced Fraction Fraction enhanced 1 1 - Fraction enhanced 1 enhanced Speedup enhanced Fraction Speedup enhanced enhanced

Amdahl s Law example New CPU 10X faster I/O bound server, so 60% time waiting for I/O Speedup overall 1 1 Fraction 1 0.4 0.4 10 1 enhanced Fraction Speedup 1 0.64 1.56 enhanced enhanced Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster

5) Processor performance equation inst count CPI Cycle time CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X

5 Steps of MIPS Datapath Figure A.2, Page A-8 Next PC Address Instruction Fetch 4 Adder Memory Inst Instr. Decode. Fetch Next SEQ PC RS1 RS2 RD File Execute Addr. Calc MUX MUX Zero? Memory Access MUX Data Memory L M D Write Back MUX Imm Sign Extend WB Data

WB Data 5 Steps of MIPS Datapath Figure A.3, Page A-9 Next PC Address Instruction Fetch 4 Adder Memory IF/ID Instr. Decode. Fetch Next SEQ PC RS1 RS2 Imm File Sign Extend ID/EX Execute Addr. Calc Next SEQ PC MUX MUX Zero? EX/MEM RD RD RD Memory Access MUX Data Memory MEM/WB Write Back MUX

WB Data 5 Steps of MIPS Datapath Figure A.3, Page A-9 Next PC Address Instruction Fetch 4 Adder Memory IF/ID Instr. Decode. Fetch Next SEQ PC RS1 RS2 Imm File Sign Extend Data stationary control ID/EX Execute Addr. Calc Next SEQ PC MUX MUX Zero? EX/MEM RD RD RD local decode for each instruction phase / pipeline stage Memory Access MUX Data Memory MEM/WB Write Back MUX

Visualizing Pipelining Figure A.2, Page A-8 I n s t r. O r d e r Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Pipelining is not quite that easy! Limits to pipelining: Hazards prevent next instruction from exec uting during its designated clock cycle Structural hazards: HW cannot support this combination of instruc tions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction stil l in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instruct ions and decisions about changes in control flow (branches and ju mps).

I n s t r. O r d e r One Memory Port/Structural Hazards Figure A.4, Page A-14 Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

One Memory Port/Structural Hazards (Similar to Figure A.5, Page A-15) I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Stall Cycle Cycle 1 2 Cycle 3 Cycle Cycle 4 5 Cycle Cycle 6 7 Instr 3 Bubble Bubble Bubble Bubble Bubble How do you bubble the pipe?

Speed Up Equation for Pipelining CPI pipelined Ideal CPI Average Stall cycles per Inst Ideal CPI Pipeline depth Speedup Ideal CPI Pipeline stall CPI For simple RISC pipeline, CPI = 1: Pipeline depth Speedup 1 Pipeline stall CPI Cycle Cycle Time Cycle Time Cycle pipelined Time unpipeline d Time unpipeline d pipelined

Example: Dual-port vs. Single-port Machine A: Dual ported memory ( Harvard Architecture ) Machine B: Single ported memory, but its pipelined implementatio n has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/(1 + 0.4 x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster

Data Hazard on R1 Figure A.6, Page A-17 I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 IF ID/RF EX MEM WB

Three Generic Data Hazards Read After Write (RAW) Instr J tries to read operand before Instr I writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a Dependence (in compiler nomenclature). This ha zard results from an actual need for communication.

Three Generic Data Hazards Write After Read (WAR) Instr J writes operand before Instr I reads it I: sub r4,r1,r3 J: add r1,r2,r3 Called an anti-dependence K: mul r6,r1,r7 by compiler writers. This results from reuse of the name r1. Can t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

Three Generic Data Hazards Write After Write (WAW) Instr J writes operand before Instr I writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1. Can t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes

Forwarding to Avoid Data Hazard Figure A.7, Page A-19 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Time (clock cycles)

HW Change for Forwarding Figure A.23, Page A-37 NextPC isters Immediate ID/EX mux mux EX/MEM Data Memory MEM/WR mux What circuit detects and resolves this hazard?

Forwarding to Avoid LW-SW Data Hazard Figure A.8, Page A-20 I n s t r. O r d e r add r1,r2,r3 lw r4, 0(r1) sw r4,12(r1) or r8,r6,r9 xor r10,r9,r11 Time (clock cycles) 38

Data Hazard Even with Forwarding Figure A.9, Page A-21 I n s t r. Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 O r d e r and r6,r1,r7 or r8,r1,r9

Data Hazard Even with Forwarding (Similar to Figure A.10, Page A-21) I n s t r. O r d e r Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Bubble Bubble Bubble How is this detected?

Control Hazard on Branches Three Stage Stall 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 What do you do with the 3 instructions in between? How do you do it?

Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier MIPS branch tests if register = 0 or 0 MIPS Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

WB Data Pipelined MIPS Datapath Figure A.24, page A-38 Next PC Address Instruction Fetch 4 Adder Memory IF/ID Instr. Decode. Fetch Next S EQ PC Adder Zero? File Sign Extend Execute Addr. Calc MUX EX/MEM RD RD RD Interplay of instruction set design and cycle time. RS1 RS2 Imm MUX ID/EX Memory Access Data Memory MEM/WB Write Back MUX

Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP)

Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism

Trends in Technology Integrated circuit technology Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM

Bandwidth and Latency Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks

Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Clock rate can be reduced dynamically to limit power con sumption Energy per task is often a better measurement

Dynamic Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage 2 Dynamic power ½ x Capacitive load x Voltage 2 x Frequency switched Reducing clock rate reduces power, not energy

Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air

Reducing Power Techniques for reducing power: Do nothing well Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks turning off cores

Reading Assignment: Chapter 1, Appendix B