Lecture 1: Introduction Dr. Eng. Amr T. Abdel-Hamid Winter 2014 Computer Architecture Text book slides: Computer Architec ture: A Quantitative Approach 5 th E dition, John L. Hennessy & David A. Patterso with modifications.
CPU History in a Flash Intel 4004 (1971): 4-bit processor, 2312 transistors, 0.4 MHz, 10 micron PMOS, 11 mm 2 chip RISC II (1983): 32-bit, 5 stage pipeline, 40,760 transistors, 3 MHz, 3 micron NMOS, 60 mm 2 chip 125 mm 2 chip, 0.065 micron CMOS = 2312 RISC II+FPU+Icache+Dcache RISC II shrinks to ~ 0.02 mm 2 at 65 nm Caches via DRAM or 1 transistor SRAM Processor is the new transistor?
Processor Performance RISC Move to multi-processor Introduction
Snapdragon 805 processor specs
Classes of Computers Personal Mobile Device (PMD) e.g. start phones, tablet computers Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performance Servers Emphasis on availability, scalability, throughput Clusters / Warehouse Scale Computers Used for Software as a Service (SaaS) Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks Embedded Computers Emphasis: price
Instruction Set Architecture: Critical Interface software hardware instruction set Properties of a good abstraction Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels
ISA vs. Computer Architecture Old definition of computer architecture = instruction set design Other aspects of computer design called implementation Insinuates implementation is uninteresting or less challenging Our view is: computer architecture >> ISA Architect s job much more than instruction set design; techni cal hurdles today more challenging than those in instruction set design
Computer Architecture is Design and Analysis Analysis Design Creativity Architecture is an iterative process: Searching the space of possible designs At all levels of computer systems Cost / Performance Analysis Bad Ideas Good Ideas Mediocre Ideas
Administrivia Instructor: Dr. Amr T. Abdel-Hamid Office: C3-320 Email: amr.talaat@guc.edu.eg Office Hours: Monday, 3 rd & 4 th Lectures. T. A:???
Administrivia
Course Grading Exams Quizzes 3 Quizzes: best 2 10 % Mid Term 30 % Final exam 40 % Project 20% Assignments Not Graded
Project Phase 0: Select your partner (17/9/2014) Submit list of your group members (2-4 per group) Submit a Comparision between Risc and Cisc Proc. Phase 1:.... Phase N: Project Implementation + Report (2 weeks before fin als) FINAL Non-Negotiable deadline
In time & It is too LATE Policy In phases 0, & 1: 5% of project grade penalty per day for being late In phase 2, to n: No late presentation is possible. Honor code 100% penalty for both copier and copy-giver of Any Report/CODE.
Quantitative Principles of Design 1. Take Advantage of Parallelism 2. Principle of Locality 3. Focus on the Common Case 4. Amdahl s Law 5. The Processor Performance Equation
1) Taking Advantage of Parallelism Increasing throughput of server computer via multiple processors or multi ple disks Detailed HW design (DSD course shortly) Carry lookahead adders uses parallelism to speed up computing sum s from linear to logarithmic in number of bits per operand Multiple memory banks searched in parallel in set-associative caches Pipelining: overlap instruction execution to reduce the total time to compl ete an instruction sequence. Not every instruction depends on immediate predecessor executin g instructions completely/partially in parallel possible Classic 5-stage pipeline: 1) Instruction Fetch (), 2) ister Read (), 3) Execute (), 4) Data Memory Access (Dmem), 5) ister Write ()
2) The Principle of Locality The Principle of Locality: Program access a relatively small portion of the address space at any instan t of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be r eferenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addr esses are close by tend to be referenced soon (e.g., straight-line code, array access) Last 30 years, HW relied on locality for memory perf. P $ MEM
Capacity Access Time Cost Tape infinite sec-min Elect ~$1 / 707 GByte Levels of the Memory Hierarchy CPU isters 100s Bytes 300 500 ps (0.3-0.5 ns) L1 and L2 Cache 10s-100s K Bytes ~1 ns - ~10 ns $1000s/ GByte Main Memory G Bytes 80ns- 200ns ~ $100/ GByte Disk 10s T Bytes, 10 ms (10,000,000 ns) ~ $1 / GByte isters L1 Cache L2 Cache Memory Disk Tape Instr. Operands Blocks Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 32-64 bytes cache cntl 64-128 bytes OS 4K-8K bytes user/operator Mbytes Upper Level faster Larger Lower Level
3) Focus on the Common Case Common sense guides computer design Since its engineering, common sense is valuable In making a design trade-off, favor the frequent case over t he infrequent case E.g., Instruction fetch and decode unit used more frequently th an multiplier, so optimize it 1st E.g., If database server has 50 disks / processor, storage dep endability dominates system dependability, so optimize it 1st Frequent case is often simpler and can be done faster than the infrequent case E.g., overflow is rare when adding 2 numbers, so improve perf ormance by optimizing more common case of no overflow May slow down overflow, but overall performance improved by optimizing for the normal case What is frequent case and how much performance improve d by making case faster => Amdahl s Law
4) Amdahl s Law ExTimenew ExTimeold 1 Speedup overall ExTime ExTime old new 1 Best you could ever hope to do: Speedup maximum Fractionenhanced Fraction Fraction enhanced 1 1 - Fraction enhanced 1 enhanced Speedup enhanced Fraction Speedup enhanced enhanced
Amdahl s Law example New CPU 10X faster I/O bound server, so 60% time waiting for I/O Speedup overall 1 1 Fraction 1 0.4 0.4 10 1 enhanced Fraction Speedup 1 0.64 1.56 enhanced enhanced Apparently, its human nature to be attracted by 10X faster, vs. keeping in perspective its just 1.6X faster
5) Processor performance equation inst count CPI Cycle time CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X
5 Steps of MIPS Datapath Figure A.2, Page A-8 Next PC Address Instruction Fetch 4 Adder Memory Inst Instr. Decode. Fetch Next SEQ PC RS1 RS2 RD File Execute Addr. Calc MUX MUX Zero? Memory Access MUX Data Memory L M D Write Back MUX Imm Sign Extend WB Data
WB Data 5 Steps of MIPS Datapath Figure A.3, Page A-9 Next PC Address Instruction Fetch 4 Adder Memory IF/ID Instr. Decode. Fetch Next SEQ PC RS1 RS2 Imm File Sign Extend ID/EX Execute Addr. Calc Next SEQ PC MUX MUX Zero? EX/MEM RD RD RD Memory Access MUX Data Memory MEM/WB Write Back MUX
WB Data 5 Steps of MIPS Datapath Figure A.3, Page A-9 Next PC Address Instruction Fetch 4 Adder Memory IF/ID Instr. Decode. Fetch Next SEQ PC RS1 RS2 Imm File Sign Extend Data stationary control ID/EX Execute Addr. Calc Next SEQ PC MUX MUX Zero? EX/MEM RD RD RD local decode for each instruction phase / pipeline stage Memory Access MUX Data Memory MEM/WB Write Back MUX
Visualizing Pipelining Figure A.2, Page A-8 I n s t r. O r d e r Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Pipelining is not quite that easy! Limits to pipelining: Hazards prevent next instruction from exec uting during its designated clock cycle Structural hazards: HW cannot support this combination of instruc tions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction stil l in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of instruct ions and decisions about changes in control flow (branches and ju mps).
I n s t r. O r d e r One Memory Port/Structural Hazards Figure A.4, Page A-14 Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
One Memory Port/Structural Hazards (Similar to Figure A.5, Page A-15) I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Stall Cycle Cycle 1 2 Cycle 3 Cycle Cycle 4 5 Cycle Cycle 6 7 Instr 3 Bubble Bubble Bubble Bubble Bubble How do you bubble the pipe?
Speed Up Equation for Pipelining CPI pipelined Ideal CPI Average Stall cycles per Inst Ideal CPI Pipeline depth Speedup Ideal CPI Pipeline stall CPI For simple RISC pipeline, CPI = 1: Pipeline depth Speedup 1 Pipeline stall CPI Cycle Cycle Time Cycle Time Cycle pipelined Time unpipeline d Time unpipeline d pipelined
Example: Dual-port vs. Single-port Machine A: Dual ported memory ( Harvard Architecture ) Machine B: Single ported memory, but its pipelined implementatio n has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/(1 + 0.4 x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster
Data Hazard on R1 Figure A.6, Page A-17 I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 IF ID/RF EX MEM WB
Three Generic Data Hazards Read After Write (RAW) Instr J tries to read operand before Instr I writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a Dependence (in compiler nomenclature). This ha zard results from an actual need for communication.
Three Generic Data Hazards Write After Read (WAR) Instr J writes operand before Instr I reads it I: sub r4,r1,r3 J: add r1,r2,r3 Called an anti-dependence K: mul r6,r1,r7 by compiler writers. This results from reuse of the name r1. Can t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
Three Generic Data Hazards Write After Write (WAW) Instr J writes operand before Instr I writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an output dependence by compiler writers This also results from the reuse of name r1. Can t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in more complicated pipes
Forwarding to Avoid Data Hazard Figure A.7, Page A-19 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Time (clock cycles)
HW Change for Forwarding Figure A.23, Page A-37 NextPC isters Immediate ID/EX mux mux EX/MEM Data Memory MEM/WR mux What circuit detects and resolves this hazard?
Forwarding to Avoid LW-SW Data Hazard Figure A.8, Page A-20 I n s t r. O r d e r add r1,r2,r3 lw r4, 0(r1) sw r4,12(r1) or r8,r6,r9 xor r10,r9,r11 Time (clock cycles) 38
Data Hazard Even with Forwarding Figure A.9, Page A-21 I n s t r. Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 O r d e r and r6,r1,r7 or r8,r1,r9
Data Hazard Even with Forwarding (Similar to Figure A.10, Page A-21) I n s t r. O r d e r Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Bubble Bubble Bubble How is this detected?
Control Hazard on Branches Three Stage Stall 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 What do you do with the 3 instructions in between? How do you do it?
Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier MIPS branch tests if register = 0 or 0 MIPS Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3
WB Data Pipelined MIPS Datapath Figure A.24, page A-38 Next PC Address Instruction Fetch 4 Adder Memory IF/ID Instr. Decode. Fetch Next S EQ PC Adder Zero? File Sign Extend Execute Addr. Calc MUX EX/MEM RD RD RD Interplay of instruction set design and cycle time. RS1 RS2 Imm MUX ID/EX Memory Access Data Memory MEM/WB Write Back MUX
Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP)
Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism
Trends in Technology Integrated circuit technology Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM
Bandwidth and Latency Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks
Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Clock rate can be reduced dynamically to limit power con sumption Energy per task is often a better measurement
Dynamic Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage 2 Dynamic power ½ x Capacitive load x Voltage 2 x Frequency switched Reducing clock rate reduces power, not energy
Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air
Reducing Power Techniques for reducing power: Do nothing well Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks turning off cores
Reading Assignment: Chapter 1, Appendix B