CSE 141 Computer Architecture Summer Session Lecture 2 Performance, ALU. Pramod V. Argade

Similar documents
Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding effects of underlying architecture

Computer Performance Evaluation: Cycles Per Instruction (CPI)

CSE 141 Computer Architecture Summer Session I, Lecture 3 Performance and Single Cycle CPU Part 1. Pramod V. Argade

Computer Performance. Reread Chapter Quiz on Friday. Study Session Wed Night FB 009, 5pm-6:30pm

Chapter 3 Arithmetic for Computers

The Von Neumann Computer Model

Computer Architecture Set Four. Arithmetic

Number Systems and Computer Arithmetic

CSE 141 Computer Architecture Summer Session Lecture 3 ALU Part 2 Single Cycle CPU Part 1. Pramod V. Argade

IC220 Slide Set #5B: Performance (Chapter 1: 1.6, )

Defining Performance. Performance. Which airplane has the best performance? Boeing 777. Boeing 777. Boeing 747. Boeing 747

CPE 335 Computer Organization. MIPS Arithmetic Part I. Content from Chapter 3 and Appendix B

Overview of Today s Lecture: Cost & Price, Performance { 1+ Administrative Matters Finish Lecture1 Cost and Price Add/Drop - See me after class

Part I: Translating & Starting a Program: Compiler, Linker, Assembler, Loader. Lecture 4

Defining Performance. Performance 1. Which airplane has the best performance? Computer Organization II Ribbens & McQuain.

The character of the instruction scheduling problem

Chapter 3. ALU Design

CSE 141 Computer Architecture Spring Lecture 3 Instruction Set Architecute. Course Schedule. Announcements

Quantifying Performance EEC 170 Fall 2005 Chapter 4

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

CO Computer Architecture and Programming Languages CAPL. Lecture 15

Performance, Cost and Amdahl s s Law. Arquitectura de Computadoras

Chapter 3 Arithmetic for Computers. ELEC 5200/ From P-H slides

ECE468 Computer Organization & Architecture. The Design Process & ALU Design

The Role of Performance

CPE300: Digital System Architecture and Design

Introduction to Pipelined Datapath

4. What is the average CPI of a 1.4 GHz machine that executes 12.5 million instructions in 12 seconds?

Lecture - 4. Measurement. Dr. Soner Onder CS 4431 Michigan Technological University 9/29/2009 1

CSE140: Components and Design Techniques for Digital Systems

CMSC 611: Advanced Computer Architecture

Lecture 2: Computer Performance. Assist.Prof.Dr. Gürhan Küçük Advanced Computer Architectures CSE 533

ECE369: Fundamentals of Computer Architecture

The bottom line: Performance. Measuring and Discussing Computer System Performance. Our definition of Performance. How to measure Execution Time?

CS Computer Architecture. 1. Explain Carry Look Ahead adders in detail

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 1. Computer Abstractions and Technology

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Lecture 3: Evaluating Computer Architectures. How to design something:

COSC3330 Computer Architecture Lecture 7. Datapath and Performance

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance evaluation. Performance evaluation. CS/COE0447: Computer Organization. It s an everyday process

Lets Build a Processor

Which is the best? Measuring & Improving Performance (if planes were computers...) An architecture example

Course web site: teaching/courses/car. Piazza discussion forum:

1.6 Computer Performance

CS 110 Computer Architecture

Tailoring the 32-Bit ALU to MIPS

CS61C Performance. Lecture 23. April 21, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson)

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation

Week 7: Assignment Solutions

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. The Processor

ECE C61 Computer Architecture Lecture 2 performance. Prof. Alok N. Choudhary.

Lecture 7: Instruction Set Architectures - IV

Chapter 1. Computer Abstractions and Technology. Lesson 3: Understanding Performance

CpE 442 Introduction to Computer Architecture. The Role of Performance

Arithmetic for Computers

Microcomputers. Outline. Number Systems and Digital Logic Review

Review. Steps to writing (stateless) circuits: Create a logic function (one per output)

ECE 341 Midterm Exam

Lecture Topics. Announcements. Today: Single-Cycle Processors (P&H ) Next: continued. Milestone #3 (due 2/9) Milestone #4 (due 2/23)

Improving Performance: Pipelining

Performance of computer systems

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

CS/COE0447: Computer Organization

CS/COE0447: Computer Organization

Lecture 5. Other Adder Issues

We are quite familiar with adding two numbers in decimal

LECTURE 4. Logic Design

TDT4255 Computer Design. Lecture 1. Magnus Jahre

MIPS Integer ALU Requirements

Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA

M1 Computers and Data

Write only as much as necessary. Be brief!

CS61C : Machine Structures

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015

Processor (I) - datapath & control. Hwansoo Han

Slide Set 5. for ENCM 369 Winter 2014 Lecture Section 01. Steve Norman, PhD, PEng

Instructor Information

Outline. EEL-4713 Computer Architecture Multipliers and shifters. Deriving requirements of ALU. MIPS arithmetic instructions

CS/COE1541: Introduction to Computer Architecture

GRE Architecture Session

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016

Administrivia. CMSC 411 Computer Systems Architecture Lecture 14 Instruction Level Parallelism (cont.) Control Dependencies

Arithmetic Circuits. Nurul Hazlina Adder 2. Multiplier 3. Arithmetic Logic Unit (ALU) 4. HDL for Arithmetic Circuit

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

Slide Set 5. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Register Transfer Level (RTL) Design

Chapter 1. Computer Abstractions and Technology. Lesson 2: Understanding Performance

Designing for Performance. Patrick Happ Raul Feitosa

Systems Architecture

Basic Arithmetic (adding and subtracting)

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Chapter 4. The Processor

ECE331: Hardware Organization and Design

Pipelined CPUs. Study Chapter 4 of Text. Where are the registers?

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Transcription:

CSE 141 Computer Architecture Summer Session 1 24 Lecture 2 Performance, ALU Pramod V. Argade

CSE141: Introduction to Computer Architecture Instructor: TA: Pramod V. Argade (p2argade@cs.ucsd.edu) Office Hours: Tue. 7:3-8:3 PM (AP&M 4141) Wed. 4:3-5:3 PM (AP&M 4141) Anjum Gupta (a3gupta@cs.ucsd.edu) Office Hour: Mon/Wed 12-1 PM Chengmo Yang (c5yang@cs.ucsd.edu) Office Hour: Mon 2-3 PM, Thu 3:3 4:3 PM Lecture: Mon/Wed. 6-8:5 PM, Center 19 Homework: Web-page: Computer Organization & Design The Hardware Software Interface, 2 nd Edition. Authors: Patterson and Hennessy http://www.cse.ucsd.edu/classes/su4/cse141 2

Two sections for 141: Announcements Friday 11-11:5 PM Center 15 Friday 2-2:5 PM CSB 4 Reading Assignment Chapter 2: Performance Chapter 4: Arithmetic for Computers, Sec. 4.6-4.11 Homework 2: Due Wed., July 7 in class 2.1 through 2.5, 2.18 through 2.21 4.9, 4.12, 4.14, 4.22, 4.23, 4.44 Quiz When: Wed., July 7, First 1 minutes of the class Topic: Performance, Chapter 2 Need: Paper, pen, calculator CSE 141 Lab Assignment has been posted on the web page http://www.cse.ucsd.edu/classes/su4/cse141l 3

CSE141 Course Schedule Lecture # Date Time Room Topic Quiz topic 1 Mon. 6/28 6-8:5 PM Center 19 2 Wed. 6/3 6-8:5 PM Center 19 Introduction, Ch. 1 ISA, Ch. 3 Performance, Ch. 2 Arithmetic, Ch. 4 Homework Due - - ISA Ch. 3 - Mon. 7/5 No Class July 4th Holiday - - 3 Wed. 7/7 6-8:5 PM Center 19 4 Mon. 7/12 6-8:5 PM Center 19 5 Tue. 7/13 7:3-8:5 PM Center 19 6 Wed. 7/14 6-8:5 PM Center 19 7 Mon. 7/19 6-8:5 PM Center 19 8 Tue. 7/2 7:3-8:5 PM Center 19 Arithmetic, Ch. 4 Cont. Single-cycle CPU Ch. 5 Single-cycle CPU Ch. 5 Cont. Multi-cycle CPU Ch. 5 Multi-cycle CPU Ch. 5 Cont. (July 5th make up class) Single and Multicycle CPU Examples and Review for Midterm Mid-term Exam Exceptions Pipelining Ch. 6 (July 5th make up class) Performance Ch. 2 #1 #2 Arithmetic, Ch. 4 #3 - - Single-cycle CPU Ch. 5 - #4 - - 9 Wed. 7/21 6-8:5 PM Center 19 Hazards, Ch. 6 - - 1 Mon. 7/26 6-8:5 PM Center 19 Memory Hierarchy & Caches Ch. 7 11 Wed. 7/28 6-8:5 PM Center 19 Virtual Memory, Ch. 7 Course Review Hazards Ch. 6 Cache Ch. 7 12 Sat. 7/31 7-1 PM Center 19 Final Exam - - - #5 #6 4

Performance Assessment Plane DC to Paris time Speed Passengers Throughput (p x mph) Boeing 747 6.5 hours 61 mph 47 286,7 Concorde 3 hours 135 mph 132 178,2 Example definitions of performance Passengers * MPH Cruising speed Passenger capacity (Passenger * MPH)/$ need more data! Clearly formulate your definition of performance 5

Why worry about performance? Learn to measure, report, and summarize Make intelligent choices See through the marketing hype As a designer or purchaser, assess: which system has best performance? which system has lowest cost? which system has highest performance/cost? Formulate metrics for performance measurement How to report relative performance? Understand impact of architectural choices on performance 6

Computer Performance: TIME, TIME, TIME Response Time (latency) How long does it take for my job to run? How long does it take to execute a job? How long must I wait for the database query? Throughput How many jobs can the machine run at once? What is the average execution rate? How much work is getting done? If we upgrade a machine with a new processor what do we decrease? If we add a new machine to the lab what do we increase? 7

% time program... program results... 9.7u 12.9s 2:39 65% % Execution Time Elapsed Time (2:39) Counts everything (disk and memory accesses, I/O, etc.) A useful number, but often not good for comparison purposes Total CPU time (13.6 s) Doesn't count I/O or time spent running other programs Comprised of system time (12.9 s), and user time (9.7 s) Our focus: user CPU time Time spent executing the lines of code that are "in" our program 8

A Definition of Performance For some program running on machine X, Performance X = 1 / Execution time X Machine X is n times faster than Y Problem: n = Performance X / Performance Y = (Execution time Y ) / (Execution time X ) Execution time: Machine A: 12 seconds, B: 2 seconds A/B =.6, so A is 4% faster, or 1.4X faster, or B is 4% slower B/A = 1.67, so A is 67% faster, or 1.67X faster, or B is 67% slower Need a precise definition X is n times faster than Y, n > 1 A is 1.67 times faster than B 9

Clock Cycles Operation of conventional computer is controlled by Clock ticks Clock rate (frequency) = cycles per second (Hertz) MHz = Million cycles per second GHz = Giga (Billion) cycles per second Cycle time = time between ticks = seconds per cycle A 2 GHz clock => Cycle time = 1/(2x1 9 ) s =.5 nanoseconds Instead of reporting execution time in seconds, we often use clock cycles seconds program = cycles program seconds cycle time 1

How many cycles are required for a program? Assumption: # of cycles = # of instructions 1st instruction 2nd instruction 3rd instruction 4th 5th 6th... time Assumption is incorrect Typically, different instructions take different number of clock cycles Integer Multiply instructions are multi-cycle Floating point instructions are multi-cycle 1st 2nd 3rd 4th 5th time 11

Now that we understand cycles A given program will require some number of instructions (machine instructions) some number of cycles some number of seconds We have a vocabulary that relates these quantities: cycle time (seconds per cycle) clock rate (cycles per second) CPI (cycles per instruction) a floating point intensive application might have a higher CPI 12

CPI Example 2.1 Suppose we have two implementations of the same instruction set architecture (ISA). For some program, Machine A has a clock cycle time of 1 ns. and a CPI of 2. Machine B has a clock cycle time of 2 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? 13

Compiler Optimization: Example 2.2 A compiler designer is trying to decide between two code sequences for a particular machine with 3 instruction classes. CPI: Class A = 1, Class B = 2, and Class C = 3 Code sequence X has 5 instructions: 2 of A, 1 of B, and 2 of C Code sequence Y has 6 instructions: 4 of A, 1 of B, and 1 of C. Which sequence will be faster? How much? 14

Execution time: Contributing Factors CPU CPU time time = Seconds = Instructions x Cycles Cycles x Seconds Program Program Instruction Cycle Cycle Instruction Count CPI Program X Compiler X (X) ISA X X Clock Rate Organization X X Technology X Improve performance => reduce execution time Reduce instruction count (Programmer, Compiler) Reduce cycles per instruction (ISA, Machine designer) Reduce clock cycle time (Hardware designer, Process engineer) 15

Performance Performance is determined by program execution time Do any of the other variables equal performance? # of cycles to execute program? # of instructions in program? # of cycles per second? average # of cycles per instruction? average # of instructions per second? Common pitfall: thinking one of the variables is indicative of performance when it really isn t. 16

Performance Variation CPU Execution Time Instruction CPI = Count X X Clock Cycle Time Number of instructions CPI Clock Cycle Time Same machine different programs same programs, different machines, same ISA different similar same same different different Same programs, different machines somewhat different different different 17

Amdahl s Law The impact of a performance improvement is limited by the percent of execution time affected by the improvement Execution time after improvement = Execution Time Affected Amount of Improvement + Execution Time Unaffected Make the common case fast! Amdahl s law sets limit on how much improvement can be made 18

Amdahl s Law: Example 2.3 Example: A program runs in 1 seconds on a machine, with multiply responsible for 8 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? Is it possible to get the program to run 5 times faster? 19

Metrics of Performance Application Programming Language Answers per month Useful Operations per second Compiler ISA Datapath Control Function Units TransistorsWiresPins (millions) of Instructions per second MIPS (millions) of (F.P.) operations per second MFLOP/s Megabytes per second Cycles per second (clock rate) Each metric has a place and a purpose, and each can be misused 2

Benchmarks Performance best determined by running a real application Use programs typical of expected workload Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. Small benchmarks nice for architects and designers easy to standardize can be abused SPEC (System Performance Evaluation Cooperative) companies have agreed on a set of real program and inputs can still be abused (Intel s other bug) valuable indicator of performance (and compiler technology) 21

SPEC 89 Compiler enhancements and performance 8 7 6 SPEC performance ratio 5 4 3 2 1 gcc espresso spice doduc nasa7 li eqntott matrix3 fpppp tomcatv Benchmark Compiler Enhanced compiler 22

SPEC 95 Benchmark Description go Artificial intelligence; plays the game of Go m88ksim Motorola 88k chip simulator; runs test program gcc The Gnu C compiler generating SPARC code compress Compresses and decompresses file in memory li Lisp interpreter ijpeg Graphic compression and decompression perl Manipulates strings and prime numbers in the special-purpose programming language Perl vortex A database program tomcatv A mesh generation program swim Shallow water model with 513 x 513 grid su2cor quantum physics; Monte Carlo simulation hydro2d Astrophysics; Hydrodynamic Naiver Stokes equations mgrid Multigrid solver in 3-D potential field applu Parabolic/elliptic partial differential equations trub3d Simulates isotropic, homogeneous turbulence in a cube apsi Solves problems regarding temperature, wind velocity, and distribution of pollutant fpppp Quantum chemistry wave5 Plasma physics; electromagnetic particle simulation 23

SPEC 95 Does doubling the clock rate double the performance? Can a machine with a slower clock rate have better performance? 1 1 9 9 8 8 7 7 SPECint 6 5 4 SPECfp 6 5 4 3 3 2 2 1 1 5 1 15 Clock rate (MHz) 2 25 Pentium 5 1 Clock rate (MHz) 15 2 25 Pentium Pentium Pro Pentium Pro 24

MIPS, MFLOPS etc. MIPS - million instructions per second number of instructions executed in program Clock rate = = execution time in seconds * 1 6 CPI * 1 6 MFLOPS - million floating point operations per second = number of floating point operations executed in program execution time in seconds * 1 6 Hard to relate MIPS/MFLOPS with execution time CPU Execution Time Instruction CPI = Count X X Highly program-dependent metrics Deceptive (See pages 78-79 in the text) Clock Cycle Time MIPS would be higher for a program using simple instructions (e.g. a loop of NOPs!) 25

RISC Processor: Example 2.4 Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time ALU 5% 1 Load 2% 5 Store 1% 3 Branch 2% 2 What is average CPI? What percentage of time is spent in each instruction class? How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? 26

Performance assessment is tricky Performance is specific to particular program(s) Total execution time is a consistent summary of performance For a given architecture performance increases come from: increases in clock rate (without adverse CPI affects) improvements in processor organization that lower CPI compiler enhancements that lower CPI and/or instruction count Pitfall: expecting improvement in one aspect of a machine s performance to affect the total performance You should not always believe everything you read! Read carefully! 27

Two sections for 141: Announcements Friday 11-11:5 PM Center 19 Friday 2-2:5 PM CSB 15 Reading Assignment Chapter 4: Arithmetic for Computers, Sec. 4.6-4.11 Chapter 5: The Processor: Datapath and Control, Sec. Homework 2: Due Wed., July 7 in class 2.1 through 2.5, 2.18 through 2.21 4.9, 4.12, 4.14, 4.22, 4.23, 4.44 Quiz When: Wed., July 7, First 1 minutes of the class Topic: Performance, Chapter 2 Need: Paper, pen, calculator CSE 141 Lab Assignment has been posted on the web page http://www.cse.ucsd.edu/classes/su4/cse141l 28

Computer Arithmetic and ALU Design

Bits everywhere What do they mean? bits (11111111...1) instruction R-format I-format... integer data number text chars... floating point signed unsigned single precision double precision............ 3

Questions About Numbers How do you represent negative numbers? fractions? really large numbers? really small numbers? How do you do arithmetic? identify errors (e.g. overflow)? What is an ALU and what does it look like? ALU=arithmetic logic unit 31

Introduction to Binary Numbers Consider a 4-bit binary number Decimal Binary 1 1 2 1 3 11 Decimal Binary 4 1 5 11 6 11 7 111 Examples of binary arithmetic: 3 + 2 = 5 3 + 3 = 6 1 1 1 1 1 1 1 + 1 + 1 1 1 1 1 1 32

Negative Numbers? We would like a number system that provides obvious representation of positive and negative integers uses the same adder for addition and subtraction single value of equal coverage of positive and negative numbers easy detection of sign easy negation 33

Possible Representations Sign Magnitude: One's Complement Two's Complement = + = + = + 1 = +1 1 = +1 1 = +1 1 = +2 1 = +2 1 = +2 11 = +3 11 = +3 11 = +3 1 = - 1 = -3 1 = -4 11 = -1 11 = -2 11 = -3 11 = -2 11 = -1 11 = -2 111 = -3 111 = - 111 = -1 2 s complement representation of negative numbers Take the bitwise inverse and add 1 Issues: balance, number of zeros, ease for HW/SW Which one is best? Why? 34

Two s Complement Arithmetic Decimal 2 s Complement Binary Decimal 2 s Complement Binary 1 1 2 1 3 11 4 1 5 11 6 11 7 111-1 1111-2 111-3 111-4 11-5 111-6 11-7 11-8 1 Examples: 7-6 = 7 + (- 6) = 1 3-5 = 3 + (- 5) = -2 1 1 1 1 1 1 1 1 1 1 + 1 1 + 1 1 1 1 1 1 1 35

Immediate Field in I-Format For LW, SW, BEQ, BNE, ADDI, ADDIU, SLT, SLTIU 16-bit Immediate field is signed Copy sign bit to all upper 16 bits It is sign extended to 32 bits before use Sign extension: 1 in 4 bits is the same as 1 in 8 bits (+2) 111 in 4 bits is the same as 1111 111 in 8 bits (-2) Why is there no SUBI instruction for MIPS? For ANDI, ORI 16-bit Immediate field is zero-extended to 32 bits before use Copy zero to all upper 16 bits Why? 36

Instruction Fetch Arithmetic -- The heart of instruction execution Instruction Decode Operand operation Fetch a 32 Execute ALU 32 result Result Store b 32 Next Instruction 37

Foundation of ICs: Field Effect Transistor (FET) N-FET: Conducts when Gate is high P-FET: Conducts when Gate is low Gate Source Gate Source Drain CMOS Inverter Input high, output low Input low, output high Power consumed only during transition VDD Drain In Out Current processing technology.9 µ, next.65 µ 38

39 Some basics of digital logic c = a. b b a 1 1 1 1 1 b a c b a c a c c = a + b b a 1 1 1 1 1 1 1 1 1 c = a a a b 1 c d 1 a c b d 1. AND gate (c = a. b) 2. OR gate (c = a + b) 3. Inverter (c = a) 4. Multiplexor (if d = =, c = a; else c = b)

A One-bit Full Adder CarryIn This is also called a (3, 2) adder Truth Table: A B 1-bit Full Adder C CarryOut Inputs Outputs A B CarryIn CarryOut Sum Comments + + = 1 1 + + 1 = 1 1 1 + 1 + = 1 1 1 1 + 1 + 1 = 1 1 1 1 + + = 1 1 1 1 1 + + 1 = 1 1 1 1 1 + 1 + = 1 1 1 1 1 1 1 + 1 + 1 = 11

Logic Equation for CarryOut Inputs Outputs A B CarryIn CarryOut Sum Comments + + = 1 1 + + 1 = 1 1 1 + 1 + = 1 1 1 1 + 1 + 1 = 1 1 1 1 + + = 1 1 1 1 1 + + 1 = 1 1 1 1 1 + 1 + = 1 1 1 1 1 1 1 + 1 + 1 = 11 CarryOut = (!A & B & CarryIn) (A &!B & CarryIn) (A & B &!CarryIn) (A & B & CarryIn) = (B & CarryIn) (A & CarryIn) (A & B) 41

Logic Equation for Sum Inputs Outputs A B CarryIn CarryOut Sum Comments + + = 1 1 + + 1 = 1 1 1 + 1 + = 1 1 1 1 + 1 + 1 = 1 1 1 1 + + = 1 1 1 1 1 + + 1 = 1 1 1 1 1 + 1 + = 1 1 1 1 1 1 1 + 1 + 1 = 11 Sum = (!A &!B & CarryIn) (!A & B &!CarryIn) (A &!B &!CarryIn) (A & B & CarryIn) 42

1-bit ALU Operation ALU Control Lines (ALUop) Function And 1 Or a b 1 Result CarryIn Operation ALU Control Lines (ALUop) Function And 1 Or 1 Add a b 1 2 Result But how do we make the adder? CarryOut 43

1-bit ALU 32-bit ALU CarryIn Operation CarryIn Operation a a b CarryIn ALU CarryOut Result 1 Result b 2 a1 b1 CarryIn ALU1 CarryOut Result1 CarryOut Implements functions: AND a2 b2 CarryIn ALU2 CarryOut Result2 OR ADD What about SUB and SLT? Ripple carry delay is large... a31 b31 CarryIn ALU31 Result31 The 32-bit ALU 44

32-bit ALU: Subtraction Keep in mind the following: (A - B) is the same as: A + (-B) Bit-wise inverse of B is!b: A - B = A + (-B) = A + (!B + 1) = A +!B + 1 Binvert Operation CarryIn a Binvert provides the negation 1 Result How about +1? For SUB, for LSB: b 1 2 set Binvert = 1 CarryIn = 1 CarryOut 45

Detecting Overflow No overflow when adding a positive and a negative number No overflow when signs are the same for subtraction Overflow occurs when the value affects the sign: overflow when adding two positives yields a negative or, adding two negatives gives a positive or, subtract a negative from a positive and get a negative or, subtract a positive from a negative and get a positive Consider the operations A + B, and A B Can overflow occur if B is? Can overflow occur if A is? Response of MIPS to overflow will be covered later in the course 46

Overflow Detection 1 1 1 1 2 1 1-4 + 1 1 3 + 1 1 1-2 1 1 5 1 1-6 1 1 1 1 1 1 1 1 7 1 1-4 + 1 1 3 + 1 1 1-5 1 1-6 1 1 1 7 So how do we detect overflow? 47

Overflow Detection Logic Carry into MSB! = Carry out of MSB For a N-bit ALU: Overflow = CarryIn[N - 1] XOR CarryOut[N - 1] CarryIn A B A1 B1 A2 B2 A3 B3 1-bit ALU CarryIn1 CarryOut 1-bit ALU CarryIn2 CarryOut1 CarryIn3 1-bit ALU 1-bit ALU CarryOut3 Result Result1 Result2 Result3 X Y X XOR Y 1 1 1 1 1 1 Overflow 48

BEQ/BNE: Zero Detection Logic Zero Detection Logic is just one BIG NOR gate Any non-zero input to the NOR gate will cause its output to be zero CarryIn A B A1 B1 A2 B2 A3 B3 1-bit Result ALU CarryIn1 CarryOut Result1 1-bit ALU CarryIn2 CarryOut1 Result2 1-bit ALU CarryIn3 CarryOut2 1-bit ALU CarryOut3 Result3 Zero 49

SLT: Set-on-less-than Logic SLT $1, $2, $3 if( $2 < $3) $1 = 1; else $1 = ; To test A < B, do a subtraction (A - B) (A < B) if (A - B) <, i.e. negative Use sign bit Route the sign bit to bit of result Set bits 1-31 to zero There is a complication due to overflow Work out solution in Homework problem 4.23 5

A Complete 32-bit ALU Bnegate Operation a b a1 b1 a2 b2 CarryIn ALU Less CarryOut CarryIn ALU1 Less CarryOut CarryIn ALU2 Less CarryOut Result Result1 Result2 Zero Functionality Arithmetic Operations: ADD, SUB Logical Operations: AND, OR Compare SLT Support for branch BEQ, BNE Exception detection Overflow a31 b31 CarryIn ALU31 Less Result31 Set Note: Less is connected to Set input for bit. For all other bits, less is connected to zero. Overflow 51

Designing an Arithmetic Logic Unit ALUop 3 A N Zero B N ALU N CarryOut Result Overflow ALU Control Lines (ALUop) Function And 1 Or 1 Add 11 Subtract 111 Set-on-less-than 52

Our ALU has functionality but lacks speed... CPU CPU time time = Seconds = Instructions x Cycles Cycles x Seconds Program Program Instruction Cycle Cycle Clk1 Clk2 Clock Skew............ CLK1 CLK2 Cycle Time = CLK-to-Q + Longest Gate Delay + Setup Time + Clock Skew 53

Adder in our ALU is in timing critical path The adder we just built is called a Ripple Carry Adder The carry bit may have to propagate from LSB to MSB Worst case delay for an N-bit RC adder: 2N-gate delay Bnegate Operation CarryIn A a b a1 b1 CarryIn ALU Less CarryOut CarryIn ALU1 Less CarryOut Result Result1 Zero a2 b2 CarryIn ALU2 Less CarryOut Result2 B CarryOut a31 b31 CarryIn ALU31 Less Result31 Set Overflow Single gate delay =.2 ns (inverter speed of 5 GHz) 32 bit adder => 64 gate delay => 1.28 ns delay Accounting for CLK2Q, set up time and clock skew, the ALU will run at << 789 MHz 54

Problem: ripple carry adder is slow Is there more than one way to do addition? two extremes: ripple carry and sum-of-products Can you see the ripple? How could you get rid of it? c 1 = b c + a c +a b c 2 = b 1 c 1 + a 1 c 1 +a 1 b 1 c 3 = b 2 c 2 + a 2 c 2 +a 2 b 2 c 4 = b 3 c 3 + a 3 c 3 +a 3 b 3 55

The Theory Behind Carry Look-ahead B1 A1 B A Cin2 Cout1 1-bit ALU Cin1 Cout 1-bit ALU Cin Recall: CarryOut = (B & CarryIn) (A & CarryIn) (A & B) Cin2 = Cout1 = (B1 & Cin1) (A1 & Cin1) (A1 & B1) Cin1 = Cout = (B & Cin) (A & Cin) (A & B) Substituting Cin1 into Cin2: Cin2 = (A1 & A & B) (A1 & A & Cin) (A1 & B & Cin) (B1 & A & B) (B1 & A & Cin) (B1 & B & Cin) (A1 & B1) Now define two new terms: Generate Carry at Bit i gi = Ai & Bi Propagate Carry via Bit i pi = Ai or Bi 56

Carry Lookahead: 1st Level Abstraction Using the two new terms we just defined: Generate Carry at Bit i gi = Ai & Bi Propagate Carry via Bit i pi = Ai Bi We can rewrite: Cin1 = g (p & Cin) Cin2 = g1 (p1 & g) (p1 & p & Cin) Cin3 = g2 (p2 & g1) (p2 & p1 & g) (p2 & p1 & p & Cin) Carry going into bit 3 is 1 if We generate a carry at bit 2 (g2) Or we generate a carry at bit 1 (g1) and bit 2 allows it to propagate (p2 & g1) Or we generate a carry at bit (g) and bit 1 as well as bit 2 allows it to propagate (p2 & p1 & g) Or we have a carry input at bit (Cin) and bit, 1, and 2 all allow it to propagate (p2 & p1 & p & Cin) 57

Carry Lookahead: 2nd Level of Abstraction Propagate signals for 4-bit adders (1 gate delay to combine p s) P = p3.p2.p1.p P1 = p7.p6.p5.p4 P2 = p11.p1.p9.p8 P3 = p15.p14.p13.p12 Generate signals for 4-bit adders (2 gate delays to combine p s and g s) G = g3 + (p3.g2) + (p3.p2.g1) + (p3.p2.p1.g) G1 = g7 + (p7.g6) + (p7.p6.g5) + (p7.p6.p5.g4) G2 = g11 + (p11.g1) + (p11.p1.g9) + (p11.p1.p9.g8) G3 = g15 + (p15.g14) + (p15.p14.g13) + (p15.p14.p13.g12) 4-bit adder carry computation (2 gate delays to combine G s P s and c) C1 = G + (P.c) C2 = G1 + (P1.G) + (P1.P.c) C3 = G2 + (P2.G1) + (P2.P1.G) + (P2.P1.P.c) C4 = G3 + (P3.G2) + (P3.P2.G1) + (P3.P2.P1.G) + (P3.P2.P1.P.c) Total 2 + 2 + 1 = 5 levels of logic to compute c32 C4 has 2 levels of logic with Gi, Pi Gi has 2 levels of logic with gi, pi gi & pi have 1 level of logic with inputs Ai, Bi 58

Carry Look-ahead Adder (CLA) CarryIn a b a1 b1 a2 b2 a3 b3 a4 b4 a5 b5 a6 b6 a7 b7 CarryIn ALU P G CarryIn ALU1 P1 G1 C1 C2 pi gi ci + 1 pi + 1 gi + 1 ci + 2 Result--3 Carry-lookahead unit Result4--7 16-bit Ripple carry (RC) adder has 16 x 2 = 32 gate delays 4-bit carry-look-ahead adder has 5 gate delays CLA adder is faster than RC by a factor of 32/5 ~ = 6 a8 b8 a9 b9 a1 b1 a11 b11 CarryIn ALU2 P2 G2 C3 pi + 2 gi + 2 ci + 3 Result8--11 a12 b12 a13 b13 a14 b14 a15 b15 CarryIn ALU3 P3 G3 C4 pi + 3 gi + 3 ci + 4 Result12--15 CarryOut 59

Conclusion We can build an ALU to support the MIPS instruction set Key idea: use multiplexor to select the output we want Efficiently perform subtraction using two s complement Replicate a 1-bit ALU to produce a 32-bit ALU Important points about hardware All of the gates are always working The speed of a gate is affected by the number of inputs to the gate The speed of a circuit is affected by the number of gates in series (on the critical path or the deepest level of logic ) For computer hardware, Speed is it! 6

Two sections for 141: Announcements Friday 11-11:5 PM Center 19 Friday 2-2:5 PM CSB 15 Reading Assignment Chapter 4: Arithmetic for Computers, Sec. 4.6-4.11 Chapter 5: The Processor: Datapath and Control, Sec. Homework 2: Due Wed., July 7 in class 2.1 through 2.5, 2.18 through 2.21 4.9, 4.12, 4.14, 4.22, 4.23, 4.44 Quiz When: Wed., July 7, First 1 minutes of the class Topic: Performance, Chapter 2 Need: Paper, pen, calculator CSE 141 Lab Assignment has been posted on the web page http://www.cse.ucsd.edu/classes/su4/cse141l 61