Advanced Embedded System Design with FPGA. Lecture 1 Introduction

Size: px

Start display at page:

Download "Advanced Embedded System Design with FPGA. Lecture 1 Introduction"

Joshua Davis
5 years ago
Views:

1 Advanced Embedded System Design with FPGA Lecture 1 Introduction

2 2 Course Description This course covers the topics related to FPGA based embedded systems, including microprocessor architectures, embedded system architecture, firmware, bootloader, JTAG etc., bare metal processor vs embedded OS, hard core and soft core IPs, interconnects between processor and FPGA, buses and interfaces, and external devices such as sensors and cameras. Labs are included for practice the design of FPGA based embedded systems.

3 3 Course Outline Lecture Description Chapter 1 Introduction to Embedded System Design 1 & 2 2 Computer Architecture 3 & Notes 3 Memory and DMA 4 4 IO and Exceptions 4 5 Bus Architectures 5 6 Embedded C and Assembly 6 7 Midterm Exam 8 Embedded C++ and Scripting 7 9 FPGA s and SoPC 9 & IDE s Synthesis and Building 8 & GPIO etc Embedded Arithmetic Communication Final Exam

4 4 Course Policy Midterm Exam 35% Final Exam 35% Homework 30% Homework assignments are worth 5 points each and loose 1 point for every day late. Assignments are not accepted after 5 days late. There are 6 homework assignments

5 5 Required Textbook Course Material Building Embedded Systems - Programmable Hardware, Gu, Changyi, 2016 ISBN Classroom notes and recordings

6 6 Contact Information Important Dates on website Homework assignments will be available on the website.

7 7 Classroom Requests Electronics devices on silent. Don t answer phones during lecture. Ask questions during question breaks The university decides on inclement weather. Do your best to be on time No electronic devices on exams If there is something you want to know, ask me!

8 Embedded systems overview Computing systems are everywhere Most of us think of desktop computers 8 PC s Laptops Mainframes Servers But there s another type of computing system Far more common... Acknowledgement: Slides 4-36 are based on Embedded Systems Design: A Unified Hardware/Software by Vahid/Givargis

9 Embedded systems overview Embedded computing systems Computing systems embedded within electronic devices Hard to define. Nearly any computing system other than a desktop computer Billions of units produced yearly, versus millions of desktop units Perhaps 50 per household and per automobile Computers are in here... and here... and even here... Lots more of these, though they cost a lot less each. 9

10 A short list of embedded systems Anti-lock brakes Auto-focus cameras Automatic teller machines Automatic toll systems Automatic transmission Avionic systems Battery chargers Camcorders Cell phones Cell-phone base stations Cordless phones Cruise control Curbside check-in systems Digital cameras Disk drives Electronic card readers Electronic instruments Electronic toys/games Factory control Fax machines Fingerprint identifiers Home security systems Life-support systems Medical testing systems Modems MPEG decoders Network cards Network switches/routers On-board navigation Pagers Photocopiers Point-of-sale systems Portable video games Printers Satellite phones Scanners Smart ovens/dishwashers Speech recognizers Stereo systems Teleconferencing systems Televisions Temperature controllers Theft tracking systems TV set-top boxes VCR s, DVD players Video game consoles Video phones Washers and dryers And the list goes on and on 10

11 Some common characteristics of embedded systems Single-functioned Executes a single program, repeatedly Tightly-constrained Low cost, low power, small, fast, etc. Reactive and real-time Continually reacts to changes in the system s environment Must compute certain results in real-time without delay 11

12 An embedded system example -- a digital camera CCD Digital camera chip A2D CCD preprocessor Pixel coprocessor D2A lens JPEG codec Microcontroller Multiplier/Accum DMA controller Display ctrl Memory controller ISA bus interface UART LCD ctrl Single-functioned -- always a digital camera Tightly-constrained -- Low cost, low power, small, fast Reactive and real-time -- only to a small extent 12

13 Design challenge optimizing design metrics Obvious design goal: Construct an implementation with desired functionality Key design challenge: Simultaneously optimize numerous design metrics Design metric A measurable feature of a system s implementation Optimizing design metrics is a key challenge 13

14 Design challenge optimizing design metrics Common metrics SWaP-C Unit cost: the monetary cost of manufacturing each copy of the system, excluding NRE cost NRE cost (Non-Recurring Engineering cost): The one-time monetary cost of designing the system Size: the physical space required by the system Performance: the execution time or throughput of the system Power: the amount of power consumed by the system Flexibility: the ability to change the functionality of the system without incurring heavy NRE cost 14

15 Design challenge optimizing design metrics Common metrics (continued) Time-to-prototype: the time needed to build a working version of the system Time-to-market: the time required to develop a system to the point that it can be released and sold to customers Maintainability: the ability to modify the system after its initial release Correctness, safety, many more 15

16 Design metric competition -- improving one may worsen others lens CCD Performance Digital camera chip A2D JPEG codec DMA controller CCD preprocessor Power NRE cost Microcontroller Pixel coprocessor Size D2A Multiplier/Accum Display ctrl Memory controller ISA bus interface UART LCD ctrl Expertise with both software and hardware is needed to optimize design metrics Not just a hardware or software expert, as is common A designer must be comfortable with various technologies in order to choose the best for a given application and constraints 16

17 Revenues ($) Time-to-market: a demanding design metric Time (months) Time required to develop a product to the point it can be sold to customers Market window Period during which the product would have highest sales Average time-to-market constraint is about 8 months Delays can be costly 17

18 Revenues ($) Losses due to delayed market entry 18 Market rise On-time entry D Delayed entry On-time Delayed W Peak revenue Peak revenue from delayed entry Time Market fall 2W Simplified revenue model Product life = 2W, peak at W Time of market entry defines a triangle, representing market penetration Triangle area equals revenue Loss The difference between the on-time and delayed triangle areas

19 Revenues ($) Losses due to delayed market entry (cont.) 19 Market rise On-time entry D Delayed entry On-time Delayed W Peak revenue Peak revenue from delayed entry Time Market fall 2W Area = 1/2 * base * height On-time = 1/2 * 2W * W Delayed = 1/2 * (W-D+W)*(W- D) Percentage revenue loss = (D(3W-D)/2W 2 )*100% Try some examples Lifetime 2W=52 wks, delay D=4 wks (4*(3*26 4)/2*26^2) = 22% Lifetime 2W=52 wks, delay D=10 wks (10*(3*26 10)/2*26^2) = 50% Delays are costly!

20 NRE and unit cost metrics Costs: Unit cost: the monetary cost of manufacturing each copy of the system, excluding NRE cost NRE cost (Non-Recurring Engineering cost): The one-time monetary cost of designing the system total cost = NRE cost + unit cost * # of units per-product cost = total cost / # of units = (NRE cost / # of units) + unit cost Example NRE=$2000, unit=$100 For 10 units total cost = $ *$100 = $3000 per-product cost = $2000/10 + $100 = $ Amortizing NRE cost over the units results in an additional $200 per unit

21 NRE and unit cost metrics Compare technologies by costs -- best depends on quantity Technology A: NRE=$2,000, unit=$100 Technology B: NRE=$30,000, unit=$30 Technology C: NRE=$100,000, unit=$2 But, must also consider time-to-market 21

22 The performance design metric Widely-used measure of system, widely-abused Clock frequency, instructions per second not good measures Digital camera example a user cares about how fast it processes images, not clock speed or instructions per second Latency (response time) Time between task start and end e.g., Camera s A and B process images in 0.25 seconds Throughput Tasks per second, e.g. Camera A processes 4 images per second Throughput can be more than latency seems to imply due to concurrency, e.g. Camera B may process 8 images per second (by capturing a new image while previous image is being stored). Speedup of B over S = B s performance / A s performance 22 Throughput speedup = 8/4 = 2

23 Three key embedded system technologies Technology 23 A manner of accomplishing a task, especially using technical processes, methods, or knowledge Three key technologies for embedded systems Processor technology IC technology Design technology

24 Processor technology The architecture of the computation engine used to implement a system s desired functionality Processor does not have to be programmable Processor not equal to general-purpose processor Controller Datapath Controller Datapath Controller Datapath Control logic and State register IR PC Register file General ALU Control logic and State register IR PC Registers Custom ALU Control logic State register index total + Data memory Data memory Program memory Assembly code for: Data memory Program memory Assembly code for: total = 0 for i =1 to General-purpose ( software ) total = 0 for i =1 to Application-specific Single-purpose ( hardware ) 24

25 Processor technology Processors vary in their customization for the problem at hand Desired functionality total = 0 for i = 1 to N loop total += M[i] end loop 25 Generalpurpose processor Application-specific processor Singlepurpose processor

26 General-purpose processors Programmable device used in a variety of applications Also known as microprocessor Features Program memory General datapath with large register file and general ALU User benefits Low time-to-market and NRE costs High flexibility Pentium the most well-known, but there are hundreds of others Controller Control logic and State register IR Program memory Assembly code for: PC total = 0 for i =1 to Datapath Register file General ALU Data memory 26

27 Single-purpose processors Digital circuit designed to execute exactly one program a.k.a. coprocessor, accelerator or peripheral Features Contains only the components needed to execute a single program No program memory Benefits Fast Low power Small size Controller Control logic State register Datapath index total + Data memory 27

28 Application-specific processors Programmable processor optimized for a particular class of applications having common characteristics Compromise between general-purpose and single-purpose processors Features Program memory Optimized datapath Special functional units Benefits Some flexibility, good performance, size and power Controller Control logic and State register IR Program memory Assembly code for: PC total = 0 for i =1 to Datapath Registers Custom ALU Data memory 28

29 IC technology The manner in which a digital (gate-level) implementation is mapped onto an IC IC: Integrated circuit, or chip IC technologies differ in their customization to a design IC s consist of numerous layers (perhaps 10 or more) IC technologies differ with respect to who builds each layer and when IC package IC source gate oxide channel drain Silicon substrate 29

30 IC technology Three types of IC technologies Full-custom/VLSI Semi-custom ASIC (gate array and standard cell) PLD (Programmable Logic Device) 30

31 Full-custom/VLSI All layers are optimized for an embedded system s particular digital implementation Placing transistors Sizing transistors Routing wires Benefits Excellent performance, small size, low power Drawbacks High NRE cost (e.g., $300k), long time-to-market 31

32 Semi-custom Lower layers are fully or partially built Designers are left with routing of wires and maybe placing some blocks Benefits Good performance, good size, less NRE cost than a full-custom implementation (perhaps $10k to $100k) Drawbacks Still require weeks to months to develop 32

33 PLD (Programmable Logic Device) All layers already exist Designers can purchase an IC Connections on the IC are either created or destroyed to implement desired functionality Field-Programmable Gate Array (FPGA) very popular Benefits Low NRE costs, almost instant IC availability Drawbacks Bigger, expensive (perhaps $30 per unit), power hungry, slower 33

34 Moore s law The most important trend in embedded systems Predicted in 1965 by Intel co-founder Gordon Moore IC transistor capacity has doubled roughly every 18 months for the past several decades Logic transistors per chip (in millions) Note: logarithmic scale 10,00 0 1,

35 The Future of Moore s law Does Moore s Law still hold? This growth rate has been steady until ~2005 Factors limiting transistor counts Power consumption (leakage power) Temperature Development of new technology (65nm, 45nm, 30nm ) Transistor density growth slows down New Trend Multiple computing cores Not significant higher frequency Nanotechnology 35

36 The co-design ladder In the past: 36 Hardware and software design technologies were very different Recent maturation of synthesis enables a unified view of hardware and software Hardware/software codesign Sequential program code (e.g., C, VHDL) Compilers (1960's,1970's) Assembly instructions Assemblers, linkers (1950's, 1960's) Machine instructions Implementation Microprocessor plus program bits: software Behavioral synthesis (1990's) Register transfers RT synthesis (1980's, 1990's) Logic equations / FSM's Logic synthesis (1970's, 1980's) Logic gates VLSI, ASIC, or PLD implementation: hardware The choice of hardware versus software for a particular function is simply a tradeoff among various design metrics, like performance, power, size, NRE cost, and especially flexibility; there is no fundamental difference between what hardware or software can implement.

37 Independence of processor and IC technologies Basic tradeoff General vs. custom With respect to processor technology or IC technology The two technologies are independent General, providing improved: Generalpurpose processor ASIP Singlepurpose processor Customized, providing improved: Flexibility Maintainability NRE cost Time- to-prototype Time-to-market Cost (low volume) Power efficiency Performance Size Cost (high volume) PLD Semi-custom Full-custom 37

38 Design productivity gap While designer productivity has grown at an impressive rate over the past decades, the rate of improvement has not kept pace with chip capacity 10,00 0 1, ,000 10,000 Logic transistors per chip (in millions) IC capacity Gap Productivity (K) Trans./Staff-Mo productivity

39 Design productivity gap 1981 leading edge chip required 100 designer months 10,000 transistors / 100 transistors/month 2002 leading edge chip requires 30,000 designer months 150,000,000 / 5000 transistors/month Designer cost increase from $1M to $300M Logic transistors per chip (in millions) 10,00 0 1, IC capacity productivity Gap 100,000 10, Productivity (K) Trans./Staff-Mo. 39

40 The mythical man-month The situation is even worse than the productivity gap indicates In theory, adding designers to team reduces project completion time In reality, productivity per designer decreases due to complexities of team management and communication In the software community, known as the mythical man-month (Brooks 1975) At some point, can actually lengthen project completion time! ( Too many cooks ) 1M transistors, 1 designer=5000 trans/month Each additional designer reduces for 100 trans/month So 2 designers produce 4900 trans/month each Team Months until completion 43 Individual Number of designers 40

41 Embedded Processors Requirements for embedded processors Low power consumption Programmable Low cost Examples Microcontroller-type: self-contained (mem, I/O, ADC, etc.), GPIO, no OS, less interactive environment e.g. sensor data acquisition Microchip PIC, Intel 8051, Parallex Propeller, Atmel AVR, TI MSP430 Microprocessor-type: SoC, demux address/data buses, co-processor, standard system buses often with OS, more interactive environment e.g. set-top box, in-vehicle entertainment system Intel Atom/Quark, TI OMAP4, Nvidia Tegra, Apple A8

42 PIC16F684

43 PIC18F45K20

44 Intel Quark

45 Intel Galileo Development Board Source:

46 General Architecture Building Embedded Systems Programmable Hardware, Gu, Changyi 2016

47 Power-On Building Embedded Systems Programmable Hardware, Gu, Changyi 2016

48 Memory Organization Building Embedded Systems Programmable Hardware, Gu, Changyi 2016

49 Advanced Embedded System Design with FPGA Lecture 2 Computer Architecture

50 Computer Components Same components for all kinds of computer Desktop, server, embedded Input/output includes User-interface devices Display, keyboard, mouse Storage devices Hard disk, CD/DVD, flash Network adapters For communicating with other computers Datapath: performs operations on data Control: sequences datapath, memory,... Cache memory Small fast SRAM memory for immediate access to data 50

Computer Components FIGURE 1.4 The organization of a computer, showing the five classic components. The processor gets instructions and data from memory.

51 Computer Components FIGURE 1.4 The organization of a computer, showing the five classic components. The processor gets instructions and data from memory. Input writes data to memory, and output reads data from memory. Control sends the signals that determine the operations of the datapath, memory, input, and output.. 51

52 Instruction Set The repertoire of instructions of a computer Different computers have different instruction sets But with many aspects in common Early computers had very simple instruction sets Simplified implementation Many modern computers also have simple instruction sets (RISC) 52

53 MIPS Assembly Language FIGURE 2.1 MIPS assembly language revealed in this chapter. This information is also found in Column 1 of the MIPS Reference Data Card at the front of this book.. 53

54 Arithmetic Operations Add and subtract, three operands Two sources and one destination add a, b, c # a gets b + c All arithmetic operations have this form Design Principle 1: Simplicity favors regularity Regularity makes implementation simpler Simplicity enables higher performance at lower cost 54

55 Arithmetic Example C code: f = (g + h) - (i + j); Compiled MIPS code: add t0, g, h # temp t0 = g + h add t1, i, j # temp t1 = i + j sub f, t0, t1 # f = t0 - t1 55

56 Register Operands Arithmetic instructions use register operands MIPS has a bit register file Use for frequently accessed data Numbered 0 to bit data called a word Assembler names $t0, $t1,, $t9 for temporary values $s0, $s1,, $s7 for saved variables Design Principle 2: Smaller is faster c.f. main memory: millions of locations 56

57 Register Operand Example C code: f = (g + h) - (i + j); f,, j in $s0,, $s4 Compiled MIPS code: add $t0, $s1, $s2 add $t1, $s3, $s4 sub $s0, $t0, $t1 57

58 Memory Operations Main memory used for composite data Arrays, structures, dynamic data To apply arithmetic operations Load values from memory into registers Store result from register to memory Memory is byte addressed Each address identifies an 8-bit byte Words are aligned in memory Address must be a multiple of 4 MIPS is Big Endian Most-significant byte at least address of a word c.f. Little Endian: least-significant byte at least address 58

59 Registers vs. Memory Registers are faster to access than memory Operating on memory data requires loads and stores More instructions to be executed Compiler must use registers for variables as much as possible Only spill to memory for less frequently used variables Register optimization is important! 59

60 Immediate Operands Constant data specified in an instruction addi $s3, $s3, 4 No subtract immediate instruction Just use a negative constant addi $s2, $s1, -1 Design Principle 3: Make the common case fast Small constants are common Immediate operand avoids a load instruction 60

61 The Constant Zero MIPS register 0 ($zero) is the constant 0 Cannot be overwritten Useful for common operations E.g., move between registers add $t2, $s1, $zero 61

62 Representing Instructions Instructions are encoded in binary Called machine code MIPS instructions Encoded as 32-bit instruction words Small number of formats encoding operation code (opcode), register numbers, Regularity! Register numbers $t0 $t7 are reg s 8 15 $t8 $t9 are reg s $s0 $s7 are reg s

63 MIPS R-format Instructions op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Instruction fields op: operation code (opcode) rs: first source register number rt: second source register number rd: destination register number shamt: shift amount (00000 for now) funct: function code (extends opcode) 63

64 R-format Example op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits add $t0, $s1, $s2 special $s1 $s2 $t0 0 add =

65 MIPS I-format Instructions op rs rt constant or address 6 bits 5 bits 5 bits 16 bits Immediate arithmetic and load/store instructions rt: destination or source register number Constant: 2 15 to Address: offset added to base address in rs Design Principle 4: Good design demands good compromises Different formats complicate decoding, but allow 32- bit instructions uniformly Keep formats as similar as possible 65

66 MIPS Machine Language FIGURE 2.6 MIPS architecture revealed through Section 2.5. The two MIPS instruction formats so far are R and I. The first16 bits are the same: both contain an op field, giving the base operation; an rs field, giving one of the sources; and the rt field, which specifies the other source operand, except for load word, where it specifies the destination register. R-format divides the last 16 bits into an rd field, specifying the destination register; the shamt field, which Section 2.6 explains; and the funct field, which specifies the specific operation of R-format instructions. I-format combines the last 16 bits into a single address field. 66

67 Logical Operations Instructions for bitwise manipulation Operation C Java MIPS Shift left << << sll Shift right >> >>> srl Bitwise AND & & and, andi Bitwise OR or, ori Bitwise NOT ~ ~ nor Useful for extracting and inserting groups of bits in a word 67

68 Shift Operations op rs rt rd shamt funct 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits shamt: how many positions to shift Shift left logical Shift left and fill with 0 bits sll by i bits multiplies by 2 i Shift right logical Shift right and fill with 0 bits srl by i bits divides by 2 i (unsigned only) 68

69 AND Operations Useful to mask bits in a word Select some bits, clear others to 0 and $t0, $t1, $t2 $t2 $t1 $t

70 OR Operations Useful to include bits in a word Set some bits to 1, leave others unchanged or $t0, $t1, $t2 $t2 $t1 $t

71 NOT Operations Useful to invert bits in a word Change 0 to 1, and 1 to 0 MIPS has NOR 3-operand instruction a NOR b == NOT ( a OR b ) nor $t0, $t1, $zero $t1 $t

72 Conditional Operations Branch to a labeled instruction if a condition is true Otherwise, continue sequentially beq rs, rt, L1 if (rs == rt) branch to instruction labeled L1; bne rs, rt, L1 if (rs!= rt) branch to instruction labeled L1; j L1 unconditional jump to instruction labeled L1 72

73 C code: Compiling If Statements if (i==j) f = g+h; else f = g-h; f, g, in $s0, $s1, Compiled MIPS code: bne $s3, $s4, Else add $s0, $s1, $s2 j Exit Else: sub $s0, $s1, $s2 Exit: 73

74 Compiling Loop Statements C code: while (save[i] == k) i += 1; i in $s3, k in $s5, address of save in $s6 Compiled MIPS code: Loop: sll $t1, $s3, 2 add $t1, $t1, $s6 lw $t0, 0($t1) bne $t0, $s5, Exit addi $s3, $s3, 1 j Loop Exit: 74

75 More Conditional Operations Set result to 1 if a condition is true Otherwise, set to 0 slt rd, rs, rt if (rs < rt) rd = 1; else rd = 0; slti rt, rs, constant if (rs < constant) rt = 1; else rt = 0; Use in combination with beq, bne slt $t0, $s1, $s2 # if ($s1 < $s2) bne $t0, $zero, L # branch to L 75

76 Branch Instruction Design Why not blt, bge, etc? Hardware for <,, slower than =, Combining with branch involves more work per instruction, requiring a slower clock All instructions penalized! beq and bne are the common case This is a good design compromise 76

77 Register Usage FIGURE 2.14 MIPS register conventions. Register 1, called $at, is reserved for the assembler (see Section 2.12), and registers 26 27, called $k0 $k1, are reserved for the operating system. This information is also found in Column 2 of the MIPS Reference Data Card at the front of this book. 77

78 Procedure Call Instructions Procedure call: jump and link jal ProcedureLabel Address of following instruction put in $ra Jumps to target address Procedure return: jump register jr $ra Copies $ra to program counter Can also be used for computed jumps e.g., for case/switch statements 78

initialized to address allowing ±offsets into this segment

79 Memory Layout Text: program code Static data: global variables e.g., static variables in C, constant arrays and strings $gp initialized to address allowing ±offsets into this segment Dynamic data: heap E.g., malloc in C, new in Java Stack: automatic storage 79

80 Byte/Halfword Operations Could use bitwise operations MIPS byte/halfword load/store String processing is a common case lb rt, offset(rs) lh rt, offset(rs) Sign extend to 32 bits in rt lbu rt, offset(rs) lhu rt, offset(rs) Zero extend to 32 bits in rt sb rt, offset(rs) sh rt, offset(rs) Store just rightmost byte/halfword 80

81 32-bit Constants Most constants are small 16-bit immediate is sufficient For the occasional 32-bit constant lui rt, constant Copies 16-bit constant to left 16 bits of rt Clears right 16 bits of rt to 0 lhi $s0, ori $s0, $s0,

82 Branch Addressing Branch instructions specify Opcode, two registers, target address Most branch targets are near branch Forward or backward op rs rt constant or address 6 bits 5 bits 5 bits 16 bits PC-relative addressing Target address = PC + offset 4 PC already incremented by 4 by this time 82

83 Jump Addressing Jump (j and jal) targets could be anywhere in text segment Encode full address in instruction op address 6 bits 26 bits (Pseudo)Direct jump addressing Target address = PC : (address 4) 83

84 Addressing Mode Summary 84

85 MIPS Instruction Formats FIGURE 2.20 MIPS instruction formats. 85

86 Translation and Startup 86

87 MIPS Instruction Set FIGURE 2.44 The MIPS instruction set covered so far, with the real MIPS instructions on the left and the pseudoinstructions on the right. Appendix B (Section B.10) describes the full MIPS architecture. Figure 2.1 shows more details of the MIPS architecture revealed in this chapter. The information given here is also found in Columns 1 and 2 of the MIPS Reference Data Card at the front of the book. 87

88 MIPS Instruction Classes FIGURE 2.45 MIPS instruction classes, examples, correspondence to high-level program language constructs, and percent age of MIPS instructions executed by category for the average SPEC2006 benchmarks. Figure 3.26 in Chapter 3 shows average percent age of the individual MIPS instructions executed. 88

89 MIPS Instruction Set FIGURE 3.13 MIPS core architecture. The memory and registers of the MIPS architecture are not included for space reasons, but this section added the Hi and Lo registers to support multiply and divide. MIPS machine language is listed in the MIPS Reference Data Card at the front of this book. 89

90 FIGURE 3.18 MIPS floating-point architecture revealed thus far. See Appendix B, Section B.10, for more detail. This information is also found in column 2 of the MIPS Reference Data Card at the front of this book. 90

91 FIGURE 3.24 The MIPS instruction set. This book concentrates on the instructions in the left column. This information is also found in columns 1 and 2 of the MIPS Reference Data Card at the front of this book. 91

92 Instruction Execution PC instruction memory, fetch instruction Register numbers register file, read registers Depending on instruction class Use ALU to calculate Arithmetic result Memory address for load/store Branch target address Access data memory for load/store PC target address or PC

93 CPU Overview 93

94 Multiplexers Can t just join wires together Use multiplexers 94

95 Control 95

96 Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently the clock cycle must be timed to accommodate the slowest instruction especially problematic for more complex instructions like floating point multiply Clk Cycle 1 Cycle 2 lw sw Waste May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but Is simple and easy to understand Irwin, PSU,

97 Building a Datapath Datapath Elements that process data and addresses in the CPU Registers, ALUs, mux s, memories, We will build a MIPS datapath incrementally Refining the overview design 97

98 Instruction Fetch 32-bit registe r Increment by 4 for next instruction 98

99 R-Format Instructions Read two register operands Perform arithmetic/logical operation Write register result 99

100 Load/Store Instructions Read register operands Calculate address using 16-bit offset Use ALU, but sign-extend offset Load: Read memory and update register Store: Write register value to memory 100

101 Branch Instructions Read register operands Compare operands Use ALU, subtract and check Zero output Calculate target address Sign-extend displacement Shift left 2 places (word displacement) Add to PC + 4 Already calculated by instruction fetch 101

102 Branch Instructions Just reroutes wires Sign-bit wire replicated 102

103 R-Type/Load/Store Datapath 103

104 Full Datapath 104

105 ALU Control ALU used for Load/Store: F = add Branch: F = subtract R-type: F depends on funct field ALU control Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR 105

106 ALU Control Assume 2-bit ALUOp derived from opcode Combinational logic derives ALU control opcode ALUOp Operation funct ALU function ALU control lw 00 load word XXXXXX add 0010 sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add add 0010 subtract subtract 0110 AND AND 0000 OR OR 0001 set-on-less-than set-on-less-than

107 ALU Control Bits FIGURE 4.13 The truth table for the 4 ALU control bits (called Operation). The inputs are the ALUOp and function code field. Only the entries for which the ALU control is asserted are shown. Some don t-care entries have been added. For example, the ALUOp does not use the encoding 11, so the truth table can contain entries 1X and X1, rather than 10 and 01. Note that when the function field is used, the first 2 bits (F5 and F4) of these instructions are always 10, so they are don t-care terms and are replaced with XX in the truth table. 107

108 The Main Control Unit Control signals derived from instruction R-type Load/ Store Branch 0 rs rt rd shamt funct 31:26 25:21 20:16 15:11 10:6 5:0 35 or 43 rs rt address 31:26 25:21 20:16 15:0 4 rs rt address 31:26 25:21 20:16 15:0 opcode always read read, except for load write for R-type and load sign-extend and add 108

109 Datapath With Control 109

Control Signals FIGURE 4.16 The effect of each of the seven control signals. When the 1-bit control to a two-way multiplexor is asserted, the multiplexor selects the input corresponding to 1.

110 Control Signals FIGURE 4.16 The effect of each of the seven control signals. When the 1-bit control to a two-way multiplexor is asserted, the multiplexor selects the input corresponding to 1. Otherwise, if the control is deasserted, the multiplexor selects the 0 input. Remember that the state elements all have the clock as an implicit input and that the clock is used in controlling writes. Gating the clock externally to a state element can create timing problems. (See Appendix C for further discussion of this problem.) 110

111 Control Signals Settings FIGURE 4.18 The setting of the control lines is completely determined by the opcode fields of the instruction. The first row of the table corresponds to the R-format instructions (add, sub, AND, OR, and slt). For all these instructions, the source register fields are rs and rt, and the destination register field is rd; this defines how the signals ALUSrc and RegDst are set. Furthermore, an R-type instruction writes a register (Reg Write = 1), but neither reads nor writes data memory. When the Branch control signal is 0, the PC is unconditionally replaced with PC + 4; otherwise, the PC is replaced by the branch target if the Zero output of the ALU is also high. The ALUOp field for R-type instructions is set to 10 to indicate that the ALU control should be generated from the funct field. The second and third rows of this table give the control signal settings for lw and sw. These ALUSrc and ALUOp fields are set to perform the address calculation. The MemRead and MemWrite are set to perform the memory access. Finally, RegDst and RegWrite are set for a load to cause the result to be stored into the rt register. The branch instruction is similar to an R-format operation, since it sends the rs and rt registers to the ALU. The ALUOp field for branch is set for a subtract (ALU control = 01), which is used to test for equality. Notice that the MemtoReg field is irrelevant when the RegWrite signal is 0: since the register is not being written, the value of the data on the register data write port is not used. Thus, the entry MemtoReg in the last two rows of the table is replaced with X for don t care. Don t cares can also be added to RegDst when RegWrite is 0. This type of don t care must be added by the designer, since it depends on knowledge of how the datapath works. 111

112 R-Type Instruction 112

113 Load Instruction 113

114 Branch-on-Equal Instruction 114

115 Control Function FIGURE 4.22 The control function for the simple single-cycle implementation is completely specified by this truth table. The top half of the table gives the combinations of input signals that correspond to the four opcodes, one per column, that determine the control output settings. (Remember that Op [5:0] corresponds to bits 31:26 of the instruction, which is the op field.) The bottom portion of the table gives the outputs for each of the four opcodes. Thus, the output RegWrite is asserted for two different combinations of the inputs. If we consider only the four opcodes shown in this table, then we can simplify the truth table by using don t cares in the input portion. For example, we can detect an R-format instruction with the expression Op5 Op2, since this is sufficient to distinguish the R-format instructions from lw, sw, and beq. We do not take advantage of this simplification, since the rest of the MIPS opcodes are used in a full implementation. 115

116 Implementing Jumps Jump 2 address Jump uses word address Update PC with concatenation of Top 4 bits of old PC 26-bit jump address 00 31:26 25:0 Need an extra control signal decoded from opcode 116

117 Datapath With Jumps Added 117

119 Pipelining the MIPS ISA What makes it easy all instructions are the same length (32 bits) can fetch in the 1 st stage and decode in the 2 nd stage few instruction formats (three) with symmetry across formats can begin reading register file in 2 nd stage memory operations occur only in loads and stores can use the execute stage to calculate memory addresses each instruction writes at most one result (i.e., changes the machine state) and does it in the last few pipeline stages (MEM or WB) operands must be aligned in memory so a single data transfer takes only one data memory access Irwin, PSU,

120 MIPS Pipeline Five stages, one step per stage 1. IF: Instruction fetch from memory 2. ID: Instruction decode & register read 3. EX: Execute operation or calculate address 4. MEM: Access memory operand 5. WB: Write result back to register 120

121 Pipeline Performance Single-cycle (T c = 800ps) Pipelined (T c = 200ps) 121

122 Pipeline Timeline FIGURE 4.28 Graphical representation of the instruction pipeline, similar in spirit to the laundry pipeline in Figure Here we use symbols representing the physical resources with the abbreviations for pipeline stages used throughout the chapter. The symbols for the five stages: IF for the instruction fetch stage, with the box representing instruction memory; ID for the instruction decode/register file read stage, with the drawing showing the register file being read; EX for the execution stage, with the drawing representing the ALU; MEM for the memory access stage, with the box representing data memory; and WB for the write-back stage, with the drawing showing the register file being written. The shading indicates the element is used by the instruction. Hence, MEM has a white back ground because add does not access the data memory. Shading on the right half of the register file or memory means the element is read in that stage, and shading of the left half means it is written in that stage. Hence the right half of ID is shaded in the second stage because the register file is read, and the left half of WB is shaded in the fifth stage because the register file is written. 122

123 Pipelining and ISA Design MIPS ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3 rd stage, access memory in 4 th stage Alignment of memory operands Memory access takes only one cycle 123

124 Hazards Situations that prevent starting the next instruction in the next cycle Structure hazards A required resource is busy Data hazard Need to wait for previous instruction to complete its data read/write Control hazard Deciding on control action depends on previous instruction 124

125 Structure Hazards Conflict for use of a resource In MIPS pipeline with a single memory Load/store requires data access Instruction fetch would have to stall for that cycle Would cause a pipeline bubble Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches 125

126 Data Hazards An instruction depends on completion of data access by a previous instruction add $s0, $t0, $t1 sub $t2, $s0, $t3 126

127 Forwarding (aka Bypassing) Use result when it is computed Don t wait for it to be stored in a register Requires extra connections in the datapath 127

128 Load-Use Data Hazard Can t always avoid stalls by forwarding If value not computed when needed Can t forward backward in time! 128

129 Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F; stall stall lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 13 cycles lw $t1, 0($t0) lw $t2, 4($t0) lw $t4, 8($t0) add $t3, $t1, $t2 sw $t3, 12($t0) add $t5, $t1, $t4 sw $t5, 16($t0) 11 cycles 129

130 Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can t always fetch correct instruction Still working on ID stage of branch In MIPS pipeline Need to compare registers and compute target early in the pipeline Add hardware to do it in ID stage 130

131 Stall on Branch Wait until branch outcome determined before fetching next instruction 131

132 Branch Prediction Longer pipelines can t readily determine branch outcome early Stall penalty becomes unacceptable Predict outcome of branch Only stall if prediction is wrong In MIPS pipeline Can predict branches not taken Fetch instruction after branch, with no delay 132

133 MIPS with Predict Not Taken Prediction correct Prediction incorrect 133

134 More-Realistic Branch Prediction Static branch prediction Based on typical branch behavior Example: loop and if-statement branches Predict backward branches taken Predict forward branches not taken Dynamic branch prediction Hardware measures actual branch behavior e.g., record recent history of each branch Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history 134

135 MIPS Pipelined Datapath MEM Right-to-left flow leads to hazards WB 135

136 Multiple Single-Cycle Datapaths FIGURE 4.34 Instructions being executed using the single-cycle datapath in Figure 4.33, assuming pipelined execution. Similar to Figures 4.28 through 4.30, this figure pretends that each instruction has its own datapath, and shades each portion according to use. Unlike those figures, each stage is labeled by the physical resource used in that stage, corresponding to the portions of the datapath in Figure IM represents the instruction memory and the PC in the instruction fetch stage, Reg stands for the register file and sign extender in the instruction decode/register file read stage (ID), and so on. To maintain proper time order, this stylized datapath breaks the register file into two logical parts: registers read during register fetch (ID) and registers written during write back (WB). This dual use is represented by drawing the unshaded left half of the register file using dashed lines in the ID stage, when it is not being written, and the unshaded right half in dashed lines in the WB stage, when it is not being read. As before, we assume the register file is written in the first half of the clock cycle and the register file is read during the second half. served. 136

137 Pipeline registers Need registers between stages To hold information produced in previous cycle 137

138 IF for Load, Store, 138

139 ID for Load, Store, 139

140 EX for Load 140

141 MEM for Load 141

142 WB for Load Wrong register number 142

143 Corrected Datapath for Load 143

144 EX for Store 144

145 MEM for Store 145

146 WB for Store 146

147 Multi-Cycle Pipeline Diagram Form showing resource usage 147

148 Multi-Cycle Pipeline Diagram Traditional form 148

149 Single-Cycle Pipeline Diagram State of pipeline in a given cycle 149

150 Pipelined Control (Simplified) 150

151 Pipelined Control Control signals derived from instruction As in single-cycle implementation 151

152 Pipelined Control 152

153 Dependencies & Forwarding 153

154 Detecting the Need to Forward Pass register numbers along pipeline e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register ALU operand register numbers in EX stage are given by ID/EX.RegisterRs, ID/EX.RegisterRt Data hazards when 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg 154

155 Detecting the Need to Forward But only if forwarding instruction will write to a register! EX/MEM.RegWrite, MEM/WB.RegWrite And only if Rd for that instruction is not $zero EX/MEM.RegisterRd 0, MEM/WB.RegisterRd 0 155

156 Forwarding Paths 156

Control Values FIGURE 4.55 The control values for the forwarding multiplexors in Figure 4.54.

157 Control Values FIGURE 4.55 The control values for the forwarding multiplexors in Figure The signed immediate that is another input to the ALU is described in the Elaboration at the end of this section 157

158 Forwarding Conditions EX hazard if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10 MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB =

159 Datapath with Forwarding 159

160 Load-Use Data Hazard Need to stall for one cycle 160

161 Load-Use Hazard Detection Check when using instruction is decoded in ID stage ALU operand register numbers in ID stage are given by IF/ID.RegisterRs, IF/ID.RegisterRt Load-use hazard when ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) If detected, stall and insert bubble 161

162 How to Stall the Pipeline Force control values in ID/EX register to 0 EX, MEM and WB do nop (no-operation) Prevent update of PC and IF/ID register Using instruction is decoded again Following instruction is fetched again 1-cycle stall allows MEM to read data for lw Can subsequently forward to EX stage 162

163 Stall/Bubble in the Pipeline Stall inserted here 163

164 Stall/Bubble in the Pipeline Or, more accurately 164

165 Datapath with Hazard Detection 165

166 Branch Hazards If branch outcome determined in MEM Flush these instructions (Set control values to 0) PC 166

167 Reducing Branch Delay Move hardware to determine outcome to ID stage Target address adder Register comparator Example: branch taken 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $ : lw $4, 50($7) 167

168 Example: Branch Taken 168

169 Example: Branch Taken 169

170 Data Hazards for Branches If a comparison register is a destination of 2 nd or 3 rd preceding ALU instruction add $1, $2, $3 IF ID EX MEM WB add $4, $5, $6 IF ID EX MEM WB IF ID EX MEM WB beq $1, $4, target IF ID EX MEM WB Can resolve using forwarding 170

171 Data Hazards for Branches If a comparison register is a destination of preceding ALU instruction or 2 nd preceding load instruction Need 1 stall cycle lw $1, addr IF ID EX MEM WB add $4, $5, $6 IF ID EX MEM WB beq stalled IF ID beq $1, $4, target ID EX MEM WB 171

172 Data Hazards for Branches If a comparison register is a destination of immediately preceding load instruction Need 2 stall cycles lw $1, addr IF ID EX MEM WB beq stalled IF ID beq stalled ID beq $1, $0, target ID EX MEM WB 172

173 Dynamic Branch Prediction In deeper and superscalar pipelines, branch penalty is more significant Use dynamic prediction Branch prediction buffer (aka branch history table) Indexed by recent branch instruction addresses Stores outcome (taken/not taken) To execute a branch Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction 173

174 Exceptions and Interrupts Unexpected events requiring change in flow of control Different ISAs use the terms differently Exception Arises within the CPU Interrupt e.g., undefined opcode, overflow, syscall, From an external I/O controller Dealing with them without sacrificing performance is hard 174

175 Two Types of Exceptions Interrupts asynchronous to program execution caused by external events may be handled between instructions, so can let the instructions currently active in the pipeline complete before passing control to the OS interrupt handler simply suspend and resume user program Traps (Exception) synchronous to program execution caused by internal events condition must be remedied by the trap handler for that instruction, so much stop the offending instruction midstream in the pipeline and pass control to the OS trap handler the offending instruction may be retried (or simulated by the OS) and the program may continue or it may be aborted Irwin, PSU,

176 Handling Exceptions In MIPS, exceptions managed by a System Control Coprocessor (CP0) Save PC of offending (or interrupted) instruction In MIPS: Exception Program Counter (EPC) Save indication of the problem In MIPS: Cause register We ll assume 1-bit 0 for undefined opcode, 1 for overflow Jump to handler at

177 An Alternate Mechanism Vectored Interrupts Handler address determined by the cause Example: Undefined opcode: C Overflow: C : C Instructions either Deal with the interrupt, or Jump to real handler 177

178 Handler Actions Read cause, and transfer to relevant handler Determine action required If restartable Take corrective action use EPC to return to program Otherwise Terminate program Report error using EPC, cause, 178

179 Exceptions in a Pipeline Another form of control hazard Consider overflow on add in EX stage add $1, $2, $1 Prevent $1 from being clobbered Complete previous instructions Flush add and subsequent instructions Set Cause and EPC register values Transfer control to handler Similar to mispredicted branch Use much of the same hardware 179

180 Pipeline with Exceptions 180

181 Exception Properties Restartable exceptions Pipeline can flush the instruction Handler executes, then returns to the instruction Refetched and executed from scratch PC saved in EPC register Identifies causing instruction Actually PC + 4 is saved Handler must adjust 181

182 Multiple Exceptions Pipelining overlaps multiple instructions Could have multiple exceptions at once Simple approach: deal with exception from earliest instruction Flush subsequent instructions Precise exceptions In complex pipelines Multiple instructions issued per cycle Out-of-order completion Maintaining precise exceptions is difficult! 182

183 Imprecise Exceptions Just stop pipeline and save state Including exception cause(s) Let the handler work out Which instruction(s) had exceptions Which to complete or flush May require manual completion Simplifies hardware, but more complex handler software Not feasible for complex multiple-issue out-of-order pipelines 183

184 Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per GB Ideal memory Access time of SRAM Capacity and cost/gb of disk 184

185 Review: Major Components of a Computer Processor Devices Control Memory Input Datapath Output Secondary Memory (Disk) Main Memory Cache Irwin, PSU,

186 Advanced Embedded System Design with FPGA Lecture 3 Memory and DMA

187 Memory Arrays 2011 David Money Harris 187

188 The Memory Hierarchy: Why Does it Work? Temporal Locality (locality in time) If a memory location is referenced then it will tend to be referenced again soon Keep most recently accessed data items closer to the processor Spatial Locality (locality in space) If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon Move blocks consisting of contiguous words closer to the processor Irwin, PSU,

189 Taking Advantage of Locality Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory Main memory Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory Cache memory attached to CPU 189

190 Memory Hierarchy FIGURE 5.1 The basic structure of a memory hierarchy. By implementing the memory system as a hierarchy, the user has the illusion of a memory that is as large as the largest level of the hierarchy, but can be accessed as if it were all built from the fastest memory. Flash memory has replaced disks in many embedded devices, and may lead to a new level in the storage hierarchy for desktop and server computers; see Section

191 A Typical Memory Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology On-Chip Components Control Datapath RegFile ITLB DTLB Instr Data Cache Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Memory (Disk) Speed (%cycles): ½ s 1 s 10 s 100 s 10,000 s Size (bytes): 100 s 10K s M s G s T s Cost: highest lowest Irwin, PSU,

192 The Memory Hierarchy: Terminology Block (or line): the minimum unit of information that is present (or not) in a cache Hit Rate: the fraction of memory accesses found in a level of the memory hierarchy Hit Time: Time to access that level which consists of Time to access the block + Time to determine hit/miss Miss Rate: the fraction of memory accesses not found in a level of the memory hierarchy 1 - (Hit Rate) Miss Penalty: Time to replace a block in that level with the corresponding block from a lower level which consists of Time to access the block in the lower level + Time to transmit that block to the level that experienced the miss + Time to insert the block in that level + Time to pass the block to the requestor Irwin, PSU, 2008 Hit Time << Miss Penalty 192

193 Characteristics of the Memory Increasing distance from the processor in access time Hierarchy Processor L1$ L2$ Main Memory 4-8 bytes (word) 8-32 bytes (block) 1 to 4 blocks Secondary Memory 1,024+ bytes (disk sector = page) Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM (Relative) size of the memory at each level Irwin, PSU,

194 How is the Hierarchy Managed? registers memory by compiler (programmer?) cache main memory by the cache controller hardware main memory disks by the operating system (virtual memory) virtual to physical address mapping assisted by the hardware (TLB) by the programmer (files) Irwin, PSU,

195 Cache Basics Two questions to answer (in hardware): Q1: How do we know if a data item is in the cache? Q2: If it is, how do we find it? Direct mapped Each memory block is mapped to exactly one block in the cache lots of lower level blocks must share blocks in the cache Address mapping (to answer Q2): (block address) modulo (# of blocks in the cache) Have a tag associated with each cache block that contains the address information (the upper portion of the address) required to identify the block (to answer Q1) Irwin, PSU,

196 Direct Mapped Cache Location determined by address Direct mapped: only one choice (Block address) modulo (#Blocks in cache) #Blocks is a power of 2 Use low-order address bits 196

197 Tags and Valid Bits How do we know which particular block is stored in a cache location? Store block address as well as the data Actually, only need the high-order bits Called the tag What if there is no data in a location? Valid bit: 1 = present, 0 = not present Initially 0 197

198 Cache Example 8-blocks, 1 word/block, direct mapped Initial state Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 N 111 N 198

199 Cache Example Word addr Binary addr Hit/miss Cache block Miss 110 Index V Tag Data 000 N 001 N 010 N 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 199

200 Cache Example Word addr Binary addr Hit/miss Cache block Miss 010 Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 200

201 Cache Example Word addr Binary addr Hit/miss Cache block Hit Hit 010 Index V Tag Data 000 N 001 N 010 Y 11 Mem[11010] 011 N 100 N 101 N 110 Y 10 Mem[10110] 111 N 201

202 Cache Example Word addr Binary addr Hit/miss Cache block Miss Miss Hit 000 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 11 Mem[11010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N 202

203 Cache Example Word addr Binary addr Hit/miss Cache block Miss 010 Index V Tag Data 000 Y 10 Mem[10000] 001 N 010 Y 10 Mem[10010] 011 Y 00 Mem[00011] 100 N 101 N 110 Y 10 Mem[10110] 111 N 203

204 Address Subdivision 204

205 Block Size Considerations Larger blocks should reduce miss rate Due to spatial locality But in a fixed-sized cache Larger blocks fewer of them More competition increased miss rate Larger blocks pollution Larger miss penalty Can override benefit of reduced miss rate Early restart and critical-word-first can help 205

206 Read hits (I$ and D$) this is what we want! Write hits (D$ only) Handling Cache Hits require the cache and memory to be consistent always write the data into both the cache block and the next level in the memory hierarchy (write-through) writes run at the speed of the next level in the memory hierarchy so slow! or can use a write buffer and stall only if the write buffer is full allow cache and memory to be inconsistent write the data only into the cache block (write-back the cache block to the next level in the memory hierarchy when that cache block is evicted ) need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is evicted can use a write buffer to help buffer write-backs of dirty blocks Irwin, PSU,

207 Cache Misses On cache hit, CPU proceeds normally On cache miss Stall the CPU pipeline Fetch block from next level of hierarchy Instruction cache miss Restart instruction fetch Data cache miss Complete data access 207

208 Write-Through On data-write hit, could just update the block in cache But then cache and memory would be inconsistent Write through: also update memory But makes writes take longer e.g., if base CPI = 1, 10% of instructions are stores, write to memory takes 100 cycles Effective CPI = = 11 Solution: write buffer Holds data waiting to be written to memory CPU continues immediately Only stalls on write if write buffer is already full 208

209 Write-Back Alternative: On data-write hit, just update the block in cache Keep track of whether each block is dirty When a dirty block is replaced Write it back to memory Can use a write buffer to allow replacing block to be read first 209

210 Write Allocation What should happen on a write miss? Alternatives for write-through Allocate on miss: fetch the block Write around: don t fetch the block Since programs often write a whole block before reading it (e.g., initialization) For write-back Usually fetch the block 210

211 Main Memory Supporting Caches Use DRAMs for main memory Fixed width (e.g., 1 word) Connected by fixed-width clocked bus Bus clock is typically slower than CPU clock Example cache block read 1 bus cycle for address transfer 15 bus cycles per DRAM access 1 bus cycle per data transfer For 4-word block, 1-word-wide DRAM Miss penalty = = 65 bus cycles Bandwidth = 16 bytes / 65 cycles = 0.25 B/cycle 211

Increasing Memory Bandwidth 4-word wide memory Miss penalty = 1 + 15 + 1 = 17 bus cycles Bandwidth = 16 bytes / 17 cycles = 0.

212 Increasing Memory Bandwidth 4-word wide memory Miss penalty = = 17 bus cycles Bandwidth = 16 bytes / 17 cycles = 0.94 B/cycle 4-bank interleaved memory Miss penalty = = 20 bus cycles Bandwidth = 16 bytes / 20 cycles = 0.8 B/cycle 212

213 Associative Caches Fully associative Allow a given block to go in any cache entry Requires all entries to be searched at once Comparator per entry (expensive) n-way set associative Each set contains n entries Block number determines which set (Block number) modulo (#Sets in cache) Search all entries in a given set at once n comparators (less expensive) 213

214 Associative Cache Example 214

215 Spectrum of Associativity For a cache with 8 entries 215

216 How Much Associativity Increased associativity decreases miss rate But with diminishing returns Simulation of a system with 64KB D-cache, 16-word blocks, SPEC way: 10.3% 2-way: 8.6% 4-way: 8.3% 8-way: 8.1% 216

217 Set Associative Cache Organization 217

218 Replacement Policy Direct mapped: no choice Set associative Prefer non-valid entry, if there is one Otherwise, choose among entries in the set Least-recently used (LRU) Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random Gives approximately the same performance as LRU for high associativity 218

219 IO Introduction I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections 219

220 I/O System Characteristics Dependability is important Particularly for storage devices Performance measures Latency (response time) Throughput (bandwidth) Desktops & embedded systems Mainly interested in response time & diversity of devices Servers Mainly interested in throughput & expandability of devices 220

221 Bus Signals and Synchronization Data lines Carry address and data Multiplexed or separate Control lines Indicate data type, synchronize transactions Synchronous Uses a bus clock Asynchronous Uses request/acknowledge control lines for handshaking 221

222 Asynchronous Bus Handshaking Protocol Output (read) data from memory to an I/O device ReadReq Data Ack DataRdy 1 addr data 6 7 I/O device signals a request by raising ReadReq and putting the addr on the data lines 1. Memory sees ReadReq, reads addr from data lines, and raises Ack 2. I/O device sees Ack and releases the ReadReq and data lines 3. Memory sees ReadReq go low and drops Ack 4. When memory has data ready, it places it on data lines and raises DataRdy 5. I/O device sees DataRdy, reads the data from data lines, and raises Ack 6. Memory sees Ack, releases the data lines, and drops DataRdy 7. I/O device sees DataRdy go low and drops Ack Irwin, PSU,

223 DMA/Cache Interaction If DMA writes to a memory block that is cached Cached copy becomes stale If write-back cache has dirty block, and DMA reads memory block Reads stale data Need to ensure cache coherence Flush blocks from cache if they will be used for DMA Or use non-cacheable memory locations for I/O 223

224 Direct Memory Access Using the processor to service data from IO devices is causes performance degradation Using the processor for memory to memory transfer is also causes performance degradation. Direct Memory Access (DMA) relieves the processor of these tasks. 224

225 Cycle Stealing DMA Types During times where the bus is idle, the transfer to or from I/O or memory is done with little disruption to the processor. Favors instruction execution over I/O and Memory transactions Burst Mode The processor bus activity is suspended and a block of data is transferred to or from I/O and memory. Favors I/O and memory transactions over instruction execution. Important tuning decisions are made here. 225

226 DMA Tuning If the processor system does not have a cache or has a cache with a high miss rate, cycle stealing will likely be choses especially if the I/O bandwidth is low. If the processor system has a cache with a high hit rate, block mode will likely be chosen. 226

227 Memory and IO Management This is a case where memory contents and I/O operations are taking place out of the main line of processing Indications have to be made to the processor so it can keep track of memory and I/O status. 227

228 Descriptors Descriptor are tags that are associated with memory blocks to indicate status so a processor can manage the contents without reading them. Frequently used in DMA I/O systems. 228

229 Memory Allocation Memory Allocation is done to remove the chance of collisions between, I/O and memory accesses. Can be static or dynamic (or virtual). 229

230 BASIC DMA OPERATION Two control signals are used to request and acknowledge a direct memory access (DMA) transfer in the microprocessor-based system. the HOLD pin is an input used to request a DMA action the HLDA pin is an output that acknowledges the DMA action Figure 13 1 shows the timing that is typically found on these two DMA control pins. Copyright 2009 Barry B Brey, Pearson Education

231 Figure 13 1 HOLD and HLDA timing for the microprocessor. HOLD is sampled in any clocking cycle when the processor recognizes the hold, it stops executing software and enters hold cycles HOLD input has higher priority than INTR or NMI the only microprocessor pin that has a higher priority than a HOLD is the RESET pin

232 HLDA becomes active to indicate the processor has placed its buses at highimpedance state. as can be seen in the timing diagram, there are a few clock cycles between the time that HOLD changes and until HLDA changes HLDA output is a signal to the requesting device that the processor has relinquished control of its memory and I/O space. one could call HOLD input a DMA request input and HLDA output a DMA grant signal

233 Basic DMA Definitions Direct memory accesses normally occur between an I/O device and memory without the use of the microprocessor. a DMA read transfers data from the memory to the I/O device A DMA write transfers data from an I/O device to memory Memory & I/O are controlled simultaneously. which is why the system contains separate memory and I/O control signals

234 A DMA read causes the MRDC and IOWC signals to activate simultaneously. transferring data from memory to the I/O device A DMA write causes the MWTC and IORC signals to both activate. 8086/8088 require a controller or circuit such as shown in Fig 13 2 for control bus signal generation. The DMA controller provides memory with its address, and controller signal (DACK) selects the I/O device during the transfer.

235 Figure 13 2 A circuit that generates system control signals in a DMA environment.

236 Data transfer speed is determined by speed of the memory device or a DMA controller. if memory speed is 50 ns, DMA transfers occur at rates up to 1/50 ns or 20 M bytes per second if the DMA controller functions at a maximum rate of 15 MHz with 50 ns memory, maximum transfer rate is 15 MHz because the DMA controller is slower than the memory In many cases, the DMA controller slows the speed of the system when transfers occur.

237 The switch to serial data transfers in modern systems has made DMA is less important. The serial PCI Express bus transfers data at rates exceeding DMA transfers. The SATA (serial ATA) interface for disk drives uses serial transfers at the rate of 300 Mbps and has replaced DMA transfers for hard disks Serial transfers on main-boards between components using can approach 20 Gbps for the PCI Express connection.

238 13 2 THE 8237 DMA CONTROLLER The 8237 supplies memory & I/O with control signals and memory address information during the DMA transfer. actually a special-purpose microprocessor whose job is high-speed data transfer between memory and I/O Figure 13 3 shows the pin-out and block diagram of the 8237 programmable DMA controller.

239 Figure 13 3 The 8237A-5 programmable DMA controller. (a) Block diagram and (b) pin-out. (Courtesy of Intel Corporation.)

240 8237 is not a discrete component in modern microprocessor-based systems. it appears within many system controller chip sets 8237 is a four-channel device compatible with 8086/8088, adequate for small systems. expandable to any number of DMA channel inputs 8237 is capable of DMA transfers at rates up to 1.6M bytes per second. each channel is capable of addressing a full 64K-byte section of memory and transfer up to 64K bytes with a single programming

241 CLK 8237 Pin Definitions Clock input is connected to the system clock signal as long as that signal is 5 MHz or less. in the 8086/8088 system, the clock must be inverted for the proper operation of the 8237

242 CS 8237 Pin Definitions Chip select enables 8237 for programming. The CS pin is normally connected to the output of a decoder. The decoder does not use the 8086/8088 control signal IO/M(M/IO) because it contains the new memory and I/O control signals (MEMR, MEMW, IOR and IOW).

243 RESET 8237 Pin Definitions The reset pin clears the command, status, request, and temporary registers. It also clears the first/last flip-flop and sets the mask register. this input primes the 8237 so it is disabled until programmed otherwise

244 READY 8237 Pin Definitions A logic 0 on the ready input causes the 8237 to enter wait states for slower memory components. HLDA A hold acknowledge signals 8237 that the microprocessor has relinquished control of the address, data, and control buses.

245 8237 Pin Definitions DREQ 0 DREQ 3 DMA request inputs are used to request a transfer for each of the four DMA channels. the polarity of these inputs is programmable, so they are either active-high or active-low inputs DB 0 DB 7 Data bus pins are connected to the processor data bus connections and used during the programming of the DMA controller.

246 IOR 8237 Pin Definitions I/O read is a bidirectional pin used during programming and during a DMA write cycle. IOW I/O write is a bidirectional pin used during programming and during a DMA read cycle.

247 EOP 8237 Pin Definitions End-of-process is a bidirectional signal used as an input to terminate a DMA process or as an output to signal the end of the DMA transfer. often used to interrupt a DMA transfer at the end of a DMA cycle

248 8237 Pin Definitions A 0 A 3 These address pins select an internal register during programming and provide part of the DMA transfer address during a DMA action. address pins are outputs that provide part of the DMA transfer address during a DMA action

249 HRQ 8237 Pin Definitions Hold request is an output that connects to the HOLD input of the microprocessor in order to request a DMA transfer.

250 8237 Pin Definitions DACK 0 DACK 3 DMA channel acknowledge outputs acknowledge a channel DMA request. These outputs are programmable as either active-high or active-low signals. DACK outputs are often used to select the DMA- controlled I/O device during the DMA transfer.

251 AEN 8237 Pin Definitions Address enable signal enables the DMA address latch connected to the DB 7 DB 0 pins on the also used to disable any buffers in the system connected to the microprocessor

252 ADSTB 8237 Pin Definitions Address strobe functions as ALE, except it is used by the DMA controller to latch address bits A 15 A 8 during the DMA transfer. MEMR Memory read is an output that causes memory to read data during a DMA read cycle.

253 MEMW 8237 Pin Definitions Memory write is an output that causes memory to write data during a DMA write cycle.

254 CAR 8237 Internal Registers The current address register holds a 16-bit memory address used for the DMA transfer. each channel has its own current address register for this purpose When a byte of data is transferred during a DMA operation, CAR is either incremented or decremented. depending on how it is programmed

255 CWCR 8237 Internal Registers The current word count register programs a channel for the number of bytes (up to 64K) transferred during a DMA action. The number loaded into this register is one less than the number of bytes transferred. for example, if a 10 is loaded to CWCR, then 11 bytes are transferred during the DMA action

256 8237 Internal Registers BA and BWC The base address (BA) and base word count (BWC) registers are used when auto-initialization is selected for a channel. In auto-initialization mode, these registers are used to reload the CAR and CWCR after the DMA action is completed. allows the same count and address to be used to transfer data from the same memory area

257 CR 8237 Internal Registers The command register programs the operation of the 8237 DMA controller. The register uses bit position 0 to select the memory-to-memory DMA transfer mode. memory-to-memory DMA transfers use DMA channel 0 to hold the source address DMA channel 1 holds the destination address Similar to operation of a MOVSB instruction.

258 Figure A-5 command register. (Courtesy of Intel Corporation.)

259 MR 8237 Internal Registers The mode register programs the mode of operation for a channel. Each channel has its own mode register as selected by bit positions 1 and 0. remaining bits of the mode register select operation, auto-initialization, increment/decrement, and mode for the channel

260 Figure A-5 mode register. (Courtesy of Intel Corporation.)

261 BR 8237 Internal Registers The bus request register is used to request a DMA transfer via software. very useful in memory-to-memory transfers, where an external signal is not available to begin the DMA transfer

262 Figure A-5 request register. (Courtesy of Intel Corporation.)

263 MRSR 8237 Internal Registers The mask register set/reset sets or clears the channel mask. if the mask is set, the channel is disabled the RESET signal sets all channel masks to disable them

264 Figure A-5 mask register set/reset mode. (Courtesy of Intel Corporation.)

265 MSR 8237 Internal Registers The mask register clears or sets all of the masks with one command instead of individual channels, as with the MRSR.

266 Figure A-5 mask register. (Courtesy of Intel Corporation.)

267 SR 8237 Internal Registers The status register shows status of each DMA channel. The TC bits indicate if the channel has reached its terminal count (transferred all its bytes). When the terminal count is reached, the DMA transfer is terminated for most modes of operation. the request bits indicate whether the DREQ input for a given channel is active

268 Figure A-5 status register. (Courtesy of Intel Corporation.)

269 Software Commands Three software commands are used to control the operation of the These commands do not have a binary bit pattern, as do various control registers within the a simple output to the correct port number enables the software command Fig shows I/O port assignments that access all registers and the software commands.

Figure 13 10 8237A-5 command and control port

270 Figure A-5 command and control port assignments. (Courtesy of Intel Corporation.)

271 8237 Software Commands Master clear Acts exactly the same as the RESET signal to the as with the RESET signal, this command disables all channels Clear mask register Enables all four DMA channels.

272 8237 Software Commands Clear the first/last flip-flop Clears the first/last (F/L) flip-flop within The F/L flip-flop selects which byte (low or high order) is read/written in the current address and current count registers. if F/L = 0, the low-order byte is selected if F/L = 1, the high-order byte is selected Any read or write to the address or count register automatically toggles the F/L flip-flop.

273 Programming the Address and Count Registers Figure shows I/O port locations for programming the count and address registers for each channel. The state of the F/L flip-flop determines whether the LSB or MSB is programmed. if the state is unknown, count and address could be programmed incorrectly It is important to disable the DMA channel before address and count are programmed.

274 Figure A-5 DMA channel I/O port addresses. (Courtesy of Intel Corporation.)

275 Four steps are required to program the 8237: (1) The F/L flip-flop is cleared using a clear F/L command (2) the channel is disabled (3) LSB & MSB of the address are programmed (4) LSB & MSB of the count are programmed Once these four operations are performed, the channel is programmed and ready to use. additional programming is required to select the mode of operation before the channel is enabled and started

276 The 8237 Connected to the 80X86 The address enable (AEN) output of 8237 controls the output pins of the latches and outputs of the 74LS257 (E). during normal operation (AEN=0), latches A & C and the multiplexer (E) provide address bus bits A 19 A 16 and A 7 A 0 See Figure

277 Figure Complete 8088 minimum mode DMA system.

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined