IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more bits, e.g. an 8-bit latch is really eight one-bit latches. Input Bus Clock Read Enable Latch Out Clock D-In Write Output Bus Registers Small, fast storage within the CPU Dedicated to a particular purpose Sizes are bits or bytes At least conceptually are composed of the proper number of latches. LMC has two explicit registers: Program counter Calculator display (accumulator) The Little Man has Registers The Little Man could remember things; in a real computer, registers do that. The address of the next instruction: the Program Counter The calculator result: the Accumulator The memory location to read or write: The memory address register The data transferred to or from memory: The memory data register An instruction read from memory: The instruction register Program Counter Holds the address of the next instruction Updated shortly after an instruction is fetched Can be changed (by the CPU) to implement branching. Data Registers Generally thought of as part of the ALU Number: between one and around a hundred. 16 or 32 is typical. The LMC has one data register: the display of the calculator. When a computer has only one data register, it is called an accumulator. (That s the A in LDA!)
Memory Registers The memory address register holds an address in memory From which to read data To which to write data The memory data register holds data Read from memory To be written to memory Instruction Register Holds the instruction fetched from the memory location pointed by the Program Counter The Decode phase examines the contents of the instruction register to decide what operation to perform The instruction was something the Little Man remembered; a real computer uses the Instruction Register Register Operations Store data values temporarily Receive the results of arithmetic operations (addition, subtraction, etc.) Receive the results of logical operations (shift data, AND data) Test their contents for conditions such as zero or positive. A Closer Look at the Architecture CPU 00 01 02 03 PC IR Control Unit MAR MDR Command Memory Accumulator ALU 96 97 98 99 I-O Conversations with Memory The Memory Address Register (MAR) Holds one memory address If memory is 128 bytes, how big must the MAR be? What if memory is 16 MB? The address in the MAR determines what memory location will be read or written. (Only one location can be read or written at a time.) Conversations with Memory The Memory Data Register (MDR) Holds one memory word If memory words are 32 bits, how big must the MDR be? The MDR receives one word of data from memory on a read The MDR holds one word of data to be transferred to memory on a write.
Reading from Memory 1. Place a memory address in the MAR 2. Send a command (electronic signal) to memory to read. 3. The memory places the selected data word on a bus 4. That is connected to the MDR. 5. The control unit can transfer the data from the MDR. Writing to Memory 1. Place a memory address in the MAR 2. Place the data to be written in the MDR 3. The memory bus is connected to the MDR 4. Send a command (electronic signal) to memory to write 5. The memory stores data from the bus into the selected location. Operation of Memory Operation of Memory: Example individual memory bits Visual Analogy Reading or Writing lsb
Memory Capacity Limited by two factors Size of the MAR: k bits address 2 k cells Size of address portion of an instruction (in LMC, it is two digits.) The amount of physical memory is important for performance Random Access Memory (RAM) Called random access because any cell may be addressed as fast as any other. Dynamic RAM (DRAM) Loses contents when power removed (volatile) Must be refreshed thousands of times per second. Static RAM (SRAM) More expensive, faster, no refresh needed Loses contents when power removed Read Only Memory (ROM) Non-volatile memory to hold software that is not expected to change over the life of the system EEPROM Electrically Erasable Programmable ROM Slower and less flexible than Flash ROM Flash ROM Faster than disks but more expensive Uses BIOS: initial boot instructions and diagnostics Digital cameras, music players, thumb drives, etc. CMOS: Very low power read-write memory; clock and configuration info. The Instruction Cycle The von Neumann Instruction Cycle Fetch: Get an instruction from the memory location pointed by the program counter and advance the program counter Decode: Determine what operation code is present, and what data to use Execute: Perform the commanded operation Register Transfer Basic operation of the execute part of the instruction cycle: send the contents of one or two registers through the ALU. The result is stored in a register, possibly the same as one of the sending registers. Data are transformed according to the command (add, shift, etc.) given the ALU. A no operation command can move data without changing it. Register operations are described with RTL Register Transfer Example PC MAR Write A PC IR MAR MDR Enable
The Complete Datapath A Fetch The Instruction Cycle Read the PC MAR program counter (read memory) Get contents of MDR IR indicated mailbox PC IR MAR The real ALU! Decode Execute Increment the program counter Check op code: it s a STORE Get calculator display value PC+1 PC Determine next operation A MDR MDR Function Status Read instruction address field Store calculator value there IR[add] MAR (write memory) Done Done 1. PC MAR 2. PC+1 PC 3. MDR IR The STO Instruction Transfer the address from the PC to the MAR Program Counter incremented Memory read completes Transfer the instruction to the IR CPU PC 01 PC MAR; Read IR Control Unit MAR MDR 01 Accumulator 137 ALU I-O 4. IR[address] MAR 5. A MDR* *Notice how Step #5 differs for LOAD and STORE Address portion of the instruction loaded in MAR Accumulator copies its data into MDR, write memory 00 01 02 03 399 (Store) 199 Command (Read) Memory 96 97 98 99 PC+1 PC; Memory Read Completes CPU PC IR Accumulator 02 137 Control Unit MAR MDR 01 399 ALU I-O MDR IR CPU PC IR Accumulator 02 3 99 137 Control Unit MAR MDR 01 399 ALU I-O Command Command 00 01 02 03 399 199 Memory 96 97 98 99 00 01 02 03 399 199 Memory 96 97 98 99
IR[address] MAR CPU PC IR Accumulator 02 3 99 137 Control Unit MAR MDR 99 399 ALU I-O A MDR; Write CPU PC IR Accumulator 02 3 99 137 Control Unit MAR MDR 99 137 ALU I-O Command Command (Write) 00 01 02 03 399 199 Memory 96 97 98 99 00 01 02 03 399 199 Memory 137 96 97 98 99 The LDA Instruction The ADD Instruction 1. PC MAR 2. PC+1 PC 3. MDR IR Transfer the address from the PC to the MAR, read memory Program Counter incremented (This is in a different place in Englander s diagram.) Transfer the instruction to the IR 1. PC MAR 2. PC+1 PC 3. MDR IR Transfer the address from the PC to the MAR Program Counter incremented Transfer the instruction to the IR 4. IR[address] MAR 5. MDR A Address portion of the instruction loaded in MAR, read memory Actual data copied into the accumulator 4. IR[address] MAR 5. A + MDR A Address portion of the instruction loaded in MAR, read memory Contents of MDR added to contents of accumulator Buses The physical connection that makes it possible to transfer data from one location in the computer system to another Group of electrical conductors for carrying signals from one location to another Line: each conductor in the bus 4 kinds of signals Data (binary numbers: alphanumeric, numerical, instructions) Addresses Control signals Power (sometimes) Buses Connect CPU and Memory I/O peripherals: on same bus as CPU/memory or separate bus Physical packaging commonly called backplane or motherboard Also called system bus or external bus Example of broadcast bus Part of printed circuit board called motherboard that holds CPU and related components
Bus Characteristics Point to Point vs. Multipoint Protocol Documented agreement for communication Specification that spells out the meaning of each line and each signal on each line Throughput, i.e., data transfer rate in bits per second Data width in bits carried simultaneously Motherboard Instructions Direction given to a computer Causes electrical signals to be sent through specific circuits for processing Instruction Set: The collection of instructions a given computer can perform. (LMC has ten instructions; the list of them is its instruction set.) Instruction Set Design defines functions performed by the processor Differentiates computer architecture by the Number of instructions Complexity of operations performed by individual instructions Data types supported Format (layout, fixed vs. variable length) Use of registers Addressing (size, modes) Elements of an Instruction Operation Code (op-code): Commands the control unit and the ALU what to do Operands: tell the location of the data to be used in the instruction. Source operand: where to get the data Result operand: where to put the result (Also called the destination operand.) The operands are (usually) addresses.
Operand Addresses Addresses may be explicit or implicit Explicit: encoded in the instruction. (The LMC memory address is explicit.) Implicit: implied by the nature of the operand. (The LMC uses the calculator display implicitly.) Addresses may refer to memory or to registers. General Form of an Instruction OP-CODE Source Operand 4 bits 20 bits Result Operand Instruction Format Specific to a particular family of computers (architecture) Specifies the length of the op-code And the size and number of operand fields A single computer may have several different instruction formats. Complex Instruction-Set Computers Many different kinds of instructions Many different instruction formats Several different instruction lengths A few different operation code lengths Often things done in high-level languages can be performed in one instruction. Emphasis is on flexibility CISC Instruction Formats Reduced Instruction-Set Computers A few kinds of instructions A small number of formats All instructions are the same length All operation codes are the same length High-level language statements generally require several instructions Emphasis is on speed
RISC Instruction Formats Categories of Instructions Data transfer instructions Arithmetic instructions Logical operations Program control Stack manipulation I-O and machine control Multiple-data instructions Data Transfer Instructions Move data between registers in CPU Transfer data from memory to a CPU register (load) Transfer data from a CPU register to memory (store) Size of a single transfer: generally the size of a data register; a word Words are 8, 16, 32, 64, or 128 bits 32-bit words are currently most common Arithmetic Instructions The usual suspects: + / * Separate instructions for integer and floating point operands. Shift and rotate instructions One bit shift left multiplies by two One bit shift right divides by two Rotate: Bits shifted out one end are used for replacement bits at the other end. Increment, complement, etc. Shift and Rotate Instructions Logical Operations Logical AND and OR of two operands Sometimes others: XOR, NOR, NOT Relational operations: > < = Testing for zero, positive, negative
Program Control Branch instructions; conditional and unconditional Call instructions (save program counter someplace) Stack Manipulation Special instructions for dealing with LIFO data structures. (A stack is a good place to store program counters for subroutine linkage!) Push Pop I-O and Machine Control Transfers from registers to I-O devices Direct memory access (DMA) I-O The I-O device communicates with memory independent of the CPU Machine state switching (privileged instructions) Interrupt control State saving Halt Multiple-Data Instructions Perform the same operation on multiple data items simultaneously (Example: Intel MMX) Commonly used in vector and array processing SIMD: Single-instruction, multiple data Multiple Data Instructions CISC Architecture Examples Intel x86, IBM Z-Series Mainframes, older CPU architectures Characteristics Few general purpose registers (perhaps 16) Many addressing modes Large number of specialized, complex instructions Instructions are of varying sizes
Limitations of CISC Architecture Some instructions are infrequently used by programmers and compilers Memory references, loads and stores, are slow and account for a significant fraction of all instructions Only a few of the many instructions are used frequently Procedure and function calls are a major bottleneck Passing arguments Storing and retrieving values in registers RISC Features Limited and simple instruction set Fixed length, fixed format instruction words enable pipelining, parallel fetches and executions Limited addressing modes reduce complicated hardware Register-oriented instruction set reduce memory accesses Large bank of registers Reduce memory accesses Efficient procedure calls CISC vs. RISC Processing Speeding Up Procedure Calls Procedure calls help modularize programs. They cause major overhead at execution time: Saving state Setting up parameters Retrieving results What if we could make procedure calls without moving data around? Circular Register Buffer (RISC) (It isn t really a circle; it s a linear space that wraps around. ) Circular Register Buffer - After Procedure Call The caller s out becomes the procedure s in. No data was moved; a single pointer was changed by a fixed amount.
CISC vs. RISC Performance Comparison RISC Simpler instructions more instructions required for a program more memory required to hold program, maybe CISC More memory access for data, so more bus traffic and increased cache memory misses More registers would improve CISC performance but there was formerly no space available for them Modern CISC and RISC architectures are becoming similar due to Moore s Law. Memory Enhancements Memory is slow compared to CPU processing speeds! 2Ghz CPU = 1 cycle in ½ of a billionth of a second 70ns DRAM = 1 access in 70 billionths of a second (140 times slower!) Improving Memory Access Wide Path Memory Access Retrieve multiple bytes instead of one byte at a time Memory Interleaving Partition memory into subsections, each with its own address register and data register Cache Memory Memory Interleaving Cache Memory A small, fast memory placed between the CPU and main memory Works because memory locations used once are likely to be used again. (Locality of reference.) Cache Terminology Blocks: Amount of data transferred; Tags: point to a location in main memory Cache controller hardware that checks tags Cache Line Unit of transfer between storage and cache memory Hit Ratio: ratio of hits out of total requests Synchronizing cache and memory Write through Write back
Step-by-Step Use of Cache: Hit Step-by-Step Use of Cache: Miss Performance Advantages Hit ratios of 90% common 50%+ improved execution speed Locality of Reference Most memory references confined to small region of memory at any given time Well-written program in small loop, procedure or function Data likely in array Variables stored together Two-level Caches Current and Emerging Trends CISC and RISC are re-converging because of greater chip densities. Multi-core chips: two, four, and even more CPUs in a single integrated circuit package. Cluster computing: Hundreds or thousands of commodity computers working together. Example: the Big Mac at Virginia Tech Parallel computing Experimental architectures
VLIW Architecture Transmeta Crusoe CPU 128-bit instruction bundle = molecule 4 32-bit atoms (atom = instruction) Parallel processing of 4 instructions 64 general purpose registers Code morphing layer Translates instructions written for other CPUs into molecules Instructions are not written directly for the Crusoe CPU EPIC Architecture Intel Itanium CPU 128-bit instruction bundle 3 41-bit instructions 5 bits to identify type of instructions in bundle 128 64-bit general purpose registers 128 82-bit floating point registers Intel x86 instruction set included Programmers and compilers follow guidelines to ensure parallel execution of instructions Modern CPU Processing Methods Alternative CPU Organization Separate Fetch/Execute Units Pipelining Scalar Processing Superscalar Processing Instruction Pipelining Assembly-line technique to allow overlapping between fetch-execute cycles of sequences of instructions Only one instruction is being completed at a time More on Pipelining Scalar processing: Average instruction execution is approximately equal to the clock speed of the CPU Problems from stalling: Instructions have different numbers of steps Problems of data latency Problems from branching
Branch Problem Solutions Separate pipelines for both possibilities Probabilistic approach Requiring the following instruction to not be dependent on the branch Instruction Reordering Pipelining Example Superscalar Processing Process more than one instruction per clock cycle Separate fetch and execute cycles as much as possible Buffers for fetch and decode phases Parallel execution units Superscalar CPU Block Diagram Scalar vs. Superscalar Processing Superscalar Issues Out-of-order processing dependencies (hazards) Data dependencies Branch (flow) dependencies and speculative execution Parallel speculative execution or branch prediction Branch History Table Register access conflicts Logical registers (register renaming)
Hardware Implementation Hardware implementation operations are implemented using logic gates Advantages: Speed RISC designs are simple and typically implemented in hardware Hardware and Software Hardware and software are logically equivalent. (But there has to be some hardware someplace!) So, computer designers have a choice of implementing with hardware or software. Microprogrammed Implementation Microcode: programs stored in ROM that replace hardwired CPU instructions Advantages More flexible Easier to implement complex instructions Can emulate other CPUs Can be changed! Disadvantage Usually requires more clock cycles Questions