CS Computer Architecture

Size: px

Start display at page:

Download "CS Computer Architecture"

Emerald Horn
5 years ago
Views:

1 CS Computer Architecture Section 600 Dr. Angela Guercio Fall 2010

2 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic language to make it easier to understand. The language will describe what happens at each clock cycle, rather than being a higher level language. To copy something from one register to another, we will use an assignment statement: MDR = SP To indicate a more complicated operation than passing through the B bus: MDR = H + SP

3 An Example Implementation We must use only legal combinations in the assignment statement (e.g. no H = H - MDR). We can assign the result to multiple registers, so we can also write: SP = MDR = SP + 1 To indicate memory reads and writes of 4-byte data words, we will just put rd and wr in the microinstruction. Fetching a byte through the 1-byte port is indicated by fetch. Assignments and memory operations can occur in the same cycle. This is indicated by writing them on the same line.

4 An Example Implementation All permitted operations. Any of the above operations may be extended by adding << 8 to them to shift the result left by 1 byte. For example, a common operation is H = MBR << 8.

5 An Example Implementation Remember that each microinstruction must explicitly supply the address of the next microinstruction. To ease the programmer s job, the microassembler normally assigns an address to each microinstruction (not necessarily consecutive in the control store) and fills in the NEXT_ADDRESS field so that microinstructions written on consecutive lines are executed consecutively. Sometimes, however, the programmer wants to branch away unconditionally: goto label.

6 An Example Implementation To do a conditional branch, we can use the Z and N flip-flops: Z = TOS tests the TOS register to see if it is zero. We can then do: Z = TOS; if (Z) goto L1; else goto L2 Note that L2 must be exactly 256 more than L1 (the previous instruction causes Z to be ORed into the high-order bit of MPC). To use the JMPC bit: goto (MBR OR value). This tells the microassembler to use value for NEXT_ADDRESS and set the JMPC bit so that MBR is ORed into MPC with NEXT_ADDRESS.

7 An Example Implementation If value is 0, which is the normal case, we can just write: goto (MBR) The actual microprogram that interprets IJVM is 112 microinstructions long. Consecutive microinstructions are not necessarily located in consecutive addresses in the control store. Note the meanings of the registers: CPP is a pointer to the constant pool. LV is a pointer to the local variables. SP is a pointer to the top of the stack. PC holds the address of the next instruction.

8 An Example Implementation MBR is a 1-byte register that holds the bytes of the instruction stream as they come in to be executed. At the beginning and end of each instruction, TOS contains the value of SP. For some instructions, POP for example, more work is necessary. The OPC register is a temporary register. The main loop of the interpreter begins on the line labeled MAIN1 and is a single microinstruction.

9 An Example Implementation Control store addresses corresponding to opcodes must be reserved for the first word of the corresponding instruction interpreter. Assume that MBR contains 0x60 (IADD). The main loop must: Increment the PC, leaving it containing the address of the first byte after the opcode. Initiate a fetch of the next byte into MBR. Perform a multiway branch to the address contained in MBR at the start of Main1. This address is the opcode currently being executed.

10 The Microprogram for the Mic-1 (1) The microprogram for the Mic-1.

11 The Microprogram for the Mic-1 (2)

12 The Microprogram for the Mic-1 (3)

13 The Microprogram for the Mic-1 (4)

14 The Microprogram for the Mic-1 (5)

15 Increasing the Speed In general, we trade off speed versus cost. A faster implementation requires more hardware complexity. There are three basic approaches for increasing the speed of execution: Reduce the number of clock cycles needed to execute an instruction. Simplify the organization so that the clock cycle can be shorter. Overlap the execution of instructions.

16 Increasing the Speed Original microprogram sequence for executing POP.

17 Increasing the Speed Enhanced microprogram sequence for executing POP.

18 Increasing the Speed As another speed increase, notice that for every instruction the following operations may occur: The PC is passed through the ALU and incremented The PC is used to fetch the next byte in the instruction stream. Operands are read from memory. Operands are written to memory. The ALU does a computation and the results are stored back. We can free up the ALU by introducing an IFU (Instruction Fetch Unit).

19 Instruction Fetch Unit

20 The Mic-2 The Mic-2, shown on the next slide Incorporates the Instruction Fetch Unit (IFU) This results in changed (smaller) microprogram to implement IJVM (at the expense of more hardware) for the IFU (which is implemented in hardware) Has the new MBR2 16-bit register Allows any register to be used as either operand (this simplifies the microprogram as well)

21 Instruction Fetch Unit The data path for Mic-2.

22 Pipelining We can further speed up processing by overlapping the execution of the instructions. In order to do this, we must break up the data path into three parts. This is done by inserting latches (registers) in the data path. Each of these parts executes faster than the original data path.

23 Pipelining in Mic-3 In the Mic-3 shown on the next slide, latches have been inserted to allow each of the three components (Drive A and B bus, perform ALU computation, writeback result) to perform concurrently Along with the IFU, this gives us a 4 stage pipeline

24 Pipelining The three-bus data path used in the Mic-3.

25 Pipelining The operation of the pipeline is shown on the next slide Note that several instructions are operating concurrently

26 Pipelining Graphical illustration of how a pipeline works.

27 Pipelining in Mic-4 The following slide shows the Mic-4 in which several more stages are added The decoding unit finds the microprogram memory location of the next opcode The queueing unit receives the micro-op index from the decoding unit and copies the corresponding micro-op to a queue it continues copying micro-ops until last one in sequence Separate MIRs for the several phases

28 Pipelining The main components of the Mic-4.

29 Pipelining The Mic-4 pipeline.

30 Cache Memory The recent improvements in CPU speed have led to an even wider gap between CPU and memory speeds. One way to try to manage this problem is through the use of high-speed cache memory. One technique that is quite efficient involves the use of a separate cache for memory and instructions. This is called a split cache.

31 Cache Memory A split cache allows memory operations to be initiated independently in each cache, doubling the bandwidth of the memory system. Each cache has independent access to the main memory. An additional cache, called level 2 cache, may reside between the instruction and data caches and main memory. There may be three or more levels of cache as more sophisticated memory systems are required.

32 Cache Memory The following slide shows a typical arrangement. The CPU chip itself contains a small instruction cache and a small data cache, typically 16 KB to 64 KB. The level 2 cache is not on the CPU chip, but may be included in the CPU package, connected by a high-speed data path. This cache is generally unified (containing both data and instructions) and is between 512 KB and 1 MB.

33 Cache Memory The third level cache is on the processor board and contains a few megabytes of SRAM, which is faster than the main DRAM memory. Caches are generally inclusive, with the full contents of the level 1 cache being in the level 2 cache and the full contents of the level 2 cache being in the level 3 cache.

34 Cache Memory A system with three levels of cache.

35 Cache Memory Cache memories depend on two kinds of address locality to achieve their goal. Spatial locality is the observation that memory locations with addresses numerically similar to a recently accessed location are likely to be accessed. Temporal locality occurs when recently accessed memory locations are accessed again. Main memory is divided into fixed-size blocks called cache lines of 4 to 64 bytes. Lines are numbered consecutively starting at 0, so with a 32-byte line size, line 0 is bytes 0 to 31, etc.

36 Cache Memory At any instant, some lines are in the cache. When memory is referenced, the cache controller circuit checks to see if the word referenced is currently in the cache. If it is, it is used. If not, some line entry is removed from the cache and the line needed is fetched from memory or some lower level cache to replace it. Many variations exist, but the central idea is always to keep the most heavily-used lines in the cache.

37 Cache Memory Cache memory can be organized several ways: Direct-mapped caches fix the cache entry in which a particular portion of main memory can be stored. This is the fastest kind of cache to search, but the most inflexible. Set-associative caches allow a portion of main memory to be stored in one of several cache entries. This allows fast retrieval and flexible storage. A fully-associative cache allows memory to be stored in any entry of the cache. Retrieval is very slow, so this organization is not very practical.

38 Direct-Mapped Caches Direct mapped caches contain a number (say 2048) of entries. Each entry consists of: The Valid bit which indicates whether there is any valid data in this entry or not. Initially, all entries are marked invalid. The Tag field consists of a unique value (in the example, 16 bits) identifying the corresponding line of memory from which the data came. The Data field contains a copy of the data in memory. This fields holds one cache line of 32 bytes.

39 Direct-Mapped Caches A memory word can be stored in exactly one place within a direct-mapped cache. Given a memory address, there is only one to place to look for it in the cache. For storing and retrieving data from the cache, the address is broken into 4 components: The TAG field corresponds to the Tag bits stored in a cache entry. The LINE field indicates which entry holds the corresponding data, if they are present.

40 Direct-Mapped Caches The WORD field tells which word within a line is referenced. The BYTE field is usually not used, but if only a single byte is requested, it tells which byte within the word is needed. For a cache supplying only 32- bit words, this field will always be 0. When the CPU produces an address, the hardware extracts the 11 LINE bits from the address and uses these to index into the cache to find one of the 2048 entries. If the entry is valid, the tags are compared, if they agree, a cache hit has occurred.

41 Direct-Mapped Caches (a) A direct-mapped cache. (b) A 32-bit virtual address.

42 MIPS Direct Mapped Cache Example Byte offset Hit Tag Index Data Index Valid Tag 20 Data 32 One word/block, cache size = 1K words What kind of locality are we taking advantage of?

43 Direct-Mapped Caches If the cache entry is invalid or the tags do not match, a cache miss has occurred. In this case, the 32-byte cache line is fetched from memory and stored in the cache entry, replacing what was there. If the existing cache entry has been modified since being loaded, it must be written back to main memory. The retrieval process is made faster by performing the retrieval from the cache in parallel with the comparison of the tags.

44 Direct-Mapped Caches Up to 64K bytes of contiguous data can be stored in the cache. However, two lines that differ by a multiple of 64K (65,536 bytes) cannot be stored in the cache at the same time. Direct-mapped caches are the most common kind of caches, and they perform quite well since collisions of the kind described above don t happen often. A compiler can take the cache into account when placing data and instructions in memory.

45 Set-Associative Caches A solution to the problem of lines competing for the same cache entry is to allow two or more lines in each cache entry. A cache with n possible entries for each address is called an n-way associative cache. A set associative cache is inherently more complicated than a direct-mapped cache because we need to check up to n tags to see if the needed line is present in an entry. Two-way and four-way caches perform quite well.

46 Set-Associative Caches A four-way set-associative cache.

47 Set-Associative Caches The use of a set-associative cache brings up the question of which line should be discarded when a new line is brought into the entry. A good choice is to use the LRU (Least Recently Used) algorithm which replaces the line which was accessed longest ago. If we carry the set-associative idea to the extreme, we have a 2048-way cache or a fully-associative cache. These don t improve much over the performance of 4- way caches and so are not much used.

48 Cache Policy Writing a word of memory in the cache causes a problem we can use Write through (immediately update main memory) Write back (only update main memory when cache line is evicted) Further we need to decide whether to use write allocation (should we bring a cache line into memory when we write an uncached line?) Good for write back, not for write through

49 Cache Summary The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three major categories of cache misses: Compulsory misses: sad facts of life. Example: cold start misses Conflict misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity misses: increase cache size Cache design space total size, block size, associativity (replacement policy) write-hit policy (write-through, write-back) write-miss policy (write allocate, write buffers)

50 Branch Prediction Modern computers are highly pipelined, having up to 10 or more stages. Pipelining works best on linear code, so consecutive words from memory can be read and sent off to be executed. Unfortunately, real code is full of branches. See, for example, the code of the next slide. Two of the five instructions are branches, and the longest linear code sequences here is two instructions.

51 Branch Prediction (a) A program fragment. (b) Its translation to a generic assemblylanguage.

52 Branch Prediction Even unconditional branches cause problems since we have to decode the instruction (in the pipeline) to see that the instruction is a branch. A number of pipelined machines (such as the UltraSPARC III) have the property that the instruction following an unconditional branch is executed, though logically it should not be. The position after a branch is called a delay slot. The Pentium 4 does not have this property, but this adds complexity. The compiler tries to put a useful instruction or NOP after the branch.

53 Branch Prediction Conditional branches are even worse since not only do they have delay slots, but now the fetch unit does not know where to read from until much later in the pipeline. Early pipeline machines just stalled until it was known whether the branch would be taken or not. What most machines do now when they hit a conditional branch is predict whether it will be taken or not. One technique: assume all backward branches are taken and forward ones are not.

54 Branch Prediction If we guess incorrectly, we have to undo what the (incorrect) instructions have done. We can allow the instructions to continue to execute until they try to change the machine s state. Instead of overwriting the register the value is put into a (secret) scratch register and only copied to the real register after it is known that the prediction was correct. Alternatively, we can record the value of overwritten registers in a (secret) scratch register and restore them if the prediction turns out to be wrong.

55 Dynamic Branch Prediction We can use a history table to record the branches taken or not taken and then consult this table when the branch occurs again. The prediction is simply that the branch will behave the same way it did the time before. This works well, except for the end of loops. To handle this case, we might decide to change the prediction only when it is wrong twice in a row. We can organize the history table in the same way a cache is organized.

56 Dynamic Branch Prediction (a) A 1-bit branch history. (b) A 2-bit branch history. (c) A mapping between branch instruction address and target address.

57 Dynamic Branch Prediction A 2-bit finite-state machine for branch prediction.

The Microarchitecture Level

The Microarchitecture Level Chapter 4 The Data Path (1) The data path of the example microarchitecture used in this chapter. The Data Path (2) Useful combinations of ALU signals and the function performed.