Microprocessors
Von Neumann architecture The first computers used a single fixed program (like a numeric calculator). To change the program, one has to re-wire, re-structure, or re-design the computer. People that do that were not called computer programmers, as they are called today, but rather computer architects. A Von Neumann computer uses a single memory to hold both instructions and data. 1
The program is written in an appropriate language, and is not hardwired in the computer itself; the computer is re-programmable. In a Von Neumann computer programs can be seen as data; as a consequence a malfunctioning program can crash the computer.
In a Von Neumann processor an instruction is read from memory, decoded, the memory location the instruction asked for is fetched, the operation performed, and the results written back in memory. The term von Neumann architecture dates from June 1945, coined after the name of the mathematician John von Neumann, although such architecture was not designed by Von Neumann alone.
The Von Neumann bottleneck The separation between the CPU and memory leads to what is known as the von Neumann bottleneck. The throughput (data transfer rate) between the CPU and memory is very small in comparison with the amount of memory available and the rate at which the CPU can work. As a result, the CPU is continuously forced to wait for data to be transferred to or from memory. Since the CPU speed and memory size have increased much faster than the throughput between the two, the bottleneck has become more intense. A cache memory between the CPU and main memory helps to alleviate the problem. 2
The Harvard architecture In the Harvard architecture there is a separate storage and signal pathways for instructions and data. In this architecture, the word width, timing, implementation technology, and memory address structure can differ for program and data. Instruction memory is often wider than data memory. In some systems, instructions can be stored in read-only memory while data memory generally requires random-access memory. Typically, there is much more instruction memory than data memory, so instruction addresses are much wider than data addresses. The CPU can be either reading an instruction or reading/writing data from/to the memory. 3
Both cannot occur at the same time in a Von Neumann architecture, since the instructions and data use the same signal pathways and memory. A computer following the Harvard architecture can be faster because it is able to fetch the next instruction at the same time it completes the current instruction (a phenomenon known as pipelining). Speed is gained at the expense of more complex electrical circuitry. Modern high performance CPU chip designs incorporate aspects of both Harvard and von Neumann architecture. On chip cache memory is divided into instruction cache and data cache.
Complex instruction set computer CISC A complex instruction set computer (CISC) is a microprocessor instruction set architecture in which each instruction can execute several lowlevel operations, such as a load from memory, an arithmetic operation, and a memory store. The terms register-memory or memory-memory also apply to the same concept. In the early days of computers, compilers did not exist. Programming was done in either machine code or assembly language. To make programming easier, computer architects created more and more complex instructions, which were direct representations of high level functions of high level programming languages. The attitude at the time was that hardware design was easier than compiler design, so the complexity went into the hardware. 4
Another force that encouraged complexity was the lack of large memory. Indeed, as every byte of memory was precious (for example, an entire system only had a few kilobytes of storage) the industry moved to such features as highly encoded instructions, instructions which could be variable sized, instructions which did multiple operations and instructions which did both data movement and data calculation. For the above reasons, CPU designers tried to make instructions that would do as much work as possible. This led to one instruction that would do all of the work in a single instruction: load up the two numbers to be added, add them, and then store the result back directly to memory. The compact nature of CISC results in smaller program sizes and fewer calls to main memory.
While many designs achieved the aim of higher throughput at lower cost and also allowed highlevel language constructs to be expressed by fewer instructions, it was observed that programs did not took profit from this. This is the point of departure from CISC to RISC. Examples of CISC processors are the Intel x86 CPUs (8051 included). The terms RISC and CISC had become less meaningful with the continued evolution of both CISC and RISC designs and implementations.
Reduced instruction set computer RISC The reduced instruction set computer, or RISC, is a CPU design philosophy that favors a reduced and simpler instruction set. The term load-store applies to the same concept. The idea was originally inspired by the discovery that many of the features that were included in traditional CPU designs (i.e., CISC) to facilitate coding were being ignored by the programs/programmers. In the late 1970s researchers demonstrated that the majority of the many addressing modes present in CISC microprocessors were ignored by most programs. This was a side effect of the increasing use of compilers to generate programs, as opposed to writing them in assembly 5
language. In others words, compilers were not able to exploit the features of a CISC assembly. At about the same time CPUs started to run even faster than the memory they talked to. It became apparent that more registers (and later caches) would be needed to support these higher operating frequencies. These additional registers and cache memories would require sizeable chip or board areas that could be made available if the complexity of the CPU was reduced. Since real-world programs spent most of their time executing very simple operations, some researchers decided to focus on making those common operations as simple and as fast as possible. The goal of RISC was to make instructions so simple, each one could be executed in a single clock cycle.
However RISC also had its drawbacks. Since a series of instructions is needed to complete even simple tasks, the total number of instructions read from memory is larger, and therefore takes longer (see the Von Neumann bottleneck). In the early 1980s it was thought that existing design was reaching theoretical limits. Future improvements in speed would be primarily through improved semiconductor process, that is, smaller features (transistors and wires) on the chip. The complexity of the chip would remain largely the same, but the smaller size would allow it to run at higher clock rates (Moore s law). The RISC CDC 6600 supercomputer, designed in 1964 by Jim Thornton and Seymour Cray has 74 op-codes, while the 8086 from intel has 400.
RISC designs have led to a number of successful platforms and architectures, some of the larger ones being: PlayStation, PlayStation 2, PlayStation Portable, PlayStation 3, Nintendo 64 game consoles, Nintendo s Gamecube and Wii, Microsoft s Xbox 360 and Palm PDA s.
Pipeline An instruction is made of micro-instructions. In a processor with pipeline, the processor works on one micro-instruction of several different instructions at the same time. For example, the RISC pipeline is broken into five stages: 1. Instruction fetch 2. Instruction decode and register fetch 3. Execute 4. Memory access 6
5. Register write back The key to pipelining is the observation that the processor can start reading the next instruction as soon as it finishes reading the last, meaning that it works on two instructions simultaneously: one is being read, the next is being decoded two stage pipelining. While no single instruction is completed any faster, the next instruction would complete right after the previous one. The result was a much more efficient utilization of processor resources. Pipelining reduces cycle time of a processor and hence increases instruction throughput, the number of instructions that can be executed in a unit of time. A typical CISC instruction to add two numbers might be ADD A, B, C, which adds the
values found in memory locations A and B, and then puts the result in memory location C. In a pipelined processor the pipeline controller would break this into a series of instructions similar to: LOAD A, R1 LOAD B, R2 ADD R1, R2, R3 STORE R3, C LOAD next instruction The R locations are registers, temporary memory inside the CPU that is quick to access. The end result is the same, the numbers are added and the result placed in C, and the time taken to drive the addition to completion is no different (possibly greater than for the CISC case) from the non-pipelined case. The key to understanding the advantage of pipelining is to consider what happens when
this ADD function is half-way done, at the ADD instruction for instance. At this point the circuitry responsible for loading data from memory is no longer being used, and would normally sit idle. In this case the pipeline controller fetches the next instruction from memory, and starts loading the data it needs into registers. That way when the ADD instruction is complete, the data needed for the next ADD is already loaded and ready to go. The overall effective speed of the machine can be greatly increased because no parts of the CPU sit idle. Every microprocessor manufactured today uses at least 2 stages of pipeline. (The Atmel AVR and the PIC microcontroller each have a 2 stage pipeline). Advantages of pipelining: the cycle time of the processor is reduced, thus increasing instruction bandwidth in most cases.
Advantages of not pipelining: The processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture. The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is due to the fact that extra flip flops must be added to the data path of a pipelined processor. A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to
predict and may vary more widely between different programs. Many designs include pipelines as long as 7, 10 and even 31 stages (like in the Intel Pentium 4). The downside of a long pipeline is when a program branches, the entire pipeline must be flushed, a problem that branch predicting helps to alleviate. The higher throughput of pipelines falls short when the executed code contains many branches: the processor cannot know where to read the next instruction, and must wait for the branch instruction to finish, leaving the pipeline behind it empty. After the branch is resolved, the next instruction has to travel all the way through the pipeline before its result becomes available and the processor appears to work again. In the extreme case,
the performance of a pipelined processor could theoretically approach that of an unpipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is present between stages.
Bibliography http://www.wikipedia.com 7