Designing a Dual Core Processor

Size: px

Start display at page:

Download "Designing a Dual Core Processor"

Shanon Goodwin
5 years ago
Views:

1 Designing a Dual Core Processor Manfred Georg Applied Research Laboratory Department of Computer Science and Engineering Washington University in St. Louis St. Louis, MO , USA mgeorg@arl.wustl.edu Abstract In this paper we present the design of a dual core processor. Two simple five stage pipelined processing cores are combined on a single chip. A bus based memory hierarchy is used to communicate between low level caches and higher memories. Requests over the bus are queued until they can be serviced. Basic hardware stalling and hazard avoidance through data forwarding are built into each processing core. A TestAndSet operation is provided for easily synchronizing the processing cores. 1 Introduction The development of increasingly compact and efficient chip manufacturing techniques have lead to increased capabilities on chips. However, the shrinking of chip designs in addition to intrinsically improving performance also allows for larger, more complicated chip designs. Recently, an increasing amount of chip space has been devoted to caches, in an effort to minimize memory latency. Complicated processor designs, such as superscalar processors, require substantially more space than their predecessors. These processors are geared towards increasing the instructional level parallelism in programs. Many times there is no way, or little incentive, to explicitly parallelize a program at the application layer. In these cases, it is best to have a single fast processor which is able to exploit instruction level parallelism. However, many applications are easily split into multiple tasks at the application layer. When split into multiple threads, it is more effective to have more than one slow processor than to have only one fast processor. A typical super scalar processor such as the Alpha EV6 takes as much as four times more chip area as its simpler, single issue counterpart the Alpha EV5 [7]. 2 Related Work There has been much work and many comparisons and proposals in the area of multiprocessing. The use of multiple processors has been used extensively. There is a wide variety of different programming models which accommodate the use of multiple processing cores [4]. In their seminal work, Olukotun et al. propose the use of multiple cores on a single chip for general purpose processors [8]. Kumar et al. present a comparison between the EV5 and EV6 Alpha cores in a heterogeneous chip multiprocessor (CMP). They are able to both improve performance [7] and reduce power consumption [5] using heterogeneous mixes of processing cores on a single chip. Furthermore Kumar et al. are also able to better utilize resources such as memory, bus, and floating point units, through the sharing of resources between on chip processing cores [6]. Crowley et al. presents a comparison of different parallel processing techniques: fine grained multi threading (FGMT), chip multiprocessing (CMP), and symmetric multithreading (SMT) in network interfaces [3]. Wun et al. use the Intel IXP 2850 network processor, which is extremely parallel and includes heterogeneous elements, to speed up easily parallelizable algorithms in computational biology [11]. In addition to adding complexity, deeply pipelined processors use more power and generate more heat. For this reason, even if cost/performance for a single thread is the only metric used, there is still a trade off between complexity and operational cost [9]. Industry has reacted to this phenomenon; every major chip manufacturer has now created a multi-core line of chips. For example, IBM s POWER4 chip combines two PowerPC processors on a single chip and uses a high throughput crossbar switch for communication between the cores and the L2 cache [10]. Whenever multiple processing units are used in conjunction, communication becomes an important issue. The use of multiple cores within the same chip allows for tightly

2 coupled communication to take place that is not feasible in more distant communication. Adve and Gharachorloo discuss the benefits of different shared memory consistency models, arguing that sequential consistency is unnecessarily restrictive causing performance degradation in multiprocessors [1]. Zhang and Asanović present a processor design which leverages simple previously designed and optimized processing cores by shrinking and placing many of them in a grid network on a single chip [12]. Cache coherence is improved through the use of victim replication and the sharing of L2 caches between nearby processors. Barroso et al. noticed that designing increasingly complex processors to exploit instruction level parallelism is becoming cost prohibitive. However, the use of many simple cores, along with simpler manufacturing techniques is able to produce competitive processors able to outperform modern processors in on-line transaction processing (OLTP) [2]. 3 Instruction Set Architecture I use a small subset of the MIPS DLX instruction set. In the spirit of RISC I have reduced the instruction set as much as possible while trying not to compromising on functionality. A list of all instructions in the instruction set can be found in table 1. All instructions are 32 bits long. As is recommended by the principle of orthogonality, all R type instructions also have an I type analog. All calculations which are possible using the original DLX instruction set, are also possible in the new instruction set, using a small number of instructions. set if... instructions are replaced with corresponding branch instructions which test for less than zero or less than or equal to zero. Some functionality is lost in the removal of jump instructions, namely the ability to register jump, which is needed for function calls. However, this is the only place were significant degradation in functionality is observed. We add a TestAndSet instruction to the instruction set so that we can easily synchronize the two processing cores. There are several restrictions which apply to the entire instruction set. First, all memory locations are addressed in words (4 byte groups) with no provisions for byte or half byte integers. This implies that memory accesses are implicitly word aligned. Furthermore, there are no packing or unpacking instructions. Additionally, for simplicity, all numbers are considered to be two s compliment signed integers. 4 Five Stage Pipeline Each core uses a fairly standard five stage pipeline design. The architecture is based on a subset of the MIPS DLX instruction set. There is a single branch delay slot. 4.1 Instruction Fetch The first stage of the pipeline is concerned with the fetching of instructions. The address of the next instruction is stored in the Program Counter (PC). This signal is incremented by 1 each clock cycle. Additionally it can be arbitrarily set by a successful branch instruction. The address is read from the memory hierarchy which is discussed later. 4.2 Register Decode The second stage of the pipeline is a decoding phase. In the first part the instruction is parsed depending on its format. There are two possible instruction formats, registerregister (R) and register-immediate (I) which are also used for branch instructions. Table 1 graphically shows the layout of each instruction. In general, the operational code is 5 bits, the registers are 5 bits (32 registers), and the immediate is 17 bits. The second part of this phase is the fetching of the register contents. Two registers are fetched from the register file at the same time. The first register, register zero, always contains the value zero. This makes it easier to perform a number of simple operations like negation, which are degenerate cases of other operations, in this case XOR with zero. The destination register is not decoded or fetched; it is passed through the pipeline unaltered, until it is needed in the last stage. Branches are also evaluated in this stage. However, because there is not enough time after register fetching to perform intense calculations, branch conditions are always comparisons with zero. If a branch is taken, then the PC is overwritten with a new value. If a branch is not taken, then the PC will have been incremented as usual, and we continue with the pipelined execution as if a null operation had been executed. Effectively, this means that predictive execution always predicts that no branching will occur. Since the immediate value is only 17 bits long and all internal numbers must be 32 bits long, the immediate is sign extended to 32 bits. This allows later stages to not differentiate between values which came from registers and the immediate. 4.3 Computation The computation stage is the phase in which instructions are actually executed in the Arithmetic Logic Unit (ALU). The ALU is able to add and subtract numbers, perform bitwise operations, and bitwise shifts. It is also used to compute memory addresses through addition.

3 Instruction Description Format ADD add R ADDI add immediate I AND and R ANDI and immediate I OR or R ORI or immediate I SLA shift left arithmetic R SLAI shift left arithmetic immediate I SUB subtract R SUBI subtract immediate I XOR xor R XORI xor immediate I BEQZ branch if register equal to zero I BNEZ branch if register not equal to zero I BLTZ branch if register is less than zero I BLEZ branch if register is less than or equal to zero I LW load word I SW store word I TAS Test And Set I Table 1. Instruction Set R type I type op code R1 R2 Rd Unused op code R1 Rd Immediate Figure 1. The instruction format for each type of instruction 4.4 Memory In the memory stage, registers are able to interact with the memory hierarchy by being loaded or stored. In a normal operation, which does not concern memory, nothing happens during this stage and the result from the ALU is simply passed through. However, for a load or a store operation, the result from the ALU is the memory address. For a store operation the value to be stored which was retrieved from a register is also provided from the previous stage. The write back register, when applicable, is also passed unmodified through this stage. 4.5 Write Back The fifth and final stage of the pipeline is the write back stage. In this stage, the result from the ALU or memory is stored into the destination register within the register file. 5 Hazard Avoidance Any time a pipeline is used to execute instructions in parallel, there is the possibility of problems arising due to data dependencies. A hazard occurs when an instruction in the pipeline has an unfulfilled dependence on another instruction which has not yet completed execution. Although complicated pipelines can produce a number of different kinds of hazards, our simple pipeline only gives rise to the read after write hazard. In this case, a read requires the result of an operation which has not yet been written back to the register file. 5.1 Stalling The first step in solving this problem is to add the ability to stall to the processor. Each stage of the processor can be stalled independently. While stalled, a particular stage maintains the same output, regardless of how the input changes. This allows a later stage in the pipeline to wait for a condition, such as the writing of a register, while earlier stages are stalled. The use of stalls is absolutely vital when

4 caches are used, since cache misses will delay the arrival of data significantly and unpredictably. 5.2 Data Forwarding Although, certain situations necessitate stalling, it is sometimes possible to obtain a result directly from a later stage without waiting for it to be written to a register. The most obvious example is an operation which requires a result from the ALU during the memory stage. Results from the ALU are not modified in the memory stage, therefore, the result that will be written to the register file is already present in the pipeline. By placing some data lines and multiplexers in front of the ALU we can allow it to calculate on values that have not yet been written to registers. This minimizes the number of stalls that are necessary. 6 Communication The use of two cores on a chip is trivially simple except for the need to communicate between them. The pipeline is simply duplicated for each core and they are run in parallel with the same frequency. However, the need for caches in each processor to communicate with main memory and keep coherent state complicates matters. 6.1 Bus Communication between the caches and memory is performed through a 34 bit wide bus, 2 bits of which are an operation code. This bus is synchronous, allowing communication to initiate quickly without any preamble. Arbitration is done through a simple slotting algorithm which alternates between the caches. To simplify matters, the bus is held by an individual cache until its request has been completed. However, since the bus is not wide enough for an entire cache line at once, cache lines must be sent over it serially. 6.2 Cache Each cache is a standard one-way associative cache. The low bits of the addressed word are used to look up the proper entry in the cache. The high bits of the address are then compared against the high bits of the word which is stored in cache. If these bits match up, and a dirty bit has not been set, signifying that the cache is valid, then the cache entry is used. Otherwise, a cache miss occurs and the information must be retrieved from memory. Each entry in the cache is a cache line of 4 words, equaling 128 bits. There are 4 dirty bits allowing each word which comprises the cache line to be independently invalidated. Words are invalidated when a different cache performs a write at their address. Since all communication with memory is performed through the bus, all cache modules can hear each others write requests and set dirty bits appropriately. 6.3 Cache Controller Since requests can arrive at the cache faster than they can be served on the bus, we require a queue of outgoing requests which must be serviced. Any number of write requests may be in the queue at one time (up to a maximum of 8), since write requests do not need to stall the processor. However, a read or TestAndSet request will stall the processor until a response is received. In the process, the entire queue of requests will be drained. Every time an entry is written to the cache, the corresponding entry, if clean, is updated. Furthermore, the data is also placed in the queue to be transmitted over the bus to memory. In this way, each write to the cache is also immediately written to memory. 6.4 Memory Memory is accessed through the bus. It is assumed that all memory operations on a word require four cycles to complete and can be pipelined. Therefore, the bus is idle for four cycles between a request and the response. However, in this interval the bus is held and cannot be used for any other request. There are three different kinds of memory requests. A read, a write, and a TestAndSet. For any of these requests, the cache module will send the address of the operation to the memory over the bus. A read request acts to retrieve an entire cache line at once, which, after a delay, is sent on the bus in four consecutive cycles. A write request only acts on a single word of data. This word is sent on the bus in the second cycle of the request after which the bus must remain empty for three cycles while the write operation completes within main memory. A TestAndSet is similar to a read, except that it only acts on a single word, and thus only requires one cycle to send information back. Naturally, a TestAnd- Set operation atomically both returns the current value of the memory address and sets that value to one. 7 Simulation Details The entire dual core processor was written in VHDL and simulated using Xilinx ModelSim. Little effort was made to use realistic gate and wire delays or fine-tune the clock frequency. However, an effort was made to balance the load

5 in each pipeline stage. Furthermore, main memory was assumed to require several clock cycles to respond. And, the cost of wires from memory to cache controllers was minimized. These are some essential points which should be considered when creating a more realistic model. 8 Conclusion The trend in computer design has always been towards exploiting more parallelism. However, we are quickly approaching the point were it no longer makes sense to create more complicated designs that are able to exploit successively more complicated and less significant amounts of instruction level parallelism. The solution is to build systems in which parallelism is explicitly given at the application layer. In such cases, the use of multiple simple cores to boost performance is much more effective than single complex cores. I present a simple dual core design, leveraging two five stage pipelined processing cores combined on a single chip. A bus based memory hierarchy is used to communicate between low level caches and higher memories. Requests over the bus are queued until they can be serviced. Basic hardware stalling and hazard avoidance through data forwarding are built into each processing core. Furthermore, a TestAndSet operation is provided for easily synchronizing the processing cores. [7] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas. Single-ISA Heterogeneous Multi-Core Architecture for Multithread Workload Performance. In International Symposium on Computer Architecture, June [8] K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K. Chang. The Case for a Single-Chip Multiprocessor. In International Conference on Architectural Support for Programming Languages and Operating Systems, October [9] V. Srinivasan, D. Brooks, M. Gschwind, and P. Bose. Optimizing Pipelines for Power and Performance. In International Symposium on Microarchitecture, November [10] J. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 System Microarchitecture. In IBM Journal of Research and Development, [11] B. Wun, J. Buhler, and P. Crowley. Exploiting Coarse- Grained Parallelism to Accelerate Protein Motif Finding with a Network Processor. In International Conference on Parallel Architectures and Compilation Techniques, September [12] M. Zhang and K. Asanović. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. In International Symposium on Computer Architecture, References [1] S. V. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial. Technical Report, Digital Western Research Laboratory, September [2] L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, and S. Qadeer. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In International Symposium on Computer Architecture, May [3] P. Crowley, M. E. Fiuczynski, J.-L. Baer, and B. N. Bershad. Characterizing Processor Architectures for Programmable Network Interfaces. In International Conference on Supercomputing, May [4] W. Gropp and E. Lusk. A Taxonomy of Programming Models for Symmetric Multiprocessors and SMP Clusters. In Programming models for massively parallel computers, [5] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA Heterogeneous Multi-core Architecture: The Potential for Processor Power Reduction. In International Symposium on Microarchitecture, December [6] R. Kumar, N. P. Jouppi, and D. M. Tullsen. Conjoined-core Chip Multiprocessing. In International Symposium on Microarchitecture, 2004.

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction