COFFEE A Core for Free

Size: px

Start display at page:

Download "COFFEE A Core for Free"

Anis Miller
6 years ago
Views:

1 COFFEE A Core for Free Juha Kylliäinen, Jari Nurmi and Mika Kuulusa Tampere University of Technology, Finland juha.p.kylliainen@tut.fi Abstract This paper presents design and implementation of an open source processor core developed at Tampere University of Technology, Finland. The design guidelines of a RISC core are introduced and some of the typical design tradeoffs are presented. The architecture of the developed processor engine, COFFEE 1 RISC core, is explained. 1. Introduction The complexity of a processor design can vary from multi-million gate design to a design with few tens of kilo gates. Also, the power consumption can vary from milliwatt range to tens of watts. In order to set the scope for comparison, we need to classify processors and define what we mean by a processor core in this context. Processors targeted to personal computers and mainframe computers form a class with similar requirements. Computing performance and hardware support for operating systems are key requirements in that class whereas power consumption is not an issue. Processors targeted to embedded systems belong to another class. Embedded systems are products other than general purpose computing machines. Processors in such systems are used to implement certain functionality of a product, the capabilities of the processor are not important to the user as long as the product fulfils certain requirements. Some level of performance is needed and usually a processor which has just enough performance, but no more, is selected. Especially in battery operated mobile devices, it is essential to select a processor which is not too good in order to avoid excessive power consumption. Independent of the target application domain, every processor has an execution core. Here we define the core of a processor as follows. The core is the unit responsible for interpretation and execution of the instruction set of the processor in question. If we rip off all peripheral components, buses and cache memories, only the core is left. 1 COFFEE RISC Core is a trademark of Tampere University of Technology, Tampere, Finland In many contexts, processor cores contain also cache memories but we prefer to leave them out. This is because memory architecture affects drastically to performance as well as power consumption and chip area. Also, there is not one optimal solution for memory hierarchy, but the design of the memory architecture is guided by the application. We can roughly divide instruction sets to RISC (Reduced Instruction Set Computer), CISC (Complex Instruction Set Computer) and DSP (Digital Signal Processing) like instruction sets. DSP processing cores belong to application specific group. RISC and CISC cores are suitable for general purpose processing. RISC and CISC instruction sets differ mainly because of different design approach. We can argue that a CISC like instruction set is designed for human and a RISC like instruction set is designed for a compiler. In the early stages of minicomputers, programming was carried out using symbolic machine code, assembly code. It was advantageous to have instructions understandable by humans and instructions which performed more. Nowadays the compilers produce RISC-like instructions even for CISC processors. Instructions which are not used by a compiler are quite useless, and only make the hardware more complex. This in turn reduces performance and increases chip area and power consumption. This is one of the reasons why current trend is towards RISC type of processors. In our COFFEE core we have adopted the RISC philosophy to derive a good processing engine for embedded computing. 2. RISC Design Philosophy We can point out few rules which are followed by RISC designers. The abbreviation RISC focuses on reducing the number and complexity of instructions but modern RISCs may have quite complex instructions included in their instruction set. One instruction per cycle. This requirement does not make sense unless we refer to a pipelined design. If multiple parallel execution pipelines are used, more than one instruction can complete each cycle. The execution time of a program depends on throughput of the pipeline, not the latency of individual instructions. Increasing the number of pipeline stages reduces clock cycle time but at the same time the number of stall and flush cycles

2 (wasted cycles) is increased. The number of wasted cycles can be reduced by careful scheduling of instructions but cannot be fully eliminated. This is a consequence of the fact that software is sequential in nature and operations tend to have strong dependency on results of previously executed operations. Also the penalty of branching becomes significant with very deep pipelining. During the evaluation of branch condition and branch target address, several instructions may enter the pipeline. If the branch is taken, those instructions have to be flushed. There are several ways to alleviate this problem. Delayed branching is quite efficient because a good compiler almost always finds instructions to be placed in delay slot(s) after the branch instruction. Delayed branching together with efficient hardware for address calculation and condition evaluation are enough in designs with less than ten pipeline stages. With deep pipelines, pre-fetching and speculative execution are most often used. The latter requires more than one execution pipeline. Fixed Instruction Length. This requirement aims at simplicity of decoding an instruction. Also the previous requirement of issuing one instruction per cycle cannot be achieved if multiple memory accesses are needed in fetch stage of the pipeline. RISC instruction word usually contains all the information needed to execute the instruction. The width of the instruction is most often 32 or 64 bits. Only load and store instructions access memory. This requirement aims at utilizing the pipeline in an optimal way as well as minimizing memory traffic. Modern RISC processors usually exploit pre-fetch mechanism: The address of the needed data is passed to cache memory well before that data is actually needed. A compiler can schedule pre-fetch commands in order to minimize cache misses which cause stalls. RISC processors usually have many general purpose registers (from 16 to 64 typically), which make it possible to handle most of the processing inside the core and use load and store instructions only to move ready results to memory and new data in. This is a simplified view though, the actual amount of memory traffic depends heavily on the application (and compiler). Simplified addressing modes. Compilers hardly ever use complex address arithmetic supported by CISC hardware. Complex address calculations in hardware only extend the clock cycle, so why should we use them. If very complex address arithmetic is needed, it can be synthesized using a few simple RISC instructions. Fewer, simpler instructions. Simpler operations imply shorter clock cycles. Simple instructions also fit better to RISC-like pipelines. Many simple RISC designs perform most of their operations in one execution stage, that is, once the data has been fetched from register file, it takes one clock cycle to evaluate the result. Demanding instructions, such as multiplication, use several cycles in execution stage and in effect stall the rest of the pipeline. This approach is adopted by for example ARM (Advanced Risc Machines) [3]. In COFFEE RISC core, multiplication is also pipelined in order to increase the throughput. 3. Defining COFFEE RISC ISA One cannot prove that a certain set of instructions is better than another. We can easily measure or compute the number of instructions executed per time unit but we can only compare cores which execute the same instruction set. Sometimes it is not even possible to do this because deeply pipelined architectures usually expect the compiler to schedule instructions in an optimal way. Measurements depend on the compiler and application. To alleviate the pain of performance measurements, several benchmarks have been developed. These benchmarks are usually a set of programs which are executed on the target processor and execution time is measured. Even though it might be justified to compare benchmark results between different processors, it is clear, that they are a measure of a system composed of compiler, processor core and the memory architecture. The instruction set of COFFEE RISC was designed based on instruction sets of RISC processors currently on market. Instructions which were available in most of processors were included and rare ones were excluded. Instructions which enable coprocessor support were also added as a way to extend the instruction set if needed. This might not seem very analytical approach, but was a good starting point for development. The penalty of implementing a particular instruction is not known before modeling the execution of that instruction with hardware timing. This makes it extremely difficult to decide whether an instruction should be included or not prior to the implementation phase of the design. Some instructions present in some RISC architectures were easy to exclude, for example division. Division is not a deterministic process, that is, execution time cannot be predicted. This implies iterative execution which means in practice stalling the pipeline until the result is ready. If needed, division should be done in software and is best avoided in time critical algorithms. The basic implementation of COFFEE RISC has 66 instructions. A special instruction was included for future extensions: swm (switch mode). It is used to switch to a different instruction decoding hardware. It can be used to implement application specific instruction sets and develop 'better' instruction sets in the future without giving away compatibility with old software. The execution pipeline of the COFFEE RISC (explained in section 5) together with swm -instruction make it possible to integrate for example MAC (Multiple and Accumulate)

3 -instruction in the pipeline without deteriorating performance. Currently swm -instruction is used to switch to compressed instruction mode where each half of the 32 bit instruction word is interpreted as individual instruction. The data processing instructions operate on two register operands, or alternatively, one register operand and one immediate operand. Instructions which produce data can write their result to any general purpose register. Three register indexes can be specified in one instruction word. There are fourteen arithmetic instructions, ten bit field manipulation instructions (bytes, halfwords, arbitrary bit fields), six boolean operations, eight conditional branches, four other jumps (linking jumps, absolute jumps etc.) and six shift instructions. Most of the instructions can be executed conditionally making it more efficient to implement short conditional statements of a high level language (not having to jump over code if condition is false). Conditional branching is implemented using two instructions: compare and branch. Compare instructions of COFFEE RISC produce condition flags which can be saved to one of eight possible condition flag registers for later use. Branch instructions evaluate branch condition based on those flags. A delay slot of one instruction is present after any jump or branch. makes it possible to interface large and slow main memories directly. The number of cycles per access can be configured by software separately for both interfaces. In simple systems with only one system bus and no cache memory, sharing data bus might be considered. This is supported directly by COFFEE core. COFFEE RISC supports connecting up to four coprocessors. Coprocessor interface is much like a memory interface. Coprocessor addressing is limited to 7 bits, including a field of two bits for coprocessor ID (identification) and a 5-bit field for coprocessor register index. COFFEE has dedicated instructions to move data and instructions to/from coprocessor. In addition, a coprocessor can interrupt COFFEE core by asserting an exception signal included in the interface. An important feature of the coprocessor interface is its ability to connect to different clock domain. This is achieved by synchronizing also exception signals on core side and allowing data transfer time to be up to sixteen clock cycles long. As with memories, also with coprocessors the access time can be configured by software. Synchronizing circuitry on coprocessor side is needed unless the clock frequency of the COFFEE core is an even multiple of the coprocessor clock frequency. COPROCESSOR_0 COPROCESSOR_1 COPROCESSOR_2 COPROCESSOR_3 4. Overview of COFFEE RISC Features The COFFEE RISC is a so called load-store machine: Memory operands have to be loaded to registers before performing any operation on them. Similarly, a result of an operation is written to a register from where it can be written to memory using special data transfer instruction. As in most of RISC architectures, a vast amount of registers is provided to reduce excessive memory traffic. A register bank with two register sets is provided. Each register set contains 32 register. Both sets are available in privileged mode of operation, but only one set is accessible in user mode. Different operating modes are provided in order to support operating systems. A memory mapped register bank, CCB (Core Configuration Block) is provided to further support operating systems and different configurations. It contains for example registers defining protected memory areas. CCB can be remapped anywhere in the address space. COFFEE RISC is a 32-bit architecture, that is, data is manipulated in 32-bit words. Memory interface is of Harvard type, having separate interfaces for data and instruction memory. Figure 1 shows an example of interfacing COFFEE. Memory interfaces of COFFEE core do not restrict memories to be of any type as long as they conform to interface timing. Multi-cycle access is supported which INST_CACHE INT_HANDLER cop_exc : (3:0) i_addr : (31:0) i_word : (31:0) i_cache_miss ext_handler ext_interrupt : (7:0) offset : (7:0) int_done int_ack core_clock cop_exc : (3:0) i_addr : (31:0) i_word : (31:0) i_cache_miss ext_handler ext_interrupt : (7:0) offset : (7:0) int_done int_ack clk COFFEE core cop_port : (40:0) rd wr d_cache_miss data : (31:0) d_addr : (31:0) pcb_rd pcb_wr stall reset_x_out rst_x boot_sel bus_ack bus_req cop_port : (40:0) rd wr d_cache_miss data : (31:0) d_addr : (31:0) stal l pcb_rd pcb_wr d_addr(7:0) data reset_x_out rst_x boot_sel bus_ack bus_req Figure 1, Interfacing COFFEE. data DATA_CACHE PCB BOOT_CNTRL BUS_CONTROL COFFEE core provides an internal interrupt controller which is adequate for many designs but a possibility to extend is provided. Connecting up to eight external interrupt sources is supported. If coprocessors are not connected, four inputs reserved for coprocessor exception signalling can be used as interrupt request lines, giving possibility to connect twelve sources. Interrupt controller has synchronization circuitry allowing asynchronous signals to be connected. If an external controller is used,

4 synchronization is bypassed in order to reduce signalling latency. Priorities between interrupt sources can be set by software via CCB registers. Interrupt sources can be masked individually and disabled or enabled all at once using di and ei instructions. All interrupts are vectored. Interrupt vectors reside in CCB. The entry address of an interrupt service routine can be the corresponding vector directly or a combination of the vector and an offset given externally if an external controller is used. A block called PCB (Peripheral Control Block), also seen in figure 1, requires some explanation. Interface to this block is provided to make it easy to communicate directly with peripheral devices around COFFEE core. Memory space reserved for peripherals can be set by software. All accesses to that space will assert PCB_WR and PCB_RD signals directing the access to PCB, instead of WR and RD signals, that are used to access the data memory. Control and data registers of peripherals can be placed into one register bank having a single decoding logic or they can reside inside each peripheral device just sharing the bus. Note that the data part of the interface is shared with data memory. Signals BOOT_SEL and STALL which can be seen in Figure 1 have a somewhat important meaning. BOOT_SEL can be used to select the address of the first executed instruction: If BOOT_SEL is high, COFFEE core will read its boot address from data bus. The address should be driven on the bus simultaneously with reset signal. STALL signal is provided to enable stalling the COFFEE core for whatever reason. In battery powered systems STALL signal can be used to save power when there is nothing to be processed. Software execution resumes instantly after releasing STALL signal because the clock of the core is not disabled, but only data in all registers is frozen. 5. Pipeline Structure COFFEE RISC core has a single pipeline with six pipeline stages. Figure 2 illustrates the different stages of COFFEE RISC pipeline. Each block in the figure presents a data transformation or some other operation done during one clock cycle. At the end of each stage, intermediate or finals results are clocked to the input registers of the following stage. Execution proceeds from left to right. As can be seen from the figure it takes six clock cycles for an instruction to go through the pipeline. The datapath is fully pipelined which means that a new instruction enters the first stage of the pipeline and one instruction completes at the last stage of the pipeline every clock cycle. This gives a throughput of one IPC (Instructions Per Cycle) in ideal conditions without any pipeline stalls. The design uses only one clock. In the following, each stage shown in figure 2 is described briefly. Figure 2, COFFEE pipeline. In the first pipeline stage, marked as FETCH in the figure, three operations are performed. A new 32-bit instruction is fetched from the location pointed by the program counter, PC. In 16-bit mode, if the address is even, a 32-bit double instruction is fetched. The address in PC is checked and an exception raised in case of a violation. Finally program counter is incremented by two or four depending on mode. The second pipeline stage, marked as DECODE in the figure, is the most important from the control point of view. This is the point where an instruction is identified and most of the decisions about its behavior in the next stages are made. If COFFEE core is in 16-bit decoding mode, 16-bit halfword is extended to an equivalent 32-bit instruction before passing it to the decode logic. The execution condition, defined by special fields inside instruction word, is evaluated. Evaluation involves checking pre-evaluated condition flags against the specified condition. If execution condition is false, the instruction will simply be flushed on next rising edge of the clock. In parallel with the execution condition check, signals needed during the current and following stages are decoded from the instruction word. Based on signals evaluated in DECODE stage and signals decoded from previous instructions currently on pipeline, the control checks for data dependencies. COFFEE RISC resolves all data dependencies by forwarding the needed data as soon as it becomes available. If data cannot be forwarded, FETCH and DECODE stages are stalled until data is available. Hardware support for resolving dependencies makes programming as well as compiler construction easier. In this simple six-stage pipeline, forwarding logic has a delay of approximately one third of the clock cycle, so it does not reduce clock frequency, but only improves performance by avoiding unnecessary stalls. As can be

5 seen from figure 2, data can be forwarded from several points. Other operations in DECODE stage are extending immediate operand, calculating PC relative jump address and evaluating new status flags if needed. All jump instructions and conditional branches (PC relative and absolute) are executed in DECODE stage, that is, at the end of the stage the target address is clocked into the PC register. Conditional branching is based on pre-evaluated condition flags as conditional execution is. To prepare for the next stage, register operands, whether forwarded or fetched from register file, are clocked to input registers of EXE1 stage. EXE1 is the first stage where data is manipulated. Integer addition, shifting, boolean and bit-field manipulating instructions are finished during this stage. All multiplication operations start in this stage producing intermediate results to next stage. Address for data memory access is calculated using the adder of ALU. At the end of the cycle, condition flags (Z = zero, N = negative, C = carry) are evaluated by compare instructions and some of the arithmetic instructions. Execution of instructions requiring more than one cycle continue in stage EXE2. During this stage, 16-bit multiplication, producing a 32-bit result, is finished. Condition flags evaluated in the previous stage are written to selected condition register. Note that the condition flags are available for DECODE stage before they are written to condition register bank. This is achieved by forwarding data inside condition register bank from input to output if the target register is the same as source register. The data memory address calculated in the previous stage is checked in EXE2. Address is compared against memory limits set for user. Also, it is checked whether the address points to the configuration block, CCB, in which case memory access is not performed. Also, address calculation overflow is detected in this stage. All coprocessor accesses are performed in stage EXE2. If a coprocessor access takes multiple cycles, pipeline will be stalled during wait cycles. This implies that if a slow coprocessor is used, performance will deteriorate unless a special interface block is used. The stage marked EXE3 in figure 2 completes execution of 32-bit multiplication instructions. Also ld and st instructions complete their work during this stage by accessing data memory. If multi-cycle access is used, the rest of the pipeline is stalled during wait cycles, since the instructions coming behind cannot bypass EXE3 stage. This points out the importance of fast data memory or data cache and prefetch capability. The last stage WB, write back, completes the execution of all instructions which produce data. Data is written to the selected destination register during this stage. The register file has internal forwarding which makes data in this stage visible to DECODE stage. 6. Implementation of COFFEE RISC Core COFFEE core is a RTL (Register Transfer Level) VHDL description, that is, a soft core. It can be ported to any technology with basic library components. The VHDL description is written in a way minimizing variation between different technology libraries. Arithmetic operations are coded at boolean level which produces predictable results since a synthesis tool does not try to map operations to fixed hard implementations. The pipeline is balanced based on relative measures of the depth of the logic in each stage. This should ensure equal results between different synthesis tools. Also mapping directly to technology without optimization should produce acceptable results. COFFEE RISC core was designed to be a general purpose processing element suitable for most applications in either SoC (System on Chip) environment or in more conventional embedded systems. One might think that a general purpose machine is too much of a compromise, that is, no good for anything. While this might be true in some cases, COFFEE RISC makes an exception. COFFEE RISC was designed to be a platform which can be tailored to suit the application. In practice this means that COFFEE is not a fixed design and moreover, COFFEE is many designs. The basic version of COFFEE provides adequate resources and processing power for many applications but it can be enhanced in various ways. Designer of a system can choose the combination of modules to get the best trade-offs. Usually this means getting just enough performance while minimizing power consumption and silicon area. If none of the ready made modules result in a satisfactory design, custom modifications can be made. Customizing COFFEE core is straightforward since it was designed to be easily modifiable. In addition to tailoring the core, external modules can be connected to construct a suitable platform for an application. COFFEE core provides simple interfaces for expansion and communication. COFFEE core was designed according to the guidelines for producing reusable IP (Intellectual Property) components [7]. A good IP is more than a good design, there are several things to consider. The importance of documentation cannot be stressed enough. An IP block without proper documentation is totally useless no matter how flexible or configurable it might be. Reusability and configurability were the main postulates for COFFEE core design. Any design which does not constraint implementation technology, has comprehensive documentation and is moderately easy to modify is reusable. If we add scalability and extendibility to the list, we have an IP block. In fact, our core is more

6 than IP. Since it is published as an open source component, it can be referred to as Intellectual Commons (IC) which enables innovation to be incrementally built on top of what we provide. It is the Linux of computation hardware. Modularity gives user the freedom to select the optimum modules or blocks for the design from a set of ready made blocks. In addition, modular structure with well documented component interfaces allows custom blocks to be used. Module-wise synthesis allows each module to be optimized either for speed or area resulting in overall optimal design. Because of its relatively simple interface, COFFEE is easy to instantiate anywhere. It was designed to be able to work as a stand-alone unit without any additional circuitry. It can however easily be equipped with cache memories and unlimited amount of peripheral devices. Peripherals can be connected via direct register interface or AMBA bus. Memory interfaces make no assumptions about the type of memories. The user can map the address space freely because there are no fixed addresses for peripherals or configuration registers. Even the boot address can be defined externally. A series of VCI [6] interface wrappers are provided which allow easy connectivity to other VCI components. The basic COFFEE core is the starting point for developing suitable platforms for applications [4]. It provides the common resources needed by every embedded system: built-in interrupt controller which supports up to twelve sources, simple memory protection mechanism and two timers. System designer selects memories and I/O peripherals as needed by application. Up to four coprocessors can be connected to boost for example floating point operations [5] or DSP processing. Preliminary synthesis results imply a clock frequency of over 200MHz (0.18u CMOS) for COFFEE RISC version 1.0. Software development tools for COFFEE are currently developed at Tampere University of Technology. GNU compiler collection [8] has been ported to COFFEE RISC and is currently being tested. In house assembler, linker and instruction set simulator have also been developed. the implementation is done using hardware description languages. The RTL VHDL description of COFFEE core enables to do this. Development and research work for more automated processor generators is currently going on. The problem is that if we want to achieve a short time to market, we also have to be able to generate software tools for a new architecture quickly. 8. References [1] David A. Patterson and John L. Hennessy, Computer Organization & Design, Morgan Kaufman Publishers Inc, San Francisco, [2] Vincent P. Heuring and Harry F. Jordan, Computer Systems Design and Architecture, Addison - Wesley, California, [3] Steve Furber, ARM system-on-chip-architecture, second edition, Addison - Wesley, [4] Tapani Ahonen et al, A Brunch from the COFFEE Table - Case Study in NOC Patform Design, in Jari Nurmi, Hannu Tenhunen, Jouni Isoaho, and Axel Jantsch (eds.): Interconnect-Centric Design for Advanced SoC and NoC, Kluwer Academic Publishers, [5] Claudio Brunelli, Design of a Floating-Point Unit for a RISC Microprocessor, MSc Thesis, Tampere University of Technology, [6] VSI Alliance, Virtual Component Interface Standard Version 2 (OCB 2 2.0), On-Chip Bus Development Working Group, April [7] Michael Keating, Reuse Methodology Manual: for system-on-a-chip designs, Kluwer Academic, [8] 7. Conclusion It is quite straightforward to design and implement a general purpose processing core by following RISC design guidelines. Here, we have presented the open source COFFEE RISC core which can be used in SoC design or in conventional embedded systems. The core forms a good starting point to develop applicationspecific platforms. If the performance is not at premium the circuit implementation work could be even automated to certain extent. Much of the work can be left to synthesis tool if

COFFEE Core USER MANUAL

COFFEE Core USER MANUAL July 2007 Contents 1. Interface specification of the COFFEE RISC Core 1.1. Shared Data Bus 1.2. Interfacing coprocessors 2. Registers 2.1. General 2.2. Set 1: General Purpose Registers