THE ICORE 520-MHZ SYNTHESIZABLE CPU CORE

Size: px

Start display at page:

Download "THE ICORE 520-MHZ SYNTHESIZABLE CPU CORE"

Ralph Bradley
5 years ago
Views:

1 THE ICORE 520-MHZ SYNTHESIZABLE CPU CORE A NEW IMPLEMENTATION OF THE ST20-C2 CPU ARCHITECTURE INVOLVES AN EIGHT-STAGE PIPELINE WITH HARDWARE SUPPORT TO EXECUTE UP TO THREE INSTRUCTIONS PER CYCLE. THE DESIGN OPERATES UP TO 520 MHZ AT1.8V, AMONG THE HIGHEST REPORTED SPEEDS FOR A SYNTHESIZED CPU CORE. Nick Richardson Lun Bin Huang Razak Hossain Julian Lewis Tommy Zounes Naresh Soni STMicroelectronics Synthesizable processor cores capable of execution speeds typically only achievable by complex custom solutions is important because of the ever-increasing CPU performance levels demanded by modern embedded applications. Another reason is that product design cycles have often decreased to only a few months. In many cases, using lengthy custom design flows is unfeasible and prohibitively expensive. The icore architecture offers a solution to this problem. Although a high-performance version of STMicroelectronics ST20-C2 embedded CPU architecture was our goal, we also had to show that the design was quickly and easily portable as a soft core across existing and future process technologies. This ruled out the use of extensive custom circuits and led to the adoption of a methodology close to that of a traditional application-specific integrated circuit (ASIC) design flow. Such a flow was necessary because of the project s aggressive performance goals. The ST20-C2 architecture is a superset of the Inmos transputer first introduced in 1984 (STMicroelectronics acquired Inmos in 1989). The ST20-C2 added an exception mechanism (interrupts and traps), extensive software debugging capability, and improved support for embedded systems. High-volume consumer applications such as set-top boxes currently use chips based on the ST20-C2. The ST20-C2 s basic instruction set specifies a set of simple, RISC-like operations, making it a good candidate for relatively easy highfrequency implementation. However, it extends conventional RISC technology by having a variable-length instruction word to promote code compactness. It also uses instructions that specify complex operations to control hardware-implemented kernel functions, such as task scheduling and inter-process communications. 1 Also, rather than being based on a RISC-like load/store architecture with many machine registers, the ST20-C2 employs a three-deep hardware evaluation stack, also known as the register stack, on which the majority of instructions operate. This scheme permits the code density necessary for embedded applications, but you must carefully match it with an efficient local memory or cache system to minimize the frequency of memory references outside the CPU core. These characteristics of the architecture required special consideration to devise a design capable of maintaining high instruction throughput on a per-cycle basis (measured in instructions per cycle or IPC) without compromising the CPU s frequency of operation. For a successful project, however, we had to demonstrate optimization in all the following areas, in approximate order of importance: frequency, execution efficiency (measured in terms of IPC), 46 Published by the IEEE Computer Society /03/$ IEEE

2 portability, core size, and power consumption. Optimizing the microarchitecture To achieve efficient instruction execution without excessive design complexity, previous implementations of the ST20-C2 architecture employed relatively short pipelines to reduce branch delays and operand or result feedback penalties, thus producing acceptable IPC over many applications. In icore s case, however, the aggressive frequency target dictated the use of a longer pipeline to minimize stage-to-stage combinatorial delays. Determining how to add the extra pipeline stages without decreasing instruction execution efficiency was a problem because there would be little point in increasing raw clock frequency at the expense of IPC. Following an analysis using a C-based performance model, we chose a relatively conventional pipeline structure that had some important variations targeted at optimizing instruction flow for the unique ST20-C2 architecture. Conditional branch repair Branch target repair BHT alias repair Figure 1 shows the pipeline microarchitecture. It comprises four separate units of two pipeline stages each. The instruction fetch unit (IFU) comprises the IF1 and IF2 pipeline stages and is responsible for fetching instructions from the instruction cache and performing branch predictions. The instruction decode unit (IDU) comprises the ID1 and ID2 pipeline stages and is responsible for decoding instructions, generating operand addresses, renaming operand stack registers, and checking operand or result dependencies. It also maintains the program counter, known in this architecture as the instruction pointer, and a local work space BHT BHB Instruction fetch unit Instruction decode unit Operand fetch unit Result bypass Execution unit Instruction fetch 1 Instruction fetch 2 Instruction decode 1 Instruction decode 2 Operand fetch 1 Operand fetch 2 Execution Operand fetch 2 Write back Store buffer Figure 1. structure for icore. Instruction fetch buffer Local work space cache generation unit Register file unit Result update Operand bypass pointer. The operand fetch unit (OFU) comprises the OF1 and OF2 pipeline stages and is responsible for fetching operands from the data cache, detecting load or store dependencies, and aligning and merging data supplied by the cache or buffered stores and data-forwarding buses. Finally, the execute unit (EXU) comprises the execution (EXE) and write back (WBK) pipeline stages and performs all arithmetic operations other than address generation. It also stages results for writing back into the register file unit (RFU) or memory. The RFU contains the three-register-deep stack, conceptually organized as first-in, first- Result bypass MAY JUNE 2003 cache writes cache reads Instruction cache reads 47

3 SYNTHESIZABLE CORE 48 IEEE MICRO out (FIFO), comprising the A, B, and C registers (in top-to-bottom order). In practice, you implement it using a high-speed multiport register file. We couple the memory-write interface with a store buffer that temporarily holds data while waiting for access to the data cache. We couple the pipeline s instruction-fetch portion (the IFU) to the pipeline s execution portion (the IDU, OFU, and EXU) via a 12- byte instruction fetch buffer (IFB). Early on in the behavior analysis of the ST20-C2 program, it became apparent that we should tune the microarchitecture for high frequency and also for very efficient memory access. This tuning would support the best use of the memory bandwidth available in the target environment: typically a highly integrated system on a chip with multiple processors sharing access to off-chip memory. To achieve this level of memory bandwidth use, we included two in-line operand caches in the pipeline, both accessible by a single instruction as it is executed in the pipeline. To understand the operand caches function, we briefly explain the ST20-C2 s two addressing modes. One mode is for addressing local variables, accessible by using an offset relative to an address pointer stored in a register known as the local work space pointer. This pointer holds the start address of a program s procedure stack (also known as its local work space), and so stores a new value on each procedure call and return. The other mode is for addressing nonlocal variables anywhere in the 4-Gbyte memory address space, using a full 32-bit address that we preloaded into the register stack s top location (the A register). The first operand cache is for local variables. We call it the local work space cache (LWC) and it accelerates references to the first 16 words in the current local work space, indicated by the local work space pointer. Running several benchmark programs on the C-based statistical performance model helped us determine the LWC s size. We place the LWC early in the pipeline (in the ID1 stage) so that it can supply local variables used for nonlocal address calculations to the address generation unit (AGU). Doing so permits early calculation of the full nonlocal address in the next pipeline stage (ID2). The second operand cache is an 8 Kbyte inline data cache that accelerates references to nonlocal variables, as well as local variables that miss in the LWC. It uses nonlocal addresses that the AGU calculates in the ID2 stage and supplies operands to the execution unit, and it thus occupies the two pipeline stages before the EXU. For critical path optimization, its tag RAM and data RAM are in two different pipeline stages. Accessing the tag RAM before the data RAM eliminates the setassociative output multiplexing of the cache required if the tag and data RAM were in the same stage. It also lets cache reads and writes share the same control flow, thus simplifying the design. Together, these two highly integrated operand caches are complementary to the ST20-C2 architecture s A, B, and C registers, effectively acting like a very large register file. Instruction folding Increasing the instruction folding opportunities is an important effect of the inline operand caches. Folding is an instruction decoding technique that combines two or more machine instructions into a single-throw pipeline operation. Because the ST20-C2 has a stack-based architecture, most of its instructions specify very simple operations. Such operations include loading and storing operands to and from the top of the threedeep register stack or performing arithmetic operations on operands already in the register stack. For simplicity, however, we never combine memory and arithmetic logic unit (ALU) operations in the same instruction. Without the compound instructions that allow memory and other arithmetic operations to occur as single-slot pipeline operations, it would be impossible to take advantage of icore s inline memory structure. With folding, however, you can merge up to three successive instructions into one operation that occupies a single execution slot and thus fully use icore s pipeline resources. As an example, take the following very common three-instruction sequence: ldl <n> ldnl <m> add The first instruction, load local, loads a variable that offset n addresses from the current

4 Table 1. Example of compound instruction execution. Stage IF1 IF2 ID1 ID2 OF1 OF2 EXE WBK Execution description Instruction fetch (instruction cache tag access). Instruction fetch (instruction cache data RAM access and loading of the instruction fetch buffer). Execute the ldl by reading the local variable from LWC at offset n; issue ldl, ldnl, and add from the instruction fetch buffer as one operation if ldl hits in the LWC. Using the AGU, generate the address the ldnl instruction will use by adding the local operand that the ldl in ID1 obtains to offset m. Fetch data that ldnl requests from the data cache (tag RAM). Fetch data that ldnl requests from the data cache (data RAM). Fetch second data operand from the register file unit (pop the stack). Add the data that the ldnl in OF2 obtains to the second operand from the register stack. Push the result onto the register stack. value of the local work space pointer and pushes it onto the top of the register stack. The second instruction, load nonlocal, loads a variable that offset m addresses from the value on top of the register stack (in this case, the value that the ldl instruction just placed there). Finally the add instruction adds the top two values on the register stack (A and B) and pushes the result back onto the stack. Without folding, this instruction sequence would occupy three distinct pipeline slots, and could stall because of a data dependency between the ldl and the ldnl instructions. However, with folding and icore s inline memory structure, these instructions can execute in a single pipeline slot, even though they require two memory reads. Table 1 describes the stage-by-stage execution. Different operations work in different pipeline stages (ldl in ID1; ldnl in ID2, OF1, and OF2; and add in EXE and WBK) to separately execute the three folded instructions. But because each operation occupies only one pipeline stage at a time, the three instructions apparently execute in a single clock cycle. The icore supports many such folding sequences, but they all follow this same principle. forwarding and register scoreboarding In a deeply pipelined design, mitigating the effects of delays caused by pipeline latencies is important. dependency between successive instructions causes one significant source of latency-based performance degradation. If the dependency is a true one (such as when an arithmetic operation uses the result of the instruction just before it) then it s impossible to eliminate the dependency, but microarchitecture techniques can reduce the pipeline stall this situation causes. These techniques include providing data forwarding paths between potentially interdependent pipeline stages. In icore, this strategy results in data bypass paths from WBK to OF2, EXE to OF2, WBK to ID2, and EXE to ID2, as Figure 1 shows. The paths from EXE and WBK to OF2 allow an instruction either immediately after or up to two cycles behind an arithmetic operation to use the arithmetic operation s result without penalty. If more than two clocks separate them, then the dependent operation can read the result from the register stack without penalty. Similarly, the path from EXE to ID2 lets the AGU use an operand that one instruction fetches from the data cache as an operand address for another instruction that follows the fetching instruction. This situation commonly occurs when either an ldnl instruction or an ldl instruction that misses in the LWC loads an address from memory used by a later ldnl instruction. Such a situation occurs when following a pointer chain, for example. Using the forwarding path under these circumstances results in a worst-case pipeline stall of two clock cycles. If we provided no forwarding paths, this penalty would be four clock cycles to allow time for the operand address to cycle through the register stack. The path from WBK to ID2 operates similarly. However, we use it when the calculated result of one instruction rather than a result MAY JUNE

5 SYNTHESIZABLE CORE 50 IEEE MICRO just fetched from the data cache serves as an AGU input for an instruction that follows. This path reduces the penalty for this type of addressing to a maximum of three clocks. Activating the forwarding paths at the appropriate times requires the use of register dependency-checking logic (also known as register scoreboarding). In icore, this logic resides in the ID2 stage, and its operation is quite straightforward: It compares the name of a register required as a source operand by one instruction and the names of all registers that instructions further ahead in the pipeline are about to modify. When this logic detects a match, the pipeline stalls the newer instruction in the ID2 if the older instruction cannot supply its result in time. Such a situation typically occurs when an instruction uses an ALU result as an operand address or when a later instruction uses the result of an earlier multicycle ALU operation. After inserting any necessary stall cycles, the scoreboarding logic releases the instruction from ID2 and activates the appropriate forwarding path to provide the source operands for the instruction at the earliest possible time. For the ST20-C2 architecture, register dependency checking must handle operands that constantly change position within the register stack s A, B, and C registers, because most instructions perform implicit or explicit pushes or pops of the stack. Thus, almost every instruction appears to have data dependencies on every older instruction in the pipeline, because nearly all of them cause the loading of new values into A, B, and C. This loading occurs even though most of those dependencies are false in the sense that they arise from the movement of data from one register to another rather than by the creation of new results. Considerable performance degradation would result if the pipeline stalled newer instructions based on those false dependencies. The icore solves this problem by using a simple renaming mechanism that maps the ST20-C2 architecture s conceptual A, B, and C registers onto fixed hardware registers, R0, R1, and R2. The key is that the same hardware register remains allocated to a particular instruction s result throughout its lifetime in the pipeline. This occurs despite the fact that subsequent instructions push and pop the architectural register stack and apparently cause the result to move among A, B, and C. A mapping table for each pipeline stage indicates which architectural register (A, B, or C) maps onto which fixed hardware register (R0, R1, R2) for the instruction in that pipeline stage. Operations that only move operands between registers do so by simply renaming the R0, R1, and R2 registers to the new A, B, or C mapping, rather than actually moving data among them. The icore performs this mapping in ID2 and bases the mapping on how the current instruction in ID2 affects the existing mapping up to but not including that instruction. Once it finishes the ID2 mapping, icore performs dependency checking based on R0, R1, and R2 source and destination values. Results never physically move among these registers, eliminating false dependencies. Branch prediction The icore s relatively long pipeline (compared to that of previous ST20-C2 implementations) increases the danger of branches causing significant performance degradation. After performance simulations showed that the longer pipeline s effect on some benchmarks was significant, we incorporated a lowcost branch prediction scheme into icore s microarchitecture. In a machine without branch prediction, the source of most branch penalties is twofold. First, the processor must determine the target address of a taken branch or other control-transfer instruction. It cannot do so until after decoding the branch, which requires identifying the branch and then (usually) performing an arithmetic operation. Such an operation might be adding an offset to the current instruction pointer value to obtain the address of the next valid instruction in the program flow. By the time the instruction fetch unit can apply the next valid instruction s target address to the instruction cache to start fetching instructions from the new destination, it has fetched several instructions from the wrong address. The pipeline must cancel these instructions, creating a penalty of several clock cycles. Secondly, conditional branches those that

6 Target address +4 Target FIFO Repair address Multiplexer Floating-point transfer register Floating-point transfer register BHT, BTB RAMs Tag RAM Target address Prediction bits 4 bytes RAM Downshift Instruction fetch buffer Instruction cache IF1 IF2 ID1 Figure 2. Branch prediction logic. depend on the results of previous instructions to determine their direction usually must progress all the way to the pipeline s end before the resolution of the condition on which their action depends. Only then can instruction fetches from a new target address proceed, causing an even greater branch penalty than that incurred by the target address calculation alone. The icore s branch prediction mechanism reduces the penalties in both these cases. It uses a two-bit predictor scheme to predict branch and subroutine call behavior (as either taken or not taken), using a branch target buffer to predict target addresses for taken branch and call instructions. It also uses a four-entry return stack to predict the special case of ret instruction return addresses. Figure 2 shows icore s prediction scheme, which employs a 1,024-entry branch history table (BHT). Each entry contains a two-bit predictor that implements a branch hysteresis algorithm 2 that characterizes branch instructions as either weakly or strongly taken or not taken. The BHT keeps the most current branching history for a set of instruction cache words. Predicted fetch addresses (such as the target addresses of predicted-taken branches) go to a 64-entry branch target buffer (BTB). Whenever an instruction fetch unit fetches a new 4-byte word from the instruction cache, the instruction address that indexes the cache also indexes the BHT and BTB. The branch prediction logic uses the retrieved information from the BHT to predict whether a taken branch is at the current instruction address. If the prediction logic predicts a taken branch, the instruction fetch unit loads the address retrieved from the BTB into the instruction fetch pointer and uses it to fetch the next instruction word. Otherwise, it uses the instruction fetch pointer s sequential value. All this processing takes place in the very first pipeline stage, ID1, permitting immediate application of the new target addresses to the instruction cache after the prediction of a taken branch. This effectively lets predicted-taken branches immediately change the instruction address applied to the instruction cache without any delay cycles, making negligible the performance impact of successfully predicted branches. We feed the predicted direction and target forward through IF2 and load it into the IFB (in the case of the prediction bits) and into a four-entry target buffer (in the case of the predicted target address). Later pipeline stages use these values to validate the predictions and update the instruction pointer. Mispredictions can occur. A cost-saving measure using only the low-order bits of the MAY JUNE

7 SYNTHESIZABLE CORE 52 IEEE MICRO instruction fetch address to index BHT and BTB entries and ignoring the upper address bits causes one type of misprediction. Providing only one BHT or BTB entry per 4-byte cache word, even though a word could have more than one branch, is another cost-saving measure that causes misprediction. These factors can cause many fetched instructions to share a single BHT or BTB entry, even though the entry is likely correct only for the one instruction that the processor originally used to update it. As a result, the instructions at other fetch addresses that share the entry might be incorrectly predicted, have incorrect target addresses, or both. Other misprediction sources include mispredicted directions of conditional branches and incorrect return addresses. Later pipeline stages detect and resolve mispredictions. The instruction decoder in the ID1 stage can detect many mispredictions caused by BHT aliasing. For instance, if the prediction logic predicts a taken condition for an instruction that is not a branch or predicts an untaken condition for an unconditional branch, the prediction is clearly in error. A repair operation can start immediately and thus have only a small penalty. To check predicted targets, an adder in the OF1 stage calculates a taken branch s correct target address and compares it with the predicted one (which ID2 feeds forward from the BTB). If a missing, aliased BTB entry causes a mismatch, then recovery can start from OF1. Finally, a conditional branch must wait until it reaches the EXE stage before it s possible to check the condition that determined the branch s action, and thus the branch has the longest repair time when incorrectly predicted. The steps to repair a branch depend on the type of misprediction. When BHT aliasing causes an incorrect branch instruction to be taken, then the pipeline flushes all instructions fetched after that instruction. The pipeline restarts instruction fetching from the source address of that instruction to ensure that is correctly fetched, while updating the BHT to indicate not taken. Conversely, in the case where prediction logic incorrectly predicts a taken branch s target address, the pipeline flushes the incorrect target s instructions. It then restarts instruction fetching from the correct target address computed in the OF1 stage. This occurs while the IF1 stage updates BTB with the new target address. Finally, repairs to correct a mispredicted conditional branch flush all subsequent instructions from the pipeline, which begins fetching again from either the branch s target address or the next sequential address, depending on whether the branch is taken or not. The repairs update the BHT and BTB appropriately. Conventional branch prediction is ineffective for branches caused by subroutine returns, because the return s target address changes constantly, depending on the program location of the subroutine call. Thus icore uses a four-entry return stack to predict the ret (return) instruction s target. The return stack saves the return addresses for the most current call instructions. When the system encounters ret instruction in the ID2 stage, it pops the return stack and uses the value as the predicted branch address for the ret instruction, incurring only a small penalty. The system validates the ret predictions in the EXE stage, because in the ST20-C2 architecture, the actual return addresses come from a memory location in the local work space and typically are available only after accessing the data cache. Cache subsystem The cache controller s design focuses on the need for simplicity (thereby enabling high frequency) and tight coupling to the CPU pipeline. These requirements resulted in a two-stage pipelined approach: The first stage handles the tag access and tag comparison, and the second stage handles the data access and data alignment. This arrangement balances delays, saving power and delay time by eliminating the need for physically separate data RAMs for each associative-cache bank (or way). This is possible because the system can predetermine the number of the data RAM s way in the tag cycle and then simply encode it into extra address bits in the data RAM. To further reduce delays, we built integrated pipeline input registers into the tag and data RAMs to eliminate input wire delays. Figure 3 shows a simplified block diagram of the IFU and OFU cache controllers. One controller occupies the IF1 and IF2 stages in

8 Multiplexer Shift, data IF0/ID2 IF1/OF1 IF2/OF2 generation unit and instruction fetch buffer Store buffer Multiplexer CPU Hold 32 Store data Request/address 32 Tag RAM CH1 Summation unit Hit Miss control RAM 32 Miss Write back (miss) Write back (dirty) Write back address CH2 External memory interface Line-fill data Cache controller Figure 3. Cache controllers. the IFU; the other occupies the OF1 and OF2 stages in the OFU. We use the same cache controller design in each case, so its main pipeline stages correspond to IF1/IF2 in one case and OF1/OF2 in the other. Before the IF1 or OF1 phase, the cache controller receives the request (read or write), address, and data from the CPU and loads them into its input pipeline registers. In the IF1 or OF1 phase, the cache controller of the instruction or data cache accesses the tag RAMs and performs the tag comparison. The first version of icore has 8-Kbyte caches and a 16-byte line size, so it uses 512 tag entries. At the end of IF1 or OF1, the result of the tag comparison (hit/miss) and the IF1 or OF1 address and data fields from the original CPU request go to the input registers of the IF2 or OF2 stage. In IF2 or OF2, the system accesses the data RAMs and returns the data to the CPU s IFB in the case of the instruction cache or the data aligner on the inputs to the EXE stage in the case of the data cache. In addition to returning data on a cache hit, the cache controller in IF2 or OF2 stage also initiates a cache-miss sequence if the requested data is not in the cache. It does this by MAY JUNE

9 SYNTHESIZABLE CORE 54 IEEE MICRO asserting a hold signal, which freezes the CPU pipeline, and signaling the main-memory controller to fetch the missing line. Once the main memory retrieves the missing line, the memory controller loads this data into the data RAM, loads the upper address bits into the tag RAM, and the IF2 or OF2 stage deasserts the hold. The CPU must then restore the original state of the cache controller s internal pipeline registers by repeating the read or write requests that were in progress at the time of the cache miss. Once the CPU restores state, normal operation resumes. Having the CPU restore state simplifies the cache design, reducing the complexity of its critical paths. The cache controller implements a writeback policy: Writes that hit the cache write only to the cache, and not main memory (as opposed to a write-through policy, which would write to both). Thus the cache can contain a more up-to-date version of a line than main memory, necessitating that the cache write the line back to memory when a new line from a different address replaces an old one. To facilitate this, the caches keep a dirty bit with each line to indicate when the line has been modified and hence requires writing back to main memory on its replacement. The dirty bits reside in a special field in the data RAM, and there is only one per four-word location (because each word is 4 bytes, and there are 16 bytes in a line). The cache controller sets the dirty bit for a given line when a write to that line hits the cache. Write buffering and store/load bypass Writes to the data cache presented a potential performance problem, because the WBK stage at the end of the CPU pipeline generates write requests. So a write in any given clock cycle could clash with newer read requests generated by the AGU at the end of the ID2 stage. Because the cache is single-ported, this would cause the CPU pipeline to stall, even though on average the cache controller is capable of dealing with the full read/write bandwidth. Performance simulations showed that we could avoid most collisions by adding a special write-buffering stage, SBF, to store blocked memory write requests from the WBK stage. With the SBF, icore has the ability to generate write requests from either the WBK stage if the SBF stage is empty and the data cache is not busy, or the SBF stage if it is not empty and the data cache is not busy. This capability gives write requests more opportunities to find unused data cache request slots. Additional write buffers would further increase the opportunities for collision avoidance, but they provided much smaller returns than the addition of the first one, so we did not implement them in icore s demonstration version. Finally, because of the high memory utilization of ST20-C2 programs, it was beneficial to add a store/load bypass mechanism. This mechanism permits icore to supply data from queued memory writes in the EXE, WBK, and SBF stages directly to read requests from the same address in OF2 to replace stale data obtained from the data cache. This approach avoids having to stall a memory read when it needs data from a data cache location with an outstanding write. Effects of microarchitecture features on IPC We first studied the effect of microarchitectural enhancements on IPC using a C-based statistical performance model. Such a model simulates the execution of various benchmark traces produced by an ST20-C2 instruction set simulator. These benchmarks measure the effects of basic instruction folding, the LWC and its alternative designs, enhanced instruction folding enabled by using LWC to fold two load instructions into one, and branch prediction schemes and designs. We fine-tuned some of the features architecture during register-transfer-level implementation, as more accurate performance and area tradeoff analyses became possible. One of the more noticeable implementation-level enhancements included modifying the branch predictor state transition algorithm to handle the case where multiple predicted branch instructions reside in the same cache word. Other enhancements include the addition of a dynamic LWC disable feature to minimize the time the LWC is unavailable because of coherency updates and a rotating LWC pointer feature to maximize the reuse of valid operands in the LWC as the work space pointer increments and decrements. Table 2 summarizes the effects of these

10 Table 2. IPC enhancements of microarchitecture optimizations as measured on a Dhrystone 2.1 benchmark. Overall IPC Increase in IPC improvement Features (percentage) (percentage) Basic instruction folding LWC Enhanced instruction folding 7 27 LWC enhancements 8 35 Branch prediction 6 41 Enhanced branch prediction transition algorithm 1 42 Return stack 3 45 microarchitecture features on IPC. Optimizing the implementation We used two methods to optimize icore s implementation: a synthesis process and a higher-performance standard-cell library. Synthesis strategy We divided the synthesis process into two steps. The first step employed a bottom-up approach using Synopsys Design Compiler, in which we synthesized all top-level blocks using estimated constraints. We generally defined blocks as logic components with a single pipeline stage. The execute unit included a large multiplier, for which we used Synopsys Module Compiler because we found it to provide better results than Design Compiler. It took several incremental compilations of the design merging blocks into larger and larger modules, then finally the full processor to tune performance across the different design boundaries. The OF2 and the IF2 stages were the only region where synthesis provided unacceptable path delays. In both cases, the problems involved the use of complex multiplexer functions. So we designed custom multiplexer gates for these cases and manually inserted logic with a dont_touch attribute to achieve the desired results. We resolved other complex multiplexer problems by rewriting the hardware description language to guide the synthesis tool toward the preferred implementation. The synthesis strategy s second step employed a top-down approach, with cell placement using Synopsys Physical Compiler, which optimizes gates and drivers based on actual cell placement. Physical Compiler proved effective in eliminating certain design iterations. In an ASIC flow, iterations can arise from to the discrepancy between the gate assumed by the wire load model, and that produced by the final routed netlist. The delays based on the placement came within 10 percent of those obtained after final place and route. You can run Physical Compiler with either an RTL to placed gates, or a gates to placed gates flow. The latter provided better results for this design. Standard-cell library We developed a higher-performance standard-cell library, H12, to improve icore s circuit speed over that provided by a generic standard-cell library. Using the H12 library provided a speed improvement of more than 20 percent in simulation and on silicon. H12 used the same conventional CMOS static designs as the generic library, but also incorporated the following differences: reduced ratio of the PMOS to NMOS transistor widths (P/N) in the standard cells. larger logic gates (instead of the bufferedlevel gates in this particular cell library), more efficient layout, and increased cell height. We determined the P/N ratio for each cell by measuring the fastest speed obtained when the gate was driving a light load. We observed that when the gate-load increased, we needed to increase the P/N ratio to obtain the optimal speed. However, the physical-synthesis tool minimizes the fanout and line loading on the most MAY JUNE

SYNTHESIZABLE CORE Cadence s CTGen. The maximum simulated skew across the core was 120 ps under worstcase process and environmental conditions. Typical clock skew would be much less.

11 SYNTHESIZABLE CORE Cadence s CTGen. The maximum simulated skew across the core was 120 ps under worstcase process and environmental conditions. Typical clock skew would be much less. Using Synopsys Formality, we formally compared the final routed design to the provided netlist. We used Synopsys PrimeTime to calculate the Synopsys-Arcadia-extracted netlist delay. Figure 4. The icore test chip s GDSII-based plot. critical paths, so we optimized the H12 library for the typical output loading of those paths. We also reduced delays by creating larger, single-level cells for high-drive gates. The original library added an extra buffering level to its high-drive logic gates to keep their input capacitance low and their layout smaller, but these optimizations also increased their delay. Eliminating that buffer by increasing the cell size also increased the input capacitance, but significantly reduced the net delay. The icore only used these cells in the most critical paths, so their size did not have much impact on the overall layout size. The H12 cells generally had larger transistors, requiring taller layouts. However, by using more-efficient layout techniques, we reduced many of the cells widths compared to those of the generic library. Also, the cells taller height let the router use extra metal tracks, which in turn increased the cell placement s utilization. Physical-design strategy We used Physical Compiler to generate the cell placement, freezing the data cache and instruction cache locations during the placement process and permitting the floorplanning to be data-flow driven. We shielded clock lines to reduce delay uncertainty by using Wroute in Silicon Ensemble by Cadence. To minimize the power-drops inside the core, we designed a dense metal layer 5 and metal layer 6 (the upper metal layers available in our process) power grid mesh. We generated the balanced clock tree with Results We fabricated the icore processor in STMicroelectronics 0.18-micron HCMOS8D process. Figure 4 shows a GDSII-based plot of the test chip. Excluding the memories, the entire chip has the distinctive random layout of synthesized logic. The large memories on the right and on the top and bottom of the plot are the data and instruction RAMs. The other small memories in the plot are the LWC, BHT, and BTB. Timing simulation indicated that we achieved a balance of delays between the pipeline stages, as shown in Figure 5. The slowest individual pipeline stage limits the overall performance of the processor. Any silicon area spent on making the other stages faster than the slowest stage is wasted as no net improvement in performance results. The optimally efficient design has balanced delays, so that all stages have equal performance. Such a deep, well-balanced pipeline structure is amenable to highly optimal synthesis. 3 Silicon testing showed functional performance of the design from 475 MHz at 1.7V to 612 MHz at 2.2V at 25 Celsius (ambient). The design operates up to 520 MHz at 1.8V, among the highest reported speeds for a synthesized CPU core. 4 Analysis of results for the Dhrystone 2.1 benchmark showed that icore achieved an IPC count of about 0.7, which met the goal of being the same or greater than that of previous ST20-C2 implementations. This indicates that the various pipeline optimizations were functioning correctly. References 1. ST20-C2 Instruction Set Reference Manual, STMicroelectronics, 72-TRN , Nov J.K.F. Lee and A.J. Smith, Branch Prediction Strategies and Branch Target Buffer Design, Computer, vol. 17, no. 1, 56 IEEE MICRO

12 Jan. 1984, pp D.G. Chinnery and K. Keutzer, Closing the Gap between ASIC and Custom: An ASIC Perspective, Proc. 38th Design Automation Conf. (DAC 2001), ACM Press, 2001, pp C.D. Snyder, Synthesizable Core Makeover: Is Lexra s Seven-Stage Core the Speed King? Microprocessor Report, 2 July delay (ns) Nick Richardson is an STMicroelectronics Fellow, who led the development of the icore microarchitecture. His research interests include the definition and design of microprocessor, cache memory, and network architectures. Richardson has a BSEE from the University of London. Lun Bin Huang is a senior principal engineer at STMicroelectronics. His contribution to the icore project includes microarchitecture and design of branch prediction, instructions decode and issue, and operand fetch. His research interests include architecture and design of high-performance CPU cores and network processors. Huang has an MSEE from California State University, Long Beach. He is a member of the IEEE. Razak Hossain is senior principal engineer at STMicroelectronics in La Jolla, Calif. His research interests include the design of highspeed circuits and interconnects. Hossain has a PhD in electrical engineering from the University of Rochester. He is a member of the IEEE. Julian Lewis was the icore project manager for STMicroelectronics. He has worked as architect, methodologist, verifier, and project manager for ST20 cores. He now works on the architectural definition of ST s DVD chips and manages a team implementing audio software for DVD players. Lewis has a BSc in computer science from the University of Manchester. Tommy Zounes is a senior principal engineer at STMicroelectronics who works in the advanced design group in ST s Central Research and Development Division in La Jolla, developing a Domino style library for IF1 IF2 ID1 ID2 OF1 OF2 EXE WBK Figure 5. Static analysis of worst-case pipeline paths. synthesis flow. His research interests include custom high-speed digital design solutions, which include front to back designing of standard-cell libraries, RAMs, ROMs, and register files. Zounes has a BSEE and an MSEE from San Diego State University. He is a member of the IEEE. Naresh Soni is vice president of US central research and development for STMicroelectronics. His research interests include highperformance circuits and systems. Soni has a BSEE from the University of Bombay, India, and an MSEE from the University of Texas, Austin. He is a member of the IEEE. Direct questions and comments about this article to Naresh Soni, STMicroelectronics, 4690 Executive Dr., Ste. 200, San Diego, CA 92121; naresh.soni@st.com. For more information on this or any other computing topic, visit our Digital Library at MAY JUNE

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

ST20 icore and architectures D Albis Tiziano 707766 Architectures for multimedia systems Politecnico di Milano A.A. 2006/2007 Outline ST20-iCore Introduction Introduction Architecture overview Multi-cluster