THE ICORE 520-MHZ SYNTHESIZABLE CPU CORE

Size: px
Start display at page:

Download "THE ICORE 520-MHZ SYNTHESIZABLE CPU CORE"

Transcription

1 THE ICORE 520-MHZ SYNTHESIZABLE CPU CORE A NEW IMPLEMENTATION OF THE ST20-C2 CPU ARCHITECTURE INVOLVES AN EIGHT-STAGE PIPELINE WITH HARDWARE SUPPORT TO EXECUTE UP TO THREE INSTRUCTIONS PER CYCLE. THE DESIGN OPERATES UP TO 520 MHZ AT1.8V, AMONG THE HIGHEST REPORTED SPEEDS FOR A SYNTHESIZED CPU CORE. Nick Richardson Lun Bin Huang Razak Hossain Julian Lewis Tommy Zounes Naresh Soni STMicroelectronics Synthesizable processor cores capable of execution speeds typically only achievable by complex custom solutions is important because of the ever-increasing CPU performance levels demanded by modern embedded applications. Another reason is that product design cycles have often decreased to only a few months. In many cases, using lengthy custom design flows is unfeasible and prohibitively expensive. The icore architecture offers a solution to this problem. Although a high-performance version of STMicroelectronics ST20-C2 embedded CPU architecture was our goal, we also had to show that the design was quickly and easily portable as a soft core across existing and future process technologies. This ruled out the use of extensive custom circuits and led to the adoption of a methodology close to that of a traditional application-specific integrated circuit (ASIC) design flow. Such a flow was necessary because of the project s aggressive performance goals. The ST20-C2 architecture is a superset of the Inmos transputer first introduced in 1984 (STMicroelectronics acquired Inmos in 1989). The ST20-C2 added an exception mechanism (interrupts and traps), extensive software debugging capability, and improved support for embedded systems. High-volume consumer applications such as set-top boxes currently use chips based on the ST20-C2. The ST20-C2 s basic instruction set specifies a set of simple, RISC-like operations, making it a good candidate for relatively easy highfrequency implementation. However, it extends conventional RISC technology by having a variable-length instruction word to promote code compactness. It also uses instructions that specify complex operations to control hardware-implemented kernel functions, such as task scheduling and inter-process communications. 1 Also, rather than being based on a RISC-like load/store architecture with many machine registers, the ST20-C2 employs a three-deep hardware evaluation stack, also known as the register stack, on which the majority of instructions operate. This scheme permits the code density necessary for embedded applications, but you must carefully match it with an efficient local memory or cache system to minimize the frequency of memory references outside the CPU core. These characteristics of the architecture required special consideration to devise a design capable of maintaining high instruction throughput on a per-cycle basis (measured in instructions per cycle or IPC) without compromising the CPU s frequency of operation. For a successful project, however, we had to demonstrate optimization in all the following areas, in approximate order of importance: frequency, execution efficiency (measured in terms of IPC), 46 Published by the IEEE Computer Society /03/$ IEEE

2 portability, core size, and power consumption. Optimizing the microarchitecture To achieve efficient instruction execution without excessive design complexity, previous implementations of the ST20-C2 architecture employed relatively short pipelines to reduce branch delays and operand or result feedback penalties, thus producing acceptable IPC over many applications. In icore s case, however, the aggressive frequency target dictated the use of a longer pipeline to minimize stage-to-stage combinatorial delays. Determining how to add the extra pipeline stages without decreasing instruction execution efficiency was a problem because there would be little point in increasing raw clock frequency at the expense of IPC. Following an analysis using a C-based performance model, we chose a relatively conventional pipeline structure that had some important variations targeted at optimizing instruction flow for the unique ST20-C2 architecture. Conditional branch repair Branch target repair BHT alias repair Figure 1 shows the pipeline microarchitecture. It comprises four separate units of two pipeline stages each. The instruction fetch unit (IFU) comprises the IF1 and IF2 pipeline stages and is responsible for fetching instructions from the instruction cache and performing branch predictions. The instruction decode unit (IDU) comprises the ID1 and ID2 pipeline stages and is responsible for decoding instructions, generating operand addresses, renaming operand stack registers, and checking operand or result dependencies. It also maintains the program counter, known in this architecture as the instruction pointer, and a local work space BHT BHB Instruction fetch unit Instruction decode unit Operand fetch unit Result bypass Execution unit Instruction fetch 1 Instruction fetch 2 Instruction decode 1 Instruction decode 2 Operand fetch 1 Operand fetch 2 Execution Operand fetch 2 Write back Store buffer Figure 1. structure for icore. Instruction fetch buffer Local work space cache generation unit Register file unit Result update Operand bypass pointer. The operand fetch unit (OFU) comprises the OF1 and OF2 pipeline stages and is responsible for fetching operands from the data cache, detecting load or store dependencies, and aligning and merging data supplied by the cache or buffered stores and data-forwarding buses. Finally, the execute unit (EXU) comprises the execution (EXE) and write back (WBK) pipeline stages and performs all arithmetic operations other than address generation. It also stages results for writing back into the register file unit (RFU) or memory. The RFU contains the three-register-deep stack, conceptually organized as first-in, first- Result bypass MAY JUNE 2003 cache writes cache reads Instruction cache reads 47

3 SYNTHESIZABLE CORE 48 IEEE MICRO out (FIFO), comprising the A, B, and C registers (in top-to-bottom order). In practice, you implement it using a high-speed multiport register file. We couple the memory-write interface with a store buffer that temporarily holds data while waiting for access to the data cache. We couple the pipeline s instruction-fetch portion (the IFU) to the pipeline s execution portion (the IDU, OFU, and EXU) via a 12- byte instruction fetch buffer (IFB). Early on in the behavior analysis of the ST20-C2 program, it became apparent that we should tune the microarchitecture for high frequency and also for very efficient memory access. This tuning would support the best use of the memory bandwidth available in the target environment: typically a highly integrated system on a chip with multiple processors sharing access to off-chip memory. To achieve this level of memory bandwidth use, we included two in-line operand caches in the pipeline, both accessible by a single instruction as it is executed in the pipeline. To understand the operand caches function, we briefly explain the ST20-C2 s two addressing modes. One mode is for addressing local variables, accessible by using an offset relative to an address pointer stored in a register known as the local work space pointer. This pointer holds the start address of a program s procedure stack (also known as its local work space), and so stores a new value on each procedure call and return. The other mode is for addressing nonlocal variables anywhere in the 4-Gbyte memory address space, using a full 32-bit address that we preloaded into the register stack s top location (the A register). The first operand cache is for local variables. We call it the local work space cache (LWC) and it accelerates references to the first 16 words in the current local work space, indicated by the local work space pointer. Running several benchmark programs on the C-based statistical performance model helped us determine the LWC s size. We place the LWC early in the pipeline (in the ID1 stage) so that it can supply local variables used for nonlocal address calculations to the address generation unit (AGU). Doing so permits early calculation of the full nonlocal address in the next pipeline stage (ID2). The second operand cache is an 8 Kbyte inline data cache that accelerates references to nonlocal variables, as well as local variables that miss in the LWC. It uses nonlocal addresses that the AGU calculates in the ID2 stage and supplies operands to the execution unit, and it thus occupies the two pipeline stages before the EXU. For critical path optimization, its tag RAM and data RAM are in two different pipeline stages. Accessing the tag RAM before the data RAM eliminates the setassociative output multiplexing of the cache required if the tag and data RAM were in the same stage. It also lets cache reads and writes share the same control flow, thus simplifying the design. Together, these two highly integrated operand caches are complementary to the ST20-C2 architecture s A, B, and C registers, effectively acting like a very large register file. Instruction folding Increasing the instruction folding opportunities is an important effect of the inline operand caches. Folding is an instruction decoding technique that combines two or more machine instructions into a single-throw pipeline operation. Because the ST20-C2 has a stack-based architecture, most of its instructions specify very simple operations. Such operations include loading and storing operands to and from the top of the threedeep register stack or performing arithmetic operations on operands already in the register stack. For simplicity, however, we never combine memory and arithmetic logic unit (ALU) operations in the same instruction. Without the compound instructions that allow memory and other arithmetic operations to occur as single-slot pipeline operations, it would be impossible to take advantage of icore s inline memory structure. With folding, however, you can merge up to three successive instructions into one operation that occupies a single execution slot and thus fully use icore s pipeline resources. As an example, take the following very common three-instruction sequence: ldl <n> ldnl <m> add The first instruction, load local, loads a variable that offset n addresses from the current

4 Table 1. Example of compound instruction execution. Stage IF1 IF2 ID1 ID2 OF1 OF2 EXE WBK Execution description Instruction fetch (instruction cache tag access). Instruction fetch (instruction cache data RAM access and loading of the instruction fetch buffer). Execute the ldl by reading the local variable from LWC at offset n; issue ldl, ldnl, and add from the instruction fetch buffer as one operation if ldl hits in the LWC. Using the AGU, generate the address the ldnl instruction will use by adding the local operand that the ldl in ID1 obtains to offset m. Fetch data that ldnl requests from the data cache (tag RAM). Fetch data that ldnl requests from the data cache (data RAM). Fetch second data operand from the register file unit (pop the stack). Add the data that the ldnl in OF2 obtains to the second operand from the register stack. Push the result onto the register stack. value of the local work space pointer and pushes it onto the top of the register stack. The second instruction, load nonlocal, loads a variable that offset m addresses from the value on top of the register stack (in this case, the value that the ldl instruction just placed there). Finally the add instruction adds the top two values on the register stack (A and B) and pushes the result back onto the stack. Without folding, this instruction sequence would occupy three distinct pipeline slots, and could stall because of a data dependency between the ldl and the ldnl instructions. However, with folding and icore s inline memory structure, these instructions can execute in a single pipeline slot, even though they require two memory reads. Table 1 describes the stage-by-stage execution. Different operations work in different pipeline stages (ldl in ID1; ldnl in ID2, OF1, and OF2; and add in EXE and WBK) to separately execute the three folded instructions. But because each operation occupies only one pipeline stage at a time, the three instructions apparently execute in a single clock cycle. The icore supports many such folding sequences, but they all follow this same principle. forwarding and register scoreboarding In a deeply pipelined design, mitigating the effects of delays caused by pipeline latencies is important. dependency between successive instructions causes one significant source of latency-based performance degradation. If the dependency is a true one (such as when an arithmetic operation uses the result of the instruction just before it) then it s impossible to eliminate the dependency, but microarchitecture techniques can reduce the pipeline stall this situation causes. These techniques include providing data forwarding paths between potentially interdependent pipeline stages. In icore, this strategy results in data bypass paths from WBK to OF2, EXE to OF2, WBK to ID2, and EXE to ID2, as Figure 1 shows. The paths from EXE and WBK to OF2 allow an instruction either immediately after or up to two cycles behind an arithmetic operation to use the arithmetic operation s result without penalty. If more than two clocks separate them, then the dependent operation can read the result from the register stack without penalty. Similarly, the path from EXE to ID2 lets the AGU use an operand that one instruction fetches from the data cache as an operand address for another instruction that follows the fetching instruction. This situation commonly occurs when either an ldnl instruction or an ldl instruction that misses in the LWC loads an address from memory used by a later ldnl instruction. Such a situation occurs when following a pointer chain, for example. Using the forwarding path under these circumstances results in a worst-case pipeline stall of two clock cycles. If we provided no forwarding paths, this penalty would be four clock cycles to allow time for the operand address to cycle through the register stack. The path from WBK to ID2 operates similarly. However, we use it when the calculated result of one instruction rather than a result MAY JUNE

5 SYNTHESIZABLE CORE 50 IEEE MICRO just fetched from the data cache serves as an AGU input for an instruction that follows. This path reduces the penalty for this type of addressing to a maximum of three clocks. Activating the forwarding paths at the appropriate times requires the use of register dependency-checking logic (also known as register scoreboarding). In icore, this logic resides in the ID2 stage, and its operation is quite straightforward: It compares the name of a register required as a source operand by one instruction and the names of all registers that instructions further ahead in the pipeline are about to modify. When this logic detects a match, the pipeline stalls the newer instruction in the ID2 if the older instruction cannot supply its result in time. Such a situation typically occurs when an instruction uses an ALU result as an operand address or when a later instruction uses the result of an earlier multicycle ALU operation. After inserting any necessary stall cycles, the scoreboarding logic releases the instruction from ID2 and activates the appropriate forwarding path to provide the source operands for the instruction at the earliest possible time. For the ST20-C2 architecture, register dependency checking must handle operands that constantly change position within the register stack s A, B, and C registers, because most instructions perform implicit or explicit pushes or pops of the stack. Thus, almost every instruction appears to have data dependencies on every older instruction in the pipeline, because nearly all of them cause the loading of new values into A, B, and C. This loading occurs even though most of those dependencies are false in the sense that they arise from the movement of data from one register to another rather than by the creation of new results. Considerable performance degradation would result if the pipeline stalled newer instructions based on those false dependencies. The icore solves this problem by using a simple renaming mechanism that maps the ST20-C2 architecture s conceptual A, B, and C registers onto fixed hardware registers, R0, R1, and R2. The key is that the same hardware register remains allocated to a particular instruction s result throughout its lifetime in the pipeline. This occurs despite the fact that subsequent instructions push and pop the architectural register stack and apparently cause the result to move among A, B, and C. A mapping table for each pipeline stage indicates which architectural register (A, B, or C) maps onto which fixed hardware register (R0, R1, R2) for the instruction in that pipeline stage. Operations that only move operands between registers do so by simply renaming the R0, R1, and R2 registers to the new A, B, or C mapping, rather than actually moving data among them. The icore performs this mapping in ID2 and bases the mapping on how the current instruction in ID2 affects the existing mapping up to but not including that instruction. Once it finishes the ID2 mapping, icore performs dependency checking based on R0, R1, and R2 source and destination values. Results never physically move among these registers, eliminating false dependencies. Branch prediction The icore s relatively long pipeline (compared to that of previous ST20-C2 implementations) increases the danger of branches causing significant performance degradation. After performance simulations showed that the longer pipeline s effect on some benchmarks was significant, we incorporated a lowcost branch prediction scheme into icore s microarchitecture. In a machine without branch prediction, the source of most branch penalties is twofold. First, the processor must determine the target address of a taken branch or other control-transfer instruction. It cannot do so until after decoding the branch, which requires identifying the branch and then (usually) performing an arithmetic operation. Such an operation might be adding an offset to the current instruction pointer value to obtain the address of the next valid instruction in the program flow. By the time the instruction fetch unit can apply the next valid instruction s target address to the instruction cache to start fetching instructions from the new destination, it has fetched several instructions from the wrong address. The pipeline must cancel these instructions, creating a penalty of several clock cycles. Secondly, conditional branches those that

6 Target address +4 Target FIFO Repair address Multiplexer Floating-point transfer register Floating-point transfer register BHT, BTB RAMs Tag RAM Target address Prediction bits 4 bytes RAM Downshift Instruction fetch buffer Instruction cache IF1 IF2 ID1 Figure 2. Branch prediction logic. depend on the results of previous instructions to determine their direction usually must progress all the way to the pipeline s end before the resolution of the condition on which their action depends. Only then can instruction fetches from a new target address proceed, causing an even greater branch penalty than that incurred by the target address calculation alone. The icore s branch prediction mechanism reduces the penalties in both these cases. It uses a two-bit predictor scheme to predict branch and subroutine call behavior (as either taken or not taken), using a branch target buffer to predict target addresses for taken branch and call instructions. It also uses a four-entry return stack to predict the special case of ret instruction return addresses. Figure 2 shows icore s prediction scheme, which employs a 1,024-entry branch history table (BHT). Each entry contains a two-bit predictor that implements a branch hysteresis algorithm 2 that characterizes branch instructions as either weakly or strongly taken or not taken. The BHT keeps the most current branching history for a set of instruction cache words. Predicted fetch addresses (such as the target addresses of predicted-taken branches) go to a 64-entry branch target buffer (BTB). Whenever an instruction fetch unit fetches a new 4-byte word from the instruction cache, the instruction address that indexes the cache also indexes the BHT and BTB. The branch prediction logic uses the retrieved information from the BHT to predict whether a taken branch is at the current instruction address. If the prediction logic predicts a taken branch, the instruction fetch unit loads the address retrieved from the BTB into the instruction fetch pointer and uses it to fetch the next instruction word. Otherwise, it uses the instruction fetch pointer s sequential value. All this processing takes place in the very first pipeline stage, ID1, permitting immediate application of the new target addresses to the instruction cache after the prediction of a taken branch. This effectively lets predicted-taken branches immediately change the instruction address applied to the instruction cache without any delay cycles, making negligible the performance impact of successfully predicted branches. We feed the predicted direction and target forward through IF2 and load it into the IFB (in the case of the prediction bits) and into a four-entry target buffer (in the case of the predicted target address). Later pipeline stages use these values to validate the predictions and update the instruction pointer. Mispredictions can occur. A cost-saving measure using only the low-order bits of the MAY JUNE

7 SYNTHESIZABLE CORE 52 IEEE MICRO instruction fetch address to index BHT and BTB entries and ignoring the upper address bits causes one type of misprediction. Providing only one BHT or BTB entry per 4-byte cache word, even though a word could have more than one branch, is another cost-saving measure that causes misprediction. These factors can cause many fetched instructions to share a single BHT or BTB entry, even though the entry is likely correct only for the one instruction that the processor originally used to update it. As a result, the instructions at other fetch addresses that share the entry might be incorrectly predicted, have incorrect target addresses, or both. Other misprediction sources include mispredicted directions of conditional branches and incorrect return addresses. Later pipeline stages detect and resolve mispredictions. The instruction decoder in the ID1 stage can detect many mispredictions caused by BHT aliasing. For instance, if the prediction logic predicts a taken condition for an instruction that is not a branch or predicts an untaken condition for an unconditional branch, the prediction is clearly in error. A repair operation can start immediately and thus have only a small penalty. To check predicted targets, an adder in the OF1 stage calculates a taken branch s correct target address and compares it with the predicted one (which ID2 feeds forward from the BTB). If a missing, aliased BTB entry causes a mismatch, then recovery can start from OF1. Finally, a conditional branch must wait until it reaches the EXE stage before it s possible to check the condition that determined the branch s action, and thus the branch has the longest repair time when incorrectly predicted. The steps to repair a branch depend on the type of misprediction. When BHT aliasing causes an incorrect branch instruction to be taken, then the pipeline flushes all instructions fetched after that instruction. The pipeline restarts instruction fetching from the source address of that instruction to ensure that is correctly fetched, while updating the BHT to indicate not taken. Conversely, in the case where prediction logic incorrectly predicts a taken branch s target address, the pipeline flushes the incorrect target s instructions. It then restarts instruction fetching from the correct target address computed in the OF1 stage. This occurs while the IF1 stage updates BTB with the new target address. Finally, repairs to correct a mispredicted conditional branch flush all subsequent instructions from the pipeline, which begins fetching again from either the branch s target address or the next sequential address, depending on whether the branch is taken or not. The repairs update the BHT and BTB appropriately. Conventional branch prediction is ineffective for branches caused by subroutine returns, because the return s target address changes constantly, depending on the program location of the subroutine call. Thus icore uses a four-entry return stack to predict the ret (return) instruction s target. The return stack saves the return addresses for the most current call instructions. When the system encounters ret instruction in the ID2 stage, it pops the return stack and uses the value as the predicted branch address for the ret instruction, incurring only a small penalty. The system validates the ret predictions in the EXE stage, because in the ST20-C2 architecture, the actual return addresses come from a memory location in the local work space and typically are available only after accessing the data cache. Cache subsystem The cache controller s design focuses on the need for simplicity (thereby enabling high frequency) and tight coupling to the CPU pipeline. These requirements resulted in a two-stage pipelined approach: The first stage handles the tag access and tag comparison, and the second stage handles the data access and data alignment. This arrangement balances delays, saving power and delay time by eliminating the need for physically separate data RAMs for each associative-cache bank (or way). This is possible because the system can predetermine the number of the data RAM s way in the tag cycle and then simply encode it into extra address bits in the data RAM. To further reduce delays, we built integrated pipeline input registers into the tag and data RAMs to eliminate input wire delays. Figure 3 shows a simplified block diagram of the IFU and OFU cache controllers. One controller occupies the IF1 and IF2 stages in

8 Multiplexer Shift, data IF0/ID2 IF1/OF1 IF2/OF2 generation unit and instruction fetch buffer Store buffer Multiplexer CPU Hold 32 Store data Request/address 32 Tag RAM CH1 Summation unit Hit Miss control RAM 32 Miss Write back (miss) Write back (dirty) Write back address CH2 External memory interface Line-fill data Cache controller Figure 3. Cache controllers. the IFU; the other occupies the OF1 and OF2 stages in the OFU. We use the same cache controller design in each case, so its main pipeline stages correspond to IF1/IF2 in one case and OF1/OF2 in the other. Before the IF1 or OF1 phase, the cache controller receives the request (read or write), address, and data from the CPU and loads them into its input pipeline registers. In the IF1 or OF1 phase, the cache controller of the instruction or data cache accesses the tag RAMs and performs the tag comparison. The first version of icore has 8-Kbyte caches and a 16-byte line size, so it uses 512 tag entries. At the end of IF1 or OF1, the result of the tag comparison (hit/miss) and the IF1 or OF1 address and data fields from the original CPU request go to the input registers of the IF2 or OF2 stage. In IF2 or OF2, the system accesses the data RAMs and returns the data to the CPU s IFB in the case of the instruction cache or the data aligner on the inputs to the EXE stage in the case of the data cache. In addition to returning data on a cache hit, the cache controller in IF2 or OF2 stage also initiates a cache-miss sequence if the requested data is not in the cache. It does this by MAY JUNE

9 SYNTHESIZABLE CORE 54 IEEE MICRO asserting a hold signal, which freezes the CPU pipeline, and signaling the main-memory controller to fetch the missing line. Once the main memory retrieves the missing line, the memory controller loads this data into the data RAM, loads the upper address bits into the tag RAM, and the IF2 or OF2 stage deasserts the hold. The CPU must then restore the original state of the cache controller s internal pipeline registers by repeating the read or write requests that were in progress at the time of the cache miss. Once the CPU restores state, normal operation resumes. Having the CPU restore state simplifies the cache design, reducing the complexity of its critical paths. The cache controller implements a writeback policy: Writes that hit the cache write only to the cache, and not main memory (as opposed to a write-through policy, which would write to both). Thus the cache can contain a more up-to-date version of a line than main memory, necessitating that the cache write the line back to memory when a new line from a different address replaces an old one. To facilitate this, the caches keep a dirty bit with each line to indicate when the line has been modified and hence requires writing back to main memory on its replacement. The dirty bits reside in a special field in the data RAM, and there is only one per four-word location (because each word is 4 bytes, and there are 16 bytes in a line). The cache controller sets the dirty bit for a given line when a write to that line hits the cache. Write buffering and store/load bypass Writes to the data cache presented a potential performance problem, because the WBK stage at the end of the CPU pipeline generates write requests. So a write in any given clock cycle could clash with newer read requests generated by the AGU at the end of the ID2 stage. Because the cache is single-ported, this would cause the CPU pipeline to stall, even though on average the cache controller is capable of dealing with the full read/write bandwidth. Performance simulations showed that we could avoid most collisions by adding a special write-buffering stage, SBF, to store blocked memory write requests from the WBK stage. With the SBF, icore has the ability to generate write requests from either the WBK stage if the SBF stage is empty and the data cache is not busy, or the SBF stage if it is not empty and the data cache is not busy. This capability gives write requests more opportunities to find unused data cache request slots. Additional write buffers would further increase the opportunities for collision avoidance, but they provided much smaller returns than the addition of the first one, so we did not implement them in icore s demonstration version. Finally, because of the high memory utilization of ST20-C2 programs, it was beneficial to add a store/load bypass mechanism. This mechanism permits icore to supply data from queued memory writes in the EXE, WBK, and SBF stages directly to read requests from the same address in OF2 to replace stale data obtained from the data cache. This approach avoids having to stall a memory read when it needs data from a data cache location with an outstanding write. Effects of microarchitecture features on IPC We first studied the effect of microarchitectural enhancements on IPC using a C-based statistical performance model. Such a model simulates the execution of various benchmark traces produced by an ST20-C2 instruction set simulator. These benchmarks measure the effects of basic instruction folding, the LWC and its alternative designs, enhanced instruction folding enabled by using LWC to fold two load instructions into one, and branch prediction schemes and designs. We fine-tuned some of the features architecture during register-transfer-level implementation, as more accurate performance and area tradeoff analyses became possible. One of the more noticeable implementation-level enhancements included modifying the branch predictor state transition algorithm to handle the case where multiple predicted branch instructions reside in the same cache word. Other enhancements include the addition of a dynamic LWC disable feature to minimize the time the LWC is unavailable because of coherency updates and a rotating LWC pointer feature to maximize the reuse of valid operands in the LWC as the work space pointer increments and decrements. Table 2 summarizes the effects of these

10 Table 2. IPC enhancements of microarchitecture optimizations as measured on a Dhrystone 2.1 benchmark. Overall IPC Increase in IPC improvement Features (percentage) (percentage) Basic instruction folding LWC Enhanced instruction folding 7 27 LWC enhancements 8 35 Branch prediction 6 41 Enhanced branch prediction transition algorithm 1 42 Return stack 3 45 microarchitecture features on IPC. Optimizing the implementation We used two methods to optimize icore s implementation: a synthesis process and a higher-performance standard-cell library. Synthesis strategy We divided the synthesis process into two steps. The first step employed a bottom-up approach using Synopsys Design Compiler, in which we synthesized all top-level blocks using estimated constraints. We generally defined blocks as logic components with a single pipeline stage. The execute unit included a large multiplier, for which we used Synopsys Module Compiler because we found it to provide better results than Design Compiler. It took several incremental compilations of the design merging blocks into larger and larger modules, then finally the full processor to tune performance across the different design boundaries. The OF2 and the IF2 stages were the only region where synthesis provided unacceptable path delays. In both cases, the problems involved the use of complex multiplexer functions. So we designed custom multiplexer gates for these cases and manually inserted logic with a dont_touch attribute to achieve the desired results. We resolved other complex multiplexer problems by rewriting the hardware description language to guide the synthesis tool toward the preferred implementation. The synthesis strategy s second step employed a top-down approach, with cell placement using Synopsys Physical Compiler, which optimizes gates and drivers based on actual cell placement. Physical Compiler proved effective in eliminating certain design iterations. In an ASIC flow, iterations can arise from to the discrepancy between the gate assumed by the wire load model, and that produced by the final routed netlist. The delays based on the placement came within 10 percent of those obtained after final place and route. You can run Physical Compiler with either an RTL to placed gates, or a gates to placed gates flow. The latter provided better results for this design. Standard-cell library We developed a higher-performance standard-cell library, H12, to improve icore s circuit speed over that provided by a generic standard-cell library. Using the H12 library provided a speed improvement of more than 20 percent in simulation and on silicon. H12 used the same conventional CMOS static designs as the generic library, but also incorporated the following differences: reduced ratio of the PMOS to NMOS transistor widths (P/N) in the standard cells. larger logic gates (instead of the bufferedlevel gates in this particular cell library), more efficient layout, and increased cell height. We determined the P/N ratio for each cell by measuring the fastest speed obtained when the gate was driving a light load. We observed that when the gate-load increased, we needed to increase the P/N ratio to obtain the optimal speed. However, the physical-synthesis tool minimizes the fanout and line loading on the most MAY JUNE

11 SYNTHESIZABLE CORE Cadence s CTGen. The maximum simulated skew across the core was 120 ps under worstcase process and environmental conditions. Typical clock skew would be much less. Using Synopsys Formality, we formally compared the final routed design to the provided netlist. We used Synopsys PrimeTime to calculate the Synopsys-Arcadia-extracted netlist delay. Figure 4. The icore test chip s GDSII-based plot. critical paths, so we optimized the H12 library for the typical output loading of those paths. We also reduced delays by creating larger, single-level cells for high-drive gates. The original library added an extra buffering level to its high-drive logic gates to keep their input capacitance low and their layout smaller, but these optimizations also increased their delay. Eliminating that buffer by increasing the cell size also increased the input capacitance, but significantly reduced the net delay. The icore only used these cells in the most critical paths, so their size did not have much impact on the overall layout size. The H12 cells generally had larger transistors, requiring taller layouts. However, by using more-efficient layout techniques, we reduced many of the cells widths compared to those of the generic library. Also, the cells taller height let the router use extra metal tracks, which in turn increased the cell placement s utilization. Physical-design strategy We used Physical Compiler to generate the cell placement, freezing the data cache and instruction cache locations during the placement process and permitting the floorplanning to be data-flow driven. We shielded clock lines to reduce delay uncertainty by using Wroute in Silicon Ensemble by Cadence. To minimize the power-drops inside the core, we designed a dense metal layer 5 and metal layer 6 (the upper metal layers available in our process) power grid mesh. We generated the balanced clock tree with Results We fabricated the icore processor in STMicroelectronics 0.18-micron HCMOS8D process. Figure 4 shows a GDSII-based plot of the test chip. Excluding the memories, the entire chip has the distinctive random layout of synthesized logic. The large memories on the right and on the top and bottom of the plot are the data and instruction RAMs. The other small memories in the plot are the LWC, BHT, and BTB. Timing simulation indicated that we achieved a balance of delays between the pipeline stages, as shown in Figure 5. The slowest individual pipeline stage limits the overall performance of the processor. Any silicon area spent on making the other stages faster than the slowest stage is wasted as no net improvement in performance results. The optimally efficient design has balanced delays, so that all stages have equal performance. Such a deep, well-balanced pipeline structure is amenable to highly optimal synthesis. 3 Silicon testing showed functional performance of the design from 475 MHz at 1.7V to 612 MHz at 2.2V at 25 Celsius (ambient). The design operates up to 520 MHz at 1.8V, among the highest reported speeds for a synthesized CPU core. 4 Analysis of results for the Dhrystone 2.1 benchmark showed that icore achieved an IPC count of about 0.7, which met the goal of being the same or greater than that of previous ST20-C2 implementations. This indicates that the various pipeline optimizations were functioning correctly. References 1. ST20-C2 Instruction Set Reference Manual, STMicroelectronics, 72-TRN , Nov J.K.F. Lee and A.J. Smith, Branch Prediction Strategies and Branch Target Buffer Design, Computer, vol. 17, no. 1, 56 IEEE MICRO

12 Jan. 1984, pp D.G. Chinnery and K. Keutzer, Closing the Gap between ASIC and Custom: An ASIC Perspective, Proc. 38th Design Automation Conf. (DAC 2001), ACM Press, 2001, pp C.D. Snyder, Synthesizable Core Makeover: Is Lexra s Seven-Stage Core the Speed King? Microprocessor Report, 2 July delay (ns) Nick Richardson is an STMicroelectronics Fellow, who led the development of the icore microarchitecture. His research interests include the definition and design of microprocessor, cache memory, and network architectures. Richardson has a BSEE from the University of London. Lun Bin Huang is a senior principal engineer at STMicroelectronics. His contribution to the icore project includes microarchitecture and design of branch prediction, instructions decode and issue, and operand fetch. His research interests include architecture and design of high-performance CPU cores and network processors. Huang has an MSEE from California State University, Long Beach. He is a member of the IEEE. Razak Hossain is senior principal engineer at STMicroelectronics in La Jolla, Calif. His research interests include the design of highspeed circuits and interconnects. Hossain has a PhD in electrical engineering from the University of Rochester. He is a member of the IEEE. Julian Lewis was the icore project manager for STMicroelectronics. He has worked as architect, methodologist, verifier, and project manager for ST20 cores. He now works on the architectural definition of ST s DVD chips and manages a team implementing audio software for DVD players. Lewis has a BSc in computer science from the University of Manchester. Tommy Zounes is a senior principal engineer at STMicroelectronics who works in the advanced design group in ST s Central Research and Development Division in La Jolla, developing a Domino style library for IF1 IF2 ID1 ID2 OF1 OF2 EXE WBK Figure 5. Static analysis of worst-case pipeline paths. synthesis flow. His research interests include custom high-speed digital design solutions, which include front to back designing of standard-cell libraries, RAMs, ROMs, and register files. Zounes has a BSEE and an MSEE from San Diego State University. He is a member of the IEEE. Naresh Soni is vice president of US central research and development for STMicroelectronics. His research interests include highperformance circuits and systems. Soni has a BSEE from the University of Bombay, India, and an MSEE from the University of Texas, Austin. He is a member of the IEEE. Direct questions and comments about this article to Naresh Soni, STMicroelectronics, 4690 Executive Dr., Ste. 200, San Diego, CA 92121; naresh.soni@st.com. For more information on this or any other computing topic, visit our Digital Library at MAY JUNE

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding ST20 icore and architectures D Albis Tiziano 707766 Architectures for multimedia systems Politecnico di Milano A.A. 2006/2007 Outline ST20-iCore Introduction Introduction Architecture overview Multi-cluster

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation CAD for VLSI 2 Pro ject - Superscalar Processor Implementation 1 Superscalar Processor Ob jective: The main objective is to implement a superscalar pipelined processor using Verilog HDL. This project may

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 12 Mahadevan Gomathisankaran March 4, 2010 03/04/2010 Lecture 12 CSCE 4610/5610 1 Discussion: Assignment 2 03/04/2010 Lecture 12 CSCE 4610/5610 2 Increasing Fetch

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 5, MAY 1998 707 Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor James A. Farrell and Timothy C. Fischer Abstract The logic and circuits

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN ARM COMPUTER ORGANIZATION AND DESIGN Edition The Hardware/Software Interface Chapter 4 The Processor Modified and extended by R.J. Leduc - 2016 To understand this chapter, you will need to understand some

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Static Branch Prediction

Static Branch Prediction Static Branch Prediction Branch prediction schemes can be classified into static and dynamic schemes. Static methods are usually carried out by the compiler. They are static because the prediction is already

More information

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics Chapter 4 Objectives Learn the components common to every modern computer system. Chapter 4 MARIE: An Introduction to a Simple Computer Be able to explain how each component contributes to program execution.

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Q.1 Explain Computer s Basic Elements

Q.1 Explain Computer s Basic Elements Q.1 Explain Computer s Basic Elements Ans. At a top level, a computer consists of processor, memory, and I/O components, with one or more modules of each type. These components are interconnected in some

More information

Complexity-effective Enhancements to a RISC CPU Architecture

Complexity-effective Enhancements to a RISC CPU Architecture Complexity-effective Enhancements to a RISC CPU Architecture Jeff Scott, John Arends, Bill Moyer Embedded Platform Systems, Motorola, Inc. 7700 West Parmer Lane, Building C, MD PL31, Austin, TX 78729 {Jeff.Scott,John.Arends,Bill.Moyer}@motorola.com

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Chapter 3 : Control Unit

Chapter 3 : Control Unit 3.1 Control Memory Chapter 3 Control Unit The function of the control unit in a digital computer is to initiate sequences of microoperations. When the control signals are generated by hardware using conventional

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition The Processor - Introduction

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor.

Chapter 4. Instruction Execution. Introduction. CPU Overview. Multiplexers. Chapter 4 The Processor 1. The Processor. COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor The Processor - Introduction

More information

THE latest generation of microprocessors uses a combination

THE latest generation of microprocessors uses a combination 1254 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 A 14-Port 3.8-ns 116-Word 64-b Read-Renaming Register File Creigton Asato Abstract A 116-word by 64-b register file for a 154 MHz

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

On GPU Bus Power Reduction with 3D IC Technologies

On GPU Bus Power Reduction with 3D IC Technologies On GPU Bus Power Reduction with 3D Technologies Young-Joon Lee and Sung Kyu Lim School of ECE, Georgia Institute of Technology, Atlanta, Georgia, USA yjlee@gatech.edu, limsk@ece.gatech.edu Abstract The

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

ARM ARCHITECTURE. Contents at a glance:

ARM ARCHITECTURE. Contents at a glance: UNIT-III ARM ARCHITECTURE Contents at a glance: RISC Design Philosophy ARM Design Philosophy Registers Current Program Status Register(CPSR) Instruction Pipeline Interrupts and Vector Table Architecture

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Freescale Semiconductor, I

Freescale Semiconductor, I Copyright (c) Institute of Electrical Freescale and Electronics Semiconductor, Engineers. Reprinted Inc. with permission. This material is posted here with permission of the IEEE. Such permission of the

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

EECS 470 Final Project Report

EECS 470 Final Project Report EECS 470 Final Project Report Group No: 11 (Team: Minion) Animesh Jain, Akanksha Jain, Ryan Mammina, Jasjit Singh, Zhuoran Fan Department of Computer Science and Engineering University of Michigan, Ann

More information

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Computer Architecture

Computer Architecture Computer Architecture Lecture 1: Digital logic circuits The digital computer is a digital system that performs various computational tasks. Digital computers use the binary number system, which has two

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

1. PowerPC 970MP Overview

1. PowerPC 970MP Overview 1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor

More information

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging 6.823, L8--1 Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Highly-Associative

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Concurrent/Parallel Processing

Concurrent/Parallel Processing Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

Design and Implementation of a Super Scalar DLX based Microprocessor

Design and Implementation of a Super Scalar DLX based Microprocessor Design and Implementation of a Super Scalar DLX based Microprocessor 2 DLX Architecture As mentioned above, the Kishon is based on the original DLX as studies in (Hennessy & Patterson, 1996). By: Amnon

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM

Computer Architecture Computer Science & Engineering. Chapter 4. The Processor BK TP.HCM Computer Architecture Computer Science & Engineering Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II

CS252 Spring 2017 Graduate Computer Architecture. Lecture 8: Advanced Out-of-Order Superscalar Designs Part II CS252 Spring 2017 Graduate Computer Architecture Lecture 8: Advanced Out-of-Order Superscalar Designs Part II Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Chapter 5: ASICs Vs. PLDs

Chapter 5: ASICs Vs. PLDs Chapter 5: ASICs Vs. PLDs 5.1 Introduction A general definition of the term Application Specific Integrated Circuit (ASIC) is virtually every type of chip that is designed to perform a dedicated task.

More information

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25

CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 CS152 Computer Architecture and Engineering March 13, 2008 Out of Order Execution and Branch Prediction Assigned March 13 Problem Set #4 Due March 25 http://inst.eecs.berkeley.edu/~cs152/sp08 The problem

More information

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars Krste Asanovic Electrical Engineering and Computer

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Basic Processing Unit: Some Fundamental Concepts, Execution of a. Complete Instruction, Multiple Bus Organization, Hard-wired Control,

Basic Processing Unit: Some Fundamental Concepts, Execution of a. Complete Instruction, Multiple Bus Organization, Hard-wired Control, UNIT - 7 Basic Processing Unit: Some Fundamental Concepts, Execution of a Complete Instruction, Multiple Bus Organization, Hard-wired Control, Microprogrammed Control Page 178 UNIT - 7 BASIC PROCESSING

More information

What is Pipelining? RISC remainder (our assumptions)

What is Pipelining? RISC remainder (our assumptions) What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

MARIE: An Introduction to a Simple Computer

MARIE: An Introduction to a Simple Computer MARIE: An Introduction to a Simple Computer 4.2 CPU Basics The computer s CPU fetches, decodes, and executes program instructions. The two principal parts of the CPU are the datapath and the control unit.

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction

Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction ISA Support Needed By CPU Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with control hazards in instruction pipelines by: 1 2 3 4 Assuming that the branch

More information

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version

Determined by ISA and compiler. We will examine two MIPS implementations. A simplified version A more realistic pipelined version MIPS Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1 Introduction Chapter 4.1 Chapter 4.2 Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number

More information

CN310 Microprocessor Systems Design

CN310 Microprocessor Systems Design CN310 Microprocessor Systems Design Micro Architecture Nawin Somyat Department of Electrical and Computer Engineering Thammasat University 28 August 2018 Outline Course Contents 1 Introduction 2 Simple

More information

ECE 341 Final Exam Solution

ECE 341 Final Exam Solution ECE 341 Final Exam Solution Time allowed: 110 minutes Total Points: 100 Points Scored: Name: Problem No. 1 (10 points) For each of the following statements, indicate whether the statement is TRUE or FALSE.

More information

The Pentium II/III Processor Compiler on a Chip

The Pentium II/III Processor Compiler on a Chip The Pentium II/III Processor Compiler on a Chip Ronny Ronen Senior Principal Engineer Director of Architecture Research Intel Labs - Haifa Intel Corporation Tel Aviv University January 20, 2004 1 Agenda

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Set: Memory Systems

ECE 3056: Architecture, Concurrency, and Energy of Computation. Sample Problem Set: Memory Systems ECE 356: Architecture, Concurrency, and Energy of Computation Sample Problem Set: Memory Systems TLB 1. Consider a processor system with 256 kbytes of memory, 64 Kbyte pages, and a 1 Mbyte virtual address

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor The A High Performance Out-of-Order Processor Hot Chips VIII IEEE Computer Society Stanford University August 19, 1996 Hewlett-Packard Company Engineering Systems Lab - Fort Collins, CO - Cupertino, CA

More information

Alexandria University

Alexandria University Alexandria University Faculty of Engineering Computer and Communications Department CC322: CC423: Advanced Computer Architecture Sheet 3: Instruction- Level Parallelism and Its Exploitation 1. What would

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12

CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12 Assigned 2/28/2018 CS152 Computer Architecture and Engineering SOLUTIONS Complex Pipelines, Out-of-Order Execution, and Speculation Problem Set #3 Due March 12 http://inst.eecs.berkeley.edu/~cs152/sp18

More information

An Instruction Stream Compression Technique 1

An Instruction Stream Compression Technique 1 An Instruction Stream Compression Technique 1 Peter L. Bird Trevor N. Mudge EECS Department University of Michigan {pbird,tnm}@eecs.umich.edu Abstract The performance of instruction memory is a critical

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015 Advanced Parallel Architecture Lesson 3 Annalisa Massini - 2014/2015 Von Neumann Architecture 2 Summary of the traditional computer architecture: Von Neumann architecture http://williamstallings.com/coa/coa7e.html

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Dynamic Instruction Scheduling with Branch Prediction

More information

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages What is Pipelining? Is a key implementation techniques used to make fast CPUs Is an implementation techniques whereby multiple instructions are overlapped in execution It takes advantage of parallelism

More information

Digital System Design Using Verilog. - Processing Unit Design

Digital System Design Using Verilog. - Processing Unit Design Digital System Design Using Verilog - Processing Unit Design 1.1 CPU BASICS A typical CPU has three major components: (1) Register set, (2) Arithmetic logic unit (ALU), and (3) Control unit (CU) The register

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information