Complexity-effective Enhancements to a RISC CPU Architecture

Size: px

Start display at page:

Download "Complexity-effective Enhancements to a RISC CPU Architecture"

Antony Walters
6 years ago
Views:

1 Complexity-effective Enhancements to a RISC CPU Architecture Jeff Scott, John Arends, Bill Moyer Embedded Platform Systems, Motorola, Inc West Parmer Lane, Building C, MD PL31, Austin, TX {Jeff.Scott,John.Arends,Bill.Moyer}@motorola.com Abstract The M CORE TM RISC architecture has been developed to address the growing need for long battery life among today s embedded applications [4]. In this paper, we present several architectural enhancements to the M CORE M3 processor. Specifically, we discuss the burst mode memory enhancements, the instruction fetch enhancements, the selectable branch prediction implementation, and the improvements for software patching. These additions to the M CORE processor were carefully selected in order to increase performance at minimal cost and complexity, in order to meet the requirements of the portable, embedded marketplace. 1 Introduction An increase in the number of portable computing devices with a requirement for extended battery life has led to innovative architectural techniques which increase performance while minimizing energy consumption as well as design complexity. Complexity-effective design techniques have proven to significantly reduce die size as well as enhance frequency of execution. In the M3 processor, which is the newest member of the M CORE family of RISC solutions, new techniques are directed toward these goals. Due to increased performance requirements of portable computing devices, memory system performance has been a focus for innovation. A commonly practiced technique of bursting sequential memory accesses is used to minimize memory access time. In addition, microprocessors have frequently adopted a Harvard architecture to increase instruction fetch bandwidth and reduce the performance penalty for simultaneous instruction and data memory requests. The M3 processor achieves comparable performance through additional instruction buffering for pipelining load, store, and floating point operations, without requiring a Harvard architecture [5]. As a result, interfacing the M3 processor to burst mode memory devices involves providing newly defined memory interface signals to achieve enhanced performance. Off-chip accesses in portable computing devices are expensive in terms of performance degradation as well as energy consumption. Cost is also an important aspect of portable devices. Many cost sensitive designs employ a single external memory device of a given type (SRAM, Flash, or ROM) in order to minimize size and cost, and these packages are often 16 bits in data width. Therefore, it is common that memory subsystems implement a reduced bus-width for off-chip/external accesses, while maintaining a wider internal bus-width to performance critical memories [3]. New techniques in the M3 processor offer the flexibility to maximize the performance for both 16-bit external accesses and internal 32-bit memory accesses. Increasingly longer RISC pipelines and instruction buffering dictate the use of branch prediction in order to minimize the performance degradation due to conditional branches [6]. Commonly used branch prediction techniques are expensive in terms of increased silicon die area and energy consumption. A new technique for selectable branch prediction will be introduced that has proven to be effective at optimizing memory access time as well as minimizing complexity overhead and power consumption. Software developers often use software patching as a means of correcting already installed ROM based code. A similar technique is often times used as a technique to modify data values dynamically. Most techniques of patching require a context switch or require the implementation to introduce logic in the processor s time-critical memory access path. A new technique for both instruction and data patching has proven to be cost-effective in various embedded control applications. The paper is organized as follows. In Section 2, burst mode enhancements are described which includes a description of inefficiencies and solutions. Section 3 describes the instruction fetch enhancements for 16-bit and 32-bit memory subsystems. Selectable branch prediction is discussed which optionally improves memory address setup time in Section 4. Section 5 includes a software patching scheme for both instruction and data patching. Section 6 summarizes the paper. 2 Burst Mode Interface Enhancements Many data processing systems include memories capable of burst mode operation. Burst mode memory devices

2 are capable of providing greater throughput and reduced latency compared to standard memory systems. Burst mode operation takes advantage of the fact that successive memory accesses are often to sequential addresses. After an initial latency for the first data item requested, subsequent burst mode accesses can be completed in fewer cycles compared to the initial access. Interruptions to a burst sequence by events such as change of flow instructions or interleaved instruction and data accesses cause the burst sequence to end, resulting in longer latency memory access times. Traditionally, a processor designed to take advantage of burst mode memories asserted a single output signal signifying that a current or upcoming bus fetch was sequential to the previous bus fetch. Usually, to accommodate a wide variety of memory sub-systems, this signal was valid as early as possible in the bus cycle. 2.1 Burst Mode Inefficiencies Defined Consider the example code for a 16-bit instruction set shown in Figure 1. This code will be used to illustrate how a typical system would assert a sequential address indicator (SEQ) to a memory sub-system add r1,r2 ; accumulate 1002 ld r2,(2000) ; load 1004 sub r1,r2 ; subtract 1006 cmp r1,r2 ; compare 1008 beq 1030 ; branch if equal (taken) 100a mult r2,r3 ; multiply : : 1030 add r1,r2 ; accumulate 1032 cmp r1,r2 ; compare 1034 beq 1050 ; branch if equal (not taken) 1036 and r1,r2 ; and 1038 xor r1,r2 ; exclusive or 103a not r1 ; not Figure 1: Example code for burst mode memory sequence Figure 2 below illustrates the memory access sequence for the code listed above. In addition, the SEQ signal value is shown. ; SEQ 1000 add r1,r2 ;? 1002 ld r2,(2000) ; sub r1,r2 ; ; cmp r1,r2 ; beq 1030 ; 1 100a mult r2,r3 ; add r1,r2 ; cmp r1,r2 ; beq 1050 ; and r1,r2 ; xor r1,r2 ; 0 103a not r1 ; 1 Figure 2: Memory sequence for example code As can be seen above, SEQ is negated for address 1038, even though it is indeed sequential to address This is due to the fact that SEQ is required to be valid as early in the clock cycle as possible to accommodate a wide range of memory systems. In order to meet this requirement, the accuracy of SEQ is sacrificed, causing SEQ to be negated unnecessarily in order to improve timing. SEQ is negated in this example because the processor encounters a branch instruction inthepipelineataddress Sincethere isno branch prediction, the next address fetched (either 1038 or the branch target 1050) will depend on the resolution of a condition code. Typically this condition code is resolved late in the clock cycle as a result of the computation of the previous instruction (compare). In addition, SEQ is negated for the fetch. This is necessary because the processor has no idea whether the access is in the same memory, or a separate memory, from the instruction space. If this is in a separate memory system from the instructions, then SEQ is once again unnecessarily negated at the cost of increased memory latency. 2.2 Burst Mode Solution Our solution to these problems is to define a set of three signals that encompass all possible sequential scenarios. These signals, SEQ, ASEQ, and ISEQ, provide the system designer the greatest flexibility in order to maximize memory throughput without adding inordinate amounts of complexity. ASEQ stands for accurate SEQ and asserts later in the clock cycle compared to SEQ. However, ASEQ is completely accurate since it relies on the completion of branch resolution. For memory systems that can handle the later timing of ASEQ, a significant reduction in effective memory latency can be achieved. ISEQ stands for instruction SEQ. This signal is used exclusively for instruction fetches and is a don t care for fetches. This signal can be utilized by a memory subsystem dedicated to instruction memory, allowing interleaved accesses to not interrupt a burstable sequence in the instruction memory system. Figure 3 shows the same memory sequence as before, along with the newly defined signals. ASEQ is asserted for thefetchofaddress 1038, althoughitisnotvalidasearly in the cycle as SEQ. Also, ISEQ is asserted for the fetch of 2000 and This is because 1006 is a sequential instruction fetch to Finally, ISEQ is asserted similar to ASEQ for address 1038, at the cost of a later setup time for ISEQ compared to SEQ. The assertion of the sequential indicator signals allows the burst memory sequence to continue uninterrupted, thereby avoiding the penalty of a burst sequence re-start.

3 ; SEQ ASEQ ISEQ 1000 add r1,r2 ;??? 1002 ld r2,(2000) ; sub r1,r2 ; ; cmp r1,r2 ; beq 1030 ; a mult r2,r3 ; add r1,r2 ; cmp r1,r2 ; beq 1050 ; and r1,r2 ; xor r1,r2 ; a not r1 ; Figure 3: Memory sequence with ASEQ and ISEQ 2.3 Burst Mode Interface Results The addition of the two extra signals, ASEQ and ISEQ, results in a significant increase in performance with minimal cost. In embedded applications for the M CORE M3 processor, where change-of-flow instructions account for roughly 15% of the dynamic instruction stream and load/ store instructions account for 20% of the dynamic instruction stream, the addition of these signals has a significant impact on overall system performance, with a minimal impact on area. Powerstone benchmark analysis show ASEQ signal assertion to be 40% greater than that of SEQ, resulting in a 40% decrease in burst sequence interruptions [4]. 3 Instruction Fetch Enhancements In many processors, pipeline throughput is improved by the addition of instruction buffers and wider paths to memory [5]. As buffers and wider datapaths are added, instruction fetch bandwidth is increased, thereby allowing for a more efficient pipeline utilization. If a CPU s instruction length is 16-bits, two instructions may be accessed each cycle from a 32-bit memory system. These instructions are stored in instruction buffers until needed, allowing for a surplus of instructions during periods of lower instruction memory throughput. It has been shown that the addition of instruction buffers andtheincrease ofinstruction fetchsizeprovidesignificant performance advantages. The 16-bit instruction set architecture M CORE M3 processor showed a 28% performance improvement from 32-bit memories with the addition of three instruction buffers and doubling the instruction fetch size [5]. However, these improvements are only seen when accessing a memory as wide as, or wider than, the instruction fetch size. For a memory system smaller than the instruction fetch size, there is a performance penalty. This penalty is associated with the fetch of unused opcodes around change-of-flow instructions because a smaller memory width device is not capable of supplying a pair of instructions with the same latency as a single instruction. In order to gain performance in wider memory systems, without degrading performance in cost conscious smaller width memory systems, a dynamic input signal was defined. The signal, IFSIZ, is asserted when 16-bit memory systems are accessed and negated when wider memory systems are accessed. This signal dynamically changes during program execution, depending on the memory width accessed, and is based simply on an address decoder. Upon a switch from one memory width to another, the processor detects the change in IFSIZ and dynamically changes the instruction fetchsizeatthenextproperly alignedboundary. Figure 4 shows the transition from a 32-bit memory to a 16-bit memory for the M CORE M3 processor. The instruction fetch to address A0 uses a transfer size request (TSIZ) of 32-bits and returns instructions I0 and I1 the following cycle. When address A4 is fetched, an address decode indicates that the fetch is to a 16-bit memory and IFSIZ is asserted accordingly. The processor recognizes this assertion and changes the instruction fetch size to 16- bits on the following access. Because the fetch to address A4 is to a 16-bit memory and the fetch size was still 32- bits, the processor must wait for the memory system to sequentially fetch both half-words (instructions I2 and I3) and return them to the processor. Once the initial fetch to the 16-bit memory system is complete, all subsequent fetches use a 16-bit TSIZ. ADDR TSIZ IFSIZ A0 A4 A8 A8 A I0,I1 I2,I3 I4 I5 Figure 4: IFSIZ transition from 32-bit to 16-bit memory Figure 5 below shows the case for a transition from 16- bit memory to 32-bit memory. The instruction fetch to address A0 uses a TSIZ of 16-bits and returns instruction I0 the following cycle. When address A0 is fetched, an address decode indicates that the fetch is to a 32-bit memory and IFSIZ is negated accordingly. The processor recognizes this negation and changes the instruction fetch size to 32-bits on the next properly aligned access. Because address A0 is a word-aligned address, the transfer size can not be changed until both halves of the word have been fetched. Once the initial fetch (or fetches) to the 32-bit memory system is complete, all subsequent fetches use a 32-bit TSIZ as is shown in the figure.

4 ADDR TSIZ IFSIZ A0 A2 A4 A8 A I0 I1 I2,I3 I4,I5 I6,I7 Figure 5: IFSIZ transition from 16-bit to 32-bit memory The addition of the IFSIZ input signal to the M CORE M3 processor requires a small state-machine used to detect the current memory region and the next memory region. This state-machine generates control signals to properly assert the TSIZ output pin and to selectively add 2 or 4 to the previous address to generate the next address. There are no issues with timing since there is at least a cycle to make the transitions, which is plenty of setup time. The addition of this flexibility in the system design resulted in an 11% performance improvement when tested using the Powerstone benchmark suite [4]. 4 Selectable Branch Prediction Branch prediction is one technique used to improve processor performance. In most instances, processors that predict the outcome of conditional branch instructions gain performance because they can make educated guesses and fetch the branch target instruction before resolution of the actual outcome of the branch instruction. However, there is usually a performance penalty associated with mispredictions. Since branch instructions cause a change in the sequential instruction stream fetch pattern, an incorrect address speculation can result in lost processor cycles. These processor cycles are lost because in the case of a misprediction, the incorrectly fetched instruction stream must be discarded, and the correct instruction stream is reloaded into the processor pipeline. The M CORE M2 processor has an aggressive branch implementation in which a taken branch is performed in two clock cycles (assuming zero wait state memory) [2]. There is no branch prediction, meaning the correct address is driven after resolution of the condition bit late in the decode cycle of the branch. For the M CORE M3 processor, the addition of instruction buffers and the increase in the instruction fetch size reduced the penalty associated with a branch misprediction. Therefore, a simple branch prediction scheme was implemented in which all branches were predicted taken [5]. If this prediction turned out to be false, an ABORT signal is asserted to the memory system during the next cycle to abort the transfer. This results in a single bus cycle penalty for a misprediction. When executing code from a 32- bit memory system, this single bus cycle penalty does not affect overall performance in most cases because of the large surplus of sequential instructions available in the instruction buffers. The advantage with this scheme is the increase in setup time of the address bus to the memory system, allowing for a more cost-effective memory subsystem. Branch resolution is no longer in the critical path of the next address calculation. Figure 6 shows the case of a mis-prediction in the M3 processor. In the figure, BT is a branch on condition true instruction. ADDR DECODE ABORT_B A0 A4 A8 B Targ A12 A16 I0,I1 BT,I3 I4,I5 I6,I7 I8,I9 I0 I1 BT I3 I4 I5 I6 Figure 6: Branch mis-prediction in the M3 cpu from 32-bit memory However, for 16-bit memory systems, this enhancement is costly. With the limited instruction fetch bandwidth, even a one bus cycle penalty will negatively affect overall processor performance (see Figure 7). The solution to this is to allow for the system designer to selectively choose whether they prefer to use the branch prediction mechanism (and gain more memory access time) or not use branch prediction (and lose memory access time). Therefore, a signal called APRED (or address prediction) was added to the processor. This signal is tied during system integration and statically selects whether branch prediction is enabled or not. ADDR DECODE ABORT_B Figure 7: Branch mis-prediction in the M3 cpu from 16-bit memory INST REG I0 I1 BT I3 I4 I5 I6 I7 BUFFERS I1 BT I3 I4 I5 I6 I7 I8 I3 I4 I5 I7 I8 I9 I5 I9 A0 A4 B Targ A8 INST REG BUFFERS I0,BT I2,I3 I0 BT I2 I3 STALL I4 I0 BT BT.. The addition of this signal turned out to be a very complexity-effective design decision. The chosen branch prediction methodology is very simple, yet improves address bus timing significantly. In fact, synthesis can be performed which completely eliminates all logic related to the I2 I3 I3 I4,I5 I4 I5 I5

5 mode that will not be used, saving area and eliminating timing false paths through the unused logic. 5 Software ing Enhancements Many systems today are Read-Only-Memory (ROM) based systems that require program code to be installed at the time silicon masks are created. Since the turnaround time from mask generation to silicon can be anywhere from three weeks to three months, it is desirable to have an effective means of modifying ROM code built into the hardware. There are also examples in certain automotive applications where data tuning a program as it is running is required. In both of these examples, a complexity-effective design is required in order to minimize silicon cost, design time, and software overhead, as well as achieve a minimal performance penalty when software patching is needed. 5.1 Traditional ROM based patching Traditionally, ROM patching involves a set of programmable address or address range comparators, which cause an interrupt exception to occur on a match. Exceptions dictate that the processor be in a state where the exceptions can be handled. This requires the processor to alter the current execution context and then save and restore the processor state. Furthermore, during portions of code where exceptions are inhibited, no patching may occur if the patching mechanism is implemented via forcing an exception. Other traditional ROM patching schemes include opcode substitution where a delay path is added to the time critical data input path to force a substituted instruction such as an absolute jump. This method is used in the Motorola DSP56600 family [1]. Using this method, memory timing must be taken into account to avoid additional speed paths in the memory subsystem and memory control. 5.2 Software ing Enhancements The software patching implementation has several aspects. One aspect is the way in which program memory is defined. Each portion of instructions or data to be patched corresponds to a patch code pointer which is located in a patch pointer table as shown in Figure 8. Program Memory Instruction(s) to be patched Code N-2 Pointer Address Code Pointer 0 Code Pointer 1 0 Code Pointer N-2 Code Pointer N-1 Pointer Table Offset Code N-1 Code Memory Figure 8: Program memory for patching Base Address A patch pointer table includes patch code pointer 0 through patch code pointer N-1, where each patch code pointer may correspond to one instruction or group of instructions to be patched or to substitute data values. Upon accessing an instruction address requiring a substitution, the processor utilizes the patch pointer table to locate the corresponding patch code. code pointers provide a patch code address which redirects program flow to a patch code memory containing the actual patch code to be executed. At the end of the patch code, a flow redirection instruction may return flow back to program memory to continue normal execution of the code until a next patch pointer address is encountered. Likewise, in response to a data access requiring a substitution, the processor utilizes the patch pointer table to provide substitute data values. The patch routines may each reside in separate memories or may even reside within program memory. Likewise, patch pointer tables may be stored within any memory and are defined to be user programmable. Each patch code pointer within the patch pointer table is referenced by a patch base address and a corresponding patch offset. Another aspect of the software patching implementation is the hardware circuitry needed to implement the scheme which is illustrated in Figure 9.

6 5.3 Software ing Conclusion Bit Offset Request Register Offset Request Base Register Bit Mux Instruction Register Offset Opcode Address Mask Address & Attributes Low Order Address Comparator & Attributes Inst Request Address Base Encoder Match Address Base Address Mask Match Figure 9: Software ing Implementation The hardware circuitry that is defined redirects program flow by generating an address comparison match when it identifies an address for which program execution should be redirected. The circuitry generates a control field having an offset specifically corresponding to the address. The instruction data is tagged with an identifying patch bit and is recognized when the instruction is decoded by the data processor. The data processor receives the instruction, but discards the instruction prior to execution. The circuitry then creates a redirected address value by combining the patch offset with the patch base address and implements redirection of program flow by utilizing the redirected address value. The circuitry is also capable of redirecting a data access by generating an address comparison match when it identifies a data access for which a substitute data value should be provided. The circuitry generates a control field having the offset, and subsequently, the data access is tagged with the data patch bit during the termination cycle of the data access. Once again, the processor discards the data prior to execution completion, and the circuitry creates a redirected address value by combining the patch offset with the patch base register. The data processor implements redirection of the data access by utilizing the redirected address value to access the substitute data value. In the software patching scheme, program flow redirection can be performed by utilizing a control field without changing the current execution context through an exception or without providing the processor a substitute opcode, such as an absolute jump. This method may also be utilized for patching exception handlers, or any code that is executed in supervisor or user space. 6 Conclusion The M CORE M3 architecture enhancements strike a delicate balance between the goals of increased performance and minimal cost and complexity. This balance is the foundation of the low-cost, low-power portable embedded marketplace. The improvements discussed in this paper resulted in significant performance increases, without a substantial increase in cost or complexity. 7 References [1] DSP bit Digital Signal Processor Family Manual, Motorola Inc., [2] M CORE Reference Manual, Motorola Inc., [3] B. Moyer, J. Arends, RISC Gets Small, Byte Magazine, February, [4] J. Scott, L. Lee, J. Arends, B. Moyer, Designing the Low- Power M CORE Architecture, Proc. Int l. Symp. on Computer Architecture Power Driven Microarchitecture Workshop, Barcelona, Spain, July 1998, pp [5] J. Scott, L. Lee, A. Chin, J. Arends, B. Moyer, Designing the M CORE M3 CPU Architecture, Proc. IEEE Int l. Conf. on Computer Design, Austin, Texas, October 1999, pp [6] D. Patterson, J. Hennesey. Computer Architecture: A Quantitative Approach, 2nd ed. (San Francisco: Morgan Kaufmann Publishers, Inc., 1996), M CORE is a trademark of Motorola, Inc.

ARM ARCHITECTURE. Contents at a glance:

ARM ARCHITECTURE. Contents at a glance: UNIT-III ARM ARCHITECTURE Contents at a glance: RISC Design Philosophy ARM Design Philosophy Registers Current Program Status Register(CPSR) Instruction Pipeline Interrupts and Vector Table Architecture