Lode DSP Core. Features. Overview

Features Two multiplier accumulator units Single cycle 16 x 16-bit signed and unsigned multiply - accumulate 40-bit arithmetic logical unit (ALU) Four 40-bit accumulators (32-bit + 8 guard bits) Pre-shifter, post-shifter and exponent extractor Divide step instruction Bit manipulation instructions Separate program and data memories Program address space: 64K x 32 bit Data memory address space: 64K x 16 bit Two data memory access per cycle Read, read Read, write Eight 16-bit data memory pointers Eight 8-bit data memory pointer modifiers Addressing Direct Pointer Indirect Immediate Circular Sixteen level hardware stack Five level instruction pipeline: Fetch, Decode, Read, Execute, & Write Two true zero overhead loops, nestable Context saving on interrupts Low Power Fully Static cmos device Wait instruction JTAG Support Lode DSP Core Overview Lode is a 16-bit digital signal processor (DSP) core designed to run with a 40 MHz clock in 0.5µm CMOS at 3.0 Volts. It consists of several major functional blocks shown in Figure 1. These are a Program Control Unit (PCU), an Arithmetic Computation Unit (ACU) and a Memory Management Unit (MMU). The core contains no memory, but can interface with off-core program and data memory. The Lode PCU employs a five stage pipeline structure. It also includes a 16 deep hardware stack, two zero overhead loop units (for multiple instruction looping), one instruction repeat module and interrupt circuitry. The ACU consists of two 16 x 16-bit Multiplier Accumulator Units (MACs), a 40-bit Arithmetic Manipulation Unit (AMU), four 40-bit Accumulators and a Post-shifter. The MMU consists of two address generation units for simultaneously addressing the two ports of data memory. It also includes two modulo addressing units. The Lode core also contains a register file distributed across the three functional blocks depending on their functions. All the loop, stack and repeat related registers reside in the PCU. The pointer registers, their modifiers, the modulo addressing control registers, the base address for page addressing register and m0/m1 reside in the MMU. The accumulators and other control and status registers belong to the ACU. Accessing any of the registers does not involve the programmer s knowledge of their locality. Table 1 lists the core registers and their functions. 0795A-A 11/97 1

As a convention throughout this data sheet, bit[39] and bit[15]/bit[7] represent the Most Significant Bit (MSB) of the accumulators and 16/8-bit registers respectively. Bit[0] represents the Least Significant Bit (LSB) of any register. Program Control Unit Program Memory AMU MAC0 MAC1 Registers Memory Management Unit Data Memory & IO Accumulators Post Shift Lode DSP Core Lode is an advanced, 16-bit Digital Signal Processor (DSP) core designed for optimal performance in digital cellular, speech and voice communications applications. The Lode core architecture efficiently performs the baseband functions - speech compression, forward error correction, and modem functions - required in digital cellular standards. Lode is the first general-purpose DSP that provides two multiplier-accumulators (MACs), which reduce power consumption by effectively cutting cycle times in half. Lode s suite of user-friendly development tools are easy to learn, thus accelerating the time it takes to get your product to market. Figure 1. Lode DSP Engine Block Diagram Lode core 2

Table 1. Lode Core Registers Register #Bits Function a0-a3 40 4 accumulators r0-r7 16 Pointers registers m0, m1 16 Read pipeline registers tstk 16 Top of the stack lreg 16 Local register pc 16 Program counter lrp0 lrp1 lcc0 lcc1 16 Loop repeat count registers 16 Loop current counters lsa0 lea0 lsa1 lea1 16 Loop registers, Start and end addresses cl0 cl1 16 16 Circular buffer length registers s0-s7 8 Pointer modifiers for r0-r7 asr 8 AMU shift register [-32.. 31] psr 8 Post shifter register [-8..16] rc 8 Repeat count register badr 8 Base address register sra srm srg 8 Status registers cmr 8 Control/mode register 3

Lode Benchmark Summary The Lode execution benchmarks are presented for basic signal processing operation in Table 2 below. The cycle count is presented as a function of the number of taps or entries, N, and processing length, L, including any overhead for the kernel operation. Table 2. Benchmarks Benchmarks Cycles w/overhead FIR realxreal FIR realxcomplex FIR complexxcomplex IIR realxreal, Nth order L(N+4)/2 L(N+4) L(2N+12) L(N+6)/2 IIR real, biquad 4.5L Double precision multiplication, N terms 3N+2 Adaptive LMS, real, N taps w/rounding 3.5N+6 40 bit CRC over N bits N Vector normalization 2N+2 Square distance N+2 Radix 2 Viterbi butterfly for channel 4 Radix 2 Viterbi butterfly for equalizer 5 Traceback (constraint length <= 5) Interleave/De-interleave Functional Description The block diagram of the functional units and buses is shown in Figure 1. The diagram shows the major data buses connecting the different units. Program Control Unit The Program Control Unit (PCU) manages the global operation of the core. The core operates with a five stage pipeline: Fetch stage: During this stage, the instruction is fetched from program memory and placed in the instruction register. In parallel, the program counter is incremented Decode stage: In this stage, the instruction is decoded. Read stage: During this stage, data is read from one and/or two data memory locations. Post modification of the addresses is also executed during this cycle (including write address). Execute stage: During this stage computations are performed in the Arithmetic Computation Unit (ACU) on the data from the selected sources. Write stage: During this stage, data is written back to memory through one of the data memory ports. Optionally, some post shifting is applied on the data before storing it. The PCU contains a few special units, of which the most important are described. 6 /stage 3 /bit Stack unit: A 16 level deep hardware stack is included in the PCU for the execution of program calls and returns for subroutine nesting. Two loop units: The PCU contains two loop units to implement two levels of true zero overhead nested loops. This provides a means to repeat the execution of a group of instructions a pre-specified number of times. Interrupt unit: The application independent part of the interrupt unit is inside the PCU, the rest is outside the core. The PCU controls the branching to the interrupt subroutine, the context saving/restoring, and handling of the control interface signals such as the interrupt acknowledgments and other pipeline related issues. The interrupt controller is application dependent. The interrupt controller provided with the I/O peripherals library will support 16 interrupts. Memory interface unit: The PCU handles pipeline issues such as inserting wait states, and generates control signals to interact with fast or slow data and program memories (which are outside the core). This unit also controls the reading and writing to many registers in the core. Status unit: The PCU contains three status registers and one control/mode register. The status registers contain flags from the ACU and from the PCU. The status bits are used for conditional execution of instructions. The control/mode register allows the programmer to enable or disable operation modes of the ACU and the PCU. 4

Arithmetic Computational Unit The Arithmetic Computational Unit (ACU) is comprised of one Arithmetic Manipulation Unit (AMU) and two Multiply/Accumulate (MAC) units. The output of both units goes to the Accumulator Register File (ARF) containing 4 accumulators. Any of the accumulator outputs can be routed back to the input of the ACU or be stored to the memory through the Post Shifter. Arithmetic Manipulation Unit The Arithmetic Manipulation Unit (AMU) consists of an exponent extraction module, a pre-shifter and a 40-bit ALU. The AMU gets its input from any of the 2 data buses and/or any of the 4 accumulators. The data buses can route the operands from the memory or from the registers. The Exponent Unit The exponent unit extracts the number of redundant sign bits such that, had the input value been shifted by this number, the sign bit would reside at bit[31] and it would be different from bit[30]. This is also called normalizing factor extraction with respect to bit[31]. The input to the exponent extractor is 40-bits wide and it accepts data either from the accumulators or from the data bus. The output of the exponent is sign extended and loaded in the destination accumulator as well as in the AMU Shift Register (asr). The range of the exponent extractor output is [-8.. 31]. The exponent module with the pre-shifter can be used in two ways: Normalizing a single value Normalizing a vector array To normalize single data values, it is first passed through the exponent module to extract the shift amount needed. The shift value is loaded automatically into asr. The original data is then loaded to the ARF through the shifter. To normalize an array of data, the extract exponent and find minimum instruction (expmn) is used in a loop. When repeated, this instruction will find the minimum of a previously extracted exponent residing in one of the accumulators and the new extracted exponent. At the end of the loop, the destination accumulator and asr have the shift value required to normalize the array. The array is then read back through the pre-shifter. The Pre-Shifter The pre-shifter performs both arithmetic and logical shifts on its input data. The input can be either a 16-bit value from the data bus or a 40-bit value from the ARF. The shift value is determined by asr or can be specified in the instruction word. The shifter range is [-32.. 31]. When no shift value is specified, the default is zero. For arithmetic right shifts, the sign is extended from bit[39]. Otherwise zeroes are shifted into bit[39] for right shifts and into bit[0] for left shifts. An arithmetic shift operation can overflow and cause saturation if the saturation bit is enabled in cmr. The pre-shifter also performs a rotate through the carry operation. A 16-bit data via the data bus can be rotated through a range of [-15..15]. Bit[31] is copied to bit[16] and also to the carry bit for left shifts, and bit[16] is copied to bit[31] and to the carry for right shifts. A 40-bit accumulator can only be rotated by 1-bit left or right. Bit[31] is copied to bit[0] and to the carry for a left shift. For a right shift, bit[0] is copied to bit[31] and to the carry. The Arithmetic Logical Unit The 40-bit Arithmetic Logical Unit (ALU) performs both Arithmetic and Boolean operations on its inputs. All operations use all 40 bits of the ALU. The inputs to the ALU can come from the ARF, the data memory, the instruction word or the output of the exponent unit. The basic functions performed by the ALU are: addition, subtraction, AND, OR, XOR, absolute distance, minimum and maximum operations. The ALU has several special instructions such as signed division, Galois Field (GF(2)) division and multiplication, bit test and max/min with pointer saving. Associated with the ALU are six status flags: carry, negative, zero, overflow, 40-bit overflow and test bit. All flags can be tested individually or combined. Guard Bits The 8 MSB s, bit[39.. 32], of the ARF are known as the guard bits. All ALU operations are 40-bit wide and therefore use the guard bits. The usefulness of the guard bits is apparent in multiply/accumulate operations where the partial sums may exceed the 32-bit width. Rounding Rounding is available in the AMU. Data can be rounded using the rounding instruction which performs a round to nearest operation by adding 0x8000 to the lower order word of an accumulator and then zeroing bit[15.. 0]. Saturation on Overflow The saturation feature can be enabled or disabled in the control mode register (cmr.) This bit provides saturation on overflow into the guard bits when the overflow condition is true after an ALU or MAC add/subtract There are two saturation control bits in cmr; one for the ALU and another for the MAC unit. A saturate instruction is provided which will override the saturation mode and can be used anytime a value in the ARF is required to be limited to 32 bits. Dual Multiply/Accumulate The Dual Multiply/Accumulate units (DMAC) consist of two identical multiply/accumulate modules, MAC0 and MAC1, which have different source and destination options. The MAC0 inputs come from the data buses and/or the ARF. For MAC1, in addition to the inputs mentioned above, there is also an input from the Local Register (lreg). This register stores the previous MAC0 operation input and hence the data can be reused in a subsequent MAC1 operation. When 5

both MAC units are used in parallel, this architecture speeds up operations such as filtering and Viterbi decoding. The feedback input to the MAC0 adder/subtractor unit can come from any of the accumulators in the ARF, or directly from the output of a0 for certain instructions. All of the feedback paths are 40-bit wide. However, the MAC1 gets its feedback inputs to its adder/subtractor only from the a1 accumulator. The destination of MAC0 is the ARF, or a0 in some instructions, and that of the MAC1 is always the a1 accumulator. The two multipliers are 16 16-bit multipliers, capable of multiplying signed or unsigned 2 s complement numbers. The output of the multipliers can be shifted left by one in the case of a fractional multiplication. This shift can be enabled or disabled in cmr. The multipliers also have a pass mode to allow data to pass to the output without change. In this case the dual MAC unit can be used as a dual add/sub unit for computations such as those required by the Viterbi decoding. Accumulator Register File Each of the four accumulators is a 40-bit register consisting of 8 guard bits, 16-bit high order word and 16-bit low order word. Local data buses interconnect the output and input of the ARF and the ACU. A write-back path is used for memory writes through the post shifter. Accumulator a2 is a special accumulator in that it can shift its contents left by one and simultaneously shift in a logical one or zero when certain instructions such as division, Galois field operations, bit test or max/min are executed. Post-shifter The post shifter takes the 40-bit output of any accumulator, shifts by any value from 8 to the right to 16 to the left and puts the 16 lower order bits (bit[15..0]) or the 16 higher order bits (bit[31..16]), as specified in the instruction on the write-back bus. Effectively any 16-bit window in the accumulator can be stored to the memory. The value by which the data is shifted can be specified either as a constant in the instruction word or as a variable in the Post Shifter Register, psr. Memory Management Unit The Memory Management Unit (MMU) consists of two independent data address generation units AGU0 and AGU1 for data memory addressing. They have an addressing range of 64K words. This allows two simultaneous memory fetches from the data memory in one cycle. Addressing Modes The Lode core provides five addressing modes for data memory accesses and fetching operands. The following is a description of each mode: 1. Memory direct addressing mode: The lower 8 bits of the data memory address are specified as an 8-bit value in the instruction. The upper 8 bits are programmed in the badr register. These two values are concatenated to generate a 16-bit memory address. 2. Register indirect addressing mode: In this mode the memory address is specified in one of the 8 pointer registers r0-r7. The data in the memory location pointed to by the register is fetched. Each of these pointer registers is associated with an 8-bit pointer modifier register s0-s7. The pointers can be post incremented or decremented by one or any number in the range from -128 to 127 specified in the pointer modifier register. Two full modulo circular addressing units are provided as well. Any of the 8 registers, r0-r7, can be configured in a circular mode. A pointer can be modified in the following seven different ways: Don t modify: r0 Increment: r0++ Decrement: r0-- Modify by the associated s register: r0+=s0 Increment in circular mode: (circ0) r0++ Decrement in circular mode: (circ0) r0-- Modify by the associated s register in circular mode: (circ0) r0+=s0 3. Register direct addressing mode: The contents of the registers r0 - r7 can directly be used as an operand. 4. Short immediate: An 8-bit immediate value can be specified in the instruction word and fetched from the instruction register as an operand. The 8-bit value can be sign extended or zero extended. 5. Integer (16 bits) immediate: A 16-bit value can be specified in the second word of an instruction and is addressed by the PC as an operand. Instruction Set The assembly syntax of Lode is similar to the C language. Operations are represented either in a function form, for example rnd(a0) to round the contents of accumulator a0, or by their symbolic representation in a mathematical expressions such as a0=a0 ^ a1 to XOR a0 with a1. This syntax allows for a short learning curve and ease of reading and understanding the code later on. The Lode core has 8 instruction categories: AMU, MAC, DMAC, Compound, Program Flow, Status, Move and Pointer. 6

AMU The AMU instructions are those instructions that use the AMU functional block of the Lode core. They consist of the 8 basic functions plus the special instructions such as exponent, division, bit test, GF(2) division/multiplication. Table 3 gives the complete list of the AMU instructions. Table 3. AMU Instructions Absolute distance Add two operands Add with carry AND two operands Arithmetic shift Bit test Decrement Divide two operands Double precision load Extract exponent Extract exponent and find minimum GF(2) polynomial division GF(2) polynomial multiplication Logical shift Maximum of two operands Maximum test of two operands Maximum and save pointer Minimum of two operands Minimum test of two operands Minimum and save pointer NOT OR two operands Rotate Round accumulator Saturate accumulator Subtract two operands Subtract with borrow XOR MAC The MAC instructions are those instructions that use the MAC0 unit only of the Lode core. These include a variety of combinations of inputs and signed/unsigned multiplications. Table 4 gives the complete list of the MAC instructions. Table 4. MAC Instructions Multiply two numbers Multiply accumulate Multiply and subtract Square Square and accumulate Square and subtract DMAC The DMAC instructions are those instructions that use both MAC0 and MAC1 units of the Lode core. This category also allows for bypassing the multiplier units to perform dual addition or subtraction operations. These are useful for Viterbi butterfly decoding. Table 5 gives the complete list of the DMAC instructions. Table 5. DMAC Instruction Multiply and accumulate in MAC1 Multiply and subtract in MAC1 Multiply in MAC0 and zero a1 Dual multiply Dual multiply and accumulate Dual multiply and subtract Dual multiply accumulate in MAC0 and subtract in MAC1 Dual multiply subtract in MAC0 and accumulate in MAC1 Dual accumulate Dual subtract Dual subtract with shift Dual accumulate in MAC0 and subtract in MAC1 Dual accumulate in MAC0 and subtract in MAC1 with shift (same value) Dual subtract in MAC0 and accumulate in MAC1 Dual subtract in MAC0 and accumulate in MAC1 with shift (same value) Dual load a0 and a1 Load a0 and LREG with the same value Compound The Compound category is the most powerful class of Lode instructions. These instructions use two (MAC0 and AMU) or all three (MAC0, MAC1 and AMU) functional blocks in parallel. Some of the instructions can write the result back 7

to memory in the same cycle as well. Instructions in this category are used to implement Viterbi butterfly decoding and distance square type of functions very efficiently. Table 6 gives the complete list of the Compound instructions. Table 6. Compound Instructions Accumulate squared distance Accumulate absolute distance Dual accumulate in MAC0 and subtract in MAC1 with writeback Dual subtract in MAC0 and accumulate in MAC1 with writeback Dual accumulate in MAC0 and subtract in MAC1 with shift and find maximum Dual accumulate in MAC0 and subtract in MAC1 with shift and find minimum Dual subtract in MAC0 and accumulate in MAC1 with shift and find maximum Dual subtract in MAC0 and accumulate in MAC1 with shift and find minimum Dual subtract and find maximum Dual subtract and find minimum Dual load a0 and a1 and find maximum Dual load a0 and a1 and find minimum Squared distance Program Flow The Program Flow instructions are those that affect the linear flow of execution such as branch and call. Table 7 gives the complete list of the Program Flow instructions. Table 7. Program Flow Instructions Branch Branch delayed Branch top of stack Call subroutine Call subroutine delayed Call top of stack Repeat n Repeat RC No operation Return from subroutine Return from subroutine delayed Return from interrupt Software trap Status The status category instructions are those instructions that act upon a status or control bit, such as clearing/setting them or executing an instruction conditioned upon a status bit. The wait instruction is part of this category. Table 8 gives the complete list of the Status instructions. Table 8. Status Instructions Conditionally execute next instruction Clear Set System reset Wait for interrupt Move The Move category instructions are those instructions that transfer data between different resources of the Lode core, such as memory, registers and accumulators. Table 9 gives the complete list of the Move instructions. Table 9. Move Instructions Load accumulator Store accumulator Move Dual Move 8

Pointer The Pointer instruction modifies pointers without accessing memory. Peripherals An extensive library of peripheral modules is available for the Lode core. Selected peripheral modules can be integrated with the Lode core along with required program and data memory to form an application specific DSP. The library includes following peripheral units: 1. Data Bus Sequencer Unit: It takes care of data accesses to both ports of the Data Memory and to the peripheral I/Os. 2. Serial I/O Port Interface Unit: This is a 16-bit bidirectional unit providing direct communication with industry standard serial devices, such as speech codecs, audio data converters etc. with a minimum of external hardware. 3. Parallel I/O Port Interface Unit: The PIO unit serves as a 16-bit, general purpose I/O port providing a bitwise control over the port pins as inputs or outputs. 4. Counter / Timer Unit: The CTU is a 16-bit counter / timer unit for generating accurate time delays under software control. 5. Interrupt Control Unit: The ICU relieves the Lode Core from the task of prioritizing a multi-interrupt system. It can handle up to 16 priority interrupts for the Lode Core. 6. Bus Isolation Unit: The BIU is responsible for keeping track of external bus requests and granting access to them when not in use. 7. Bus Interface Unit: The BIU provides interconnection between the signals between the Lode Core and the external logic through tri-state buffers. 8. Wait State Generator Unit: The Wait State Generator unit inserts wait states for the accesses to slow program and data memories. 9. Microcontroller FIFO I/O Interface Unit: This unit provides a 16-bit data interface for high speed, semaphore controlled, communication with external devices. 10. JTAG Controller Unit: It includes Jtag test logic and special pins to provide standardized approach for testing and debugging the logic on the chip 11. RAM Interface Unit: This unit brings the capability to the Lode core to interface with slow memories(eprom) and high speed rams. During boot mode, it takes the snapshot of the data in slow EPROM and writes it into the RAM. Lode Tools Software tools for Lode enable the DSP programmer to develop and verify applications. Following software tools are available for Lode: LodeAsm: The Lode assembler converts Lode source files in assembly language into object files that can be linked into executable files. LodeLink: The Lode linker converts assembled files into executable files that can be used by Lode core or by the simulator. LodeSim: The Lode simulator operates on Lode executable files to validate the instruction set and provide a platform for code development. A Lode Applications Development System (LADS) using Lode DSP is also available. LADS provides means for future program development, algorithm testing and demonstrations. LADS incorporates a chip with both the Lode core and I/O to provide for complete ICE functions. The LADS provides not only hardware emulation, single stepping, and register interrogation, but JTAG test functionality. 9

Figure 2. Lode Core Signals PDB[31:0] RB0[15:0] RB1[15:0] Lode DSP Engine RESETL PCUINT FREEZEL CK Program memory PAB[15:0] PMEL Data memory AB0[15:0] AB1[15:0] DRW0L DRW1L DME0L DME1L DRW0LIN DRW1LIN DME0LIN DME1LIN Controls INTDIS PCUACK RESETO ACTIVE Signal Description Signal Name I/O Description Program memory interface PAB[15:0] O Program memory address bus PDB[31:0] I Program memory data bus PMEL O Program memory enable Data memory interface AB0[15:0] O Data memory 0 address bus AB1[15:0] O Data memory 1 address bus RB0[15:0] IO Data memory 0 data bus RB1[15:0] IO Data memory 1 data bus DRW0L DRW1L O O Data memory read/write control for port 0 Data memory read/write control for port 1 DME0L O Data memory 0 enable line DME1L O Data memory 1 enable line DRW0LIN DRW1LIN O O Early data memory 0 read/write control Early data memory 1 read/write control DME0LIN O Early Data memory 0 enable DME1LIN O Early Data memory 1 enable Control interface ACTIVE O Active or Wait Mode RESETL I Processor Reset. Minimum duration is 2 cycles. RESETO O Processor Reset output PCUINT I PCU interrupt request PCUACK O PCU interrupt acknowledge INTDIS FREEZEL O I Interrupt disable status signal External data wait state control CK I Master clock signal 10

Timing Specification All timing specifications are for 40 MHz operation. Parameter Min Max Notes CK cycle time 25 -- For 40 MHz clk CK high pulse duration 10.5 12.5 For 40 MHz clk CK low pulse duration 10.5 12.5 For 40 MHz clk CK fall time 0 1.0 CK rise time 0 1.0 PDB[31:0] setup time before CK high 2.0 --.24 pf load cap PDB[31:0] hold time after CK high 0 -- PAB[31:0] valid delay from CK high 0 4.0 2.0 pf load cap PAB[31:0] hold time after CK high 0 -- PMEL valid after CK high 0 4.0 PMEL hold time after CK high 0 -- 2.0 pf load cap AB0[15:0], AB1[15:0] valid delay -- 6.0 3.0 pf load cap AB0[15:0], AB1[15:0] hold time after CK high 0.0 -- RB0[15:0], RB1[15:0] read setup time 4.0 --.24 pf load cap RB0[15:0], RB1[15:0] read hold time 0 -- RB0[15:0], RB1[15:0] write setup time 11.0 -- 4.0 pf load cap RB0[15:0], RB 1[15:0] write hold time (tri_stated after CK High) 0 5.0 DRW0L, DME0L, DRW1L, DME1L valid delay 0 6.0 3.0 pf load cap DRW0L, DME0L, DRW1L, DME1L hold time after CK high 0 -- DRW0LIN, DME0LIN, DRW1LIN, DME1LIN valid delay 0 13.2 DRW0LIN, DME0LIN, DRW1LIN, DME1LIN hold time after CK high 0 -- RESETL setup time before CK high 5.0 --.24 pf load cap RESETL hold time after CK high 0 -- RESETL low pulse duration 25.0 -- For 40 MHz clk RESETO valid after CK high (output) 0 5.0 2.0 pf load cap ACTIVEL valid (output) 0 5.0 2.0 pf load cap PCUINT valid after CK high (input) 0 7.0.24pf load cap PCUACK valid delay (output) 0 5.0 2.0 pf load cap INTDIS valid after CK high (output) 0 5.0 2.0 pf load cap FREEZEL setup time before CK high (input) 3.0 --.24pf load cap FREEZEL hold time after CK low 0 -- 11