DSP Platforms Lab (AD-SHARC) Session 05

University of Miami - Frost School of Music DSP Platforms Lab (AD-SHARC) Session 05 Description This session will be dedicated to give an introduction to the hardware architecture and assembly programming of the SHARC-21364. Additional Lab material can be found on http://gmuelibrary.pbwiki.com in the DSP Platforms section. Introduction: A digital signal processor is a microprocessor with architectural optimizations to speed up processing. One of the main features of a DSP-chip as this one is the implementation of Harvard architecture, which means that the data memory and the program memory have independent buses to fetch multiple data and instructions at the same time. The name SHARC stands for super-harvard architecture, which besides having separated busses for data and program memory has an independent I/O processor with its associated dedicated busses. The SHARC can be used as a floating-point or fixed-point processor. SHARC-21364 Hardware Architecture This lab is intended as an introductory overview and it is not a replacement of the actual manuals. For an in-depth understanding of the DSP-chip and its programming capabilities refer to the user s manuals located at the following url: http://www.analog.com/processors/sharc/technicallibrary/manuals/index.html For a quick interactive overview of general SHARC-DSP architecture, download and run the app Navigator 3.1 at: http://www.enel.ucalgary.ca/people/smith/2002webs/encm515_02/02general/background _info/sharcnav.zip Note: Start in the functional block diagram an click on the blocks for an interactive tutorial. It is suggested to read this interactive tutorial before the lab session. The figure below shows a block diagram of the SHARC architecture, a short description of each block is given as follows:

The SHARC has three main blocks that communicate through the busses. The Core Processor has two processing elements PEX and PEY, each with its correspondent computational units: Arithmetic Logic Unit (ALU), Multiply Accumulate unit (MAC), Shifter; and a data register. The other blocks are a Program Sequencer, the Timer, the Instruction Cache, and two Data Address Generators. The I/O Processor is an independent input/output-processing unit that manages the offchip data I/O transfers and alleviates the Core processor of this burden. The Memory is organized in four blocks, with up to 3Mbits of internal RAM using SRAM (static RAM: stores information while power is on without the need of refreshing), an up to 4Mbits or ROM (read-only memory).

1. Core Processor Note: Complete information in the SHARC-21364 programming reference manual 1.1. Processing Elements The processor s processing elements (PEx and PEy) perform numeric processing for processor algorithms. Each processing element contains a data register file and three computation units an arithmetic/logic unit (ALU), a multiplier, and a shifter. Computational instructions for these elements include both fixed-point and floating-point operations. ALU (Arithmetic Logic Unit): The ALU performs arithmetic operations on fixed-point or floating-point data and logical operations on fixed-point data. ALU fixed-point instructions operate on 32-bit fixed-point operands and output 32-bit fixed-point results, and ALU floating-point instructions operate on 32-bit or 40-bit floating-point operands and output 32-bit or 40-bit floatingpoint results. ALU instructions include: Floating-point addition, subtraction, add/subtract, average Fixed-point addition, subtraction, add/subtract, average Floating-point manipulation: binary log, scale, and mantissa Fixed-point add with carry, subtract with borrow, increment, decrement Logical And, Or, Xor, Not Functions: ABS, PASS, MIN, MAX, CLIP, and COMPARE Format conversion Reciprocal and reciprocal square root primitives ALU instructions take one or two inputs: X input and Y input. These inputs (known as operands) can be any data registers in the register file. Most ALU operations return one result; in add/subtract operations, the ALU operation returns two results; in compare operations, the ALU operation returns no result (only flags are updated). ALU results can be returned to any location in the register file. MAC (Multiply Accumulator): The multiplier performs fixed-point or floating-point multiplication and fixed-point multiply/accumulate operations. Multiplier floating-point instructions operate on 32-bit or 40-bit floating-point operands and output 32-bit or 40-bit floating-point results. Multiplier fixed-point instructions operate on 32-bit fixed-point data and produce 80-bit results. Inputs are treated as fractional or integer, unsigned or two s-complement. Multiplier instructions include:

Floating-point multiplication Fixed-point multiplication Fixed-point multiply/accumulate with addition, rounding optional Fixed-point multiply/accumulate with subtraction, rounding optional Rounding multiplier result register Saturating multiplier result register Clearing multiplier result register The multiplier takes two inputs: X and Y. These inputs (also known as operands) can be any data registers in the register file. The multiplier can accumulate fixed-point results in the local multiplier result (MRF) registers or write results back to the register file. The results in MRF can also be rounded or saturated in separate operations. Floating-point multiplies yield floating-point results, which the multiplier writes directly to the register file. Barrel Shifter (Shifter) The shifter performs bit-wise operations on 32-bit fixed-point operands. Shifter operations include: Shifts and rotates from off-scale left to off-scale right Bit manipulation operations, including bit set, clear, toggle, and test Bit field manipulation operations, including extract and deposit Fixed-point/floating-point conversion operations, including exponent extract, number of leading 1s or 0s The shifter takes one to three inputs: X, Y, and Z. The inputs (known as operands) can be any register in the register file. Within a shifter instruction, the inputs serve as follows. The X input provides data that is operated on. The Y input specifies shift magnitudes, bit field lengths, or bit positions. The Z input provides data that is operated on and updated. In the following example, Rx is the X input, Ry is the Y input, and Rn is the Z input. The shifter returns one output (Rn) to the register file. Rn = Rn OR LSHIFT Rx BY Ry; Data Register File Each of the processor s processing elements has a data register file, which is a set of data registers that transfers data between the data buses and the computational units. These registers also provide local storage for operands and results. The two register files consist of 16 primary registers and 16 alternate (secondary) registers. The data registers are 40 bits wide. Within these registers, 32-bit data is left justified. If an operation specifies a 32-bit data transfer to these 40-bit registers, the eight

LSBs are ignored on register reads, and the LSBs are cleared to zeros on writes. When a program refers to these registers as R0 through R15, the computational units treat the contents of these registers as fixed-point data. To perform floating-point computations, refer to these registers as F0 through F15. For example, the following instructions refer to the same registers, but direct the computational units to perform different operations: F0 = F1 * F2; /* floating-point multiply */ R0 = R1 * R2; /* fixed-point multiply */ The rules for using register names are: R0 through R15 and F0 through F15 refer to PEx registers for data move and computational instructions, whether the processor is in SISD or SIMD mode. R0 through R15 and F0 through F15 refer to both PEx and PEy register for computational instructions in SIMD mode. S0 through S15 refer to PEy registers for data move instructions, when the processor is in SISD or SIMD mode. SISD: Single Instruction Single Data. An operation performed on one processing element is not duplicated on the other processing element. SIMD: Single Instruction Multiple Data. The dual processing elements execute the same instruction, but operate on different data. 1.2. Data Address Generators (DAGS) The processor s data address generators (DAGs) generate addresses for data moves to and from data memory (DM) and program memory (PM). Each DAG has four types of registers. These registers hold the values that the DAG uses for generating addresses. The four types of registers are: Index registers (I0 I7 for DAG1 and I8 I15 for DAG2). An index register holds an address and acts as a pointer to memory. For example, the DAG interprets DM(I0,0) and PM(I8,0) syntax in an instruction as addresses. Modify registers (M0 M7 for DAG1 and M8 M15 for DAG2). A modify register provides the increment or step size by which an index register is pre- or post-modified during a register move. For example, the DM(I0,M1) instruction directs the DAG to output the address in register I0 then modify the contents of I0 using the M1 register. Length and base registers (L0 L7 and B0 B7 for DAG1 and L8 L15 and B8 B15 for DAG2). Length and base registers set the range of addresses and the starting address for a circular buffer.

Examples: R0 = dm(i0,m0); /* store in register R0 the value pointed by I0 and increment I0 by the amount stored in m0 afterwards */ dm(i0,m0) = F0; /* store in address pointed by I0 the value stored in register F0 and increment I0 by the amount stored in m0 afterwards */ An important feature of the DAGs is the circular buffer addressing capability. This is defined as addressing a range of addresses, which contain data that the DAG steps through repeatedly, wrapping around to repeat stepping through the range of addresses in a circular pattern. To address a circular buffer, the DAG steps the index pointer (I register) through the buffer, post-modifying and updating the index on each access with a positive or negative modify value (M register or immediate value). If the index pointer falls outside the buffer, the DAG subtracts or adds the buffer length to the index value, wrapping the index pointer back within the start and end boundaries of the buffer. Example of how to set up a circular buffer: 1. Enable circular buffering (BIT SET MODE1 CBUFEN;). This operation is only needed once in a program. 2. Load the buffer s base address into the B register. This operation automatically loads the corresponding I register. The starting address that the DAG wraps around is called the buffer s base address (B register). There are no restrictions on the value of the base address for a circular buffer. 3. Load the buffer s length into the corresponding L register. For example, L0 corresponds to B0. 4. Load the modify value (step size) into an M register in the corresponding DAG. For example, M0 through M7 correspond to B0. Alternatively, the program can use an immediate value for the modifier. THE FOLLOWING SYNTAX SETSUP AND ACCESSES A CIRCULAR BUFFER WITH: LENGTH=11 BASEADDRESS=0X80500 MODIFIER=4 BIT SET MODE1 CBUFEN; /*ENABLES CIRCULAR BUFFER ADDRESSING JUST ONCE IN A PROGRAM*/ B0=0X80500; /*LOADS B0 AND L0 REGISTERS WITH BASE ADDRESS*/ L0=11; /*LOADS L0 REGISTER WITH LENGTH OF BUFFER*/ M1=4; /*LOADS M1 WITH MODIFIER OR STEP SIZE*/ LCNTR=11, DO MY_CIR_BUFFER UNTIL LCE; /*SETS UP A LOOP CONTAINING BUFFER ACCESSES*/ R0=DM(I0,M1); /*AN ACCESS WITHIN THE BUFFER USES POST MODIFY ADDRESSING*/... /*OTHER INSTRUCTIONS IN THE MY_CIR_BUFFERLOOP*/ MY_CIR_BUFFER: NOP; /*END OF MY_CIR_BUFFERLOOP*/

1.3. Program Sequencer The program sequencer controls program flow by constantly providing the address of the next instruction to be fetched for execution. Program flow in the processor is mostly linear, with the processor executing instructions sequentially. This linear flow varies occasionally when the program branches due to nonsequential program structures, such as those described below. Nonsequential structures direct the processor to execute an instruction that is not at the next sequential address following the current instruction. To achieve a high execution rate while maintaining a simple programming model, the processor employs a five stage pipeline to process instructions - fetch1, fetch2, decode, address and execute. The non-sequencial structures include: Loops. One sequence of instructions executes multiple times with zero overhead. Subroutines. A traditional CALL structure where the processor temporarily breaks sequential flow to execute instructions from another part of program memory. Jumps. Program flow is permanently transferred to another part of program memory. Interrupts. A runtime event (generally not an instruction) triggers the program sequencer to branch to interrupt-handling subroutines. Idle. An instruction that causes the core to stop executing further instructions and hold its current state until an interrupt occurs. Then, after the processor services the interrupt, the sequencer resumes normal program execution.

1.4 Timer The core s programmable interval timer provides periodic interrupt generation. When enabled, the timer decrements a 32-bit count register every cycle. When this count registers reaches zero, the ADSP-2136x processors generate an interrupt and assert their timer expired output. The count register is automatically reloaded from a 32-bit period register and the countdown resumes immediately. In addition to the internal core timer, the ADSP-2136x processor contains three identical 32-bit timers that can be used to interface with external devices. 1.5. Instruction Cache The program sequencer includes a 32-word instruction cache that effectively provides three-bus operation for fetching an instruction and two data values. The cache is selective; only instructions whose fetches conflict with data accesses using the PM bus are cached. This caching allows full speed execution of core, looped operations such as digital filter multiply-accumulates, and FFT butterfly processing. 1.6. Interrupts The ADSP-2136x processors have three external hardware interrupts. The processor also provides three general-purpose interrupts, and a special interrupt for reset. The processor has internally-generated interrupts for the timer, DMA controller operations, circular buffer overflow, stack overflows, arithmetic exceptions, and user-defined software interrupts. For the general-purpose interrupts and the internal timer interrupt, the processor automatically stacks the arithmetic status (ASTATx) register and mode (MODE1) registers in parallel with the interrupt servicing, allowing 15 nesting levels of very fast service for these interrupts. 2. The I/O Processor In applications that use extensive off-chip data I/O, programs may find it beneficial to use a processor resource other than the processor core to perform data transfers. The ADSP- 2136x processor contains an I/O processor (IOP) that supports a variety of DMA (direct memory access) operations. Each DMA operation transfers an entire block of data. These operations include the transfer types listed below: Internal memory external memory devices Internal memory serial port I/O

Internal memory SPI I/O Internal memory Digital Applications Interface (DAI) Internal memory Internal memory * For a detailed explanation refer to the hardware reference Manual 3. Memory The ADSP-2136x processors contain four blocks of single-ported internal memory. In most programs this memory is available for single cycle, simultaneous, independent accesses by the core processor and I/O processor. The single-ported memory, in combination with three separate on-chip buses, allows two data transfers from the core and one transfer from the I/O processor in a single cycle, provided the access is from three different blocks of the memory. Using the I/O bus, the I/O processor provides data transfers between internal memory and the processor s communication ports. The processor memory is organized as four blocks block 0, block 1, block 2 and block 3, containing up to 3M bits of internal RAM and 4M bits of internal ROM. The memory on the ADSP-21364 processor has the following additional features. Each block can be configured for different combinations of code and data storage. Each block consists of four columns and each column is 16 bits wide. Each block maps to separate regions in memory address space and can be accessed as 16-bit, 32-bit, 48-bit, or 64-bit words. Each block also has its own two-deep self clearing shadow write buffers with automatic hit detection and data forwarding logic for read access. The processor features a 16-bit floating-point storage format that effectively doubles the amount of data that may be stored on-chip. A single instruction converts the format from 32-bit floating-point to 16-bit floating-point. While each memory block can store combinations of code and data, accesses are most efficient when one block stores data using the DM bus, for transfers, the second block stores instructions and data using the PM bus and a third and fourth block stores data using the I/O bus. Using the DM and PM buses in this way assures single-cycle execution with two data transfers. In this case, the instruction must be available in the cache. * For a detailed explanation refer to the programming reference Manual

SHARC Assembly Language Programming In this section a simple example is provided to show how an FIR filter structure may be implemented.

The code above is for instructional purposes and resources initialization steps have been omitted for simplicity. As shown at the top of the main program, there are three variables that need to be defined before jumping into the main section of code. These are the number of points in the filter kernel, NR_COEF; a circular buffer that holds the past samples from the input signal, dline[ ]; and a circular buffer that holds the filter kernel, coef[ ]. We also need to give the program two other pieces of information: the sampling rate of the codec, and the name of the file containing the filter kernel, so that it can be read into coef[ ]. All these steps are easy; nothing more than a single line of code each. We don't show them in this example because they are contained in the sections of code that we are ignoring for simplicity. The main section of the program performs two functions. In lines 6 to 13, the dataaddress-generators (DAGs) are configured to manage the circular buffers: dline[ ], and coef[ ]. Three parameters are needed for each buffer: the starting location of the buffer in memory (b0 and b8), the length of the buffer (l0 and l8), and the step size of the data being stored in the buffer (m0 and m8). These parameters that control the circular buffers are stored in hardware registers in the DAGs, allowing them to access and manage the data very efficiently. The second action of the main program is an idle loop, implemented in lines 15 to 19. This does nothing but wait for an interrupt indicating that an input sample has been acquired. All of the processing in this program occurs on a sample-by-sample basis. Each time a sample is read from the input, a sample in the output signal is calculated and routed to the codec. Most time-domain algorithms, such as FIR and IIR filters, fall into this category. The alternative is frame-by-frame processing, which is required for frequency-domain techniques. In the frame-by-frame method, a group of samples is read from the input, calculations are conducted, and a group of samples is written to the output. The subroutine that services the sample-ready interrupt is broken into three sections. The first section (lines 27 to 33) fetches the sample from the codec as a fixed point number, and converts it to floating point. In SHARC assembly language, a data register holding a fixed point number is referred to by "r" (such as r0, r8, r15, etc.), and by "f" if it is holding a floating point number (i.e., f0, f8, or f15.). For instance, in line 32, the fixed point number in data register 0 (i.e., r0) is converted into a floating point number and overwrites data register 0 (i.e., f0). This conversion is done according to a scaling specified by the fixed point number in data register 1 (i.e. r1). In the third section (lines 47 to 53), the opposite steps take place; the floating point number for the output sample is converted to fixed point and sent to the codec. The FIR filter that converts the input samples into the output samples is contained in lines 35 to 45. All the calculations are carried out in floating point, avoiding the need to worry about scaling and overflow. This section of code is optimized to take advantage of the SHARC DSP's ability to execute multiple instructions each clock cycle.

After we have the assembly program written and the filter kernel designed, we are ready to create a program that can be executed on the SHARC DSP. This is done by running the compiler, the assembler, and then the linker; three programs provided with the EZ-KIT Lite. The compiler converts a C program into the SHARC's assembly language. If you directly write the program in assembly, such as in this example, you bypass this step. The assembler and linker convert the program and external files (such as the architecture file, codec initialization routines, filter kernel, etc.) into the final executable file. All this takes about 30 seconds, with the final result being a SHARC program residing on the harddisk of your PC. The EZ-KIT Lite host is then used to run the program on the EZ-KIT Lite. Exercise: The files below may be used to use VisualDSP++ to compile Assembly language codes. The outline and scope of the exercise is left to the course instructor to define. Download from the URL below the examples for Single Channel block FIR and Single Channel single sample FIR. http://www.analog.com/processors/sharc/technicallibrary/codeexamples/2136x_simd_co de.html Also open the code example for Talkthru Analog In-Out (ASM), that is normally located in the path below. C:\Program Files\Analog Devices\VisualDSP 5.0\213xx\Examples\ADSP-21364 EZ-KIT Lite\Talkthru Analog In-Out (ASM) You can also look in the URL below for some audio effects example for the SHARC. Note: Even though these examples are meant for a different family of SHARC processors, they can be implemented in the 21364 with a few modifications. http://www.analog.com/processors/sharc/technicallibrary/codeexamples/21065l_audiod emos.html