Embedded Processor Cores. National Chiao Tung University Chun-Jen Tsai 5/30/ PDF Free Download

Embedded Processor Cores National Chiao Tung University Chun-Jen Tsai 5/30/2011

ARM History The first ARM processor was designed by Acron Computers Limited, Cambridge, England between 1983 and 1985 Based on RISC concepts at Stanford and Berkeley in 1980 First RISC processor for commercial use ARM (Advanced RISC Machine) established in 1990 ARM is not an IC vender, but a fabless design house ARM also develops tools to facilitate ARM-based system development: Software tools, boards, debug hardware, application software, bus architectures, peripherals etc 2/57

ARM vs. Berkeley RISC Features used Load/Store architecture Fixed-length instructions (32-bit or 16-bit) Large register banks 3-address instruction formats: Features rejected f bits n bits n bits n bits function dest. operand operand 1 operand 2 Register windows too costly, use shadow registers instead Delayed branch replaced by branch prediction Single-cycle execution load/store instructions require at least two cycles, unless data & instruction memories are separated 3/57

ARM vs. Intel Atom ARM Cortex-A8 TSMC 65nm process Die size: 4mm 2 with I/D- cache Clock Freq.: 1GHz Power: 0.45 W Intel Atom (Core Architecture) Intel High-k 45nm process Die size: 25mm 2 Clock Freq.: 0.8GHz~1.8GHz Power: 0.6W~2.5W 4/57

Generic ARM7 Architecture Register Bank 2 read ports, 1 write ports, access any register additional read and write ports for r15 (PC) Barrel Shifter Shift or rotate the operand by any number of bits Multiplier ALU Addr. register and incrementer Instruction Decode and Control Data In/Out Registers ALU BUS A[31:0] PC A BUS address register register bank (31 32-bit) incrementer 32 8 multiplier ALU barrel shifter PC B BUS control scan control instruction decode & control data out register data in register D[31:0] 5/57

Data Processing Instructions All operations take place in a single clock cycle address register address register increment increment Rd PC registers Rn Rm mult Rd registers Rn mult PC as ins. as ins. as instruction as instruction [7:0] data out data in i. pipe data out data in i. pipe (a) register - r egister operations (b) register - immediate operations 6/57

Data Transfer Instructions Address calculation is similar to a data processing instruction address register address register increment increment registers Rn PC Rn PC registers Rd mult mult lsl #0 shifter = A / A + B / A - B [11:0] = A + B / A - B data out data in i. pipe byte? data in i. pipe (a) 1st cycle - compute address (b) 2nd cycle - store data & auto-index For load instruction, the data from memory only gets as far as the data in register on the 2nd cycle and a 3rd cycle is needed to transfer the data from there to the destination register 7/57

Branch Instructions A branch instruction takes three cycles address register address register increment registers PC mult lsl #2 increment R14 registers PC mult shifter = A + B = A [23:0] data out data in i. pipe data out data in i. pipe (a) 1st cycle - compute branch tar get (b) 2nd cycle - save r eturn address The third cycle, which is required to complete the pipeline refilling, is also used to update the value of the link register so that it points directly at the instruction which follows the branch 8/57

Pipeline Organization 9/57

ARM9 vs. ARM7 Pipeline Operations Due to higher clock rate, there is not enough time to translate Thumb instructions into ARM instructions and then decode 10/57

ARM9 Forwarding Architecture Forwarding mechanism: The ALU result from the EX/MEM register is always fed back to the ALU input latches If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file 11/57

Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands so that a data is needed before it is updated Clock cycle number 1 2 3 4 5 6 7 8 9 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF ID sub EX MEM WB AND R6,R1,R7 IF ID and EX MEM WB OR R8,R1,R9 IF ID or EX MEM WB XOR R10,R1,R11 IF ID xor EX MEM WB 12/57

Forwarding The problem with data hazards sometimes can be solved with a simple hardware technique called data forwarding Clock cycle number 1 2 3 4 5 6 7 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF ID sub EX MEM WB AND R6,R1,R7 IF ID and EX MEM WB 13/57

Instruction Reshuffling The load instruction has a delay or latency that cannot be eliminated by forwarding alone 1 2 3 4 5 6 7 8 LDR R1,@(R2) IF ID EX MEM WB SUB R4,R1,R5 IF ID stall EX sub MEM WB AND R6,R7,R8 IF stall ID EX MEM WB We can reshuffle the instructions to avoid the stall 1 2 3 4 5 6 7 LDR R1,@(R2) IF ID EX MEM WB AND R6,R7,R8 IF ID EX MEM WB SUB R4,R1,R5 IF ID EX sub MEM WB 14/57

Co-Processor Interface ARM has a general-purpose coprocessor interface: Coprocessors use user-defined extension of ARM instructions Support for up to 16 logical coprocessors Each coprocessor can have up to 16 registers of any size Coprocessors use a load-store architecture: the LSU moves data between private registers and memory/arm registers ASEL ARM Core DIN DOUT CSEL 0 1 Memory system 0 1 0 1 CPDIN CPOUT CPDRIVE Coprocessor Note: Coprocessor does not control address bus ARM generates the required addresses. 15/57

ARM Data Types ARM has two states of operation: Regular state: 32-bit instructions, word-aligned Thumb state: 16-bit instructions, half-word aligned ARM processor supports 6 data types 8-bits signed and unsigned bytes 16-bits signed and unsigned half-word, aligned on 2-byte boundaries 32-bits signed and unsigned words, aligned on 4-byte boundaries Un-aligned data access are void be careful with pointer arithmetic in C 16/57

ARM Registers ARM has 37 registers, all of which are 32 bits long 1 program counter (PC), 1 current program status register, 5 saved status registers, and 30 general purpose registers Convention: r13 as stack pointer, r14 as link register Processor mode governs which bank is accessible r0. r7 r8 r9 r10 r11 r12 r13 r14 r15 (PC) r8_fiq r9_fiq r10_fiq r11_fiq r12_fiq r13_fiq r14_fiq r13_svc r14_svc r13_abt r14_abt usable in user mode system modes only r13_irq r14_irq r13_und r14_und CPSR SPSR_fiq SPSR_svc SPSR_abt SPSR_irq SPSR_und user mode fiq mode svc mode abort mode irq mode undefined mode 17/57

Exception Operations Upon an exception, ARM does the following things Completes the current instruction as best as it can Departs from the current instruction sequence to handle the exception which starts from a specific location Processor performs the following sequence: Change the operating mode of the particular exception Save the return address in r14 of the new mode Save the old value of CPSR in the SPSR of the new mode Mask out further IRQ or FIQ by setting the related CPSR bit Set PC to the entry point of the exception handling routine 18/57

Exception Entry Point The entry points of exceptions: Exception Type Mode Vector address Priority Reset SVC 0x0000 0000 1 Undefined Instruction UND 0x0000 0004 6 Software Interrupt (SWI) SVC 0x0000 0008 6 Prefetch abort (instruction fetch memory fault) Abort 0x0000 000C 5 Data abort (data access memory fault) Abort 0x0000 0010 2 IRQ (normal interrupt) IRQ 0x0000 0018 4 FIQ (fast interrupt) FIQ 0x0000 001C 3 Note that the vector address is the ENTRY POINT of the exception handling routine This is different from the x86 interrupt vector table! Normally, the entry point contains a branch instruction to the Real interrupt handling routine 19/57

Exception Return Once the exception has been handled, the user task is normally resumed The code sequence is Restored modified user registers from the handler s stack Restore CPSR from the appropriate SPSR Set PC to the return address The last two steps happen atomically as part of a single instruction 20/57

ARM Instruction Set Features Conditional execution of every instruction If the condition is false, the instruction becomes NOP Shift, ALU operation in a single instruction Compression of immediate operands and address offset in 32-bit instructions Your high-level language coding style may have noticeable performance impact Compact instruction mode: Thumb mode 21/57

ARM Instruction Coding Format 22/57

Conditional Execution Other processors typically only allow branches to be executed conditionally ARM allows conditional execution on everything All instructions contain a condition field which determines whether the CPU will execute them or not Non-executed instructions still take up 1 cycle This greatly reduces branches which stall the pipeline Allows very dense in-line code The time penalty of not executing several conditional instructions is frequently less than that of a branch However, for a large block of conditional codes, you should still use conditional branches 23/57

Conditional Execution Example For short conditional sequence, it is better to exploit conditional execution than to use a branch CMP r0,#5 BEQ Bypass ;if (r0!=5) ADD r1,r1,r0 ;{r1=r1+r0} SUB r1,r1,r2 Bypass CMP r0,#5 ADDNE r1,r1,r0 SUBNE r1,r1,r2 Conditional execution can also implements short circuit expressions: if ((a==b) && (c==d)) e++; CMP r0,r1 CMPEQ r2,r3 ADDEQ r4,r4,#1 24/57

Using Barrel Shifter for 2 nd Operand Register with shift operation Shift value can be either 5-bit unsigned integer Specified in bottom byte of another register Example: ADD r3,r2,r1,lsl#3 ; r3 := r2 + 8*r1 Used for multiplication by constant Immediate value constant 8-bit number, with a range of 0 ~ 255 Rotated right thru. even number of positions Allows increased range of 32-bit constants 25/57

Multiplication by a Constant Multiplication by a constant equals to 2 n 1 can be done in a single cycle, for example: r0 = r1 5 r0 = r1 + (r1 4) ADD r0,r1,r1,lsl #2 Other type of constant multiplications can be carried out by combining several instructions: r2 = r3 119 r2 = r3 17 7 r2 = r3 (16 + 1) (8-1) ADD r2,r3,r3,lsl #4 ; r2 := r3*17 RSB r2,r2,r2,lsl #3 ; r2 := r2*7 26/57

Loading Constants (1/2) No single ARM instruction can load a 32-bit immediate constant directly into a register All ARM instructions are 32-bit long ARM instructions do not use the instruction stream as data Data processing instruction has 12 bits available for operand 2: Use 8-bit for constants, give a range of 0-255 Use 4-bit to specify number of right-rotation bits This gives a much larger range of constants that can be directly loaded, though some constants will still need to be loaded from memory 27/57

Loading Constant (2/2) To load a constant, simply move the required value into a register the assembler will convert to the rotated form for us MOV r0,#4096 ;MOV r0,#0x1000 (0x40 ror 26) The bitwise complements can be formed using MVN: MOV r0,#&ffffffff ;MVN r0,#0 Values that cannot be generated automatically will cause an error from the assembler 28/57

Loading 32-bit Constants To allow larger constants to be loaded, the assembler offers a pseudo-instruction: LDR Rd,=const This will either: Produce an MOV or MVN to generate the value if possible or Generate an LDR instruction with a PC-relative address to read the constant from a literal pool (constant data area embedded in the code) For example MOV r0,=&ff LDR r0,=&55555555 ;MOV r0,#0xff ;LDR r0,[pc,#imm10] 29/57

Multiple Register Data Transfer The load and store multiple instructions (LDM/STM) allows between 1 and 16 registers to be transferred to/from memory Order of register can t be specified, order in the list is insignificant Lowest register number is always transferred to/form lowest memory location The transferred register can be either LDMIA r1,{r0,r2,r5} ;r0:=mem[r1] ;r2:=mem[r1+4] ;r5:=mem[r1+8] Any subset of the current bank of registers (default) Any subset of the user mode bank of registers when in a privileged mode (postfix instruction with a ^ ) 30/57

Stack Operations Using LDM/STM Stack pointer (SP) points to the top of stack Full stack: sp points to the data item at TOS Empty stack: sp points to the next vacant slot 31/57

Branch and Link Instructions For single-level function call (faster): Perform a branch, save the return address in the link register, r14 BL SUBR ;branch to SUBR... ;return here SUBR... ;subroutine entry point MOV PC,r14 ;return For nested function calls, r14 and work registers must be pushed onto a stack in memory SUBR BL SUBR... STMFD r13!,{r0-r2,r14} ;save work and link reg....... LDMFD r13!,{r0-r2,pc} ;return 32/57

Why Do We Need Assembly? Today, there are few reasons why you want to write assembly code for RISC processors Interrupt Service Routine (ISR) is one of the reasons In addition to ISRs, there are some manipulations that can still be done more efficiently in assembly than in C code key point: compilers do not understand algorithms. Example: int b, c; b = c / 2; v.s. unsigned b, c; b = c / 2; 33/57

Example 1: Data-Packing/Unpacking Expand an array of signed half-word into an array of words (or vise versa): ADR r1,array1 ;half-word array start ADR r2,array2 ;word array start ADR r3,endarr1 ;ARRAY1 end + 2 Loop LDRSH r0,[r1],#2 ;get signed half-word STR r0,[r2],#4 ;save word CMP r1,r3 ;check for end of array BLT Loop ;if not finished, loop Code segments for packing/unpacking is important in HW/SW codesign, but cannot be done efficiently in C 34/57

Example 2: Endian Swapping Swapping endians of a data stream is useful in multimedia standards and communication protocols Swapping of bytes in r0 can be done in merely 4 instructions in assembly byteswap ; R0 = A, B, C, D EOR R1, R0, R0, ROR #16 ; R1 = A^C, B^D, C^A, D^B BIC R1, R1, #0x00FF00FF ; R1 = A^C, 0, C^A, 0 MOV R0, R0, ROR #8 ; R0 = D, A, B, C EOR R0, R0, R1, LSR #8 ; R0 = D, C, B, A David Seal Ed., ARM Architecture Reference Manual, 2nd Ed., Addison-Wesley, 2000 35/57

Thumb-mode Instr. Compression Thumb instruction set is a subset of the ARM instruction set R0 R7: fully accessible High register R8 R12: only accessible with MOV, ADD, CMP; only CMP sets the condition code flags Most Thumb instructions use unconditional execution Many Thumb data processing instructions use twoaddress format, i.e., the destination register is the same as one of the source registers Thumb instruction formats are less regular than ARM instruction formats dense encoding 36/57

Mixed-State Operations The code density of Thumb and its performance for narrow memory system make it ideal for the bulk of C code in embedded systems However there is still a need to switch between ARM and Thumb state within most applications: ARM state provides better performance for wide memory Some functions only provided in ARM state Access to CPSR Access to coprocessors Exception handling ARM state is automatically entered upon exception 37/57

Switching between States State-switching is achieved using the Branch Exchange instructions In Thumb state BX Rn In ARM state (on Thumb-aware cores only) BX<condition> Rn Where Rn can be any registers (R0 to R15) Bit 0 of Rn specifies the state to change to Bit 0 of Rn is copied into the T bit in CPSR Bit 31 1 of Rn is copied into PC Bit 0 of PC is set to 0 in both modes Bit 1 of PC is set to 0 in ARM mode 38/57

Review: Converged Multimedia Platform If hardware/software co-design are done correctly, the only architecture we need is possibly just: GPP core 2nd-level cache ASIP cores external memory interface Smart DMA smart interconnect interface shared RAM DSP core single-port RAM dual-port RAM 39/57

A Real Example: TI OMAP Open Multimedia Application Platform (OMAP): SDRAM SRAM Flash SRAM External OMAP Internal SRAM / Frame buffer DMA Traffic controller Memory I/F LCD Controller 55x DSP Subsystem DARAM SARAM DMA MPUI TI-Enhanced ARM925T Subsystem UART USB Host/Client McBSP SD/MMC Timer (x3) WDT Interrupt controller McBSP (x2) GPIO UART Mailbox Timer (x3) WDT RTC Interrupt controller I2C HOST Camera I/F Memory Stick 40/57

General DSP Concept DSP are special-purpose processors that are designed to provide flexibility of a CPU and the speed close to ASIC (for certain applications) DSP programs are usually stored in on-chip memory To reduce the code size, assembly programming are usually used for DSP software development Key issues for a DSP design: Speed Power consumption Code density 41/57

Filter: Motivating Problem for DSP DSP were originally designed to perform fast calculation of FIR filter: y(n) = k = 0..K 1 h(k) x(n k) The key architecture: Multiply-and-Accumulate (MAC) h(k) x(n-k) Multiply ADD/SUB Accumulator y(n) Each MAC operation requires two data fetches, one multiply, one accumulate, and write back 42/57

TI-C55x CPU Architecture 43/57

Memory-Mapped Registers Most C55x registers are mapped to some memory addresses. For example: Auxiliary registers are mapped to 0x10 ~ 0x17 This is very convenient for MCU DSP inter-processor communications 47/57

Instruction Buffer Unit (I Unit) During each CPU cycle: Receives four bytes of code from the 32-bit program bus Decodes one to six bytes of code at the head of the queue I unit passes the decoded information to the P unit, the A unit, and the D unit for execution of the instructions Great for looping block code! 48/57

Address Data Flow Unit (A Unit) Generates the addresses for data read/write accesses Contains all the logic and registers necessary to calculate the addresses for the three data-read address buses and the two data-write address buses Contains a general-purpose 16-bit arithmetic logic unit (ALU) with shifting capability Typically for address calculation Ref.: C55x Technical Overview, TI spru393, page 2-9 49/57

Data Computation Unit (D Unit) (1/2) Primary part of the CPU where data is processed: Three data-read buses feed the two MAC units and the 40-bit ALU The parallelism of D unit minimizes the required task cycles Ref.: TMS320C55x Technical Overview, TI spru393, page 2-11 50/57

Data Computation Unit (2/2) Dual MAC architecture: In a single cycle, each MAC unit can perform a 17-bit by 17- bit multiplication and a 40-bit addition or subtraction with optional 32-/40-bit saturation The three data-read buses can carry two data streams and a common coefficient stream to the two MAC units The results from the MAC units can be placed in any of four 40-bit accumulators within the D unit Other modules, e.g. ALU and shifter, are common in general-purpose processors (like ARM) 51/57

C55x Image/Video HW Extention C55x has the following hardware extension for image and video applications DCT/IDCT Pixel Interpolation Motion Estimation Ref.: TMS320C55x Hardware Extensions for Image/Video Applications Programmer s Reference, TI spru098 52/57

Pixel Interpolation HW Ext. (1/2) Pixel interpolation extension for video encoding Note: Rnd = 0 or 1 An instruction is provided to handle a 2x2 block 53/57

Pixel Interpolation HW Ext. (2/2) Pixel interpolation extension for video decoding Note: only one of U-, M-, or R-type of pixels are required for decoding; M is shown as an example here No special instruction is designed for decoding, existing architecture can perform this efficiently 54/57

Optimizing Data Types DSP processors natural data types are very different from RISC processors. For example, TI C55x data types are: char short int long long long float double 16 bits 16 bits 16 bits 32 bits 40 bits 32 bits 64 bits exchanging optimal data structures between RISC and DSP becomes a crucial issue 55/57

Coding for Compiler Optimization How do we write C code to use the 16-bit 16-bit 32-bit engine in the DSP? long res = (long)(int)src1 * (long)(int)src2; How about using the 32-bit 16-bit 32-bit MAC engine? long mult(int a, int b) { long result; result = a * b; /* incorrect */ result = (long) (a * b); /* incorrect */ result = (long)a * b; /* correct */ } return result; 56/57

Software Pipelining With proper hints to the compiler, you can enable software pipelining in a loop using C code: int a[8], b[8]; long e[8], f[8], c[8], d[8];... for (idx = 0; idx < 8; idx++) { e[idx] = (long) a[idx]*b[idx]; /* use MAC */ f[idx] = (long)(int)c[idx] * (long)(int)d[idx]; /* use ALU-16 */ } 57/57

Embedded Processor Cores. National Chiao Tung University Chun-Jen Tsai 5/30/2011