Key elements in DSP algorithms. for DSP algorithms on. To be tackled today. Instruction fetches must be efficient.

Size: px

Start display at page:

Download "Key elements in DSP algorithms. for DSP algorithms on. To be tackled today. Instruction fetches must be efficient."

Walter Charles
5 years ago
Views:

1 Efficient Loop Handling for DSP algorithms on CISC, RISC and DSP processors M. Smith, Electrical and Computer Engineering, University of Calgary, Alberta, Canada ucalgary.ca Key elements in DSP algorithms Instruction fetches must be efficient Data fetches / stores often multiple must be efficient Multiplication must be efficient and accurate and remain precise Addition must be efficient i and accurate and remain precise Decision logic to control all the operations above must be efficient Program flow is key control operation Copyright smithmr@ucalgary.ca 2/ 33 To be tackled today Performing operations on an array Loop overhead can steal many cycles Loop overhead -- depends on implementation Standard loop with test at the start -- while () Initial test with additional test at end -- do-while( ) Down-counting loops Special Efficiencies CISC -- hardware RISC -- intelligent compilers DSP -- hardware a DELAY 0 DELAY_LEFT DELAY_RIGHT No relative delay modelled into the audio channel -- then sound perceived in centre of head Modelling a relative delay into the right ear audio channel. Sound arrival will shift sound to left as sound seems to get to left ear first Copyright smithmr@ucalgary.ca 3/ 33 ENCM Background to Audio Channel Modelling Copyright smithmr@ucalgary.ca 4/ 33

2 FIFO buffer update via Memory Move Example Pointer FIFO Delay Line Implement a delay line Try zero delay on left channel and large delay on right delay. Then try to read into a microphone void MemoryMove_Delay_CPP(void) { int count; } // Insert new value into the back of the FIFO delay line left_delayline[0 + LEFT_DELAY_VALUE] = channel1_in; // Grab delayed value from the front of the FIFO delay line channel1_out = left_delayline[0]; // Update the FIFO delay line using inefficient // memory-to-memory moves for (count = 0; count < LEFT_DELAY_VALUE; count++) left_ delayline[count] = left_ delayline[count + 1]; Copyright smithmr@ucalgary.ca 5/ 33 float FIFO[N]; float *pt_in = &FIFO[DELAY]; float *pt_out = &FIFO[0]; void PointerFIFO_CPP(void) { // Insert new value into the back of the FIFO delay line *pt _ in ++ = channel1_ in (Read pt _ in value, use it, store new pt _ in value) // Grab delayed value from the front of the FIFO delay line *channel1_out = *pt_out ++ if (pt_in > &left_delay[0 + LEFT_DELAY_VALUE]) _ then pt_in = pt_in (LEFT_DELAY_VALUE); if (pt_out > &left_delay[0 + LEFT_DELAY_VALUE]) then pt_out = pt_out (LEFT_DELAY_VALUE) } Requires additional reads and stores of static memory locations where pointers are stored Requires compares and jumps pipeline issues on jumps Copyright smithmr@ucalgary.ca 6/ 33 Labs delay line -- Concept FILTER 30 LEFT FILTER 330 LEF FILTER 330 RIG FILTER 30 RIGHT Get ambience by taking into account constructive and destructive interference around the face. This implies knowing characteristics of audio channel and modelling using an FIR filter -- 2 FIR filters per speaker -- processing requirement increasing ENCM Background to Audio Channel Modelling Copyright smithmr@ucalgary.ca 7/ 33 Real-time FIR Filter float fircoeffs_30[], fircoeffs[330]; void FIRFilter(void) { // Insert new value into FIFO delay line left_delayline[0 + N] = (float) channelleft_in; right_delayline[0 + N] = (float) channelright_in; channel_one_30 = channel_one 330 = 0; Need equivalent of following loop for EACH sound source for (count = 0, count < FIRlength - 1, count++) { channel_one_30 = channel_one_30 + left_delayline [count] * fir_30(count); channel_one_330 = channel_one_330 + right_delayline [count]* fir_330[count]; } channelleft_out = (int) channel_one_30 + scale_factor * channel_one_30 ditto 2 Update Left Channel delay line; Update Right Channel Delay line } Copyright smithmr@ucalgary.ca 8/ 33

3 Real-time FIR Hard-coded loop channel_one_30 = channel_one_30 + arrayleft[0] * fir_30(0); channel_one_30 = channel_one_30 + arrayleft[1] * fir_30(1); channel_one_30 = channel_one_30 + arrayleft[2] * fir_30(2); channel _o one_30 = channel _o one_30 + arrayleft[3] ay * fir_30(3); channel_one_30 = channel_one_30 + arrayleft[4] * fir_30(4); channel_one_30 = channel_one_30 + arrayleft[5] * fir_30(5); channel_one_30 = channel_one_30 + arrayleft[6] * fir_30(6); channel_one_30 = channel_one_30 + arrayleft[7] * fir_30(7); No loop overhead heavy memory penalty -- FIR filters 300 taps * 4 filters Using pt++ type memory operations rather than direct memory access with offset is faster on SOME processors!! Timing required to handle DSP loops for k = 0 to (N-1) -- Could require many lines to code Body of Code -- BofC cycles -- Could be 1 line Endfor -- Could require many lines to -- code, jumps and counter updates Important feature -- how much overhead time is used in handling the loop construct itself? Three components Set up Time Body of code time -- BofC cycles Handling the loop itself Copyright smithmr@ucalgary.ca 9/ 33 Copyright smithmr@ucalgary.ca 10 / 33 Basic Loop body Set up loop -- loop overhead -- done once Check conditions -- loop overhead -- done many times Do Code Body -- done many times -- useful Loop Back + counter increment -- loop overhead -- many Define Loop Efficiency = Useful cycles / Total cycles = N * Tcodebody Tsetup + N * (Tcodebody + Tconditions + Tloopback) Different Efficiencies depending on the size of the loop Need to learn good approximation techniques and recognize the two extremes 3 different basic loop constructs While loop Main compare test at top of loop Modified do-while loop with initial test Initial compare test at top Main compare test at the bottom of the loop Down-counting do-while loop with initial iti test t No compare operations in test. Relies on the setting of condition code flags during adjustment of the loop counter. Can increase overhead in some algorithms Copyright smithmr@ucalgary.ca 11 / 33 Copyright smithmr@ucalgary.ca 12 / 33

4 Clements -- Microprocessor Systems Design PWS Publishing ISBN Data from the memory appears near the end of the read cycle Copyright 13 / 33 Review -- CISC processor instruction phases Fetch -- Obtain op-code PC-value out on Address Bus Instruction op-code at Memory[PC] on Data Bus and then into Instruction Register Decode -- Bringing required values (internal or external) to the ALU input. Immediate -- Additional memory access for value -- Memory[PC] Absolute -- Additional memory access for address value and then further access for value -- Memory[Memory[PC]] Indirect -- Additional memory access to obtain value at Memory[AddressReg] Execute -- ALU operation Writeback -- ALU value to internal/external storage May require additional memory accesses to obtain address used during storage May require additional memory operations to perform storage. Copyright smithmr@ucalgary.ca 14 / 33 Basic 68K CISC loop -- Test at start MOVE.L #0, count -- Set up -- count in register Fetch instr. (FI4) + Fetch 32-bit constant (FC 2 * 4) + operation (OP0) CMP.L #N, count -- (FI4 FC8, OP bit subtract) BGE ENDLOOP Actually ADD.L #(ENDLOOP - 4), PC (ADD OF 16-bit btdisplacement TO PC -- FI4 FC4 OP(0 or 4) ) Body Cycles -- doing FIR perhaps ADD.L #1, count JMP LOOP END -- This is actually a numerical value (address) LOOP EFFECIENCY = N*(28 + BodyCycles + 32) Since ( ) >> 12 (5 times) then ignore startup cycles even if N small Copyright smithmr@ucalgary.ca 15 / 33 Check at end -- 68K CISC loop MOVE.L #0, count -- (FI4, FC8, OP0) NOTE JUMP Body Cycles -- doing FIR perhaps ADD.L #1, count CMP.L #N, count BLT LOOP EFFECIENCY = N*BodyCycles + 44*(N+1) Since 44 > 26 (only 1.8 times) then can t Ignore startup cycles when N small and Body Cycles small -- Small loop means inefficient Copyright smithmr@ucalgary.ca 16 / 33

5 Down Count -- 68K CISC loop MOVEQ.L #0, array_index -- (FI4, FC0, OP0) MOVE.L #N, count -- (FI4, FC0, OP0) BodyCycles using instructions of form OPERATION (Addreg, Index) ADDQ.L #1, array_index -- (FI4, FC0, OP0?) SUBQ.L #1, count -- (FI4, FC0, OP0?) LOOPTEST : BGT LOOP 24 + N*BodyCycles y + 20*(N+1) (was 44 * (N+1)) Since 20 < 24 then can t Ignore startup if N small and Body Cycles small Copyright smithmr@ucalgary.ca 17 / 33 Down Count -- Possible sometimes MOVEA.L #array_start, Addreg -- (FI4, FC0, OP0) MOVE.L #N, count -- (FI4, FC0, OP0) BodyCycles y using autoincrement mode OPCODE (Addreg)+ SUBQ.L #1, count LOOPTEST : BGT LOOP -- (FI4, FC0, OP0?) N*BodyCycles 24 + N*BodyCycles + 16*(N+1) (was 20 * (N+1)) Since 16 < 24 then can t Ignore startup if N small and Body Cycles small NOTE -- Number of cycles needed in body of the loop decreases in this case Copyright smithmr@ucalgary.ca 18 / 33 Loop Efficiency on CISC processor Efficiency depends on how loop constructed Standard d while-loop hl l Check at end -- modified do-while Down counting -- with/without auto-incrementing addressing modes Compiler versus hardcode efficiency See Embedded System Design magazine Sept./Oct 2000 Local copy available on ENCM515 web-pages What happens with different processor architectures? Check at end -- 29K RISC loop CONST count, 0 JUMP LOOPTEST NOP Bodycycles -- autoincrementing mode -- NOT AN OPTION ON 29K ADDU count, count, 1 CPLE TruthReg, count, N -- (1 cycle should be 2 -- register forwarding) (Boolean Truth Flag in TruthReg -- which could be any register) JMPT TruthReg LOOP NOP Loop Efficiency = *(N+1) Since 4 = 3 then can t Ignore startup if N small and Body Cycles small Since dealing with single cycle operations -- body cycle count smaller than CISC. This means that the loop overhead become more problematic if the processor efficient Copyright smithmr@ucalgary.ca 19 / 33 Copyright smithmr@ucalgary.ca 20 / 33

6 Down Count -- 29K RISC loop CONST index, cycle JUMP LOOPTEST -- 1 cycle CONST count, N -- in delay slot BodyCycles SUBU count, count, cycle CPGT TruthReg, count, cycle JMPT TruthReg, LOOP -- 1 cycle ADDS index, index, 1 -- in delay 3 + N*BodyCycles + 4*(N+1) Copyright smithmr@ucalgary.ca 21 / 33 Efficiency on RISC processors Not much difference between Test at end, Down count loop format HOWEVER body-cycle count has decreased Processor is highly pipelined -- Loop jumps cause the pipeline to stall Need to take advantage of delay slots Efficiency depends on DSP algorithm being implemented? What about DSP processors? Architecture is designed for efficiency in this area. Copyright smithmr@ucalgary.ca 22 / 33 Check at end -- ADSP-21K loop count, = 0; number = N; JUMP LOOPTEST (DB); jump to loop end NOP; NOP; BODYCYCLES count = count + 1; LOOPTEST Comp(count, number); IF LT JUMP LOOP (DB); NOP; NOP; EFFICIENCY = N*BodyCycles + 5*(N+1) Copyright smithmr@ucalgary.ca 23 / 33 Speed improvement -- Possible? count = 1; number = N; JUMP LOOPTEST (DB); count = count - 1; ADJUST number = number - 1; BODYCYCLES count = count + 1; Comp(count, number); IF LT JUMP LOOP (DB); count = count + 1; NOP; EFFICIENCY = N*BodyCycles + 4*(N+1) Copyright smithmr@ucalgary.ca 24 / 33

7 Down Count -- ADSP-21K loop number = 0; JUMP (PC, LOOPTEST) (DB); index = 0; count = N ; Bodycycles count = count - 1; LOOPTEST Comp(count, number); IF GT JUMP (PC, LOOP) (DB); index = index + 1; NOP; 4 + N*BodyCycles + 5*(N+1) Improved Down Count -- ADSP21K loop Is code valid -- or 1 off in times around loop? number = -1; -- Bias the loop counter (1 cycle) JUMP (PC, LOOPTEST) (DB); index = 0; count = (N-1); Body cycles LOOPTEST Comp(count, number); IF GT JUMP (PC, LOOP); index = index + 1; count = count - 1; 4 + N*BodyCycles + 4*(N+1) Copyright smithmr@ucalgary.ca 25 / 33 Copyright smithmr@ucalgary.ca 26 / 33 Faster Loops Need to go to special features CISC -- special Test, Conditional Jump and Decrement in 1 instruction RISC -- Change algorithm ago format DSP -- Special hardware for loops Maximum of six-nested loops (or just 2 on some processors) Can be a hidden trap when mixing C and asm code Copyright smithmr@ucalgary.ca 27 / 33 Recap -- 68K CISC loop down count MOVEQ.L #0, index MOVE.L #N, count BodyCycles ADDQ.L #1, index SUBQ.L #1, count LOOPTEST : BGT LOOP -- (FI4, FC0, OP0) -- (FI4, FC0, OP0) -- (FI4, FC0, OP0?) -- (FI4, FC0, OP0?) 24 + N*BodyCycles + 20*(N+1) Since 24=20 then can t Ignore startup if N small and Body Cycles small Copyright smithmr@ucalgary.ca 28 / 33

8 Hardware 68K CISC loop MOVEQ.L #0, index MOVE.L #(N-1), count BodyCycles ADDQ.L #1, index DBCC count, LOOP -- (FI4 FC0 OP0) -- (FI4 FC0 OP0) -- (FI4, FC0 OP0?) Loop Efficiency = N*BodyCycles + 16*(N+1) Possibility that Efficiency almost 100% if the Body Instructions are small enough to fit into cache Copyright smithmr@ucalgary.ca 29 / 33 Custom loop hardware on RISC For long loops -- loop overhead small -- no need to be concerned about the loop overhead (unless loop in loop) For small loops -- unroll the loop so that hardcoded 20 instructions rather than 1 instruction looped 20 times For medium loops -- advantage over CISC normally is that instructions more efficient -- 1 cycles compared to cycles For medium loops -- advantage over DSP normally is that instructions more efficient 1 RISC cycle compared to 2 DSP cycles -- (not 21K since 1 to 1) For more information See the Micro 1992 articles See the CCI articles Copyright smithmr@ucalgary.ca 30 / 33 21k Processor architecture Recap -- Improved Down Count -- 21K DSP loop number_r1 = -1 JUMP (PC, LOOPTEST) (DB) index_m4 = 0 count_r2 = (N-1) Body cycles Copyright smithmr@ucalgary.ca 31 / 33 LOOPTEST Comp(count,_R2 number_r1) -- (1 cycle ) IF GT JUMP (PC, LOOP) index _M4= index _M4+ 1 count _R2= count _R2-1 Loop Efficiency = N*BodyCycles + 4*(N+1) Copyright smithmr@ucalgary.ca 32 / 33

9 Hardware Loop -- 21K DSP loop count_r0 = N count_r0 = PASS count_r0 sets the condition codes with out doing math (allows parallel operation IF LE JUMP (PC, PASTLOOP) (DB) index = 0 nop HARDWARE_ LCNTR = count_r0; do (PC, PASTLOOP-1) until LCE -- 1 cycle -- parallel instruction Body-cycles PAST Last cycle of loop is at location PASTLOOP -1 Rest of the program code N*BodyCycles = 100% ENCM High + Speed N*BodyCycles Loops -- Hardware and Software Copyright smithmr@ucalgary.ca 33 / 33 Copyright smithmr@ucalgary.ca 34 / 33 DSP Hardware loop Efficiency from a number of areas Hardware counter No overhead for decrement No overhead for compare Pipelining efficient Processor knows to fetch instructions ti from start t of loop, not past the loop Has some problems if loop size is too small -- loop timing is longer than expected as processor needs to flush the pipeline and restart it Tackled today Performing access to memory in a loop Loop overhead can steal many cycles Loop overhead -- depends on implementation Standard loop with test at the start -- while () Initial test with test at end -- do-while( ) Down-counting loops Special Efficiencies CISC -- hardware RISC -- intelligent compilers DSP -- hardware a Copyright smithmr@ucalgary.ca 35 / 33 Copyright smithmr@ucalgary.ca 36 / 33

The SHARC in the C. Mike Smith

The SHARC in the C. Mike Smith M. Smith -- The SHARC in the C Page 1 of 9 The SHARC in the C Mike Smith Department of Electrical and Computer Engineering, University of Calgary, Alberta, Canada T2N 1N4 Contact Person: M. Smith Phone: