Design of Embedded DSP Processors Unit 8: Firmware design and benchmarking. 9/27/2017 Unit 8 of TSEA H1 1

Size: px

Start display at page:

Download "Design of Embedded DSP Processors Unit 8: Firmware design and benchmarking. 9/27/2017 Unit 8 of TSEA H1 1"

Richard Bryan
6 years ago
Views:

1 Design of Embedded DSP Processors Unit 8: Firmware design and benchmarking 9/27/2017 Unit 8 of TSEA H1 1

2 Contents Introduction to FW and its coding flow 1. Application modeling under HW constraints 2. Stream-kernel (master / slave) programming 3. Programming algorithm / computing kernels 4. Assembly code implementation 5. Code benchmarking and integration 9/27/2017 Unit 8 of TSEA H1 2

3 FW design flow 9/27/2017 Unit 8 of TSEA H1 3

4 Firmware FW is SW with fixed functions and firmed (not yet HW) in a system. FW permanently installed in non-volatile memory, rarely changed. Typical baseband firmware in SDR processor, video CODEC firmware in TV, in Surveillance camera 9/27/2017 Unit 8 of TSEA H1 4

FW coding / implementation flow Documents, STD High level behavior modeling Code inspection HW constraints HW related C-modeling Assembly programmin g C-compiler Code

5 FW coding / implementation flow Documents, STD High level behavior modeling Code inspection HW constraints HW related C-modeling Assembly programmin g C-compiler Code inspection Source xx.asm Source xxx.c C-compiler Assembler objective file xxx.bin objective file xxx.bin LIB 9/27/2017 Unit 8 of TSEA H1 5 Object linker Simulator debugger

6 The role of Programmer / Compiler 1. Programmer: partition and assign to different instruction domains /streams, domain coding & debugging, and integrate heterogeneous codes In an instruction stream, a programmer codes kernel codes to approach the best performance 2. A compiler translate C to codes of its machine language and optimize the translation. 3. API is finally added by a programming model 9/27/2017 Unit 8 of TSEA H1 6

7 Understand Applications Product Portable audio player DTV and video player Application components RTOS Audio decoder Voice encoder DVB modem Video decoder Function kernels Filter (I)DCT Huffman decoder Waveform generator (I)FFT Innermost loop design 9/27/2017 Unit 8 of TSEA H1 7

8 Task partition, allocation, scheduling before coding / compiling Mostly do it by hand, rarely available tools. Based on computing cost prediction (code profile), algorithm features, & HW constraints There are different partition objectives: to reach the highest performance lowest power (lower speed, less communication) Lowest memory cost Job balancing 9/27/2017 Unit 8 of TSEA H1 8

9 Understanding applications HW Aware algorithm selections High level language modeling Finite length design Coding finite length firmware Expose memory costs Coding FW with memory costs Run time budget Coding cycle accurate FW Re-allocatable assembly coding Binary machine code Copyright of Linköping University, all rights reserved FW Design flow Behavior modeling Simplified firmware design flow Bit accurate modeling Memory accurate modeling Timing budget Assembly coding Design entry 1 Design entry 2 Design entry 3 embedded.com codehelp.co.uk

10 High level FW design 9/27/2017 Unit 8 of TSEA H1 10

11 Algorithm selection Function! Do not forget your function! Select algorithms for the architecture (adapt to HW 1 advanced feature and 2 constraints) Reuse of available algorithms (SW reuse) Minimize computing cost (innermost loop) Minimize code cost (of high level codes) Minimize data accesses (mostly focused today) 9/27/2017 Unit 8 of TSEA H1 11

12 Stream-kernel based programming Stream The main consists of FSM, prepare & use subroutines Prolog (start a subrouting in device) Epilog (finish subrouting in device, handover results) API insertion: CUDA, OpenCL, OpenGL, OpenMP Kernel Interwork, task/resource management, and function call Speed up innermost loops by assembly level coding That what we are going to do today! 9/27/2017 For teachers using the book 12

13 Assembly kernel coding 9/27/2017 Unit 8 of TSEA H1 13

14 Finite Length Finite Length Integer/Fractional data with limited dynamic range Low cost/power with acceptable quantization noise Technique Integer/fractional guard bits for iterations Scaling and Round before truncation Saturation instead of exception Block floating, half precision floating point 9/27/2017 Unit 8 of TSEA H1 14

15 Filter DEC DSP Filter Copyright of Linköping University, all rights reserved Added quality control codes A/D Main task flow DSP DSP DSP D/A Scaling Scaling Scaling coefficient paramet Scaling Scaling scaling scaling Scaling flow tasks are executed only after running the measurement flow MAX AVG counters Measurement flow tasks are executed only when needed 9/27/2017 Unit 8 of TSEA H1 15

16 Firmware in a fixed point processing Start Program booting and parameter initialization Loading inputs and pre-processing Main task flow Executing the kernel part algorithms Data quality control flow Default No operation In case needed Measurement flow After measurement Scaling flow Post processing, result storing 9/27/2017 For teachers using the book 16

17 Bit accurate behavior coding Fractional v.s. integer A=0.25 v.s. 8192=0.25*32768 Mask including guard: A=(long)(int)A&0001FFFF Arithmetic, for example: yn= yn+((long)(int)a*xn>>15) 9/27/2017 Unit 8 of TSEA H1 17

18 Bit accurate specification HW Ceiling Headroom ADC resolution Scale up to avoid accumulated quantization errors MAX gain result 0dB Feet-room 9/27/2017 Unit 8 of TSEA H1 18

19 Measuring Data Quality D RMS ( R 1 r 1 ) 2 ( R 2 r 2 ) 2... ( R n r n ) 2 N D ABSMAX MAX{ R r1, R2 r2,..., Rn 1 rn 1 1 n n, R r } SNR 20log MAX 10 headroom D RMS dbv 9/27/2017 Unit 8 of TSEA H1 19

20 Memory and memory access Using SPM instead of cache Expose flexibilities for data access Minimize memory cost or access cost? Memory hardware constraints may induce extra execution time Code loading, load/store data, swapping data when memory size is not sufficient Adapt your implementation to memory HW 9/27/2017 Unit 8 of TSEA H1 20

21 Memory efficiency 1. Minimize memory costs Low program cost, low data memory costs 2. Minimum memory access costs Minimize on off chip swapping (SPM efficiency?) Multi tasks/threads sharing data Memory block re-connect (sharing out/in FIFO) 9/27/2017 Unit 8 of TSEA H1 21

22 Memory efficient Select algorithms with full memory access predictability. Much data can thus be stored in the off-chip memory and pre-fetch it when needed. 9/27/2017 Unit 8 of TSEA H1 22

23 Reduce register cost Number of registers required a b c d s t u v x y ACR0 ACR1 R0 R Cycles 9/27/2017 Unit 8 of TSEA H1 23 R1 R2 R4 R5 R0 R3 R1 R2

24 Real-time Firmware Implementation Correct = correct result + results available in time Find critical path & time constraints, WCET, minimize memory uncertainty 9/27/2017 Unit 8 of TSEA H1 24

25 Real Time Real time Cycle true: based on known cycle count Short distance between WCET: Worst Case Execution Time BCET: Best Case Execution Time Dynamic / static run time analysis Quality coding of innermost loops 9/27/2017 Unit 8 of TSEA H1 25

26 Code compiling The closer the C-code to HW, the better can be the C-compiler result Understand the compiler in detail. Annotate enough Compiler known Do we trust compiler Functional verification of compiled code 9/27/2017 Unit 8 of TSEA H1 26

27 Low cycle cost assembly kernels Focus on low cycle cost of inner most loops! Use REPEAT instead of conditional jump Loop unrolling & low cycle cost scheduling! Do not care much the code cost of inner loop! Use as much vector instruction as possible Keep useful data in RF as long time as possible C Algorithms for Real-Time DSP, Prentice Hall, ISBN Hacker's Delight, Addison-Wesley, ISBN /27/2017 Unit 8 of TSEA H1 27

28 Low cycle cost assembly kernels Implementation models Function Matrix Basic Video Baseband HPC Large matrix Transform Larger size T Filter ISP CODEC Post process Coding Searching Sorting FSM Storage Channel Decoding FEC Taylor series Task partition Data partition Grouping Pipeline Recursive SPMD Master-slave Fork-join BSPM Data sharing Reading:A Pattern Language for Parallel Programming

29 Reading:A Pattern Language for Parallel Programming 9/27/2017 Unit 8 of TSEA H1 29

30 Kernel programming tips CISC (if available) V.S. RISC (always there) RISC: Memory RF Computing RF Memory DSP loop: Memory Computing RF Trade off 10% - 90%, prolog, epilog, iterations Minimize cycle cost by acceleration / quality coding Amdahl s law: To minimize the parts can not run in parallel 9/27/2017 Unit 8 of TSEA H1 30

31 Code integration Oh my god! Where are cycles consumed! Extra cycles are needed during SW integration Be sure you predicted / accounted cycles during early SW plan / design phases Extra cost can come from (not limited to) Control: prolog/epilog, asynch, synchronization Data dependencies: loading, waiting for data available Communications: master/device (slave, I/O) 9/27/2017 Unit 8 of TSEA H1 31

32 Assembly-level Release WCET (the worst-case execution time) should be analyzed based on static timing analysis Remove paths which can never be true Avoid releasing code based on dynamic timing (code simulation) Stack overflow should be checked if multiple tasks are running simultaneously and associated with many interrupts and subroutine calls 9/27/2017 Unit 8 of TSEA H1 32

33 Benchmark 9/27/2017 Unit 8 of TSEA H1 33

34 Benchmark Benchmark is a type of program to measure the performance of a processor. Benchmarking is the execution of such type of programs which allows processor users to measure machine clock cycles consumed by a specific section of code. 9/27/2017 Unit 8 of TSEA H1 34

35 ASIP design flow Source code analysis, Decision for ISA of ASIP Design instruction set and toolchain for prototyping Benchmark (kernel), evaluate microarchitecturte Change ISA? No Satisfied? Yes Microarchitecture design, VLSI design, Verifications 2017/9/27 Unit 8 of TSEA H1 35

36 Third Party Benchmarks BDTI: Berkeley Design Tech Incorporation Hand written assembly by professional engineers EEMBC (the EDN Embedded Microprocessor Benchmark Consortium), five classes: automotive/industrial, consumer, networking, office automation, and telecommunication 9/27/2017 Unit 8 of TSEA H1 36

37 Benchmark example: for a simple DSP Algorithm Kernels Number of samples Taps Total cycle cost Kernel cycle cost P-Mem cost D-mem cost Block transfer point complex FFT Single data sample FIR Frame FIR (multi samples) Complex FIR IIR biquad type I LMS Adaptive FIR bit division Vector add Vector dot Vector Max Floating to fixed Fixed to floating X8DCT FSM (Packet classification) /27/2017 Unit 8 of TSEA H1 37

38 How to write a benchmark All operation, operands, and results are native length. Try to keep high precision in MAC. Round and saturate before storing data from MAC (after truncation) to memory or registers. All programs are implemented by experienced DSP firmware engineers. Complete program including loop prolog and epilog, program initialization, and wrapping up. All related memory access cost shall be included. 9/27/2017 Unit 8 of TSEA H1 38

39 An example: FIR benchmark A FIR filter is a weighted sum of a finite set of inputs. y(n)= m 1 k 0 a x( n x(n) is the input y(n) is the output k k) a k is a vector as the filter coefficients 9/27/2017 Unit 8 of TSEA H1 39

40 An example: FIR benchmark x(n) T T T a 0 a 1 a n + y(n) 9/27/2017 Unit 8 of TSEA H1 40

41 An example: FIR benchmark Behavior level code (single sample FIR) { Reset ACR DM(DP) <= The latest Sample DP <= DP + 1 /*Store latest sample in computing buffer, and then load the oldest sample, using same pointer. */ For i=0 to 15 do { ACR =< ACR + DM(DP)*TM(TP) /* 16-tap convolution for a sample */ DP <= DP + 1 /* implied modulo DP */ TP <= TP + 1; Round and Sat ACR; Output result; } Store the data pointer DP. } 9/27/2017 Unit 8 of TSEA H1 41

42 An example: FIR benchmark The first part of the program Set AP1, $SEG_FIR -- load segment (block) address to DM1pointer Set LoopR, N -- load the loop counter -- filter program parameters are stored in DM1 Set R15, $Resultpt -- Result pointer to R15 Set AP0, $Datapt -- data pointer to AP0 Set BTR, $Bottom -- FIFO bottom pointer Set TPR, $Top -- FIFO top pointer Set AP1, $Coeffpt -- coefficient pointer to AP The prolog consumes 7 cycles Repeat N -- Number of samples --for every data sample Store DM0(AP0++), R1 -- a sample data from R1 to DM0(DM0pointer) CLR ACR1 -- Clean the accumulator buffer ACR1 9/27/2017 Unit 8 of TSEA H1 42

43 An example: FIR benchmark The second part of the program CONV ACR1 SSF 16 DM0(AP0) DM1(AP1) -- Signed fractional convolution -- iteration uses N+1 = 16(17) clock cycles Convolution iteration --consumes 16 cycles if the following --instruction does not use ACR1 9/27/2017 Unit 8 of TSEA H1 43

44 An example: FIR benchmark The third part of the program PostOP R1, ACR1 -- Sat Round(ACR), store result in ACRH and R1 Store DM1(R15), R1 -- Store result in R1 to DM1(GRX++) INC R15 -- position to the next result End repeat Store DM1(AP1++), R15 - Store Y pointer after updating result Y Store DM1(AP1), AP1 - Store X pointer of the FIFO filter The epilog consumes 6 cycles /27/2017 Unit 8 of TSEA H1 44

45 The data memory space The FIFO buffer X(0) X(1) X(2) X(3) X(4) X(5) X(13) X(14) X(15) Copyright of Linköping University, all rights reserved Example: Frame sample FIR C-code: 40 samples filtered by a 16-tap FIR Push new data once a FIR tap Load each data once for signal processing of a FIR tap (a) The FIFO behavior Removed data MIN address MAX address Bottom Top DM Btm + 0 Btm + 1 Btm + 14 Btm + 15 State 0 State 1 R0 R7 X (n) X (n-15)... X (n-2) R5 X (n-1) R7 X (n-15) X (n-14)... X (n-1) R5 X (n) Read a new value to replace the oldest value in the buffer: x (n-15) R7 R5 State 2 State 3 R7 X (n-1) X (n) X (n-15) X (n-2) X (n-2) X (n-1) X (n) X (n-15) 9/27/2017 For teachers using the book 45 (b) The FIFO implementation R0 R5 R0 Increase the address counter R0. It points to the (next) oldest value in the FIFO. Replace the (next) oldest value x (n-15) with the new incoming value R0

46 Example: Frame sample FIR C-code: 16-tap FIR filter runs 40 samples Kernel cycle cost 17x40=680 cycles Prolog and epilog of inner loop: 40x5=200 cycles Prolog and epilog of the top loop: 9 cycles Typical BDTI benchmarking Algorithm 40 sample 16-tap FIR Innermost loop pro epilogue Kernel cycle cost 5x40=200 17x40 = 680 Total code cost DM cost /27/2017 Unit 8 of TSEA H1 46

47 Review on today s discussions Quality firmware design is based on rich FW experiences, deep understanding of applications, and HW. A formal design will never offer quality code. Firmware design can be divided into three steps: the algorithm selection and behavior modeling, the C-coding under hardware constraint, the assembly language coding Benchmark fundamentals Learn heterogeneous programming model in other courses 9/27/2017 Unit 8 of TSEA H1 47

48 Concepts Copyright of Linköping University, all rights reserved Summarize what/how to learn Skills System understanding FW coding Integration Assembly coding tools Further understanding tools after reading chapter 18 Debug skill Verification Firmware plan & design Skills to select algorithms Bit accurate Memory accurate Cycle accurate plan vs code To find extra cycle cost which you could not find out during coding subroutines 9/27/2017 Unit 8 of TSEA H1 48

49 Self reading after the lecture Your hardware knowledge will help you to design quality firmware, try to summarize it by yourself Reading Chapter 18 and chapter 9 1. Collect experiences to design quality innermost loop codes. 2. How to accelerate innermost loop in HW. 9/27/2017 Unit 8 of TSEA H1 49

50 Exciting time now! Let us discuss Whatever you want to discuss and related to HW You will have the chance after each lecture (Fö), do take the chance! Prepare your Qs for the next time 9/27/2017 Unit 8 of TSEA H1 50

51 LOGO Welcome to ask any questions you want to I can answer Or discuss together I want to know what you want Dake Liu, Room 556 coridoor B, Hus-B, phone , dake.liu@liu.se

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1

Design of Embedded DSP Processors Unit 2: Design basics 9/11/2017 Unit 2 of TSEA26-2017 H1 1 ASIP/ASIC design flow We need to have the flow in mind, so that we will know what we are talking about in later