Design of Embedded DSP Processors Unit 8: Firmware design and benchmarking. 9/27/2017 Unit 8 of TSEA H1 1
|
|
- Richard Bryan
- 6 years ago
- Views:
Transcription
1 Design of Embedded DSP Processors Unit 8: Firmware design and benchmarking 9/27/2017 Unit 8 of TSEA H1 1
2 Contents Introduction to FW and its coding flow 1. Application modeling under HW constraints 2. Stream-kernel (master / slave) programming 3. Programming algorithm / computing kernels 4. Assembly code implementation 5. Code benchmarking and integration 9/27/2017 Unit 8 of TSEA H1 2
3 FW design flow 9/27/2017 Unit 8 of TSEA H1 3
4 Firmware FW is SW with fixed functions and firmed (not yet HW) in a system. FW permanently installed in non-volatile memory, rarely changed. Typical baseband firmware in SDR processor, video CODEC firmware in TV, in Surveillance camera 9/27/2017 Unit 8 of TSEA H1 4
5 FW coding / implementation flow Documents, STD High level behavior modeling Code inspection HW constraints HW related C-modeling Assembly programmin g C-compiler Code inspection Source xx.asm Source xxx.c C-compiler Assembler objective file xxx.bin objective file xxx.bin LIB 9/27/2017 Unit 8 of TSEA H1 5 Object linker Simulator debugger
6 The role of Programmer / Compiler 1. Programmer: partition and assign to different instruction domains /streams, domain coding & debugging, and integrate heterogeneous codes In an instruction stream, a programmer codes kernel codes to approach the best performance 2. A compiler translate C to codes of its machine language and optimize the translation. 3. API is finally added by a programming model 9/27/2017 Unit 8 of TSEA H1 6
7 Understand Applications Product Portable audio player DTV and video player Application components RTOS Audio decoder Voice encoder DVB modem Video decoder Function kernels Filter (I)DCT Huffman decoder Waveform generator (I)FFT Innermost loop design 9/27/2017 Unit 8 of TSEA H1 7
8 Task partition, allocation, scheduling before coding / compiling Mostly do it by hand, rarely available tools. Based on computing cost prediction (code profile), algorithm features, & HW constraints There are different partition objectives: to reach the highest performance lowest power (lower speed, less communication) Lowest memory cost Job balancing 9/27/2017 Unit 8 of TSEA H1 8
9 Understanding applications HW Aware algorithm selections High level language modeling Finite length design Coding finite length firmware Expose memory costs Coding FW with memory costs Run time budget Coding cycle accurate FW Re-allocatable assembly coding Binary machine code Copyright of Linköping University, all rights reserved FW Design flow Behavior modeling Simplified firmware design flow Bit accurate modeling Memory accurate modeling Timing budget Assembly coding Design entry 1 Design entry 2 Design entry 3 embedded.com codehelp.co.uk
10 High level FW design 9/27/2017 Unit 8 of TSEA H1 10
11 Algorithm selection Function! Do not forget your function! Select algorithms for the architecture (adapt to HW 1 advanced feature and 2 constraints) Reuse of available algorithms (SW reuse) Minimize computing cost (innermost loop) Minimize code cost (of high level codes) Minimize data accesses (mostly focused today) 9/27/2017 Unit 8 of TSEA H1 11
12 Stream-kernel based programming Stream The main consists of FSM, prepare & use subroutines Prolog (start a subrouting in device) Epilog (finish subrouting in device, handover results) API insertion: CUDA, OpenCL, OpenGL, OpenMP Kernel Interwork, task/resource management, and function call Speed up innermost loops by assembly level coding That what we are going to do today! 9/27/2017 For teachers using the book 12
13 Assembly kernel coding 9/27/2017 Unit 8 of TSEA H1 13
14 Finite Length Finite Length Integer/Fractional data with limited dynamic range Low cost/power with acceptable quantization noise Technique Integer/fractional guard bits for iterations Scaling and Round before truncation Saturation instead of exception Block floating, half precision floating point 9/27/2017 Unit 8 of TSEA H1 14
15 Filter DEC DSP Filter Copyright of Linköping University, all rights reserved Added quality control codes A/D Main task flow DSP DSP DSP D/A Scaling Scaling Scaling coefficient paramet Scaling Scaling scaling scaling Scaling flow tasks are executed only after running the measurement flow MAX AVG counters Measurement flow tasks are executed only when needed 9/27/2017 Unit 8 of TSEA H1 15
16 Firmware in a fixed point processing Start Program booting and parameter initialization Loading inputs and pre-processing Main task flow Executing the kernel part algorithms Data quality control flow Default No operation In case needed Measurement flow After measurement Scaling flow Post processing, result storing 9/27/2017 For teachers using the book 16
17 Bit accurate behavior coding Fractional v.s. integer A=0.25 v.s. 8192=0.25*32768 Mask including guard: A=(long)(int)A&0001FFFF Arithmetic, for example: yn= yn+((long)(int)a*xn>>15) 9/27/2017 Unit 8 of TSEA H1 17
18 Bit accurate specification HW Ceiling Headroom ADC resolution Scale up to avoid accumulated quantization errors MAX gain result 0dB Feet-room 9/27/2017 Unit 8 of TSEA H1 18
19 Measuring Data Quality D RMS ( R 1 r 1 ) 2 ( R 2 r 2 ) 2... ( R n r n ) 2 N D ABSMAX MAX{ R r1, R2 r2,..., Rn 1 rn 1 1 n n, R r } SNR 20log MAX 10 headroom D RMS dbv 9/27/2017 Unit 8 of TSEA H1 19
20 Memory and memory access Using SPM instead of cache Expose flexibilities for data access Minimize memory cost or access cost? Memory hardware constraints may induce extra execution time Code loading, load/store data, swapping data when memory size is not sufficient Adapt your implementation to memory HW 9/27/2017 Unit 8 of TSEA H1 20
21 Memory efficiency 1. Minimize memory costs Low program cost, low data memory costs 2. Minimum memory access costs Minimize on off chip swapping (SPM efficiency?) Multi tasks/threads sharing data Memory block re-connect (sharing out/in FIFO) 9/27/2017 Unit 8 of TSEA H1 21
22 Memory efficient Select algorithms with full memory access predictability. Much data can thus be stored in the off-chip memory and pre-fetch it when needed. 9/27/2017 Unit 8 of TSEA H1 22
23 Reduce register cost Number of registers required a b c d s t u v x y ACR0 ACR1 R0 R Cycles 9/27/2017 Unit 8 of TSEA H1 23 R1 R2 R4 R5 R0 R3 R1 R2
24 Real-time Firmware Implementation Correct = correct result + results available in time Find critical path & time constraints, WCET, minimize memory uncertainty 9/27/2017 Unit 8 of TSEA H1 24
25 Real Time Real time Cycle true: based on known cycle count Short distance between WCET: Worst Case Execution Time BCET: Best Case Execution Time Dynamic / static run time analysis Quality coding of innermost loops 9/27/2017 Unit 8 of TSEA H1 25
26 Code compiling The closer the C-code to HW, the better can be the C-compiler result Understand the compiler in detail. Annotate enough Compiler known Do we trust compiler Functional verification of compiled code 9/27/2017 Unit 8 of TSEA H1 26
27 Low cycle cost assembly kernels Focus on low cycle cost of inner most loops! Use REPEAT instead of conditional jump Loop unrolling & low cycle cost scheduling! Do not care much the code cost of inner loop! Use as much vector instruction as possible Keep useful data in RF as long time as possible C Algorithms for Real-Time DSP, Prentice Hall, ISBN Hacker's Delight, Addison-Wesley, ISBN /27/2017 Unit 8 of TSEA H1 27
28 Low cycle cost assembly kernels Implementation models Function Matrix Basic Video Baseband HPC Large matrix Transform Larger size T Filter ISP CODEC Post process Coding Searching Sorting FSM Storage Channel Decoding FEC Taylor series Task partition Data partition Grouping Pipeline Recursive SPMD Master-slave Fork-join BSPM Data sharing Reading:A Pattern Language for Parallel Programming
29 Reading:A Pattern Language for Parallel Programming 9/27/2017 Unit 8 of TSEA H1 29
30 Kernel programming tips CISC (if available) V.S. RISC (always there) RISC: Memory RF Computing RF Memory DSP loop: Memory Computing RF Trade off 10% - 90%, prolog, epilog, iterations Minimize cycle cost by acceleration / quality coding Amdahl s law: To minimize the parts can not run in parallel 9/27/2017 Unit 8 of TSEA H1 30
31 Code integration Oh my god! Where are cycles consumed! Extra cycles are needed during SW integration Be sure you predicted / accounted cycles during early SW plan / design phases Extra cost can come from (not limited to) Control: prolog/epilog, asynch, synchronization Data dependencies: loading, waiting for data available Communications: master/device (slave, I/O) 9/27/2017 Unit 8 of TSEA H1 31
32 Assembly-level Release WCET (the worst-case execution time) should be analyzed based on static timing analysis Remove paths which can never be true Avoid releasing code based on dynamic timing (code simulation) Stack overflow should be checked if multiple tasks are running simultaneously and associated with many interrupts and subroutine calls 9/27/2017 Unit 8 of TSEA H1 32
33 Benchmark 9/27/2017 Unit 8 of TSEA H1 33
34 Benchmark Benchmark is a type of program to measure the performance of a processor. Benchmarking is the execution of such type of programs which allows processor users to measure machine clock cycles consumed by a specific section of code. 9/27/2017 Unit 8 of TSEA H1 34
35 ASIP design flow Source code analysis, Decision for ISA of ASIP Design instruction set and toolchain for prototyping Benchmark (kernel), evaluate microarchitecturte Change ISA? No Satisfied? Yes Microarchitecture design, VLSI design, Verifications 2017/9/27 Unit 8 of TSEA H1 35
36 Third Party Benchmarks BDTI: Berkeley Design Tech Incorporation Hand written assembly by professional engineers EEMBC (the EDN Embedded Microprocessor Benchmark Consortium), five classes: automotive/industrial, consumer, networking, office automation, and telecommunication 9/27/2017 Unit 8 of TSEA H1 36
37 Benchmark example: for a simple DSP Algorithm Kernels Number of samples Taps Total cycle cost Kernel cycle cost P-Mem cost D-mem cost Block transfer point complex FFT Single data sample FIR Frame FIR (multi samples) Complex FIR IIR biquad type I LMS Adaptive FIR bit division Vector add Vector dot Vector Max Floating to fixed Fixed to floating X8DCT FSM (Packet classification) /27/2017 Unit 8 of TSEA H1 37
38 How to write a benchmark All operation, operands, and results are native length. Try to keep high precision in MAC. Round and saturate before storing data from MAC (after truncation) to memory or registers. All programs are implemented by experienced DSP firmware engineers. Complete program including loop prolog and epilog, program initialization, and wrapping up. All related memory access cost shall be included. 9/27/2017 Unit 8 of TSEA H1 38
39 An example: FIR benchmark A FIR filter is a weighted sum of a finite set of inputs. y(n)= m 1 k 0 a x( n x(n) is the input y(n) is the output k k) a k is a vector as the filter coefficients 9/27/2017 Unit 8 of TSEA H1 39
40 An example: FIR benchmark x(n) T T T a 0 a 1 a n + y(n) 9/27/2017 Unit 8 of TSEA H1 40
41 An example: FIR benchmark Behavior level code (single sample FIR) { Reset ACR DM(DP) <= The latest Sample DP <= DP + 1 /*Store latest sample in computing buffer, and then load the oldest sample, using same pointer. */ For i=0 to 15 do { ACR =< ACR + DM(DP)*TM(TP) /* 16-tap convolution for a sample */ DP <= DP + 1 /* implied modulo DP */ TP <= TP + 1; Round and Sat ACR; Output result; } Store the data pointer DP. } 9/27/2017 Unit 8 of TSEA H1 41
42 An example: FIR benchmark The first part of the program Set AP1, $SEG_FIR -- load segment (block) address to DM1pointer Set LoopR, N -- load the loop counter -- filter program parameters are stored in DM1 Set R15, $Resultpt -- Result pointer to R15 Set AP0, $Datapt -- data pointer to AP0 Set BTR, $Bottom -- FIFO bottom pointer Set TPR, $Top -- FIFO top pointer Set AP1, $Coeffpt -- coefficient pointer to AP The prolog consumes 7 cycles Repeat N -- Number of samples --for every data sample Store DM0(AP0++), R1 -- a sample data from R1 to DM0(DM0pointer) CLR ACR1 -- Clean the accumulator buffer ACR1 9/27/2017 Unit 8 of TSEA H1 42
43 An example: FIR benchmark The second part of the program CONV ACR1 SSF 16 DM0(AP0) DM1(AP1) -- Signed fractional convolution -- iteration uses N+1 = 16(17) clock cycles Convolution iteration --consumes 16 cycles if the following --instruction does not use ACR1 9/27/2017 Unit 8 of TSEA H1 43
44 An example: FIR benchmark The third part of the program PostOP R1, ACR1 -- Sat Round(ACR), store result in ACRH and R1 Store DM1(R15), R1 -- Store result in R1 to DM1(GRX++) INC R15 -- position to the next result End repeat Store DM1(AP1++), R15 - Store Y pointer after updating result Y Store DM1(AP1), AP1 - Store X pointer of the FIFO filter The epilog consumes 6 cycles /27/2017 Unit 8 of TSEA H1 44
45 The data memory space The FIFO buffer X(0) X(1) X(2) X(3) X(4) X(5) X(13) X(14) X(15) Copyright of Linköping University, all rights reserved Example: Frame sample FIR C-code: 40 samples filtered by a 16-tap FIR Push new data once a FIR tap Load each data once for signal processing of a FIR tap (a) The FIFO behavior Removed data MIN address MAX address Bottom Top DM Btm + 0 Btm + 1 Btm + 14 Btm + 15 State 0 State 1 R0 R7 X (n) X (n-15)... X (n-2) R5 X (n-1) R7 X (n-15) X (n-14)... X (n-1) R5 X (n) Read a new value to replace the oldest value in the buffer: x (n-15) R7 R5 State 2 State 3 R7 X (n-1) X (n) X (n-15) X (n-2) X (n-2) X (n-1) X (n) X (n-15) 9/27/2017 For teachers using the book 45 (b) The FIFO implementation R0 R5 R0 Increase the address counter R0. It points to the (next) oldest value in the FIFO. Replace the (next) oldest value x (n-15) with the new incoming value R0
46 Example: Frame sample FIR C-code: 16-tap FIR filter runs 40 samples Kernel cycle cost 17x40=680 cycles Prolog and epilog of inner loop: 40x5=200 cycles Prolog and epilog of the top loop: 9 cycles Typical BDTI benchmarking Algorithm 40 sample 16-tap FIR Innermost loop pro epilogue Kernel cycle cost 5x40=200 17x40 = 680 Total code cost DM cost /27/2017 Unit 8 of TSEA H1 46
47 Review on today s discussions Quality firmware design is based on rich FW experiences, deep understanding of applications, and HW. A formal design will never offer quality code. Firmware design can be divided into three steps: the algorithm selection and behavior modeling, the C-coding under hardware constraint, the assembly language coding Benchmark fundamentals Learn heterogeneous programming model in other courses 9/27/2017 Unit 8 of TSEA H1 47
48 Concepts Copyright of Linköping University, all rights reserved Summarize what/how to learn Skills System understanding FW coding Integration Assembly coding tools Further understanding tools after reading chapter 18 Debug skill Verification Firmware plan & design Skills to select algorithms Bit accurate Memory accurate Cycle accurate plan vs code To find extra cycle cost which you could not find out during coding subroutines 9/27/2017 Unit 8 of TSEA H1 48
49 Self reading after the lecture Your hardware knowledge will help you to design quality firmware, try to summarize it by yourself Reading Chapter 18 and chapter 9 1. Collect experiences to design quality innermost loop codes. 2. How to accelerate innermost loop in HW. 9/27/2017 Unit 8 of TSEA H1 49
50 Exciting time now! Let us discuss Whatever you want to discuss and related to HW You will have the chance after each lecture (Fö), do take the chance! Prepare your Qs for the next time 9/27/2017 Unit 8 of TSEA H1 50
51 LOGO Welcome to ask any questions you want to I can answer Or discuss together I want to know what you want Dake Liu, Room 556 coridoor B, Hus-B, phone , dake.liu@liu.se
Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1
Design of Embedded DSP Processors Unit 2: Design basics 9/11/2017 Unit 2 of TSEA26-2017 H1 1 ASIP/ASIC design flow We need to have the flow in mind, so that we will know what we are talking about in later
More informationDesign of Embedded DSP Processors
Design of Embedded DSP Processors Unit 3: Microarchitecture, Register file, and ALU 9/11/2017 Unit 3 of TSEA26-2017 H1 1 Contents 1. Microarchitecture and its design 2. Hardware design fundamentals 3.
More informationDesign of Embedded DSP Processors Unit 5: Data access. 9/11/2017 Unit 5 of TSEA H1 1
Design of Embedded DSP Processors Unit 5: Data access 9/11/2017 Unit 5 of TSEA26-2017 H1 1 Data memory in a Processor Store Data FIFO supporting DSP executions Computing buffer Parameter storage Access
More informationDesign of Embedded DSP Processors Unit 7: Programming toolchain. 9/26/2017 Unit 7 of TSEA H1 1
Design of Embedded DSP Processors Unit 7: Programming toolchain 9/26/2017 Unit 7 of TSEA26 2017 H1 1 Toolchain introduction There are two kinds of tools 1.The ASIP design tool for HW designers Frontend
More information04 - DSP Architecture and Microarchitecture
September 11, 2014 Conclusions - Instruction set design An assembly language instruction set must be more efficient than Junior Accelerations shall be implemented at arithmetic and algorithmic levels.
More information02 - Numerical Representation and Introduction to Junior
02 - Numerical Representation and Introduction to Junior September 10, 2013 Todays lecture Finite length effects, continued from Lecture 1 How to handle overflow Introduction to the Junior processor Demonstration
More information04 - DSP Architecture and Microarchitecture
September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:
More informationTSEA 26 exam page 1 of Examination. Design of Embedded DSP Processors, TSEA26 Date 8-12, G34, G32, FOI hus G
TSEA 26 exam page 1 of 10 20171019 Examination Design of Embedded DSP Processors, TSEA26 Date 8-12, 2017-10-19 Room G34, G32, FOI hus G Time 08-12AM Course code TSEA26 Exam code TEN1 Design of Embedded
More informationStorage I/O Summary. Lecture 16: Multimedia and DSP Architectures
Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal
More information05 - Microarchitecture, RF and ALU
September 15, 2015 Microarchitecture Design Step 1: Partition each assembly instruction into microoperations, allocate each microoperation into corresponding hardware modules. Step 2: Collect all microoperations
More information03 - The Junior Processor
September 10, 2014 Designing a minimal instruction set What is the smallest instruction set you can get away with while retaining the capability to execute all possible programs you can encounter? Designing
More information03 - The Junior Processor
September 8, 2015 Designing a minimal instruction set What is the smallest instruction set you can get away with while retaining the capability to execute all possible programs you can encounter? Designing
More informationVIII. DSP Processors. Digital Signal Processing 8 December 24, 2009
Digital Signal Processing 8 December 24, 2009 VIII. DSP Processors 2007 Syllabus: Introduction to programmable DSPs: Multiplier and Multiplier-Accumulator (MAC), Modified bus structures and memory access
More informationDesign and Implementation of Single Issue DSP Processor Core. Vinodh Ravinath
Design and Implementation of Single Issue DSP Processor Core Examensarbete utfört i Datirteknik Vid Tekniska högskolan i Linköping av Vinodh Ravinath LiTH-ISY-EX--07/4094--SE Linköping 2007 Design and
More information02 - Numerical Representations
September 3, 2014 Todays lecture Finite length effects, continued from Lecture 1 Floating point (continued from Lecture 1) Rounding Overflow handling Example: Floating Point Audio Processing Example: MPEG-1
More informationGeneral Purpose Signal Processors
General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:
More informationECE 450:DIGITAL SIGNAL. Lecture 10: DSP Arithmetic
ECE 450:DIGITAL SIGNAL PROCESSORS AND APPLICATIONS Lecture 10: DSP Arithmetic Last Session Floating Point Arithmetic Addition Block Floating Point format Dynamic Range and Precision 2 Today s Session Guard
More informationHi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan
Processors Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan chanhl@maili.cgu.edu.twcgu General-purpose p processor Control unit Controllerr Control/ status Datapath ALU
More informationLinköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing
Linköping University Post Print epuma: a novel embedded parallel DSP platform for predictable computing Jian Wang, Joar Sohl, Olof Kraigher and Dake Liu N.B.: When citing this work, cite the original article.
More informationDesign of Embedded DSP Processors
Design of Embedded DSP Processors Unit 10: Integration and Verification 10/3/2017 Unit 10 of TSEA26 2017 H1 1 Three integrations 1. Hardware integration (Integration of RTL codes) 2. Integration of the
More informationIntroduction to C. Why C? Difference between Python and C C compiler stages Basic syntax in C
Final Review CS304 Introduction to C Why C? Difference between Python and C C compiler stages Basic syntax in C Pointers What is a pointer? declaration, &, dereference... Pointer & dynamic memory allocation
More informationBetter sharc data such as vliw format, number of kind of functional units
Better sharc data such as vliw format, number of kind of functional units Pictures of pipe would help Build up zero overhead loop example better FIR inner loop in coldfire Mine more material from bsdi.com
More informationIndependent DSP Benchmarks: Methodologies and Results. Outline
Independent DSP Benchmarks: Methodologies and Results Berkeley Design Technology, Inc. 2107 Dwight Way, Second Floor Berkeley, California U.S.A. +1 (510) 665-1600 info@bdti.com http:// Copyright 1 Outline
More informationCPE300: Digital System Architecture and Design
CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10032011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Chapter 3 Number Systems Fixed Point
More informationWhat is Computer Architecture?
What is Computer Architecture? Architecture abstraction of the hardware for the programmer instruction set architecture instructions: operations operands, addressing the operands how instructions are encoded
More information1. Micro Architecture and Finite Length. Olle Seger Andreas Ehliar Dake Liu, Rizwan Azhgar
1. Micro Architecture and Finite Length Olle Seger (olle.seger@liu.se) Andreas Ehliar (ehliar@isy.liu.se) Dake Liu, Rizwan Azhgar 1 Outline Introduction Some Administrative Information Basic Components
More informationEQUALIZER DESIGN FOR SHAPING THE FREQUENCY CHARACTERISTICS OF DIGITAL VOICE SIGNALS IN IP TELEPHONY. Manpreet Kaur Gakhal
EQUALIZER DESIGN FOR SHAPING THE FREQUENCY CHARACTERISTICS OF DIGITAL VOICE SIGNALS IN IP TELEPHONY By: Manpreet Kaur Gakhal A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE
More informationDSP Platforms Lab (AD-SHARC) Session 05
University of Miami - Frost School of Music DSP Platforms Lab (AD-SHARC) Session 05 Description This session will be dedicated to give an introduction to the hardware architecture and assembly programming
More information07 - Program Flow Control
September 23, 2014 Schedule change this week The lecture on thursday needs to move Lab computers The current computer lab (Bussen) is pretty nice since it has dual monitors However, the computers does
More informationREAL-TIME DIGITAL SIGNAL PROCESSING
REAL-TIME DIGITAL SIGNAL PROCESSING FUNDAMENTALS, IMPLEMENTATIONS AND APPLICATIONS Third Edition Sen M. Kuo Northern Illinois University, USA Bob H. Lee Ittiam Systems, Inc., USA Wenshun Tian Sonus Networks,
More informationLode DSP Core. Features. Overview
Features Two multiplier accumulator units Single cycle 16 x 16-bit signed and unsigned multiply - accumulate 40-bit arithmetic logical unit (ALU) Four 40-bit accumulators (32-bit + 8 guard bits) Pre-shifter,
More informationCSCE 5610: Computer Architecture
HW #1 1.3, 1.5, 1.9, 1.12 Due: Sept 12, 2018 Review: Execution time of a program Arithmetic Average, Weighted Arithmetic Average Geometric Mean Benchmarks, kernels and synthetic benchmarks Computing CPI
More informationVLSI Signal Processing
VLSI Signal Processing Programmable DSP Architectures Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao Tung University Outline DSP Arithmetic Stream Interface
More informationREAL TIME DIGITAL SIGNAL PROCESSING
REAL TIME DIGITAL SIGNAL PROCESSING UTN - FRBA 2011 www.electron.frba.utn.edu.ar/dplab Introduction Why Digital? A brief comparison with analog. Advantages Flexibility. Easily modifiable and upgradeable.
More informationJob Posting (Aug. 19) ECE 425. ARM7 Block Diagram. ARM Programming. Assembly Language Programming. ARM Architecture 9/7/2017. Microprocessor Systems
Job Posting (Aug. 19) ECE 425 Microprocessor Systems TECHNICAL SKILLS: Use software development tools for microcontrollers. Must have experience with verification test languages such as Vera, Specman,
More informationUniversität Dortmund. ARM Architecture
ARM Architecture The RISC Philosophy Original RISC design (e.g. MIPS) aims for high performance through o reduced number of instruction classes o large general-purpose register set o load-store architecture
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationDesigning with STM32F2x & STM32F4
Designing with STM32F2x & STM32F4 Course Description Designing with STM32F2x & STM32F4 is a 3 days ST official course. The course provides all necessary theoretical and practical know-how for start developing
More informationReal-time Signal Processing on the Ultrasparc
Technical Memorandum M97/4, Electronics Research Labs, 1/17/97 February 21, 1997 U N T H E I V E R S I T Y A O F LET TH E R E B E 1 8 6 8 LI G H T C A L I A I F O R N Real-time Signal Processing on the
More informationREAL TIME DIGITAL SIGNAL PROCESSING
REAL TIME DIGITAL SIGNAL PROCESSING UTN-FRBA 2010 Introduction Why Digital? A brief comparison with analog. Advantages Flexibility. Easily modifiable and upgradeable. Reproducibility. Don t depend on components
More informationMARIE: An Introduction to a Simple Computer
MARIE: An Introduction to a Simple Computer 4.2 CPU Basics The computer s CPU fetches, decodes, and executes program instructions. The two principal parts of the CPU are the datapath and the control unit.
More informationIntroducing the Superscalar Version 5 ColdFire Core
Introducing the Superscalar Version 5 ColdFire Core Microprocessor Forum October 16, 2002 Joe Circello Chief ColdFire Architect Motorola Semiconductor Products Sector Joe Circello, Chief ColdFire Architect
More informationChapter 1 Introduction
Chapter 1 Introduction The Motorola DSP56300 family of digital signal processors uses a programmable, 24-bit, fixed-point core. This core is a high-performance, single-clock-cycle-per-instruction engine
More informationEITF20: Computer Architecture Part2.1.1: Instruction Set Architecture
EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Instruction Set Principles The Role of Compilers MIPS 2 Main Content Computer
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationReminder: tutorials start next week!
Previous lecture recap! Metrics of computer architecture! Fundamental ways of improving performance: parallelism, locality, focus on the common case! Amdahl s Law: speedup proportional only to the affected
More informationSpecializing Hardware for Image Processing
Lecture 6: Specializing Hardware for Image Processing Visual Computing Systems So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.
More informationMODERN OPERATING SYSTEMS. Chapter 3 Memory Management
MODERN OPERATING SYSTEMS Chapter 3 Memory Management No Memory Abstraction Figure 3-1. Three simple ways of organizing memory with an operating system and one user process. Base and Limit Registers Figure
More informationAdvance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts
Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism
More informationAn introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures
An introduction to DSP s Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures DSP example: mobile phone DSP example: mobile phone with video camera DSP: applications Why a DSP?
More informationEE 354 Fall 2015 Lecture 1 Architecture and Introduction
EE 354 Fall 2015 Lecture 1 Architecture and Introduction Note: Much of these notes are taken from the book: The definitive Guide to ARM Cortex M3 and Cortex M4 Processors by Joseph Yiu, third edition,
More informationMARIE: An Introduction to a Simple Computer
MARIE: An Introduction to a Simple Computer Outline Learn the components common to every modern computer system. Be able to explain how each component contributes to program execution. Understand a simple
More informationComputer Systems A Programmer s Perspective 1 (Beta Draft)
Computer Systems A Programmer s Perspective 1 (Beta Draft) Randal E. Bryant David R. O Hallaron August 1, 2001 1 Copyright c 2001, R. E. Bryant, D. R. O Hallaron. All rights reserved. 2 Contents Preface
More informationChapter 4. MARIE: An Introduction to a Simple Computer
Chapter 4 MARIE: An Introduction to a Simple Computer Chapter 4 Objectives Learn the components common to every modern computer system. Be able to explain how each component contributes to program execution.
More informationCS450/550 Operating Systems
CS450/550 Operating Systems Lecture 4 memory Palden Lama Department of Computer Science CS450/550 Memory.1 Review: Summary of Chapter 3 Deadlocks and its modeling Deadlock detection Deadlock recovery Deadlock
More informationEMBEDDED SYSTEM BASICS AND APPLICATION
EMBEDDED SYSTEM BASICS AND APPLICATION Dr.Syed Ajmal IIT- Robotics TOPICS TO BE DISCUSSED System Embedded System Components Classifications Processors Other Hardware Software Applications 2 INTRODUCTION
More informationCHAPTER 4 MARIE: An Introduction to a Simple Computer
CHAPTER 4 MARIE: An Introduction to a Simple Computer 4.1 Introduction 177 4.2 CPU Basics and Organization 177 4.2.1 The Registers 178 4.2.2 The ALU 179 4.2.3 The Control Unit 179 4.3 The Bus 179 4.4 Clocks
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationECE 486/586. Computer Architecture. Lecture # 7
ECE 486/586 Computer Architecture Lecture # 7 Spring 2015 Portland State University Lecture Topics Instruction Set Principles Instruction Encoding Role of Compilers The MIPS Architecture Reference: Appendix
More informationAdvanced Parallel Architecture Lesson 3. Annalisa Massini /2015
Advanced Parallel Architecture Lesson 3 Annalisa Massini - Von Neumann Architecture 2 Two lessons Summary of the traditional computer architecture Von Neumann architecture http://williamstallings.com/coa/coa7e.html
More informationENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT
ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT THE FREE AND OPEN RISC INSTRUCTION SET ARCHITECTURE Codasip is the leading provider of RISC-V processor IP Codasip Bk: A portfolio of RISC-V processors Uniquely
More informationHead, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India
Mapping Signal Processing Algorithms to Architecture Sumam David S Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India sumam@ieee.org Objectives At the
More informationChapter 4. Chapter 4 Objectives. MARIE: An Introduction to a Simple Computer
Chapter 4 MARIE: An Introduction to a Simple Computer Chapter 4 Objectives Learn the components common to every modern computer system. Be able to explain how each component contributes to program execution.
More informationCharacterization of Native Signal Processing Extensions
Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if
More informationARM Processors for Embedded Applications
ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or
More informationVICP Signal Processing Library. Further extending the performance and ease of use for VICP enabled devices
Signal Processing Library Further extending the performance and ease of use for enabled devices Why is library effective for customer application? Get to market faster with ready-to-use signal processing
More informationProcessing Unit CS206T
Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct
More informationMicroprocessors, Lecture 1: Introduction to Microprocessors
Microprocessors, Lecture 1: Introduction to Microprocessors Computing Systems General-purpose standalone systems (سيستم ھای نھفته ( systems Embedded 2 General-purpose standalone systems Stand-alone computer
More informationAn introduction to Digital Signal Processors (DSP) Using the C55xx family
An introduction to Digital Signal Processors (DSP) Using the C55xx family Group status (~2 minutes each) 5 groups stand up What processor(s) you are using Wireless? If so, what technologies/chips are you
More informationOPERATING SYSTEMS. After A.S.Tanenbaum, Modern Operating Systems 3rd edition Uses content with permission from Assoc. Prof. Florin Fortis, PhD
OPERATING SYSTEMS #8 After A.S.Tanenbaum, Modern Operating Systems 3rd edition Uses content with permission from Assoc. Prof. Florin Fortis, PhD MEMORY MANAGEMENT MEMORY MANAGEMENT The memory is one of
More informationEvaluating MMX Technology Using DSP and Multimedia Applications
Evaluating MMX Technology Using DSP and Multimedia Applications Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * November 22, 1999 The University of Texas at Austin Department of Electrical
More informationMIPS Technologies MIPS32 M4K Synthesizable Processor Core By the staff of
An Independent Analysis of the: MIPS Technologies MIPS32 M4K Synthesizable Processor Core By the staff of Berkeley Design Technology, Inc. OVERVIEW MIPS Technologies, Inc. is an Intellectual Property (IP)
More informationModeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano
Modeling and Simulation of System-on on-chip Platorms Donatella Sciuto 10/01/2007 Politecnico di Milano Dipartimento di Elettronica e Informazione Piazza Leonardo da Vinci 32, 20131, Milano Key SoC Market
More informationMain Points of the Computer Organization and System Software Module
Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 22 Title: and Extended
More informationELC4438: Embedded System Design Embedded Processor
ELC4438: Embedded System Design Embedded Processor Liang Dong Electrical and Computer Engineering Baylor University 1. Processor Architecture General PC Von Neumann Architecture a.k.a. Princeton Architecture
More informationTHE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION
THE OPTIUM MICROPROCESSOR AN FPGA-BASED IMPLEMENTATION Radu Balaban Computer Science student, Technical University of Cluj Napoca, Romania horizon3d@yahoo.com Horea Hopârtean Computer Science student,
More informationCourse web site: teaching/courses/car. Piazza discussion forum:
Announcements Course web site: http://www.inf.ed.ac.uk/ teaching/courses/car Lecture slides Tutorial problems Courseworks Piazza discussion forum: http://piazza.com/ed.ac.uk/spring2018/car Tutorials start
More informationRISC-V CUSTOMIZATION WITH STUDIO 8
RISC-V CUSTOMIZATION WITH STUDIO 8 Zdeněk Přikryl CTO, Codasip GmbH WHO IS CODASIP Leading provider of RISC-V processor IP Introduced its first RISC-V processor in November 2015 Offers its own portfolio
More informationCS146 Computer Architecture. Fall Midterm Exam
CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state
More informationAnand Raghunathan
ECE 695R: SYSTEM-ON-CHIP DESIGN Module 2: HW/SW Partitioning Lecture 2.15: ASIP: Approaches to Design Anand Raghunathan raghunathan@purdue.edu ECE 695R: System-on-Chip Design, Fall 2014 Fall 2014, ME 1052,
More information55:132/22C:160, HPCA Spring 2011
55:132/22C:160, HPCA Spring 2011 Second Lecture Slide Set Instruction Set Architecture Instruction Set Architecture ISA, the boundary between software and hardware Specifies the logical machine that is
More informationEmbedded Systems. 7. System Components
Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic
More informationEmbedded Systems Design (630414) Lecture 1 Introduction to Embedded Systems Prof. Kasim M. Al-Aubidy Computer Eng. Dept.
Embedded Systems Design (630414) Lecture 1 Introduction to Embedded Systems Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Definition of an E.S. It is a system whose principal function is not computational,
More informationLow-Power Processor Solutions for Always-on Devices
Low-Power Processor Solutions for Always-on Devices Pieter van der Wolf MPSoC 2014 July 7 11, 2014 2014 Synopsys, Inc. All rights reserved. 1 Always-on Mobile Devices Mobile devices on the move Mobile
More informationPorting LLVM to a Next Generation DSP
Porting LLVM to a Next Generation DSP Presented by: L. Taylor Simpson LLVM Developers Meeting: 11/18/2011 PAGE 1 Agenda Hexagon DSP Initial porting Performance improvement Future plans PAGE 2 Hexagon DSP
More informationLecture 4: Instruction Set Architecture
Lecture 4: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation Reading: Textbook (5 th edition) Appendix A Appendix B (4 th edition)
More informationRUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch
RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,
More informationCase study: Performance-efficient Implementation of Robust Header Compression (ROHC) using an Application-Specific Processor
Case study: Performance-efficient Implementation of Robust Header Compression (ROHC) using an Application-Specific Processor Gert Goossens, Patrick Verbist, Erik Brockmeyer, Luc De Coster Synopsys 1 Agenda
More informationMICROPROCESSOR BASED SYSTEM DESIGN
MICROPROCESSOR BASED SYSTEM DESIGN Lecture 5 Xmega 128 B1: Architecture MUHAMMAD AMIR YOUSAF VON NEUMAN ARCHITECTURE CPU Memory Execution unit ALU Registers Both data and instructions at the same system
More informationInstruction-set Design Issues: what is the ML instruction format(s) ML instruction Opcode Dest. Operand Source Operand 1...
Instruction-set Design Issues: what is the format(s) Opcode Dest. Operand Source Operand 1... 1) Which instructions to include: How many? Complexity - simple ADD R1, R2, R3 complex e.g., VAX MATCHC substrlength,
More informationURL: Offered by: Should already know: Will learn: 01 1 EE 4720 Computer Architecture
01 1 EE 4720 Computer Architecture 01 1 URL: https://www.ece.lsu.edu/ee4720/ RSS: https://www.ece.lsu.edu/ee4720/rss home.xml Offered by: David M. Koppelman 3316R P. F. Taylor Hall, 578-5482, koppel@ece.lsu.edu,
More information14.1 Control Path in General
AGU PC FSM Configuration and status Program address Instruction Instruction decoder DM Operand & result control Exec unit ALU/MAC Results RF Control Path Design Hardware organization and micro architecture
More informationComputer Architecture. Fall Dongkun Shin, SKKU
Computer Architecture Fall 2018 1 Syllabus Instructors: Dongkun Shin Office : Room 85470 E-mail : dongkun@skku.edu Office Hours: Wed. 15:00-17:30 or by appointment Lecture notes nyx.skku.ac.kr Courses
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationLecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture
Lecture Topics ECE 486/586 Computer Architecture Lecture # 5 Spring 2015 Portland State University Quantitative Principles of Computer Design Fallacies and Pitfalls Instruction Set Principles Introduction
More informationHardware/Software Co-design
Hardware/Software Co-design Zebo Peng, Department of Computer and Information Science (IDA) Linköping University Course page: http://www.ida.liu.se/~petel/codesign/ 1 of 52 Lecture 1/2: Outline : an Introduction
More informationWhen addressing VLSI design most books start from a welldefined
Objectives An ASIC application MSDAP Analyze the application requirement System level setting of an application Define operation mode Define signals and pins Top level model Write a specification When
More informationFixed-Point Math and Other Optimizations
Fixed-Point Math and Other Optimizations Embedded Systems 8-1 Fixed Point Math Why and How Floating point is too slow and integers truncate the data Floating point subroutines: slower than native, overhead
More informationChapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST
Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism
More information