Better sharc data such as vliw format, number of kind of functional units

Similar documents
DSP Platforms Lab (AD-SHARC) Session 05

REAL TIME DIGITAL SIGNAL PROCESSING

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

REAL TIME DIGITAL SIGNAL PROCESSING

Digital Signal Processor Core Technology

General Purpose Signal Processors

REAL TIME DIGITAL SIGNAL PROCESSING

Processing Unit CS206T

LOW-COST SIMD. Considerations For Selecting a DSP Processor Why Buy The ADSP-21161?

Efficient Implementation of Transform Based Audio Coders using SIMD Paradigm and Multifunction Computations

Basic Computer Architecture

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Separating Reality from Hype in Processors' DSP Performance. Evaluating DSP Performance

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Embedded Systems Development

Microprocessors vs. DSPs (ESC-223)

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

Key elements in DSP algorithms. for DSP algorithms on. To be tackled today. Instruction fetches must be efficient.

TMS320C3X Floating Point DSP

04 - DSP Architecture and Microarchitecture

Last Time. Response time analysis Blocking terms Priority inversion. Other extensions. And solutions

3 PROGRAM SEQUENCER. Overview. Figure 3-0. Table 3-0. Listing 3-0.

Implementation of DSP Algorithms

VLSI Signal Processing

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7

The SHARC in the C. Mike Smith

Chapter 2 Instructions Sets. Hsung-Pin Chang Department of Computer Science National ChungHsing University

Principles of Audio Coding

The Processor: Instruction-Level Parallelism

Getting CPI under 1: Outline

DSP VLSI Design. Addressing. Byungin Moon. Yonsei University

An introduction to Digital Signal Processors (DSP) Using the C55xx family

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors

Chapter 13 Reduced Instruction Set Computers

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

FEATURE ARTICLE. Michael Smith

Hardware-based Speculation

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

INTRODUCTION TO DIGITAL SIGNAL PROCESSOR

Typical DSP application

Classification of Semiconductor LSI

DSP VLSI Design. Instruction Set. Byungin Moon. Yonsei University

Embedded Computation

Last time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

03 - The Junior Processor

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Lode DSP Core. Features. Overview

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers

Chapter 4 The Processor (Part 4)

Typical Processor Execution Cycle

Graduate Institute of Electronics Engineering, NTU 9/16/2004

ADDRESS GENERATION UNIT (AGU)

CS 101, Mock Computer Architecture

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Advanced processor designs

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

The following revision history lists the anomaly list revisions and major changes for each anomaly list revision.

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Independent DSP Benchmarks: Methodologies and Results. Outline

Computer Science 246 Computer Architecture

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.

M.Tech. credit seminar report, Electronic Systems Group, EE Dept, IIT Bombay, Submitted: November Evolution of DSPs

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Senior Manager, CE Technology Dolby Australia Pty Ltd

HW 2 is out! Due 9/25!

Digital Signal Processors: fundamentals & system design. Lecture 1. Maria Elena Angoletta CERN

Dynamic Control Hazard Avoidance

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

Multi-cycle Instructions in the Pipeline (Floating Point)

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal.

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

HY425 Lecture 09: Software to exploit ILP

02 - Numerical Representation and Introduction to Junior

In this tutorial, we will discuss the architecture, pin diagram and other key concepts of microprocessors.

Compiler Structure. Backend. Input program. Intermediate representation. Assembly or machine code. Code generation Code optimization

HY425 Lecture 09: Software to exploit ILP

VLIW/EPIC: Statically Scheduled ILP

Embedded Systems: Hardware Components (part I) Todor Stefanov

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

The Evolution of DSP Processors

Cache Justification for Digital Signal Processors

CPU1. D $, 16-K Dual Ported South UPA

Ascenium: A Continuously Reconfigurable Architecture. Robert Mykland Founder/CTO August, 2005

Digital Signal Processing Introduction to Finite-Precision Numerical Effects

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Micro-programmed Control Ch 17

SMS045 - DSP Systems in Practice. Lab 2 - ADSP-2181 EZ-KIT Lite and VisualDSP++ Due date: Tuesday Nov 18, 2003

Multiple Instruction Issue. Superscalars

Hardwired Control (4) Micro-programmed Control Ch 17. Micro-programmed Control (3) Machine Instructions vs. Micro-instructions

Micro-programmed Control Ch 15

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011

Machine Instructions vs. Micro-instructions. Micro-programmed Control Ch 15. Machine Instructions vs. Micro-instructions (2) Hardwired Control (4)

Micro-programmed Control Ch 15

Digital Signal Processor 2010/1/4

Instruction Set Architecture ISA ISA

ECE 486/586. Computer Architecture. Lecture # 7

BASIC COMPUTER ORGANIZATION. Operating System Concepts 8 th Edition

Transcription:

Better sharc data such as vliw format, number of kind of functional units Pictures of pipe would help Build up zero overhead loop example better FIR inner loop in coldfire Mine more material from bsdi.com Update costs for ADSP chips Tie in audio stuff to DSP stuff better DSP perf of various stages or similar?

Today Digital signal processors VLIW SHARC details Quick look at audio processing

Digital Signal Processors Microcontrollers are optimized for control-intensive apps Average general-purpose application branches every seven instructions Branches often not very predictable Memory accesses often not very predictable DSPs are optimized for math, loops, and data movement Both fixed-point and floating-point math Fast loop operations for simple loop structures Lots of I/O Instructions and memory accesses very predictable

Important DSPs Texas Instruments TMS320C2000, TMS320C5000, and TMS320C6000 Motorola StarCore: DSP56300, DSP56800, and MSC8100 Agere Systems DSP16000 series Analog Devices SHARC: ADSP-2100 and ADSP-21000

At the low end DSP: All key arithmetic ops in 1 cycle GPP: Often some math (multiply at least) is multiplecycle DSP: Support for 8 and 16 bit quantities as both integers and fractions GPP: Fixed word size, integer only DSP: HW support for managing numerical fidelity Saturation, flexible rounding, etc. GPP: These are often implemented in SW

At the high end DSP: Up to 8 arithmetic units GPP: 1-3 arithmetic units DSP: Highly specialized functional units Multiply and accumulate, Viterbi, etc. GPP: General-purpose functional units Integer, floating point, etc. DSP: Very limited use of dynamic features Branch predication, superscalar, etc. GPP: Extensive use of dynamic features

More CPU vs. DSP DSPs are Harvard architecture even at the high end No high end CPU is Harvard architecture DSPs offer better cache control Lockable cache regions Cache can be turned into scratchpad RAM Scratchpad == explicitly addressable fast RAM

SHARC High-performance DSP architecture Similarities to MCF52233 Separate instruction and data memories Some pipelining (3 stage vs. 4) SHARC is more CISC than ColdFire CISC main idea Give people complex instructions that match what they are trying to do This gives good performance and high code density SHARC Instructions are highly specialized for DSP

Quick VLIW Intro VLIW == Very Long Instruction Word Aggressive superscalar, out-of-order processors like P4 and Athlon Single operation per instruction Get high IPC through superscalar and out-of-order execution Requires lots of logic (and energy) to detect and avoid problematic dependencies VLIW Dependencies detected and avoided at compile time VLIW can get high IPC with simpler HW Compiler technology is difficult Also, compiler becomes very sensitive to the architectural details and program structure

More SHARC Stuff Supports saturating ALU operations Can issue some computations in parallel Dual add-subtract Multiplication and dual add/subtract Floating-point multiply and ALU operation Example SHARC instruction: R6 = R0*R4, R9 = R8 + R12, R10 = R8 - R12;

Parallelism Example We want to compute: if (a>b) y = c-d; else y = c+d; Strategy: Compute both results in parallel and then pick the right one! Load values (DM == data memory) R1=DM(_a); R2=DM(_b); R3=DM(_c); R4=DM(_d);! Compute both sum and difference R12 = R2+R4, R0 = R2-R4;! Choose which one to save COMP(R1,R2); IF LE R0=R12; DM(_y) = R0! Write to y

SHARC Addressing Immediate value R0 = DM(0x20000000); Direct load R0 = DM(_a);! Loads contents of _a Direct store DM(_a)= R0;! Stores R0 at _a Post-modify with update Used to sweep through a buffer I register holds base address M register/immediate holds modifier value R0 = DM(I3,M3)! Load DM(I2,1) = R1! Store

Data in Program Memory Can put constant data in program memory to read two values per cycle: F0 = DM(M0,I0), F1 = PM(M8,I9); Compiler allows programmer to control which memory values are stored in

Circular Buffers Fundamental data structure for DSP New sample always overwrites oldest sample Sample 523 Sample 524 Sample 525 Sample 526 Sample 519 Sample 520 Sample 521 Sample 522 Read sample 527 from ADC Sample 523 Sample 524 Sample 525 Sample 526 Sample 527 Sample 520 Sample 521 Sample 522

SHARC Circular Buffers Uses special Data Address Generator registers: L register gets buffer size B register buffer base address I, M registers in post-modify mode I is automatically wrapped around the circular buffer when it reaches B+L

SHARC Zero Overhead Loop No cost for jumping back to start of loop Hardware decrements counter, compares, then jumps back Loop length Last instruction In loop Termination condition (Loop Counter Expired) LCNTR=30, DO L UNTIL LCE; R0=DM(I0,M0), F2=PM(I8,M8); R1=R0-R15; L: F4=F2+F3; Nested loops also handled HW provides a 6-deep loop counter stack

FIR in Detail 1. Obtain sample from ADC, generate interrupt 2. Move the sample into the input circular buffer 3. Update the pointer for the circular buffer 4. Zero the accumulator 5. Loop through all coefficients 1. Fetch coefficient from coefficient circular buffer 2. Update pointer to coefficient circular buffer 3. Fetch sample from input circular buffer 4. Update the pointer to the input circular buffer 5. Multiply coefficient and sample 6. Add result to accumulator 6. Move output sample to a holding buffer 7. Move output sample from holding buffer to DAC

FIR Inner Loop in C int fir_inner (void) { } int i, f; for (i=0, f=0; i<n; i++) f = f + c[i]*x[i]; return f;

FIR Inner in SHARC! loop setup I0=a;! I0 points to a[0] M0=1;! set up increment I8=b;! I8 points to b[0] M8=1;! set up postincrement mode! loop body LCNTR=N, DO loopend UNTIL LCE; R1=DM(I0,M0), R2=PM(I8,M8); R8=R1*R2; loopend: R12=R12+R8;

FIR Inner in ColdFire fir_inner: link a6,#0 moveq #0,d2 moveq #0,d0 lea _x,a1 lea _c,a0 addq.l #1,d2 move.l (a1)+,d1 muls.l (a0)+,d1 add.l d1,d0 cmpi.l #10,d2 blt.s *-16 unlk a6 rts

DSP C Compilers Most of the compiler is the same as for standard architectures Lexer, parser, type checker IR generator High-level optimizations CSE, constant folding and propagation, loop unrolling Target-dependent optimizations are different Software pipelining Instruction scheduling Peephole optimizations Register allocation DSP compilers are typically very sensitive to issues like arrays vs. pointers

SHARC Stats $17-$18 $55-$65 ADSP-21261 SIMD ADSP-21262 ADSP- 21266 SIMD ADSP-21375 SIMD ADSP-21364 ADSP- 21365 SIMD Clock Cycle 150 MHz 200 MHz 266 MHz 333 MHz Instruction Cycle Time 6.67 ns 5 ns 3.75 ns 3 ns MFLOPS Sustained 600 MFLOPS 800 MFLOPS 1064 MFLOPS 1332 MFLOPS MFLOPS Peak 900 MFLOPS 1200 MFLOPS 1596 MFLOPS 1998 MFLOPS 1024 Point Complex FFT (Radix 4, with bit reversal) 61.3 µs 46 µs 34.5 µs 28 µs FIR Filter (per tap) 3.3 ns 2.5 ns 1.88 ns 1.5 ns IIR Filter (per biquad) 13.3 ns 10 ns 7.5 ns 6 ns

Performance for <$10

Performance for more $$

Human Hearing The ear is basically a frequency spectrum analyzer Sound intensity measured in decibel sound power level On a log scale 20 db = 10x change in air pressure 0 db = weakest detectable sound 60 db = normal speech 140 db = pain and damage Ear can detect 1 db change in volume Normal frequency range 20 Hz to 20 khz But most sensitive between 1 and 4 khz

Equal Loudness Curves

More Hearing We perceive Loudness Pitch Timbre harmonic content Amplitude 440 880 1320 1760 2200 2640 Hz Fundamental frequency Harmonics = integer multiples of the fundamental frequency

Phase Insensitivity Hearing is quite phase insensitive These waveforms sound the same: Why don t we hear phase?

Sound Quality vs. Data Rate Quality Bandwidth Sampling rate Number of bits Data rate CD 5 Hz-20 khz 44.1 khz 16 706 kbps Telephone 200 Hz-3.2 khz 8 khz 12 96 kbps Telephone with companding Compressed speech 200 Hz-3.2 khz 8 khz 8 64 kbps 200 Hz-3.2 khz 8 khz 12 4 kbps

Why Look at Hearing? Understanding hearing supports efficient audio processing Alternative to understanding is overkill E.g., CD-quality audio MP3 exploits limitations of hearing Notes with similar frequencies cannot be distinguished Sounds close in time cannot be distinguished Loud notes drown quieter ones Ear is not uniformly sensitive to all frequencies

MP3 Encoding 1. Break data into frames 2. Convert into frequency domain 3. Use psychoacoustic model to sort frequency components by importance Drop less important components subject to bit-rate constraints 4. Perform Huffman encoding on coefficients 5. Put frame data together into a bit stream Which of these are DSP-intensive?

Summary DSPs are cool Far more bang for the buck than microcontrollers for signal processing Interesting instruction sets, architectures, and compilers Sound processing Significant user of DSP chips Need to understand capabilities / limitations of human hearing