Digital Signal Processor Core Technology

Similar documents
General Purpose Signal Processors

WS_CCESSH-OUT-v1.00.doc Page 1 of 8

DSP Platforms Lab (AD-SHARC) Session 05

,1752'8&7,21. Figure 1-0. Table 1-0. Listing 1-0.

LOW-COST SIMD. Considerations For Selecting a DSP Processor Why Buy The ADSP-21161?

This section discusses resources available from Analog Devices to help you develop applications using ADSP Family digital signal processors.

The World Leader in High Performance Signal Processing Solutions. DSP Processors

Graduate Institute of Electronics Engineering, NTU 9/16/2004

REAL TIME DIGITAL SIGNAL PROCESSING

The Nios II Family of Configurable Soft-core Processors

Better sharc data such as vliw format, number of kind of functional units

Outline: System Development and Programming with the ADSP-TS101 (TigerSHARC)

ARM Processors for Embedded Applications

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

5 MEMORY. Overview. Figure 5-0. Table 5-0. Listing 5-0.

The SOCks Design Platform. Johannes Grad

ADSP-2100A DSP microprocessor with off-chip Harvard architecture. ADSP-2101 DSP microcomputer with on-chip program and data memory

Graphics: Alexandra Nolte, Gesine Marwedel, Universität Dortmund. RTL Synthesis

Verilog for High Performance

REAL TIME DIGITAL SIGNAL PROCESSING

systems such as Linux (real time application interface Linux included). The unified 32-

REAL TIME DIGITAL SIGNAL PROCESSING

SHARC Embedded Processor ADSP-21261

KEY FEATURES PROCESSOR CORE At 266 MHz (3.75 ns) core instruction rate, the ADSP-237/ ADSP-2375 performs.596 GFLOPs/533 MMACs ADSP-237 M bit on-chip S

FPGA design with National Instuments

Copyright 2014 Xilinx

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Introduction ADSP-2100 FAMILY OF PROCESSORS

IMAGINE: Signal and Image Processing Using Streams

The following revision history lists the anomaly list revisions and major changes for each anomaly list revision.

SHARC Embedded Processor ADSP-21262

The ARM10 Family of Advanced Microprocessor Cores

CHAPTER 4 MARIE: An Introduction to a Simple Computer

The S6000 Family of Processors

CONTACT: ,

Getting Started With SHARC Processors Revision 3.0, April 2010 Part Number

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Ten Reasons to Optimize a Processor

Ascenium: A Continuously Reconfigurable Architecture. Robert Mykland Founder/CTO August, 2005

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7

SHARC Processor ADSP-21371/ADSP-21375

08 - Address Generator Unit (AGU)

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Copyright 2016 Xilinx

Section 2 Introduction to VisualDSP++

Efficient Implementation of Transform Based Audio Coders using SIMD Paradigm and Multifunction Computations

Separating Reality from Hype in Processors' DSP Performance. Evaluating DSP Performance

Asynchronous on-chip Communication: Explorations on the Intel PXA27x Peripheral Bus

SHARC Processor ADSP-21369

Computers and Microprocessors. Lecture 34 PHYS3360/AEP3630

NXP Unveils Its First ARM Cortex -M4 Based Controller Family

SHARC Processor ADSP-21363

VLSI Signal Processing

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

PRELIMINARY TECHNICAL DATA FUNCTIONAL BLOCK DIAGRAM DUAL-PORTED SRAM TWO INDEPENDENT BLOCKS DM DATA BUS JTAG TEST & EMULATION DATA FILE (PEY)

Lecture 11 Logic Synthesis, Part 2

G GLOSSARY. Terms. Figure G-0. Table G-0. Listing G-0.

An overview of standard cell based digital VLSI design

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs

All MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes

ASIC Design of Shared Vector Accelerators for Multicore Processors

VHDL for Synthesis. Course Description. Course Duration. Goals

EE-123. An Overview of the ADSP-219x Pipeline Last modified 10/13/00

Lode DSP Core. Features. Overview

System-On-Chip Design with the Leon CPU The SOCKS Hardware/Software Environment

Resource Efficiency of Scalable Processor Architectures for SDR-based Applications

101-1 Under-Graduate Project Digital IC Design Flow

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

ECE 471 Embedded Systems Lecture 2

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

INTELLIGENCE PLUS CHARACTER - THAT IS THE GOAL OF TRUE EDUCATION UNIT-I

FPGA for Software Engineers

Chapter 5. Introduction ARM Cortex series

ADSP-2469/ADSP-2469W Preliminary Technical Data KEY FEATURES PROCESSOR CORE At 450 MHz core instruction rate, the ADSP-2469 performs at 2.7 GFLOPS/900

SHARC Embedded Processor ADSP-21266

ADSP-2362/ADSP-2363/ADSP-2364/ADSP-2365/ADSP-2366 KEY FEATURES PROCESSOR CORE At 333 MHz (3.0 ns) core instruction rate, the ADSP-236x performs 2 GFLO

WS_CCESBF7-OUT-v1.00.doc Page 1 of 8

Interconnects, Memory, GPIO

ALTERA FPGAs Architecture & Design

SHARC Processor ADSP-21364

The World Leader in High Performance Signal Processing Solutions. Development Tools.

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

A 1-GHz Configurable Processor Core MeP-h1

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles

INSTITUTE OF AERONAUTICAL ENGINEERING Dundigal, Hyderabad ELECTRONICS AND COMMUNICATIONS ENGINEERING

5 MEMORY. Figure 5-0. Table 5-0. Listing 5-0.

Design of Transport Triggered Architecture Processor for Discrete Cosine Transform

Designing Embedded Processors in FPGAs

Engineer-to-Engineer Note

Vector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data

Tutorial Introduction

ConnX D2 DSP Engine. A Flexible 2-MAC DSP. Dual-MAC, 16-bit Fixed-Point Communications DSP PRODUCT BRIEF FEATURES BENEFITS. ConnX D2 DSP Engine

Case study: Performance-efficient Implementation of Robust Header Compression (ROHC) using an Application-Specific Processor

ECE 471 Embedded Systems Lecture 2

Design Methodologies and Tools. Full-Custom Design

MLR Institute of Technology

: : (91-44) (Office) (91-44) (Residence)

V8-uRISC 8-bit RISC Microprocessor AllianceCORE Facts Core Specifics VAutomation, Inc. Supported Devices/Resources Remaining I/O CLBs

SHARC Processors ADSP-21367/ADSP-21368/ADSP-21369

Embedded Systems Development

Transcription:

The World Leader in High Performance Signal Processing Solutions Digital Signal Processor Core Technology Abhijit Giri Satya Simha November 4th 2009

Outline Introduction to SHARC DSP ADSP21469 ADSP2146x Core Compute unit ISA Memory Architecture Connectivity Implementation Methodology ADSP 2146x Core as an IP 2

SHARC DSP & I/Os ADSP21469 3

SHARC ADSP2146x Core 4

SHARC Architecture High Performance IEEE-754 32-bit/40-bit Floating Point Processor Upward compatibility with the ADSP-21020 (SISD) Very deterministic architecture 5 instruction pipeline stages (Protected) 450 MHz (2.25ns) core instruction rate Performs 2.7 GFLOPS / 900 MMACS Single-Instruction multiple-data (SIMD) computational architecture provides: Two 32-bit floating point/ 32-bit fixed point/40-bit extended precision floating point computational units Each of two units has: Multiplier Arithmetic Logic Unit Shifter Register file Concurrent code execution Single cycle execution of a Multiply or ALU operation A dual memory read or write, and an instruction fetch Transfers data between core & memory at a sustained 5.4GB/s bandwidth

SHARC Instruction Set Multiple parallel operations packed in compact instructions Variable length (16/32/48-bit long) instruction-encoding achieves compact code Almost all the instructions can be conditional; many also take ELSE clause If..then..else constructs are compiled into compact and efficient code Branches can have delay slot Minimizes wastage of cycles Hardware looping instructions Zero-overhead looping Most instructions have a compute part Compute can be single function or multi-function Algebric style instructions Makes hand-coding easier 6

SHARC Instruction Set contd.. Multi-function compute packs multiply, add and subtract operations Example: inner loop of a butterfly computation of FFT f13 = f1*f4, f12 = f8+f12, f14 = f8-f12, f4 = dm(i2,m0), f1 = pm(i15,m9); Peak performance: 6*f MFLOPS (operation in SIMD) 2.7 GFLOPS for 450MHz processor. Sustained peak MFLOPs is realizable due to Single cycle multifunction compute Parallel data load/store aided by DAGs from fast on-chip (L1) memory Zero-overhead hardware looping Shallow pipeline 7

Performance Benchmarks at 400 MHz Benchmark Algorithm 1024-Point Complex FFT (Radix 4, with Reversal) FIR Filter (per Tap) IIR Filter (per Biquad) Matrix Multiply (Pipelined) [3 x 3] x [3 x 1] [4 x 4] x [4 x 1] Divide (y/x) Inverse Square Root Speed at 400 MHz 23.25 us 1.50 ns 5.00 ns 11.25 ns 20.00 ns 8.75 ns 13.50 ns

Memory System All busses 64-bit except as indicated SHARC Core DMD PMD IOX 32 IO Processor Crossbar 32 CMD EPD 32 IOY Ext. mem. i/f 8/16/32 Harvard architecture Instr and data busses 4-banked on-chip L1 at core speed Fully addressable Each bank 64-bit wide Standard 1-deep pipelined interface Can be compiler-generated Full Crossbar interconnect 4 accesses if no conflict Supports large amount (16Mb) of on-chip memory ADSP21469 populated with 5Mb RAM + 4Mb ROM Directly addressable off-chip memory DMA transfer between on-chip and off-chip 9

System Bus Interfaces Internal Memory (upto 4 banks) pahb dahb eahb edahb M M S S Core interfaces with IOP system over 4 AHB busses through appropriate bridges pahb for MMR accesses eahb for MMR accesses in ext. mem i/f, as well as direct off-chip access edahb for DMA to/from ext. mem dahb for DMA to/from all other peripherals S M M S M: Unit can master the bus S: Unit is slave on the bus 10

Implementation Methodology ibrary Synthesis Scan Insertion & vector generation Floorplanning P&R RTL Scripts Constraints Verificatio n environm ent & vectors Guidelines Extraction Functional Simulation Netlist Simulation Static Timing Design in Verilog HDL ASIC standard design flow Cell based Synthesis Some custom cells used for performance Auto P&R Clock-tree synthesis Static timing based timing signoff Coupling analysis/fixing IR and EM checks/fixes LVS/DRC etc.. Full chip build Verification LVS & DRC All OK? Self-checking directed tests Constrained random test Formal and semi-formal Tapeout 11

SHARC Core 5 stage pipeline core (SIMD SHARC-V) Standard memory interface for internal memory AHB compliant interfaces for peripherals Flip-Flop based design with few latches Design in synthesizable Verilog RTL Scan-ready Frequency depends on choice of technology / implementation 450MHz in a 65nm technology (optimized for high performance) 12

SHARC Core IP collateral Core in Verilog RTL Synthesizable Verilog RTL for simulation as well as synthesis Simulation environment standalone core environment Tests for design verification Synthesis scripts and guidelines for DCT (Synopsys) Clock descriptions, timing and other constraints, exceptions Documentation of interfaces (memory, peripherals) C simulator of the core Documentation on Clocking guidelines (inside core), DFT, input clocking requirements of the core, power-on and reset. Physical design guidelines Design support as required 13

ADSP 21469 I/O Peripherals Serial Ports SPDIF I2C -compatible 2-wire interface UART SPI Timers Pulse with count PWM waveform generation Link Ports 8-bit bi-directional port with Clock and ACK for fast link External memory interface DDR2 and AMI All are RTL based designs and are implemented in standard ASIC flow in one hierarchy. 14

ADI VDSP++ Development Tools VisualDSP++ provides an IDDE, which provides easy access to Editor Compilers C/C++ VDK RTOS/Kernel Assembler Linker Simulator including MP Emulator/debugger Plug-ins for easy programming of some of the peripherals 15

Summary ADSP 21469 is a modern 32-bit floating point DSP Delivers 2.7GFLOPS at 450MHz DSP Core can easily be integrated into any SoC Synthesis-P&R-ready Includes standard interfaces 16

17 Thank You

SHARC ADSP2146x Core - Summary 32-bit architecture Code compatibility with other SHARC family members at the assembly level Single instruction multiple data (SIMD) architecture provides Two computational processing elements Concurrent execution Compute units support IEEE Single precision floating point (32-bit) 40-bit extended precision floating point for 32-bit resolution in floating point computations Also 32-bit fixed point Dual data address generators (DAGs) modulo (for circular buffers) and bit-reverse (FFT) addressing Sequencer supports Zero-overhead looping with single-cycle loop setup Low-overhead branching VISA (variable instruction set) execution support Parallelism in buses and computational units allows Single cycle executions (with or without SIMD) of a multiply operation, an ALU operation, a dual memory read or write, and an instruction fetch 5-deep pipeline - fully interlocked 18