An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Similar documents
The S6000 Family of Processors

SA-1500: A 300 MHz RISC CPU with Attached Media Processor*

Each Milliwatt Matters

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

ARM Processors for Embedded Applications

The Nios II Family of Configurable Soft-core Processors

Parallel Processing SIMD, Vector and GPU s cont.

Markets Demanding More Performance

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Portland State University ECE 588/688. Cray-1 and Cray T3E

Lecture 8: RISC & Parallel Computers. Parallel computers

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

KeyStone C665x Multicore SoC

PowerPC 740 and 750

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Digital Signal Processors: fundamentals & system design. Lecture 1. Maria Elena Angoletta CERN

HotChips An innovative HD video and digital image processor for low-cost digital entertainment products. Deepu Talla.

General Purpose Signal Processors

Chapter 2: Computer-System Structures. Hmm this looks like a Computer System?

Uniprocessor Computer Architecture Example: Cray T3E

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

HP PA-8000 RISC CPU. A High Performance Out-of-Order Processor

With Fixed Point or Floating Point Processors!!

Jim Keller. Digital Equipment Corp. Hudson MA

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

CLEARSPEED WHITEPAPER: CSX PROCESSOR ARCHITECTURE

Age nda. Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications

Tile Processor (TILEPro64)

Parallel Computing: Parallel Architectures Jin, Hai

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Implementation of DSP Algorithms

Computer Architecture

This Material Was All Drawn From Intel Documents

Buses. Disks PCI RDRAM RDRAM LAN. Some slides adapted from lecture by David Culler. Pentium 4 Processor. Memory Controller Hub.

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Next Generation Technology from Intel Intel Pentium 4 Processor

picojava I Java Processor Core DATA SHEET DESCRIPTION

Fundamental CUDA Optimization. NVIDIA Corporation

The University of Texas at Austin

AMD Opteron 4200 Series Processor

A Multimedia Streaming Server/Client Framework for DM64x

Portland State University ECE 588/688. Cray-1 and Cray T3E

Platform-based Design

Commercial Network Processors

UNIT I (Two Marks Questions & Answers)

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Fundamental CUDA Optimization. NVIDIA Corporation

Embedded Systems. 7. System Components

LOW-COST SIMD. Considerations For Selecting a DSP Processor Why Buy The ADSP-21161?

SONICS, INC. Sonics SOC Integration Architecture. Drew Wingard. (Systems-ON-ICS)

Zynq-7000 All Programmable SoC Product Overview

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Higher Level Programming Abstractions for FPGAs using OpenCL

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

The RM9150 and the Fast Device Bus High Speed Interconnect

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

Multimedia in Mobile Phones. Architectures and Trends Lund

One instruction specifies multiple operations All scheduling of execution units is static

REAL TIME DIGITAL SIGNAL PROCESSING

Introduction to Embedded System Processor Architectures

Digital Signal Processor Core Technology

Next Generation Multi-Purpose Microprocessor

Chapter 1 Computer System Overview

5 MEMORY. Figure 5-0. Table 5-0. Listing 5-0.

Crusoe Processor Model TM5800

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Chapter 5. Introduction ARM Cortex series

Copyright 2016 Xilinx

,1752'8&7,21. Figure 1-0. Table 1-0. Listing 1-0.

Computing platforms. Design methodology. Consumer electronics architectures. System-level performance and power analysis.

CUDA OPTIMIZATIONS ISC 2011 Tutorial

TI s PCI2040 PCI-to-DSP Bridge

The T0 Vector Microprocessor. Talk Outline

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Introducing Multi-core Computing / Hyperthreading

POWER7: IBM's Next Generation Server Processor

Instruction Register. Instruction Decoder. Control Unit (Combinational Circuit) Control Signals (These signals go to register) The bus and the ALU

Configurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc.

KeyStone C66x Multicore SoC Overview. Dec, 2011

Outline Marquette University

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

Summary of Computer Architecture

CSEE W4824 Computer Architecture Fall 2012

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Media Instructions, Coprocessors, and Hardware Accelerators. Overview

CS370 Operating Systems

TRANSPUTER ARCHITECTURE

Windowing System on a 3D Pipeline. February 2005

MAP1000A: A 5W, 230MHz VLIW Mediaprocessor

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Lecture 18: DRAM Technologies

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Chapter 6 Storage and Other I/O Topics

IBM Research Report. A Multiprocessor System-on-a-Chip Design Methodology for Networking Applications

Transcription:

An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Media Processing Challenges Increasing performance requirements Need for flexibility & scalability Increasing cost and time-to-market pressure Need to provide system solution System software very different from media processing software Need to provide required I/O

New Shift to Thread-Parallel Computing VLIW alone does not solve all these challenges Thread-level parallel architectures are becoming viable Increased transistor budgets at newer process technologies Multithreaded programming model and tools maturing Reduced power achieved versus complex, single-core processors Industry is switching to thread-parallel models Cell processor, XBox 360 Multi-core roadmaps emerging everywhere

MDSP Solution Multiprocessor DSP (MDSP) architecture addresses media processing challenges Performance Multiple area-efficient engines => higher true MIPS / mm 2 Scalable memory hierarchy for high bandwidth Scalability Can scale performance by number of cores, not just frequency Loosely coupled programming model makes it easy to scale software as number of cores changes Ease of optimization Dedicated IMem => No ICache misses Threads with single instruction per cycle simplifies optimization System solution GPP used for control code so DSP can be kept busy with computation Multiple processors => multiple control threads => improved realtime behavior Programmable I/O provides user-defined mix of I/O interfaces

CT3616 Block Diagram PCI DDR Controller 8 Ch DMA Global Functions Global Bus Quad 0 Quad 1 Data Memory 128KB I-cache 32KB Data Memory 128KB I-cache 32KB PE (4 GPPs) PE (4 GPPs) DSE (8 DSPs) DSE (8 DSPs) PIO (9 pin groups) PIO (9 pin groups)

Quad Structure IMem IMem IMem IMem IMem IMem IMem IMem DSP0 DSP1 DSP2 DSP3 DSP4 DSP5 DSP6 DSP7 GPP0 GPP1 GPP2 GPP3 BIU1 GBI Data Memory 128KB I-Cache 32KB Quad 1 2 Compute subsystems with: 8 DSP engines 4 General Purpose Processors 128 KB of shared data memory 32 KB of shared instruction cache Separate instruction & data buses Close I/O subsystem integration Integrated bus interface unit Efficient input stream pre-processing Reduce DRAM accesses

DSP Cores REGISTER BANK 192 x 32 -bit FROM MEMORIES PROGRAM MEMORY 2K x 20 -bit DMA FIFO ALU SIMD FPU PIMAC FIFO DMA 4 FP + 4/16 PI MACs TO MEMORIES High memory bandwidth provided by 192 registers with background transfers to/from local data memory MAC with 32 bit floating point or 4x8-bit, 2x16-bit, or 1x32-bit integer operands 16 parallel signed/unsigned 8x8 bit MAC operations per clock (256 MACs/cycle) SIMD operations on packed 8-bit or 16-bit data Sum of Absolute Differences (SAD) for video encoding Zero-cycle arbitrary bit-field access and packing

Memory Architecture offers High Bandwidth REG REG DSP DSP GPP REG REG DSP DSP GPP REG REG DSP DSP GPP REG REG DSP DSP GPP SHARED MEMORY Quad 0 Off Chip Memory REG REG REG REG REG REG REG REG DSP DSP DSP DSP DSP DSP DSP DSP SHARED MEMORY Quad 1 GPP GPP GPP GPP Three-level memory & bus hierarchy (90% of accesses are local DSP mem) At 350 MHz DSP clock, 333 MHz DRAM clock: 16 DSPs each 3 (2R, 1W) x 4 bytes x 350 MHz bandwidth to registers (67 GB / s) 2 shared memories each with 8 bytes x 350 MHz bandwidth (5.6 GB / s) 1 DDR DRAM with 8 bytes x 333 MHz bandwidth (2.7 GB / s)

DRAM Performance GBI DMA0 DMA1 DMA2 DMA3 ARB ATU/Remap DDR Controller MSS DDR 333 MHz data rate GPP/DSP DRAM writes fast 4 DMA engines DRAM reads fast Pre-fetch 1D and 2D data from DRAM to shared memory Support chained transfers Memory remapping Minimize page misses for 2D

I/O Subsystem DSP DSP GPP GPP Instruction Bus Data Bus Global Bus Interface Instruction Cache Data Memory IO Bus Bus Inte rface Unit 9 Pin Groups BIU transfers data from/to pin-groups (which do pin level signaling interfaces) Scatter/Gather HW in BIU, streams data directly between memories and BIU DSPs do ECC checking (CRC, Reed-Solomon) & low-level command/status/data processing DSPs can also do preprocessing of data (downsampling, scaling, etc.) GPPs support Device level APIs for command/status/data high-level interfaces.

Pin Groups Data Bus Bus Interface Unit IO Bus Programmable Logic Array FIFO FIFO Ancillary Logic Unit Control 8 Control / Data Pins 18 pin groups (CT3616) of 8 pins each The PLA provides a fast state machine for implementing pin level signaling interfaces FIFOs and serial I/O units assemble words Ancillary logic has commonly used units, e.g. timers, parity, comparators Pin Groups can be grouped together for wider devices

CT3600 Programming Model Two main types of parallelism Data parallelism each processor works on own chunk of data Pipelined parallelism each processor does part of the algorithm on a chunk of data, then passes chunk on to next stage in pipeline Most applications have some of both Loosely coupled, coarse-grain multiprocessing Tasks running on different processors process large chunks of data, independently Synchronization and sharing of data is small percentage of work Media applications fit this model very well Master / Slave model GPPs do control code and synchronization (masters) DSPs do heavy computation and processing (slaves) Each processor can be optimized for respective type of code => efficient use of silicon

Dynamic Scheduling Dynamic Scheduling of Tasks Two classes of tasks: GPP tasks and DSP tasks Algorithm is statically partitioned into these two classes One or more software queues for each class When processors finish their current task, they grab next task from the appropriate queue DSP 0 Q Task - 1 GPP 0 Q Task - A......... DSP 1 Q Task - X Q Task - Z... GPP M DSP N

Dynamic Scheduling (Cont d) Advantages of Dynamic Scheduling Easier to partition Only need to worry about partitioning tasks into GPP or DSP task, not about which GPP or DSP to run it on Scalable Same application can run with any number of GPP or DSP resources to get desired performance Automatic load balancing Even with software pipelining, no processor is idle, since they grab next task as soon as they finish Low overhead For H.264 encoder, overhead is < 2%

INSPECTOR Graphical multiprocessor debugger Unlimited Breakpoints Across Multiple Processors Symbolic Debugging Support Debug Multiple Processors at Once Processor Status Summary View Multiple Processor Registers

Multiprocessor Profiling and Performance Analysis Tools Customized Profiling Identify Performance Bottlenecks Quickly Function and Instruction Profiling True Multiprocessing Support

Performance DSP Cores GPP Cores Freq (MHz) 16-bit MMACs / sec TI DM642 TM 1 0 600 2400 Cradle CT3616 TM 16 8 350 22400 MPEG4 Channels * $ / Channel ** mw / Channel 4 $10 450 16 $5.50 300 * Channel: MPEG4 SP L3, CIF Encoder @ 30fps ** Based on 10K book price Trademarks are property of their respectiv e owners.

Performance (cont d) CODEC MJPEG MPEG4 MPEG4 H.264 H.264 Resolution D1 @ 30 FPS CIF @ 30 FPS SP D1 @ 30 FPS SP CIF @ 30 FPS MP D1 @ 30 FPS MP % of 3616 Used for Encode 8 % 6 % 22 % 17 % 75 %

CT3616 Silicon Status TSMC 0.13 µm LV process 55 million transistors < 5 W @ 1.2 V, 350 MHz, running DV stress test 400 KB on-chip RAM/Cache First silicon received 1Q 05 General samples with development board and BSP available in 3Q 05

CT3600 Family

Summary Multicore Architecture for Media Processing 16 DSP cores optimized for video/media: SIMD, SAD, PIMAC with up to 89.6 GMACs (8x8) 8 GPP cores for efficient handling of control code High data throughput: DDR, DMAs, high-speed internal bus Flexible: programmable cores, programmable I/O Scalable programming model to simplify design State-of-the-Art Development Tools Graphical multiprocessor debugger/profiler Multiprocessor dynamic scheduling and run time analysis High Performance 1 H.264 D1, or 16 MPEG4 CIF channels, encode With resources to spare for audio, intelligent video, I/O, etc. => True single-chip solution

Thank you For more information, or to schedule a demo visit www.cradle.com or contact sales@cradle.com