Choosing a Processor: Benchmarks and Beyond (S043)

Similar documents
Benchmarking Processors for DSP Applications

Benchmarking Multithreaded, Multicore and Reconfigurable Processors

Benchmarking H.264 Hardware/Software Solutions

Smart Processor Picks for Consumer Media Applications

Separating Reality from Hype in Processors' DSP Performance. Evaluating DSP Performance

Evaluating the DSP Processor Options (DSP-522)

Independent DSP Benchmarks: Methodologies and Results. Outline

Smart Processor Picks for Consumer Audio/Video Applications

Microprocessors vs. DSPs (ESC-223)

Introduction to Video Compression

Processors for Embedded Digital Signal Processing

Motivation and scope Selection criteria and methodology Benchmarking options Processor architecture options Conclusions

Insider Insights on the ARM11 s Signal-Processing Capabilities

Audio Signal Processing Hardware Trends

Evaluating DSP Processor Performance

REAL TIME DIGITAL SIGNAL PROCESSING

REAL TIME DIGITAL SIGNAL PROCESSING

Consumer Media Products: Trends and Technologies (Workshop 206)

Developing and Integrating FPGA Co-processors with the Tic6x Family of DSP Processors

Implementing FFT in an FPGA Co-Processor

Evaluating the Latest DSPs for Communications Applications. Evaluating the Latest DSPs for Portable Communications Applications

The Nios II Family of Configurable Soft-core Processors

OpenRadio. A programmable wireless dataplane. Manu Bansal Stanford University. Joint work with Jeff Mehlman, Sachin Katti, Phil Levis

Hardware Implementation and Verification by Model-Based Design Workflow - Communication Models to FPGA-based Radio

FPGAs Provide Reconfigurable DSP Solutions

REAL TIME DIGITAL SIGNAL PROCESSING

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

DSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Multimedia Decoder Using the Nios II Processor

Implementing Embedded Streaming Media: 10 Secrets of Success

Five Ways to Build Flexibility into Industrial Applications with FPGAs

FPGA system development What you need to think about. Frédéric Leens, CEO

Leveraging Mobile GPUs for Flexible High-speed Wireless Communication

Basic Computer Architecture

Digital Signal Processors: fundamentals & system design. Lecture 1. Maria Elena Angoletta CERN

DSP Solutions For High Quality Video Systems. Todd Hiers Texas Instruments

MIPS Technologies MIPS32 M4K Synthesizable Processor Core By the staff of

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

Altera SDK for OpenCL

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Energy scalability and the RESUME scalable video codec

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Research Challenges for FPGAs

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

MPSoC Design Space Exploration Framework

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Voice Analysis for Mobile Networks

Video Compression An Introduction

Blackfin Optimizations for Performance and Power Consumption

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

Emerging Architectures for HD Video Transcoding. Jeremiah Golston CTO, Digital Entertainment Products Texas Instruments

General Purpose Signal Processors

ECE332, Week 2, Lecture 3. September 5, 2007

ECE332, Week 2, Lecture 3

The Shrink Wrapped Myth:

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

Cut DSP Development Time Use C for High Performance, No Assembly Required

Embedded Computation

TKT-2431 SoC design. Introduction to exercises

ARM Processors for Embedded Applications

The Evolution of DSP Processors

White Paper Assessing FPGA DSP Benchmarks at 40 nm

Digital Signal Processor 2010/1/4

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

Cache Justification for Digital Signal Processors

Classification of Semiconductor LSI

Xilinx DSP. High Performance Signal Processing. January 1998

FPGAs: FAST TRACK TO DSP

VICP Signal Processing Library. Further extending the performance and ease of use for VICP enabled devices

Higher Level Programming Abstractions for FPGAs using OpenCL

Emerging Architectures for HD Video Transcoding. Leon Adams Worldwide Manager Catalog DSP Marketing Texas Instruments

Agenda. Introduction FPGA DSP platforms Design challenges New programming models for FPGAs

Cover TBD. intel Quartus prime Design software

2008/12/23. System Arch 2008 (Fire Tom Wada) 1

VIDEO COMPRESSION STANDARDS

Cover TBD. intel Quartus prime Design software

VLSI Design Automation. Maurizio Palesi

ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT

The Role of Performance

TI TMS320C6000 DSP Online Seminar

PERFORMANCE ANALYSIS OF AN H.263 VIDEO ENCODER FOR VIRAM

Coarse Grain Reconfigurable Arrays are Signal Processing Engines!

Memory Systems IRAM. Principle of IRAM

Performance Verification for ESL Design Methodology from AADL Models

Platform-based Design

INSIGHTS. FPGA - Beyond Market Data. Financial Markets

Software Defined Modem A commercial platform for wireless handsets

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

WS_CCESSH-OUT-v1.00.doc Page 1 of 8

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Lecture 1: Course Introduction and Overview Prof. Randy H. Katz Computer Science 252 Spring 1996

EEM870 Embedded System and Experiment Lecture 4: SoC Design Flow and Tools

Understanding Peak Floating-Point Performance Claims

TKT-2431 SoC design. Introduction to exercises. SoC design / September 10

White Paper Using Cyclone III FPGAs for Emerging Wireless Applications

RapidIO.org Update. Mar RapidIO.org 1

Distributed Vision Processing in Smart Camera Networks

Storage validation at GoDaddy Best practices from the world s #1 web hosting provider

Transcription:

Insight, Analysis, and Advice on Signal Processing Technology Choosing a Processor: Benchmarks and Beyond (S043) Jeff Bier Berkeley Design Technology, Inc. Berkeley, California USA +1 (510) 665-1600 info@bdti.com http://www.bdti.com Presentation Goals By the end of this workshop, you should know: What to consider when choosing a processor How to make the selection process manageable How to use benchmarks Benchmark results for TI processors and selected competitors 2 TI Developer s Conference Page 1

Outline Processor selection criteria Processor selection methodology Processor benchmarking approaches Sample benchmark results Benchmarking hardware+software solutions Conclusions 3 Processor Selection Criteria Performance on relevant tasks Speed Numeric fidelity Fixed-point vs. floating-point Data word size(s) Execution-time predictability Dynamic features confound determinism Energy consumption Memory bandwidth: on-chip, off-chip Memory usage 4 TI Developer s Conference Page 2

Processor Selection Criteria Cost On-chip integration Coprocessors Memory I/O interfaces Other peripherals Packaging options Sizes Temperature ranges Ease of manufacture 5 Availability and Roadmap Risk Availability; reliability of supply Multi-vendor architectures a plus What does the errata list look like? Roadmap Vendor commitment to evolving the chip, e.g., improved integration, reduced cost Roadmap for next-generation architectures Compatibility of future parts What is your confidence that the vendor will execute on its roadmap? 6 TI Developer s Conference Page 3

Development Considerations Programming model complexity Developer familiarity Compatibility Tools (vendor, 3rd party) Accurate cycle-count and memory profiling Visibility into cache, pipeline Debug/development benefit from tools with: Peripheral and multi-processor simulation Non-intrusive, real-time debug 7 Development Considerations Language support Quality of C compiler; availability of C++ compiler Support for assembly language optimization Software availability Signal processing components Device drivers and other general-purpose software Operating systems Hardware/software reference designs 8 TI Developer s Conference Page 4

Processor Selection Methodology Use a hierarchical approach to make the problem manageable: Determine selection criteria Prioritize or assign weights to selection criteria Use critical criteria to eliminate obviously unsuitable choices Begin with classes of processors Evaluate and rank candidates Weigh trade-offs among non-critical criteria Iterate as necessary Refine criteria and analysis of candidates 9 Outline Processor selection criteria Processor selection methodology Processor benchmarking approaches Sample benchmark results Benchmarking hardware+software solutions Conclusions 10 TI Developer s Conference Page 5

Why Do Benchmarks Matter? Assess key processor metrics accurately, e.g., Speed (not cycle counts!) Cost efficiency Energy efficiency (not power consumption!) Memory efficiency Use limited engineering resources effectively Compare performance across a wide range of architectures, applications 11 Typical Application Decomposition Applications Portable audio player Wireless handset Video conf. system Application Components OS Audio decoder Audio encoder Speech codec Video decoder Video encoder Algorithm Kernels FIR FFT DCT VECADD Operations Add Mult/MAC Shift Load 12 TI Developer s Conference Page 6

Flexibility vs. Accuracy Application Accuracy Application Component Algorithm Kernel Operation Flexibility 13 Flexibility vs. Accuracy Portable Video Player Application Cellular Base Station Application Component Algorithm Kernel Operation Personal Video Recorder WiMAX Base Station Set-Top Box DSL Gateway Cable Head-end 14 TI Developer s Conference Page 7

What s Wrong with MMACS? MMACS approximates performance on some signal processing algorithms like FIR filters, but: It ignores other operations required to sustain repeated MACs It ignores memory bandwidth bottlenecks Many important signal processing algorithms don t use MACs! Example: C5510 and PXA260 200 MHz C5510: 400 MMACS and 1,200 million bytes/sec 400 MHz PXA260: 800 MMACS and 1,600 million bytes/sec These two processors have comparable signal processing speed! 15 Algorithm Kernels Computationally intensive portions of signal processing applications FFTs, filters, bit unpack,! Strong predictors of performance "Do not measure system-level performance or OS overhead!modest programming effort IDCT 39% Window 25%!Results for common kernels widely available "Difficult to apply to multi-core processors, hardware accelerators, FPGAs, etc. Examples: BDTI DSP Kernel Benchmarks, BDTI Video Kernel Benchmarks Other 25% Denorm 11% 16 TI Developer s Conference Page 8

Example: BDTI DSP Kernel Benchmarks Hand optimized! Reflects common coding practice! Accurate representation of architecture capability Moderate level of effort Detailed programming rules! Ensures fair comparison between architectures " Complicates programming!large base of results available for comparison! Nearly 100 architectures already benchmarked! Provides easy means for quick and accurate analysis 17 Benchmark Results Example: C64x Family BDTImark2000 TM BDTImark2000 Power Consumption BDTImemMark2000 TM BDTImark2000 10 ku Price 18 TI Developer s Conference Page 9

Benchmark Results C64x C64x+ TigerSHARC 19 Benchmark Results C64x Blackfin 20 TI Developer s Conference Page 10

Benchmark Results C55x Blackfin MSC71xx 21 Typical Application Decomposition Applications Portable audio player Wireless handset Video conf. system Application Components OS Audio decoder Audio encoder Speech codec Video decoder Video encoder Algorithm Kernels FIR FFT DCT VECADD Operations Add Mult/MAC Shift Load 22 TI Developer s Conference Page 11

Application Components Model a key signal processing task!often representative of overall workload!easier to implement than a full application "Less general than a set of kernel benchmarks Larger workload vs. kernel benchmarks!allows comparison of different types of architectures!simplifies programming rules Can benchmark the entire system Capture effects of memory size, bandwidth, etc. "Does not capture effects of combining multiple tasks 23 Example Application Component Benchmark BDTI Communications Benchmark (OFDM) is based on a simplified 10 Mbps OFDM receiver Closely resembles a real-world task Simplified to enable optimized implementations Constrained to ensure consistent, reasonable implementation practices IQ Demodulator FIR FFT Slicer Viterbi Decoder 24 TI Developer s Conference Page 12

BDTI Communications Benchmark Freescale MSC7110 (200 MHz) TI C6410 (400 MHz) Altera Stratix 1S20-6 Altera Stratix 1S80-6 Bit rate 5.6 Mbit/s 12 Mbit/s 800 Mbit/s 2400 Mbit/s Cost (1 ku) $14 $18 $120 $600 Cost per Mbit/s $2.50 $1.45 $0.15 $0.25 From BDTI s report FPGAs for DSP and unpublished benchmarks. 25 BDTI Communications Benchmark Estimated Engineering Effort (for an Optimized Implementation) With Block Libraries Without Block Libraries Typical DSP 1-2 weeks 8-10 weeks Typical FPGA* ~40 weeks??? *Assumes traditional HDL design flow 26 TI Developer s Conference Page 13

Full Application Benchmarks! Potential for highly accurate results "Results useful only for specific application (or highly similar applications) "Applications tend to be ill-defined!may be able to use existing application code as a benchmark Example: BDTI H.264 Decoder Solution Certification " but costly and time-consuming to implement a new application "For processors, similar results via simpler approaches But this is not true for all implementation technologies 27 Outline Processor selection criteria Processor selection methodology Processor benchmarking approaches Sample benchmark results Benchmarking hardware+software solutions Conclusions 28 TI Developer s Conference Page 14

The Problem with Solutions Vendors increasingly offer HW +SW solutions But solution performance claims are very difficult to use and compare Hantro s H.264 player for series 60 handsets is based on the 6100 software decoder and PlayEngine middleware. Running on the Nokia 7610 handset, full screen video (208x176 resolution) at 15 frames per second can be achieved. We're shipping today, running a 90-MHz processor and delivering 20-frame per second QCIF video, which is a very acceptable level. Agere H.264 player on 600 MHz Blackfin, CIF (360 x 240) at 30 fps: 111 MHz ADI 29 Application Code as a Benchmark!Actual application code can give the most accurate and relevant measure of performance "Usually impractical to implement application code solely for benchmarking purposes "Vendor s data is often difficult to interpret "Varying configurations and conditions "Varying performance metrics "Inability to quickly distinguish real solutions from vaporware 30 TI Developer s Conference Page 15

BDTI s Methodology Standardization: Operating points Test streams Metrics Certification (independent verification): Functionality Performance Benefits: Meaningful, comparable performance data Real solutions distinguished from vaporware 31 BDTI H.264 Decoder Solution Certification Primary Operating Point: Baseline profile, level 1.3 D1 resolution (720 480) 30 frames per second 2 Mbit/second bitstream Secondary Operating Points are used to provide a complete performance picture Metrics: CPU use (MHz, % loading) Memory bandwidth use (Mbit/second, % loading) Energy consumption (mj/frame) Cost or die area ($ or mm²) Program and data memory use (Mbytes) 32 TI Developer s Conference Page 16

Outline Processor selection criteria Processor selection methodology Processor benchmarking approaches Sample benchmark results Benchmarking hardware+software solutions Conclusions 33 Conclusions Know what to look for in a processor Clearly define the application requirements Consider all the processor options Be alert for incomplete or misleading information Use a hierarchical approach to pick a processor Develop a hierarchy of requirements Start with critical criteria; iteratively narrow the field Expect to make trade-offs 34 TI Developer s Conference Page 17

Conclusions Benchmarks are invaluable, if you Choose the right benchmarking approach for the task at hand Different approaches make different trade-offs Consider all the relevant metrics Beware the many benchmarking pitfalls Don t loose sight of non-performance considerations 35 For More Information www.bdti.com Inside DSP newsletter and quarterly reports Benchmark scores for dozens of processors Pocket Guide to Processors for DSP Basic stats on over 40 processors Articles, white papers, and presentation slides Processor architectures and performance Signal processing applications Signal processing software optimization comp.dsp FAQ 6 th Edition 36 TI Developer s Conference Page 18