Evaluating MMX Technology Using DSP and Multimedia Applications

Similar documents
Characterization of Native Signal Processing Extensions

Intel s MMX. Why MMX?

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW and Superscalar Architectures

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

Real-Time High-Throughput Sonar Beamforming Kernels Using Native Signal Processing and Memory Latency Hiding Techniques

Audio-coding standards

2 1 Introduction Recent times have produced an increasing demand for digital signal processing and multimedia capabilities on a personal computer. The

REAL-TIME DIGITAL SIGNAL PROCESSING

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding

Media Instructions, Coprocessors, and Hardware Accelerators. Overview

Audio-coding standards

What is multimedia? Multimedia. Continuous media. Most common media types. Continuous media processing. Interactivity. What is multimedia?

Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement

Multimedia. What is multimedia? Media types. Interchange formats. + Text +Graphics +Audio +Image +Video. Petri Vuorimaa 1

VICP Signal Processing Library. Further extending the performance and ease of use for VICP enabled devices

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

PERFORMANCE ANALYSIS OF AN H.263 VIDEO ENCODER FOR VIRAM

Impact of Source-Level Loop Optimization on DSP Architecture Design

DSP. Presented to the IEEE Central Texas Consultants Network by Sergio Liberman

FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

2 1 Introduction The rst addition to the X86 instruction set architecture in almost a decade was implemented to increase the performance and handling

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

Microprocessor Extensions for Wireless Communications

An Optimizing Compiler for the TMS320C25 DSP Chip

Native Signal Processing on the UltraSparc in the Ptolemy Environment

Real-time Signal Processing on the Ultrasparc

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi

PERFORMANCE ANALYSIS OF ALTERNATIVE STRUCTURES FOR 16-BIT INTEGER FIR FILTER IMPLEMENTED ON ALTIVEC SIMD PROCESSING UNIT

Fatima Michael College of Engineering & Technology

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

MULTIMEDIA COMMUNICATION

Representation of Numbers and Arithmetic in Signal Processors

Microprocessors, Lecture 1: Introduction to Microprocessors

More on Conjunctive Selection Condition and Branch Prediction

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview

D. Richard Brown III Associate Professor Worcester Polytechnic Institute Electrical and Computer Engineering Department

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

Data encoding. Lauri Võsandi

Chapter 14 MPEG Audio Compression

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

REAL TIME DIGITAL SIGNAL PROCESSING

D. Richard Brown III Professor Worcester Polytechnic Institute Electrical and Computer Engineering Department

Multimedia Communications. Transform Coding

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

Floating-point to Fixed-point Conversion. Digital Signal Processing Programs (Short Version for FPGA DSP)

Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India

AUDIOVISUAL COMMUNICATION

A Flexible IIR Filtering Implementation for Audio Processing Juergen Schmidt, Technicolor R&I, Hannover

Typical DSP application

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

CS 335 Graphics and Multimedia. Image Compression

Audio Compression. Audio Compression. Absolute Threshold. CD quality audio:

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Digital Signal Processing with Field Programmable Gate Arrays

COSC 6385 Computer Architecture. Instruction Set Architectures

Scalable Multi-core Sonar Beamforming with Computational Process Networks

BLIND MEASUREMENT OF BLOCKING ARTIFACTS IN IMAGES Zhou Wang, Alan C. Bovik, and Brian L. Evans. (

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Robert Matthew Buckley. Nova Southeastern University. Dr. Laszlo. MCIS625 On Line. Module 2 Graphics File Format Essay

Better sharc data such as vliw format, number of kind of functional units

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Complex Instruction and Software Library Mapping for Embedded Software Using Symbolic Algebra

SIMD Implementation of the Discrete Wavelet Transform

How to Write Fast Numerical Code

A Framework for Real-Time High-Throughput Signal and Image Processing Systems on Workstations

All MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes

How to Write Fast Numerical Code

Module 9 AUDIO CODING. Version 2 ECE IIT, Kharagpur

In this article, we present and analyze

Principles of Audio Coding

Using MMX Instructions to Implement a Synthesis Sub-Band Filter for MPEG Audio Decoding

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Optical Storage Technology. MPEG Data Compression

Source Coding Basics and Speech Coding. Yao Wang Polytechnic University, Brooklyn, NY11201

Simulation and Analysis of Interpolation Techniques for Image and Video Transcoding

CISC 7610 Lecture 3 Multimedia data and data formats

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

REAL TIME DIGITAL SIGNAL PROCESSING

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Embedded Computation

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

S2CBench : Synthesizable SystemC Benchmark Suite for High-Level Synthesis

Performance of computer systems

Memory Access and Computational Behavior. of MP3 Encoding

high performance medical reconstruction using stream programming paradigms

A Characterization of High Performance DSP Kernels on the TRIPS Architecture

CS802 Parallel Processing Class Notes

Understanding multimedia application chacteristics for designing programmable media processors

Flexible wireless communication architectures

Cache Justification for Digital Signal Processors

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

DigiPoints Volume 1. Student Workbook. Module 8 Digital Compression

Transcription:

Evaluating MMX Technology Using DSP and Multimedia Applications Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * November 22, 1999 The University of Texas at Austin Department of Electrical and Computer Engineering * Laboratory of Computer Architecture Telecommunications and Signal Processing Seminar 24-1

Evaluating MMX Technology Using DSP and Multimedia Applications This talk is a condensed version of a presentation given at: The 31st International Symposium on Microarchitecture (MICRO-31) Dallas, Texas November 30, 1998 http://www.ece.utexas.edu/~ravib/mmxdsp/ Telecommunications and Signal Processing Seminar 24-2

57 New assembly instructions 64-bit registers Aliased to FP registers EMMS Instruction No compiler support Telecommunications and Signal Processing Seminar 24-3

8, 16, 32, 64-bit fixed-point data Packing, unpacking of data Packed moves 16-bit multiply-accumulate Saturation arithmetic Telecommunications and Signal Processing Seminar 24-4

Independent evaluation of MMX How much speedup is possible? What tradeoffs are involved? Time, complexity, performance, precision Characterization of MMX workloads Instruction mix, memory accesses, etc. Telecommunications and Signal Processing Seminar 24-5

Finite Impulse Response Filter Speech, general filtering Fast Fourier Transform MPEG, spectral analysis Matrix &Vector Multiplication Image processing Infinite Impulse Response Filter Audio, LPC Telecommunications and Signal Processing Seminar 24-6

JPEG Image Compression Bitmap Image to JPEG Image 2D DCT G.722 Speech Encoding Compression, Encoding of Speech ADPCM Image Processing Uniform Color Manipulation Vector Arithmetic Doppler Radar Processing Vector Arithmetic, FFT Telecommunications and Signal Processing Seminar 24-7

Adjust non-mmx benchmark DSP environment Create MMX version Setup like non-mmx Use Intel Assembly Libraries Microsoft Visual C++ 5.0 Simulate with VTune 2.5.1 Telecommunications and Signal Processing Seminar 24-8

Not just function swapping Different input data types Fixed-point versus floating-point 16-bit versus 32-bit Reordering of data Ex: Arrangement of filter coefficients Row-order versus column-order Telecommunications and Signal Processing Seminar 24-9

Intel performance profiling tool Designed for hot spots Simulate sections of code Pentium with MMX CPU penalties Instruction mix Library calls Hardware performance counters Telecommunications and Signal Processing Seminar 24-10

12 Ratios (Non-MMX:MMX) 10 8 6 4 2 0 Cycles Dynamic Instructions Memory References jpeg g722 radar fir fft iir image matvec Ratio of non-mmx to MMX Programs Telecommunications and Signal Processing Seminar 24-11

JPEG and G722 show slowdowns Superlinear speedup in MatVec 16-bit data, 6.6X speedup Free unrolling MMX related overhead FIR, Radar, JPEG, G722 MMX multiplication Fewer cycles Requires unpacking Telecommunications and Signal Processing Seminar 24-12

%MMX Instructions 100 90 80 70 60 50 40 30 20 10 0 Emms Packed Moves MMX Arithmetic MMX Packs/Unpacks jpeg g722 radar fir fft iir image matvec % MMX Instructions and MMX Instruction Mix. Speedup increasing from left to right Telecommunications and Signal Processing Seminar 24-13

Input set size Small: FIR, Radar, G722, JPEG Large: IIR, Image, MatVec, FFT Affects MMX %, speedup Automatic Packing Less than 50% MMX arithmetic FFT Converts to FP Old version: 40% MMX, less speedup Telecommunications and Signal Processing Seminar 24-14

Ratio (Opt. Non-MMX:MMX) 2.5 2 1.5 1 0.5 0 Cycles Dynamic Instructions Memory References fft fir iir Ratio of Non-MMX Assembly to MMX Telecommunications and Signal Processing Seminar 24-15

Non-MMX version 1.98X faster But... inserted MMX code 1.6X faster Function call overhead 8.8X more in MMX version MMX Maintenance Instructions Accounting for precision Non-sequential data accesses Telecommunications and Signal Processing Seminar 24-16

Slowdown possible JPEG and G722 Parallel, contiguous data Hard to find Precision Obtainable at a price Library function call overhead Hand-coded assembly, inlining Telecommunications and Signal Processing Seminar 24-17

Speedup available with libraries Kernels: 1.25 to 6.6 Applications: 1.21 to 5.5 Versus optimized FP: 1.25 to 1.71 General Characteristics of MMX More static instructions used Fewer dynamic instructions Fewer memory references Less than 50% of MMX is arithmetic Telecommunications and Signal Processing Seminar 24-18

This concludes this portion of the talk. The following slides provide further information on: methodology, benchmarks, results, and additional work. Telecommunications and Signal Processing Seminar 24-19

Unreal 1.0 Doom-like game Command-line MMX switch Hardware Performance Counters 48% MMX Instructions Real-time. What is speedup? 1.34X more frame/second Same trends as benchmarks Telecommunications and Signal Processing Seminar 24-20

Focus on Important Code Buffer Inputs and Outputs No OS Effects Measured Real-time Atmosphere Telecommunications and Signal Processing Seminar 24-21

Some functions use MMX 8-bit and 16-bit data Scale factors Vector inputs Library-specific structures Signal Processing Library 4.0 Recognition Primitives Library 3.1 Image Processing Library 2.0 Telecommunications and Signal Processing Seminar 24-22

Precision JPEG Non-MMX SNR: 31.05 db MMX SNR: 31.04 db Image: No Change G722 Non-MMX SNR: 5.46 db MMX SNR: 5.18 db Doppler Radar Less than 1% Telecommunications and Signal Processing Seminar 24-23

Profiled Program 2D DCT Quantization Color Conversion 74% of execution time Small Block Size 8x8 blocks of pixels Telecommunications and Signal Processing Seminar 24-24

2D DCT Library only has 1D DCT Data in different order Quantization Not enough data parallelism Color conversion Create and fill vectors Telecommunications and Signal Processing Seminar 24-25

FIR Filter Finite Impulse Response Filter Moving averages filter Process one input at a time Non-MMX: 32-bit FP MMX: 16-bit fixed-point Filter length is 35 Telecommunications and Signal Processing Seminar 24-26

FFT Fast Fourier Transform Computes discrete Fourier Transform 4096-point In-place Whole FFT to MMX function Non-MMX: 32-bit FP MMX: 16-bit fixed-point Telecommunications and Signal Processing Seminar 24-27

MatVec Matrix & Vector Multiplication 512x512 matrix times 512-entry vector Dot product of two 512-entry vectors Both versions: 16-bit data Telecommunications and Signal Processing Seminar 24-28

IIR Infinite Impulse Response Filter Butterworth coefficients Direct form, Bandpass Filter length of 8, 17 coefficients Requires high precision Feedback Our versions unstable Telecommunications and Signal Processing Seminar 24-29

Doppler Radar Processing Subtract complex echo signals Removing stationary targets Estimates power spectrum Dominant frequency from peak of FFT 16-point, in-place FFT Telecommunications and Signal Processing Seminar 24-30

G.722 Speech Encoding Input signal: 16-bit, 16 khz Output signal: 8-bit, 8 khz 6 kb speech file Telecommunications and Signal Processing Seminar 24-31