Lightweight Arithmetic for Mobile Multimedia Devices. IEEE Transactions on Multimedia

Similar documents
Lightweight Arithmetic for Mobile Multimedia Devices

Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3

Floating Point : Introduction to Computer Systems 4 th Lecture, May 25, Instructor: Brian Railing. Carnegie Mellon

Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon

Floating-point Arithmetic. where you sum up the integer to the left of the decimal point and the fraction to the right.

Floating Point (with contributions from Dr. Bin Ren, William & Mary Computer Science)

Foundations of Computer Systems

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Floating point. Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties. Next time. !

ECE232: Hardware Organization and Design

Written Homework 3. Floating-Point Example (1/2)

Data Representation Floating Point

Data Representation Floating Point

Data Representation Type of Data Representation Integers Bits Unsigned 2 s Comp Excess 7 Excess 8

EE 109 Unit 20. IEEE 754 Floating Point Representation Floating Point Arithmetic

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Data Representation Floating Point

3 Data Storage 3.1. Foundations of Computer Science Cengage Learning

Floating Point Numbers

Floating Point Numbers

Representing and Manipulating Floating Points. Jo, Heeseung

EE 109 Unit 19. IEEE 754 Floating Point Representation Floating Point Arithmetic

CSCI 402: Computer Architectures. Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI.

3.5 Floating Point: Overview

Representing and Manipulating Floating Points

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Representing and Manipulating Floating Points. Computer Systems Laboratory Sungkyunkwan University

CS61c Midterm Review (fa06) Number representation and Floating points From your friendly reader

02 - Numerical Representations

Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

Giving credit where credit is due

Giving credit where credit is due

Description Hex M E V smallest value > largest denormalized negative infinity number with hex representation 3BB0 ---

Floating point. Today. IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.

Floating Point Puzzles The course that gives CMU its Zip! Floating Point Jan 22, IEEE Floating Point. Fractional Binary Numbers.

System Programming CISC 360. Floating Point September 16, 2008

ECE 450:DIGITAL SIGNAL. Lecture 10: DSP Arithmetic

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Project Goals. Project Approach

Representing and Manipulating Floating Points

Systems I. Floating Point. Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties

Computer Arithmetic Floating Point

Audio Compression. Audio Compression. Absolute Threshold. CD quality audio:

Floating Point Arithmetic

Floating Point January 24, 2008

Computer Organization: A Programmer's Perspective

Up next. Midterm. Today s lecture. To follow

Introduction to Computer Science (I1100) Data Storage

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Measuring Improvement When Using HUB Formats to Implement Floating-Point Systems under Round-to- Nearest

Floating Point. EE 109 Unit 20. Floating Point Representation. Fixed Point

Implementation of DSP and Communication Systems

Floating Point. CSE 238/2038/2138: Systems Programming. Instructor: Fatma CORUT ERGİN. Slides adapted from Bryant & O Hallaron s slides

Representing and Manipulating Floating Points

Audio-coding standards

Numeric Encodings Prof. James L. Frankel Harvard University

ESE532: System-on-a-Chip Architecture. Today. Message. Fixed Point. Operator Sizes. Observe. Fixed Point

VLSI Signal Processing

Systems Programming and Computer Architecture ( )

Numerical computing. How computers store real numbers and the problems that result

Floating-point to Fixed-point Conversion. Digital Signal Processing Programs (Short Version for FPGA DSP)

Lecture 13: (Integer Multiplication and Division) FLOATING POINT NUMBERS

VERILOG 1: AN OVERVIEW

Arithmetic for Computers. Hwansoo Han

10.1. Unit 10. Signed Representation Systems Binary Arithmetic

Integers and Floating Point

Floating-point representations

Floating-point representations

Computer Organization and Levels of Abstraction

Inf2C - Computer Systems Lecture 2 Data Representation

CPE300: Digital System Architecture and Design

CS 33. Data Representation (Part 3) CS33 Intro to Computer Systems VIII 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

COMP2611: Computer Organization. Data Representation

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Perceptual Coding. Lossless vs. lossy compression Perceptual models Selecting info to eliminate Quantization and entropy encoding

5: Music Compression. Music Coding. Mark Handley

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Lecture 10: Floating Point, Digital Design

Implementation of IEEE754 Floating Point Multiplier

Floating Point Arithmetic

CS101 Introduction to computing Floating Point Numbers

The course that gives CMU its Zip! Floating Point Arithmetic Feb 17, 2000

Audio-coding standards

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Introduction to C. Why C? Difference between Python and C C compiler stages Basic syntax in C

2 General issues in multi-operand addition

CS429: Computer Organization and Architecture

Class 5. Data Representation and Introduction to Visualization

Integer Multiplication. Back to Arithmetic. Integer Multiplication. Example (Fig 4.25)

Floating Point Numbers

Number Systems. Binary Numbers. Appendix. Decimal notation represents numbers as powers of 10, for example

Lecture Information. Mod 01 Part 1: The Need for Compression. Why Digital Signal Coding? (1)

Divide: Paper & Pencil

Platform Selection Motivating Example and Case Study

Computer Organization and Levels of Abstraction

Lecture Information Multimedia Video Coding & Architectures

Introduction to Video Compression

Today: Floating Point. Floating Point. Fractional Binary Numbers. Fractional binary numbers. bi bi 1 b2 b1 b0 b 1 b 2 b 3 b j

MIPS Integer ALU Requirements

Transcription:

Lightweight Arithmetic for Mobile Multimedia Devices Tsuhan Chen Carnegie Mellon University tsuhan@cmu.edu Thanks to Fang Fang and Rob Rutenbar IEEE Transactions on Multimedia EDICS Signal Processing for Multimedia Applications Components and Technologies for Multimedia Systems Human Factor, Interface and Interaction Multimedia Databases and File Systems Multimedia Communication and Networking System Integration Applications Standards and Related Issues 1

Multimedia Applications on Mobile Devices Multimedia Processing More and more applications are ported from PCs to mobile devices Floating-point computational intensive Multimedia System Development Media designers use 32-64bit floats in C for algorithms ASIC designers use 10-20bit fixed-point units for hardware Serious design disconnect Multimedia Applications on Mobile Devices Multimedia Processing More and more applications are ported from PCs to mobile devices Floating-point computational intensive Multimedia System Development Media designers use 32-64bit floats in C for algorithms ASIC designers use 10-20bit fixed-point units in hardware Serious design disconnect 2

Fixed-Point vs. Floating-Point Fixed point Floating-point s Integer fraction s exp fraction Limited dynamic range and precision Sm, less power consumption From SW to HW: timeconsuming and error-prone Wide dynamic range & high precision Big, power intensive Easy translation from SW to HW How about make this lightweight? Don t use more than necessary. What Does Lightweight Mean s exp? fraction Lightweight === Less bits Actuy it s more than this. IEEE Standard FP Formats and ops for ordinary numbers Very sm nums:denormals Delicate rounding modes We can work on each dimension 3

Lightweight Floating-Point Arithmetic Lightweight FP arithmetic is a middle-ground solution Lightweight FP Fixed-Point Better numerical features than fixedpoint Less complicated than IEEE FP Acceptable energy consumption Easy to prototype algorithms with Easy to implement into hardware Floating-Point Software to Hardware Cycles Media algorithms Parameters Lightweight Arithmetic C class Chip hardware Lightweight FP Op Synthesizable Verilog 4

Design Flow Comparison Lightweight FP Design C FP Fixed-Point Design C FP Lightweight FP Fixed-point SW Lib SW simulation SW tuning Pass? Pass? HW Lib HW design HW design Design Flow Comparison Lightweight FP Design C FP Fixed-Point Design C FP Lightweight FP Fixed-point SW Lib SW simulation SW tuning Pass? Pass? HW Lib HW design HW design 5

IEEE Standard vs. Lightweight IP IEEE FP Standard 32 / 64 bits 8 / 11 bits exponent 23 / 52 bits mantissa 1 sign bit Lightweight Arithmetic IP Fewer bits Fewer bits of fraction less numerical precision Fewer bits of exponent less dynamic range Specs normal numbers as well as special values (infinity), edge cases (INF - INF), etc. Which of the special cases/numbers should be supported? IEEE Floats vs. CMUfloats CMUfloats IEEE Floats FP Formats and ops for ordinary numbers Very sm nums:denormals Delicate rounding modes Customizable format providing variable dynamic range and precision Fraction [1, 23], exponent width [1,8] On-off switch for denormalization Multiple choices for rounding mode Real-rounding / Jamming / Truncation 6

Rounding in CMUfloat We support not only IEEE rounding, but also two quick & dirty modes Allowed precision b2 b1 b0 Allowed precision b2 b1 b0 Allowed precision b2 b1 b0 IEEE Rounding (Real rounding) Truncation kill Jamming Simple logic, no adders Allowed precision Final, rounded result Allowed precision Allowed precision Achieves best results, but requires costly hardware What most ASIC hardware designers do, for efficiency Invented in 1940s, better than truncate, similar HW C CMUfloat library Supported operators Cmufloat double float int short = Cmufloat - * / == >=, > <=, <!= Cmufloat double float int short Other supported C features Pointer Cmufloat * a; Reference Cmufloat & a ; Array Cmufloat a[10][10] ; Argument passing func ( Cmufloat a ) I/O stream cout << a; 7

C Cmufloat Library Supported operators Cmufloat double float int short = Cmufloat - * / == >=, > <=, <!= Cmufloat double float int short Other supported C features Pointer Cmufloat * a; Reference Cmufloat & a ; Array Cmufloat a[10][10] ; Argument passing func ( Cmufloat a ) I/O stream cout << a; Software Library: Advantages Transparent mechanism to embed Cmufloat in the algorithm The over structure of the source code can be preserved Minimal effort in translating standard FP to lightweight FP Cmufloat <14,5> a = 0.5; // 14 bit fraction and 5 bit exponent Cmufloat <> b= 1.5; // Default Cmufloat is IEEE float Cmufloat <18,6> c[2]; // Define an array float fa; c[1] = a b; fa = a * b; // Assign the result to float c[2] = fa b; // Operation between float and Cmfloat 8

Software Library: Advantages (Cont.) Arithmetic operators are implemented by bit-level manipulation: more precise Our approach: Emulates the hardware implementation exactly Previous approach Add( b, c) { a = b c; a = round (a ); Built-in FP operator Round to limited bit-width } Summary: Features Supported Bit widths Variable from 2 bits (1 sign 1 exp 0 man) to 32 bits ( IEEE std) Rounding Use jamming (1.00011 rounds to 1.01) Experiments show jamming is nearly as good as full IEEE rounding, always superior to truncation, yet same complexity as truncation Denormalized numbers Not supported--our experiments on video/audio codecs suggest that denormal numbers do not improve the performance Exceptions Support only the exceptional values for infinity, zero and NAN Helps make the smer FP sizes more robust 9

sign0 sign1 mantissa0 swap 2 s >>n mantissa1 complement 2 s complement exponent0 exponent1 swap > ones ones zeros zeros ones ones zeros zeros sign-out sign_out logic bias Special case logic Special case logic >>1 >>1 <<n Leading zeros detection _ round nan zero infinity nan zero infinity over flow under flow nan zero infinity nan zero infinity over flow under flow mantissa out exponent out Hardware Library: ASIC Design Flow Verilog to layout flow Timing & area analysis Power analysis Verilog design Technology Library Standard Cells, Logical, physical, and timing views SIMULATION: ModelSim LOGIC: Synopsys DesignCompiler LAYOUT: Cadence Silicon Ensemble Synopsys DesignWare * Mux... Basic blocks: Integer arithmetic, multiplexors, etc POWER: Synopsys DesignPower Lightweight FP Adders/Multipliers Feature Supported Bit widths: Variable from 3 bits (1 sign 1 exp 1 frac) to 32 bits ( IEEE std) Rounding: Jamming / Truncation Addition Normalization Alignment Rounding Special case detection Design Issues Design method Subcomponent structures Core integer adder structure? Core shifter structure? Core integer multiplier structure? sign0 sign1 mantissa0 mantissa1 exponent0 exponent1 M ultiplication Normalization Rounding x sign_out - round mantissa out exponent out Special case detection 10

Floating Pt Adder Blue modules have large area and / or delay sign-out sign_out logic Addition Normalization sign0 sign1 Alignment Rounding mantissa0 mantissa1 swap >>n 2 s complement 2 s complement >>1 <<n Leading zeros detection _ round nan zero infinity mantissa out exponent0 exponent1 swap > ones _ nan zero infinity over flow exponent out ones zeros zeros Special case logic Special case detection under flow Floating Pt Multiplier Blue modules have large area and / or delay sign0 sign1 sign_out Multiplication Normalization Rounding mantissa0 mantissa1 x >>1 round nan zero infinity mantissa out exponent0 exponent1 ones ones zeros bias Special case logic over flow under flow nan zero infinity exponent out zeros Special case detection We see that the multiplier has less over-head than the adder 11

Design Examples: Adders 32-bit Floating-point 20-bit Fixed-pt 14-bit Floating-pt 32-bit FP 20-bit FIX 14-bit FP Area( um 2 ) - post layout 26634 4866 10096 Delay(ns) - post synthesys 48.95 2.44 25.77 Design Examples: Multipliers 32-bit Floating-point 20-bit Fixed-point 14-bit Floating-pt 32-bit FP 20-bit FIX 14-bit FP Area( um 2 ) - post layout 60713 40738 8851 Delay(ns) - post synthesis 24.14 22.82 15.89 12

Power Analysis IDCT in 32-bit IEEE FP 15-bit radix-16 lightweight FP Fixed-point implementation 12-bit accuracy for constants Widest bit-width is 24 in the whole algorithm (not fine tuned) Implementation Area(um 2 ) Delay(ns) Power(mw) IEEE FP 926810 111 1360 Lightweight FP 216236 46.75 143 Fixed-point 106598 36.11 110 Multimedia Encoding/Decoding Video camera Encoding of Media We can choose how accurately we wish to decode the data or Encoder 101110010... Playback of Media Decoder Uncompressed multimedia file Encoded bitstream 13

8-bit Video Codec H.261/263, MPEG-1/2/4, and even JPEG 8-bit DCT Q 9-bit Floating point Floating point 12-bit IQ IDCT Transmit 12-bit IQ IDCT Floating point IDCT requires floating point, and has an IEEE quality spec (1180-1990) that requires comparison against a 64-bit IEEE double implementation 9-bit 9-bit Motion Comp D D Motion Comp 8-bit Video Quality vs. Bit-width Use PSNR (Peak-Signal-to-Noise) to measure perceptual video quality Test video Proposed Codec CMUfloat Qi Pi Measure PSNR (Noise = Qi Pi ) CMUfloat can go very sm, ~14bits ( 5 exponent 8 fraction 1 sign bits = 14 total bits ) 39 37 PSNR (db) 35 Yellow pts show where PSNR decreases by 0.2dB from asymptotic value 5 7 9 11 13 5 7 9 11 13 15 17 19 21 23 2 3 15 17 19 21 33 31 Fraction-width (bit) 14

Rounding Modes Compare 3 rounding modes using IDCT video streams PSNR (db) Jamming is nearly as good as real rounding in precision, but as simple as truncation in hardware. 39 38 37 36 35 Comparison of Rounding Methods Real Jamming Truncation 34 33 6 7 8 9 10 11 12 13 14 15 16 17 Fraction-width(bit) Video Demo IEEE double vs. variable-precision CMUfloats Decoded with 14-bit lightweight IDCT Decoded with 64-bit double IDCT Decoded with 11-bit lightweight IDCT 15

Hardware Reduction Using Lightweight FP Comparison in Area/Delay/Power 32-bit IEEE FP IDCT / 14-bit lightweight FP IDCT with Jamming rounding / 20-bit fixed point IDCT 1.2 IEEE FP Lightweight FP Fixed-point 1 0.8 0.6 0.4 0.2 0 Area Delay Power Low-Resolution Display Media software commonly done in full precision (32 64 bits) Why do this if the display cannot handle it? On a portable video player: Full precision decode Display image on portable device Encoded bitstream Decoded Image This is rey inefficient Decoded Image on 2-bit B&W display Can t we do better than this, with smarter operators? 16

Low-Resolution Display (cont.) 0~255 Encoder Bitstream Decoder 0~255 Simplest (dumbest): just encode/decode as usual, let the display figure it out 0~255 0 255 Decoder quantizer Better: decode with just enough precision so a quantizer can retrieve right 2-bit pixel values 0~255 quantizer 0 255 Encoder Decoder quantizer 0 255 Best: Encode and decode with min precision needed so quantizer gets right pixels Results Simplest: needs ~20-bit lightweight floats to work Better: needs 16-bit lightweight floats; even just 11-bits looks decent Best: needs just 9-bit floats (4 fraction bits) to work just fine. Full Precision (64 bit) Video Demo Using 23 bits (IEEE 1180 passed) Using 11 bits (IEEE 1180 failed) 17

How About Audio? MPEG-1/2 Layer 3 (MP3).wav Analysis DCT Filter Bank (MDCT) MP3 Bitstream Filter Bank (IMDCT) Synthesis IDCT.wav Joint Stereo Coding Mux De Mux Joint Stereo Decoding Scale& Quantizer Scale& IQuantizer Perceptual Model Perceptual Model Use CMUfloat No standard tests for quality Audio Quality Need to rely on subjective testing on perceptual quality Mean Opinion Score (MOS) From 5 imperceptible difference to 1 rey annoying Results 8 subjects. 6-bit exponent and 3~7 bit fraction 5 4 MOS 3 2 Music 1 (Classic) Music 2 (Pop) 1 0 2 3 4 5 6 7 8 # of Fraction Bits 18

Conclusion Tradeoff between the lightweight FP and the fixed-point Design time Hardware cost Numerical performance Lighweight FP Fixed-point Ongoing Work : Automatic Design Flow Standard C FP algorithm main( ) { double x,y; x = 2*x y; } CMUfloat C class Exhaustive Bit-width search optimization for bit-width engine C lightweight FP algorithm with optimal bit-width main( ) {) { CMUfloatx,y; x = 2*x y; y; } FP arithmetic Verilog library Lightweight FP hardware design 19

Recap Accomplishments C lightweight FP arithmetic library Verilog lightweight FP arithmetic library Extensive experiments on video/audio/speech Is the lightweight FP solution universal? No, tradeoff between fixed-point solution and lightweight FP solution Ongoing work Automatic design flow Important for multimedia on low-power mobile devices Advanced Multimedia Processing Lab Please visit us at: http://amp.ece.cmu.edu 20