IMPLEMENTATION OF A FAST MPEG-2 COMPLIANT HUFFMAN DECODER

Similar documents
Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

Multimedia Networking ECE 599

At-Speed On-Chip Diagnosis of Board-Level Interconnect Faults

DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS

Bits and Bit Patterns

An Efficient VLSI Architecture for Full-Search Block Matching Algorithms

Huffman Coding Author: Latha Pillai

VLSI Implementation of Parallel CRC Using Pipelining, Unfolding and Retiming

Module 10 MULTIMEDIA SYNCHRONIZATION

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

CS250 VLSI Systems Design Lecture 9: Memory

DESIGN OF AN FFT PROCESSOR

Module 6 STILL IMAGE COMPRESSION STANDARDS

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver

Analysis of Parallelization Effects on Textual Data Compression

L2: Design Representations

Design and Implementation of FPGA- based Systolic Array for LZ Data Compression

AN FFT PROCESSOR BASED ON 16-POINT MODULE

Data Storage. Slides derived from those available on the web site of the book: Computer Science: An Overview, 11 th Edition, by J.

DESIGN OF A HIGH SPEED SYNCHRONOUS MEMORY BUS INTERFACE

DESIGN OF DCT ARCHITECTURE USING ARAI ALGORITHMS

DESIGN OF PARAMETER EXTRACTOR IN LOW POWER PRECOMPUTATION BASED CONTENT ADDRESSABLE MEMORY

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

The VHDL Based Design of the MIDA MPEG1 Audio Decoder

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

Design of a High Speed CAVLC Encoder and Decoder with Parallel Data Path

CRC Generation for Protocol Processing

Chapter 5: ASICs Vs. PLDs

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

Chapter 1. Data Storage Pearson Addison-Wesley. All rights reserved

FPGA Implementation of Huffman Encoder and Decoder for High Performance Data Transmission

WITH integrated circuits, especially system-on-chip

MPEG-4 Structured Audio Systems

Implication of variable code block size in JPEG 2000 and its VLSI implementation

Design of Coder Architecture for Set Partitioning in Hierarchical Trees Encoder

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

CODING METHOD FOR EMBEDDING AUDIO IN VIDEO STREAM. Harri Sorokin, Jari Koivusaari, Moncef Gabbouj, and Jarmo Takala

Systems Programming. Lecture 2 Review of Computer Architecture I

LOSSLESS DATA COMPRESSION AND DECOMPRESSION ALGORITHM AND ITS HARDWARE ARCHITECTURE

Memory Supplement for Section 3.6 of the textbook

Run length encoding and bit mask based Data Compression and Decompression Using Verilog

RECENTLY, researches on gigabit wireless personal area

Outline. Introduction to Structured VLSI Design. Signed and Unsigned Integers. 8 bit Signed/Unsigned Integers

Prototype of SRAM by Sergey Kononov, et al.

Chapter 6. CMOS Functional Cells

Microprocessor Architecture. mywbut.com 1

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

Wireless Communication

DUE to the high computational complexity and real-time

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

CS 074 The Digital World. Digital Audio

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

Transport protocols Introduction

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

A MOSIS CMOS 4-bit, 8-sample Fast Fourier Transform Chip Set

Brotli Compression Algorithm outline of a specification

Systolic Super Summation with Reduced Hardware

Dynamic Pipeline Design of an Adaptive Binary Arithmetic Coder

Memory, Latches, & Registers

A Fast Johnson-Mobius Encoding Scheme for Fault Secure Binary Counters

Robustness of Multiplexing Protocols for Audio-Visual Services over Wireless Networks

MODELING AND SIMULATION OF MPEG-2 VIDEO TRANSPORT OVER ATM NETWOR.KS CONSIDERING THE JITTER EFFECT

HIGH-LEVEL SYNTHESIS

Requirements, Partitioning, paging, and segmentation

Efficient Radix-4 and Radix-8 Butterfly Elements

Video Coding in H.26L

Robust MPEG-2 SNR Scalable Coding Using Variable End-of-Block

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Requirements, Partitioning, paging, and segmentation

Design and Implementation of CVNS Based Low Power 64-Bit Adder

JPEG decoding using end of block markers to concurrently partition channels on a GPU. Patrick Chieppe (u ) Supervisor: Dr.

A New MIMO Detector Architecture Based on A Forward-Backward Trellis Algorithm

CHAPTER 12 ARRAY SUBSYSTEMS [ ] MANJARI S. KULKARNI

JPEG Joint Photographic Experts Group ISO/IEC JTC1/SC29/WG1 Still image compression standard Features

Introduction to GAL Device Architectures

TransMu x. Users Manual. Version 3. Copyright PixelTools Corporation

Design and FPGA Implementation of Fast Variable Length Coder for a Video Encoder

Testability Optimizations for A Time Multiplexed CPLD Implemented on Structured ASIC Technology

Design, mapping, and simulations of a 3G WCDMA/FDD basestation using network on chip

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 13

16-BIT DECIMAL CONVERTER FOR DECIMAL / BINARY MULTI-OPERAND ADDER

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

An Enhanced Dynamic Packet Buffer Management

Data Representation. Types of data: Numbers Text Audio Images & Graphics Video

FPGA based High Performance CAVLC Implementation for H.264 Video Coding

Design of Low Power Digital CMOS Comparator

UltraLogic High-Density Flash CPLDs

Three DIMENSIONAL-CHIPS

Digital VLSI Testing Prof. Santanu Chattopadhyay Department of Electronics and EC Engineering India Institute of Technology, Kharagpur.

ECE 551: Digital System *

VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT

Evaluation of FPGA Resources for Built-In Self-Test of Programmable Logic Blocks

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.

A Time-Multiplexed FPGA

Bridging and Switching Basics

A Technique for High Ratio LZW Compression

Code Compression for RISC Processors with Variable Length Instruction Encoding

Transcription:

IMPLEMENTATION OF A FAST MPEG-2 COMPLIANT HUFFMAN ECOER Mikael Karlsson Rudberg (mikaelr@isy.liu.se) and Lars Wanhammar (larsw@isy.liu.se) epartment of Electrical Engineering, Linköping University, S-581 83 Linköping, Sweden Tel: +46 13 284059; fax: +46 13 139282 ABSTRACT In this paper a 100 Mbit/s Huffman implementation is presented. A novel approach where a parallel decoding of data mixed with a serial input has been used. The critical path has been reduced and a significant increase in throughput is achieved. The is aimed at the MPEG-2 Video decoding standard and has therefore been designed to meet the required performance. 1. INTROUCTION Huffman coding is a lossless compression technique often used in combination with other lossy compression methods, in for instance digital video and audio applications. The Huffman coding method uses codes with different s, where s with high probability are assigned shorter codes than s with lower probability. The problem is that since the coded s have unequal s it is impossible to know the boundaries of the s without first decoding them. Therefore it is difficult to parallelize the decoding process. When dealing with compressed video data this will become a problem since high data rates are necessary. The architecture of the Huffman presented in this paper is based on a novel hardware structure [1] that allows high speed decoding. The can handle all Huffman tables required for decoding MPEG-2 Video at the Main Stream, Main Level resolutions [2]. The design is completely MPEG-2 adapted with automatic handling of the MPEG-2 specific escape and end of block codes. In total our supports 11 code tables with more than 600 different code words. Since the code books are static in the MPEG-2 standard the Huffman has been optimized for these specific MPEG-2 codes. A decoding rate of 100 Mbit/s is required and also achieved in our implementation. 2. HUFFMAN ECOER Huffman decoding can be performed in a numerous ways. One common principle is to decode the incoming bit stream in parallel [3, 4]. The simplified decoding process is described below: 1. Feed a and a with M bits, where M is the of the longest code word. 2. The maps the input vector to the corresponding. A will at the same time find the of the input vector. 3. The information from the is used in the input to fill up the again (with between one and M bits, figure 1). The problem with this solution is the long critical path through the to the that shifts in new data (figure 1). input shifting M bit/cycle critical path output Figure 1. A constant output rate Huffman.

In our the shifting is realized with a shift that continuously shifts new data into the (figure 2). The and are supplied from s that are ed every time a new code word is present at the input. The decoding process is described below: 1. Load the input s of the and. 2. If the coded data has a of one go back to point 1. 3. If the coded data has a of two go back to point 1. and so on with codes of three and four up to M. =M? select force input M bits M-1 bits >1 shift =2? =1? s output Figure 2. Huffman with relaxed evaluation time. This structure allows longer evaluation times for longer code words. The delay in the critical path is reduced to the time it takes for evaluating the of code words with a of one or two bits. Codes with other s are allowed to be evaluated in several cycles, i.e. code words with s of three must be evaluated in two cycles and so on. Comparing this algorithm with the previous one we note the following: The input rate of our new structure is constant while the original has a variable input rate. The new structure evaluates short code words in a few cycles but requires more cycles for longer words. The original structure has a constant evaluation time for all code words. The new structure allow higher clock rate since the critical path is reduced. But this also means that the must be faster since it in the worst case will receive new data every clock cycle. The new structure has a variable output rate while the original one has a constant output rate. The new structure require higher clock rate to perform the same amount of work. But, if the average code is short enough the new structure will have a higher speed due to the significantly higher clock rates that can be achieved. Normally the shorter code words will dominate in Huffman coded data and therefore the new is faster during normal circumstances. 2.1 Handling of special markers Special markers are placed in the data stream to indicate for example end of block (eob) at the end of a coded block of data. After this marker other types of data like uncompressed stream information will follow (figure 3). In the Huffman it is essential to detect the presence of this marker to be able to stop the decoding process and let other units process the data that will follow. This detects the eob marker in the and halts the decoding process until a new start signal is applied. Huffman codes eob header data Huffman codes Huffman codes mb_escape fix data Huffman codes Figure 3. Markers in the MPEG-2 stream requiring special decoding. The mb_escape marker is also important. After this the following data is of fix. Also this marker is detected in the and results in that the following data is passed through the unchanged (figure 3). 3. IMPLEMENTATION The MPEG-2 standard requires that the input data must be decoded at a rate of about 100 Mbit/s. t

uring the implementation special care had to be taken during the partitioning of the and a few critical paths had to be optimized manually. A few modifications of the new decoding algorithm had to be made to make it possible to achieve the targeted performance. 3.1 Improvements of the The turned out to be to slow when evaluating codes with s of one or two bits (' = 1?' and ' = 2?' in figure 2). These paths had to be broken up. How this was done is shown in figure 4 below. The evaluation of the ' = 1?' signal is done by taking data one step earlier (i.e. from position i 1 instead of i) from the shift and add a flip flop after the evaluation. For the ' = 2?' signal the was moved to after the evaluation. i-2 i-1 i i-2 i-1 i The were split into five separate units that takes care of their own part of the code tables (figure 5). Every unit consists of an input that holds the data and a combinatorial block that maps the input vector to the. One of the five units' output are chosen and passed to the output. ata is in two's complement after the mb_escape marker while other data is decoded to signed magnitude format. A post processing stage converts the two's complement data to signed magnitude representation. 3.3 Interface The interface of the Huffman consists of an eight bit, parallel input port for coded data. A signal indicates when a new input vector can be applied. The decoded data is delivered with a maximum of 50 M/s. The ' present' signal (figure 5) indicates when data is valid at the output. =2? =1? =2? =1? The shift at the input of the (figure 2) can be read and controlled externally. This is necessary since the Huffman coded data is interleaved with other information. Before optimization After optimization Figure 4. Optimization of critical paths in the. Note that this way of breaking up the critical loops can be generalized to removing all critical loops in this structure, see [1]. unit 1 unit 2 unit 3 unit 4 unit 5 3.2 Symbol ecoder The decoding task is more complicated than the decoding. The could not be designed to receive data in 100 Mbit/s. Code words with a of one bit are rare in MPEG coded data. The most frequent used code tables do only contain codes with more than two bits. Therefore the is fed with data no more often than every second clock cycle. The input shift is halted one clock period every time a one bit code is found, and hence, the only need to process 50 Mbit/s without a significance loss in performance. However, this modification causes the input rate to vary. current table post processing >1 present Figure 5. Realization of. 3.4 Synthesis The has been described in VHL and then transformed to a circuit using synthesis tools mapping to an 0.8 µm CMOS standard cell

library. Some post processing had to be done after the synthesis step to achieve the necessary performance. The main problem was to get the to work fast enough. Therefore the decoding has been split into five separate units. The core area is about 8.4 mm 2 the total area is 14.5 mm 2 (3.9 x 3.75 mm). About two third of the area is occupied by the (figure 6). The power supply is 5V and the transistor count is 26900. 3.5 Symbol tables To get all tables correct the VHL code for the decoding as well as for the decoding has been generated from a thoroughly verified template file. However, this method also yielded a sub-optimal decoding that can be further optimized, but with an increasing probability to introduce design errors in the code tables. This was not done in this implementation because of lack of time and that it was considered more important to get a functional correct implementation. 4. CONCLUSIONS In this paper an implementation of a novel Huffman architecture has been presented. We have shown that the new structure can be used for fast Huffman decoding while still keeping a simple architecture. The throughput has been increased by using a serial input combined with a serial/parallel evaluation. Since the current implementation uses standard cells it is reasonable to believe that a full custom version of the same circuit can reach significantly higher speed. 5. REFERENCES [1] M. K. Rudberg and L. Wanhammar, New Approaches to High Speed Huffman ecoding, IEEE Proc. ISCAS 96, May 1996. [2] ISO/IEC IS 13818-2 Generic coding of moving pictures and associated audio information, part 2: Video, (MPEG-2), June 1994. [3] S. F. Chang and. G. Messerschmitt, esigning High-Throughput VLC ecoder Part I Concurrent VLSI Architectures, IEEE Trans. on Circuits and Systems for Video Technology, Vol. 2, No. 2, pp. 187-196, June 1992. [4] H.. Lin and. G. Messerschmitt, esigning High-Throughput VLC ecoder Part II Parallel ecoding Methods, IEEE Trans. on Circuits and Systems for Video Technology, Vol. 2, No. 2, pp. 197-206, June 1992.

Shift Register Control Unit Symbol ecoder Length ecoder Clock Buffer Figure 6. Layout of the Huffman.