The BroadVoice Speech Coding Algorithm. Juin-Hwey (Raymond) Chen, Ph.D. Senior Technical Director Broadcom Corporation March 22, 2010

Similar documents
Presents 2006 IMTC Forum ITU-T T Workshop

Perceptual Pre-weighting and Post-inverse weighting for Speech Coding

ROBUST SPEECH CODING WITH EVS Anssi Rämö, Adriana Vasilache and Henri Toukomaa Nokia Techonologies, Tampere, Finland

the Audio Engineering Society. Convention Paper Presented at the 120th Convention 2006 May Paris, France

MULTIMODE TREE CODING OF SPEECH WITH PERCEPTUAL PRE-WEIGHTING AND POST-WEIGHTING

AN EFFICIENT TRANSCODING SCHEME FOR G.729 AND G SPEECH CODECS: INTEROPERABILITY OVER THE INTERNET. Received July 2010; revised October 2011

Digital Speech Coding

A MULTI-RATE SPEECH AND CHANNEL CODEC: A GSM AMR HALF-RATE CANDIDATE

The Steganography In Inactive Frames Of Voip

ON-LINE SIMULATION MODULES FOR TEACHING SPEECH AND AUDIO COMPRESSION TECHNIQUES

Opus, a free, high-quality speech and audio codec

AUDIO. Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015

The MPEG-4 General Audio Coder

14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006, copyright by EURASIP

Speech and audio coding

On Improving the Performance of an ACELP Speech Coder

Multi-Pulse Based Code Excited Linear Predictive Speech Coder with Fine Granularity Scalability for Tonal Language

Center for Multimedia Signal Processing

Speech-Coding Techniques. Chapter 3

Implementation of G.729E Speech Coding Algorithm based on TMS320VC5416 YANG Xiaojin 1, a, PAN Jinjin 2,b

SIGNAL COMPRESSION. 9. Lossy image compression: SPIHT and S+P

Source Coding Basics and Speech Coding. Yao Wang Polytechnic University, Brooklyn, NY11201

Real Time Implementation of TETRA Speech Codec on TMS320C54x

Abstract. 1. Introduction

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011

MPEG-4 General Audio Coding

New Results in Low Bit Rate Speech Coding and Bandwidth Extension

RTP implemented in Abacus

Multimedia Systems Speech II Mahdi Amiri February 2012 Sharif University of Technology

2.4 Audio Compression

Perspectives on Multimedia Quality Prediction Methodologies for Advanced Mobile and IP-based Telephony

5: Music Compression. Music Coding. Mark Handley

Mahdi Amiri. February Sharif University of Technology

1.1 February B

On the Importance of a VoIP Packet

HD Voice and Wideband Codecs (HD-02) Panel Discussion (ITEXPO West 2009) September 02, 2009 Los Angeles, CA

Multimedia Systems Speech II Hmid R. Rabiee Mahdi Amiri February 2015 Sharif University of Technology

Audio-coding standards

Configure Core Settings for Device Pools

Lecture 7: Audio Compression & Coding

Application of wavelet filtering to image compression

Adaptive Quantization for Video Compression in Frequency Domain

A CUSTOM VLSI ARCHITECTURE FOR IMPLEMENTING LOW-DELAY ANALY SIS-BY-SYNTHESIS SPEECH CODING ALGORITHMS

End-to-end speech and audio quality evaluation of networks using AQuA - competitive alternative for PESQ (P.862) Endre Domiczi Sevana Oy

Synopsis of Basic VoIP Concepts

REAL-TIME DIGITAL SIGNAL PROCESSING

CODING METHOD FOR EMBEDDING AUDIO IN VIDEO STREAM. Harri Sorokin, Jari Koivusaari, Moncef Gabbouj, and Jarmo Takala

DRA AUDIO CODING STANDARD

Audio Coding and MP3

Design of a CELP Speech Coder and Study of Complexity vs Quality Trade-offs for Different Codebooks.

CT516 Advanced Digital Communications Lecture 7: Speech Encoder

SAOC and USAC. Spatial Audio Object Coding / Unified Speech and Audio Coding. Lecture Audio Coding WS 2013/14. Dr.-Ing.

Missing Frame Recovery Method for G Based on Neural Networks

Opus Generated by Doxygen Thu May :22:05

Optical Storage Technology. MPEG Data Compression

MPEG-4 ALS International Standard for Lossless Audio Coding

Ch. 5: Audio Compression Multimedia Systems

VoIP Forgery Detection

Date. Next Generation in Speech Quality ETSI STQ Workshop, Nov 2012 Dr. Imre Varga Qualcomm Inc.

6MPEG-4 audio coding tools

MOS x and Voice Outage Rate in Wireless

Preface Preliminaries. Introduction to VoIP Networks. Public Switched Telephone Network (PSTN) Switching Routing Connection hierarchy Telephone

Audio-coding standards

THE OPTIMIZATION AND REAL-TIME IMPLEMENTATION OF

Lecture 5: Error Resilience & Scalability

Audiovisual QoS for communication over IP networks

Zhitao Lu and William A. Pearlman. Rensselaer Polytechnic Institute. Abstract

Real-time Audio Quality Evaluation for Adaptive Multimedia Protocols

EE482: Digital Signal Processing Applications

Nokia Q. Xie Motorola April 2007

Distributed Signal Processing for Binaural Hearing Aids

The Sensitivity Matrix

SM40: Measuring Maturity and Preparedness

A Synchronization Scheme for Hiding Information in Encoded Bitstream of Inactive Speech Signal

S.K.R Engineering College, Chennai, India. 1 2

Voice Quality Assessment for Mobile to SIP Call over Live 3G Network

Predicting the Quality Level of a VoIP Communication through Intelligent Learning Techniques

Deployment note. Products for conferencing platform developers Product deployment note

GSM Network and Services

ELL 788 Computational Perception & Cognition July November 2015

Project Name SmartPSS

Scalable Perceptual and Lossless Audio Coding based on MPEG-4 AAC

Determination of Bit-Rate Adaptation Thresholds for the Opus Codec for VoIP Services

MULTIMEDIA COMMUNICATION

Avaya IP Softphone R6 Feature Matrix

Designing Software for Mobile VoIP and Video. Fred Wydler VP VoIP Products SPIRIT DSP

Network Working Group Request for Comments: 4060 Category: Standards Track May 2005

Configure Core Settings for Device Pools

Hik-Connect Mobile Client

Adaptive data hiding based on VQ compressed images

e2020 ereader Student s Guide

Module 8: Video Coding Basics Lecture 42: Sub-band coding, Second generation coding, 3D coding. The Lecture Contains: Performance Measures

Scalable Coding of Image Collections with Embedded Descriptors

Appendix 4. Audio coding algorithms

Net: EUR Gross: EUR

MPEG-4 Version 2 Audio Workshop: HILN - Parametric Audio Coding

INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND AUDIO

VOICE OVER INTERNET PROTOCOL (VOIP)

25*$1,6$7,21,17(51$7,21$/('(1250$/,6$7,21,62,(&-7&6&:* &2',1*2)029,1*3,&785(6$1'$8',2 &DOOIRU3URSRVDOVIRU1HZ7RROVIRU$XGLR&RGLQJ

Data Compression. Audio compression

Transcription:

The BroadVoice Speech Coding Algorithm Juin-Hwey (Raymond) Chen, Ph.D. Senior Technical Director Broadcom Corporation March 22, 2010

Outline 1. Introduction 2. Basic Codec Structures 3. Short-Term Prediction / Noise Spectral Shaping 4. Long-Term Prediction / Noise Spectral Shaping 5. Gain Quantization 6. Excitation Vector Quantization 7. Bit Allocation 8. Postfiltering and Packet Loss Concealment 9. Complexity 10. Performance 11. Conclusion 2

Introduction BroadVoice16 (BV16): 16 kb/s narrowband speech codec with 8 khz sampling Selected by CableLabs in 2004 as a standard codec in PacketCable 1.5 for Voice over Cable applications; later also became a standard codec in PacketCable 2.0 Standardized by SCTE and ANSI in 2006 as ANSI/SCTE 24-21 2006 standard One of the standard codecs listed in the ITU-T Recommendation J.161 BroadVoice32 (BV32): 32 kb/s wideband speech codec with 16 khz sampling Standard codecs in PacketCable 2.0, ANSI/SCTE 24-23 2007, and ITU-T Recommendation J.361 BV16 and BV32 are: based on Two-Stage Noise Feedback Coding (TSNFC) optimized for low delay, low complexity, and high speech quality Royalty-free and open source (both floating-point and fixed-point C) Visit http://www.broadcom.com/broadvoice for info & code download 3

BV16 Encoder Structure LSPI Short-term predictive analysis & quantization s(n) Long-term predictive analysis & quantization e(n) PPI PPTI Bit multiplexer GI CI Output bit stream Input signal Highpass pre-filter v(n) u(n) Prediction uq(n) dq(n) residual - quantizer - ltnf(n) Long-term noise feedback filter - q(n) ppv(n) - Longterm predictor sq(n) Shortterm predictor stnf(n) Short-term noise feedback filter qs(n) BV16 uses TSNFC Form 3 structure in our ICASSP 2006 paper 4

BV32 Encoder Structure LSPI Short-term predictive analysis & quantization Long-term predictive analysis & quantization e(n) PPI PPTI Bit multiplexer GI CI Output bit stream Input signal s(n) Preemphasis filter Highpass pre-filter Shortterm predictor - d(n) v(n) - ltnf(n) u(n) Prediction residual quantizer Long-term noise feedback filter - uq(n) q(n) ppv(n) - Longterm predictor dq(n) stnf(n) Short-term noise feedback filter qs(n) BV32 uses TSNFC Form 2 structure in our ICASSP 2006 paper 5

BV16/BV32 Decoder Structure Similar to a CELP decoder BV32 uses a de-emphasis filter but not a postfilter BV16 does not use a de-emphasis filter but may add a postfilter 6

Short-Term Prediction Use 8 th -order short-term prediction to keep complexity low LSP quantized using 8 th -order MA prediction and two-stage VQ: 1 st -stage: 8-dimensional VQ with 7-bit codebook 2 nd -stage: BV16 uses 8-dimensional VQ with 1-bit sign and 6-bit shape BV32 uses split VQ with 3-5 split and 5 bits each BroadVoice might be used in non-voip applications with bit errors Desirable to make it robust to bit errors Only codevectors that preserve the order of first 3 LSPs are allowed in the 2 nd -stage VQ codebook search order reversal at decoder indicates bit errors last LSP vector used greatly reduces distortion due to bit errors without sending redundant information essentially no degradation to clear-channel quality 7

Short-Term Noise Spectral Shaping TSNFC Form 2 structure of BV32 has a lower complexity but gives a more constrained noise spectral shape of N BV 32( z) = A ~ ( z / γ ) ~ A( z) TSNFC Form 3 structure of BV16 has a higher complexity but gives a more general noise spectral shape of N BV ( z) = 16 A( z / γ ) A( z / γ ) ~ A ( z ) uses quantized coefficients while A(z) uses unquantized ones γ = 0.75 for BV32; γ 1 = 0. 5 and γ 2 = 0. 85 for BV16 1 2 8

Long-Term Prediction and Noise Spectral Shaping Long-Term Prediction: 3-tap pitch predictor with integer pitch period pitch period encoded to 7 bits for BV16 and 8 bits for BV32 pitch period range: 10 to 136 for BV16 and 10 to 264 for BV32 3 pitch predictor taps vector quantized to 5 bits pitch period and pitch taps determined in open-loop fashion to save complexity Long-Term Noise Spectral Shaping: To keep the complexity low, the noise feedback filter has a simple form of pp Fl ( z) = Nl ( z) 1 = λ z λ is half of optimal single-tap pitch predictor coefficient, range-limited to [0, 1] pp The corresponding noise spectral shape is given by Nl( z) = 1 λ z Magnitude of the Frequency Response Example: 5 Magnitude (db) 0-5 -10 0 500 1000 1500 2000 2500 3000 3500 4000 Frequency 9

Gain Quantization Excitation gain derived and quantized in open-loop to save complexity 1 gain/frame for BV16, and 2 gains/frame for BV32 Gain: base-2 logarithm of average power of open-loop prediction residual Fixed moving-average (MA) prediction of gain using 40 ms worth of previous data: 8 th -order MA predictor for BV16 16 th -order MA predictor for BV32 Scalar quantization of MA prediction residual of log-gain: 4 bits for BV16 5 bits for BV32 10

Gain Change Limitation Problem: Bit errors can cause large gain pops in decoded speech Solution: Limit the maximum gain increase allowed, conditioned on the previous log-gain and previous log-gain change Train a constraint threshold matrix off-line: Row: log-gain relative to a long-term average log-gain Column: log-gain change between adjacent gains Matrix element values: 99.x percentile of observed log-gain change in natural speech In gain encoding, if quantized gain gives a log-gain change > threshold, reduce the quantized gain until < threshold, or until the smallest gain in gain codebook In gain decoding, if the gain code is not for the smallest gain in gain codebook and the decoded gain gives a log-gain change > threshold, then the gain is corrupted by bit errors replace with the last decoded gain value Result: All severe gain pops eliminated, no redundant bit needed, and clear-channel performance hardly affected 11

Excitation Vector Quantization Excitation VQ dimension = 4 BV16: 1-bit sign, 4-bit shape, (14)/4 = 1.25 bits/sample BV32: 1-bit sign, 5-bit shape, (15)/4 = 1.5 bits/sample VQ codebook closed-loop trained Analysis-by-synthesis codebook search: concept: pass all codevectors through TSNFC structure, pick the one that gives minimum energy of quantization error Efficient VQ codebook search: treat TSNFC structure as a linear system with VQ codevector as input and quantization error vector as output decompose quantization error vector into Zero-Input Response (ZIR) and Zero- State Response (ZSR) see our ICASSP 2006 paper further complexity reduction see our Interspeech 2006 paper 12

Bit Allocation Parameter BV16 BV32 LSP 77=14 7(55)=17 Pitch period 7 8 3 pitch taps 5 5 Excitation gain(s) 4 55=10 Excitation vectors (14) 10=50 (15) 20=120 Total per frame 80 bits/40 samples 160 bits/80 samples 13

Postfiltering (PF) and Packet Loss Concealment (PLC) BV16 and BV32 are not bit-exact standards PF and PLC are both post-processing steps after decoding PF and PLC do not affect bit-stream compatibility PF and PLC are not really part of the BV16/BV32 standards BV16 specification gives an example PF BV16/BV32 specifications each gives an example PLC Other PF and PLC schemes can be used without affecting interoperability with the BV16/BV32 standards 14

Complexity Comparison with Other CELP-Based Standard Codecs* Codec MIPS RAM (kwords) ROM (kwords) Total Memory Footprint Algorithmic Delay (ms) G.728 36 2.2 6.7 9 0.625 G.729 22 2.6 14 17 15 G.729E 27 2.6 20 23 15 G.723.1 19 2.1 20 22 37.5 EVRC 25 2.5?? 30 AMR 20 4.6 17 22 25 BV16 12 2 11 13 5 G.722.2 40 5.3 18 23 26.875 VMR-WB 40 9.05?? 33.75 G.729.1 40 8.7 40.5 49 48.9375 BV32 17 3 10 13 5 * Most data extracted from PacketCable 2.0 spec audio codec comparison table 15

PESQ Narrowband Speech Quality Measured by PESQ Using 13 Languages 4.4 4.3 4.2 4.1 4.0 3.9 3.8 3.7 3.6 3.5 3.4 Arabic Chinese English French German Hindi Japanese Korean Portuguese Russian Spanish Swedish Thai G.711 u-law at 64 kb/s G.726 at 32 kb/s G.728 at 16 kb/s BV16 at 16 kb/s ilbc 20 ms at 15.2 kb/s ilbc 30 ms at 13.3 kb/s G.729E at 11.8 kb/s G.729 at 8 kb/s G.723.1 at 6.3 kb/s G.723.1 at 5.3 kb/s All 96 sentence pairs of 13 languages in NTT 1994 database were used BV16 was rated higher than all other codecs here except 64 kb/s G.711 16

4.3 Wideband Speech Quality Measured by Wideband PESQ Using 13 Languages Wideband PESQ 4.1 3.9 3.7 3.5 3.3 G.711 u-law at 128 kb/s G.722 at 64 kb/s G.722 at 56 kb/s G.722 at 48 kb/s G.722.1 at 32 kb/s G.722.1 at 24 kb/s BV32 at 32 kb/s G.722.2 at 23.85 kb/s 3.1 2.9 Arabic Chinese English French German Hindi Japanese Korean Portuguese Russian Spanish Swedish Thai G.722.2 at 15.85 kb/s G.722.2 at 8.85 kb/s All 96 sentence pairs of 13 languages in NTT 1994 database were used BV32 was rated higher than all other codecs listed here 17

Narrowband Listening Test Results 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Nominal level, 0% PLR 2 tandems -10 db level 10 db level 1% PLR 3% PLR MOS BV16 at 16 kb/s G.728 at 16 kb/s G.729 at 8 kb/s G.711 at 64 kb/s G.726 at 32 kb/s 18

Wideband Listening Test Results 4.5 4.0 3.5 3.0 2.5 2.0 1.5 BV32 at 32 kb/s G.722 at 64 kb/s G.722 at 56 kb/s G.722 at 48 kbt/s 1.0 0.5 0.0 19 MOS Nominal Level, 0% PLR 2 tandem -10 db level 10 db level 0.01% BER 1% PLR 3% PLR

BroadVoice Subjective Speech Quality Relative to Reference Codecs Dynastat did narrowband MOS test; Comsat Labs did wideband test 32 naïve listeners in each test BV16 rated statistically better than G.728, G.729, and G.726 at 32 kb/s BV32 rated statistically better than G.722 at 64 kb/s BV16/BV32 give 0.5 MOS degradation at about 5% random packet loss, versus 2% to 3% for most other standard speech codecs Narrowband Codec MOS Wideband Codec MOS G.711 µ-law 3.91 BV32 4.11 BV16 3.76 G.722 at 64 kb/s 3.96 G.729 3.56 G.722 at 56 kb/s 3.88 G.726 at 32 kb/s 3.56 G.722 at 48 kb/s 3.60 G.728 3.54 20

Conclusion BroadVoice16 and BroadVoice32 are based on novel Two-Stage Noise Feedback Coding with following design emphases: Low delay: 3x to 8X lower algorithmic delay than most competing codecs Low complexity: 2X to 3X lower MIPS, 1.3X to 3.8X lower memory footprint High speech quality: BV16 statistically better than toll-quality codecs G.726 at 32 kb/s, G.728, G.729 BV32 statistically better than G.722 at 64 kb/s Slower degradation with increasing packet loss rate than most other codecs BV16 and BV32 are standard speech codecs of PacketCable 1.5/2.0, ANSI, SCTE, and ITU-T J.161/J.361 for VoIP over Cable applications BV16 and BV32 are royalty-free and open source BV16 and BV32 can potentially be a base layer codec of IETF Internet Interactive Audio Codec benefit: can make IIAC inter-operable with existing ANSI/SCTE BV16/BV32 standards 21