Speech and audio coding

Similar documents
Audio Coding and MP3

Principles of Audio Coding

The MPEG-4 General Audio Coder

Audio-coding standards

2.4 Audio Compression

MPEG-4 General Audio Coding

Audio-coding standards

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011

Optical Storage Technology. MPEG Data Compression

5: Music Compression. Music Coding. Mark Handley

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.

Audio Compression. Audio Compression. Absolute Threshold. CD quality audio:

Lecture 7: Audio Compression & Coding

Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal.

Data Compression. Audio compression

Mahdi Amiri. February Sharif University of Technology

Multimedia Systems Speech II Hmid R. Rabiee Mahdi Amiri February 2015 Sharif University of Technology

Digital Speech Coding

Appendix 4. Audio coding algorithms

SAOC and USAC. Spatial Audio Object Coding / Unified Speech and Audio Coding. Lecture Audio Coding WS 2013/14. Dr.-Ing.

CT516 Advanced Digital Communications Lecture 7: Speech Encoder

Mpeg 1 layer 3 (mp3) general overview

Multimedia Communications. Audio coding

ON-LINE SIMULATION MODULES FOR TEACHING SPEECH AND AUDIO COMPRESSION TECHNIQUES

Parametric Coding of High-Quality Audio

MPEG-4 Version 2 Audio Workshop: HILN - Parametric Audio Coding

Perceptual Pre-weighting and Post-inverse weighting for Speech Coding

EE482: Digital Signal Processing Applications

Source Coding Basics and Speech Coding. Yao Wang Polytechnic University, Brooklyn, NY11201

Audio and video compression

Ch. 5: Audio Compression Multimedia Systems

Compressed Audio Demystified by Hendrik Gideonse and Connor Smith. All Rights Reserved.

New Results in Low Bit Rate Speech Coding and Bandwidth Extension

Perceptual Coding. Lossless vs. lossy compression Perceptual models Selecting info to eliminate Quantization and entropy encoding

Multimedia Systems Speech II Mahdi Amiri February 2012 Sharif University of Technology

Application of wavelet filtering to image compression

Speech-Coding Techniques. Chapter 3

Audio Coding Standards

The Sensitivity Matrix

AUDIO. Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015

the Audio Engineering Society. Convention Paper Presented at the 120th Convention 2006 May Paris, France

MULTIMODE TREE CODING OF SPEECH WITH PERCEPTUAL PRE-WEIGHTING AND POST-WEIGHTING

Chapter 4: Audio Coding

Lecture 16 Perceptual Audio Coding

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Snr Staff Eng., Team Lead (Applied Research) Dolby Australia Pty Ltd

Chapter 14 MPEG Audio Compression

Audio Engineering Society. Convention Paper. Presented at the 126th Convention 2009 May 7 10 Munich, Germany

ITNP80: Multimedia! Sound-II!

Parametric Coding of Spatial Audio

On Improving the Performance of an ACELP Speech Coder

AN EFFICIENT TRANSCODING SCHEME FOR G.729 AND G SPEECH CODECS: INTEROPERABILITY OVER THE INTERNET. Received July 2010; revised October 2011

ROW.mp3. Colin Raffel, Jieun Oh, Isaac Wang Music 422 Final Project 3/12/2010

An adaptive wavelet-based approach for perceptual low bit rate audio coding attending to entropy-type criteria

Using Noise Substitution for Backwards-Compatible Audio Codec Improvement

Squeeze Play: The State of Ady0 Cmprshn. Scott Selfon Senior Development Lead Xbox Advanced Technology Group Microsoft

ROBUST SPEECH CODING WITH EVS Anssi Rämö, Adriana Vasilache and Henri Toukomaa Nokia Techonologies, Tampere, Finland

Module 9 AUDIO CODING. Version 2 ECE IIT, Kharagpur

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Senior Manager, CE Technology Dolby Australia Pty Ltd

Fundamentals of Perceptual Audio Encoding. Craig Lewiston HST.723 Lab II 3/23/06

CISC 7610 Lecture 3 Multimedia data and data formats

ijdsp Interactive Illustrations of Speech/Audio Processing Concepts

Design of a CELP Speech Coder and Study of Complexity vs Quality Trade-offs for Different Codebooks.

What is multimedia? Multimedia. Continuous media. Most common media types. Continuous media processing. Interactivity. What is multimedia?

Principles of MPEG audio compression

DRA AUDIO CODING STANDARD

KINGS COLLEGE OF ENGINEERING DEPARTMENT OF INFORMATION TECHNOLOGY ACADEMIC YEAR / ODD SEMESTER QUESTION BANK

Filterbanks and transforms

Perceptual Audio Coders What to listen for: Artifacts of Parametric Coding

Technical PapER. between speech and audio coding. Fraunhofer Institute for Integrated Circuits IIS

Distributed Signal Processing for Binaural Hearing Aids

Multimedia. What is multimedia? Media types. Interchange formats. + Text +Graphics +Audio +Image +Video. Petri Vuorimaa 1

Contents. 3 Vector Quantization The VQ Advantage Formulation Optimality Conditions... 48

GSM Network and Services

ELL 788 Computational Perception & Cognition July November 2015

DigiPoints Volume 1. Student Workbook. Module 8 Digital Compression

Bit or Noise Allocation

DAB. Digital Audio Broadcasting

MPEG-1. Overview of MPEG-1 1 Standard. Introduction to perceptual and entropy codings

The Effect of Bit-Errors on Compressed Speech, Music and Images

A Generic Audio Classification and Segmentation Approach for Multimedia Indexing and Retrieval

Extraction and Representation of Features, Spring Lecture 4: Speech and Audio: Basics and Resources. Zheng-Hua Tan

/ / _ / _ / _ / / / / /_/ _/_/ _/_/ _/_/ _\ / All-American-Advanced-Audio-Codec

MATLAB Apps for Teaching Digital Speech Processing

The Steganography In Inactive Frames Of Voip

A NEW DCT-BASED WATERMARKING METHOD FOR COPYRIGHT PROTECTION OF DIGITAL AUDIO

Implementation of G.729E Speech Coding Algorithm based on TMS320VC5416 YANG Xiaojin 1, a, PAN Jinjin 2,b

Port of a fixed point MPEG2-AAC encoder on a ARM platform

ISO/IEC INTERNATIONAL STANDARD. Information technology MPEG audio technologies Part 3: Unified speech and audio coding

AUDIOVISUAL COMMUNICATION

DSP. Presented to the IEEE Central Texas Consultants Network by Sergio Liberman

Figure 1. Generic Encoder. Window. Spectral Analysis. Psychoacoustic Model. Quantize. Pack Data into Frames. Additional Coding.

CS 074 The Digital World. Digital Audio

Lecture Information Multimedia Video Coding & Architectures

The BroadVoice Speech Coding Algorithm. Juin-Hwey (Raymond) Chen, Ph.D. Senior Technical Director Broadcom Corporation March 22, 2010

A MULTI-RATE SPEECH AND CHANNEL CODEC: A GSM AMR HALF-RATE CANDIDATE

6MPEG-4 audio coding tools

Center for Multimedia Signal Processing

AUDIOVISUAL COMMUNICATION

Bluray (

ABSTRACT AUTOMATIC SPEECH CODEC IDENTIFICATION WITH APPLICATIONS TO TAMPERING DETECTION OF SPEECH RECORDINGS

Transcription:

Institut Mines-Telecom Speech and audio coding Marco Cagnazzo, cagnazzo@telecom-paristech.fr MN910 Advanced compression

Outline Introduction Introduction Speech signal Music signal Masking Codeurs simples MP3 2/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Outline Introduction Speech signal Music signal Introduction Speech signal Music signal 3/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Speech signal Speech signal Music signal Non-stationary, but locally stationary (20 ms) Voiced sounds (vowels, some consonants) non-voiced (some consonants), other (transitions) Simple and effective prediction models: Linear filters AR of impulsions for voiced sounds Same filter on white noise for non-voiced sounds 4/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Speech signal Speech signal Music signal Digitalization PCM (Pulse Code Modulation) Bandwidth 200 Hz 3400 Hz Enough for understanding Sampling at 8000 Hz : Fs > 2F max 8 bits per sample rightarrow 64 kbps Extended bandwidth 50-200 Hz : more natural and 3.4-7 khz : better understanding 5/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Speech signal Speech signal Music signal Coding standards G.711 (1972) PCM (no compression), 8 samples per ms codés sur 8 bits : 64 kbps G.721 (1984) ADPCM : 32 kbps G.728 (1991) CELP (Code Excited Linear Predictive), low latency: 16 kbps G.729 (1995) CELP, without latency constraint: 8kbps G.723.1 (1995) Encoder at 6.3 kbps 6/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Speech signal Speech signal Music signal Normes de compression : Mobiles GSM 06.10 (1988) RPE-LTP : Regular Pulse Excitation Long Term Predictor. 13 (22.8) kbps GSM 06.20 (1994) Half-Rate. 5.6 (11.4) kbps GSM 06.60 (1996) ACELP. 12.2 (22.8) kbit/s GSM 06.90 (1999) Source/channel coding at variable bit-rate 4.75 12.2 (11.4 22.8)kbps (ACELP-AMR : Adaptive Multi Rate) G.722 Wideband speec coder, rates 6.6, 8.85, 12.65 kbps (AMR-WB) ; further rates 15.85 and 23.85 kbps 7/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Music signal Introduction Speech signal Music signal Large dynamics: 90dB Locally stationary signal No simple model 8/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Music signal Introduction Speech signal Music signal Compression CD: sampling at 44.1 khz, quantization on 16 bits: 705 kbps (mono) MP3 : Audio part of MPEG-1. Three layers of increasing complexity, at 192, 128 and 96 kbps AAC : Audio part of MPEG-2, reputed as the best audio encoder MPEG-4 : Sound object representation 9/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Quality Introduction Speech signal Music signal Objective criteria (MSE) not satisfying Subjective stest: Speech : understandability Music: transparency criterion. Double blind method with triple stimuli and hidden reference (UIT-T BS.1116) 10/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Outline Introduction Masking Introduction Masking 11/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Audition threshold Masking Power db 120 80 S a(f) 40 0 20 50 100 200 500 1000 2000 5000 10000 Frequency Hz 12/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Masking Power db 120 80 40 0 20 50 100 200 500 1000 2000 5000 10000 Frequency Hz 13/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Masking Frequency masking functions S m (f 0,σ 2, f) Power db S m (f 1,σ 2 1, f) S m (f 2,σ 2 2, f) For a given f 0 and σ 2, S m(f) has a triangular shape The maximum is in f = f 0 Masking index: S m(f,σ 2, f) σ 2 Frequency Hz 14/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Time Masking Masking Power db Time Pre-masking : 2 5 ms Post-masking : 100 200 ms 15/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Outline Introduction Codeurs simples Introduction Codeurs simples 16/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples LPC10 encoder at 2.4 kbps Linear Prediction Coding avec 10 échantillons Only for teaching! Window N = 160 Sample prediction using P = 10 previous samples Scheme x(n) y(n) ŷ(n) A(z) Q LPC a â 17/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Filter Introduction Codeurs simples LPC10 encoder at 2.4 kbps P y(n) = x(n) x P (n) x P (n) = h k x(n k) k=1 P y(n) = a k x(n k) a 0 = 1 a k = h k k=0 A(z) = 1+a 1 z 1 + a 2 z 2 +...+ a P z P Y(z) = A(z)X(z) A is computed by minimisation of the residual power in the current window 18/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Filter Introduction Codeurs simples LPC10 encoder at 2.4 kbps c = [r X (1) r X (2)... r X (P)] T R = a = R 1 c r X (0) r X (1)... r X (P 1) r X (1) r X (0)... r X (P 2)............ r X (P 1) r X (P 2)... r X (0) Estimation of r X : using N samples of the current window r X (k) = 1 N 1 x(n)x(n k) N k n=k k {0, 1,...,P} 19/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples LPC10 encoder at 2.4 kbps Non-voiced sounds If X was deprived of all correlation, Y would be white noise We dont send Y : the decoder will filter WN using 1/A(z) Resulting in unaudible phase noise WN is a good model only for non-voiced suonds For voice sounds we have residual periodicity 20/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples LPC10 encoder at 2.4 kbps Non-voiced sounds : Filtering WN with 1/A(z) Voiced sounds : Filtering 1/A(z) of a pulse train: ŷ(n) = α m Z δ(n mt 0 +φ) α 0 φ T 0 1 21/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples LPC10 encoder at 2.4 kbps Estimation of auto-correlation r x (k) r x (0) estimates opowerσy 2 or α If r x decreases towards zero, it is a non-voiced sound If r x is periodical, it is a voiced sound with period T 0 22/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples LPC10 encoder at 2.4 kbps Each 20 ms we send: The P filter coefficients P = 10 over 3 or 4 bits. b P = 36 bits One bit for voiced/non-voiced b v = 1 bit For voiced bα = 6 The period T0 : b T0 = 7 For non-voiced: bσ 2 = 6 bits 23/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples LPC10 encoder at 2.4 kbps Total R = ( b P + b v + b T0 + b α ) /(20ms) = (36+1+6+7)/0.02 = 2500bps or R = (b P + b v + b σ 2)/(20ms) = (36+1+6)/0.02 = 2150bps R = 2.4 kbps 24/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples Find a filter and the prediction error signal Filter A(z) : LPC idea Error y(n) : from a GS-VQ dictionary x(n) ŷ(n) x(n) ǫ(n) 1 1 L Min A(z) i g a 25/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples Filter coefficient coding Line Spectrum Pairs: effective mathematical representation of A(z) Perceptual weighting function (WF) Noise can be tolored where the signal is strong The WF depends thus on A(z) 26/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Perceptual weighting Codeurs simples We use W(z) = A(z) A(z/γ) to weight the signal Noise can be higher in the formantic areas The filter 1/A(z) has peaks in the formantic areas The filter 1/A(z/γ), with γ (0, 1) has peaks at the sames frequencies, but smaller Let pi be the poles of 1/A(z) Thus γpi are the poles of 1/A(z/γ), which are thus closer to the center 27/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Perceptual weighting Codeurs simples Blue : 1/A(z) ; Red : 1/A(z/γ) ; Black: W(z) = A(z)/1/A(z/γ) 28/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples Excitation model. It is the sum of K = 2 or K = 3 vectors Vector and gain selection High complexity Sub-optimal algorithms such as the standard iterative algorithm 29/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples Global scheme x(n) A(z) 1 A(z/γ) p(n) p 0 (n) C Min 1 L bz Q 1 A(z/γ) p(n) 30/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples Coding rate Filter coefficients, sent once per 10 ms : P = 10, bp = 18 bits Long-term predictor, sent once per 5 ms Pitch coded on 7 bits Puissance, codée sur 3 bits Residual, coded by GS-VQ, once per 5 ms : Shape : 17 bits dictionary Gain : 4 bits dictionary R = 18/0.010+(7+3+17+4)/0.005 = 8 kbps 31/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples UIT-T G.722.2 State of the art in speech coding Introduction of 50-200 Hz: more natural, presence effect Extension 3.4-7 khz : better understandability ACELP-like encoder with: Modification of perceptual weighting (wide-band) Modification of pitch information very large dictionary Rates: 6.6 to 23.85 kbps 32/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Codeurs simples Harmonic exploitation Amplitude modulation 33/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Outline Introduction MP3 Introduction MP3 34/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Perceptual coding MP3 Overlapping windows Buffer of N samples; M new samples are coded at time Examples : M = 32 et N = 512 (MP3) ; M = 1024 et N = 2048 (AAC) Three tools used for encoding the current window Time-frequence analysis Bit allocation (controlled by audion model) Quantization and lossless coding 35/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Perceptual coding MP3 Coder x(n) Estimateur spectral Modele d audition Allocation des bits Flux Binaire N M H Q x 36/41 05.03.183 Institut Mines-Telecom Speech and audio coding

Perceptual coding MP3 Decoder Flux Binaire Q* F x(n) 37/41 05.03.183 Institut Mines-Telecom Speech and audio coding

MPEG-1/MP3 MP3 Transparent music encoder: perfect subjective quality Three complexity layers MP3 : third layer (max complexity) We coniser f e = 44.1kHz 38/41 05.03.183 Institut Mines-Telecom Speech and audio coding

MPEG-1/MP3 MP3 TF transform 32 filters for the filterbank Uniform repartition of frequencies between 0 and 22 khz 700 Hz per sub-band Critical sampling and quasi-perfect reconstruction (SNR>90dB) A vector of 12 samples from the same subband is encoded jointly y k, it corresponds to 10ms 39/41 05.03.183 Institut Mines-Telecom Speech and audio coding

MPEG-1/MP3 MP3 Subband vectors are normalized y k = g k a k g k : scale factor, the larges coefficient. It is quantified on 6 bits a k : normalized vector The samples x of the current window are also used to compute the perceptual function ŜX(f) and the masking threshold Φ(f) based on psychoacustical models 40/41 05.03.183 Institut Mines-Telecom Speech and audio coding

MPEG-1/MP3 MP3 Rate allocation and quantization The signal to mask ratio is known for each subband The available bits are first given to subbands with the highest ratio, then to others For each subband, we can choose the number of bit to be used for the scalar uniform quantizer 41/41 05.03.183 Institut Mines-Telecom Speech and audio coding