Lecture 7: Audio Compression & Coding

Similar documents
Lecture 16 Perceptual Audio Coding

CISC 7610 Lecture 3 Multimedia data and data formats

Principles of Audio Coding

5: Music Compression. Music Coding. Mark Handley

Speech and audio coding

Perceptual Coding. Lossless vs. lossy compression Perceptual models Selecting info to eliminate Quantization and entropy encoding

Audio-coding standards

Audio-coding standards

2.4 Audio Compression

Ch. 5: Audio Compression Multimedia Systems

Mpeg 1 layer 3 (mp3) general overview

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011

Audio Coding and MP3

Data Compression. Audio compression

Multimedia Communications. Audio coding

The MPEG-4 General Audio Coder

Source Coding Basics and Speech Coding. Yao Wang Polytechnic University, Brooklyn, NY11201

EE482: Digital Signal Processing Applications

MPEG-4 General Audio Coding

Appendix 4. Audio coding algorithms

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.

Chapter 14 MPEG Audio Compression

Parametric Coding of High-Quality Audio

Audio Coding Standards

Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal.

Module 9 AUDIO CODING. Version 2 ECE IIT, Kharagpur

Multimedia Systems Speech II Hmid R. Rabiee Mahdi Amiri February 2015 Sharif University of Technology

MPEG-1. Overview of MPEG-1 1 Standard. Introduction to perceptual and entropy codings

Mahdi Amiri. February Sharif University of Technology

New Results in Low Bit Rate Speech Coding and Bandwidth Extension

Audio Compression. Audio Compression. Absolute Threshold. CD quality audio:

Speech-Coding Techniques. Chapter 3

MPEG-4 Version 2 Audio Workshop: HILN - Parametric Audio Coding

CT516 Advanced Digital Communications Lecture 7: Speech Encoder

Optical Storage Technology. MPEG Data Compression

ELL 788 Computational Perception & Cognition July November 2015

Multimedia Systems Speech II Mahdi Amiri February 2012 Sharif University of Technology

Perceptual Pre-weighting and Post-inverse weighting for Speech Coding

Figure 1. Generic Encoder. Window. Spectral Analysis. Psychoacoustic Model. Quantize. Pack Data into Frames. Additional Coding.

Perceptual Audio Coders What to listen for: Artifacts of Parametric Coding

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Snr Staff Eng., Team Lead (Applied Research) Dolby Australia Pty Ltd

Digital Speech Coding

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Senior Manager, CE Technology Dolby Australia Pty Ltd

ON-LINE SIMULATION MODULES FOR TEACHING SPEECH AND AUDIO COMPRESSION TECHNIQUES

Audio and video compression

Fundamentals of Perceptual Audio Encoding. Craig Lewiston HST.723 Lab II 3/23/06

SAOC and USAC. Spatial Audio Object Coding / Unified Speech and Audio Coding. Lecture Audio Coding WS 2013/14. Dr.-Ing.

Chapter 4: Audio Coding

Contents. 3 Vector Quantization The VQ Advantage Formulation Optimality Conditions... 48

MULTIMODE TREE CODING OF SPEECH WITH PERCEPTUAL PRE-WEIGHTING AND POST-WEIGHTING

Compressed Audio Demystified by Hendrik Gideonse and Connor Smith. All Rights Reserved.

AUDIO. Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015

Bit or Noise Allocation

Application of wavelet filtering to image compression

Design and Implementation of an MPEG-1 Layer III Audio Decoder KRISTER LAGERSTRÖM

ISO/IEC INTERNATIONAL STANDARD

Squeeze Play: The State of Ady0 Cmprshn. Scott Selfon Senior Development Lead Xbox Advanced Technology Group Microsoft

ITNP80: Multimedia! Sound-II!

Bluray (

DSP. Presented to the IEEE Central Texas Consultants Network by Sergio Liberman

The Sensitivity Matrix

ROW.mp3. Colin Raffel, Jieun Oh, Isaac Wang Music 422 Final Project 3/12/2010

Efficient Signal Adaptive Perceptual Audio Coding

/ / _ / _ / _ / / / / /_/ _/_/ _/_/ _/_/ _\ / All-American-Advanced-Audio-Codec

KINGS COLLEGE OF ENGINEERING DEPARTMENT OF INFORMATION TECHNOLOGY ACADEMIC YEAR / ODD SEMESTER QUESTION BANK

the Audio Engineering Society. Convention Paper Presented at the 120th Convention 2006 May Paris, France

ijdsp Interactive Illustrations of Speech/Audio Processing Concepts

Lecture Information Multimedia Video Coding & Architectures

DAB. Digital Audio Broadcasting

Multimedia Systems Speech I Mahdi Amiri February 2011 Sharif University of Technology

AUDIOVISUAL COMMUNICATION

Digital Audio for Multimedia

For Mac and iphone. James McCartney Core Audio Engineer. Eric Allamanche Core Audio Engineer

Music & Engineering: Digital Encoding and Compression

Lecture Information. Mod 01 Part 1: The Need for Compression. Why Digital Signal Coding? (1)

DRA AUDIO CODING STANDARD

Filterbanks and transforms

Multimedia Systems Speech I Mahdi Amiri September 2015 Sharif University of Technology

Audio coding for digital broadcasting

<< WILL FILL IN THESE SECTIONS THIS WEEK to provide sufficient background>>

An investigation of non-uniform bandwidths auditory filterbank in audio coding

6MPEG-4 audio coding tools

Module 8: Video Coding Basics Lecture 42: Sub-band coding, Second generation coding, 3D coding. The Lecture Contains: Performance Measures

PQMF Filter Bank, MPEG-1 / MPEG-2 BC Audio. Fraunhofer IDMT

Parametric Coding of Spatial Audio

CS 074 The Digital World. Digital Audio

Transporting audio-video. over the Internet

Embedded lossless audio coding using linear prediction and cascade coding

Music & Engineering: Digital Encoding and Compression

GSM Network and Services

AN AUDIO WATERMARKING SCHEME ROBUST TO MPEG AUDIO COMPRESSION

An adaptive wavelet-based approach for perceptual low bit rate audio coding attending to entropy-type criteria

Efficient Representation of Sound Images: Recent Developments in Parametric Coding of Spatial Audio

Port of a Fixed Point MPEG-2 AAC Encoder on a ARM Platform

Compression; Error detection & correction

MATLAB Apps for Teaching Digital Speech Processing

ENTROPY CODING OF QUANTIZED SPECTRAL COMPONENTS IN FDLP AUDIO CODEC

Packet Loss Concealment for Audio Streaming based on the GAPES and MAPES Algorithms

MPEG-1 Bitstreams Processing for Audio Content Analysis

Audio & Music Research at LabROSA

Transcription:

EE E682: Speech & Audio Processing & Recognition Lecture 7: Audio Compression & Coding 1 2 3 Information, compression & quantization Speech coding Wide bandwidth audio coding Dan Ellis <dpwe@ee.columbia.edu> http://www.ee.columbia.edu/~dpwe/e682/ Columbia University Dept. of Electrical Engineering Spring 23 E682 SAPR - Dan Ellis L7 - Coding 23-3-31-1

1 Compression & Quantization How big is audio data? What is the bitrate? - F s frames/second (e.g. 8 or 441) x C samples/frame (e.g. 1 or 2 channels) x B bits/sample (e.g. 8 or 16) F s C B bits/second (e.g. 64 Kbps or 1.4 Mbps) bits / frame Mobile 13 Kbps 32 8 8 Telephony 64 Kbps How to reduce? - lower sampling rate less bandwidth (muffled) - lower channel count no stereo image - lower sample size quantization noise Or: use data compression CD Audio 1.4 Mbps 441 frames / sec E682 SAPR - Dan Ellis L7 - Coding 23-3-31-2

Data compression: Redundancy vs. Irrelevance Two main principles in compression: - remove redundant information - remove irrelevant information Redundant info is implicit in remainder - e.g. signal bandlimited to 2kHz, but sample at 8kHz can recover every other sample by interpolation: In a bandlimited signal, the red samples can be exactly recovered by interpolating the blue samples sample time Irrelevant info is unique but unnecessary - e.g. recording a microphone signal at 8 khz sampling rate E682 SAPR - Dan Ellis L7 - Coding 23-3-31-3

Irrelevant data in audio coding For coding of audio signals, irrelevant means perceptually insignificant - an empirical property Compact Disc standard is adequate: - 44 khz sampling for 2 khz bandwidth - 16 bit linear samples for ~ 96 db peak SNR Reflect limits of human sensitivity: - 2 khz bandwidth, 1 db intensity - sinusoid phase, detail of noise structure - dynamic properties - hard to characterize Problem: separating salient & irrelevant E682 SAPR - Dan Ellis L7 - Coding 23-3-31-4

Q[x] 5 4 3 2 1-1 -2-3 -4-5 -5-4 -3-2 -1 1 2 3 4 5 x Quantization Represent waveform with discrete levels 6 4 2-2 Q[x[n]] 5 1 15 2 25 3 35 4 Equivalent to adding error e[n]: xn [ ] = Qxn [ [ ]] + en [ ] e[n] ~ uncorrelated, uniform white noise p(e[n]) -D /2 +D /2 variance x[n] error e[n] = x[n] - Q[x[n]] 2 D 2 σ e = ------ 12 E682 SAPR - Dan Ellis L7 - Coding 23-3-31-5

Quantization noise (Q-noise) Uncorrelated noise has flat spectrum With a B bit word and a quantization step D - max signal range (x) = -(2 B-1 ) D.. (2 B-1-1) D - quantization noise (e) = -D/2.. D/2 Best signal-to-noise ratio (power) SNR = E[ x 2 ] Ee [ 2 ] = ( 2 B ) 2 2 log 1 2 B 6 B.. or, in db, db level / db -2-4 Quantized at 7 bits -6-8 1 2 3 4 5 6 7 freq / Hz E682 SAPR - Dan Ellis L7 - Coding 23-3-31-6

Redundant information Redundancy removal is lossless Signal correlation implies redundant information - e.g. if x[n] = x[n - 1] + v[n] x[n] has a greater amplitude range more bits than v[n] - sending v[n] = x[n] - x[n - 1] can reduce amplitude, hence bitrate x[n] - x[n-1] - white noise sequence has no redundancy Problem: separating unique & redundant E682 SAPR - Dan Ellis L7 - Coding 23-3-31-7

Optimal coding Shannon information: An unlikely occurrence is more informative p(a) =.5 p(b) =.5 ABBBBAAABBABBABBABB p(a) =.9 p(b) =.1 AAAAABBAAAAAABAAAAB A, B equiprobable A is expected; B is big news Information in bits I = log 2 (probability) - clearly works when all possibilities equiprobable Opt. bitrate token length = entropy H = E[I] - i.e. equal-length tokens are equally likely How to achieve this? - transform signal to have uniform pdf - nonuniform quantization for equiprobable tokens - variable-length tokens Huffman coding E682 SAPR - Dan Ellis L7 - Coding 23-3-31-8

Quantization for optimum bitrate Quantization should reflect pdf of signal: p(x = x ) p(x < x ) -.2 -.15 -.1 -.5 - cumulative pdf p(x < x ) maps to uniform x - or: nonuniform quantization bins Or, codeword length per Shannon log 2 (p(x)): -.2 -.1.1.2.3 p(x).5.1.15.2.25 Shannon info / bits Codewords 111111111xx 11111xx 1111xx 11xx 1xx xx 11xx 111xx 11111xx 111111xx 1111111xx 11111111xx 11111111xx 2 4 6 8 E682 SAPR - Dan Ellis L7 - Coding 23-3-31-9 x x' 1..8.6.4.2

Huffman coding Variable-length bit sequence tokens can code unequally probable events Tree-structure for unambiguous decoding: p =.5 1 p =.25 11 p =.625 1 1 1 Can build tables to approximate arbitrary distributions Eliminates irrelevance.. within limits Problem: very probable events short tokens 1 1 111 111 1111 1111111111111111 p =.625 p =.625 p =.625 E682 SAPR - Dan Ellis L7 - Coding 23-3-31-1

Vector Quantization Quantize mutually dependent values in joint space: 3 2 x 1 1-1 -2 x 2-6 -4-2 2 4 May help even if values are largely independent - larger space {x 1,x 2 } is easier for Huffman E682 SAPR - Dan Ellis L7 - Coding 23-3-31-11

Compression & Representation As always, success depends on representation Appropriate domain may be naturally bandlimited - e.g. vocal-tract-shape coefficients - can reduce sampling rate without data loss In right domain, irrelevance is easier to get at - e.g. STFT to separate magnitude and phase E682 SAPR - Dan Ellis L7 - Coding 23-3-31-12

Aside: Coding standards Coding is only useful when recipient knows the code! Standardization efforts are important Federal Standards: Low bit-rate secure voice: - FS115e: LPC-1 2.4 Kbps - FS116: 4.8 Kbps CELP ITU G.series - G.726 ADPCM - G.729 Low delay CELP MPEG - MPEG-Audio layers 1,2,3 - MPEG 2 Advanced Audio Codec - MPEG 4 Synthetic-Natural Hybrid Codec etc... E682 SAPR - Dan Ellis L7 - Coding 23-3-31-13

Outline 1 2 3 Information, compression & Quantization Speech coding - General principles - CELP & friends Wide bandwidth audio coding E682 SAPR - Dan Ellis L7 - Coding 23-3-31-14

2 Speech coding Standard voice channel: - analog: 4 khz slot (~ 4 db SNR) - digital: 64 Kbps = 8 bit µ-law x 8 khz How to compress? Redundant - signal assumed to be a single voice, not any possible waveform Irrelevant - need code only enough for intelligibility, speaker identification (c/w analog channel) Specifically, source-filter decomposition - vocal tract & fund. frequency change slowly Applications: - live communications - offline storage E682 SAPR - Dan Ellis L7 - Coding 23-3-31-15

Encoder Bandpass filter 1 Channel Vocoder (194s-196s) Basic source-filter decomposition - filterbank breaks into spectral bands - transmit slowly-changing energy in each band Smoothed energy Downsample & encode E 1 Decoder Bandpass filter 1 Input Output Bandpass filter N Smoothed energy Downsample & encode E N Bandpass filter 1 Voicing analysis V/UV Pitch Pulse generator Noise source - 1-2 bands, perceptually spaced Downsampling? Excitation? - pitch / noise model - or: baseband + flattening... E682 SAPR - Dan Ellis L7 - Coding 23-3-31-16

LPC encoding The classic source-filter model Encoder Filter coefficients {ai} 1 /A(e jω ) Decoder Input s[n] LPC analysis f Residual e[n] t Represent & encode Represent & encode Excitation generator e[n] ^ All-pole filter H(z) = 1 1 - Σa i z -i Output s[n] ^ Compression gains: - filter parameters are ~slowly changing - excitation can be represented many ways Filter parameters Excitation/pitch parameters 5 ms 2 ms E682 SAPR - Dan Ellis L7 - Coding 23-3-31-17

Encoding LPC filter parameters For communications quality : - 8 khz sampling (4 khz bandwidth) - ~1th order LPC (up to 5 pole pairs) - update every 2-3 ms 3-5 param/s Representation & quantization - {a i } - poor distribution, can t interpolate -reflection coefficients {k i }: guaranteed stable - LSPs - lovely! a i -2-1.5-1 -.5.5 1 1.5 2 k i -2-1.5-1 -.5.5 1 1.5 2 f Li Bit allocation (filter): - GSM (13 kbps): 8 LARs x 3-6 bits / 2 ms = 1.8 Kbps - FS116 (4.8 kbps): 1 LSPs x 3-4 bits / 3 ms = 1.1 Kbps.5.1.15.2.25.3.35.4.45.5 E682 SAPR - Dan Ellis L7 - Coding 23-3-31-18

Line Spectral Pairs (LSPs) LSPs encode LPC filter by a set of frequencies Excellent for quantization & interpolation Definition: zeros of Pz ( ) = Az ( ) + z p Qz ( ) = Az ( ) z p 1 Az 1 ( ) 1 Az 1 ( ) - z = e jω z -1 = e -jω A(z) = A(z -1 ) on u.circ. - P(z), Q(z) have (interleaved) zeros when angle{a(z)} = ± angle{z -p-1 A(z -1 )} - reconstruct P(z), Q(z) = Π i (1 - ζ i z -1 ) etc. - A(z) = [P(z) + Q(z)]/2 A(z -1 ) = A(z) = Q(z) = P(z) = E682 SAPR - Dan Ellis L7 - Coding 23-3-31-19

Excitation Excitation already better than raw signal: 5-5 1 Original signal LPC residual -1 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 - save several bits/sample, still > 32 Kbps time / s Crude model: U/V flag + pitch period - ~ 7 bits / 5 ms = 1.4 Kbps LPC1 @ 2.4 Kbps Pitch period values 1 5-5 16 ms frame boundaries 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s Band-limit then re-extend (RELP) E682 SAPR - Dan Ellis L7 - Coding 23-3-31-2

Encoding excitation Something between full-quality residual (32 Kbps) and pitch parameters (2.4 kbps)? Analysis by synthesis loop: Input s[n] LPC analysis Filter coefficients A(z) Excitation code Synthetic excitation control c[n] Pitch-cycle predictor + b z N L x[n] ^ LPC filter 1 A(z) + MSError minimization Excitation encoding Perceptual weighting A(z) W(z)= A(z/γ) ^x[n]*h a [n] - s[n] gain / db Perceptual weighting discounts peaks: 2 1-1 -2 A(z) A(z/γ) A(z) W(z)= A(z/γ) 5 1 15 2 25 3 35 freq / Hz E682 SAPR - Dan Ellis L7 - Coding 23-3-31-21

Multi-Pulse Excitation (MPE-LPC) Stylize excitation as N discrete pulses 15 1 5-5 mi 5-5 2 4 6 8 1 12 time / samps ti - encode as N x (t i, m i ) pairs original LPC residual multi-pulse excitation Greedy algorithm places one pulse at a time: E pcp = = ----------------- Az ( ) X( z) Az ( γ ) ---------- Az ( ) Sz ( ) X( z) ----------------- Az ( γ ) Rz ( ) ----------------- Az ( γ ) - cross-correlate h γ and r * h γ, iterate E682 SAPR - Dan Ellis L7 - Coding 23-3-31-22

CELP Represent excitation with codebook e.g. 512 sparse excitation vectors Codebook 1 2 3 4 5 6 7 Search index 8 9 1 11 excitation 12 13 14 15 6 time / samples - linear search for minimum weighted error? FS116 4.8 Kbps CELP (3ms frame = 144 bits): 1 LSPs 4x4 + 6x3 bits = 34 bits Pitch delay 4 x 7 bits = 28 bits 138 Pitch gain 4 x 5 bits = 2 bits bits Codebk index 4 x 9 bits = 36 bits Codebk gain 4 x 5 bits = 2 bits E682 SAPR - Dan Ellis L7 - Coding 23-3-31-23

Aside: CELP for nonspeech? CELP is sometimes called a hybrid coder: - originally based on source-filter voice model - CELP residual is waveform coding (no model) CELP does not break with multiple voices etc. freq / Hz 4 3 2 1 4 3 2 1 Original (mrzebra-8k) 4.8 Kbps CELP 1 2 3 4 5 6 7 8 - just does the best it can time / s LPC filter models vocal tract; also matches auditory system? - i.e. the source-filter separation is good for relevance as well as redundancy? E682 SAPR - Dan Ellis L7 - Coding 23-3-31-24

Outline 1 2 3 Information, compression & Quantization Speech coding Wide bandwidth audio coding - General principles - MPEG-Audio E682 SAPR - Dan Ellis L7 - Coding 23-3-31-25

3 Wide-Bandwidth Audio Coding Goals: - transparent coding i.e. no perceptible effect - general purpose - handles any signal Simple approaches (redundancy removal) - Adaptive Differential PCM (ADPCM) X[n] + + D[n] Adaptive quantizer C[n] = Q[D[n]] C[n] X p [n] Predictor X p [n] = F[X'[n-i]] X'[n] + + + D'[n] Dequantizer D'[n] = Q -1 [C[n]] - as prediction gets smarter, becomes LPC e.g. shorten - lossless LPC encoding Larger compression gains needs irrelevance - hide quantization noise with psychoacoustic masking E682 SAPR - Dan Ellis L7 - Coding 23-3-31-26

Noise shaping Plain Q-noise sounds like added white noise - actually, not all that disturbing -.. but worst-case for exploiting masking Have Q-noise scale with signal level - i.e. quantizer step gets larger with amplitude - minimum distortion for some centerheavy pdf.2.4.6.8 1 Or: put Q-noise around peaks in spectrum - key to getting benefit of perceptual masking level / db 2-2 -4-6 Transform Q-noise Signal E682 SAPR - Dan Ellis L7 - Coding 23-3-31-27 1.8.6.4.2 mu-law quantization mu(x) Linear Q-noise 15 2 25 3 35 freq / Hz x - mu(x)

Subband coding Idea: Quantize separately in separate bands Encoder Bandpass f Downsample M Quantize Q[ ] Decoder Q -1 [ ] M f Input Analysis filters (M channels) Reconstruction + filters Output f M Q[ ] Q -1 [ ] M - Q-noise stays within band, gets masked Critical sampling 1/M of spectrum per band gain / db -1-2 alias energy -3-4 -5.1.2.3.4.5.6.7 normalized freq 1 - some aliasing inevitable Trick is to cancel with alias of adjacent band quadrature-mirror filters E682 SAPR - Dan Ellis L7 - Coding 23-3-31-28

MPEG-Audio Basic idea: Subband coding plus psychoacoustic masking model to choose dynamic Q-levels in subbands Scale indices Scale normalize Quantize Input 32 band polyphase filterbank Scale normalize Quantize Format & pack bitstream Bitstream Data frame Control & scales Psychoacoustic masking model Per-band masking margins 32 chans x 36 samples 24 ms - 22 khz 32 equal bands = 69 Hz bandwidth - 8 / 24 ms frames = 12 / 36 subband samples - fixed bitrates 32-256 Kbps/chan (1-6 bits/samp) - scale factors like LPC envelope? E682 SAPR - Dan Ellis L7 - Coding 23-3-31-29

MPEG Psychoacoustic model Based on simultaneous masking experiments Difficulties: - noise energy masks ~1 db better than tones - masking level nonlinear in frequency & intensity - complex, dynamic sounds not well understood Procedure - pick tonal peaks in NB FFT spectrum - remaining energy noisy peaks - apply nonlinear spreading function - sum all masking & threshold in power domain SPL / db SPL / db SPL / db 6 5 4 3 2 1-1 6 5 4 3 2 1-1 6 5 4 3 2 1-1 Signal spectrum Tonal peaks 1 3 5 7 9 11 13 15 17 19 21 23 25 Masking spread Non-tonal energy 1 3 5 7 9 11 13 15 17 19 21 23 25 Resultant masking 1 3 5 7 9 11 13 15 17 19 21 23 25 freq / Bark E682 SAPR - Dan Ellis L7 - Coding 23-3-31-3

MPEG Bit allocation Result of psychoacoustic model is maximum tolerable noise per subband Masking tone level SNR ~ 6 B Masked threshold Safe noise level Quantization noise Subband N freq - safe noise level required SNR bits B Bit allocation procedure (fixed bit rate): - pick channel with worst noise-masker ratio - improve its quantization by one step - repeat while more bits available for this frame Bands with no signal above masking curve can be skipped entirely E682 SAPR - Dan Ellis L7 - Coding 23-3-31-31

MPEG Audio Layer III Transform coder on top of subband coder Digital Audio Signal (PCM) (768 kbit/s) Subband 31 Filterbank 32 Subbands MDCT FFT 124 Points Window Switching Line 575 Psychoacoustic Model Distortion Control Loop Nonuniform Quantization Rate Control Loop Huffman Encoding Coding of Sideinformation Bitstream Formatting CRC-Check Coded Audio Signal 192 kbit/s 32 kbit/s External Control Blocks of 36 subband time-domain samples become 18 pairs of frequency-domain samples - more redundancy in spectral domain - finer control e.g. of aliasing, masking - scale factors now in band-blocks Fixed Huffman tables optimized for audio data Power-law quantizer E682 SAPR - Dan Ellis L7 - Coding 23-3-31-32

Adaptive time window Time window relies on temporal masking - single quantization level over 8-24 ms window Nightmare scenario: Pre-echo distortion - backward masking saves in most cases Adaptive switching of time window: window level 1.8.6.4.2 normal transition short 2 4 6 8 1 time / ms E682 SAPR - Dan Ellis L7 - Coding 23-3-31-33

freq / khz 2 15 1 The effects of MP3 Josie - direct from CD 5 freq / khz 2 15 1 After MP3 encode (16 kbps) and decode 5 freq / khz 2 15 1 Residual (after aligning 1148 sample delay) 5 2 4 6 8 1 time / sec - chop off high frequency (above 16 khz) - occasional other time-frequency holes - quantization noise under signal E682 SAPR - Dan Ellis L7 - Coding 23-3-31-34

MP3 & Beyond MP3 is transparent at ~ 128 Kbps for stereo (11x smaller than 1.4 Mbps CD rate) - only decoder is standardized: better psych. models better encoders MPEG2 AAC - rebuild of MP3 without backwards compatibility - 3% better (stereo at 96 Kbps?) - multichannel etc. MPEG4-Audio - wide range of component encodings - MPEG Audio, LSPs,... SAOL - synthetic component of MPEG-4 Audio - complete DSP/computer music language! - how to encode into it? E682 SAPR - Dan Ellis L7 - Coding 23-3-31-35

Summary For coding, every bit counts - take care over quantization domain & effects - Shannon limits... Speech coding - LPC modeling is old but good - CELP residual modeling can go beyond speech Wide-band coding - noise shaping hides quantization noise - detailed psychoacoustic models are key E682 SAPR - Dan Ellis L7 - Coding 23-3-31-36