Parametric Coding of Spatial Audio Ph.D. Thesis Christof Faller, September 24, 2004 Thesis advisor: Prof. Martin Vetterli Audiovisual Communications Laboratory, EPFL Lausanne
Parametric Coding of Spatial Audio Contents: - Audio Coding and Thesis Motivation - Background - Binaural Cue Coding (BCC) - Variations of BCC - Source Localization in Complex Listening Scenarios - Conclusions
Audio Coding Audio coding Transmission x Encoder Decoder x Convert audio signal into a representation suitable for: - Transmission - Storage Minimize bitrate - Optimal for storage - Optimal for transmission Storage
Audio Coding Lossless coding x Encoder Decoder x 50% Memory/bitrate reduction (Put music of 10 CDs on 1 CD) x = x Redundancy reduction: -Time to frequency transform -Entropy coding
Perceptual audio coding Audio Coding x Encoder Decoder x 90% Memory/bitrate reduction (Put music of 10 CDs on 1 CD) x x x x x x
Parametric audio coding Audio Coding x Encoder Decoder x >90% Memory/bitrate reduction Drawback: Quality loss
Thesis Motivation Receiver properties and source properties have been extensively considered in audio coding for monaural audio signals. This thesis: Extensively consider receiver and source properties for stereo and multi-channel audio signals.
Parametric Coding of Spatial Audio Contents: - Audio Coding and Thesis Motivation - Background - Binaural Cue Coding (BCC) - Variations of BCC - Source Localization in Complex Listening Scenarios - Conclusions
Background Spatial audio playback Two-channel stereo (headphone and loudspeaker playback) 5.1 Surround LFE Left Center Right Left Right Left Right Rear left Rear right Spatial audio playback: - perception of an auditory spatial image
Background Interaural differences One source, free-field s Source azimuth and ITD/ILD: (Narrowband signal) φ = 90 ILD φ = 0 ITD φ φ = -90 distance difference, head shadowing interaural time and level difference (ITD and ILD)
Background Inter-channel differences: Inter-channel time-difference (ICTD) Inter-channel level-difference (ICLD) Inter-channel coherence (ICC) ICTD, ICLD ICC
Background Mixing stereo signals: a 1 a 1 d 1 d 1 s 1 a 2 s 1 a 2 d 2 d 2 Σ Σ Σ Σ a 3 a 3 d 3 d 3 s 2 a 4 s 2 a 4 d 4 d 4
Background Other auditory spatial image attributes: Auditory event distance Auditory event width Listener envelopment - Power of ear-input signals - Ratio of power of direct to reflected sound - Lateral fraction - Late lateral energy fraction Spatial audio playback: These attributes are controlled by adding reflections to the signal channels.
Parametric Coding of Spatial Audio Contents: - Audio Coding and Thesis Motivation - Background - Binaural Cue Coding (BCC) - Variations of BCC - Source Localization in Complex Listening Scenarios - Conclusions
Binaural Cue Coding (BCC) Conventional multi-channel transmission Proposed multi-channel transmission MONO SIGNAL ANALYSIS + DOWNMIX SYNTHESIS
Binaural Cue Coding (BCC) Estimation of inter-channel cues: ICTD, ICLD, and ICC Frequency ICC ICTD, ICLD Time ICTD/ICLD: 1-3 kb/s (per channel pair) ICC: 1-3 kb/s
Binaural Cue Coding (BCC) ICTD/ICLD/ICC synthesis s FB s d 1 d 2 a 1 a 2 h 1 h 2 x 1 x 2 IFB IFB x 1 x 2 d C a C h C x C IFB x C d i : delays a i : scale factors h i : ICTD ICLD h i : filters f f
Binaural Cue Coding (BCC) ICTD, ICLD, and ICC and auditory spatial image attributes Auditory event localization: ICTD, ICLD Effect of late reflections: (ambience, listener envelopment, auditory event distance) ICC (de-correlation) ICLD (reverberation decay) Effect of early reflections: (auditory event width, coloration) ICC (de-correlation) ICLD (comb filter)
Binaural Cue Coding (BCC) Subjective evaluation: Stereo audio quality GRADING 100 80 60 40 20 0 anchor 1 anchor 2 no ICC ICC reference INDIVIDUAL ITEMS A B C D E F G H I J K L M N AVERAGE Excellent Good Fair Poor Bad MUSHRA (ITU-R BS.1534), headphone listening, 7 experienced subjects BCC: "excellent"
Surround 5.1 Demo: Binaural Cue Coding (BCC) Original Mono Synthesized ANALYSIS + DOWNMIX SYNTHESIS 15 kb/s
Binaural Cue Coding (BCC) Surround 5.1 Demo: 44 kb/s (lowest bitrate ever demonstrated) Memory of 1 CD = 32 CDs worth of music! Not stereo, but surround! Original Synthesized ANALYSIS + DOWNMIX 32 kb/s HE-AAC SYNTHESIS 12 kb/s
Parametric Coding of Spatial Audio Contents: - Audio Coding and Thesis Motivation - Background - Binaural Cue Coding (BCC) - Variations of BCC - Source Localization in Complex Listening Scenarios - Conclusions
Variations of BCC "MP3 Surround": Stereo backwards compatible coding of 5.1 Stereo-downmix: 2-to-5 Synthesis: Lt FB y 1 Upmixing.. s 1 s 4 d 1 d 4 a 1 a 4 A x 1 x 4 IFB IFB L sl + s 3 a 3 x 3 IFB C Lt = L + 0.7C + sl Rt = R + 0.7C + sr Rt FB y 2.. s 2 s 5 d 2 d 5 a 2 a 5 A x 2 x 5 IFB IFB R sr
Variations of BCC Low complexity 5-to-2 BCC: Subjective evaluation: MUSHRA (ITU-R BS.1534) Hidden reference, Anchor, AAC, 5-to-2 BCC + MP3, Dolby Prologic II 256 kb/s 192 kb/s PCM
Variations of BCC MP3 Surround Demo: Lt Rt L LFE C R Ls Rs MP3 Surround
Variations of BCC Variations of BCC Conventional flexible rendering BCC for flexible rendering ANALYSIS + DOWNMIX MONO SIGNAL
Variations of BCC BCC for Flexible Rendering Side information: Time-frequency structure of sum signal FREQUENCY 2 1 2 3 1 TIME SUBBAND INDEX... 1 1 1 2 3 3 1 1 1 2 3 3 2 2 2 2 1 1... TIME INDEX k
Binaural Cue Coding (BCC) BCC for Flexible Rendering Demo: 4 Instruments playing together. ANALYSIS + DOWNMIX MONO SIGNAL 2.5kb/s
Parametric Coding of Spatial Audio Contents: - Audio Coding and Thesis Motivation - Background - Binaural Cue Coding (BCC) - Variations of BCC - Source Localization in Complex Listening Scenarios - Conclusions
Source Localization in Complex Listening Scenarios Model for source localization in complex listening scenarios: - concurrently active sources - direct and reflected sound
Source Localization in Complex Listening Scenarios IC and concurrent sources s 1 s 2 e 1 e 2 1 IC 0.5 0 30 20 10 0 10 20 30 p 1 /p 2 [db] IC can be used as a "single-source-active indicator"
Source Localization in Complex Listening Scenarios IC and reflections s' e 1 s e 2 e 1 e 2 IC 1 n 1 n 2 n IC can be used as a "first-wavefront indicator"
Source Localization in Complex Listening Scenarios Model for source localization Psychological aspect of model Cue selection: Use ITD and ILD when IC > c 0 ITD, ILD, IC Binaural processor Inner ear (cochlea) Middle ear [db] [db] f [Hz] External ear f [Hz] h 1, h 2
Source Localization in Complex Listening Scenarios Two concurrent male speech sources, free-field 500Hz 1 IC c 0.9 c 0 = 0.95 s 1 s 2 0.8 5 40 40 Δ L [db] ILD 0 5 1 ITD τ [ms] 0 1 0 0.2 0.4 0.6 0.8 1 TIME [s]
Source Localization in Complex Listening Scenarios Precedence effect: (Broadband pulses, 5ms lead/lag delay, free-field) 500Hz 1 s 1 s 2 200 ms IC c 0.9 0.8 c 0 = 0.95 5 ms s 1 s 2 Δ L [db] ILD 5 0 5 40 40 ITD τ [ms] 1 0 1 0 0.1 0.2 0.3 0.4 TIME [s]
Source Localization in Complex Listening Scenarios Source localization in reverberant environment: (2 male speech sources, 500Hz: RT = 2s, 2kHz: RT = 1.4s) 5 p(δl) Without cue (B) selection ΔL [db] ILD 0 s 1 s 2 5 p(τ) 30 30 1 0 1 τ [ms] ITD p(δl c>c 0 ) With cue (D) selection 5 ΔL [db] ILD 0 5 p(τ c>c 0 ) 1 0 1 τ [ms] ITD
Parametric Coding of Spatial Audio Contents: - Audio Coding and Thesis Motivation - Background - Binaural Cue Coding (BCC) - Variations of BCC - Source Localization in Complex Listening Scenarios - Conclusions
Parametric Coding of Spatial Audio Conclusions Binaural Cue Coding (BCC): - low bitrate coding of stereo and multi-channel audio signals - low bitrate transmission of independent sources - bridging between different audio formats Proposed source localization model: - speculates about role of IC for "cue selection" - attempts at explaining source localization in complex listening
Parametric Coding of Spatial Audio Future work: Multi-channel BCC parameters: Different more efficient and flexible parametrization. BCC applied to binaural recordings: Further investigate. Psychoacoustic experiments with BCC stimuli. Source localization model: - Adaptation of cue selection threshold - More simulations, psychoacoustic experiments - Can model be related to other attributes of auditory spatial image?
Parametric Coding of Spatial Audio Acknowledgements: Martin Vetterli, Peter Kroon, and all colleagues I collaborated with. Thanks to all of you for coming to this defense!
Audio Coding Hybrid audio coding: Intensity stereo coding (ISC): Frequency Frequency Frequency Azimuth parameters Frequency Spectral bandwidth replication (SBR): Frequency Spectral parameters Frequency
Binaural Cue Coding (BCC) Downmix with equalization x 1 x 2 FB FB x 1 x 2! x e s IFB s x C FB x C Numerical example: x 1 x 1 x 2 x 2 x + x 1 +x 2 2 1 0.5 0 1 0.5 0 1 0.5 0 0 5 10 15 20 TIME [ms] Time [ms] X 1 [db] X 1 X 2 X 2 [db] X 1 [db] +X 2 10 5 0 5 10 10 5 0 5 10 10 5 0 5 10 0 0.5 1 1.5 2 FREQUENCY [khz] Frequency [Hz]
Binaural Cue Coding (BCC) Alternative ICTD/ICLD/ICC synthesis s FB s d 1 a 1 b 1! x 1 IFB x 1 LR LR s 1 s 2 d 2 a 2 b 2! x 2 IFB x 2 a i, b i : scale factors LR: late reverberation filter Coherent/incoherent channel power:
Binaural Cue Coding (BCC) Subjective evaluation: 5-channel audio quality GRADING 5 4 3 2 1 INDIVIDUAL ITEMS a b c d e f g h AVERAGE Imperceptible Perceptible but not annoying Slightly annoying Annoying Very annoying Hidden reference test (ITU-R BS.1116), loudspeaker listening, 9 subjects
Binaural Cue Coding (BCC) Subjective evaluation: ICLD time resolution stable Auditory spatial image stability image stability grading instable 5 grading 4 3 2 1 speech singing percussions applause 32 2048 ms 16 1024 ms 8512 ms 4256 ms window size Overall overall quality speech singing percussions applause 32048 ms 16 1024 ms 8512 ms 4256 ms window size imperceptible perceptible but not annoying slightly annoying annoying very annoying
Variations of BCC C-to-E BCC x 1 x 2 x C ENCODER DOWNMIX C-to-E y 1 y E DECODER ICTD, ICLD, ICC SYNTHESIS x 1 x 2 x C ICTD, ICLD, ICC ESTIMATION SIDE INFORMATION - Potentially better quality than C-to-1 BCC - E-channel backwards compatible coding of C-channel audio
Variations of BCC 6-to-5 BCC: 5.1 backwards compatible coding of 6.1 Downmix: 5-to-6 Synthesis: y 1 y 2 Upmixing delay delay x 1 x 2 y 3 delay x 3 y 4 FB y 4. + s 4 s 6 a 4 a 6 x 4 x 6 IFB IFB x 4 x 6 Dolby Surround EX loudspeaker positioning y 5 FB y 5. s 5 a 5 x 5 IFB x 5
Variations of BCC 7-to-5 BCC: 5.1 backwards compatible coding of 7.1 Downmix: 5-to-7 Synthesis: y 1 y 2 y 3 y 4 FB y 1 Upmixing. s 4 s 6 delay delay delay a 4 d 4 a 6 d 6 A x 4 x 6 IFB IFB x 1 x 2 x 3 x 4 x 6 Lexicon Logic 7 loudspeaker positioning y 5 FB y 2. s 5 s 7 d 5 d 7 a 5 a 7 A x 5 x 7 IFB IFB x 5 x 7
Variations of BCC Hybrid BCC Frequency Frequency Frequency f 0 ICTD, ICLD ICC Frequency 100 QUALITY Hybrid BCC Conventional coder BCC BIT RATE Differences to intensity stereo coding: - consider more cues (not only intensities) - use separate filterbank (better performance)
Variations of BCC Hybrid BCC 100 anchor 1 anchor 2 f 0 =0kHz f 0 =1kHz f 0 =2kHz f 0 =6kHz f 0 =16kHz AVERAGE 80 GRADING 60 40 20 0 x 1 x 2 FB FB BCC ENCODER 59 kb/s 68 kb/s y 1 IFB y 2 IFB 74 kb/s FB FB 84 kb/s BCC DECODER 101 kb/s IFB IFB x 1 x 2 x C FB IFB y C FB IFB x C SIDE INFORMATION
Source Localization in Complex Listening Scenarios Three and five concurrent male speech sources: (-30, 0, 30, -80, 80, free-field) Without cue selection With cue selection ΔL [db] ILD ΔL [db] ILD p(δl) (A) 5 0 5 p(τ) 1 0 1 τ [ms] ITD p(δl c 12 >c 0 ) (C) 5 0 ΔL [db] ILD ILD ΔL [db] p(δl) (B) 5 0 5 p(τ) 1 0 1 τ [ms] ITD p(δl c 12 >c 0 ) (D) 5 0 5 5 p(τ c 12 >c 0 ) 1 0 1 τ [ms] ITD p(τ c 12 >c 0 ) 1 0 1 τ [ms] ITD
Source Localization in Complex Listening Scenarios Cue selection: Three phases of the precedence effect, 500Hz critical band WITHOUT Without CUE cue SELECTION selection 200 ms ITD τ [ms] ILD Δ L [db] ITD τ [ms] 1 0 1 5 0 5 1 0 0 0.2 0.4 0.6 0.8 1 ICI [ms] With cue selection WITH CUE SELECTION 5 10 15 20 ICI [ms] lead/lag delay ILD Δ L [db] p 0, c 0 1 5 0 5 1 0.5 p 0 c 0 0 0 0.2 0.4 0.6 0.8 1 ICI [ms] lead/lag delay [ms] 5 10 15 20 ICI [ms]
Source Localization in Complex Listening Scenarios Amplitude panning, free-field, 500Hz critical band Reference REFERENCE BCC BCC (no without ICC) 1 IC c 0.9 c 0 = 0.995 c 0 = 0.995 c 0 = 0.995 0.8 0.5 ILD τ [ms] 0 0.5 5 Δ L [db] ITD Standard stereo setup Two male speech sources +-8dB amplitude panning. 0 5 0 1 2 3 4 TIME [s] 0 1 2 3 4 TIME [s] 0 1 2 3 4 TIME [s]