Principles of Audio Coding

Size: px

Start display at page:

Download "Principles of Audio Coding"

Rosalind Caldwell
5 years ago
Views:

1 Principles of Audio Coding

2 Topics today Introduction VOCODERS Psychoacoustics Equal-Loudness Curve Frequency Masking Temporal Masking (CSIT 410) 2

3 Introduction Speech compression algorithm focuses on exploiting temporal redundancy PCM DPCM ADPCM Variants of these algorithms take into consideration the speech properties a) Linear PCM at 16 bit per sample at 8kHz b) Speech restored from G.721 compressed audio at 4 bit per sample c) Difference between a & b (CSIT 410) 3

4 Introduction [2] G.726 ADPCM (It supersedes G.721 & G. 723) Defines a multiplier constant that will change for every difference value e n, depending on the current scale of signals The scaled difference signal is defined as (CSIT 410) 4

5 Introduction [3] g n is sent for quantization Quantizer is backward adaptive Works by noticing if too many values are quantized to values far from 0, or too many values fell closer to 0 most of the times. It changes the size of the steps in the quantizer accordingly. (CSIT 410) 5

6 VOCODERS Voice coders Concerned with modeling speech, in capturing the features in as few bits as possible Model speech waveform In time domain (Linear Predictive Coding) In frequency domain (Channel vocoders & Formant vocoders) (CSIT 410) 6

7 VOCODERS Phase Insensitivity Phase is a shift in the time argument Perceptually the sound waves cos( t)+cos(2 t+ /2) and cos( t)+cos(2 t) sound similar So the energy spectrum is important, not the shape of the waveform (CSIT 410) 7

8 VOCODERS Phase Insensitivity[2] Solid line shows phase shifted superposition of two cosine waves. Dashed line shows unshifted superposition. (CSIT 410) 8

9 VOCODERS Channel Vocoder Subband filtering Subband coding. It does use the power of energy spectrum, so waveform is rectified to its absolute value. ITU G.722, for instance, filters analog signals into two bands 1. 50Hz to 3.5 khz 48kbps kHz to 7kHz 16kbps Waveform for the word audio Vocoders can operate at low bitrates, 1-2kbps (CSIT 410) 9

10 VOCODERS Channel Vocoder [2] It also analyzes the pitch & excitation of the speech, besides its absolute value. Excitation is concerned with if a sound is voiced or unvoiced Unvoiced signal looks like a noise (s, f) Voiced signal is fairly periodic (a, e, o) (CSIT 410) 10

11 VOCODERS Channel Vocoder [3] Uses vocal-tract model to generates vector of excitation parameters that describe the sound Guesses if a sound is voiced / unvoiced If the sound is voiced, identifies the period using 2400bps (CSIT 410) 11

12 VOCODERS Channel Vocoder [4] Voiced sounds periodic wave generator Unvoiced sounds pseudo-noise generator + estimate of energy given by band pass filter Achieves intelligible synthetic voice at 2400 bps (CSIT 410) 12

13 VOCODERS Channel Vocoder [5] Channel Vocoder (CSIT 410) 13

14 VOCODERS Formant Vocoder Not all frequencies present in the speech are equally represented Certain frequency components are strong while others not The important frequency peaks are called formants Formant vocoder works by encoding only most important frequencies Can produce intelligible speech at 1000bps Formants of two signals (CSIT 410) 14

15 VOCODERS Linear Predictive Coding Extract the feature of the signal from the waveform, do not convert to frequency domain Set of parameters modeling the shape and excitation of the vocal tract, not actual signals or differences Bitrates using LPC are small, because we send instructions, rather than the sound itself. (Something similar to MIDI) (CSIT 410) 15

16 Psychoacoustics - Introduction The range of human hearing is 20Hz- 20kHz. Range of human voice is from 500 Hz to 4kHz. Temporal masking Ever attended a musical performance & found sometime afterward you hear nothing??? Frequency masking Have you noticed the band s singing drowned out by the lead guitar??? (CSIT 410) 16

17 Psychoacoustics Introduction [2] Any coding technique that take advantage of such psychoacoustic model of hearing is referred to as perceptual coding (CSIT 410) 17

18 Psychoacoustics Equal-Loudness Relations The ear does not hear low and high frequencies as well as those in the middle. Fletcher-Munson Curves Equal loudness curves Perceived loudness (in phons) plotted for a given sound volume (db) vs frequency (Hz) (CSIT 410) 18

19 Psychoacoustics Equal-Loudness Relations [2] Fletcher-Munson equal loudness response curves (CSIT 410) 19

20 Psychoacoustics Equal-Loudness Relations [3] At 4kHz, 2dB gives the perception of 10dB At 10kHz 20dB gives the perception of 10dB At 0.1kHz 30dB gives the perception of 10dB (CSIT 410) 20

21 Psychoacoustics Equal-Loudness Relations [4] Observe the curves in 2.5kHz to 4kHz duration Very sensitive??? Reason : ear canal amplifies frequencies from 2.5kHz to 4kHz (CSIT 410) 21

22 Psychoacoustics Frequency Masking Frequency masking answers How does one tone interfere with another? At what level, one frequency drown out other? Masking curves have answers to these questions (CSIT 410) 22

23 Psychoacoustics Frequency Masking [2] Scenarios Lower tone can effectively mask the higher tones Higher tone do not mask the lower tone as well and effectively as the lower do the higher The greater the power in the masking tone the wider its influence, the broader range of frequency it can mask If two tones are widely separated by frequency, little masking occurs (CSIT 410) 23

24 Psychoacoustics Frequency Masking [3] Threshold of hearing 1. Generate one particular frequency (say 1 khz) 2. Reduce its volume to 0 in a quiet room 3. Turn up until the sound is barely audible Generate data points for all audible frequencies, this way & plot. (CSIT 410) 24

25 Psychoacoustics Frequency Masking [4] Threshold of hearing. Only if the sound is above its threshold level, it can be heard. The formula that approximated the above curve is The threshold units is db (CSIT 410) 25

26 Psychoacoustics Frequency Masking [5] Frequency masking curves Generated by Playing a pure tone (say at 1kHz), at a loud volume, and Verifying how this tone affects our abilities to hear tones at nearby frequencies Play 1kHz-60 db (masking) tone Raise the level of a nearby tone, say 1.1kHz, until is just audible. (CSIT 410) 26

27 Psychoacoustics Frequency Masking [6] The higher the frequency of the masking tone, the broader the range of its influence Effect of masking tones (CSIT 410) 27

28 Psychoacoustics Frequency Masking [7] Masking by loudness Effect of loudness of tones (CSIT 410) 28

29 Psychoacoustics Critical Bands Represents the ear s resolving power for simultaneous tones Human hearing range is divided into critical bands There is an inability of the auditory frequency-analysis mechanism to resolve inputs whose frequency difference is smaller than the critical bandwidth reduced audibility of a sound signal when in the presence of a second signal of higher intensity and within the same critical band. (CSIT 410) 29

30 Psychoacoustics Critical Bands [2] At lower frequency, the critical band is approximately 100Hz For frequencies above 500Hz, the critical bandwidth increases approximately linearly with frequency The ear is not very discriminating within a critical band, because of masking (CSIT 410) 30

31 Psychoacoustics Critical Bands [3] Critical Bands & their bandwidths (CSIT 410) 31

32 Psychoacoustics Bark Unit The higher the masking tone frequency, the broader the frequencies masked Bark unit is an alternative frequency unit, such that the masking curves are of same width New unit is named Bark, named after Heinrich Barkhausen One Bark unit corresponds to width on one critical band for any masking frequency (CSIT 410) 32

33 Psychoacoustics Bark Unit [2] Effects of masking tones expressed in Bark units The conversion between the frequency and the critical band number (Bark) is (CSIT 410) 33

34 Psychoacoustics Temporal Masking It takes quite a while for our hearing to return normal after a musical performance Any loud tone causes the hearing receptors in the inner ear to become saturated, and they require time to recover human eyes also have this kind of effect (CSIT 410) 34

35 Psychoacoustics Temporal Masking [2] Masking experiment 1. Play a masking tone, say at 1 khz, volume level of 60 db 2. Play another (test) tone at 1.1kHz, at 40dB. This may not be heard in the presence of the masking tone 3. Turn off the masking tone. It will take a while to start hearing the test tone 4. Now turn off the test tone just after the masking tone is off. Adjust this delay such that, the test tone is turned off when the test tone can just be distinguished It may take 500 ms to discern the test tone after a masking tome at 60dB is turned off (CSIT 410) 35

36 Psychoacoustics Temporal Masking [3] The louder the test tone, the shorter the delay for it to be heard after the masking signal is removed (CSIT 410) 36

37 Psychoacoustics Temporal Masking [4] Solid Line: masking tone played for 200msec Dashed Line: masking tone played for 100msec The phenomenon of saturation also depends on how long the masking signal is applied (CSIT 410) 37

38 MPEG- Introduction First it applies a filter bank to the input, to break the input into frequency components Applies psycho-acoustic model to the data and this model is used in a bit-allocation block Number of bits allocated is used to quantize the information from the filter bank (CSIT 410) 38

39 MPEG Layers 3 downward compatible layers, each able to understand the lower layers. Audio part of the MPEG standard. More complexity in the psychoacoustic model, better compression, with more delay Layer-1 DAT (Good quality with high bitrate) Layer-2 DAB Layer-3 (MP3) Audio transmission over ISDN (CSIT 410) 39

40 MPEG Layers [2] Each layer uses different frequency transform, and a psychoacoustic model More complex encoders, but simpler decoders (MP3 players for instance) Quality in terms of listening test scores: At 64kbps, out of a level of 5: Layer to 2.6 Layer to 3.8 (CSIT 410) 40

41 MPEG Audio Strategy Compression is called for. It relies on quantization, but also uses idea of critical band. MPEG-1 aims at 256kbps for audio The encoder employs a bank of filters that analyze the frequency components of the audio signal Frequency masking is brought into here to analyze the just noticeable noise level Balances the masking behavior & the available number of bits, by discarding the inaudible frequency (CSIT 410) 41

42 MPEG Audio Strategy [2] Uniform width for all frequency for all frequency analysis filters 32 overlapping subbands For each frequency level, sound level above masking level dictates how many bits must be assigned to code signal values Quantization noise is kept below the masking level & cannot be heard (CSIT 410) 42

43 MPEG Audio Strategy [3] Layer-1 uses only frequency masking. Bitrates range from 32 (mono) to 448 kbps (stereo) Layer-2 uses also temporal masking by accumulating more blocks of samples and comparing the current block with the neighboring blocks. Ranges from (mono) to kbps (stereo) Layer-3 is directed towards lower bitrate applications & uses more sophisticated subband analysis, nonuniform quantization and entropy coding. Ranges from kbps (CSIT 410) 43

44 Audio Compression Algorithm MPEG Audio encoder & decoder (CSIT 410) 44

45 Audio Compression Algorithm [2] It divides the input into 32 frequency subbands, via a filter bank Takes in as input 32 PCM samples, sampled in time and produces as its output 32 frequency coefficients If the sampling rate is fs = 48ksps, the maximum frequency mapped is fs/2 (by Nyquist theorem) (CSIT 410) 45

46 Audio Compression Algorithm [3] Layer-1 The sets of 32 PCM values are assembled in to a set of 12 groups of 32s (segments). Delay to accumulate 384 samples Quantization is decided for each segments Consider the 32 x 12 segment as a 32 x 12 matrix (CSIT 410) 46

47 Audio Compression Algorithm [4] For each of the 32 subbands the quantization is set Maximum amplitude in a row of 12 samples is taken as the scaling factor of that subband & subsequently dictates the bit allocation Decision is made if the signal is noise or tone This decision and the scaling factor is used calculate the masking threshold for each band & then compared to threshold of hearing The output of frequency masking model consists of SMR (Signal-to-Mask Ratio) Ratio of short term signal power to the minimum masking threshold for the subband. SMR directs the amplitude resolution & influences the bit allocation (CSIT 410) 47

48 Audio Compression Algorithm [5] More bits are used in the region where hearing is more sensitive Scaling factor is quantized using 6 bits The 12 values in each subband are quantized Also the bit allocation for each subband is transmitted Maximum resolution for quantizer is 15 bits. (CSIT 410) 48

49 Audio Compression Algorithm [6] MPEG Audio Frame size (CSIT 410) 49

50 Audio Compression Algorithm [7] Layer-2 Reduced bitrate, improved quality and increased complexity Three group of 12 samples are encoded in each frame & temporal masking is brought into play If scaling factor is similar for each of the three groups only one needs to be sent Bit allocation is applied to window length of 36 samples, instead 12 in layer-1 ( before, current and next ) Increased quantizer resolution, 16 bits (CSIT 410) 50

51 Audio Compression Algorithm [8] Layer-3 Takes into account stereo redundancy Uses refinement of Fourier transform, MDCT, addresses problems that DCT had at boundaries of window used Window size can be reduced to 12 samples, optionally, or a mixture of the two can be used (18 for lower freq.) Better compression ratio Table: MP-3 compression performances (CSIT 410) 51

52 MPEG-2 AAC (Advanced Audio Coding) Standard for DVDs. Adopted by XM radio. Capable of delivering high-quality stereo sound at 5 channels, so it can be played from 5 directions, at 320 kbps 5.1 channel system includes low frequency enhancement (woofer = LFE channel) Also capable of delivering good quality stereo sound at 128kbps Supports 3 profiles Main, Low Complexity (LC), Scalable Sampling Rate (SSR) (CSIT 410) 52

53 MPEG-4, 7, 21 MPEG-4 Integrates several audio coders, perceptual and structured Speech compression Perceptually based coders Text-to-speech MIDI MPEG-7 Promote the search of audio objects & coding is based on audio objects Not based on a complete model ASR is supported MPEG-21 Ongoing effort, addressing interoperability (CSIT 410) 53

54 Reference: Chapter 13, 14 (CSIT 410) 54

Chapter 14 MPEG Audio Compression

Chapter 14 MPEG Audio Compression 14.1 Psychoacoustics 14.2 MPEG Audio 14.3 Other Commercial Audio Codecs 14.4 The Future: MPEG-7 and MPEG-21 14.5 Further Exploration 1 Li & Drew c Prentice Hall 2003 14.1