Subjective Audiovisual Quality in Mobile Environment

Size: px

Start display at page:

Download "Subjective Audiovisual Quality in Mobile Environment"

Maude Conley
5 years ago
Views:

Vienna University of Technology Faculty of Electrical Engineering and Information Technology Institute of Communications and Radio-Frequency Engineering

1 Vienna University of Technology Faculty of Electrical Engineering and Information Technology Institute of Communications and Radio-Frequency Engineering Master of Science Thesis Subjective Audiovisual Quality in Mobile Environment by Bruno Gardlo Supervisor: Dr. Michal RIES Professor: Prof. Markus RUPP Vienna, 2009 i

3 Abstract In today s world the mobile devices became important part of peoples s life. We use our mobile phones, mp3 players, handheld devices, portable DVD players, laptops or cameras every day. In recent years, the development of mobile phones made a great step towards multifunctional devices. Nowadays we can use our mobile phones not just for telephoning and messaging, but also for many other uses - including watching, listening and sharing multimedia contents. There is a great potential in television streaming (DVB-H) in telecommunication market and since the mobile phones are changing to small pocket PC s, there are more and more customers demanding video streaming services and video content accessible on their mobile devices. The other changing artefact in today s telecommunication market is increasing number of customers using the video call services instead of simple calls. All this multimedia services are critical for content and service providers in meanings of perceived end-user quality. Users want high quality multimedia content and the providers want to save the bandwidth as much as possible. The trade of between perceived quality and bandwidth has to be searched and for this purpose the audiovisual quality test are very important. It is very inconvenient and expensive to measure the end-user quality by subjective tests. Yet, if the transmission scenario is changing, also the perceived quality changes and so the subjective tests fails. Thus, there is a great potential in exploring and developing new objective measurement methods for evaluating the perceived audiovisual quality. In time of writing of this thesis, there exists several objective metrics for measuring either the audio or the video perceived quality. Most of them are defined as reference metric, so the presence of reference signal is needed for evaluation of the perceived quality. Another problem is, that they are dealing with audio and video in a separate ways. But the research in this area shows that the audio and video components are closely connected with each other, which is known as mutual compensatory property of audiovisual content. The goal of this work is to propose non-reference objective audiovisual quality metric suitable for use in mobile environment. Since the most used audio iii

4 codecs in mobile environment are Advanced Audio Codec (AAC) and Adaptive Multi-rate Codec (AMR), we will explore quality properties of these codecs. The explored video codec will be H.264/AVC, since at this time it is the most advanced video codec. New reference-free approaches are presented for quality estimation based on motion characteristics of video and reference-free audio quality metric. Moreover, the proposed metric is compared with most recent audiovisual metrics. iv

5 Contents Abstract Contents iii i 1 Introduction Motivation Audio quality Introduction Psychoacoustics Human Auditory System Psychoacoustic Principles Speech and Audio Coding Technologies Speech Coding standards Audio Content Estimation Audio parameters Speech detector LLR Test Based on κ and HZCRR M LLR Test Based on Mel-Frequency Cepstrum Coefficients Performance evaluation and comparison Audio Quality Estimation Algorithms Reference Audio Quality Metrics Non-reference Audio Quality Metrics Video Quality Introduction Basic principles of video coding Video and Colour sampling New features in H Video Quality Estimation i

6 3.3.1 Quality estimation based on content sensitive parameters Audiovisual quality Introduction Audiovisual quality assessment Test Methodology Encoder Settings Prior Art Feature extraction Video feature extraction Audio feature extraction Audiovisual quality estimation Performance evaluation Conclusions 59 Bibliography 61 List of Symbols and Abbreviations 65 List of Figures 67 List of Tables 69 ii

7 Chapter 1 Introduction 1.1 Motivation Massive provisioning of mobile multimedia services and higher expectations of end user quality bring a new challenges for service and content providers. The most challenging part is improving the subjective quality of audio and audiovisual services. Due to audio and video compression improvements of the video coding standard MPEG-4/AVC and encoding efficiency of AMR and AAC audio encoding standards, provisioning of audiovisual services is possible at low bit and frame rates while preserving its perceptual quality. This is especially suitable for video applications in broadband wireless networks. The Universal Mobile Telecommunications System (UMTS) release 4 (implemented by the first UMTS network elements and terminals) provides a maximum data rate of 1920 kbit/s shared by all users in a cell, release 5 offers up to 14.4Mbit/s in downlink (DL) direction for High Speed Downlink Packet Access (HSDPA). The following audio and video codecs are supported for UMTS video services. For audio encoding the following codecs are supported [2]: AMR speech codec, AAC Low Complexity (AAC-LC), AAC Long Term Prediction (AAC-LTP). The video codec encoding the following codecs are supported [2]: H.263, MPEG-4 and MPEG-4/AVC. The appropriate encoder settings for UMTS video services differ for different contents and streaming application settings (resolution, frame and bit rate) [3]. The end user quality is influenced by following aspects: mutual compensation effect between audio and video, content, encoding and network settings and finally by the transmission conditions. Moreover, video and audio media not only interact, but there is even a synergy of component media (audio and video) [4]. Therefore, perceptual mutually compensation effect differently performs in videos with dominant human voice than in other video contents [5]. The videos contents with dominant human voice are mainly news, interviews, talk shows... Finally, 1

8 the audiovisual quality estimation models tuned for videos contents with dominant human voice performs better than the universal one [5]. Therefore, our focus within this work is given on design of speech detection algorithms for mobile environment. This thesis is organised as follows: In the Section 2 of this thesis the audio properties of the audiovisual content will be described. The main goal of the audio part was to design of the speech detector. In recent years a speech detection has been extensively studied [6], [7], [8], [9]. The proposed algorithms for speech detection differs in computational complexity, environment of usage, accuracy and application. Our approach is to design real time speech detection algorithm suitable for mobile environment. Therefore, the design was focused at accurate, low complexity method, which is robust against audio compression artifacts. After the audio part, in the Section 3 will be described some basics of the video processing and simple overview of the video quality estimator based on the former work of my supervisor M.Ries[3] In the Section 4 of the work will be described the results from the subjective audiovisual survey. Finally the new audiovisual metric for mobile streaming services will be introduced. 2

9 Chapter 2 Audio quality 2.1 Introduction As the main topic of my diploma thesis is audiovisual quality, it s important to write at least brief overview about several audio properties. In these sections are explained the basics of psychoacoustic perception. Moreover the state of the art audio metric are explained. The reference and reference-free audio metrics will be differentiated and explained in separate sections. Finally, the new audio content estimator, developed for purpose of this thesis, will be described. The output of this estimator and the output of non-reference audio quality metric will be further used in audiovisual quality estimation. 2.2 Psychoacoustics Human Auditory System This subsection gives an overview of the sound signal processing in the human auditory system and the main psychoacoustic phenomenas. Most of the modern lossy audio codecs and perceptual quality assessment methods are developed based on those psychoacoustic effects. In the following section, the psychoacoustic mechanism of each component will be explained: Pinna: The Pinna pre-filters the incoming sound with a filter characteristic given by the Head Related Transfer Function (HRTF)[18]. 3

10 Figure 2.1: Outer Ear. Ear canal: The ear canal filters the sound further with a resonance at around 5kHz. Cochlea: The Cochlea is a fluid-filled coil within the ear and is partially protected by small bones. Basilar membrane (BM): The Basilar Membrane semi-partitions the cochlea and acts as a spectrum analysator by decomposing spatially the signal into frequency components. Each point on the Basilar Membrane resonates at a different frequency (frequency-to-place transformation) and the frequency selectivity is given by the width of the filter at each of this points. Outer hair cells: The Outer hair cells are distributed along the length of the Basilar Membrane and they change the resonant properties of the Basilar Membrane by reacting to feedback from the brainstorm. Inner hair cells: 4

11 Figure 2.2: Cochlea and basilar membrane. Figure 2.3: Hair cells. The inner hair cells are transforming the basilar motion to neural firing, where stronger motions cause more impulses. The neuronal firing starts when the BM moves upwards and this is the moment of the transformation from physical waves to physiological information, transducting the sound wave at each point into a 5

12 signal on the auditory nerve. Each cell needs a certain time to recover between firings, so the average response during a steady tone is lower than that at its onset. Thus, the inner hair cells act as an automatic gain control. The firing of any individual cell is pseudo-random, modulated by the movement of the BM[20]. In relation to audio signal processing and telecommunication, the whole human auditory sound signal processing system seems to encode an audio signal, which has a relatively bandwidth and large dynamic range, for transmission along nerves which each offer a much narrower bandwidth and limited dynamic range. The critical point is that any information lost due the transduction process within the cochlea is not available to the brain - the cochlea is effectively a lossy coder Psychoacoustic Principles Psychoacoustics deals with the relationship of physical sounds and the human brain s interpretation of them. The field of psychoacoustics has made significant progress toward characterising human auditory perception and particularly the time-frequency analysis capabilities of the inner ear. An auditory model for assessing the perceived quality of coded audio signals by simulating the functionality of the human ear and its characteristics is presented in ear, where the predictions of audible and inaudible conditions in a variety of psychoacoustic listening tests is investigated. Several psychoacoustic principles simulates the function of the human auditorial system and are used to identify the irrelevant information, which is not detectable even by well trained listeners ( golden ears ). These psychoacoustic principles are the absolute hearing thresholds the critical band frequency analysis the simultaneous masking and the spread of masking (along the basilar membrane) the temporal masking The absolute threshold of hearing characterises the amount of energy needed in a pure tone such that it can be detected by a listener in a noiseless environment and is expressed in terms of Sound Pressure Level (db SPL). The quiet threshold is well approximated by the non-linear function [19]: T q(f) = 3.64(f/1000) (f/ ) (f/1000)4 (2.1) 6

13 The curve is often referenced by audio codec designer by equating the lowest point near 4kHz in such way that the smallest possible output signal of their decoder will be presented close to 0 db SPL. The absolute hearing threshold curve is illustrated later in Figure 2.5 together with the effects of frequency masking. The inner ear separates the frequencies and concentrates them at certain locations along the basilar membrane (frequency-to-place transformation), so it can be regarded as a complex system of a series of overlapping band-pass filters with asymmetrical, non-linear and level depending magnitude responses. The bandwidth of the cochlear band-pass filters are non-uniform and increase with increasing frequency. Where these bands should be centred, or how wide they should be, has been analysed through several psychoacoustic experiments. One of the psychoacoustic models for the centre frequencies of these band-pass filters is the critical-band rates scale, where frequencies are bundled into 25 critical-bands with the unit name Bark. A distance of 1 critical band is commonly referred to as one bark and the following equation is often used to convert from frequency in Hertz to the Bark scale [19]: z(f) = 13 arctan(0.76f) arctan((f/7.5) 2 ) (2.2) The Bark scale is a nonlinear scale that describes the nonlinear, almost logarithmic processing in the ear and Table 2.1 shows the center frequencies and bandwidth for each of the 25 Bark bands: Band Center Freq. Bandwidth Band Center Freq. Bandwidth Band Center Freq. Bandwidth No. (Hz) (Hz) No. (Hz) (Hz) No. (Hz) (Hz) Table 2.1: The center frequencies and bandwidth for each of the 25 Bark bands. The critical bandwidth in Hertz is a function of center frequency that quantifies the cochlear band-pass filter conveniently approximated by [19]: BW c (f) = [ (f/1000) 2 ] 0.69 (2.3) Masking is the phenomenon where the perception of one sound is obscured by the perception of another and can be explained in the frequency- and time domain. Simultaneous masking (strong and softer tones with same frequency) and the spread of masking (strong and softer tones at nearby frequencies) refers to a frequency-domain phenomenon that can be observed whenever two or more 7

14 stimuli are simultaneously presented to the auditory system. The response of the auditory system is nonlinear and the perception of a given tone is affected by the presence of other tones. The auditory channels for different tones interfere with each other, which leads to a complex auditory response, the frequency masking. That means, that a single tone (masker) is surrounded by its so-called masking threshold curve and masking bandwidth. Every single tone within this masking bandwidth with its sound pressure level (SPL) falling below the masking curve will be masked and so non-audible. The masking bandwidth depends on the frequency of the masking tone and increases with the SPL-value of the masking tone. The relation between sound pressure level and masking threshold curve of a single masker of 440 Hz is illustrated in Figure 2.4 Figure 2.4: Simultane masking: relation between masking threshold curve and sound pressure level of a 440Hz masker. There is also a frequency relation to the width of the masking threshold curve: louder tones with higher frequencies will mask more neighbouring frequencies than softer tones with lower frequencies. So, ignoring the frequency components in the masking band whose levels fall below the masking curve does not cause any perceptual loss. In Figure 2.5, the masking phenomena of two maskers and the absolute hearing threshold curve are illustrated by a 1 khz tone with a smaller width of the masking threshold curve and a 4 khz tone with a wider range under 8

15 the masking threshold curve. Both tones have a sound pressure level of 60 db to demonstrate the frequency depending of the width of the masking threshold curves. Figure 2.5 shows the superimposed individual masking threshold curves representing the complexity auditory response of just two stimuli. To be audible, a third tone of 2 khz must have a sound pressure level that lies over the superposition of the masking thresholds (labelled with SP in Figure 2.5) at 2kHz. The overall masking threshold curve by superposition of the individual ones depends on the frequency distance between the maskers. Figure 2.5: Absolute hearing threshold and frequency masking: a 2 khz tone must have a sound pressure level over SP to be audible. In the time-domain, masking effects appears through temporal masking. Temporal masking explains the masking effect of a softer (test) tone by the presence of a stronger one (mask tone). The level of the masked signal depends on the time between the masker and the test tone. The stronger tone will mask softer tones with lower levels, which appears a short time later (decaying time). This temporal masking effect is based on the functionality of the human auditory system, that the inner hair cells within the human ear needs a recovery time from the strong mask tone, until they are able to realise the existence of the softer test tone. In the case of audio signals (e.g., the onset of a percussive musical in- 9

16 strument), abrupt signal transients create pre- and post- masking regions in time during which a listener will not perceive signals beneath the audibility thresholds produced by the masker. Pre- or backward masking is based on processing times in the ear and means, that signals just before the strong masker appears are masked [20]. Pre-masking occurs prior to masker onset and lasts only a few milliseconds, while post-masking may persist for more than 100 milliseconds after the masker is removed. Figure 2.5 shows an example for the masking curve, including pre- and post-masking: Figure 2.6: Temporal pre- and post-masking. 2.3 Speech and Audio Coding Technologies In speech and audio coding, digitised speech or audio signals are represented with few bits as possible, by removing the redundancies and the irrelevancies from the original signal. The perceptual quality of such digitised speech or audio signals is a function of the available bit rate. Speech and audio codecs must take account of encoding/decoding delay, sound quality of the decoded signal and the transmission bandwidth. All of them are difference for speech and audio. For example, in speech signals big signal changes are accepted, while for music such degradations are forbidden. Most of the speech coding standards are developed to handle narrow speech at a sampling frequency of 8 khz (or wideband speech with sampling frequency 16 khz) based on a model for speech production. But for non-speech signals like music or background noise such a source model does not work. So modern audio codecs employ psychoacoustic principles to model human auditory perception with the goal for a transparent reproduction of the information that is relevant to human auditory perception. 10

17 AMR bit GSM GSM GSM WCDMA rate GMSK GMSK 8-PSK FR HR HR 4.75 kbps Yes Yes Yes Yes 5.15 kbps Yes Yes Yes Yes 5.90 kbps Yes Yes Yes Yes 6.70 kbps Yes Yes Yes Yes 7.40 kbps Yes Yes Yes Yes 7.95 kbps Yes Yes Yes Yes 10.2 kbps Yes - Yes Yes 12.2 kbps Yes - Yes Yes Table 2.2: AMR modes in GSM and WCDMA Speech Coding standards Adaptive Multi-Rate Codec Adaptive Multi Rate (AMR) was designed as an improved standard of voice quality in cellular services and greater capacity for the GSM system and UMTS technology. It was standardised for GSM Release 98 and 3GPP Release 99. Its great advantages against previous GSM speech codecs are the variable bitrate and its adaptive error concealment section, in which the number of bits for error correction depends on the transmission conditions. The AMR speech codec adapts its error protection level to the local radio channel and traffic conditions and is a mandatory codec for 3G wireless networks. Narrow band AMR voice codec supports eight different speech codecs with bit rates ranging from 4.75 kbps to 12.2 kbps with 8 khz wide band[17]. The wideband codec AMR-WB was developed as a multi-rate codec consisting of several codec modes like the AMR-NB codec and brings speech quality exceeding that of (narrowband) wire line quality to 3G and GSM/GERAN systems. Consequently, the wideband codec is referred to as AMR Wideband (AMR-WB) codec. Like in AMR-NB, the codec mode is chosen based on the operating conditions on the radio channel. Adapting coding depending on the channel quality provides high robustness against transmission errors. The codec also includes a source controlled rate operation mechanism, which allows it to encode speech at a lower average rate by taking speech inactivity into account. Advanced Audio Codec Advanced Audio Codec (AAC) was specified and declared as an international standard by MPEG in 1997 to increase the quality of audio coding of mono, stereo and multichannel signals. The international cooperation of the Fraunhofer Institute and companies like AT&T, Sony and Dolby developed this efficient 11

18 method for audio data compression. Driving force to develop AAC was the quest for an efficient coding method for surround signals, like 5-channel signals. MPEG-2 AAC is the continuation of the coding method MPEG Audio Layer-3 and the sampling frequencies which are used are between 8 khz and 96 khz. It is not backward-compatible to standard MPEG-1, but its core supports new code-standards such as MPEG-4. Compared to coding methods such as MPEG-2 Layer-2, it is possible to cut the required bitrate by a factor of two with no loss of subjective quality. Further, the stereo width of difficult to encode signals at bitrate less than 60kbit/s is reduced. Like all perceptual coding schemes, MPEG- 2 AAC basically makes use of the signal masking properties of the human ear in order to reduce the amount of data. Doing so, the quantisation noise is distributed to frequency bands in such a way that it is masked by the total signal and remains inaudible. 2.4 Audio Content Estimation End-user quality is influenced by a number of factors including mutual compensation effects between audio and video, content, encoding, and network settings as well as transmission conditions. Moreover, audio and video are not only mixed in the multimedia stream, but there is even a synergy of component media (audio and video) [4]. As previous work has shown, mutual compensation effects cause perceptual differences in video with a dominant voice in the audio track rather than in video with other types of audio [5]. Video content with a dominant voice include news, interviews, talk shows, etc. Finally, audio-visual quality estimation models tuned for video content with a dominant human voice perform better than a universal models [5]. Therefore, our focus within this work is on the design of automatic speech detection algorithms for the mobile environment. In recent years, speech detection has been extensively studied [6], [7], [8], [9]. The proposed algorithms for speech detection differ in computational complexity, application environment, and accuracy. Our approach is to design a speech detection algorithm suitable for real-time implementation in the mobile environment. Therefore, our work is focussed on accurate and low complexity methods which are robust against audio compression artifacts. Our proposed low-complexity algorithm has a first stage based on kurtosis [10] and a second stage based on hypothesis testing using a Log-Likelihood Ratio (LLR). In the second stage, we use the High Zero Crossing Rate Ratio (HZCRR) [11] or the Mel-Frequency Cepstral Coefficients (MFCCs) extracted from the audio signal. The HZCRR has a lower complexity than MFCCs but lower accuracy. The proposed method shows a good balance between accuracy and computational complexity. Finally, performance and complexity of these methods are compared. Proposed methods was sent as conference paper and approved for IWSSIP 12

19 2009 conference [1] Audio parameters Due to the low complexity requirement of the algorithm, our investigation was initially focused on time-domain methods. Initial inspection of the various audio signals show significantly different characteristics in speech and non-speech signals (see Figures 2.7 and 2.8). Wide dynamic range of the speech signal (compared to non-speech signals) is clearly visible. Figure 2.7: Example of speech signal (time-domain). Both kurtosis and HZCRR features have been used in blind speech separation [12] and music information retrieval [11]. Kurtosis of a zero-mean random process x(n) is defined as the dimensionless, scale invariant quantity 1 κ x = 1 N N n=1 [x(n) x]4 ( 1 N N n=1 [x(n) x]2) 2. (2.4) where in our case, x(n) represents the n-th sample of an audio signal. A higher κ value is related to a more peaked distribution of samples as is found in speech signals (see Figure 2.9) whereas a lower value implies a flatter distribution as is found in other types of audio signals (see Figure 2.9). Therefore, kurtosis was selected as a basis for detection of speech. However, accurate detection of in short-time frames is not always possible by kurtosis alone. Nn=1 [x(n) x] 4 1 The reader is cautioned that some texts define kurtosis as κ x = ( N 1 1N ) 2 3 We Nn=1 [x(n) x] 2 shall however follow the definition in [10]. 13

20 Figure 2.8: Example of non-speech (time-domain). Figure 2.9: samples. Probability density function of the speech and non-speech audio The second objective parameter under consideration is the HZCRR defined as the ratio of the number of frames whose Zero Crossing Rate (ZCR) is greater 14

21 than 1.5 the average ZCR in audio file as [11] HZCRR M = 1 N 1 [sgn(zcr(n, M) 1.5ZCR) + 1] (2.5) 2N n=0 where ZCR(n, M) is the rate of the n-th, length-m frame (equation given below), N is the total number of frames, ZCR is the average ZCR over the audio file. The ZCR is given by ZCR(n, M) = 1 M M 1 m=0 1 <0 [x(nm + m)x(nm + m + 1)] (2.6) where m denotes the sample index within the frame and the indicator function is defined as { 1; q < 0 1 <0 (q) = 0; q 0. In the proposed algorithms, we use a frame length of 10 ms and the framing windows are overlapped by 50%. The 10 ms frame length 2 contains a sufficient number of audio samples for further statistical processing. Moreover, a longer framing window would increase the calculation complexity and length of investigated audio sequence necessary for speech detection. Figure 2.10 shows the ZCR curves for both speech and non-speech signals. The ZCR of the non-speech signal has a small amplitude range and low variance. The ZCR of the speech signal, on the other hand, has a wider amplitude range, large variance, and relatively low and stable baseline with occasional high peaks. However, as can be seen in Figure 2.10, many frames of the speech and non-speech signal have similar ZCRs and thus accurate detection of speech in short-time frames is also not possible with ZCR and subsequently HZCRR alone. Audio Corpus The training and evaluation of our speech detector was performed on a large audio corpus. Our corpus consists of 3032 speech and non-speech audio files (see details in Tables 2.3 and 2.4). The speech part of the corpus is in the German language and consists of ten speakers. The non-speech part of corpus consists of mainly music files of various genres (e.g. rock, pop, hip-hop, live music). All audio files were encoded using typical settings for the UMTS environment. Each audio file was encoded using three codec types at different sampling rates: AAC, AMR-WB at 16 khz and AMR-NB at 8 khz. Due to limitations of mobile radio resources, bit rates were selected in range 8 32 kbps. Encoded audio files with insufficient audio quality were excluded. 2 e.g. for sample rate (SR) = 32 khz framing window contains M = 320 samples 15

22 Figure 2.10: Plot of the ZCR of the speech signal. Table 2.3: Speech audio corpus. Codec Encoding settings Number of audio files AAC 16 khz 1817 AMR-NB 7.9 khz 1856 AMR-WB khz 1856 Table 2.4: Non-speech audio corpus. Codec Encoding settings Number of audio files AAC 32 khz 1169 AMR-NB 7.9 khz 1172 AMR-WB khz 1176 For purposes of determining speech and non-speech detection parameters, 2273 audio files without a dominant voice and 3194 audio files with dominant voice were used in training. These files were selected from all codecs and encoding combinations. The rest of the audio corpus was used for testing and performance evaluation. Kurtosis and HZCRR M measurements on the training files are given in Figures 2.11 and It can be seen that kurtosis is a better speech indicator than HZCRR M, however, HZCRR M may be used as an additional indicator. 16

Figure 2.11: Kurtosis values of speech and non-speech signals. Figure 2.12: HZCRR M values of speech and non-speech signals. 2.4.

a LLR test (see Figure 2.14). For the second stage (based on a LLR test), two solutions are proposed.

23 Figure 2.11: Kurtosis values of speech and non-speech signals. Figure 2.12: HZCRR M values of speech and non-speech signals Speech detector In order to reduce complexity, we propose a two-stage voice detection algorithm where the first stage is based on a threshold comparison of kurtosis and the second stage is based on a LLR test (see Figure 2.14). For the second stage (based on a LLR test), two solutions are proposed. The first, based on HZCRR M, has significantly lower complexity than the second, based on MFCCs, but also lower accuracy. In the first solution, non-speech audio frames are first detected by a simple decision based on whether the kurtosis is less than a pre-defined threshold, i.e. κ < c 0 where c 0 = 4.96 (see Figure 2.11). The first stage is capable of recognising 62.3% of the non-speech frames from our corpus with a 97% accuracy. In the second solution, we set the threshold in the first stage to c 0 = 4 using the Least Absolute Errors optimisation technique. All sequences with κ 4 are recognised as non-speech sequences. The first stage is capable of recognising 40% of the non-speech sequences from our corpus with 99.7% accuracy. In both solutions, if non-speech content is detected in the first stage, we do not carry out 17

24 Figure 2.13: Cumulative distribution function of the κ. the second stage in order to reduce computational complexity. The reason for this thresholds can be also seen in the CDF of the kurtosis for speech and non-speech audio content in the Figure Figure 2.14: Two-stage speech detector LLR Test Based on κ and HZCRR M In the second stage of the first solution, we derive a more general decision rule based on a hypothesis test (LLR) and we use both kurtosis and HZCRR M of the 18

25 frame as elements in a feature vector [ X = κ HZCRR For speech signals, we denote the mean vector for the speech feature vectors as µ s and covariance matrix as Σ s and for non-speech feature vectors, we denote the mean vector as µ m and covariance matrix as Σ m. Furthermore, the LLR test is performed on the first 20 frames, in order to reduce computational complexity. The log-likelihood ratio is calculated as follows 20 i=1 log{ 1 exp( 1 (2π) = 2 Σ s 2 (X i µ s )Σ 1 s (X i µ s ) T ) } 20 i=1 log{ 1 exp( 1 (2π) 2 Σ m 2 (X i µ m )Σ 1 m (X i µ m ) T ) } (2.7) If the LLR is greater than the decision threshold, c = 2.2 (see Figure 2.14), we declare a non-speech frame otherwise we declare a speech frame. Note that mean vectors and covariance matrices in (2.7) are estimated ahead of time in the training stage LLR Test Based on Mel-Frequency Cepstrum Coefficients In the second stage of the second solution, we consider the use of MFCCs extracted from the frame as the feature vector. MFCCs are widely used in speech and audio as a feature vector in a variety of applications. The algorithm in [13] is used for calculation of the first 14 MFCCs. Thus the covariance matrix is and mean vector is The LLR test is performed on the first 20 frames. The LLR is calculated as = 20 i=1 log{ 1 (2π) 13 Σ m exp( 1 2 (X i µ m )Σ 1 m (X i µ m ) T ) } 20 i=1 log{ 1 (2π) 13 Σ s exp( 1 2 (X i µ s )Σ 1 s (X i µ s ) T ) } (2.8) If the LLR is greater than the decision threshold, c = 1.04 (see Figure 2.14), we declare a speech frame otherwise we declare a non-speech frame Performance evaluation and comparison We evaluate both two-stage algorithms: LLR test based on kurtosis and HZCRR M and LLR test based on MFCCs. The first algorithm is a relatively low-complexity solution based on time-domain audio parameters, κ and HZCRR M. The second algorithm provides a more sophisticated solution based on MFCCs. The performance and complexity (measured in terms of computation time) of both methods was evaluated using 1770 speech files and 1181 non-speech files. The audio corpora for training and evaluation were approximately the same size. The overall accuracy of both proposed methods exceeds 92% (see Table 2.5) for speech and 19 ].

26 non-speech content averaged over all codecs. The accuracy of second algorithm, however, is higher than the first but at increased computation cost. Table 2.5: Accuracy results for detection of non-speech and speech from coded audio. Content Codec κ & HZCRR MFC Non-speech AAC % % AMR-NB % 100 % AMR-WB % % Speech AAC % % AMR-NB % 100 % AMR-WB % % Overall % % In order to evaluate complexity, the computation time was measured using 6091 audio files (3759 speech files, 2332 non-speech files). The algorithms were executed in MATLAB environment on a Core 2 Duo processor. In order to obtain accurate results, the test was repeated ten times. Table 2.6 gives the average computation times. The first algorithm is approximately 2 faster then the second algorithm. The efficiency reflects the amount of processed files per second (see Table 2.6). The computing time and efficiency results show that both methods allow for fast detection of speech frames and are suitable for real time implementation in mobile devices. Table 2.6: Time needed for content estimation. Method Time[s] Efficiency [files/s] κ & HZCRR MFCC Conclusion The goal of this part of work was to design a speech detector for mobile environment. The design was focused on accurate, low complexity methods, which are robust against audio compression artifacts. Both proposed algorithms show very good accuracy (92%) and relatively low complexity. However, the method based on kurtosis and HZCRR M is 2 faster (lower complexity). 2.5 Audio Quality Estimation Algorithms In many various application it s important to know what the content quality on the user side looks like. For this purpose was developed several measurement techniques, whether it s subjective or objective measurements. Although, the 20

27 subjective measurement are very precise, and the network operator know exactly how the network setups impacts on the perceived quality, they are also very complex and the end-user perceived quality is often hard to obtain. After all, subjective tests are expensive. Therefore, in practise, only the objective measurement techniques are used. For audio quality estimation there exists many different types of objective estimation algorithms. They can be divide into two main groups: 1. Reference audio quality metrics 2. Non-reference audio quality metrics Generally, for evaluation of the perceived quality, there exist standardised technique, which describes the quality by 5 scores - Mean Opinion Score (MOS). MOS Quality Impairment 5 Excellent Imperceptible 4 Good Perceptible but not annoying 3 Fair Slightly annoying 2 Poor Annoying 1 Bad Very annoying Table 2.7: Description of Mean Opinion Scores. The aim of the subjective test is to obtain the objective MOS values as well correlated to subjective values as possible Reference Audio Quality Metrics The first research works in audio quality measurement area are dated in the early eighties. First algorithms was based on the work of Zwicker, Schroder, Brandenburg. The first algorithm that was used in real measurement was Noise to Mask Ratio (NMR) in 1989 [14]. In terms of standardisation and adoption in the field, the most advanced objective perceptual quality assessment methods may be found in the areas of audio and speech. This is due to the observation that the psycho-acoustic effects known from masking experiments seem to differ in significant, when comparing the perception of speech and music signals. For wideband audio signals, the Perceptual Audio Quality (PEAQ ) method has been developed and recommended by the ITU-R Rec. Bs [15]. PEAQ was developed originally as an automated method to evaluate the perceptual quality of different wideband audio codecs. Several objective perceptual quality assessment methods have been developed for speech signals. The main ones in use today include Perceptual Analysis Measurement System (PAMS ), Perceptual Speech Quality Measurement (PSQM), 21

28 and Perceptual Evaluation of Speech Quality (PESQ) [16]. Although they significantly differ in the way they try to model human perception, they also show a very high degree of similarity in their basic structure. While comparing all of these measuring algorithms they can be broken down into a block diagram as shown in Figure 2.15: The structure of the generic perceptual measurement algorithm. In the following subsections will be described two types of reference metrics. Perceived Evaluation of Speech Quality (PESQ) In modern mobile networks, just like VoIP, the measurement algorithm has to deal with much higher distortions as with GSM codecs and the most eminent factor is that the delay between the reference and the test signal is not constant anymore (the delay for each time interval is significantly different from the previous time interval). Those varying delays are handled in PESQ by a time alignment algorithm. PESQ was developed for use a wider range of network conditions, including analogue connections, codecs, packet loss and variable delay. It gives accurate predictions of subjective quality in a very wide range of conditions, including those with background noise, analogous filtering,channel errors, coding distortions, or variable delay. PESQ addresses these effects with transfer function equalisation, time alignment, and an algorithm for averaging distortions over time. Further, PESQ is suitable for many applications in assessing the speech quality of mobile networks or narrow-band speech codecs and for end-to-end measurements. While PESQ measures only the effects of one-way speech distortion 22

29 and noise on speech quality, the effects of loudness loss delay, sidetone, echo, and other impairments related to two-way interaction are not reflected in the PESQ score. Figure 2.16 presents the structure of the PESQ model [16], which compares a reference signal with degraded signal that is the result of the original signal passing through a communication system. The output of PESQ is a prediction of the perceived quality that would be given to the degraded signal by subjects in a subjective listening test. Figure 2.16: Structure of the PESQ algorithm. In a first pre-processing step, level alignment brings the reference and degraded signals to a standard listening level. After that, they are filtered with an input filter to model a standard telephone handset. Next, the signals are aligned in time and equalised and then processed through an auditory transform using a perceptual model. The key to this process is the transformation of both signals to an internal representation that is analogous to the psycho-physical representation of the audio signals in the human auditory system, taking account of perceptual frequency (Bark) and loudness (Sone). The steps, which are included, are: timealignment, level-alignment to calibrated listening level, time-frequency mapping, frequency warping, and compressive loudness scaling. The transformation also involves equalising for linear filtering in the system and for gain variation which have a little perceptual significance. From the difference between the transformed signals, the so-called disturbance, two distortion parameters are extracted and aggregated in frequency and time and mapped to a prediction of subjective mean opinion score (MOS) to enable a direct comparison between the objective and subjective scores.the mapping of the objective PESQ score onto the subjective MOS score is done by using a linear, monotonic function to preserve all the information during the so-called regression process. This process uses a regression 23

30 mapping to remove any systematic offset between the objective scores and the subjective MOS, minimising the mean square of the residual errors: e i = x i y i. (2.9) Various measures may be applied to the residual errors to give an alternative view of the closeness of objective scores to subjective MOS. Perceived Evaluation of Audio Quality (PEAQ) Perceptual Evaluation of Audio Quality according to ITU-R recommendation BS.1387 [15], available as a basic and advanced model. Figure 2.17 present the structure of the PEAQ algorithm. Figure 2.17: Structure of the PEAQ algorithm. The proposed Method for Objective Measurement of Perceived Audio Quality consists of a peripheral ear model, several intermediate steps (here referred as pre-processing of excitation patterns ), the calculation of (mostly) psycho acoustically based Model Output Variables (MOVs ) and a mapping from a set of Model Output Variables to a single value representing the basic audio quality of the Signal Under Test. It includes two peripheral ear models, one based on an FFT and one based on a filter bank. Except for the calculation of the error signal (which is only used with the FFT based part of the ear model) the general structure is the same for both peripheral ear models. 24

31 The inputs for the MOV calculation are: - The excitation patterns for both test and Reference Signal. - The spectrally adapted excitation patterns for both test and Reference Signal. - The specific loudness patterns for both test and Reference Signal. - The modulation patterns for both test and Reference Signal. - The error signal calculated as the spectral difference between test and Reference Signal (only for the FFT-based ear model). If not indicated differently, in the case of stereo signals all computations are performed independently and in the same manner for the left and right channel. The description defines two setups, one called the Basic Version and one called the Advanced Version. In all given equations, the index Ref. stands for all patterns calculated from the Reference Signal, the index Test stands for all patterns calculated from the Signal Under Test. The index k stands for the discrete frequency variable (i.e. the frequency band) and n stands for the discrete time variable (i.e. either the frame counter or the sample counter). If the values for k or n are not explicitly defined, the computations are to be carried out for all possible values of k and n. All other abbreviations are explained at the place they occur. In the names of the Model Output Variables, the index A stands for all variables calculated using the filter bank-based part of the ear model and the index B stands for all variables calculated using the FFT-based part of the ear model. Basic Version The Basic Version includes only MOVs that are calculated from the FFTbased ear model. The filter bank-based part of the model is not used. The Basic Version uses a total of 11 MOVs for the prediction of the perceived basic audio quality. Advanced Version The Advanced Version includes MOVs that are calculated from the filter bankbased ear model as well as MOVs that are calculated from the FFT-based ear model. The spectrally adapted excitation patterns and the modulation patterns are computed from the filter bank-based part of the model only. The Advanced Version uses a total of 5 MOVs for the prediction of the perceived basic audio quality Non-reference Audio Quality Metrics For our work we decided not to use reference audio metrics, because in mobile scenarios it s difficult to work with the reference signal. For example, if the UMTS network operator wants to know the quality on the users side, it s not possible to compare the signal at the output of the receiver, with the reference signal at the transceiver. The main reason of this is that the transmission condition in the mobile environment changes both in the time and space domain. So we decided 25

32 to work with the non-intrusive audio metric - Singe Sided Speech Quality Metric (3SQM) [14]. Despite it s designed primary for the evaluation of the speech quality, it works well in our audio-visual metric setups also for the evaluation of the non-speech signals. Non-intrusive measurement compared to the intrusive measurement offers the possibility to measure at almost any point of the network with any real-world audio signal. Of course there is an disadvantage - the accuracy of the measurement is little bit lower compared to intrusive measurement technique, e.g. PESQ. These types of metrics can be divided into two fundamentally different principles. The first principle works only in the signal processing level,it takes into account the transmission conditions and the transmission path itself. It doesn t take into account the signal itself. If the transmission path changes its properties, this type of measurement will fail. The a priori knowledge of the exact transmission path and the equipment that is used is important. The advantage of this type of measurement is that, it is slightly less computationally complex than other methods. The second principle of the non-intrusive voice quality measurement is more universal. It analyses the voice stream and not the transmission path. Any kind of the voice signal can be analysed and this principle can be used in every point of the network. It s not dependent on the network elements. This principle is more complex than the first type, but is much more flexible, and can be used in any scenario. In today s heterogeneous networks is this type of non-intrusive measurement the only one which can be used with hardly any restrictions. Single-sided Speech Quality Measurement (3SQM) The 3SQM algorithm is the example of the second type of the non-reference audio metric. As can be seen in the following picture (2.18), it combines three different and fundamentally independent parts. The input signal is first preprocessed. The signal is filtered through the filter, which simulates the standard headset used in the laboratories for subjective testing, after filtering is the signal adjusted to speech level and after all is the signal divided into voiced and unvoiced parts. After pre-processing is the signal lead into the next stage, where the distortion and speech parameters are extracted. The parameters is divided into three main distortion classes : 1. Vocal tract analysis and unnaturalness of speech 2. Analysis of strong additional noise 3. Interruptions, mutes and time clipping 26

Figure 2.18: Blockdiagram of the 3SQM analysis algorithm. After division of the parameters, several parameters are clustered and they define single isolated distortion class.

33 Figure 2.18: Blockdiagram of the 3SQM analysis algorithm. After division of the parameters, several parameters are clustered and they define single isolated distortion class. This is done with respect to analogy of the real listeners subjective testing. The listeners focuses on the foreground of the signal, and would not judge the quality by simple sum of all occurred distortions but just by the single dominant artifact in the signal. The dominant distortion classes used in 3SQM metric are : 1. Low static SNR 2. Mutes 3. Low segmental SNR 4. Unnatural voice - Robotization 5. Basic speech quality Finally, for each dominant distortion the final quality estimation is calculated. It s calculated based on a selection of the MOVs. The result of the estimation is equivalent to Mean Opinion Score - Objective Listening Quality (MOS-LQO), which is defined by the ITU-T recommendation P

35 Chapter 3 Video Quality 3.1 Introduction In this chapter will be described some basic overview of the video encoding process, some basic steps which are common for several video codec - video and colour sampling, frames structure, creation of motion vectors. Also the encoding properties of the H.264/AVC will be described in more details. At the end of this chapter will be presented the video quality metric based on the motion characteristics of the video. The objective parameters extracted from the motion vectors will be further used in audiovisual quality metric. These objective parameters will be also described in more details. 3.2 Basic principles of video coding If we take the simplest case - storing uncompressed digital video data, we will need enormous amount of storage space. If we will use the 8-bit code words for each sample, then for the standard VGA (640x480 pixels) picture size and 25 frames per second (Frame Rate - FR) we will need approximately 155 GB 1 of disc space for two hour movie. It s obvious that we must use compressing methods, so we will need less storage capacity, or lower bitrates for streaming of the video data. This is the major problem in the mobile environment, since here we are scoping with the limited bandwidth, worst transmission properties and also service providers want to use as low media capacity as possible. For video streaming services, the following video compression standards are used today: H.263[23] standardised by International Telecommunication Union (ITU), MPEG-4 Part 2[21] 1 3 colours 8 bit 640 pixels 480pixels 25 fps 60 sec 120 min = bits = GB 29

36 standardised by Internationale Organisation for Standardisation (ISO) Motion Picture Expert Group, and the newest H.264/AVC [22] (also known as MPEG-4 Part 10), standardised by Joint Video Team (JVT) of experts from both ISO/IEC and ITU. All of the mentioned codecs uses similar basic principles for encoding the video, so brief overview of these principles will be described in the following sections. In this work, we focused just for the newest codec - the H.264, since this is at the time the most advanced technique and offers the best coding efficiency. Also it has the great potential for wide use in the UMTS networks and in the Digital Video Broadcasting(DVB) Video and Colour sampling Redundancy and irrelevance qualify the redundant information in the video signal. The video signal, which contains these information will be perceived by a human likewise also after elimination of these information. The difference between redundant and irrelevant part is only in the subjective or objective point of view. 1. Redundant part of the video signal qualify the part of information, which can be removed and the signal can be reconstructed again without any loss of information and without any distortion. 2. Irrelevant part of the video signal qualify the part of signal, which absence will be not recognised by human eye. It means, that signal without irrelevant information will be perceived equally like the original signal. These two types of the information allow two types of the compression, namely: 1. Loss-less compression 2. Lossy compression Loss-less compression realises removal of the redundant information from the video signal. Because of the fact that redundant information can be 100% reconstructed, we use the term loss-less compression. There are many types of the compress algorithms, but the most common used in the video processing are Run Length Encoding(RLE), Lempel-Ziv-Welch(LZW) and Huffman code. The compression ratio which can be obtained by this type of signal processing is relative small compared to the lossy one. Lossy compression lowers the data amount by removal of the irrelevant information. The compress ratio varies because of the encoder ability to set compress ratio and also which irrelevant information will be removed. This changeable and 30

37 variable compress ratio is very suitable for multimedia applications and it allows high reduction of the bit rates and the reduction of the transmission bandwidth. If we take a look at the video signal it self we can define several levels or parts of the video stream. The video signal consists of several pictures which goes continually one after another. The number of these picture in one second defines the basic video parameter - Frame Rate(FR) or sometimes also know as Frames Per Second(FPS). We can easily eliminate the bitrate by simply lowering the FR. Several standards define several values of the FR, but most commonly used value is 25. Frame rates below 10 FPS are sometimes used for very low bit-rate video communications, but motion is jerky and unnatural at this rate. Frame rates between 10 and 20 frames per second are typical for low bit-rate video communications, sampling at FPS is standard for television (with interlacing to improve the appearance of the movement) and 50 or 60 FPS produces smooth apparent motion. The processing of the whole frame is very inconvenient and difficult and so for the video processing are used smaller parts of the frames - pixels. Several pixels form the rows, the number of pixels in one row define the width of the video frame. Several rows form the slices, which are used in video processing for better control over the video stream, and for better error concealment. The number of row define the height of the video frame. The height and width of the video frame form the resolution or picture size. There are several picture sizes commonly used in mobile environment. The most used are listed in the table 3.1. Abbreviation Size Description VGA 640x480 Video Graphics Array QVGA 320x240 Quarter Video Graphics Array CIF 352x288 Common Intermediate Format QCIF 176x144 Quarter Intermediate Format Table 3.1: Video resolutions used in mobile environment. In our subjective measurement only VGA and QVGA picture resolutions were used. Also we have worked with colour video so here will be described some basics of the storing mechanisms of the colour and the colour space sub-sampling. The main principle of the color picturing devices is drawing each colour as a mixture of red, green and blue color. Any colour can be created by combining these three basic colours. The colour of the pixel in the frame can be described as 1-by-3 matrix, where number in the each row describes relative proportions of Red, Green and Blue. The picture divided into each significant colour can be seen in the figure 3.1 Since Human Visual System (HVS) is more sensitive for the luminance than for the chrominance, there exist another way, how to reduce the size of the image 31

Figure 3.1: Red,Green and Blue compound of an image. or video sequence. The colour space which uses the higher sensitivity for the luminance is called YCbCr.

38 Figure 3.1: Red,Green and Blue compound of an image. or video sequence. The colour space which uses the higher sensitivity for the luminance is called YCbCr. Y is the luminance(luma) component and can be calculated as a weighted average of R,G,and B: Y = k r R + k g G + k b B (3.1) The other components of the YCbCr can be easily computed from the components of the RGB colour space. The following equations are used for derivation YCbCr from the RGB [25]: Y = k r R + (1 k b kr)g + k b B C b = k b (B Y) (3.2) C r = k r (R Y) The k r and k b are the constants and ITU-R recommendation BT.601 defines them as k b = and k r = which leads to the: Y = 0.299R G B C b = 0.564(B Y) (3.3) C r = 0.713(R Y) The picture divided into the separate components of the YCbCr colour space can be seen in figure 3.2. It s obvious that simple transformation into another colour space can not achieve lower transmission bitrates. But as mentioned above, the HVS is more sensitive to the luma component and is less sensitive to the chrominance information of the picture or of the video signal. So the Cb and Cr components can be sampled using lower sampling rate and this will result in saving the data needed for digital picture representation. 32

Figure 3.2: Luminance and chrominance components of the picture in YCbCr colour space. There exist several colour sub-sampling scheme, the main ones are listed below: 1.

39 Figure 3.2: Luminance and chrominance components of the picture in YCbCr colour space. There exist several colour sub-sampling scheme, the main ones are listed below: 1. 4:4:4 sub-sampling means that all three components - Y, Cb, Cr, have same resolution ans hence a sample of each component exists at every pixel position. The numbers indicate the relative sampling rate of each component in the horizontal direction. 2. 4:2:2 sub-sampling means, that chrominance components have the same vertical resolution as the luminance component but half the horizontal resolution. For calculation of the saved bandwidth, we can use a little help - if we divide the sum of the sampling ratio coefficients(4:2:2) by the number 12, we get the result, which defines the ratio of sub-sampled video bandwidth and the bandwidth of the original video. Hence this sampling scheme requires two thirds 2 of bits which will be needed by 4:4:4 version. 3. 4:2:0 sub-sampling, also called YV12,means that chrominance components (Cb, Cr) each have half the horizontal and vertical resolution of luma component Y. Hence, with this scheme we will need half bandwidth of the original 4:4:4 video version. The sampled pixels forms the Macroblock, and the further encoding processing is made mainly in the macroblock level. The macroblock consist of 16x16 luma and 8x8 chroma pixels.the further processing differs for three types of frames, namely it s Intra-coded frames (I frames), Predicted frames (P frames) and Bi-directional predicted frames (B-frames). If the I frame is encoded, the frame is divided into 8x8 pixel blocks. These blocks are transformed using the Discrete Cosine Transform (DCT),but the transformed data still offers no data saving. The data saving is obtained by quantisation of the DCT coefficients,after which many of the coefficient, mainly the higher frequency ones, will be equal to zero. By zigzagging the frame matrix, the 2 (4+2+2) 12 =

Figure 3.3: 4:2:2 Colour sampling pattern[24]. Figure 3.4: 4:2:0 Colour sampling pattern[24]. picture data bit-stream is obtained. This data stream will have distinctive groups of zeros.

40 Figure 3.3: 4:2:2 Colour sampling pattern[24]. Figure 3.4: 4:2:0 Colour sampling pattern[24]. picture data bit-stream is obtained. This data stream will have distinctive groups of zeros. This can be used for data saving by using the run-length codes. After this step finally the Huffman coding is used and so the data are transformed into smaller groups of numbers. If the P or B frame is encoded, different process is used. First the present frame is compared with previous or future reference frame(or 2 reference frames in case of B Frame). This comparing is done in macroblock level. By using the energy level, the similar or same macroblock is searched in the reference frame. 34

41 If found, the motion vector and the difference between the present and reference macroblock are calculated (this is also know as motion-compensated prediction). This residual is further encoded by the same process like the one used in I frame encoding New features in H.264 Above were mentioned some of the elementary encoding processes, but H.264 was developed to achieve better coding efficiency than it s ancestors, so some new features has to be applied into tho new codec. Here is the list of new features present in the H.264/AVC codec[26]. Variable block-size motion compensation with small block sizes: The fixes motion compensations was non-efficient and so this standard support more flexibility in the selection of motion compensation block size and shapes. The luma motion compensation block can be as small as 4x4 pixels. Quarter-sample-accurate motion compensation: Most prior standards enables half-sample motion vector accuracy at most. The H.264 brings this into another level, enabling quarter-sample motion vector accuracy. Motion vectors over picture boundaries: This feature was previously implemented in H.263 standard and now is also included on H.264/AVC. This technique is know as boundary extrapolation technique. Multiple reference picture motion compensation: While previous codec used to use only one previous picture to predict the values in an incoming picture, the new standard enables to use more than one previously decoded picture for prediction. This is also possible for bi-directional predicted frames - B frames. Decoupling of referencing order from display order: In prior standards, there was a strict dependency between the ordering of pictures for motion compensation referencing purposes and the ordering of pictures for display purposes. In H.264/AVC, these restrictions are largely re- moved, allowing the encoder to choose the ordering of pictures for referencing and display purposes with a high degree of flexibility constrained only by a total memory capacity bound imposed to ensure decoding ability. Removal of the restriction also enables removing the extra delay previously associated with bi-predictive coding. 35

42 Decoupling of picture representation methods from picture referencing capability: Now, the bi-directionally predicted pictures can be used for prediction of other pictures in the video sequence. Weighted prediction: A new innovation in H.264/AVC allows the motion-compensated prediction signal to be weighted and offset by amounts specified by the encoder. This can dramatically improve coding efficiency for scenes containing fades, and can be used flexibly for other purposes as well. Improved skipped and direct motion inference: The H.264/AVC design infers motion in skipped areas, and also include an enhanced motion inference method known as direct motion compensation, which improves previous direct prediction designs. Directional spatial prediction for intra coding: Extrapolation of the edges of the previously-decoded parts of the current picture helps to improve the prediction quality in regions of pictures which was coded without referencing to the content of referencing pictures. In-the-loop deblocking filtering: For improving the video quality and for prevention of blocking effect, the new adaptive deblocking filter is used in H.264/AVC. The deblocking filter in the H.264/AVC design is brought within the motion-compensated prediction loop, so that this improvement in quality can be used in inter-picture prediction to improve the ability to predict other pictures as well. Small block-size transform: Despite the fact, that the prior video coding standards used a transform block size of 8x8, the H.264/AVC is based primarily on a 4x4 transform. Hierarchical block transform: For signals that contain sufficient correlation, the longer (longer than 4x4) basis functions for transformation can be used. The H.264/AVC achieve better coding efficiency and higher picture quality also by allowing the encoder to select a special coding type for intra coding. The transform matrices of size 4x4, 8x8 or 16x16 can be used. Short word-length transform: While previous designs have generally required 32-bit processing for transform computation, the H.264/AVC design requires only 16-bit arithmetic. 36

43 Exact-match inverse transform: In previous video coding standards, the transform used for representing the video was generally specified only within an error tolerance bound, due to the impracticality of obtaining an exact match to the ideal specified inverse transform. As a result, each decoder design would produce slightly different decoded video, causing a drift between encoder and decoder representation of the video and reducing effective video quality. Building on a path laid out as an optional feature in the H effort, H.264/AVC is the first standard to achieve exact equality of decoded video content from all decoders. Arithmetic entropy coding: More effective use of the arithmetic entropy coding than in H.263 codec is used in this new coding standard. Context-adaptive entropy coding: The two entropy coding methods applied in H.264/AVC, termed CAVLC (context-adaptive variable-length coding) and CABAC, both use contextbased adaptivity to improve performance relative to prior standard designs. Parameter set structure: The parameter set design provides for robust and efficient conveyance header information. As the loss of a few key bits of information (such as sequence header or picture header information) could have a severe negative impact on the decoding process when using prior standards, this key information was separated for handling in a more flexible and specialised manner in the H.264/AVC design. NAL unit syntax structure: Each syntax structure in H.264/AVC is placed into a logical data packet called a NAL unit. The NAL unit syntax structure allows greater customisation of the method of carrying the video content in a manner appropriate for each specific network. Flexible slice size: Slice sizes in H.264/AVC are highly flexible, as was the case earlier in MPEG-1. Flexible macroblock ordering (FMO): The picture is partitioned into regions called slice groups, and each slice is becoming independently-decodable subset of a slice group. 37

44 Arbitrary slice ordering (ASO): As a result of previously mentioned feature, the H.264/AVC enables sending and receiving the slices of the picture in any order relative to each other. Redundant pictures: The encoder can send redundant representations of regions of pictures. This ability of encoder enables a representation of regions of pictures for which the primary representation has been lost during data transmission. Data Partitioning: The syntax of each slice can be separated into up to three different partitions for transmission, depending on the categorisation of syntax elements. This categorisation is done as a matter of fact, that some coded informations are more valuable for representation of the video content. SP/SI synchronisation/switching pictures: The H.264/AVC design includes a new feature consisting of picture types that allow exact synchronisation of the decoding process of some decoders with an ongoing video stream produced by other decoders without penalising all decoders with the loss of efficiency resulting from sending an I picture. This can enable switching a decoder between representations of the video content that used different data rates, recovery from data losses or errors, as well as enabling trick modes such as fast-forward, fast-reverse, etc. 3.3 Video Quality Estimation In recent years the great effort was put into the research of the video quality estimation. Several video quality metrics were designed, several method for measuring the subjective and objective video quality were designed. At the time of writing of this thesis, there is still not any well received and commonly used video quality metrics based on objective parameters with good correlation to the subjective test results. So for the evaluation of the video quality, the subjective tests are still the usual way how to obtain the perceived quality by various people. But, as the goal of this thesis was to propose an objective audiovisual quality metric, it was essential to propose also independent video quality metric based on objective parameters of the video. This part of work was based on the former research published in [3]. The [3] describes several ways for measuring the video quality. All of them works independent from the original video signal (non-reference metrics) and uses various objective parameters. 38

45 In the following paragraphs will be described objective parameters which will be later used in the video quality estimation for audiovisual quality estimation Quality estimation based on content sensitive parameters For estimation of the perceived video quality, it can be defined several content classes of the video. Each of these contents have some specific features which can be used for the content classification. In subjective tests, which will be described later, we decided to use three very specific video content classes which differs in several ways. If we take the global look at them and we will focus also in the difference in the audio domain, these three video content classes will be also very specific here. For our work we choose the Video clip, Video call and the Soccer clip. The difference in the audio domain is obvious - Video clip will contain manly music, perhaps with the speech in the background, the video call will contain mainly the voiced data and finally soccer will contain speech, but strongly noised by the noise of the audience in the soccer stadium. But let s focus on the difference at the video domain. The Video clip is specific by the fast changes of pictures, fast movements, very variate colours and perhaps also some person in the foreground. This type of video content contain both the global and local movement. On the other side, the video call contain mainly very slow moving object in the foreground, slowly moving or static background. There is very low local movement and eventually the slow global movement. The soccer video content type contains a lot of specific colours, mainly green one. The audience can be defined as a background and specifically with very low bitrates the background seem to be for the human static, or with very low movement. In the foreground there is fast local movements. These specific properties can be very well used in the feature extraction. If we take the motion vector (MV) as one of the objective parameter of the video, then if we calculate various statistical values of the MV, we can easily obtain the particular properties of the movement in the video. The list below specifies several statistical properties of the MV and other video properties which will be later use in the complex audiovisual quality metric, but which can be also used for estimation of the simple video quality(without presence of the audio) [3]. Zero MV ratio within one shot Z: The percentage of zero MVs is the proportion of the frame that does not change at all (or changes very slightly) between two consecutive frames averaged over all frames in the shot. This feature detect the proportion of still region. The high proportion of the still region refers to a very static 39

Video Clip Video call Soccer Figure 3.5: Example of various video content types. sequence with small significant local movement. The viewer attention is focused mainly on this small moving region.

Mean MV size within one shot N: This is a percentage of mean size of the non-zero MVs normalised to the screen width. This parameter determines the intensity of a movement within a moving region.

Ratio of MV deviation within one shot S: Percentage of standard MV deviation to mean MV size within one shot.

46 Video Clip Video call Soccer Figure 3.5: Example of various video content types. sequence with small significant local movement. The viewer attention is focused mainly on this small moving region. The low proportion of the still region indicates uniform global movement and/or a lot of local movement. Mean MV size within one shot N: This is a percentage of mean size of the non-zero MVs normalised to the screen width. This parameter determines the intensity of a movement within a moving region. Low intensity indicates the static sequence. High intensity within a large moving region indicates a rapidly changing scene. Ratio of MV deviation within one shot S: Percentage of standard MV deviation to mean MV size within one shot. A high deviation indicates a lot of local movement and a low deviation indicates a global movement. Uniformity of movement within one shot U: Percentage of MVs pointing in the dominant direction (the most frequent direction of MVs) within one shot. For this purpose, the resolution of the direction is 10 o. This feature expresses the proportion of uniform and local movement within one sequence. Average BR: This parameter refers to the pure video payload. The parameter BR is calculated as an average over the whole stream. Furthermore, the parameter BR reflects a compression gain in spatial and temporal domain. Moreover the encoder performance is dependent on the motion characteristics. The BR reduction causes a loss of the spatial and temporal information what is usually annoying for viewers. 40

47 As mention in above section, these parameters can be used for video quality estimation with no presence of audio stream within. For purposes of this thesis (based on the work in [3]) I decided to use the ensemble based quality estimation, which is easy to implement in MATLAB environment. For training of the ensemble model was used the Entool 1.1 [27], which is available for free download as a MATLAB toolbox. The aim was to train a defined ensemble of models with a set of four motion sensitive objective parameters (Z,N, S, U) and BR. The ensemble consists of different model classes to improve the performance in regression problems. The closer view of the ensemble modelling will be described later in the section about the audiovisual quality estimation metric. The simplified scheme for estimation of video quality based on content adaptive parameter is depicted in Figure 3.6 Figure 3.6: Video quality estimation based on content adaptive parameters. With this method can be achieved quite good results of objective MOS scores. For validation of the ensemble based quality estimation the Pearson correlation was used: r = (x x) T (y y) ((x x) T (x x))((y y) T (y y)), (3.4) where x is the vector of the MOS values from subjective tests(averaged from all subjective evaluations), x is the average MOS value over x, y is the vector of objective MOS values obtained by estimation metric and y is the average value over y. The best achieved result described by the Pearson correlation factor for this metric was 85.85%. Similar method will be described also later in the section of audiovisual quality estimation. 41

49 Chapter 4 Audiovisual quality 4.1 Introduction Provisioning of mobile video services is a difficult challenge since in the mobile environment, bandwidth and processing resources are limited. Audiovisual content is present in the most multimedia services, however, the user expectation of perceived audiovisual quality differs for speech and non-speech contents. One of the challenges is to improve the subjective quality of audio and audio-visual services. Due to advances in audio and video compression and wide-spread use of standard codecs such as AMR and AAC (audio) and MPEG-4/AVC (video), provisioning of audio-visual services is possible at low bit rates while preserving perceptual quality. The Universal Mobile Telecommunications System (UMTS) release 4 (implemented by the first UMTS network elements and terminals) provides a maximum data rate of 1920 kbps shared by all users in a cell and release 5 offers up to 14.4 Mbps in the downlink direction for High Speed Downlink Packet Access (HSDPA). The following audio and video codecs are supported for UMTS video services: for audio these include AMR speech codec, AAC Low Complexity (AAC-LC), AAC Long Term Prediction (AAC-LTP) [2] and for video these include H.263, MPEG-4 and MPEG-4/AVC [2]. The appropriate encoder settings for UMTS video services differ for various content and streaming application settings (resolution, frame and bit rate) [3]. End-user quality is influenced by a number of factors including mutual compensation effects between audio and video, content, encoding, and network settings as well as transmission conditions. Moreover, audio and video are not only mixed in the multimedia stream, but there is even a synergy of component media (audio and video) [4]. As previous work has shown, mutual compensation effects cause perceptual differences in video with a dominant voice in the audio track rather than in video with other types of audio [5]. Video content with a dominant voice 43

50 include news, interviews, talk shows, etc. Finally, audio-visual quality estimation models tuned for video content with a dominant human voice perform better than a universal models [5]. Therefore, our focus within this work is on the design audiovisual metric based on audio and video content adaptive features. We are looking at measures that do not need the original (non-compressed) sequence for the estimation of quality, because this reduces the complexity and at the same time broadens the possibilities of the quality prediction deployment. Furthermore, we investigated novel ensemble based estimation systems. The ensemble based estimation method shows that ensemble based systems are more beneficial than their single classifier counterparts [28]. 4.2 Audiovisual quality assessment Test Methodology The proposed test methodology is based on ITU-T P.911 [29] and adapted to our specific purpose and limitations. For this particular application it was considered that the most suitable experimental method, among those proposed in the ITU-T Recommendation, is ACR, also called Single Stimulus Method. The ACR method is a category judgement in which the test sequences are presented one at a time and are rated independently on a category scale. Only degraded sequences are displayed, and they are presented in arbitrary order. This method imitates the real world scenario, because the customers of mobile video services do not have access to original videos (high quality versions). On the other hand, ACR introduces a higher variance in the results, as compared to other methods in which also the original sequence is presented and serves as a reference by the test subjects. After each presentation the test subjects were asked to evaluate the overall quality of the sequence shown. In order to measure the quality perceived, a subjective scaling method is required. However, whatever the rating method, this measurement will only be meaningful if there actually exists a relation between the characteristics of the video sequence presented and the magnitude and nature of the sensation that it causes on the subject. The existence of this relation is assumed. Test subjects evaluated the video quality after each sequence in a prepared form using a five grade MOS scale: 5 Excellent, 4 Good, 3 Fair, 2 Poor, 1 Bad. Higher discriminative power was not required, because our test subjects were used to five grade MOS scales (school). Furthermore, a five grade MOS scale offers the best trade-off between the evaluation interval and reliability of the results. Higher discriminative power can introduce higher variations to MOS results. For emulating the real word conditions of the UMTS video service all the audio 44

Figure 4.1: Snapshots of selected sequences for audiovisual test: Video clip (left), Soccer (middle), Video call (right). and video sequences were played at the UE (Vodafone VPA IV).

51 Figure 4.1: Snapshots of selected sequences for audiovisual test: Video clip (left), Soccer (middle), Video call (right). and video sequences were played at the UE (Vodafone VPA IV). In this singular point the proposed methodology for audiovisual quality testing is not compliant with ITU-T P.911 [29]. Furthermore, since one of our intentions is to study the relation between audio quality and video quality, we have decided to take all the tests with a standard stereo headset. During the training session of three sequences the subjects were allowed to adjust the volume level of the headset to a comfortable level. The viewing distance from the phone was not fixed and selected by the test person but we have noticed that all subjects were comfortable to hold the cell-phone at a distance of cm Encoder Settings Resolution Audio Video BR Video FR Audio BR Audio SR Codec [kbps] [fps] [kbps] [khz] QVGA AAC VGA AAC QVGA AAC VGA AAC QVGA AAC Table 4.1: Encoding settings of Video clip sequence. All video sequences were encoded using typical settings for the UMTS environment. Due to limitations of mobile radio resources, bit rates were selected in range kbps. Encoded audio files with insufficient audio quality were excluded. The test sequences were encoded with H.264/AVC baseline profile 1b codec. The audio was encoded with AAC or AMR codec. The encoding parameters were selected according our former experiences described in [3] and [5]. In total there were 12 encoding combinations (see Tables 4.1, 4.2, 4.3) tested. To evaluate the subjective perceptual audiovisual quality a group of 15 people for training set and a group of 16 people for evaluation set was chosen. The chosen group ranged different ages (between 22 and 30), gender, education and experi- 45

52 ence. The sequences were presented in an arbitrary order, with the additional condition that the same sequence (even differently degraded) did not appear in succession. In the further processing of data results we have rejected the sequences which were evaluated with individual variance higher than one. In total there were 6% of the obtained results rejected. Two rounds of each test were taken. The duration of each test round was about 20 minutes. Resolution Audio Video BR Video FR Audio BR Audio SR Codec [kbps] [fps] [kbps] [khz] QVGA AAC QVGA AAC QVGA AAC QVGA AAC QVGA AAC Table 4.2: Encoding settings of Soccer sequence. For audiovisual quality tests three different content types (Video clip, Soccer and Video call) were selected with different perception of video and audio media. The video snapshots are depicted in Figure 4.1. The first two sequences Video clip and Soccer contain a lot of local and global movement. The main difference between them is in their audio part. In Soccer the speaker voice as well as a loud support of audience is present, where the speaker s voice is rather important. The results depicted in Figure 4.2 show the importance of video quality. Especially important are small moving objects: players and ball. Furthermore, it can be seen that a higher audio BR does not significantly improve the audiovisual quality. Finally, the video media is more dominant for soccer content. In Video clip instrumental music with voice is present in the foreground. The results depicted in Figure 4.3 show the importance of audio quality. In Video call a human voice is the most dominant. Furthermore, the obtained results for Video call and Soccer show that higher resolution has no or little impact at audiovisual quality. This was influenced with granularity of LCD on test PDA. Resolution Audio Video BR Video FR Audio BR Audio SR Codec [kbps] [fps] [kbps] [khz] QVGA AAC QVGA AMR Table 4.3: Encoding settings of Video call sequence. 46

53 Figure 4.2: Measured MOS results for Soccer video sequences. Figure 4.3: Measured MOS results for Video clip sequences Prior Art In former work [3], [5] we investigated audiovisual quality on different content classes, codecs and encoding settings. The obtained subjective video quality results clearly show the existence of a mutual influence in audio and video and the presence of the mutual compensation effect. The depicted Figures 4.4 and 4.5 (color code serves only for better visualisation of the results) show results of audiovisual quality assessment based on H263 encoding. The mutual compensation effect was more dominant for Cinema trailer and Video clip contents as shown in (see Figure 4.5). In Video call the audiovisual quality is more influenced by the audio quality than the video quality as shown in (see Figure 4.4). More details can be found in [3]. 47

54 Figure 4.4: H.263/AAC. MOS results for the Video call content - codecs combination Further investigation within this work shows that it is beneficial to propose one audiovisual model with different coefficients for various video contents depending on the presence (Video call) or absence of dominant human voice (Video clip and Cinema trailer). Therefore, within the new work presented in this contribution an additional parameter was introduced for detecting speech and nonspeech audio content (cf. Section 4.3.2). 4.3 Feature extraction The proposed method is focused on reference free audiovisual quality estimation. The character of the sequence is determined by content dependent audio and video features in between two scene changes. Therefore, the investigation of the audio and video stream was focused on sequence motion features as well as on audio content and quality. The video content influences significantly the subjective video quality [3], [31] and the sequence motion features reflect very well video content. The well-known ITU-T standard P.563 [14] was used for audio quality estimation. Furthermore, a speech/non-speech detector was introduced for eliminating of different influence of mutual compensation effect between audio and video in speech and non-speech content. Finally, temporal segmentation was 48

55 Figure 4.5: MOS results for the Video clip - codecs combination H.263/AAC. used also as a prerequisite in the process of video quality estimation. For this purpose a scene change detector was designed with an adaptive threshold based on the video dynamics. The scene change detector design is described in detail in [3] Video feature extraction The focus of our investigation is given on the motion features of the video sequences. The motion features can be used directly as an input into the estimation formulas or models. Both possibilities were investigated in [41], [32] and [3], respectively. The investigated motion features concentrate on the motion vector statistics, including the size distribution and the directional features of the motion vectors (MV) within one sequence of frames between two cuts. Zero MVs allow for estimating the size of the still regions in the video pictures. That, in turn, allows analysing MV features for the regions with movement separately. This particular MV feature makes it possible for distinguishing between rapid local movements and global movement. Moreover, the perceptual quality reduction in spatial and temporal domain is very sensitive to the chosen motion features, making these very suitable for reference free quality estimation because a higher compression does not necessarily reduce the subjective video quality (e.g. in static sequences). 49

Principles of Audio Coding

Principles of Audio Coding Topics today Introduction VOCODERS Psychoacoustics Equal-Loudness Curve Frequency Masking Temporal Masking (CSIT 410) 2 Introduction Speech compression algorithm focuses on exploiting