PSYCHOPHYSICS AND MODERN DIGITAL AUDIO TECHNOLOGY

::,,_.:!. ":" - Philips J. Res. 47 (1992) 3-14 R1263 PSYCHOPHYSICS AND MODERN DIGITAL AUDIO TECHNOLOGY by A.I.M. HOUTSMA Institute for Perception Research (IPO), P.O. Box 5J3, 5600 MB Eindhoven, The Netherlands Abstract Most ofus today are quite familiar with digital sound through the compact disc (CD). The sound coding in CD technology is largely based on the simple psychoacoustic facts that our auditory system's frequency range is limited to about 20 khz and its effective dynamic range for music not much more than 90 db. This resulted in a bit rate of about 1.4 megabits ç I. In some present applications such as the digital compact cassette (DCC) or in future applications such as digital audio broadcasting (DAB), these high bit rates pose serious technical problems. Considerable bit saving can be achieved, however, by (1) allowing quantization noise in such a way that it is always masked by the music signal, and by (2) not coding sound elements which are masked by other sound elements. Psychoacoustic tests have shown that thresholds for discrimination between fulll6 bits/sample CD sound and variable-bit-rate DCC sound are somewhere between 2.5 and 3.0 bits/sample, depending on the type of music fragment and playback conditions. Keywords: bit rate reduction, digital recording, masking, MUSICAM, sound quality. 1. Introduetion When we listen to the radio or to a compact disc, we perceive acoustical images which are, on the one hand, sufficiently realistic to be interesting and enjoyable but are, on the other hand, also easily distinguishable from the real situation. Hearing a symphony in high-fidelity stereo may be a real pleasure, but it is not the same as being in the concert hall. The difference between the sensation of a real event and a played-back image had in the past a lot to do with the relatively poor technical quality of the image. The noisy mono AM radio broadcasts and the scratchy phonograph records of the 40s and 50s are examples for which many of us still remember how our imagination fills in the voids that exist in less-than-perfect sound Philips Journal of Research Vol.47 No. I 1992 3

~.~ 1 J A.J.M. Houtsma Î 120. 100 BD Intensity level 60 (db) 40 20 o F eli 120 /. 110- t-- ~ 100 l- t--" - 90 r- r-i--' :,/ êl:::; BO r- t---l--',/ ss I:::::~ I- 70 -I-../1 "t' r--...~ i-- lso r- _./ "'-...: -I- 50 r- I:'-::: 1--..1- --9, 40 I- t-..."'- r-, 30 I- L. <, <, 20 1"'-1- j <; 10 r-./ 0 l- r-...... t- 20 100 500 1000 5000 10000 Frequency In cycles per second --+ Fig. I. Equal-loudness contours, according to Fletcher and Munson'). representations. Technology has advanced over the years, however, from 78 rpm mono discs to 33 rpm stereo LPs and on to CDs, and from mono AM to stereo FM radio. With the compact disc in particular, we seem to have reached a new perceptual sound quality standard, in the sense that the public is very unlikely to accept any lesser sound quality in the future. Historically, the development of sound technology has been primarily but not exclusively a matter of physics and engineering. Perceptual psychology or psychophysics has also played a significant role. The employment by Bell Telephone Laboratories in the USA of people such as Harvey Fletcher, Bela Julesz, and Roger Shepard indicates an awareness, at least at Bell Telephone, that knowledge of the working and operational limits of the human senses is an essential element in the development ofhigh-quality communication equipment. Although few other companies developing radio, HiFi or telephone equipment had this foresight, the Philips Research Laboratories did have a pioneer in the field of psychophysics well before the Second World War. Professor Jan F. Schouten's almost solo effort was to result in 1957 in the founding of the Institute for Perception Research as a cooperative endeavour between Philips and Eindhoven University of Technology. Broadly speaking, the role of perception research in the development of telecommunication and broadcasting equipment is twofold. Firstly, this research provides fundamental knowledge about hearing on which designs of sound coding, transmission and representation can be based. An example of 4 Philip. Journalor Research Vol.47 No. 1 1992

Psychophysics and digital audio technology such a perceptual data base is the set of equal-loudness or iso-phone contours measured by Fletcher and Munson") at Bell Laboratories, and shown in fig. 1. Each contour represents the locus of intensities and frequencies of sinusoidal tones which subjectively sound equally loud. They were originally measured to obtain insight into the loudness summation of noise that interfered with the voice in telephone communication, but have since then proved to be extremely relevant for the manner of processing sound in high-fidelity sound systems. In fact, it is difficult to find a stereo amplifier today that does not have a "loudness" button. This button, when engaged, activates a network of filters that have the same shape as the iso-phone contours, thus maintaining a proper subjective tone balance at any selected playback intensity. The second function of perception research in the development process of audio equipment is that its methodology can be used for testing prototypes from a perceptual viewpoint during the research and development process. Tests comprising blind subjective comparisons, two-alternative forced-choice procedures and scaling methods, originally developed in perception laboratories for the study of auditory behaviour, are to an increasing extent tending to find their way into industrial R&D laboratories and consumer organizations' test facilities for subjective performance evaluations of loudspeakers and other sound equipment. International organizations such as the International Organization for Standardization (ISO) and the International Electrotechnical Commission (lec) have developed standards for some of these test procedures. Section 2 contains a description of a recent development in sound coding technology in which psychoacoustics has played an essential role. This technology will form the backbone of the digital audio broadcasting (DAB) system to be implemented in Europe after 1995;it is also used in the digital compact cassette (DCC) recorder recently developed at Philips. Although several international standards with respect to particular applications have been agreed upon, the technology is still under further development in a cooperative research effort by the Institut für Rundfunk Technik in Germany, Philips Research in the Netherlands, the Centre Commun d'etudes de Télédiffusion et Télécommunications in France and, since recently, the Matsushita Electric Corporation of Japan. It is known under the name MUSICAM, an acronym for Masking-pattern Universal Sub-band Integrated Coding And Multiplexing. Detailed technical information can be found in the literature+"). Alternative technical approaches to the same fundamental objective are described by Johnston") and Brandenburg"). Section 3 illustrates the role psychoacoustics can play for testing prototypes from a perceptual viewpoint. Philips Journalor Research Vol.47 No. I 1992 5

A.J.M. Houtsma 2. MUSICAM: bit-rate reduction without loss of sound quality The problem which MUSICAM addresses can briefly be stated as follows. A compact disc (CD) player operates at a rate of two times 44100 samples of 16 bits each every second in order to obtain its high audio quality. The 44100 samples per second for each stereo channel are needed in order to reproduce faithfully frequencies up to 20 khz, about the uppermost limit of human hearing. The 16 bits per sample are needed to allow coding of instantaneous amplitude ofthe sound waveform in sufficiently fine steps to obtain a dynamic (amplitude) range of 90 db. The question is whether the subsequent high rate of 1411200 bits S-I is always absolutely necessary to obtain the desired high-quality sound. For an application such as the DCC, for instance, the requirement of backwards compatibility with analog tape cassettes, which entails a fixed tape head and a tape speed of 1 7/8 in s" I, only allows a bit rate of less than half that of the CD. In the case of DAB the bit rate can be directly translated into transmission bandwidth and operating cost. A lower bit rate almost always saves money in the long run, even with the initial investments necessary to achieve it. As it turns out, the high CD bit rate is not always necessary to obtain CD sound quality. The same perceptual quality can be obtained at much lower bit rates by reduction of redundancy and irrelevance in the sound signal to be coded, stored or transmitted. "Reduction of redundancy" simply means providing an efficient digital representation of a signal that does not contain more information than is necessary to reconstruct it exactly from the digital code. This is mostly a question of logic and mathematics, and does not involve any knowledge about hearing. "Reduction of irrelevance", on the other hand, means that quantization noise, which is a necessary byproduct of digital sound representation and is inversely related to the number of bits by which samples are represented, is allowed to such a level that it just fails to be heard. It also means that only those features of a sound which are audible are coded. MUSICAM primarily addresses reduction of irrelevance and is therefore intricately based on fundamental knowledge of our hearing system. 2.1. Quantization noise, masking and sub band coding Quantization noise is a direct consequence of the fact that the amplitude of an audio sample is digitally represented by a discrete number taken from a limited set of integers. The smaller this set is, the higher will be the level of the quantization noise. A crude rule of thumb is that Lqn' the sound pressure level of the quantization noise in decibels, is given by the expression: where L sm Lqn = L sm -20 log 102n (1) is the maximum sound pressure level (in decibels) that can be 6 Philips Journalof Research Vol.47 No. I 1992

Psychophysics and digital audio technology BD 60 Î 40 LT 20 o \ fm' 0.25 1 4 Hz \ 1/\ \ (\ \ / \ I \ I -, / \ I \ I \ -, 1\, _\ I \ I r-, X I -t 'N 1'1 / ~ 0.02 0.05 0.1 0.2 0.5 1 2 fr (khz)- 5 10 20 Fig. 2. Threshold level (Lr) of a test tone in the quiet and in the presence of a masking sound comprising narrow bands of noise centered around the frequencies fm (250, 1000 and 4000 Hz) having equal power (according to Zwicker and Feldtkeller 8 ). The horizontal line illustrates the broadband spectrum of digital quantization noise. 1 reached by the digital sound converter, and n is the number of bits used in the conversion. Quantization noise is broadband and may therefore occur at frequencies far away from the signal frequencies that ar_ebeing played. Figure 2 shows the average human hearing threshold and also shows how this threshold is elevated in the presence of a sound signal. In this case the sound consists of three very narrow bands of noise, centered around 250, 1000 and 4000 Hz, having equal power. The resulting threshold curve, i.e. the limit ofaudibility for all other tones in the presence of these three noise bands, shows a pattern that is locally elevated in an asymmetrie manner, with low-frequency slopes about twice as steep as the high-frequency slopes. If the masker, which can be thought of as a simple music signal, is represented digitally, an amount of speetrally flat quantization noise will be generated, which is also shown in the figure. The representation ofthis quantization noise can be thought of as the noise power in 1 Hz wide bands and can therefore be directly compared at each frequency with the masked threshold curve caused by the signal. One can easily see that, if the digital steps taken to encode the signal amplitude are too large, quantization noise may become audible in the deep valleys between the tone frequencies. Such situations can occur when 8-bit or even 12-bit digital signal representations are used since, according to eq. (1), quantization noise will then be 48 or 72 db below the maximum sound levels. In CD this level difference is more than 90 db, rendering it very unlikely that under normal playback conditions quantization noise will ever be heard. Philip. Journalof Research Vol.47 No. 1 1992 7

A.J.M. Houtsma No.of subband - 70~ ~ ~2~r3~4~;6~8TT1Drl"2öl,4T16rT18~~rr~n2T4T 60 o 50 Î 40 LT (db) 30 10 20 30 20 40 10 o 50 60 0.02 0.05 0.1 0.2 0.5 2 5 10 20 Frequency (khz)- Fig. 3. Same as in Fig. 2, but quantization noise allowed in 24 subbands. (From Stoll et al.") It is also apparent from fig. 2 why our ears are so sensitive to quantization noise. If we could manage to shape the spectrum of this noise according to the spectrum of the signal, we could allow much larger amounts of quantization noise without it actually being heard. MUSICAM achieves this by first passing the signal through a set of band pass filters, similar to the filtering process that takes place in our ears. The optimal way to choose these filters appears to be in accordance with the critical bands of our hearing system'"). The output of each of these filters, i.e. each spectral slice ofthe signal, is then coded separately into digital format. This limits quantization noise to that particular filter band. The advantage of this subband coding scheme is that it allows fairly precise control of the amount of quantization noise in each of the subbands, which, ifproperly implemented, yields a noise spectrum similar to the masking pattern of the signal. Such an "ideal" situation is illustrated in fig. 3. In practice, however, it is much easier to make digital filters with constant bandwidth. The MUSICAM standard as applied to DCC and DAB therefore uses a bank of 32 filters of equal bandwidth. This bandwidth, which is half the sampling rate divided by 32, comes out somewhere around 700 Hz, dependent on the exact sampling rate used. An example is shown in fig. 4 (see Sec. 2.2). 2.2. Dynamic bit allocation The typical spectra of music or speech, simplistically represented in figs 2 and 3 as stationary functions, should actually not be thought of as being stationary. The filtering process performed by our ears is a spectral analysis 8 Philips Journalof Research Vol. 47 No. I 1992

Psychophysics and digital audio technology performed over a very short sliding time window that runs from about 5 to 15 ms in the past up to the present time. In DCC applications the signal to be coded is similarly divided up into successive time frames of 8 ms, and for groups of three successive frames a signal spectrum is computed. In the simplest form this spectrum is no more than a set of 32 numbers representing the amounts of short-term signal energy in each subband. In DAB applications of MUSICAM a l024-point fast Fourier transform is computed every 24 ms, parallel to the computation of the signal energies in each subband. From the "instantaneous" spectrum a masking function is determined based on fundamental psychoacoustic rules and models. These masking rules mostly involve simultaneous masking, i.e. masking effects that occur within one time frame, but could in principle also incorporate forward and backward masking, i.e. masking effects of the signal in the present frame on the noise in the next or in the previous frame. The masking function obtained for a particular time frame now allows bit allocation for the signal in each subband of that frame according to the following rules: (a) (b) If the amount of signal energy in a subband falls below the masking threshold, that portion of the signal will be inaudible and is allocated 0 bits (i.e. it is not coded). In all other subbands enough bits should be allocated to yield a level of quantization noise just below the masking threshold. "Just below" implies a certain safety range known as the "mask-to-noise reserve". The result of coding a fragment of a vowel sound /~/ (as in the word "battle") is shown in fig. 4. One sees that at around 3 khz some harmonics of this vowel fall below the masking threshold and are therefore not coded. Quantization noise has been kept about 5 db below masked threshold in each subband. Presumably, if the psychoacoustical laws about masking of noise by tones were better known than they are today, more precise estimates could be made and the mask-tonoise reserve could be decreased for further bit savings. Because spectral analysis, threshold computation and bit allocation are done for very short signal segments, the coding system is dynamic and can keep up with all temporal (transient) and spectral details of a speech or music signal at least as well as our ears can. 3. How does it sound? As mentioned in the introduction, psychoacoustics not only provides essen- Philips Journalof Research Vol.47 No. I 1992 9

A.J.M. Houtsma No. of sub band --+ Frequency (khz) --+ Fig. 4. Amplitude spectrum (sound pressure level, SPL) of the vowel!;,!, masking pattern Lr, and quantization noise, resulting after coding by the 700 Hz constant-bandwith MUSICAM system. (From Stoll and Wiese 4 ) tial ground rules for the coding algorithm of MUSICAM, but can also be used to test its performance. From a fragment of music recorded on CD or DAT one can produce a series of versions, using the MUSICAM coding scheme, that run at a progressively decreasing bit rate and therefore contain more and more quantization noise. In terms of fig. 4 this means that the mask-to-noise margin is made progressively smaller. It can even reach negative values when the noise levels exceed the masked threshold levels, in which case the noise will be audible. 3.1. Perception experiment In a two-interval two-alternative forced-choice (2I2AFC) test procedure listeners hear two sequential music fragments, one taken directly from the CD and the other with a reduced bit rate, and have to respond whether the CD version came first or second. Feedback of the correct answer is provided after each trial. When the bit rate of the reduced version is high, for instance close to 16 bits/sample, the fragments are presumably indistinguishable and 50% of the responses will be correct (chance level). When the bit rate is lowered, the difference becomes audible and the score will asymptotically approach 100% correct. The resulting function, called the "psychometrie function", shows the percentage correct responses as a function of the independent experimental variable, the bit rate. Such a 2I2AFC blind listening test was performed with 10 Philips Journal of Research Vol.47 No. I 1992

Psychophysics and digital audio technology 100 Ba Î 60 Percent correct 40 20 0 0 2 4 6 8 10 12 14 16 Av. bil rate (bits/sample) --+ Fig. 5. Psychometrie function of one listener for a music fragment from Mozart's Requiem. Sound was presented in stereo through broadband insert (ER-2) earphones. Coding was according to DCC protocol. six subjects and two different music fragments, using an adaptive DCC coding application ofmusicam as far as that was developed in the summer of 1990. Figure 5 shows a psychometrie function produced by one subject for a 3 s tenor and orchestra fragment taken from Mozart's Requiem. The bit rate corresponding to a performance of75% correct is usually taken as the discrimination threshold. Such thresholds can also be found without measuring the entire psychometrie function by following a so-called "adaptive" procedure!"). Subjects respond to two sequential 212AFC trials, after which an immediate evaluation is made. If both responses are correct, the bit rate is increased by one step, i.e. the task is made a little more difficult for the next two trials. If one or both responses are incorrect, the bit rate is decreased by one step, making the task easier. Such an adaptive procedure can be shown to converge to a bit-rate level which corresponds to a score of 71% correct. Adaptive thresholds of several subjects, measured for two different music signals (the Mozart Requiem fragment and a simple C 4 -E 4 interval played on a viola without accompaniment), are shown in fig. 6. In all of these experiments the dynamic bit allocation was done in the same manner as it is being implemented in DCC, i.e. with subband filters of constant 689 Hz bandwidth, with masking threshold functions computed directly from the amounts of energy in the various subbands during 24 ms time frames, and using only simultaneous masking. One can generally observe that: (a) The psychometrie function offig. 5 is rather steep, indicating.that most of Philips Journalof Research Vol.47 No. 1 1992 11

A.J.M. Houtsma 4T~=---DC-C------------------------------' 3 Î AV.blt 2 rate (bits/sample) / / ~~/ / / ~// / / AV: 2.48 bis ~~/ SD: 0.26 bis AV: 3.16 bis ~ ~ SD: 0.09 bis \ /~/ ~~~ ~~~ o+---~~~~~+---~------~~~--~~ Tenor & orch. Fig. 6. Adaptive discrimination thresholds for two music fragments and groups of 6 and 4 listeners. Coding was according to DCC protocol. Averages (AV) and standard deviations (SD) are indicated for each group. the transition from perfect discriminability to total indiscriminability happens within the span of 1 bit/sample. (b) Discrimination thresholds vary somewhat between subjects, but vary much more between the two music fragments that were studied. A higher bit rate is necessary to represent the viola sound adequately because this fragment contained most of its acoustical energy in the two lowest sub bands. These subbands are, in the present protocol, considerably wider than the corresponding critical bands in human hearing. (c) The average bit rate to be used in the DCC, 4 bits/sample or roughly 353000 bits s", seems sufficient to ensure a subjective sound quality as good as that of CD music, at least for the fragments of music tested so far. DCC performance tests with much more varied program material executed with professional listeners by the Product Division Consumer Electronics are now indicating that, at a fixed average rate of 4 bits/ sample, these listener groups hardly ever score significantly better than chance level when asked to distinguish blindly between frozen CD and DCC music fragments. 3.2. Physical versus psychological measures Everyone involved in the sale of audio and video equipment knows that physical performance specifications play an important and sometimes dominant role in the choices people make. Someone may readily be willing to pay twice as much for an audio amplifier which extends to 100000 Hz compared with another that has a frequency response up to only 50000 Hz, despite Viola 12 Philip. Journalor Research Vol.47 No. I 1992

Psychophysics and digital audio technology the fact that this differenceis perceptually quite irrelevant. The bit-rate reduction scheme, when implemented commercially, might cause an acute marketing dilemma. From the publicity around CD technology the public has probably concluded that a signal-to-noise (SIN) ratio of at least 90 db is necessary to obtain a "good" sound. If the SIN ratio of the sound from a DCC recorder or a future DAB receiver is physically measured, one may find a value of somewhere between 10 and 20 db. This is because, as was explained earlier, quantization noise is purposely allowed to a level just below the audible. Should then the public, including the professional reviewers of HiFi equipment, be re-educated to put more trust in psychological, perceptual criteria rather than in the hard physical performance specifications? Or should new physical test equipment be developed that measures, for instance, not physical noise but audible noise? The speech transmission index (STI) and its simplified version, the rapid speech transmission index (RASTI)",12), are examples of an apparently well-functioning physical measure of a subjective, psychological attribute of sound, in this case the intelligibility of speech in noisy and reverberant environments. The development of a device that measures the true noise-to-mask reserve would perhaps be an adequate solution, but such a device would only be reliable if we knew precisely how to model the filtering and masking operation of our hearing system for complex and dynamic sounds. As long as this knowledge is less than complete, the best thing to do is to keep pointing at the greater reliability of psychoacoustical measures compared with physical measures. 4. Conclusions MUSICAM as applied to DAB and DCC are good examples of consumeroriented high-tech developments which have drawn from the fields of signal processing mathematics, engineering, perceptual psychology and marketing. Because they are solidly based on fundamental knowledge of the functioning of our hearing system, they provide a reliable source of information for rational decisions when, in a particular application, trade-offs have to be made between perceptual quality, technical feasibility, market requirements and costs. They could be models for many technical developments in the future that involve interaction between man and machine. Acknowledgements DCC-coded music material for listening tests was provided by R. Veldhuis and R. v.d. Waal. Helpful discussions with R. Veldhuis and P. de Wit concerning the manuscript are gratefully acknowledged. Philips Journalof Research Vol.47 No. I 1992 13

A.J.M. Houtsma REFERENCES ') H. Fletcher and W.A. Munson, J. Acoust Soc. Am., 5, 82-108 (1933). 2) G. Stol1, G. Theile and M. Link, MASCAM; using psychoacoustic masking effects of low-bitrate coding of high quality complex sounds, in Structure and Perception of Electroacoustic Sound and Music, eds S. Nielzén and O. Olsson, Elsevier, Amsterdam, 1989. J) R.N.J. Veldhuis, M. Breeuwer and R.G. van der Waal, Philips J. Res., 44, 329-343 (1989). 4) G. Stoll and D. Wiese, High-quality audio bit-rate reduction considering the psychoacoustic phenomena of human sound perception, in Proc. Int. Syrnp. on Subjective and Objective Evaluation of Sound, ed. E. Ozimek, World Scientific, London, 1990. 5) G. Stoll ar.d Y.F. Dehery, High-quality audio bit-rate reduction system family for different applications, Proc. IEEE Int. Conf. on Communications, Atlanta, GA, USA, 322.2, pp. 937-941, 1990. 6) J.D. Johnston, IEEE J. Selected Areas Comrnun., 6, 314-323 (1988). 7) K. Brandenburg, High quality sound coding at 2.5 bit/sample, AES Preprint 2582, 1988. 8) E. Zwicker and R. Feldtkel1er, Das Ohr als Nachrichtenempfänger, Hirzel, Stuttgart, 1967. 9) B.C.J. Moore and B.R. Glasberg, J. Acoust. Soc. Am., 74, 750-753 (1983). '0) H. Levitt, J. Acoust, Soc. Am., 49, 467-476 (1970). ") T. Houtgast, H.J.M. Steeneken and R. Plomp, Acustica, 46, 60-72 (1980). 12) P.V. Brüel, Intelligibility in classrooms, in Proc. Int. Syrnp. on Subjective and Objective Evaluation of Sound, ed. E. Ozimek, World Scientific, London, 1990. Author A.J.M. Houtsma: State Diploma A (Music), Municipal School of Music, Arnhem, The Netherlands, 1961; B.A. degree (Theology), Augustinian School of Theology, Nijmegen, The Netherlands, 1963; S.B. degree (Electrical Engineering), Villanova University, USA, 1965; S.M. degree (Electrical Engineering), Massachusetts Institute of Technology (MIT), USA, 1966; Ph.D., MIT, USA, 1971; MIT Departments of Electrical Engineering and Humanities, 1971-1982; research staff of the Hearing and Speech Department of the l nstitute for Perception Research, Eindhoven, 1982; Professor of Psychoacoustics and its Technical Applications at the Eindhoven University of Technology, 1989. 14 Philips Journalof Research Vol. 47 No. 1 1992