An auditory localization model based on. high frequency spectral cues. Dibyendu Nandy and Jezekiel Ben-Arie

Size: px

Start display at page:

Download "An auditory localization model based on. high frequency spectral cues. Dibyendu Nandy and Jezekiel Ben-Arie"

Alexander Pitts
5 years ago
Views:

1 An auditory localization model based on high frequency spectral cues Dibyendu Nandy and Jezekiel Ben-Arie Department of Electrical Engineering and Computer Science The University of Illinois at Chicago Running Head: Localization using spectral cues MS: 943 Revision: 3 Contact Address: Dr. Jezekiel Ben-Arie The University of Illinois at Chicago Department of Electrical Engineering and Computer Science (M/C 54) 85 South Morgan Street Chicago, IL Phone: (32) Fax: (32) benarie@eecs.uic.edu

2 Abstract We present in this paper a connectionist model that extracts interaural intensity dierences (IID) from head related transfer functions (HRTF) in the form of spectral cues to localize broadband high frequency auditory stimuli, both in azimuth and elevation. A novel discriminative matching measure (DMM) is dened and optimized to characterize matching this IID spectrum. The optimal DMM approach as well as a novel back-propagation based fuzzy model of localization are shown to be capable of localizing sources in azimuth, using only spectral IID cues. The fuzzy neural network model is extended to include localization in elevation. The use of training data with additive noise provides robustness to input errors. Outputs are modeled as two dimensional Gaussians which act as membership functions for the fuzzy sets of sound locations. Error back-propagation is used to train the network to correlate input patterns and the desired output patterns. The fuzzy outputs are used to estimate the location of the source by detecting Gaussians using the max-energy paradigm. The proposed model shows that HRTF based spectral IID patterns can provide sucient information for extracting localization cues using a connectionist paradigm. Successful recognition in the presence of additive noise in the inputs indicate that the computational framework of this model is robust to errors made in estimating the IID patterns. The localization errors for such noisy patterns at various elevations and azimuths are compared and found to be within limits of localization blurs observed in humans. INDEX TERMS: auditory localization, pattern recognition, discriminative matching measure (DMM), fuzzy neural networks, back-propagation

3 MS: 943 Revision: 3 Introduction The physics of sound propagation and the anatomy of the head produce interaural time dierence (ITD) cues and interaural intensity dierence (IID) cues at the two ears. The auditory system has been shown to be sensitive to the presence of these cues. It is well known that these ITDs and IIDs are the two primary cues which help in localization [3]. Most models of localization have been based on the processing of ITD cues, by extracting a measure of similarity between the inputs at the two ears, with minor signicance attached to the variation of the IID with frequency. These models include the approaches based on signal correlation [3, ] and its extensions. Other approaches include auditory nerve based models, Durlach's equalization and cancelation model, and count comparison models (suggested by von Bekesy) which are elaborated in [5]. The ITD cues have been shown to be of major importance in the localization process [3, 5]. However the complete localization phenomenon cannot be explained solely by ITDs. Sound sources located in the medial saggital plane relative to the head or on the cone of confusion, cannot be localized using only ITDs. In addition, gradual loss of synchrony of auditory nerve bers for signal frequencies above about.3 khz [26], together with a declining ability to detect envelope interaural phase dierences (EIPDs) for modulation frequencies above 5 Hz and carrier frequencies above 4kHz [4] suggest that ITD cues cannot be reliably estimated for high frequency sound sources. Furthermore, the ability of humans to adapt to conditions of localizing monaurally [] and the strong inuence of spectral content of narrowband noise sources on localization [5] implies that the localization mechanism must also use spectral information. It is well known that primarily the ITD cues and to some extent the head shadow eect on IID cues dominate azimuthal localization at low frequencies. Consequently, it has been suggested that spectral cues are extracted monaurally for localization only in elevation [4]. This is an over simplication of a complex localization task. It is now accepted that the pinna, head and torso provide important localization cues by spectrally modifying the sound

4 MS: 943 Revision: 3 2 signal received at the tympanic membrane according to the direction of origin [4, 2, 23]. Psychoacoustic experiments have also indicated that high frequency spectral cues are essential for a genuine spatial sensation of sound, which includes the outside-the-head sensation or externalization [8]. The ability to localize high frequency broadband sound as well as externalization is most eective for binaural audition and becomes less so for monaural audition. This indicates that there exist means to process binaural spectral cues for localization in human auditory system. The spectral modications imposed on the free-eld source emanating from a particular direction can be estimated by identifying the transfer function of the contributing eects of the pinnae, head and torso [2, 23]. This transfer function has been called the head related transfer function (HRTF). The HRTF is able to encode directional spectral cues, which can be detected by the human localization system [22, 23]. The cochlea acts as a set of band pass lters and channels incoming stimuli to excite auditory nerve bers which have bandlimited transfer characteristics [25, pp.65,86]. This has been modeled by Lyon [2], Yang et al. [24] and others. The nerve bers of the the inner hair cells pick up the displacement of the basilar membrane. These bers have been shown to have discharge rates which are proportional to the stimulus intensity (in db) over a sizeable range [2] [25, pp.84-86]. These nerve bers are tonotopically arranged [7] [25, pp.99-2], i.e they are spatially arranged in the order of characteristic frequencies (CFs) to which they are tuned to respond maximally. This indicates that frequency dependent intensities of stimuli are encoded in the nerve discharge rates of the peripheral auditory system, and are thus available for localization. Neti et al. [7] demonstrated the use of neural network models using spectral IID cues in both monaural and binaural form. However their approach of using HRTF amplitude spectra assumed a source with a at spectrum. Localization when the sound spectrum was not at was not explained. Reported results [7] indicated that azimuthal and elevation coding in the HRTF could not be independent. A random training set of HRTFs was used to try and show that it is possible to generalize the localization phenomenon by using the smoothness

5 MS: 943 Revision: 3 3 property of the HRTFs as they vary in elevation and azimuth. Neti et al. were able to show that response maps of hidden and output layer units were similar to observed responses of neurons in the superior colliculus (SC). However, actual results of individual localization experiments were not reported. The mean errors in localizing using the test set were shown to be rather large. In this paper, we present a connectionist approach that can successfully estimate the direction of a source from spectral IID cues. We do not claim this model to be physiologically exact, but rather that it can represent a systematic approach to localization using spectral IID cues. It has long been known that human perception of intensity is approximately logarithmic [25, pp.63-64] and therefore perceived loudness is measured in decibels (db). This perceptual relation has been backed by physiological observations. Measurements made in the auditory (VIII th) nerve have shown that discharge rates in auditory nerve bers are proportional to input stimulus intensity levels [2] [25, pp.84-86]. Sachs et al. [2] showed that the discharge rates of nerve bers for single tone stimuli are approximately proportional to the sound intensity (in db). This relationship was shown to hold over a sizeable range, from the threshold levels of the individual neurons to saturation levels at least 3 db above these threshold levels. Sound intensities were measured on a decibel scale. Thus ring rates of the auditory nerve bers were shown to be proportional to the logarithm of the actual sound stimulus (pressure) level. The dierence of intensities in the stimulus from each ear results in an IID characterized by frequency. This hypothesis is motivated by the tonotopic arrangement of nerve bers in the auditory nervous system [7] [25, pp.99-2] and the response of the cells in the lateral superior olive (LSO) [4, 7]. The majority of the LSO units are most sensitive to mid-to-high frequency stimuli, having characteristic frequencies (CFs) greater than about 2 khz. They are excited by stimulus at the ipsilateral ear and inhibited by stimulus at the contralateral ear [4, 7]. This response characteristic is termed E/I, and can be modeled as the dierence in the IID at a given frequency. As shown in Section 2, such a model eectively removes the dependence on the source spectrum while retaining spectral

6 MS: 943 Revision: 3 4 features due to pinnae ltering. The result is a source invariant internal representation of IIDs as a function of frequency and location. In Section 3, the process by which the internal IID spectral patterns are computed from the HRTF data is discussed. We show in Section 5, that by using a training set generated from HRTFs uniformly distributed over the auditory space, it is possible to achieve adequate generalization and thus accurately localize sources from HRTF based IID data not used in training. An approach based on such IIDs to estimate localization was simultaneously investigated by Duda [6]. Duda used information theoretic concepts based on the maximum likelihood estimation procedure. An implementation of that scheme by means of a neural network is not obvious. We use optimization techniques to shape the localization response to be unimodal and sharp even in the presence of noise. The approach is simple and is implemented using connectionist models with typical correlative (multiply with weights and add) neuron models. We use several algorithms for calculating suitable weights (correlation, optimizing a novel discriminative matching measure (DMM), error back-propagation) for networks with single/multiple layers with and without output nonlinearities (Section 4). We show that azimuthal localization is also possible using high frequency IID data. In later sections, localization in both azimuth and elevation is demonstrated using the back-propagation based fuzzy neural network model. The results of the simulations are presented in Section 6 and are compared with the observations of Makous et al. [3] on their experiments with human localization. Section 7 concludes this manuscript with a discussion of our experiments, results and the inferences that can be drawn from the model presented. 2 A Localization Model based on HRTFs The anatomical structure of the pinna head and torso aect incident acoustic signals by modifying their phase and amplitude spectra. Dierences in the acoustic path lengths at all frequencies lead to ITDs between the perceived auditory signals at the two ears. Head shadow eects and multiple reections o of the pinnae lead to IIDs and interaural phase

7 MS: 943 Revision: 3 5 dierences (IPD) which vary as a function of frequency. The eects of such head related ltering are captured by modeling the transfer characteristics of the incoming signal path as a linear lter from the source to the ear canal. The lter thus obtained is a direction dependent linear lter called the head related transfer function (HRTF). The HRTF model implicitly includes IIDs in its amplitude spectrum. As outlined by Wightman et al. [23], Wenzel [22] and others, the HRTF depends upon the directional parameters of the azimuth and elevation of the sound source relative to the head location. In the following discussion, localization of a broadband signal source in anechoic environments is considered. It is assumed that the source is stationary relative to the head. The stimulus is assumed to be a signal having non-zero components over a broad band of frequencies! 2 [2kHz; 2kHz], located at an azimuth and elevation and a constant distance r from the center of the head. With an assumed constant distance r, the elevation and azimuth are akin to latitudinal and longitudinal mapping of a spherical auditory space as described by Wightman and Kistler [23]. The stimulus spectrum at the left tympanic membrane can be expressed as X L (!; ; ; t) = H L (!; ; )X(!; t) ; () where X(!; t) is the short time Fourier transform of the free-eld sound source, from the direction ( ; ), and H L (!; ; ) is the HRTF of the left ear associated with the direction ( ; ). The spectrum at the right tympanic membrane has a similar form using the corresponding HRTF H R (!; ; ) for the right ear. The vibrations of the tympanic membrane are transduced through the middle ear bones to the cochlea. The stimulus energy is transfered to the basilar membrane, which carries the stimulus as traveling waves. The basilar membrane can be modeled as a set of bandpass lters [2, 24] [25, pp.6-66]. The center frequencies of these bandpass lters are approximately logarithmically spaced on the basilar membrane. The logarithmic spacing of the basilar membrane lters is modeled by a space-frequency variable s corresponding to the stimulus frequency! s. Thus if the response of the basilar membrane is

8 MS: 943 Revision: 3 6 sampled at n frequencies spaced logarithmically between! and! n? and indexed by s,! s =! e s n? ln[! n?! ] ; s = ; ; ; n? ;! = 2kHz ;! n? = 2kHz : (2) The mechanical motion of the basilar membrane at a particular location is picked up by the inner hair cells (IHC) of the organ of Corti. Auditory nerve bers which end in the IHC re in proportion to the neurotransmitter release caused by the depolarization of the IHC. These bers have been shown to have ring rates which are proportional to the intensity of the stimulus (in db) [2][25, pp.84-86]. The proportional range of response varies from the threshold levels of the individual neurons to their saturation levels, which are on an average about 3 db above the threshold levels. The intensity due to a bandpass (basilar membrane) lter H BM (s;!) with characteristic frequency! s, can be given by Z I L (s; ; ; t) = log j X L (!; ; ; t)h BM (s;!) j 2 d! : (3) When the lters H BM are suciently narrow, we can assume that H BM (s;!) (!?! s ) and the spectrum of the intensity level carried by these bers is given by I L (s; ; ; t) log j H L (! s ; ; )X(! s ; t) j 2 : (4) The modeled intensity spectrum given by Eq. (4) forms the auditory representation in the auditory nerve bers which synapse with the left cochlear nucleus. A more sophisticated treatment as done by Yang et al. [24] also gives rise to a similar form of representation in the auditory nerve, which they call the auditory spectrum. As the frequency content of the stimulus changes over time, so will the actual discharge pattern of the nerve bers. Let us consider a specic time frame in the localization process. Some of the nerve bers from the ipsilateral and contralateral cochlear nuclei synapse at the superior olivary complex (while others may pass without synapse to the inferior colliculus). The majority of the units in the lateral superior olive (LSO) have been observed to exhibit a E/I response [4, 7]. The E/I response can be modeled as a binaural dierence operator which evaluates the dierence between ipsilateral and contralateral stimuli at the input of the

9 MS: 943 Revision: 3 7 LSO units. From Eq. (4) corresponding to the ipsilateral cochlea and its equivalent for the contralateral cochlea, the output of an LSO unit can be modeled as being dependent on the IID at its characteristic frequency. Here, we consider! 2 [2kHz; 2kHz], the approximate range of the frequency response of the LSO units. We drop the variable t as we model a specic time frame during the localization process. Thus, h(s; ; ) = I I (s; ; )? I C (s; ; ) (5) where I I I L and I C I R. h(s; ; ) is the dierence in the intensities of the ipsilateral and contralateral stimuli at the frequency! s (Eq. (2)). h(s; ; ) is an internal representation of the IID, processed through the cochlea, the auditory nerve, the cochlear nucleus and the LSO at that frequency. From Eq. (5), Eq. (4) and Eq. () we get h(s; ; ) = log " # 2 j HI (! s ; ; )X(! s ) j = 2 log j H I(! s ; ; ) j j H C (! s ; ; )X(! s ) j j H C (! s ; ; ) j : (6) The internal IID h(s; ; ), over the range of the space-frequency index s form the vector h( ; ) = [h(; ; ); h(; ; ); ; h(s; ; ) ; h(n? ; ; )] T ; (7) which we shall refer to as the internal IID spectrum. In the process outlined above we have assumed the bandpass ltering eects of the peripheral auditory system to have very narrow pass bands. We have also ignored saturation eects of the nerve discharge rates. Thus the internal IID spectrum represents idealizations of the modeled nerve responses. To compensate partially for this idealization, we model variations from the input stimulus as estimation errors. The estimate of the internal IID spectrum may be expected to have variations with changes in the input stimulus or in the ring rates of the nerve bers or due to other factors. In Appendix A, we show that small additive distortions in the intensity encoding lead to corresponding additive distortions in the internal IID (Eq. (8)). Thus the estimated internal IID spectrum may be modeled as h e ( ; ) = h( ; ) + (8)

10 MS: 943 Revision: 3 8 It has been observed that the HRTF magnitude spectrum varies in a relatively smooth manner in both azimuth and elevation [22, 6, 6]. The smoothly varying nature of the HRTFs constrains the internal IID spectrum to be a continuous smooth function of azimuth and elevation for a particular frequency. We assume that the internal IID spectrum for dierent azimuths and elevations has the information to dene the auditory space for a human. Our model is based on this hypothesis. Thus the auditory system is modeled as being able to extract an internal representation of the direction-dependent IID spectrum in real time from incoming audio signal spectra. The internal IID spectrum is invariant to the source signal spectrum, but depends on the HRTF encoded directionally varying spectral cues. If x(t) is a broadband signal with nonzero frequency components, then the above formulation applies over the entire frequency band under consideration, and spectral pattern recognition (and thus localization) is possible, independent of the free-eld signal x(t). Auditory space map Elevation Azimuth Hidden layer neurons IID responsive neural units Localization network Contralateral units sensitive to intensity Ipsilateral units sensitive to intensity Figure : The proposed binaural localization model using internal representation of spectral IID cues encoded in the head related transfer function The internal IID spectrum h e ( ; ) has unique characteristics for every direction ( ; ) [6,

11 MS: 943 Revision: 3 9 6]. A feed forward network which projects such a HRTF dependent internal IID spectrum onto a set of connection weights can extract the direction of a sound source. These weights correspond to a set of localization lters. Such localization lters may be computed using several optimization techniques discussed below. We hypothesize that in the biological localization systems in humans and other animals, connection strengths (weights) may be adaptively formed by a combination of reinforcement and supervised training induced by auditory and visual cues as well as other sensory data. Such feedback learning would be optimal in the context of minimizing localization error. Thus our optimization techniques are also geared towards characterizing and minimizing the localization error. Figure shows a diagram of an articial neural network model with one hidden layer. The output response of such a network is modeled as activity in the units corresponding to the estimated azimuth and elevation. This output array thus forms a map which models the auditory space. 3 Computing the internal IID spectrum The internal IID spectral patterns are derived from the HRTF data set (SDO MP44.DAT) of a good human localizer used in the Convolvotron spatial audio system [23, 22]. The SDO data provide HRTFs for 2 azimuthal directions 3 apart from?8 to +5, at each of 6 elevations 8 apart, from?36 to 54, as nite impulse responses with 28 coecients each. Fig. 2 corresponding to the left LSO nucleus shows the internal IID spectral data for the elevations of,?8 and?36 and indicates how the it varies in azimuth and elevation. The SDO data are resampled (at 44. khz) minimum phase approximations of nite impulse response data originally measured at 5 khz sampling rate by Wightman and Kistler [23]. The fast Fourier transform (FFT) is used to determine the amplitude spectrum of these HRTFs. The amplitude spectra are then interpolated for each frequency in 36 of azimuth and 9 of elevation with a resolution of, using cubic splines. In azimuth, the splines are constrained to be continuous over the wraparound region from +5 and?8. This results in a HRTF map for either ear which varies smoothly in azimuth and elevation for each

12 MS: 943 Revision: 3 5 IID corresponding to left LSO (db) EL: deg EL: -8 deg EL: -36 deg -2 5 Frequency (khz) 5 2 left - Azimuth (deg) right Figure 2: The internal IID spectrum corresponding to the left LSO units from the HRTF data set SDO MP44.DAT for elevations 2 (?36 ;?8 ; ). sampled frequency. The amplitude spectra of the HRTFs are warped to a logarithmic scale (Eq. (2)) between 2 khz and 2 khz, the range of response of the LSO units. The warped frequency response is resampled uniformly at 28 points. Linear interpolation is used to determine the amplitude value for intermediate frequencies, at which the magnitude is not known. The logarithmic warping models the cochlear response along the basilar membrane. The internal IID spectral data vectors are computed from the resampled HRTF data using Eq. (6). 4 Localization in Azimuth In previous work [6], we have analyzed the discriminative capabilities of internal IID spectrum in providing localization cues in azimuth. Three dierent pattern matching techniques, including normalized correlation matching, fuzzy back-propagation matching, and an approach based on optimizing a novel discriminative matching measure (DMM) were investig-

13 MS: 943 Revision: 3 ated for accuracy and discriminative ability in localization. As discussed in Section 2, the internal IID spectrum of an incoming signal is projected onto a library of template patterns, each representing a particular direction. The estimate of this internal IID spectrum is assumed to be distorted by additive noise. Details of this distortion model are given in Appendix A. For simplicity in analysis, we assume this additive noise term to be zero mean Gaussian, independent and identically distributed. Normalized correlation involves projecting the incoming IID spectral vector onto the set of library IID spectral vectors. It is well known that matching using the correlation approach optimizes the signal to noise ratio (SNR). Detailed treatment of the correlation approach and its relation to Duda's maximum likelihood approach [6] are discussed in [6]. The fuzzy model using back-propagation training approach is detailed in Section 5 for localization in both azimuth and elevation. A brief summary of the approach optimizing the novel discriminative matching measure (DMM) is provided here. Given an M-dimensional IID spectral vector h e, and a set of M-dimensional template vectors i ; i 2 f; N h g associated with azimuthal directions i 2 [?8 ; +8 ), it is to be determined which of the N h templates responds best to h e. The eectiveness of the match can be quantied by the DMM. Let the response scores of the vector h e with the i th template i be given by c(i) = T i h e ; i 2 f N h g. i is the template associated with azimuth i that is to be correlated with h e. Let the vector h e evoke the highest response from the l th template l. The discriminative matching measure (DMM) is dened as the ratio: DMM = N X h N h? i= ( T l h e) 2 j l? i j 2 mod 8 (T i h e) 2 ; (9) where i i 2 f; Ng is the azimuthal angle of the i th template. j j mod 8 is the modulo operator. This function ensures that the maximum measured azimuthal dierence is less than or equal to 8. The DMM penalizes the matching score in proportion to the squared dierence of the estimated direction from the true direction. The smooth nature of the IID spectrum as a function of location is exploited by introducing a dependence of the angular

14 MS: 943 Revision: 3 2 distance which allows the response to fall gradually away from the peak..8 Normalized Correlation DMM:. Response Azimuth: -99 degrees Optimal DMM DMM: 3.54 Fuzzy Neural-Net DMM: Azimuth (degrees) Figure 3: Typical responses for localizing in azimuth using the optimal DMM approach, fuzzy back-propagation and normalized correlation. Shown are the responses for =?99. Note the sharply peaked response of the optimal DMM approach and the fuzzy neural network as compared to the broad peak of the normalized correlation method. The optimal lters i which yield maximal possible DMM when correlated with h e, are formulated in Appendix B. In this formulation, additive noise and expectations of random variables are used to dene the DMM in a slightly dierent manner. The optimal DMM lter is given by l = u l h T l h N h? h N h? P Nh i= j i? l j 2 mod 8 [R h i + R ] i? hl P Nh i= j i? l j 2 mod 8 [R h i + R ] i? hl ; () where u l is a user determined constraint, usually set to, R hi = h i h T i, the sample autocorrelation matrix of the IID spectral data and R = E[ T ], the noise autocorrelation matrix. In the limiting case, of increasing additive white noise, it is easy to show that the optimal DMM lter converges to the original template h l normalized to unit amplitude, thus indicating that the DMM method performs as well or better than correlation matching in terms of DMM. In implementing the optimal DMM approach, an incoming IID spectral vector is correl-

15 MS: 943 Revision: 3 3 Optimal DMM 2 Normalized Correlation DMM - Fuzzy Neural-Net RMS Error (degrees) Optimal DMM Normalized Correlation Fuzzy Neural-Net Input HRTF Template SNR (db) Input HRTF Template SNR (db) Figure 4: Average DMM obtained by matching using optimal DMM ltering, fuzzy neural networks and normalized correlation. As is evident, the optimal DMM method has the best DMM scores followed by the fuzzy neural network. Figure 5: RMS errors of optimal DMM lter matching, the fuzzy neural network and normalized correlation matching. Here the fuzzy network is seen to perform best in localizing with minimum RMS error 2 followed by the optimal DMM approach. ated with optimal DMM lters for each available azimuth. The response is a vector with components corresponding to all these azimuths. Fig. 3 shows a result in localizing a sound source located at an azimuth of = 99. Three methods are compared in this gure, namely, normalized correlation, fuzzy back-propagation and the optimal DMM approach. In these experiments, the response is estimated for azimuths apart. Note that the optimal DMM approach and the back-propagation based fuzzy neural network have sharply peaked responses for the correct azimuth. The normalized correlation approach has a broad peak, which is less discriminative. The estimate of the azimuth is less robust to variations for such broadly peaked response than the narrower peaks of the optimal DMM and the fuzzy neural network approach. Fig. 4 and Fig. 5 show the average DMM scores and the RMS errors in localization using these three approaches over a large range of additive noise levels. The DMM approach and the fuzzy neural network are seen to have similar performances, both in DMM and in localization accuracy. As expected, the DMM approach has a better average DMM score than the fuzzy neural network, but on the other hand has slightly higher RMS errors. The correlation approach is inferior to both of these methods in terms of DMM and RMS error. The optimal DMM approach is theoretically capable of being extended to localization in

16 MS: 943 Revision: 3 4 both azimuth and elevation. In order to formulate this, it is required to simply consider the absolute angular distance as dened in Eq. (2), rather than just the azimuthal angular distance i, with the corresponding IID spectral pattern for the relevant direction. In the experiments described above, a resolution of in azimuth was used. This is smaller than the localization acuity of humans [3], and thus is convenient to evaluate the performance bounds of the model. For localization in both azimuth and elevation, it would be necessary to construct 36 9 optimal lters for the desired resolution of. In our experiments, the computation of each lter takes approximately 2 minutes a 486 DX2/66. Thus it would take more that one year to compute all the 36 9 lters with the resources available to us. The numerical complexity precludes such an implementation. It must be noted that the simulation of the localization process after the lter computations is primarily a set of parallel vector correlations and is quite fast, even on serial architectures. It can thus be expected that the performance of such a model will be similar to the results obtained for localization in azimuth only and thus be comparable to human localization acuity. 5 Localization Using Fuzzy Neural Networks As discussed above, the nal stage in the localization process, after the extraction of the IID spectrum, involves mapping this IID pattern vector to a location in the modeled auditory space of the listener. The assumption is that the modeled IID responses are mapped from a tonotopic ordering to a location oriented response at some stage in the auditory nervous system. This mapping is done by a neural network (shown in Fig. ) based on principles of fuzzy logic []. The model is a feed forward fully connected network with one hidden layer. Each of the processing units in the network correlate the input vector with its associated weight vector. The logistic function is used as the output nonlinearity for all processing nodes. The error back-propagation algorithm is used to train the network in a supervised mode. The back-propagation algorithm has been extensively used. It is capable of extracting general characteristics of pattern classes quite readily. The training of the neural network

17 MS: 943 Revision: 3 5 consists of presenting a set of training patterns or exemplars to the network and demanding a corresponding set of desired outputs from the network. The network denes a nonlinear mapping from the IID spectral pattern space to the modeled auditory space. The mapping is a function of the number of layers, the weight matrix associated with each layer and the nonlinearity parameters of each neuron. The learning algorithm modies the weights of the network in accordance with the error between the actual and desired outputs. The topic of back-propagation is well developed [9] and we do not present any details here other than noting that the algorithm provides a method to do a gradient descent on the mean squared error surface in the weight space of the neural network. Neti et al. [7] used a fault tolerant learning method to model the behavior of biological neural networks where it is observed that cognitive functions are left relatively unimpaired despite damage to individual processing units. Neural units carried evenly distributed information and were modularly fault tolerant. However the manner in which the training data were selected did not allow for good generalization [7]. In the neural network we describe here, the problem of poor generalization has been overcome by using fuzzy variables and IID data uniformly distributed over the auditory space the model is expected to analyze. The use of fuzzy variables in the input allows the network to be robust to distortions in the input IID spectrum. The fuzzy model allows a set of outputs dened by the fuzzy membership function, to have high activity values, i.e. the estimated location can have varying memberships in a range of locations. We assume that this models the actual perception of sound source location by humans. Humans are generally able to localize sounds to a small region of uncertainty. This region is akin to the localization blur or the minimum audible angle (MAA) measured by various experimenters [3, 4]. The use of fuzzy variables also reduce the burden on the training algorithm and allow faster convergence. This is because the input IID vectors corresponding to nearby locations are highly correlated. Requiring an orthogonal set of outputs, would require a more complex mapping than requiring a set of fuzzy outputs as described below. Fuzzy outputs also allow for a far greater resolution than is possible

18 MS: 943 Revision: 3 6 otherwise without much greater computational burdens. 5. Fuzzy Model of the Input Stimulus The input to the neural network is given by the internal IID spectral pattern (Eq. (6) and Eq. (7)). It is desired that the network be robust to variances induced in the internal IID spectrum by estimation errors at various stages of auditory processing. Such errors are modeled as additive white noise (s) as explained in Appendix A. is assumed to be a zero mean independent and identically distributed (i.i.d.) random vector constructed from the estimation errors at each frequency. In training the network, each target direction is associated with several input templates, each one corresponding to the internal IID spectral pattern for that direction with dierent levels of additive noise. 2 2 left ear left ear HRTF magnitude (db) EL: 23 deg right ear Cochlear response magnitude (db) EL: 23 deg right ear AZ: -55 deg AZ: -55 deg Frequency (khz) -2 Frequency (khz) Figure 6: The magnitude of the HRTF for the left and right ear for the direction and = +23, shown in db. =?55 Figure 7: The HRTF magnitude spectra warped to a logarithmic scale between 2 khz and 2 khz. The input intensity at a particular frequency for a given direction is converted to a fuzzy variable [], allowing the network to learn expected variances of the input amplitude at that frequency. The h(s; ; ) value is scaled to lie in the range [ -, ]. Thus if h max and h min are the maximum and minimum gain values of h(s; ; ) over all input patterns from all directions, the input to the network is modeled as follows: ~h e (s; ; ) = 2( h(s; ; ) + (s) )? h max? h min h max? h min () The number of input nodes is xed by the choice of the input vector length, in this case 28

19 MS: 943 Revision: EL: 23 deg 35 EL: 23 deg.7 AZ: -55 deg Response of left LSO neurons (db) AZ: -55 deg Normalized LSO rate input to network SNR: 5 db 5. Frequency (khz) Frequency (khz) Figure 8: The internal IID pattern for the lateral superior olive units corresponding to direction =?55 and = +23. Figure 9: The internal IID pattern for the lateral superior olive units with additive noise to model estimation errors and after scaling. (SNR = 5 db) points, spaced logarithmically between 2 khz and 2 khz and indexed by s (Eq. (2)). Fig. 6 shows the original magnitude spectra of the HRTFs for the direction ( =?55 ; = +23 ), Fig. 7 the intensity spectrum as modeled in the cochlear response, Fig. 8 the modeled IID spectrum corresponding to the response of the LSO units and Fig. 9 the result of adding noise and scaling it to lie between h max and h min respectively. 5.2 Fuzzy Model of the Output Response The output response is modeled as a map of the auditory space in terms of elevation and azimuth. The output layer is a 2 array of neurons (Fig. ), with the azimuth ranging from?8 to 62 spaced 8 apart along the 2 neuron side and the elevation ranging from?72 to +9, also spaced 8 apart, along the neuron side. The output activity is modeled as a Gaussian with the mean at the location of the input stimulus being localized, and sampled at the locations corresponding to each neuron in the output array. This enables coding of locations which which do not correspond to the locations of neurons in the output array. Each Gaussian is plotted to be circularly symmetric in terms of the absolute angular dierence between the mean location of the Gaussian and the location of each neuron in the output array by using the direction cosines of these two directions. We note that at increasing elevations farther from the horizontal plane at, the azimuthal locations are sampled more

20 MS: 943 Revision: 3 8 densely. Hence the relative contribution of the elevation and azimuthal angles to the absolute angular dierence changes. Thus the use of is particularly relevant in estimating the localized direction. Gaussian at: Gaussian at: Expected fuzzy response EL: 23 deg AZ: -55 deg Expected fuzzy response EL: 22 deg AZ: 55 deg Note wrap around in azimuth - right - right Elevation (deg) left Azimuth (deg) Elevation (deg) left Azimuth (deg) Figure : The modeled output of the network at location (?55 ; +23 ) is a Gaussian with = 25 when mapped on a spherical surface. This target provides fuzzy memberships in a set of azimuths and elevations at and close to the source direction. Figure : This gure illustrates how the desired response at the location (+55 ; +22 ) has modeled to be continuous in azimuth by introducing a wraparound. The absolute angular dierence between the locations ( i ; i ) and ( j ; j ) is given by = cos? (l i l j + m i m j + n i n j ) ; (2) where l i = cos( i ) cos( i ) ; m i = sin( i ) cos( i ) ; n i = sin( i ) (3) are the direction cosines for direction ( i ; i ). The direction cosines for ( j ; j ) are similarly dened. Thus the value of the Gaussian, with its mean at ( i ; i ), at a neuron (k; l), corresponding to the location ( j ; j ) in the output array, is given by G i (k; l) = :9 e?2 2 2? :95 ; (4) where d (= 25 ) is the standard deviation of the output Gaussian. The Gaussian in Eq. (4) is limited in amplitude to [ -.95,.95 ]. This avoid demanding saturated responses from the output nonlinearity of each on the network output units. This form of output coding results in a map of the auditory space that is continuous on the sphere and thus in elevation

21 MS: 943 Revision: 3 9 Defuzzified: Defuzzified: EL: 22 deg EL: 23 deg Localization response AZ: -59 deg Input SNR: 5 db Localization response AZ: 54 deg Input SNR: 5 db - right - right Elevation (deg) left Azimuth (deg) Elevation (deg) left Azimuth (deg) Figure 2: The response of the localization network to the input (SNR: 5 db) from direction (?55 ; +23 ). The response is defuzzied according to Eq. (5) and the estimated direction is (?59 ; +22 ), with an absolute angular error () of 3:8283. Figure 3: The response of the localization network to the input (SNR: 5 db) from direction (+23 ; +55 ). The estimated location is (+22 ; 54 ), with an absolute angular error () of :364. and azimuth. It is capable of localizing sources from all directions. An example of the desired output for the direction corresponding to (?55 ; +23 ) is shown in Fig.. Another example for direction (+55 ; +22 ) is shown in Fig.. Note that the response wraps around in azimuth, and extends from +9 to?9 in elevation to cover the whole surface of the sphere. The Gaussian response model forms a fuzzy membership function [], with the output having memberships in a multiplicity of directions given by the response of each neuron. Thus the resolution of localization is not limited by the number of output neurons. The network trained with a set of patterns, which are selected to be uniformly distributed from the set of all available directions. The network is able to generalize and interpolate to responses for intermediate directions, for which it has not been trained. Examples of responses to inputs with which the network is not trained are shown in Fig. 3 and Fig. 2. Results are discussed further in Section Error Back-propagation Training The weight assignment is done by iteratively converging to a minimum close to the global minimum, on the mean squared error surface of the network mapping function. The network

22 MS: 943 Revision: 3 2 is initialized with small random weights. The error associated with each output node is dened as the dierence between node target output and the achieved node output. Errors associated with hidden layer nodes are determined by back-propagating the error associated with each node of the next layer through the output nonlinearity. The use of the logistic function which is continuous, as the output nonlinearity enables an analytic derivation of the back-propagation algorithm as described by Rumelhart et al. [9]. The network is trained in a batch mode. The training data include IID spectral templates for 2 azimuths 6 elevations, uniformly spaced 8 in azimuth and elevation. The actual training data correspond to directions over the range of azimuths 2 [?8 ; +8 ) and elevations 2 [?36 ; 54 ]. The inclusion of training patterns corrupted by additive noise of variance 2 (additive Gaussian noise that result in 2 db SNR and 25 db SNR are used), ensures robustness and better generalization capabilities [9], by enabling the network to model the estimation errors in the IID spectral patterns. The target pattern for each direction is a Gaussian as dened in Eq. (4). The network consists of 28 inputs, a varying number of hidden units and a 2 output array. The network was trained with heuristically adjusted learning and momentum rates. The convergence criteria was that the average error per node per pattern drop to less than :2. Varying numbers of hidden layer units from 4 to 2 were tried. The best error performance was observed for a network using 5 hidden units, which converged in about 45 iterations. 5.4 Estimating the Auditory Source Location The modeled output response is a Gaussian sampled in a 2 array as given by Eq. (4). As discussed above, the network is expected to interpolate responses for directions which do not correspond to a direction represented by the output nodes. The estimated location due to an output response corresponds to the mean of the Gaussian which best ts the response in some measure. Thus, the aim is to detect the location of the Gaussian at a higher resolution than aorded by the 2 output array. Ben-Arie and Rao [2] have shown that a signal may

23 MS: 943 Revision: 3 2 be decomposed on a set of Gaussian basis functions by recursively subtracting the Gaussian which provides the maximal amount of energy in the signal. A single Gaussian of known variance, as in the present case, can be detected by the mean of a Gaussian which provides the minimum mean squared error t to the signal. Following the max-energy paradigm of Ben-Arie and Rao, it is attempted to resolve the output response at a resolution of in both azimuth and elevation. Modeled Gaussian responses G i of standard deviation d = 25 are generated for i such that i 2 [?8 ; +79 ] and i 2 [?36 ; +54 ] sampled on a 2 array as in Eq. (4). Let the output response be approximately Gaussian in nature and be given by G op of some standard deviation op d. The ordered pair (k; l) index the output layer neurons. The estimated location ( e ; e ) is given by,! min arg ( i ; i ) kg i? G op k F ; (5) where kg i? G op k F = s X k X the Frobenius norm of the error matrix. l [G i ( d ; k; l)? G op ( op ; k; l)] 2 ; (6) The absolute angular error in the estimate is absolute angular dierence, between the locations ( i ; i ) and ( e ; e ) can be computed using Eq. (2). This scheme is seen to be robust to variation in op and the output response amplitude. In previous work [6] described in Section 4, the back-propagation method was evaluated for azimuthal localization using both criteria of angular error as well as the DMM measure (Fig. 4 and Fig. 5). In order to estimate the DMM of the response, a slightly dierent scheme to defuzzify the output response was implemented and is described in [6]. In brief, an interpolated response was generated at the required level of angular resolution ( ). The azimuth was estimated as the centroid of this interpolated response. 6 Simulation Results The performance of the network is evaluated by running simulations over a large range of input SNR. Each test set for a given input SNR consists of 5 randomly selected test IID

24 MS: 943 Revision: 3 22 spectral patterns uniformly distributed over the available range of input directions, i.e. 2 [?8 ; +79 ], 2 [?36 ; +54 ]. The response of the network for each input is defuzzied as described in Section 5.4. Samples of the network response are shown in Fig. 2 and Fig. 3. Evidently, the results (shown for noisy input at 2 db SNR) are robust to additive noise. The defuzzied results have minimum mean absolute angular errors of about 3:3. The absolute angular error is calculated as the angle between the true and estimated location projected at the center of a sphere by using Eq. (2). Fig. 4 shows the distribution of the estimated elevation against actual elevation and Fig. 5 shows the distribution of the estimated azimuth against actual azimuth for inputs with 2 db SNR. From Fig. 4 and Fig. 5, it is observed 6 5 Input SNR: 2 db 2 5 Input SNR: 2 db Estimated elevation (deg) Estimated azimuth (deg) Actual elevation (deg) Actual azimuth (deg) Figure 4: The distribution of estimated elevation against actual elevation for inputs at 2 db SNR. The average elevation error is :5 with a standard deviation of 3:5 and the average absolute angular error is 4:37. Figure 5: The distribution of estimated azimuth against actual azimuth for inputs at 2 db SNR. The average azimuthal error is :68 with a standard deviation of 4:38. The average absolute angular error is 4:37. that the model performs very well in estimating both elevation and azimuth from IID spectral cues. Table summarizes the results of localization over the range of test SNR from 5 db to 4 db, in terms of the mean and standard deviations of the elevation and azimuth error and the mean absolute error. All results are given in degrees. Evidently, the errors increase with greater levels of noise. Note that the standard deviation is a more accurate indicator of performance when measuring localization acuity in azimuth or elevation alone due to the signed nature of the error. It is seen that the mean absolute error in localization vary from

25 MS: 943 Revision: 3 23 Table : Average localization errors over a range on input SNR SNR (db) Mean Elevation Error Mean Azimuth Error Mean Absolute Error about 3:3 to about 7:3 over a wide range of noise level. The error is seen to plateau for SNR's above 2 db. Makous et al. [3] used broadband at stimuli in the range from.8 khz to 6 khz estimate localization acuity in humans, the approximate range of frequencies investigated for our model. Results were reported for open-loop and closed-loop localization experiments. The open-loop trials used short burst of stimuli to remove the contributions due to orienting the head. Here, it is suitable to compare our model responses with the open-loop experiments. It was observed that localization acuity in humans was usually better in azimuth than in elevation. Makous et al. reported signed errors which were at a minimum of about :7 : for azimuths and?:3 3:5 for elevations. These minima corresponded to locations directly ahead. Maximum azimuthal errors were of the order of 8: 6:2 for sources to one side and?3: :2 directly behind. Similarly maximum elevation errors were observed to be 7: :2 for source behind and slightly above the horizontal interaural plane. The minimum elevation errors reported by Makous et al. compare well with our results. The minimum azimuthal errors are smaller than what those achieved by our model. This may seem to invalidate the approach using IID information for azimuthal localization. However it must be noted that ITD cues, specially onset ITD cues provide very informative cues for azimuthal localization. It might be expected that the use of ITD cues will improve the acuity of this model. It cannot be denied that even without ITD cues, the modeled IID spectrum

26 MS: 943 Revision: 3 24 provides a reliable source of information for azimuthal localization. From Fig. 4 and Fig. 5, it can be observed that our model has errors which are approximately uniformly distributed. The model cannot accurately predict the variation in localization acuity with the location of the source, nor can it emulate front-to-back and back-to-front confusions observed in humans. This is because these phenomenon are closely related to the manner in which ITD cues are combined with IID cues to form a listener's subjective auditory space. 7 Conclusions The localization process has been modeled using the interaural intensity dierence (IID) spectral vectors dened in Eq. (8). This IID spectrum is the dierence of intensities from the ipsilateral and contralateral stimuli. The intensity responses of units in the auditory system can be approximated as the power spectrum of the stimulus exciting each bandpass lter of the basilar membrane. The localization process is modeled as a spatial correlate which maps the IID spectrum to the subjective auditory space. This model attempts to explain localization in the mid-to-high frequency range, based on using directional frequency cues that are imposed by the pinnae, head and torso on the IID at the two ears. The experiments described in the manuscript provide an insight into how the peripheral auditory system might be able to process spectral cues for localization. Thus it complements currently popular localization models that are based only on interaural time dierence (ITD) cues. It must be noted that this model has been dened only for broadband medium and high frequency sound sources (>.3 khz). The modeled IID spectrum is an idealization of the response of the units in the lateral superior olive (LSO). In obtaining this representation, the cochlear bandpass lters were assumed to be narrowband and in eect ignored. Saturation eects of the nerve bers in the auditory (VIII th) nerve were ignored. Stimuli were assumed to evoke responses in the approximately linear range of nerve bers and nonlinearities were discounted. These limitations of the model must be kept in mind when deriving conclusions from the above model.

Modeling of Pinna Related Transfer Functions (PRTF) Using the Finite Element Method (FEM)

Modeling of Pinna Related Transfer Functions (PRTF) Using the Finite Element Method (FEM) Manan Joshi *1, Navarun Gupta 1, and Lawrence V. Hmurcik 1 1 University of Bridgeport, Bridgeport, CT *Corresponding