Proceedings of Meetings on Acoustics

Size: px

Start display at page:

Download "Proceedings of Meetings on Acoustics"

Clarence Powell
6 years ago
Views:

1 Proceedings of Meetings on Acoustics Volume, rd Meeting Acoustical Society of America Salt Lake City, Utah 4-8 June 2007 Session psc: Speech Communication psc9. Temporal characterization of auditory-visual coupling in speech Adriano V. Barbosa, Hani C. Yehia and Eric Vatikiotis-Bateson This work examines the coupling between the acoustic and visual com- ponents of speech as it evolves through time. Previous work has shown a consistent correspondence between face motion and spectral acous- tics, and between fundamental frequency (F0) and rigid body motion of the head [Yehia et al. (2002), JPHON, 30, ]. Although these correspondences have been estimated both for sentences and for running speech, the analyses have not taken into account the tempo- ral structure of speech. As a result, the role of temporal organization in multimodal speech cannot be assessed. The current study is a first effort to correct this deficit. We have developed an algorithm, based on recursive correlation, that computes the correlation between measurement domains (e.g., head motion and F0) as a time-varying function. Using this method, regions of high or low correlation, or of rapid transition (e.g., from high to low), can be associated with visual and auditory events. This analysis of the time-varying cou- pling of multimodal events has implications for speech planning and synchronization between speaker and listener. Published by the Acoustical Society of America through the American Institute of Physics 2008 Acoustical Society of America [DOI: 0.2/ ] Received 27 Feb 2008; published 24 Apr 2008 Proceedings of Meetings on Acoustics, Vol., (2008) Page

2 Temporal characterization of auditory-visual coupling in speech Adriano V Barbosa, Hani C Yehia 2 and Eric Vatikiotis-Bateson Linguistics, University of British Columbia, Vancouver, Canada 2 Electronics, Federal University of Minas Gerais, Belo Horizonte, Brazil adriano.vilela@gmail.com, evb@interchange.ubc.ca, hani@cefala.org Introduction This paper introduces two important improvements to our system for processing multimodal signals that can be measured during spoken communication: ) the computation of correspondences between time-varying measures that are sensitive to temporally local fluctuations; 2) the transduction of visible motion from simple video recordings in a field situation where use of passive markers or makeup is unacceptable. Over the past decade, we have applied both linear and nonlinear estimation techniques to characterize the largely linear correspondences between vocal tract articulation, the speech acoustics and visible motions of the head and face. Since the time-varying vocal tract shapes both the acoustics and the face through positioning of the tongue, jaw and lips, it is not surprising that measures made in these three domains should be related somehow. When applied to isolated sentences and longer stretches of connected speech, our previous analyses have shown there to be largely linear correspondences, for example, between spectral acoustic parameters (Line Spectrum Pairs LSP; Sugamura and Itakura, 986 ) and motion of the lips, cheeks, and chin (Yehia et al., 998, 999), and between fundamental frequency (F0) and rigid body head motion (Yehia et al., 2002). These correspondences have been computed using relatively few (5-0) parameters for each measurement domain. For example, the first 5-6 principal Proceedings of Meetings on Acoustics, Vol., (2008) Page 2

3 components of 2D midsagittal motion of the tongue, jaw, and lips and a similar number of components for face and head motion are typically sufficient to recover more than 95% of the variance in each of these domains. Simple linear models applied to these reduced numbers of principal components are then usually able to account for 80-90% of the cross-domain variance. The simplicity of these correspondences greatly facilitated the creation of a linguistically valid talking head animation system that, running in real time, can be driven by measured vocal tract, acoustics, or visible motion of the head and face (for details of the animation system, see Kuratate et al., 2005; for the perceptual validation, see Munhall et al., 2004). A major limitation of this system, however, has been that the correspondences, for example, between face motion and acoustic LSPs, are computed globally over the entire signal. For isolated sentences spanning -2 seconds, this is fine. However, when longer stretches of data are considered, the correspondences do not improve; if anything they weaken somewhat (Vatikiotis- Bateson and Yehia, 2002). That is, the computation of correspondences between signals is based on a static set of parameters, computed once, which means that there is no distinction between spatiotemporal variations that characterize the behavioral structure and local fluctuations that degrade the result when the computed parameters are applied recursively to estimate the time-varying behavior (Moreira and Yehia, 2006). To address this limitation, we introduce an algorithm, based on recursive correlation (Aarts et al., 2002), that computes the instantaneous cross-correlation between measurement domains e.g., head motion and acoustic amplitude (root mean square RMS). This allows rapid changes in correspondence between auditory-visual events to be evaluated as a function of time while also potentially improving the accuracy of cross-domain correspondences computed over analysis windows of any size. A second limitation to our system has been the dependence on markers placed on the face and head for tracking 2D or 3D motion. The use of markers, either active (e.g., wired infrared LEDs) or passive, is physically invasive and distracting for naive experimental subjects, and restricts data collection to the laboratory a situation which is stressful for the elderly and other populations unaccustomed to formal research. In this study however, the motion data were all extracted from video recordings made in the field using a relatively unobtrusive digital video (DV) camcorder. Through the simple technique of computing the optical flow (Horn and Schunk, 98) and then summing the amplitudes (and discarding the directions) of the Proceedings of Meetings on Acoustics, Vol., (2008) Page 3

4 pixel motion vectors for each frame step, signals similar to those derived from marker tracking were created. As shown below (see Figures 3-5), even a single channel representing all of the motion in a video frame captures significant aspects of the spatiotemporal behavior. For the purpose of demonstration, the algorithm is applied to audiovisual behavior produced by a speaker of Plains Cree (Alberta, Canada) as part of an investigation of language as performance with R-M Déchaine and J Deschamps. The analysis of the time-varying coupling of multimodal events clearly has implications for our understanding of speech organization and the assessment of communicative coordination between speaker and listener. In this study, we focus in particular on the coordination between orofacial motions, the amplitude (RMS) of the speech acoustics, and the motion of the speaker s hands. The specific use to which this has been put in the investigation of Cree is in assessing the coordination of hand gestures and speech acoustic parameters in the collaborative construction of meaning. This includes instances where the explicit meaning does not reside solely in the words (or the visible gestures), and instances where iconic use of the hands shows secondary iconic specification in anaphoric structures (e.g., bringing the hand to touch the head once in a narrative to indicate that the speaker was thinking about something and then subsequently producing a reduced motion in the direction of the head to indicate the same thing). The timing and structure of these gestures are coordinated with, but not necessarily determined by, the speech acoustics. The remainder of this paper is organized as follows. Section 2 presents the mathematical formulation of our instantaneous correlation algorithm. Section 3 describes the data acquisition process and discusses the optical flow analysis applied to the acquired video sequences. Results are presented and discussed in Section 4. Finally, the paper is summarized in Section 5. 2 Instantaneous correlation algorithm The instantaneous correlation coefficient ρ(k) between signals x(k) andy(k) is computed as S xy (k) ρ(k) = Sxx (k) S yy (k), () Proceedings of Meetings on Acoustics, Vol., (2008) Page 4

5 where the instantaneous covariance S xy between signals x(k)andy(k)isgiven by S xy (k) = ce ηl (x(k l) x(k l)) (y(k l) ȳ(k l)), (2) l=0 which is a modification of Equation (4) in Aarts et al. (2002). The instantaneous means x(k) and ȳ(k) are computed as x(k) = ce ηm x(k m), (3) m=0 ȳ(k) = ce ηm y(k m), (4) m=0 with the constant c given by c = e η, (5) where η is a small positive number. It is interesting to note that the signal x(k) can be seen as the product of the constant c and the the output of a first-order low-pass linear filter excited by the signal x(k) (the same is valid for signal y(k)). The z-transform representation of this linear filter is given by H(z) = e η z, z >e η. (6) Furthermore, the covariance S xy as defined by Equation 2 can also be seen as the product of the constant c and the output of the filter in Equation 6 when excited by the signal (x(k) x(k)) (y(k) ȳ(k)). (7) 3 Data recording and processing The speaker s hands and face were recorded simultaneously at 24 frames per second (fps) using two DV cameras during a 6 minute interview. Stereo sound was recorded digitally by each camera at 48 khz via two professional microphones (Tram-50 lapel, Sennheiser-46 shotgun). Measures of 2D motion were extracted from the video recordings using the optical flow algorithm developed by Horn and Schunk (98). Figure Proceedings of Meetings on Acoustics, Vol., (2008) Page 5

6 Figure : A snapshot of the video and optical flow between the frame shown and the next frame in the video sequence. shows video frame for each camera and the resulting optical flow computed between that frame and the next frame in the sequence. There are many algorithms for computing optical flow (Barron et al., 994). However, they all have the same goal of calculating optical flow fields corresponding to the projection of the 3D motion of objects in the world onto the 2D image. A standard definition (SDTV) frame of NTSC digital video is 640 pixels wide by 480 pixels high. Each pixel has a luminance, or intensity, value within an 8-bit (0-255) range. Pixels also have values for color, but these are discarded in calculating optical flow. Moving images are recorded as changes in the intensity (and color) values for the pixels in the image array that are influenced by the motion. The optical flow algorithm does not merely register the change of intensity from one image to the next for each pixel; rather it attempts to keep track of specific intensity values, corresponding to image objects as they change location within the pixel array. Thus, the Proceedings of Meetings on Acoustics, Vol., (2008) Page 6

7 algorithm assigns a motion vector consisting of a magnitude and a direction to each pixel based on where the intensity associated with that pixel in one image is located in the next image in sequence. The direction is simply the line from the first pixel to the second and the magnitude corresponds to the Euclidean distance between them. The array of motion vectors comprises the optical flow field. For the purposes of the current analysis, only the magnitude (speed) of pixel motion is needed to assess the coordination of the hand and face-head motion with respect to each other and to the speech acoustics. The richness of this information is readily apparent in movies constructed by representing the magnitude component of optical flow as intensity so that more rapid changes appear brighter (have higher intensity) in the image sequence. Optical flow captures the motion and makes it possible to assess the coordination of hand motion, eye blinks, head motion, and even events in the speech acoustics. In principle, the motion associated with specific regions of interest such as the eyes, mouth, and head or the left and right hands can be examined independently. At this early stage of development, however, our goals are to introduce the instantaneous correlation algorithm, use optical flow to recover motion from video, and show how these techniques can be used to assess the time-varying correspondences between speech acoustics, head and orofacial motion, and hand gestures. To do this, the 640x480 magnitudes of motion associated with each pair of consecutive frames are summed and stored as unidimensional streams for the video sequences acquired by the two cameras (one of the face and head, the other of the hands). Summing the motion for the entire video frame obscures the contribution of specific components for example, the potentially differential contribution of each hand is lost and reduces the dimensionality of each measurement domain to one-time-varying measure. However, as a first step, this has two advantages: ) no a priori decisions about which aspects of the motion or which physical regions are the most relevant to the cross-domain correspondences have been made; and 2) as shown below, these supposedly impoverished measures are surprisingly well-coordinated across domains. In what follows, the instantaneous correlation algorithm is used to compare the two streams of motion magnitudes, summed from the optical flow results, with each other (Figure 3) and with the time-varying RMS amplitude of the acoustics (Figures 4-5). Proceedings of Meetings on Acoustics, Vol., (2008) Page 7

8 4 Results Computing instantaneous correlations that match what we readily perceive qualitatively from watching the optical flow movies requires tuning the algorithm to the behavioral data. The algorithm should not be too sensitive to rapid changes in correspondence e.g., changes due to noise or to higher frequency components in the behavior. Signal noise can be reduced by low-pass filtering (0 Hz used here). Behavioral noise is another issue: fluctuations in synchronization and higher frequencies will confound a sensitive function, while a less sensitive function will miss the subtle changes in spatiotemporal patterning. 0.9 Temporal scope of correlation weights ( ) 0.8 Relative Weight = 0.2 = Previous Sample Number Figure 2: Comparison of the temporal effects on relative weight for two values of η = 0.2 and 0.02 (used in our analysis). Sensitivity is determined by the exponent η in Equation 3. The larger its value the more sensitive the correlation is to rapid changes in the correspondence between signals. Figure 2 shows, for two different values of η, how preceding samples influence the correlation estimate. The smaller value (η = 0.02) gives a slower decline in the weight of preceding samples, decaying to less than % after 250 samples, and has proven to be a good value for the continuous correlation of the running speech data recorded for Plains Cree. This amounts to about a 5 sec. window when applied to signals resampled at Proceedings of Meetings on Acoustics, Vol., (2008) Page 8

9 Face-Hand Correspondence Correlation (Face/Hand) start 528 start Hand Motion Face Motion Time (s) Figure 3: Time-series plots of instantaneous correlation (top) calculated from two starting points (solid 8 sec. prior to window, dashed from window onset), summed optical flow for hands video (middle), and summed optical flow for head and face video (bottom). a common rate of 48Hz. This can be seen in the top panel of Figure 3 where the dashed correlation trace was computed from the start of the window and the solid trace 8 sec. earlier. The relatively high correlation in the first 3-4 sec. of Figure 3 can be seen by inspecting the motion traces for the hands and head/face. Also, comparing the audio waveform (Figure 3 top) and the motion traces shows that the 5 bursts of hand motion activity are apparently more synchronous with the speech signal than is the face motion. This latter observation is supported by the instantaneous correlation results shown for RMS amplitude and hand motion in Figure 4, and for RMS amplitude and face/head motion in Figure 5. Throughout the segments of data depicted in these figures, the correlation between RMS amplitude and face/head motion suffers from both poor synchronization and frequency mismatches in which the optical flow for the Proceedings of Meetings on Acoustics, Vol., (2008) Page 9

10 RMS Amplitude Hand Correspondence Correlation (RMS/hand) RMS Amplitude Hand Motion Time (s) Figure 4: Time-series plots of instantaneous correlation (top), RMS amplitude (middle), and summed hand motion (bottom). face/head sums across the perhaps semi-independent orofacial deformations due to speech articulation, eye-blinks, and head motion. The time-course of RMS amplitude clearly corresponds to the time-course of vocal tract opening (vowels) and closing (consonants). It has also been suggested that head motion is associated with RMS amplitude, but not necessarily in strict synchrony with successive syllables (Munhall et al., 2004). This difference in phasing alone would contribute to the complex frequencies observed for face/head motion. Eye-blinks add yet another dimension to the temporal stream of events. Although beyond the scope of the current paper, Matlab tools have been created to accommodate variable synchronization between signals and to do spectral decomposition prior to computing the correlations (Barbosa et al., 2007a). Proceedings of Meetings on Acoustics, Vol., (2008) Page 0

11 RMS Amplitude Face Correspondence Correlation (RMS/Face) 0 RMS Amplitude Face Motion Time (s) Figure 5: Time-series plots of instantaneous correlation (top), RMS amplitude (middle), and summed face motion (bottom). 5 Summary discussion As can be seen from this simple demonstration, both optical flow and continuous instantaneous correlation promise to be useful in assessing the spatiotemporal coupling between behavioral signals that can be measured as non-invasively as possible. In the implementation presented here, relatively high correlations are observed only for events that are tightly synchronized. Since presenting this poster in June 2007 (Barbosa et al., 2007b), the algorithm has been modified to compute correlations ) weighted by any combination of preceding and following samples, and 2) at any temporal offset between the two signals. These improvements preclude real-time processing, but provide more robust assessments of the correspondence between signals. Both modifications should afford larger values of η that asymptote more Proceedings of Meetings on Acoustics, Vol., (2008) Page

12 quickly, thus effectively reducing the size of the filter window (Figure 2). While this results in greater sensitivity to fluctuations in the instantaneous correlation coefficient, the greater sensitivity can be used to assess the shifts in temporal lag without necessarily reducing the degree of correspondence that occurs naturally during the production of coordinated behaviors such as speech and music. Most of this expanded functionality has already been incorporated in the Matlab toolbox that we have created for processing and analyzing multimodal speech data (Barbosa et al., 2007a), and is available to the research community. Even with these improvements, there are still instances of coupling between the speech and gestural behavior that cannot be easily captured. These are due, in part, to summing the motion for the face/head. Therefore, we are currently attempting the more fine-grained decomposition of this complex into head, perioral, and eye components, which we know are each coordinated with the production of speech. Finally, if we need to make the algorithm smarter so that its sensitivity can be modified on the fly, we will replace the current instantaneous correlation algorithm with a learning algorithm e.g., Kalman filtering (Kalman, 960) that combines prediction from previous patterns of behavior with local estimates of the instantaneous correlation. Acknowledgment Support for this work was provided by NSERC and SSHRC grants to E. Vatikiotis-Bateson. The Plains Cree data were collected in collaboration with Rose-Marie Déchaine, Clancy Dennehy, and Joseph Deschamps. References Aarts, R. M., Irwan, R., and Janssen, A. J. E. M. (2002). Efficient tracking of the cross-correlation coefficient. IEEE Transactions on Speech and Audio Processing, 0(6): Barbosa, A. V., Yehia, H. C., and Vatikiotis-Bateson, E. (2007a). Matlab toolbox for audiovisual speech processing. In Vroomen, J., Swerts, M., and Krahmer, E., editors, International Conference on Auditory-Visual Speech Processing AVSP 2007, pages 32 37, The Netherlands. ISCA. Proceedings of Meetings on Acoustics, Vol., (2008) Page 2

13 Barbosa, A. V., Yehia, H. C., and Vatikiotis-Bateson, E. (2007b). Temporal characteristization of auditory-visual coupling in speech. Journal of the Acoustical Society of America, 2:3044. Barron, J. L., Fleet, D. J., and Beauchemin, S. S. (994). Performance of optical flow techniques. International Journal of Computer Vision, 2: Horn, B. K. and Schunk, B. G. (98). Determining optical flow. Artificial Intelligence, 7: Kalman, R. E. (960). A new approach to linear filtering and prediction problems. Transactions of the ASME Journal of Basic Engineering, 82(Series D): Kuratate, T., Vatikiotis-Bateson, E., and Yehia, H. C. (2005). Estimation and animation of faces using facial motion mapping and a 3d face database. In Clement, J. G. and Marks, M. K., editors, Computer-graphic facial reconstruction, pages Academic Press, Amsterdam. Moreira, K. S. and Yehia, H. C. (2006). Analysis of the variability of the coupling between facial motion and speech acoustics. In Yehia, H. C., Demolin, D., and Laboissière, R., editors, International Seminar on Speech Production ISSP 2006, pages 09 6, Brazil. UFMG. Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T., and Vatikiotis- Bateson, E. (2004). Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological Science, 5(2): Sugamura, N. and Itakura, F. (986). Speech analysis and synthesis methods developed at ECL in NTT: from LPC to LSP. Speech Communication, 5: Vatikiotis-Bateson, E. and Yehia, H. C. (2002). Speaking mode variability in multimodal speech production. IEEE Transactions in Neural Networks, 3(4): Yehia, H. C., Kuratate, T., and Vatikiotis-Bateson, E. (999). Using speech acoustics to drive facial motion. In Ohala, J. J., Hasegawa, Y., Ohala, Proceedings of Meetings on Acoustics, Vol., (2008) Page 3

14 M., Granville, D., and Bailey, A. C., editors, Proceedings of the 4th International Congress of Phonetic Sciences, volume, pages , San Francisco, CA. Linguistics Dept., UC Berkeley. Yehia, H. C., Kuratate, T., and Vatikiotis-Bateson, E. (2002). Linking facial animation, head motion, and speech acoustics. Journal of Phonetics, 30(3): Yehia, H. C., Rubin, P. E., and Vatikiotis-Bateson, E. (998). Quantitative association of vocal-tract and facial behavior. Speech Communication, 26: Proceedings of Meetings on Acoustics, Vol., (2008) Page 4

MATLAB Toolbox for Audiovisual Speech Processing

ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 27 (AVSP27) Hilvarenbeek, The Netherlands August 31 - September 3, 27 MATLAB Toolbox for Audiovisual Speech Processing