Lipreading using Profile Lips Rebuilt by 3D Data from the Kinect

Size: px

Start display at page:

Download "Lipreading using Profile Lips Rebuilt by 3D Data from the Kinect"

Ralph Francis
6 years ago
Views:

1 Journal of Computational Information Systems 11: 7 (2015) Available at Lipreading using Profile Lips Rebuilt by 3D Data from the Kinect Jianrong WANG 1, Yongchun GAO 1, Ju ZHANG 1, Jianguo WEI 2,, Jianwu DANG 1 1 School of Computer Science and Technology, Tianjin University, Tianjin , China 2 School of Computer Software, Tianjin University, Tianjin , China Abstract Lipreading plays an important role in understanding the fluently spoken speech for hearing-impaired people, and the majority studies of lipreading assume the frontal images of the speaker s face, which is easily affected by the variations of each speaker s lip size and illumination. This paper concentrates on the efforts of the 3D data captured by the Kinect to the robustness of the lipreading system. In order to supplement the information of frontal lip, left profile lips and right profile lips are rebuilt by these 3D coordinates. As a result, the experiment adopts the feature integrating with the profile lips yield superior performance over the one with visual-only feature, and the recognition rate relatively increase by 7%. Keywords: Lipreading; 3D Data; Kinect; Profile Lip 1 Introduction Lipreading improves the robustness of speech recognition by means of establishing and analysising the parameters of mouth movement, or directly using the sequence of image to classification and identification. The lipreading system using image sequence of the speakers lips has recently attracted significant interest, and a great deal of progress has been achieved [1, 2, 11]. In the study of the lipreading, it gives emphasis to lip detection and feature extraction. The feature extraction is a crucial part for a lipreading system, and various visual features have been proposed in the literature. In general, they can be categorized into three kinds: 1. pixel based, where the entire image containing the speaker s lip is considered as informative; 2. lip contour based, in which lip contour model is obtained as the visual feature; and 3. the combination of 1 and 2. Among these approaches, the one based on pixels is assumed to be the most efficient [12, 13, 16]. However, some difference may occur when collecting the data, such as, different sizes of each speaker s lip, local and global changes in illumination and the variations in head pose, in addition, the poor mouth ROI localisation when lip detection may present. These differences can significantly degrade the performance of lipreading system. Project part of the National Natural Science Foundation (surface project No , National Key Basic Research Program No. 2013CB and key projects No ). Corresponding author. address: jianguo.fr@gmail.com (Jianguo WEI) / Copyright 2015 Binary Information Press DOI: /jcis13691 April 1, 2015

2 2430 J. Wang et al. /Journal of Computational Information Systems 11: 7 (2015) To alleviate those above problems, few dataset and experimental results have been published by utilizing some sort of 3D information from the speaker s face. For example, [10] developed a lip tracking system that allows the speaker s head to move in 3D and rotate up to 30 degrees away from the camera. In [19], they used three-dimensional characteristics for word recognition, and the result indicated that the recognition rate for three-dimensional characteristics was higher than that for two-dimensional characteristics. And the in-car spanish database AV@CAR was captured from six different angles in order to reconstruct a 3D textured mesh of the speaker s face [14]. Recently, with the development of the MS Kinect, whose sensor is supported by SDK which providing tools for real-time face tracking, and predefining the face with 121 3D coordinate points. As a result, some scholars concentrated their attention on the multi-model AVSR system, the University of Texas utilized their own recorded BAVCD database [4], built a multi-modal AVSR system investigating the use of the facial depth information [5, 6]. A Turkish university employed the angles computed by the 3D coordinate points as the feature, and KNN classifier was used to classify the words [22]. However, most researchs in lipreading were confined to frontal face, no matter with the visual feature or the 3D data, but in the real world, it is hard for everyone to keep frontal view all along. Consequently, some work gave emphasis to non-frontal video data for AVASR [23, 24], and the results demonstrated that useful speech information can be gained from non-frontal visual features, the profile lip feature even yield superior results in [8]. The main purpose of this paper is to lead to a new lipreading system integrating the speech information extracted from the visual data with the profile lips, which rebuilt by the 3D data. This constituents the first attempt in lipreading system using this novel feature. In this paper, a Chinese audio-visual corpus with 3D data was collected, and a projection technique using 3D coordinates to locate lip was introduced. Considering the changes of the speaker s head pose may make different information contained in different profile lips, the main contribute of this paper is to rebuild two sides of profile lips based on 3D coordinate which captured by the Kinect to supplement the information of frontal lip. Following those work, the remainder of this paper is organized as follows: Section 2 introduce the new lipreading system, which include locating the lip with 3D projection, rebuilding the profile lips and integrating them with visual information. The result will be presented in Section 3. Finally, Section 4 concludes this work. 2 The Lipreading System This part consists of the lip location completed by 3D projection, rebuilding of the profile lips, feature extraction applying to visual data as well as 3D data, and the model training and testing on the feature which integrating visual feature with profile lip feature. These will be discussed in more detail in the following subsection. The lipreading system overview is depicted in Fig Lip location by 3D projection Before feature extraction, the primary task is lip location, this paper adopts 3D projection rather than traditional method based on image processing [13, 20]. The 3D projection utilize the 3D data captured by the Kinect and imaging principle, which is shown in Fig. 2, to estimate the coordinate of the center pixel of the lip, then take this center as the midpoint, extend circumferentially to

3 J. Wang et al. /Journal of Computational Information Systems 11: 7 (2015) Fig. 1: Overview of the lipreading system integrating the visual data with 3D data obtain the lip portion containing the 32*32 pixels area of the lip. Fig. 2: The schematic of Kinect imaging principle The schematic of Kinect imaging principle defines three coordinates. x1o1y1 is the camera coordinates, x2o2y2 represents imaging plane coordinates, and upv stands for the image coordinates. It is clearly to see that the center of the camera coordinates in the same straight line with that of the imaging plane coordinates. Assuming the distance between them is m1, the coordinate of o2 will be (0, 0, m1). Let the horizontal angle be α, the vertical angle be β. If m1 is known, the real length and width of the imaging plane respectively is Length : L = 2 m1 tan(α/2) (1) W idth : S = 2 m1 tan(β/2) (2) Suppose that there is a point in the space, its coordinate is (x1, y1, z1), its imaging point in the imaging plane is a, whose coordinate is (x2, y2, z2), and z2 = m1. As they are in the same plane, the following formula can be obtained as a result of similar triangles. x1 x2 = y1 y2 = z1 z2 (3)

4 2432 J. Wang et al. /Journal of Computational Information Systems 11: 7 (2015) In addition, assume the pixel coordinate of a in the imaging plane is (x3, y3, z3) with o2 is the center, if the resolution of the image is m*n, the formulas listed below can be obtained according to the fact that the pixel is proportional to the imaging plane. x2 x3 = L m (4) y2 y3 = s n (5) Since the horizontal angel of Kinect is 57, the vertical angel is 43, together with the principle of coordinate transformation in the Kinect SDK, the real pixel coordinate of a in the picture is (320 + x3, y3). A combination of those formulas leads to the result that m1 is the only variant, which is the distance between imaging plane with the camera. This paper take m1 as 0.7m, then transform these color image into grayscale. The lips obtained from above steps are listed in Fig. 3. Fig. 3: The gray image of lip area for all the speakers 2.2 Profile lips rebuilt by 3D data Kinect is supported by Face Tracking SDK which predefines 121 lip points with 3D coordinates,18 of them represent lip, and each lip point is assigned an integer ID value to identify them. Take the right profile lip rebuilding for example. With the purpose to locate lip region and identify the graphics border, the first step is to generate the grid map of right lip. In case of the interference of left lip, here only choose 11 3D coordinate points corresponding to the right lip. Fig. 4(a) is the right lip contour plotted by these 11 3D points, which only give the information about the two-dimensional plane of the lip. Then interpolation the lip contour to a grid map according to the corresponding relationship between z-axis and x-axis, y-axis, as exhibit in Fig. 4(b); The second step is filling the grid map with color. Fig. 4(c) shows clearly that the color shading has corresponded to z-axis, color is deepened with the distance goes closer; The final step is projection and rotation, what we have obtained so far is solely the right lip from front view. Projection by changing perspective is a necessity to generate the profile lip, which is the right profile lip from right side view of the speaker. Due to the view of the profile lip is downward, it need to rotate 90 degree to get the right profile lip from normal view. Finally, save the rebuilding profile lip into a 60*60 pixels picture in BMP format. Figure 3 provides a flow chart of these steps. Since the right lip and the left lip of each speaker contain different information, this paper rebuild the right profile lips and the left profile lips following the same procedures as in flowchart. Fig. 4 present the right profile lip and the left profile lip rebuilt from 3D data of the same frame.

5 J. Wang et al. /Journal of Computational Information Systems 11: 7 (2015) Fig. 4: The flowchart of rebuilding the profile lip from 3D data Fig. 5: The left profile lip(a) and the right profile lip(b) rebuilt from the 3D data of the same frame 2.3 Feature extraction After obtaining the gray level image of ROI and the profile lips rebuilt form previous processing, the next step is transforming these image information into feature vector to capture the speech information. This paper apply the same method to feature extraction for ROI image as well as the profile lips image, the task is illustrated in Fig. 5. Fig. 6: The block diagram of feature extraction Motivated by some previous work [13, 17], this paper choose DCT transform, it involve onedimensional DCT two-dimensional DCT and block-based DCT. This paper apply two-dimensional DCT as they work similarly for lip reading task [19], using Zig-Zag method to draw the DCT matrix into a 1*1024 row vector. However, before applying DCT to the profile lip rebuilt by the 3D data, the image needs to be compressed into 32*32 pixels, on account of DCT allows fast implementations when the coefficients are powers of 2 [13, 18]. In case of dimensions disaster, this paper apply PCA to reduce dimensions in view of its excellent ability for information compression. This combination is assumed to take the advantages of these

6 2434 J. Wang et al. /Journal of Computational Information Systems 11: 7 (2015) two transforms. DCT is preferable to differentiate frequencies while PCA is beneficial to select the most important components [21]. This combination method outperforms the traditional Zig-Zag [7]. This paper project the features down to 52. Normalization is necessary for the purpose to improve the robustness of features. This paper apply feature mean normalization (FMN) by simply subtracting the feature mean computed over the entire utterance length T. x i = x i T x i i=1 T, i = 1, 2,..., T (6) Where i is the time frame, T is the total number of frames in one word, x t is the vector of visual feature. In order to get the information represent the lip movement, take J as the window length, H as the overlap, concatenating the J-frame feature within a window, which is similar to the windowing of audio signal processing, to get the lip dynamic information. C T t = [x ( t [J/2]) T,..., x T t,..., x ( t + [J/2] + 1) T ] (7) Where x i is the feature vector of the Jth frame.this paper take J = 3, H = 1. Fig. 7: The schematic of windowing for visual feature 3 Experiments and Results This paper perform speech recognition on the basic of isolated Chinese words, the baseline experiment implemented solely with one single feature (i.e., the visual-only feature and each side of the profile lips). Then compare the visual-only lipreading with the one fusioned with 3D data, and compare the lipreading with a single side of profile lip with the one with both sides of the profile lips. The HTK toolkit is utilized for both system training and testing [25], implementing three-state phoneme HMMs with a mixture of two Gaussians per state. As this paper conduct windowing to the integrated feature, which make the dimensions increase and easily cause dimensions disaster, PCA apply to reduce the dimensionality of the integrated feature consequently. When integrating the right profile lip with the left profile lip (this paper call it RL ), their dimension are all 39. While integrate the visual feature with the one rebuilt by the 3D data, the visual feature dimension is 52, and the 3D feature dimension is 26. In addition to these, the dimension of all integrate feature are reduced to 78 after PCA. 3.1 Database To allow experiments on Chinese audio visual corpus with 3D data, a suitable database was collected in the recording studio of computer and technology department at TianJin University,

J. Wang et al. /Journal of Computational Information Systems 11: 7 (2015) 2429 2438 2435 which provides clean acoustics and controlled illumination. Every speaker distanced camera 0.

7 J. Wang et al. /Journal of Computational Information Systems 11: 7 (2015) which provides clean acoustics and controlled illumination. Every speaker distanced camera 0.9m with a solid blue background. It consists of audio and full-face frontal video with 3D data of 10 speakers, with an equal number of men and women. 40 Chinese words are complied to guarantee the phoneme balance, and each of them is pronounced by each speaker 10 times. The device employed in capturing the data is the Microsoft Kinect, which is a novel device. The Kinect utilizes 4 microphones to capture audio, in conjunction with a color camera to get the color video images, the 3D data is collected by a laser and an IR camera. This guarantee the Kinect can capture the audio, visual and 3D data at the same time. Among them, the audio is two-tracks of 16-bit, 44.1 khz, PCM format, the color video are 640*480 pixel, 24-bit RGB at 30 fps. Corresponding to each frame of an color image, Kinect yield 3D data whose full format is shown in Fig. 7, which provide 121 3D points to describe the face contour, the first and second column are the timestamp of the 3D data, the third column is the ID number to these 121 3D data, the next three columns are respectively the x-axis, y-axis and z-axis of the points. Fig. 8: The format of the 3D coordinate data captured by the Kinect 3.2 Experimental results The experiment results are given in the following tables, Table 1 list the baseline results, Table 2 for the integrated feature experiment, including left profile fusion with the right profile lip, the visual feature integrate respectively with the left profile lip, right profile lip and RL. Table 1: Word recognition accuracy based on visual-only and profile lip rebuilded by 3D data Feature Recognition accuracy Visual-only Left profile Right profile Table 2: Word recognition accuracy on integrated feature Integrated feature Integrated feature before PCA Integrated after PCA left profile lip+right profile lip(rl) Visual+left profile lip Visual+right profile lip Visual+RL feature Checking the database, it is easily to find that some speaker s head turn left slightly in some words, owing to the fact that it is hard to keep frontal view continuously, this make the result

8 2436 J. Wang et al. /Journal of Computational Information Systems 11: 7 (2015) that the rebuilding right profile lip outperforming the left profile lip as show in Table 1. This is consistent with the result in [3, 9, 15], that the recognition accuracy degrades as the speakers head pose deviates from the frontal pose. Table 2 presents the efforts of using the 3D data integrated with the visual data, as well as the beneficial effects of feature transformation adopting PCA. It is obviously that the integrated feature dose improve the recognition accuracy compareing with the one using signal feature. Furthermore, the PCA significantly improve the lipreading accuracy. The feature fusioning the left profile lip with the right profile lip outperform the one with signal profile lip reflects that the information contained in one single profile lip is limited, which can not be well represent the side lip information of the speaker. It can also be noted that the feature integrate the visual with the 3D data obtains better performance than the visual-only, this demonstrates that the 3D data provides great efforts to the robustness of lipreading. 4 Conclusion This paper explores a new lipreading system which adopt 3D data captured by Kinect, integrating the profile lips rebuilt by the 3D data with visual feature to improve the performance of the traditional lipreading. In addition, this paper employ 3D projection to locate the lip, and apply the same method to extract the visual feature and 3D data. Finally, the result reveal that the 3D data dose improve the robustness of the visual-only lipreading. The result also indicates that the left profile lips rebuilt by the 3D data outperform the right one, and the result exactly consistent with the conclusion in [3, 9, 15], that the performance will be degrade as the head pose deviates from the frontal view. However, most work have neglected the efforts of 3D data to the profile lip lipreading, and there is no database to allow these work, so the future work of this paper is to build a database involving 3D data as well as audio and visual data, to explore whether the 3D data provide sufficient information to improve the robustness of multi-pose view lipreading. Acknowledgement The research was supported by part of the National Natural Science Foundation (surface project No , National Key Basic Research Program No. 2013CB and key projects No ). References [1] C. Bregler and Y. Konig. eigenlips for robust speech recognition. In Acoustics, Speech, and Signal Processing, ICASSP-94., 1994 IEEE International Conference on, vol. 2, pp. II-669. IEEE, [2] G. I. Chiou and J.-N. Hwang. Lipreading from color video. Image Processing, IEEE Transactions on, 6 (8): , 1997.

9 J. Wang et al. /Journal of Computational Information Systems 11: 7 (2015) [3] V. Estellers and J.-P. Thiran. Multipose audio-visual speech recognition. In EUSIPCO Proceedings, number EPFL-CONF , [4] G. Galatas, G. Potamianos, D. I. Kosmopoulos, C. McMurrough, and F. Makedon. Bilingual corpus for avasr using multiple sensors and depth information. In AVSP, pp , [5] G. Galatas, G. Potamianos, and F. Makedon. Audio-visual speech recognition incorporating facial depth information captured by the kinect. In Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European, pp IEEE, [6] G. Galatas, G. Potamianos, and F. Makedon. Audio-visual speech recognition using depth information from the kinect in noisy video conditions. In Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments, pp. 2. ACM, [7] X. Hong, H. Yao, Y. Wan, and R. Chen. A pca based visual dct feature extraction method for lipreading. In Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 06. International Conference on, pp IEEE, [8] K. Kumar, T. Chen, and R. M. Stern. Profile view lip reading. In Acoustics, Speech and Signal Processing, ICASSP IEEE International Conference on, vol. 4, pp. IV-429. IEEE, [9] K. Kumatani and R. Stiefelhagen. State synchronous modeling on phone boundary for audio visual speech recognition and application to muti-view face images. In Acoustics, Speech and Signal Processing, ICASSP IEEE International Conference on, vol. 4, pp. IV-417. IEEE, [10] G. Loy, E.-J. Holden, and R. Owens. 3d head tracker for an automatic lipreading system. In Proc. Australian Conf. on Robotics and Automation (ACRA2000), [11] J. Luettin, N. A. Thacker, and S. W. Beet. Speechreading using shape and intensity information. In Spoken Language, ICSLP 96. Proceedings., Fourth International Conference on, vol. 1, pp IEEE, [12] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey. Extraction of visual features for lipreading. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24 (2): , [13] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou. Audio-visual speech recognition. In Final Workshop 2000 Report, vol. 764, [14] A. Ortega, F. Sukno, E. Lleida, A. F. Frangi, A. Miguel, L. Buera, and E. Zacur. Av@ car: A spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In LREC, [15] A. Pass, J. Zhang, and D. Stewart. An investigation into features for multi-view lipreading. In Image Processing (ICIP), th IEEE International Conference on, pp IEEE, [16] G. Potamianos, H. P. Graf, and E. Cosatto. An image transform approach for hmm based automatic lipreading. In Image Processing, ICIP 98. Proceedings International Conference on, pp IEEE, [17] G. Potamianos, C. Neti, G. Iyengar, A. W. Senior, and A. Verma. A cascade visual front end for speaker independent automatic speechreading. International Journal of Speech Technology, 4 (3-4): , [18] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical recipes in c: the art of scientific computing, Cité en, pp. 92, [19] K. Uda, N. Tagawa, A. Minagawa, and T. Moriya. Effectiveness evaluation of word characteristics obtained from 3d image information for lipreading. In Image Analysis and Processing, Proceedings. 11th International Conference on, pp IEEE, 2001.

10 2438 J. Wang et al. /Journal of Computational Information Systems 11: 7 (2015) [20] S. Werda, W. Mahdi, and A. B. Hamadou. Lip localization and viseme classification for visual speech recognition. arxiv preprint arxiv: , [21] Q. YANG and X. CHEN. An improved grid search algorithm and its application in pca and svm based face recognition. Journal of Computational Information Systems, 10 (3): , [22] A. Yargic and M. Dogan. A lip reading application on ms kinect camera. In Innovations in Intelligent Systems and Applications (INISTA), 2013 IEEE International Symposium on, pp IEEE, [23] T. Yoshinaga, S. Tamura, K. Iwano, and S. Furui. Audio-visual speech recognition using lip movement extracted from side-face images. In AVSP 2003-International Conference on Audio-Visual Speech Processing, [24] T. Yoshinaga, S. Tamura, K. Iwano, and S. Furui. Audio-visual speech recognition using new lip features extracted from side-face images. In COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction, [25] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, et al. The htk book (for htk version 3.4). Cambridge university engineering department, 2 (2): 2-3, 2006.

A New Manifold Representation for Visual Speech Recognition

A New Manifold Representation for Visual Speech Recognition Dahai Yu, Ovidiu Ghita, Alistair Sutherland, Paul F. Whelan School of Computing & Electronic Engineering, Vision Systems Group Dublin City University,