Phoneme Analysis of Image Feature in Utterance Recognition Using AAM in Lip Area

MIRU 7 AAM 657 81 1 1 657 81 1 1 E-mail: {komai,miyamoto}@me.cs.scitec.kobe-u.ac.jp, {takigu,ariki}@kobe-u.ac.jp Active Appearance Model Active Appearance Model combined DCT Active Appearance Model combined Phoneme Analysis of Image Feature in Utterance Recognition Using AAM in Lip Area Yuto KOMAI, Chikoto MIYAMOTO, Tetsuya TAKIGUCHI, and Yasuo ARIKI Graduate School of System Informatics, Kobe University Rokkodaicho 1 1, Nada-ku, Kobe, Hyogo, 657 81 Japan Organization of Advanced Science and Technology, Kobe University Rokkodaicho 1 1, Nada-ku, Kobe, Hyogo, 657 81 Japan E-mail: {komai,miyamoto}@me.cs.scitec.kobe-u.ac.jp, {takigu,ariki}@kobe-u.ac.jp Abstract Lip-reading, that recognizes the content of the utterance from the movement of the lip enables the recognition under a noise environment. Multimodal speech recognition using lip dynamic scene information together with acoustic information is attracting attention and the research is advanced in recent years. Since visual information plays a great role in multimodal speech recognition, what to select as the image feature becomes a significant point. This paper proposes, for spoken word recognition, to utilize c combined parameter as the image feature extracted by Active Appearance Model applied to a face image including the lip area. Combined parameter contains information of the coordinate value and the intensity value as the image feature. The recognition rate was improved by the proposed method compared to the conventional method such as DCT and the principal component score. Moreover, we integrated with high accuracy the phoneme score from acoustic information and the viseme score from visual information. Key words Lip area, Active Appearance Model, Combined parameter, Integration of audio and visual, Viseme 1. IS3-31:1771

[1] [2] [3] [4] [5] RGB [6] [7] [8] [9] SNAKE [] Active Shape Model [11] Active Appearance Model [12] [13] [14] [15] [2] [3] [15] [16] DCT [14] [17] Active Appearance Models AAM, AAM combined c shape texture c HMM Haar-like AdaBoost [18] [19] [] 2. 3. AAM 4. 5. 216 6. 7. 2. 1 Haar-like AdaBoost AAM AAM AdaBoost AAM 音声特徴抽出 MFCC HMM q q q 1 2 3 q4 1 入力動画結果統合認識画像顔検出顔領域 AAM 適用唇領域 AAM 適用特徴抽出 c パラメータ HMM q q q 1 2 3 q4 AAM AAM AAM AAM 2 AAM AAM AAM c texture [21] AAM AAM 2 AAM AAM AAM c HMM HMM IS3-31:1772

HMM 3. 3. 1 Active Appearance Models AAM Cootes shape texture shape s s s s texture texture g s g 1 2 s = x 1, y 1,..., x n, y n T 1 AAM のモデル構築に用いた 63 点の特徴点を与えた画像 g = g 1,..., g m T 2 x i, y i i < = n g j j < = m s s ḡ s g s ḡ P s P g 3 4 s = s + P s b s 3 顔領域の shape 顔領域 AAM 唇領域の shape 唇領域 AAM g = ḡ + P g b g 4 2 2 AAM b s b g shape texture shape texture shape texture b s b g 5 6 W s b s W s P T s s s b = = = Qc 5 b g P T g g ḡ Q = Q s Q g 6 W s shape texture Q c shape texture combined c s g 7 8 sc = s + P s W s 1 Q s c 7 gc = ḡ + P g Q g c 8 c shape texture 3. 2 2. 2 AAM AAM shape texture 2 8 11 12 8 63 AAM shape texture 3. 3 Combined AAM 3 c c c c I i I i Wp AAM gc e 9 e c p IS3-31:1773

閉じた口平均テクスチャう 4. HMM HMM HMM HMM MFCC12, 39 2. [4] 3 あい c L A+V = αl A + 1 αl V, < = α < = 1 L A+V L A L V α 5. 5. 1 ec, p = gc I i Wp 2 9 p W c shape texture 95% 78 c 3 1 3 c c, 3. 4 c 2 DCT [22] AAM 32 32 2 DCT 32 32 16 16 16 16 % DCT 4 4 16 48 ATR 216 ATR 1 Logicool Qcam Orbit MP 9 7 fps SONY ECM-PC / / cm 1 SN db - 216 leave-one-out 9 1 closed 216 1 open HMM 3 4 HMM closed HMM monophone closed open open 5. 2 4 4 closed1 HMM closed2 closed HMM open open HMM DCT 2 DCT PCA C parameterface AAM c C parameterlip AAM c IS3-31:1774

率識 7 認 DCT PCA C parameterface C parameterlip Speech 98.8 99.198.9 99 99.9 97.9 96 81.3 82.7 8.5 8.2 72 68 67 61 5 8 % 率 7 識認 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 DCT PCA C parameter 混合数 closed2 closed1 closed2 open 4 closed1 c closed2 c DCT 2.2 PCA 1.4 c 2.5 open DCT 11 PCA 4 c 5 closed2 open c closed2 open closed1 closed1 closed2 HMM HMM HMM HMM open closed2 5 6 HMM 5 6 closed2 closed2 closed2 open closed2 closed2 open 8 % 率 7 識認 6 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 DCT PCA C parameter 混合数 open 5. 3 SN db - c HMM HMM.1 7 8 9 7 8 9 HMM closed1 closed HMM closed2 open HMM open 1 clean SN 8 % 7.1.2.3.4.5.6.7.8.9 1 7 closed1 db - IS3-31:1775

7 7.1.2.3.4.5.6.7.8.9 1 8 closed2 db - db - a a: b by ch d dy e e: f g gy h hy i i: j k ky m my n N ny o o: p py q r ry s sh t ts u u: w y z 欠落 a 84 3 1 a: 3 b 24 3 1 1 1 1 2 by 1 ch 13 1 1 1 d 11 1 1 2 5 dy e 33 e: 1 1 f 2 11 1 6 1 5 g 2 8 1 1 1 1 4 gy h 3 1 1 hy i 2 63 1 3 5 i: 1 1 j 1 11 k 2 1 1 26 1 2 ky 1 1 1 m 13 1 my 1 n 5 1 N 1 39 1 ny o 52 1 o: 1 p 6 py q 3 3 r 1 19 6 ry 1 s 1 sh 1 17 t 1 1 ts 5 u 1 1 1 52 34 u: 4 2 w 1 3 y 1 2 3 z 5 挿入 1 1 1 8 confusion matrixopen.1.2.3.4.5.6.7.8.9 1 9 open SN db - SN - 6. 6. 1 1 1 216 HMM open Julius c 11 open confusion matrix 11 c open confusion matrix Correct Accuracy a a: b by ch d dy e e: f g gy h hy i i: j k ky m my n N ny o o: p py q r ry s sh t ts u u: w y z 欠落 a 63 3 11 2 9 a: 3 b 9 1 11 2 11 by 1 ch 5 1 1 2 3 4 d 1 1 1 1 3 3 2 3 2 3 dy e 4 21 3 2 3 e: 2 f 9 1 1 2 3 g 1 1 1 1 1 1 4 1 2 5 gy h 1 1 2 1 hy i 1 1 46 4 1 2 1 18 i: 2 j 6 1 3 2 k 1 2 4 6 3 1 16 ky 1 1 1 m 3 1 9 1 my 1 n 1 1 1 3 N 2 3 7 3 26 ny o 1 47 5 o: 1 p 1 2 3 py q 1 5 r 2 1 2 7 1 13 ry 1 s 3 1 3 1 3 sh 1 1 2 1 3 1 2 4 3 t 3 1 1 1 2 1 1 2 ts 1 1 2 1 u 4 1 63 1 u: 6 w 1 3 y 2 1 3 z 2 1 1 1 挿入 1 1 3 6 6 1 1 3 9 1 2 11 c confusion matrixopen 1 open open closed2 Accuracy Correct Accuracy Correct Accuracy Correct 82.91 82.91 67.81 68.38 65.46 66.21 72.4 75.38 11.85 21.58 37.46 45.87 77.65 79.26.74 45.74 52.86 57.5 1 1 8 open 7 1 2 4 IS3-31:1776

6. 2 11 c i u ii uu ii i uu u ii uu r r a ra N N N N N k g n r b m p 6. 3 6. 2 ka ga viseme closed2 [17] 5. HMM 12 eikyou eigyou 2 eisyou 2 12 13 7.1.2.3.4.5.6.7.8.9 1 closed2 7.1.2.3.4.5.6.7.8.9 1 db - open 12 closed2 13 open closed2 open 1 8 9 8 9 14 confusion matrix 2 14 a i u e o p r sy w t s y vf N 欠落 a 73 8 i 57 21 u 68 6 27 e 3 3 26 4 o 5 p 54 1 r 1 11 2 3 9 sy 1 44 2 2 2 w 1 26 3 t 5 2 2 15 1 2 5 7 s 5 7 8 1 y 1 2 2 vf 2 8 7 3 18 17 N 2 29 挿入 18 1 3 1 db - c confusion matrixopen 2 1 IS3-31:1777

2 open closed2 Accuracy Correct Accuracy Correct 75.9 75.9 78.21 78.58 47.69 57.85 63.28 68.44 62.54 67.35 71.59 74.8 closed2 14 N 6. 1 t t t d n t t vf N 7. AAM combined confusion matrix 1 AAM monophone HMM triphone HMM [1] G Potamianos, H.P. Graf, Discriminative Training of HMM Stream Exponents for Audio-Visual Speech Recognition, Proc. ICASSP98, Seattle, U.S.A., vol.6, pp.3733-3736, 1998. [2] pp.3-4, Sep, 2. [3],,,,,, pp.193-194, Apr. 3. [4] WIT8-7 pp.37-42, 8. [5] HMM pp.111-112,. [6] Vol.J85-D-, No.12, pp.1813-1822, 2. [7] Vol.J89-D, No.9, pp.25-32, 6. [8] Rainer Stiefelhagen, Uwe Meiger, Jie Yang, Real- Time Lip-Tracking For LipReading, Eurospeech 97, 1997. [9] Vol.J84-D-, No.3, pp.459-47, 1. [] PRMU3-276, pp.121-126, 4. [11] Juergen Luettin, Neil A. Thacker, Steve W.Beet, Visual Speech Recognition using Active Shape Models And Hidden Markov Models ICASSP-96, Vol.2, pp.817-8, 1996. [12] Cootes, T.F., Active Appearance Model, Proc. European Conference on Computer Vision, Vol2, pp.484-498, 1998. [13] Cootes, T.F., K Walker, C.J.Taylor, View-based Active Appearance Models Image and Vision Computing, pp.657-664, 2. [14] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou, Audio-visual speech recognition, Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, MD, Final Workshop Report, Oct.. [15] MIRU8, IS-17, pp.434-439, 8. [16] HMM Vol.PRMU2-124, pp.25-, 2. [17] JSAI4, 1E2-2, pp.1-4, 4. [18] P.Viola, M. Jones, Rapid Object Detection Using Boosted Cascade of Simple Features In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp.1-9, 1. [19] Joint Haar-like MIRU5 pp.4-111 5. [] SP8-7, pp.1-6, 8. [21] AAM MIRU9 pp.769-776 9. [22] Vol.1, No.385, pp.25-32, 1. IS3-31:1778