Phoneme Analysis of Image Feature in Utterance Recognition Using AAM in Lip Area

Similar documents
Facial Age Estimation Based on KNN-SVR Regression and AAM Parameters

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

Joint Processing of Audio and Visual Information for Speech Recognition

Mobile Robot Control Interface Using Augmented Free-viewpoint Image Generation

Audio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models

A New Manifold Representation for Visual Speech Recognition

3D LIP TRACKING AND CO-INERTIA ANALYSIS FOR IMPROVED ROBUSTNESS OF AUDIO-VIDEO AUTOMATIC SPEECH RECOGNITION

Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments

Lipreading using Profile Lips Rebuilt by 3D Data from the Kinect

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

2. 集団の注目位置推定 提案手法では 複数の人物が同一の対象を注視している状況 置 を推定する手法を検討する この状況下では 図 1 のよう. 顔画像からそれぞれの注目位置を推定する ただし f は 1 枚 この仮説に基づいて 複数の人物を同時に撮影した低解像度顔

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments

Research Article Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

Audio Visual Isolated Oriya Digit Recognition Using HMM and DWT

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition

arxiv: v1 [cs.cv] 3 Oct 2017

END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS

Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH RECOGNITION

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR

An Introduction to Pattern Recognition

Lip Detection and Lip Geometric Feature Extraction using Constrained Local Model for Spoken Language Identification using Visual Speech Recognition

Mouse Pointer Tracking with Eyes

Object Recognition by Integrated Information Using Web Images

フラクタル 1 ( ジュリア集合 ) 解説 : ジュリア集合 ( 自己平方フラクタル ) 入力パラメータの例 ( 小さな数値の変化で模様が大きく変化します. Ar や Ai の数値を少しずつ変化させて描画する. ) プログラムコード. 2010, AGU, M.

第 6 回画像センシングシンポジウム, 横浜,00 年 6 月 and pose variation, thus they will fail under these imaging conditions. n [7] a method for facial landmarks detection

Mouth Center Detection under Active Near Infrared Illumination

Mouth Region Localization Method Based on Gaussian Mixture Model

Temporal Multimodal Learning in Audiovisual Speech Recognition

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Methods to Detect Malicious MS Document File using File Structure Inspection

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

Face. Feature Extraction. Visual. Feature Extraction. Acoustic Feature Extraction

PAPER Graph Cuts Segmentation by Using Local Texture Features of Multiresolution Analysis

Articulatory Features for Robust Visual Speech Recognition

MySQL Cluster 7.3 リリース記念!! 5 分で作る MySQL Cluster 環境

Generic object recognition using graph embedding into a vector space

ISCA Archive

Combining Audio and Video for Detection of Spontaneous Emotions

SCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION

Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE

IP Network Technology

AUDIO-VISUAL SPEECH-PROCESSING SYSTEM FOR POLISH APPLICABLE TO HUMAN-COMPUTER INTERACTION

Combining Dynamic Texture and Structural Features for Speaker Identification

Image Denoising AGAIN!?

Automatic speech reading by oral motion tracking for user authentication system

Sparse Coding Based Lip Texture Representation For Visual Speaker Identification

MAPPING FROM SPEECH TO IMAGES USING CONTINUOUS STATE SPACE MODELS. Tue Lehn-Schiøler, Lars Kai Hansen & Jan Larsen

Spoken Term Detection Using Multiple Speech Recognizers Outputs at NTCIR-9 SpokenDoc STD subtask

Tagging Video Contents with Positive/Negative Interest Based on User s Facial Expression

Face Objects Detection in still images using Viola-Jones Algorithm through MATLAB TOOLS

暗い Lena トーンマッピング とは? 明るい Lena. 元の Lena. tone mapped. image. original. image. tone mapped. tone mapped image. image. original image. original.

A Real Time Facial Expression Classification System Using Local Binary Patterns

Detection of a Single Hand Shape in the Foreground of Still Images

Analysis on the Multi-stakeholder Structure in the IGF Discussions

A Novel Visual Speech Representation and HMM Classification for Visual Speech Recognition

Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study

Xing Fan, Carlos Busso and John H.L. Hansen

Robust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions

Definition of Visual Speech Element and Research on a Method of Extracting Feature Vector for Korean Lip-Reading

Audio-Visual Speech Recognition for Slavonic Languages (Czech and Russian)

Relaxed Consistency models and software distributed memory. Computer Architecture Textbook pp.79-83

Face Recognition for Mobile Devices

Studies of Large-Scale Data Visualization: EXTRAWING and Visual Data Mining

A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition

Audio Visual Speech Recognition

Facial Feature Detection

Simultaneous Design of Feature Extractor and Pattern Classifer Using the Minimum Classification Error Training Algorithm

Hierarchical Audio-Visual Cue Integration Framework for Activity Analysis in Intelligent Meeting Rooms

Yamaha Steinberg USB Driver V for Windows Release Notes

Realistic Talking Head for Human-Car-Entertainment Services

Video Annotation and Retrieval Using Vague Shot Intervals

Yamaha Steinberg USB Driver V for Mac Release Notes

M I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies?

A long, deep and wide artificial neural net for robust speech recognition in unknown noise

DECODING VISEMES: IMPROVING MACHINE LIP-READING. Helen L. Bear and Richard Harvey

Face Recognition-based Automatic Tagging Scheme for SNS

Improving Speaker Verification Performance in Presence of Spoofing Attacks Using Out-of-Domain Spoofed Data

LIP ACTIVITY DETECTION FOR TALKING FACES CLASSIFICATION IN TV-CONTENT

PHONEME-TO-VISEME MAPPING FOR VISUAL SPEECH RECOGNITION

AUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION

Skeleton Cube for Lighting Environment Estimation

A Robust Method of Facial Feature Tracking for Moving Images

Discriminative training and Feature combination

Lip Tracking for MPEG-4 Facial Animation

Hybrid Approach of Facial Expression Recognition

電脳梁山泊烏賊塾 構造体のサイズ. Visual Basic

Multifactor Fusion for Audio-Visual Speaker Recognition

Baseball Game Highlight & Event Detection

SVD-based Universal DNN Modeling for Multiple Scenarios

The Nottingham eprints service makes this work by researchers of the University of Nottingham available open access under the following conditions.

Query Translation for CLIR: EWC vs. Google Translate

Motion Path Searches for Maritime Robots

Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos

Progress Report of Final Year Project

Vehicle Calibration Techniques Established and Substantiated for Motorcycles

PCIe SSD PACC EP P3700 Intel Solid-State Drive Data Center Tool

Off-line Character Recognition using On-line Character Writing Information

Adaptive Feature Extraction with Haar-like Features for Visual Tracking

Transcription:

MIRU 7 AAM 657 81 1 1 657 81 1 1 E-mail: {komai,miyamoto}@me.cs.scitec.kobe-u.ac.jp, {takigu,ariki}@kobe-u.ac.jp Active Appearance Model Active Appearance Model combined DCT Active Appearance Model combined Phoneme Analysis of Image Feature in Utterance Recognition Using AAM in Lip Area Yuto KOMAI, Chikoto MIYAMOTO, Tetsuya TAKIGUCHI, and Yasuo ARIKI Graduate School of System Informatics, Kobe University Rokkodaicho 1 1, Nada-ku, Kobe, Hyogo, 657 81 Japan Organization of Advanced Science and Technology, Kobe University Rokkodaicho 1 1, Nada-ku, Kobe, Hyogo, 657 81 Japan E-mail: {komai,miyamoto}@me.cs.scitec.kobe-u.ac.jp, {takigu,ariki}@kobe-u.ac.jp Abstract Lip-reading, that recognizes the content of the utterance from the movement of the lip enables the recognition under a noise environment. Multimodal speech recognition using lip dynamic scene information together with acoustic information is attracting attention and the research is advanced in recent years. Since visual information plays a great role in multimodal speech recognition, what to select as the image feature becomes a significant point. This paper proposes, for spoken word recognition, to utilize c combined parameter as the image feature extracted by Active Appearance Model applied to a face image including the lip area. Combined parameter contains information of the coordinate value and the intensity value as the image feature. The recognition rate was improved by the proposed method compared to the conventional method such as DCT and the principal component score. Moreover, we integrated with high accuracy the phoneme score from acoustic information and the viseme score from visual information. Key words Lip area, Active Appearance Model, Combined parameter, Integration of audio and visual, Viseme 1. IS3-31:1771

[1] [2] [3] [4] [5] RGB [6] [7] [8] [9] SNAKE [] Active Shape Model [11] Active Appearance Model [12] [13] [14] [15] [2] [3] [15] [16] DCT [14] [17] Active Appearance Models AAM, AAM combined c shape texture c HMM Haar-like AdaBoost [18] [19] [] 2. 3. AAM 4. 5. 216 6. 7. 2. 1 Haar-like AdaBoost AAM AAM AdaBoost AAM 音声 特徴抽出 MFCC HMM q q q 1 2 3 q4 1 入力動画 結果統合 認識 画像 顔検出 顔領域 AAM 適用 唇領域 AAM 適用 特徴抽出 c パラメータ HMM q q q 1 2 3 q4 AAM AAM AAM AAM 2 AAM AAM AAM c texture [21] AAM AAM 2 AAM AAM AAM c HMM HMM IS3-31:1772

HMM 3. 3. 1 Active Appearance Models AAM Cootes shape texture shape s s s s texture texture g s g 1 2 s = x 1, y 1,..., x n, y n T 1 AAM のモデル構築に用いた 63 点の特徴点を与えた画像 g = g 1,..., g m T 2 x i, y i i < = n g j j < = m s s ḡ s g s ḡ P s P g 3 4 s = s + P s b s 3 顔領域の shape 顔領域 AAM 唇領域の shape 唇領域 AAM g = ḡ + P g b g 4 2 2 AAM b s b g shape texture shape texture shape texture b s b g 5 6 W s b s W s P T s s s b = = = Qc 5 b g P T g g ḡ Q = Q s Q g 6 W s shape texture Q c shape texture combined c s g 7 8 sc = s + P s W s 1 Q s c 7 gc = ḡ + P g Q g c 8 c shape texture 3. 2 2. 2 AAM AAM shape texture 2 8 11 12 8 63 AAM shape texture 3. 3 Combined AAM 3 c c c c I i I i Wp AAM gc e 9 e c p IS3-31:1773

閉じた口 平均テクスチャ う 4. HMM HMM HMM HMM MFCC12, 39 2. [4] 3 あい c L A+V = αl A + 1 αl V, < = α < = 1 L A+V L A L V α 5. 5. 1 ec, p = gc I i Wp 2 9 p W c shape texture 95% 78 c 3 1 3 c c, 3. 4 c 2 DCT [22] AAM 32 32 2 DCT 32 32 16 16 16 16 % DCT 4 4 16 48 ATR 216 ATR 1 Logicool Qcam Orbit MP 9 7 fps SONY ECM-PC / / cm 1 SN db - 216 leave-one-out 9 1 closed 216 1 open HMM 3 4 HMM closed HMM monophone closed open open 5. 2 4 4 closed1 HMM closed2 closed HMM open open HMM DCT 2 DCT PCA C parameterface AAM c C parameterlip AAM c IS3-31:1774

率識 7 認 DCT PCA C parameterface C parameterlip Speech 98.8 99.198.9 99 99.9 97.9 96 81.3 82.7 8.5 8.2 72 68 67 61 5 8 % 率 7 識認 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 DCT PCA C parameter 混合数 closed2 closed1 closed2 open 4 closed1 c closed2 c DCT 2.2 PCA 1.4 c 2.5 open DCT 11 PCA 4 c 5 closed2 open c closed2 open closed1 closed1 closed2 HMM HMM HMM HMM open closed2 5 6 HMM 5 6 closed2 closed2 closed2 open closed2 closed2 open 8 % 率 7 識認 6 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 18 19 DCT PCA C parameter 混合数 open 5. 3 SN db - c HMM HMM.1 7 8 9 7 8 9 HMM closed1 closed HMM closed2 open HMM open 1 clean SN 8 % 7.1.2.3.4.5.6.7.8.9 1 7 closed1 db - IS3-31:1775

7 7.1.2.3.4.5.6.7.8.9 1 8 closed2 db - db - a a: b by ch d dy e e: f g gy h hy i i: j k ky m my n N ny o o: p py q r ry s sh t ts u u: w y z 欠落 a 84 3 1 a: 3 b 24 3 1 1 1 1 2 by 1 ch 13 1 1 1 d 11 1 1 2 5 dy e 33 e: 1 1 f 2 11 1 6 1 5 g 2 8 1 1 1 1 4 gy h 3 1 1 hy i 2 63 1 3 5 i: 1 1 j 1 11 k 2 1 1 26 1 2 ky 1 1 1 m 13 1 my 1 n 5 1 N 1 39 1 ny o 52 1 o: 1 p 6 py q 3 3 r 1 19 6 ry 1 s 1 sh 1 17 t 1 1 ts 5 u 1 1 1 52 34 u: 4 2 w 1 3 y 1 2 3 z 5 挿入 1 1 1 8 confusion matrixopen.1.2.3.4.5.6.7.8.9 1 9 open SN db - SN - 6. 6. 1 1 1 216 HMM open Julius c 11 open confusion matrix 11 c open confusion matrix Correct Accuracy a a: b by ch d dy e e: f g gy h hy i i: j k ky m my n N ny o o: p py q r ry s sh t ts u u: w y z 欠落 a 63 3 11 2 9 a: 3 b 9 1 11 2 11 by 1 ch 5 1 1 2 3 4 d 1 1 1 1 3 3 2 3 2 3 dy e 4 21 3 2 3 e: 2 f 9 1 1 2 3 g 1 1 1 1 1 1 4 1 2 5 gy h 1 1 2 1 hy i 1 1 46 4 1 2 1 18 i: 2 j 6 1 3 2 k 1 2 4 6 3 1 16 ky 1 1 1 m 3 1 9 1 my 1 n 1 1 1 3 N 2 3 7 3 26 ny o 1 47 5 o: 1 p 1 2 3 py q 1 5 r 2 1 2 7 1 13 ry 1 s 3 1 3 1 3 sh 1 1 2 1 3 1 2 4 3 t 3 1 1 1 2 1 1 2 ts 1 1 2 1 u 4 1 63 1 u: 6 w 1 3 y 2 1 3 z 2 1 1 1 挿入 1 1 3 6 6 1 1 3 9 1 2 11 c confusion matrixopen 1 open open closed2 Accuracy Correct Accuracy Correct Accuracy Correct 82.91 82.91 67.81 68.38 65.46 66.21 72.4 75.38 11.85 21.58 37.46 45.87 77.65 79.26.74 45.74 52.86 57.5 1 1 8 open 7 1 2 4 IS3-31:1776

6. 2 11 c i u ii uu ii i uu u ii uu r r a ra N N N N N k g n r b m p 6. 3 6. 2 ka ga viseme closed2 [17] 5. HMM 12 eikyou eigyou 2 eisyou 2 12 13 7.1.2.3.4.5.6.7.8.9 1 closed2 7.1.2.3.4.5.6.7.8.9 1 db - open 12 closed2 13 open closed2 open 1 8 9 8 9 14 confusion matrix 2 14 a i u e o p r sy w t s y vf N 欠落 a 73 8 i 57 21 u 68 6 27 e 3 3 26 4 o 5 p 54 1 r 1 11 2 3 9 sy 1 44 2 2 2 w 1 26 3 t 5 2 2 15 1 2 5 7 s 5 7 8 1 y 1 2 2 vf 2 8 7 3 18 17 N 2 29 挿入 18 1 3 1 db - c confusion matrixopen 2 1 IS3-31:1777

2 open closed2 Accuracy Correct Accuracy Correct 75.9 75.9 78.21 78.58 47.69 57.85 63.28 68.44 62.54 67.35 71.59 74.8 closed2 14 N 6. 1 t t t d n t t vf N 7. AAM combined confusion matrix 1 AAM monophone HMM triphone HMM [1] G Potamianos, H.P. Graf, Discriminative Training of HMM Stream Exponents for Audio-Visual Speech Recognition, Proc. ICASSP98, Seattle, U.S.A., vol.6, pp.3733-3736, 1998. [2] pp.3-4, Sep, 2. [3],,,,,, pp.193-194, Apr. 3. [4] WIT8-7 pp.37-42, 8. [5] HMM pp.111-112,. [6] Vol.J85-D-, No.12, pp.1813-1822, 2. [7] Vol.J89-D, No.9, pp.25-32, 6. [8] Rainer Stiefelhagen, Uwe Meiger, Jie Yang, Real- Time Lip-Tracking For LipReading, Eurospeech 97, 1997. [9] Vol.J84-D-, No.3, pp.459-47, 1. [] PRMU3-276, pp.121-126, 4. [11] Juergen Luettin, Neil A. Thacker, Steve W.Beet, Visual Speech Recognition using Active Shape Models And Hidden Markov Models ICASSP-96, Vol.2, pp.817-8, 1996. [12] Cootes, T.F., Active Appearance Model, Proc. European Conference on Computer Vision, Vol2, pp.484-498, 1998. [13] Cootes, T.F., K Walker, C.J.Taylor, View-based Active Appearance Models Image and Vision Computing, pp.657-664, 2. [14] C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou, Audio-visual speech recognition, Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, MD, Final Workshop Report, Oct.. [15] MIRU8, IS-17, pp.434-439, 8. [16] HMM Vol.PRMU2-124, pp.25-, 2. [17] JSAI4, 1E2-2, pp.1-4, 4. [18] P.Viola, M. Jones, Rapid Object Detection Using Boosted Cascade of Simple Features In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp.1-9, 1. [19] Joint Haar-like MIRU5 pp.4-111 5. [] SP8-7, pp.1-6, 8. [21] AAM MIRU9 pp.769-776 9. [22] Vol.1, No.385, pp.25-32, 1. IS3-31:1778