Joint Processing of Audio and Visual Information for Speech Recognition
|
|
- Winifred Davis
- 5 years ago
- Views:
Transcription
1 Joint Processing of Audio and Visual Information for Speech Recognition Chalapathy Neti and Gerasimos Potamianos IBM T.J. Watson Research Center Yorktown Heights, NY With Iain Mattwes, CMU, USA Juergen Luettin, IDIAP, Switzerland Baltimore, 07/14/00
2 JHU Workshop, 2000 Theme: Audio-visual speech recognition Goal: Use visual information to improve audio-based speech recognition, particulary in acoustically degraded conditions TEAM: Senior Researchers Chalapathy Neti - Lead, Gerasimos Potamianos (IBM) Iain Matthews (CMU) Juegen Luettin (IDIAP, Switzerland) Andreas Andreou (JHU) Graduate students Herve Glottin (ICP, Grenoble), Eugenio Culurcielllo (JHU) Undergraduates: Azad Mashari (U.Toronto), June Sison (UCSC) DATA: Subset of IBM VVAV data, LDC Broadcast video data
3 Perceptual Information Interfaces (PII) Joint processing of audio, visual and other information for active acquisition, recognition and interpretation of objects and events. Is anybody speaking? Speech, music, etc. Who is speaking? What is being said? Speaker state? When speaking. Emotional state. What is the environment? Office, studio, street, etc. Who is in the environment? Crowd, meeting, anchor,, etc.
4 Audio-Visual Processing Technologies Exploit visual information to improve audio-based processing in adverse acoustic conditions Example contexts: Audio-Visual Speech Recognition (MMSP99, ICASSP00, ICME00) Combine visual speech (visemes) with audio speech (phones) recognition Audio-Visual Speaker Recognition (AVSP99, MMSP99, ICME00) Combine audio signatures of a person with visual signatures (face-id) Audio-Visual Speaker Segmentation (RIAO00) Combine visual scene change with audio-based speaker change Audio-Visual Speech-event detection (ICASSP00) Combine visual speech onset cues with audio-based speech energy Audio-Visual Speech synthesis (ICME00) Applications:Human-Computer Interaction, Multimedia content indexing, HCI for people with special needs In this workshop, we propose to attack the problem of LV audio-visual speech recognition. Video data Decision AV Stream Audio data Decision Fusion A-V Descriptor
5 Audio-Visual ASR: Motivation OBJECTIVE: To significantly improve ASR performance by joint use of audio and visual information. FACTS / MOTIVATION: Lack of ASR robustness to noise / mismatched training-testing conditions. Humans speechread to increase intelligibility of noisy speech. Summerfield experiment. Humans fuse audio and visual information in speech recognition. McGurk effect. Acoustic: /ga/ Visual: /ba/ Perceived: /da/ Hearing impaired people speechread well. Visual and audio information are partially complementary: Acoustically confusable phonemes belong to different viseme classes Inexpensive capture of visual information
6 Transcription Accuracy - Results from the '98 DARPA Evaluation (IBM Research System) Accuracy Prepared Conversation Telephone Background Music Background Noise Foreign Accent Combinations Overall 0 Focus condition
7 Bimodal Human Speech Recognition (Summerfield, 79) Experiment: Human transcription of audio-visual speech at SNR < 0 db A - Acoustic only. B - Acoustic + full video. C - Acoustic + lip region. D - Acoustic + 4 lip-points. Word recognition A B C D
8 Visemes vs. Phonemes 1. f,v 5. p, b, m 9. l 2. th,dh 6. w 10. iy 3. s,z 7. r 11. ah 4. sh, zh 8. g,k,n,t,d,y 12. eh Viseme is a unit of visual speech. Phoneme to viseme is a many-to-one mapping. Viseme classes are different from phonetic classes. e.g., /m/, /n/,, belong to the same acoustic class (nasals), whereas they belong to separate viseme classes.
9 McGurk Effect
10 Visual-Only ASR Performance Digit / alpha recognition (Potamianos). Single-speaker digits = 97.0 %. Single-speaker alphas = 71.1 %. 50-speaker alphas = 36.5 % Phonetic classification (Potamianos, Neti, Vermar, Iyengar). 162-speakers (VV-AV), LVCSR = 37.3% Single-speaker digits Lip shape DWT PCA LDA 80
11 Audio-Visual ASR: Challenges Suitable databases are just emerging. At IBM (Giri, Potamianos,Neti): Collected a one of a kind LVCSR audio-visual database (50 hrs, >260 subjects). Previous work: AT&T (Potamianos( Potamianos): 1.5 hr, 50 subjects, connected letters Surrey (Kittler): M2VTS (37 subjects, connected digits) Visual front end. Face and lip/mouth tracking. Mouth center location and size estimates suffice, and are robust (Senior( Senior). Visual features suitable for speechreading. Pixel based, image transforms of video ROI (DCT, DWT, PCA, LDA). Audio-visual fusion (integration) for bimodal LVCSR. Decision fusion, by means of the multi-stream HMM. Exponent training. Confidence estimation of individual streams. Integration level (state, phone, word).
12 The Audio-Visual Database Previous work (CMU,( AT&T, U.Grenoble, (X)M2VTS, Tulips ). databases). Small vocabulary (digits, alphas). Small number of subjects (1-10). Isolated or connected word speech. The IBM ViaVoice audio-visual database (VV-AV( VV-AV). ViaVoice enrollment sentences. 50 hrs of speech. >260 subjects. MPEG2 compressed, 60 Hz, 704 x 480 pixels, frontal subject view. Largest up-to-date database.
13 Facial feature location and tracking
14 Face and Facial Feature Tracking Locate the face in each frame (Senior,( Senior, '99). Search image pyramid across scales & locations. Each square mxn region is considered as a face. Statistical, pixel based approach, using LDA and PCA. Locate ocate 29 facial features (6 mouth ones) within located faces. Statistical, pixel based approach. Uses face geometry statistics to prune erroneous feature candidates. start end ratio
15 Visual Speech Features (I) Lip shape based features. Lip area, height, width, perimeter (Petajan, Lip image moments (Potamianos). Lip contour parametrization. Deformable template parameters (Stork( Stork). Active shape models (Luettin,( Blake). Fourier descriptors (Potamianos( Potamianos). Fail to capture oral cavity information. Petajan, Benoit).
16 Visual Speech Features (II) Image transform, pixel based approach. Considers video region of interest: V(t) = { V(x,y,z): x = 1...m, y = 1...n, z = t-t L...t+t R } Compresses the region of interest using an appropriate image transform (trained( from the data), such as: Discrete cosine transform - DCT Discrete wavelet transform - DWT Principal component analysis - PCA. Linear discriminant analysis - LDA. Final feature vector: o v (t) = S G P'V(t) V(t)
17 CASCADE TRANFORMATION PCA PCA+LDA+MLLT PCA+LDA
18 - AAM Features (IAIN) Statistical model of: Shape point distribution model (PDM) Appearance shape free eigenface Combined shape + appearance combined appearance model Iterative fit algorithm Active Appearance Model (AAM) algorithm mode 1-2sd mean +2sd
19 Point Distribution Model Media Clip
20 Active Shape Models Active Shape Model (ASM) PDM in action iterative fitting constrained in statistically learned shape space Iterative fitting algorithm (Cootes et. al) Point-wise search along normal for edge, or best fit of function Find pose update and use shape constraints Reiterate until convergence Simplex minimisation (Luettin et. al) Concatenate grey-level profiles Use PCA to calculate grey-level profile distribution model (GLDM) GLDM: x = x + P b p p p p
21 Active Appearance Model (AAM) Extension of ASM s to model both appearance and shape in a single statistical model (Cootes and Taylor, 1998) Shape model same PDM used with ASM tracking x = x + P b s s Color appearance model warp example landmark point model to mean shape sample color and normalise use PCA to calculate appearance model g = g + P b g g Combined shape appearance model concatenate shape and appearance parameters
22 Face appearance models
23 Audio-Visual Sensor Fusion Feature fusion vs. decision fusion
24 Feature Fusion Method: Concatenate audio and visual observation vectors (after time-alignment): O (t) = [ O (t) A, O (t) V ] Train a single-stream HMM: P[ O (t) state j ] = m=1...mj w jm jm N( O (t) ; jm, S jm jm )
25 Decision Fusion Formulation (general setting): S information sources about classes: j Å=J = {1,...,J1,...,J }. Time-synchronous observations per source: O (t) s Å R D s, s = 1,...,S. Multi-modal observation vector: O (t) = [ O (t) 1, O (t) 2,..., O (t) S ] Å R D. Single-modal HMMs: Pr [ O (t) s j ] = m=1...mjs w jms N Ds (O (t) s ; jms, S jms ). Multi-stream HMM: Score [ O (t) j ] = s=1...s js log Pr [ O (t) s j ], Stream exponents: 0 x js x S, s=1...s js = S, j Å J.
26 Multi-Stream HMM Training HMM parameters: Stream parameters: Stream exponents: s = [ ( w jms, jms, jms ), str = [ js, j Å J, s = 1,...,S ]. Combined parameters: = [ s, s = 1,...,S, str ]. ), m = 1,...,M js js, j Å J ]. Train single-stream HMMs first (EM training): s (k+1) (k+1) = argmax s Q ( (k) (k), s O ), ), s = 1,...,S s = 1,...,S. Discriminatively train exponents by means of the GPD algorithm. Results (50-subject connected letters, clean speech): Results ( Audio-only word accuracy: Visual-only word accuracy: Audio-visual word accuracy: word accuracy: 85.9 % word accuracy: 36.5 % word accuracy: 88.9 %
27 Multistream HMM (Juergen et al) Can allow for asynchrony Acoustic features Visual features
28 Audio-visual decoding
29 Fusion Research Issues Stream independence assumption is unrealistic. Probability is preferable to a "score" formulation. Integration level: Feature, state, phone, word. Classes of interest: Visemes vs. phonemes. Stream confidence estimation. Entropy, SNR based, smoothly time varying? General decision fusion function. Score [ O (t) j ] = f ( Pr [ O (t) s j ], s = 1,...,S ). Lattice / n-best rescoring techniques.
30 Audio-Visual Speech Recognition Word recognition experiments 5.5 hrs of training, 162 spkrs 1.4 hrs of testing, 162 spkrs Audio features cepstra+lda+mllt Visual features DWT+LDA+MLLT Audio-Visual feature fusion (Cepstra,DWT)+LDA+MLLT Noise speech noise at 10 db SNR Audio Visual A-V 0 Clean 10 db SNR
31 Summary Workshop concentration will be on decision fusion techniques for audio-visual ASR. Multi-stream HMM. Lattice rescoring. Others. We (IBM) are providing the LVCSR AV database and baseline lattices and models. This workshop is a unique opportunity to advance the state of the art in audio-visual ASR.
32 Scientific Merit Significantly improve the LVCSR robustness Digit/alpha SD -> LVCSR SI First step towards perceptual computing Use of diverse sources of information (audio, visual - gaze, pointing, facial gestures, touch) to robustly extract human activity and intent for human information interaction (HII) An exemplar for developing a generalized framework for information fusion Measures for realibility of information sources and their use in making a joint decision Fusion functions
33 Applications Perceptual interfaces for HII speech recognition, speaker identification, speaker segmentation, human intent detection, speech understanding, speech synthesis, scene understanding Multimedia content indexing and retreival (MPEG7 context) Internet digital media content is exploding Network companies have agressive plans for digitizing archival content HCI for people with Special needs e.g: Speech impaired people due to hearing impairment
34 Workshop Planning Pre-workshop: Data preparation (IBM VVAV and LDC CNN) data collection, transcription and annotation Audio and visual front ends (DCT) setup for AAM feature extraction Trained audio-only and visual-only HMMs. Audio-only lattices. Workshop: AAM based visual front end Baseline audio-visual HMM using feature fusion. Lattice rescoring experiments. Time-varying HMM stream exponent estimation. Stream information / confidence measures. adaptation to CNN data Other recombination functions.
Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study
Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study H.J. Nock, G. Iyengar, and C. Neti IBM TJ Watson Research Center, PO Box 218, Yorktown Heights, NY 10598. USA. Abstract. This paper
More information2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology
ISCA Archive STREAM WEIGHT OPTIMIZATION OF SPEECH AND LIP IMAGE SEQUENCE FOR AUDIO-VISUAL SPEECH RECOGNITION Satoshi Nakamura 1 Hidetoshi Ito 2 Kiyohiro Shikano 2 1 ATR Spoken Language Translation Research
More informationConfusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments
PAGE 265 Confusability of Phonemes Grouped According to their Viseme Classes in Noisy Environments Patrick Lucey, Terrence Martin and Sridha Sridharan Speech and Audio Research Laboratory Queensland University
More informationAudio-Visual Speech Processing System for Polish with Dynamic Bayesian Network Models
Proceedings of the orld Congress on Electrical Engineering and Computer Systems and Science (EECSS 2015) Barcelona, Spain, July 13-14, 2015 Paper No. 343 Audio-Visual Speech Processing System for Polish
More informationA Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition
A Multimodal Sensor Fusion Architecture for Audio-Visual Speech Recognition by Mustapha A. Makkook A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree
More informationAUDIO-VISUAL SPEECH-PROCESSING SYSTEM FOR POLISH APPLICABLE TO HUMAN-COMPUTER INTERACTION
Computer Science 19(1) 2018 https://doi.org/10.7494/csci.2018.19.1.2398 Tomasz Jadczyk AUDIO-VISUAL SPEECH-PROCESSING SYSTEM FOR POLISH APPLICABLE TO HUMAN-COMPUTER INTERACTION Abstract This paper describes
More informationSCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION
IEE Colloquium on Integrated Audio-Visual Processing for Recognition, Synthesis and Communication, pp 8/1 8/7, 1996 1 SCALE BASED FEATURES FOR AUDIOVISUAL SPEECH RECOGNITION I A Matthews, J A Bangham and
More informationAUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION
AUDIOVISUAL SPEECH RECOGNITION USING MULTISCALE NONLINEAR IMAGE DECOMPOSITION Iain Matthews, J. Andrew Bangham and Stephen Cox School of Information Systems, University of East Anglia, Norwich, NR4 7TJ,
More informationPhoneme Analysis of Image Feature in Utterance Recognition Using AAM in Lip Area
MIRU 7 AAM 657 81 1 1 657 81 1 1 E-mail: {komai,miyamoto}@me.cs.scitec.kobe-u.ac.jp, {takigu,ariki}@kobe-u.ac.jp Active Appearance Model Active Appearance Model combined DCT Active Appearance Model combined
More informationMultifactor Fusion for Audio-Visual Speaker Recognition
Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 2007 70 Multifactor Fusion for Audio-Visual Speaker Recognition GIRIJA CHETTY
More informationActive Appearance Models
Active Appearance Models Edwards, Taylor, and Cootes Presented by Bryan Russell Overview Overview of Appearance Models Combined Appearance Models Active Appearance Model Search Results Constrained Active
More informationFacial Feature Detection
Facial Feature Detection Rainer Stiefelhagen 21.12.2009 Interactive Systems Laboratories, Universität Karlsruhe (TH) Overview Resear rch Group, Universität Karlsruhe (TH H) Introduction Review of already
More informationLIP ACTIVITY DETECTION FOR TALKING FACES CLASSIFICATION IN TV-CONTENT
LIP ACTIVITY DETECTION FOR TALKING FACES CLASSIFICATION IN TV-CONTENT Meriem Bendris 1,2, Delphine Charlet 1, Gérard Chollet 2 1 France Télécom R&D - Orange Labs, France 2 CNRS LTCI, TELECOM-ParisTech,
More informationISCA Archive
ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing 2005 (AVSP 05) British Columbia, Canada July 24-27, 2005 AUDIO-VISUAL SPEAKER IDENTIFICATION USING THE CUAVE DATABASE David
More informationArticulatory Features for Robust Visual Speech Recognition
Articulatory Features for Robust Visual Speech Recognition Kate Saenko, Trevor Darrell, and James Glass MIT Computer Science and Artificial Intelligence Laboratory 32 Vassar Street Cambridge, Massachusetts,
More informationAutomatic speech reading by oral motion tracking for user authentication system
2012 2013 Third International Conference on Advanced Computing & Communication Technologies Automatic speech reading by oral motion tracking for user authentication system Vibhanshu Gupta, M.E. (EXTC,
More informationGeneric Face Alignment Using an Improved Active Shape Model
Generic Face Alignment Using an Improved Active Shape Model Liting Wang, Xiaoqing Ding, Chi Fang Electronic Engineering Department, Tsinghua University, Beijing, China {wanglt, dxq, fangchi} @ocrserv.ee.tsinghua.edu.cn
More informationA Comparison of Visual Features for Audio-Visual Automatic Speech Recognition
A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition N. Ahmad, S. Datta, D. Mulvaney and O. Farooq Loughborough Univ, LE11 3TU Leicestershire, UK n.ahmad@lboro.ac.uk 6445 Abstract
More informationTowards Lipreading Sentences with Active Appearance Models
Towards Lipreading Sentences with Active Appearance Models George Sterpu, Naomi Harte Sigmedia, ADAPT Centre, School of Engineering, Trinity College Dublin, Ireland sterpug@tcd.ie, nharte@tcd.ie Abstract
More informationTowards Audiovisual TTS
Towards Audiovisual TTS in Estonian Einar MEISTER a, SaschaFAGEL b and RainerMETSVAHI a a Institute of Cybernetics at Tallinn University of Technology, Estonia b zoobemessageentertainmentgmbh, Berlin,
More informationFace Recognition At-a-Distance Based on Sparse-Stereo Reconstruction
Face Recognition At-a-Distance Based on Sparse-Stereo Reconstruction Ham Rara, Shireen Elhabian, Asem Ali University of Louisville Louisville, KY {hmrara01,syelha01,amali003}@louisville.edu Mike Miller,
More information3D LIP TRACKING AND CO-INERTIA ANALYSIS FOR IMPROVED ROBUSTNESS OF AUDIO-VIDEO AUTOMATIC SPEECH RECOGNITION
3D LIP TRACKING AND CO-INERTIA ANALYSIS FOR IMPROVED ROBUSTNESS OF AUDIO-VIDEO AUTOMATIC SPEECH RECOGNITION Roland Goecke 1,2 1 Autonomous System and Sensing Technologies, National ICT Australia, Canberra,
More informationAutomatic Extraction of Lip Feature Points
Automatic Extraction of Lip Feature Points Roland Gockel, J Bruce Millar l, Alexander Zelinsky 2, and J ordi Robert-Ribes 3 lcomputer Sciences Laboratory and 2Robotic Systems Laboratory, RSISE, Australian
More informationComparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos
Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos Seymour, R., Stewart, D., & Ji, M. (8). Comparison of Image Transform-Based Features for Visual
More informationMouth Region Localization Method Based on Gaussian Mixture Model
Mouth Region Localization Method Based on Gaussian Mixture Model Kenichi Kumatani and Rainer Stiefelhagen Universitaet Karlsruhe (TH), Interactive Systems Labs, Am Fasanengarten 5, 76131 Karlsruhe, Germany
More informationBengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE
1 A Low Complexity Parabolic Lip Contour Model With Speaker Normalization For High-Level Feature Extraction in Noise Robust Audio-Visual Speech Recognition Bengt J Borgström, Student Member, IEEE, and
More informationLOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH RECOGNITION
LOW-DIMENSIONAL MOTION FEATURES FOR AUDIO-VISUAL SPEECH Andrés Vallés Carboneras, Mihai Gurban +, and Jean-Philippe Thiran + + Signal Processing Institute, E.T.S.I. de Telecomunicación Ecole Polytechnique
More informationConfidence Measures: how much we can trust our speech recognizers
Confidence Measures: how much we can trust our speech recognizers Prof. Hui Jiang Department of Computer Science York University, Toronto, Ontario, Canada Email: hj@cs.yorku.ca Outline Speech recognition
More informationProbabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information
Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Mustafa Berkay Yilmaz, Hakan Erdogan, Mustafa Unel Sabanci University, Faculty of Engineering and Natural
More informationMaximum Likelihood Beamforming for Robust Automatic Speech Recognition
Maximum Likelihood Beamforming for Robust Automatic Speech Recognition Barbara Rauch barbara@lsv.uni-saarland.de IGK Colloquium, Saarbrücken, 16 February 2006 Agenda Background: Standard ASR Robust ASR
More informationGating Neural Network for Large Vocabulary Audiovisual Speech Recognition
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. X, JUNE 2017 1 Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition Fei Tao, Student Member, IEEE, and
More informationFeature Detection and Tracking with Constrained Local Models
Feature Detection and Tracking with Constrained Local Models David Cristinacce and Tim Cootes Dept. Imaging Science and Biomedical Engineering University of Manchester, Manchester, M3 9PT, U.K. david.cristinacce@manchester.ac.uk
More informationCONTINUOUS VISUAL SPEECH RECOGNITION FOR MULTIMODAL FUSION
2014 IEEE International Conference on Acoustic, peech and ignal Processing (ICAP) CONTINUOU VIUAL PEECH RECOGNITION FOR MULTIMODAL FUION Eric Benhaim, Hichem ahbi Telecom ParisTech CNR-LTCI 46 rue Barrault,
More informationA New Manifold Representation for Visual Speech Recognition
A New Manifold Representation for Visual Speech Recognition Dahai Yu, Ovidiu Ghita, Alistair Sutherland, Paul F. Whelan School of Computing & Electronic Engineering, Vision Systems Group Dublin City University,
More informationMultiple Kernel Learning for Emotion Recognition in the Wild
Multiple Kernel Learning for Emotion Recognition in the Wild Karan Sikka, Karmen Dykstra, Suchitra Sathyanarayana, Gwen Littlewort and Marian S. Bartlett Machine Perception Laboratory UCSD EmotiW Challenge,
More informationSemantic Word Embedding Neural Network Language Models for Automatic Speech Recognition
Semantic Word Embedding Neural Network Language Models for Automatic Speech Recognition Kartik Audhkhasi, Abhinav Sethy Bhuvana Ramabhadran Watson Multimodal Group IBM T. J. Watson Research Center Motivation
More informationResearch Article Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images
Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing Volume 2007, Article ID 64506, 9 pages doi:10.1155/2007/64506 Research Article Audio-Visual Speech Recognition Using
More informationLOCAL APPEARANCE BASED FACE RECOGNITION USING DISCRETE COSINE TRANSFORM
LOCAL APPEARANCE BASED FACE RECOGNITION USING DISCRETE COSINE TRANSFORM Hazim Kemal Ekenel, Rainer Stiefelhagen Interactive Systems Labs, University of Karlsruhe Am Fasanengarten 5, 76131, Karlsruhe, Germany
More informationMulti-pose lipreading and audio-visual speech recognition
RESEARCH Open Access Multi-pose lipreading and audio-visual speech recognition Virginia Estellers * and Jean-Philippe Thiran Abstract In this article, we study the adaptation of visual and audio-visual
More informationAudio-visual interaction in sparse representation features for noise robust audio-visual speech recognition
ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing (AVSP) 2013 Annecy, France August 29 - September 1, 2013 Audio-visual interaction in sparse representation features for
More informationSPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION
Far East Journal of Electronics and Communications Volume 3, Number 2, 2009, Pages 125-140 Published Online: September 14, 2009 This paper is available online at http://www.pphmj.com 2009 Pushpa Publishing
More informationVideo search requires efficient annotation of video content To some extent this can be done automatically
VIDEO ANNOTATION Market Trends Broadband doubling over next 3-5 years Video enabled devices are emerging rapidly Emergence of mass internet audience Mainstream media moving to the Web What do we search
More informationFace Recognition Using Active Appearance Models
Face Recognition Using Active Appearance Models G.J. Edwards, T.F. Cootes, and C.J. Taylor Wolfson Image Analysis Unit, Department of Medical Biophysics, University of Manchester, Manchester M13 9PT, U.K.
More informationShape description and modelling
COMP3204/COMP6223: Computer Vision Shape description and modelling Jonathon Hare jsh2@ecs.soton.ac.uk Extracting features from shapes represented by connected components Recap: Connected Component A connected
More informationAudio-visual speech recognition using deep bottleneck features and high-performance lipreading
Proceedings of APSIPA Annual Summit and Conference 215 16-19 December 215 Audio-visual speech recognition using deep bottleneck features and high-performance lipreading Satoshi TAMURA, Hiroshi NINOMIYA,
More informationA MOUTH FULL OF WORDS: VISUALLY CONSISTENT ACOUSTIC REDUBBING. Disney Research, Pittsburgh, PA University of East Anglia, Norwich, UK
A MOUTH FULL OF WORDS: VISUALLY CONSISTENT ACOUSTIC REDUBBING Sarah Taylor Barry-John Theobald Iain Matthews Disney Research, Pittsburgh, PA University of East Anglia, Norwich, UK ABSTRACT This paper introduces
More informationPartial Least Squares Regression on Grassmannian Manifold for Emotion Recognition
Emotion Recognition In The Wild Challenge and Workshop (EmotiW 2013) Partial Least Squares Regression on Grassmannian Manifold for Emotion Recognition Mengyi Liu, Ruiping Wang, Zhiwu Huang, Shiguang Shan,
More informationThe Template Update Problem
The Template Update Problem Iain Matthews, Takahiro Ishikawa, and Simon Baker The Robotics Institute Carnegie Mellon University Abstract Template tracking dates back to the 1981 Lucas-Kanade algorithm.
More informationTurbo Decoders for Audio-visual Continuous Speech Recognition
INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Turbo Decoders for Audio-visual Continuous Speech Recognition Ahmed Hussen Abdelaziz International Computer Science Institute, Berkeley, USA ahmedha@icsi.berkeley.edu
More informationVisual Front-End Wars: Viola-Jones Face Detector vs Fourier Lucas-Kanade
ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing (AVSP) 2013 Annecy, France August 29 - September 1, 2013 Visual Front-End Wars: Viola-Jones Face Detector vs Fourier Lucas-Kanade
More informationDr Andrew Abel University of Stirling, Scotland
Dr Andrew Abel University of Stirling, Scotland University of Stirling - Scotland Cognitive Signal Image and Control Processing Research (COSIPRA) Cognitive Computation neurobiology, cognitive psychology
More informationDA Progress report 2 Multi-view facial expression. classification Nikolas Hesse
DA Progress report 2 Multi-view facial expression classification 16.12.2010 Nikolas Hesse Motivation Facial expressions (FE) play an important role in interpersonal communication FE recognition can help
More informationFeatures for Audio-Visual Speech Recognition
Features for Audio-Visual Speech Recognition Iain Matthews School of Information Systems University of East Anglia September 1998 Double-sided final version. c This copy of the thesis has been supplied
More informationDynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons
INTERSPEECH 2014 Dynamic Stream Weight Estimation in Coupled-HMM-based Audio-visual Speech Recognition Using Multilayer Perceptrons Ahmed Hussen Abdelaziz, Dorothea Kolossa Institute of Communication Acoustics,
More informationFace analysis : identity vs. expressions
Face analysis : identity vs. expressions Hugo Mercier 1,2 Patrice Dalle 1 1 IRIT - Université Paul Sabatier 118 Route de Narbonne, F-31062 Toulouse Cedex 9, France 2 Websourd 3, passage André Maurois -
More informationLip Detection and Lip Geometric Feature Extraction using Constrained Local Model for Spoken Language Identification using Visual Speech Recognition
Indian Journal of Science and Technology, Vol 9(32), DOI: 10.17485/ijst/2016/v9i32/98737, August 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Lip Detection and Lip Geometric Feature Extraction
More informationAudio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features
EURASIP Journal on Applied Signal Processing 2002:11, 1213 1227 c 2002 Hindawi Publishing Corporation Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features Petar S. Aleksic Department
More informationRobust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions
Robust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions Stewart, D., Seymour, R., Pass, A., & Ji, M. (2014). Robust Audio-Visual Speech Recognition under Noisy Audio-Video Conditions.
More informationFace Alignment Across Large Poses: A 3D Solution
Face Alignment Across Large Poses: A 3D Solution Outline Face Alignment Related Works 3D Morphable Model Projected Normalized Coordinate Code Network Structure 3D Image Rotation Performance on Datasets
More informationM I RA Lab. Speech Animation. Where do we stand today? Speech Animation : Hierarchy. What are the technologies?
MIRALab Where Research means Creativity Where do we stand today? M I RA Lab Nadia Magnenat-Thalmann MIRALab, University of Geneva thalmann@miralab.unige.ch Video Input (face) Audio Input (speech) FAP Extraction
More informationLipreading using Profile Lips Rebuilt by 3D Data from the Kinect
Journal of Computational Information Systems 11: 7 (2015) 2429 2438 Available at http://www.jofcis.com Lipreading using Profile Lips Rebuilt by 3D Data from the Kinect Jianrong WANG 1, Yongchun GAO 1,
More informationFace Alignment Under Various Poses and Expressions
Face Alignment Under Various Poses and Expressions Shengjun Xin and Haizhou Ai Computer Science and Technology Department, Tsinghua University, Beijing 100084, China ahz@mail.tsinghua.edu.cn Abstract.
More informationA Multi-Stage Approach to Facial Feature Detection
A Multi-Stage Approach to Facial Feature Detection David Cristinacce, Tim Cootes and Ian Scott Dept. Imaging Science and Biomedical Engineering University of Manchester, Manchester, M3 9PT, U.K. david.cristinacce@stud.man.ac.uk
More informationDISTANCE MAPS: A ROBUST ILLUMINATION PREPROCESSING FOR ACTIVE APPEARANCE MODELS
DISTANCE MAPS: A ROBUST ILLUMINATION PREPROCESSING FOR ACTIVE APPEARANCE MODELS Sylvain Le Gallou*, Gaspard Breton*, Christophe Garcia*, Renaud Séguier** * France Telecom R&D - TECH/IRIS 4 rue du clos
More informationBiometrics Technology: Multi-modal (Part 2)
Biometrics Technology: Multi-modal (Part 2) References: At the Level: [M7] U. Dieckmann, P. Plankensteiner and T. Wagner, "SESAM: A biometric person identification system using sensor fusion ", Pattern
More informationA Comparative Evaluation of Active Appearance Model Algorithms
A Comparative Evaluation of Active Appearance Model Algorithms T.F. Cootes, G. Edwards and C.J. Taylor Dept. Medical Biophysics, Manchester University, UK tcootes@server1.smb.man.ac.uk Abstract An Active
More informationEND-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS
END-TO-END VISUAL SPEECH RECOGNITION WITH LSTMS Stavros Petridis, Zuwei Li Imperial College London Dept. of Computing, London, UK {sp14;zl461}@imperial.ac.uk Maja Pantic Imperial College London / Univ.
More informationFace. Feature Extraction. Visual. Feature Extraction. Acoustic Feature Extraction
A Bayesian Approach to Audio-Visual Speaker Identification Ara V Nefian 1, Lu Hong Liang 1, Tieyan Fu 2, and Xiao Xing Liu 1 1 Microprocessor Research Labs, Intel Corporation, fara.nefian, lu.hong.liang,
More informationTwo-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments
The 2 IEEE/RSJ International Conference on Intelligent Robots and Systems October 8-22, 2, Taipei, Taiwan Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments Takami Yoshida, Kazuhiro
More informationLearning based face hallucination techniques: A survey
Vol. 3 (2014-15) pp. 37-45. : A survey Premitha Premnath K Department of Computer Science & Engineering Vidya Academy of Science & Technology Thrissur - 680501, Kerala, India (email: premithakpnath@gmail.com)
More informationAudio Visual Speech Recognition
Project #7 Audio Visual Speech Recognition Final Presentation 06.08.2010 PROJECT MEMBERS Hakan Erdoğan Saygın Topkaya Berkay Yılmaz Umut Şen Alexey Tarasov Presentation Outline 1.Project Information 2.Proposed
More informationWHO WANTS TO BE A MILLIONAIRE?
IDIAP COMMUNICATION REPORT WHO WANTS TO BE A MILLIONAIRE? Huseyn Gasimov a Aleksei Triastcyn Hervé Bourlard Idiap-Com-03-2012 JULY 2012 a EPFL Centre du Parc, Rue Marconi 19, PO Box 592, CH - 1920 Martigny
More informationAutomatic Enhancement of Correspondence Detection in an Object Tracking System
Automatic Enhancement of Correspondence Detection in an Object Tracking System Denis Schulze 1, Sven Wachsmuth 1 and Katharina J. Rohlfing 2 1- University of Bielefeld - Applied Informatics Universitätsstr.
More informationGenetic Snakes: Application on Lipreading
Genetic Snakes: Application on Lipreading Renaud Séguier, Nicolas Cladel 1 Abstract Contrary to Genetics Snakes, the current methods of mouth modeling are very sensitive to initialization (position of
More informationImage Processing (IP)
Image Processing Pattern Recognition Computer Vision Xiaojun Qi Utah State University Image Processing (IP) Manipulate and analyze digital images (pictorial information) by computer. Applications: The
More informationLip Contour Localization using Statistical Shape Models Fabian Schneiter
Lip Contour Localization using Statistical Shape Models Fabian Schneiter Master Thesis Supervised by Gabriele Fanelli Computer Vision Institute, Department of Electrical Engineering & Information Technology,
More informationDiscriminative training and Feature combination
Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics
More informationRecognizing faces in broadcast video
To appear in the proceedings of the IEEE workshop on Real-Time Analysis and Tracking of Face and Gesture in Real-Time 1 Systems, Kerkyra, Greece, September 1999 Recognizing faces in broadcast video A.W.Senior
More informationAAM Based Facial Feature Tracking with Kinect
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 15, No 3 Sofia 2015 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.1515/cait-2015-0046 AAM Based Facial Feature Tracking
More informationFACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT
FACE ANALYSIS AND SYNTHESIS FOR INTERACTIVE ENTERTAINMENT Shoichiro IWASAWA*I, Tatsuo YOTSUKURA*2, Shigeo MORISHIMA*2 */ Telecommunication Advancement Organization *2Facu!ty of Engineering, Seikei University
More informationP-CNN: Pose-based CNN Features for Action Recognition. Iman Rezazadeh
P-CNN: Pose-based CNN Features for Action Recognition Iman Rezazadeh Introduction automatic understanding of dynamic scenes strong variations of people and scenes in motion and appearance Fine-grained
More informationWorkshop W14 - Audio Gets Smart: Semantic Audio Analysis & Metadata Standards
Workshop W14 - Audio Gets Smart: Semantic Audio Analysis & Metadata Standards Jürgen Herre for Integrated Circuits (FhG-IIS) Erlangen, Germany Jürgen Herre, hrr@iis.fhg.de Page 1 Overview Extracting meaning
More informationLip Tracking for MPEG-4 Facial Animation
Lip Tracking for MPEG-4 Facial Animation Zhilin Wu, Petar S. Aleksic, and Aggelos. atsaggelos Department of Electrical and Computer Engineering Northwestern University 45 North Sheridan Road, Evanston,
More informationClassifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao Motivation Image search Building large sets of classified images Robotics Background Object recognition is unsolved Deformable shaped
More informationModeling Coarticulation in Continuous Speech
ing in Oregon Health & Science University Center for Spoken Language Understanding December 16, 2013 Outline in 1 2 3 4 5 2 / 40 in is the influence of one phoneme on another Figure: of coarticulation
More informationClassification of Face Images for Gender, Age, Facial Expression, and Identity 1
Proc. Int. Conf. on Artificial Neural Networks (ICANN 05), Warsaw, LNCS 3696, vol. I, pp. 569-574, Springer Verlag 2005 Classification of Face Images for Gender, Age, Facial Expression, and Identity 1
More informationPerceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.
Perceptual coding Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal. Perceptual encoders, however, have been designed for the compression of general
More informationState of The Art In 3D Face Recognition
State of The Art In 3D Face Recognition Index 1 FROM 2D TO 3D 3 2 SHORT BACKGROUND 4 2.1 THE MOST INTERESTING 3D RECOGNITION SYSTEMS 4 2.1.1 FACE RECOGNITION USING RANGE IMAGES [1] 4 2.1.2 FACE RECOGNITION
More information22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos
Machine Learning for Computer Vision 1 22 October, 2012 MVA ENS Cachan Lecture 5: Introduction to generative models Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Center for Visual Computing Ecole Centrale Paris
More informationDynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR
INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR Sebastian Gergen 1, Steffen Zeiler 1, Ahmed Hussen Abdelaziz 2, Robert Nickel
More informationA Novel Visual Speech Representation and HMM Classification for Visual Speech Recognition
A Novel Visual Speech Representation and HMM Classification for Visual Speech Recognition Dahai Yu, Ovidiu Ghita, Alistair Sutherland*, Paul F Whelan Vision Systems Group, School of Electronic Engineering
More informationOn Modeling Variations for Face Authentication
On Modeling Variations for Face Authentication Xiaoming Liu Tsuhan Chen B.V.K. Vijaya Kumar Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213 xiaoming@andrew.cmu.edu
More informationLearning The Lexicon!
Learning The Lexicon! A Pronunciation Mixture Model! Ian McGraw! (imcgraw@mit.edu)! Ibrahim Badr Jim Glass! Computer Science and Artificial Intelligence Lab! Massachusetts Institute of Technology! Cambridge,
More informationFurther Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji
S.No TITLES DOMAIN DIGITAL 1 Image Haze Removal via Reference Retrieval and Scene Prior 2 Segmentation of Optic Disc from Fundus images 3 Active Contour Segmentation of Polyps in Capsule Endoscopic Images
More informationFace anti-spoofing using Image Quality Assessment
Face anti-spoofing using Image Quality Assessment Speakers Prisme Polytech Orléans Aladine Chetouani R&D Trusted Services Emna Fourati Outline Face spoofing attacks Image Quality Assessment Proposed method
More informationXing Fan, Carlos Busso and John H.L. Hansen
Xing Fan, Carlos Busso and John H.L. Hansen Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science Department of Electrical Engineering University of Texas at Dallas
More informationA long, deep and wide artificial neural net for robust speech recognition in unknown noise
A long, deep and wide artificial neural net for robust speech recognition in unknown noise Feipeng Li, Phani S. Nidadavolu, and Hynek Hermansky Center for Language and Speech Processing Johns Hopkins University,
More informationDetection of Mouth Movements and Its Applications to Cross-Modal Analysis of Planning Meetings
2009 International Conference on Multimedia Information Networking and Security Detection of Mouth Movements and Its Applications to Cross-Modal Analysis of Planning Meetings Yingen Xiong Nokia Research
More informationOptical Storage Technology. MPEG Data Compression
Optical Storage Technology MPEG Data Compression MPEG-1 1 Audio Standard Moving Pictures Expert Group (MPEG) was formed in 1988 to devise compression techniques for audio and video. It first devised the
More informationColumbia University High-Level Feature Detection: Parts-based Concept Detectors
TRECVID 2005 Workshop Columbia University High-Level Feature Detection: Parts-based Concept Detectors Dong-Qing Zhang, Shih-Fu Chang, Winston Hsu, Lexin Xie, Eric Zavesky Digital Video and Multimedia Lab
More information