Real-Time Semi-Blind Speech Extraction with Speaker Direction Tracking on Kinect

Similar documents
2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

Upper-limit Evaluation of Robot Audition based on ICA-BSS in Multi-source, Barge-in and Highly Reverberant Conditions

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments

Maximum Likelihood Beamforming for Robust Automatic Speech Recognition

Dereverberation Robust to Speaker s Azimuthal Orientation in Multi-channel Human-Robot Communication

AUDIO SIGNAL PROCESSING FOR NEXT- GENERATION MULTIMEDIA COMMUNI CATION SYSTEMS

Design and Implementation of Small Microphone Arrays

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Reverberation design based on acoustic parameters for reflective audio-spot system with parametric and dynamic loudspeaker

Production of Video Images by Computer Controlled Cameras and Its Application to TV Conference System

THE PERFORMANCE of automatic speech recognition

Surrounded by High-Definition Sound

Speaker Verification with Adaptive Spectral Subband Centroids

Geometric calibration of distributed microphone arrays

System Identification Related Problems at

Online PLCA for Real-time Semi-supervised Source Separation

Distributed Signal Processing for Binaural Hearing Aids

Guided Image Super-Resolution: A New Technique for Photogeometric Super-Resolution in Hybrid 3-D Range Imaging

Robust Adaptive CRLS-GSC Algorithm for DOA Mismatch in Microphone Array

Improving Mobile Phone Based Query Recognition with a Microphone Array

The Pre-Image Problem and Kernel PCA for Speech Enhancement

BINAURAL SOUND LOCALIZATION FOR UNTRAINED DIRECTIONS BASED ON A GAUSSIAN MIXTURE MODEL

arxiv: v1 [cs.lg] 5 Nov 2018

Mengjiao Zhao, Wei-Ping Zhu

Data fusion and multi-cue data matching using diffusion maps

3D Audio Perception System for Humanoid Robots

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification

System Identification Related Problems at SMN

[N569] Wavelet speech enhancement based on voiced/unvoiced decision

Lecture 19: Depth Cameras. Visual Computing Systems CMU , Fall 2013

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Perspectives on Multimedia Quality Prediction Methodologies for Advanced Mobile and IP-based Telephony

Text-Tracking Wearable Camera System for the Blind

System Identification Related Problems at SMN

Chapter 5.5 Audio Programming

Variable-Component Deep Neural Network for Robust Speech Recognition

Keyword Recognition Performance with Alango Voice Enhancement Package (VEP) DSP software solution for multi-microphone voice-controlled devices

Embedded Audio & Robotic Ear

SPEECH DEREVERBERATION USING LINEAR PREDICTION WITH ESTIMATION OF EARLY SPEECH SPECTRAL VARIANCE

2014, IJARCSSE All Rights Reserved Page 461

Source Separation with a Sensor Array Using Graphical Models and Subband Filtering

Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking

Rate-distortion Optimized Streaming of Compressed Light Fields with Multiple Representations

Separation of speech mixture using time-frequency masking implemented on a DSP

Passive Differential Matched-field Depth Estimation of Moving Acoustic Sources

Dr Andrew Abel University of Stirling, Scotland

Rate-distortion Optimized Streaming of Compressed Light Fields with Multiple Representations

Automatic Enhancement of Correspondence Detection in an Object Tracking System

Active noise control at a moving location in a modally dense three-dimensional sound field using virtual sensing

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 6, AUGUST

An Introduction to Pattern Recognition

Optimal Video Adaptation and Skimming Using a Utility-Based Framework

Multifactor Fusion for Audio-Visual Speaker Recognition

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

A Kinect Sensor based Windows Control Interface

Expanding gait identification methods from straight to curved trajectories

Robust speech recognition using features based on zero crossings with peak amplitudes

Task selection for control of active vision systems

PHASE AND LEVEL DIFFERENCE FUSION FOR ROBUST MULTICHANNEL SOURCE SEPARATION

Short Survey on Static Hand Gesture Recognition

Coordinate Mapping Between an Acoustic and Visual Sensor Network in the Shape Domain for a Joint Self-Calibrating Speaker Tracking

Segmentation: Clustering, Graph Cut and EM

Clustering Lecture 5: Mixture Model

A Novel Smoke Detection Method Using Support Vector Machine

1 Audio quality determination based on perceptual measurement techniques 1 John G. Beerends

Audio-coding standards

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

Accurate and Dense Wide-Baseline Stereo Matching Using SW-POC

Task analysis based on observing hands and objects by vision

Learning to bounce a ball with a robotic arm

Audio-coding standards

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Adaptive Quantization for Video Compression in Frequency Domain

Region Weighted Satellite Super-resolution Technology

REAL-TIME DIGITAL SIGNAL PROCESSING

CS Introduction to Data Mining Instructor: Abdullah Mueen

An Approach for Reduction of Rain Streaks from a Single Image

DSP using Labview FPGA. T.J.Moir AUT University School of Engineering Auckland New-Zealand

Estimating the wavelength composition of scene illumination from image data is an

Room Acoustics. CMSC 828D / Spring 2006 Lecture 20

13. Learning Ballistic Movementsof a Robot Arm 212

Binarization of Color Character Strings in Scene Images Using K-means Clustering and Support Vector Machines

Discriminative training and Feature combination

Real-Time Document Image Retrieval for a 10 Million Pages Database with a Memory Efficient and Stability Improved LLAH

SSL for Circular Arrays of Mics

Separating Speech From Noise Challenge

Particle Filtering. CS6240 Multimedia Analysis. Leow Wee Kheng. Department of Computer Science School of Computing National University of Singapore

Identifying the number of sources in speech mixtures with the Mean Shift algorithm

Feature Point Extraction using 3D Separability Filter for Finger Shape Recognition

ISAR IMAGING OF MULTIPLE TARGETS BASED ON PARTICLE SWARM OPTIMIZATION AND HOUGH TRANSFORM

Design of Multichannel AP-DCD Algorithm using Matlab

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.

V. Zetterberg Amlab Elektronik AB Nettovaegen 11, Jaerfaella, Sweden Phone:

Squeeze Play: The State of Ady0 Cmprshn. Scott Selfon Senior Development Lead Xbox Advanced Technology Group Microsoft

Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing

Multi-View Image Coding in 3-D Space Based on 3-D Reconstruction

Client Dependent GMM-SVM Models for Speaker Verification

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Face Hallucination Based on Eigentransformation Learning

Transductive Phoneme Classification Using Local Scaling And Confidence

Transcription:

Real-Time Semi-Blind Speech Extraction with Speaker Direction Tracking on Kinect Yuji Onuma, Noriyoshi Kamado, Hiroshi Saruwatari, Kiyohiro Shikano Nara Institute of Science and Technology, Graduate School of Information Science, Nara 630-0192, Japan E-mail: yuji-o@is.naist.jp Tel: +81-743-72-5287 Abstract In this paper, speech recognition accuracy improvement is addressed for ICA-based multichannel noise reduction in spoken-dialogue robot. First, to achieve high recognition accuracy for the early utterance of the target speaker, we introduce a new rapid ICA initialization method combining robot image information and a prestored initial separation filter bank. From this image information, an ICA initial filter fitted to the user s direction can be used to save the user s first utterance. Next, a new permutation solving method using a probability statistics model is proposed for realistic sound mixtures consisting of pointsource speech and diffuse noise. We implement these methods using user tracking on Microsoft Kinect and evaluate it by speech recognition experiment in the real environment. The experimental results show that the proposed approaches can markedly improve the word recognition accuracy. I. Introduction In a hands-free robot dialog system, the user s voice is picked at a distance with a microphone array, resulting in a more natural and stress-free interface for humans. In this system, however, it is difficult to achieve accurate speech recognition because background environmental noises always degrade the target speech quality [1]. In this paper, speech recognition accuracy improvement is addressed for multichannel noise reduction in spoken-dialogue robot. As a conventional noise reduction method, independent component analysis (ICA) [2] has been proposed. ICA is a well-known method that can be used to estimate environment noise components. Hence, one of the authors has proposed blind spatial subtraction array (BSSA) [3], which enhances target speech by subtracting an ICA-based noise estimate from noisy observations via spectral subtraction or Wiener filtering [4]. BSSA has been developed to be real-time processing [5], but there exist some inherent problems that (a) difficulty in removing the ambiguity of the source order, i.e., solve the permutation problem under diffuse noise conditions, and (b) BSSA s initial filter is always fixed to be, e.g., null beamformer (NBF) [6] steered to array normal. Therefore, the first utterance of the target user cannot be recognized. In this paper, first, to achieve high recognition accuracy for the early utterance of the target speaker, we introduce a new rapid ICA initialization method combining image information and a prestored initial separation filter bank. To cope with the problem, we assume that the robot has its own video camera, and thus, the direction of the target user can be immediately Fig. 1. Appearance of Kita-systems (Kita-chan and Kita-robo) with Kinect. detected. From this image information, the prestored ICA filter bank can be established with the user s direction tags, and an ICA initial filter fitted to the user s direction can be used to save the user s first utterance. Next, a new permutation solving method using a probability statistics model is proposed. In this method, a shape difference between probability density functions of sources can cope with the permutation problem in realistic sound mixtures consisting of point-source speech and diffuse noise. Finally, we implement the real-time target speech extraction system consisting of above-mentioned algorithms using user tracking ability of Microsoft Kinect (hereafter we call it Kinect) [7] and evaluate this system by speech recognition experiment in the real environment. This Kinect-based speech enhancement system is installed with our previously developed spoken dialog robot, Kita-chan [5], to construct a handsfree robot dialog system (see Fig.1). The experimental results in the real environment show that the proposed approaches can markedly improve the word recognition accuracy in the real-time ICA-based noise reduction developed in the robot dialogue system.

II. Related Work:Target Speech Extraction by BSSA A. Noise Estimation by ICA In this study, we assume a mixture of point-source target speech and diffuse noise, which typically arises in robot dialogue systems. It is known that ICA is proficient in noise estimation under the non-point-source noise condition [3]. The observed signal vector of the J-channel array in the timefrequency domain is given by x( f, τ) = h( f, θ)s( f, τ, θ) + n( f, τ), (1) where f is the frequency bin, τ is the frame number, x( f, τ) = [x 1 ( f, τ),,x J ( f, τ)] T is the observed signal vector, h( f, θ) = [h 1 ( f, θ),,h J ( f, θ)] T is a column vector of transfer function from the target signal component to each microphone, s( f, τ, θ) is a target speech signal component, and n( f, τ) = [n 1 ( f, τ),,n J ( f, τ)] T is a column vector of the additive noise signal. In addition, θ is the direction-of-arrival (DOA) of the source. In the ICA, we perform signal separation using a complexvalued matrix W ICA ( f, θ), so that the output signals o( f, τ, θ) become mutually independent; this procedure can be represented as o( f, τ, θ) = [ o 1 ( f, τ, θ),,o K ( f, τ, θ) ] T = W ICA ( f, θ)x( f, τ), (2) where o( f, τ, θ) is the separated signal vector, K is the number of outputs, and W ICA ( f, θ) is the separation matrix for canceling out the signal coming from θ. The update rule of W ICA ( f, θ) is ] W [p] W [p+1] ICA ( f, θ) = µ [ I ϕ (o( f, τ, θ)) o H ( f, τ, θ) τ ICA ( f, θ) + W [p] ICA ( f, θ), (3) where µ is the step-size parameter, [p] is used to express the value of the pth step in iterations, and I is the identity matrix. Moreover, τ denotes a time-averaging operator, M H denotes conjugate transpose of matrix M, and ϕ( ) is an appropriate nonlinear vector function. Next, the estimated target speech signal is discarded as it is not required because we want to estimate only the noise component. Instead, assuming o U ( f, τ, θ) as the speech component, we construct a noise-only vector q( f, τ, θ) as q( f, τ, θ) = [ o 1 ( f, τ, θ),,o U 1 ( f, τ, θ), 0, o U+1 ( f, τ, θ),,o K ( f, τ, θ) ]T. (4) Following this, we apply the projection back operation to remove the ambiguity of amplitude, and we have ˆq( f, τ, θ) = [ˆq 1 ( f, τ, θ),,ˆq J ( f, τ, θ) ] T = W + ICA ( f, θ)q( f, τ, θ), (5) where M + is the Moore-Penrose generalized inverse matrix of M. Also, we should remove the ambiguity of the source order, i.e., solve the permutation problem. The solution of this permutation problem will be discussed in Sect. III-C. Fig. 2. B. Target Speech Extraction Microphone array in Kinect. In another path, we enhance the speech signal by delayand-sum () beamforming; the enhanced signal y ( f, τ, θ) is given by y ( f, τ, θ) = g ( f, θ) T x( f, τ), (6) g ( f, θ) = [ g () 1 ( f, θ),,g () J ( f, θ) ] T, (7) g () j ( f, θ) = 1 J exp ( i2π ( f /M) f s d j sin θ/c ), (8) where g ( f, θ) is the coefficient of, f s is the sampling frequency, d j ( j = 1,,J) is the position of the microphones, M is the size of DFT, and c represents the sound velocity. Finally, we apply generalized spectral subtraction (GSS) [8] as a post processing for target speech extraction, resulting in the target speech estimate y BSSA ( f, τ, θ) given by y BSSA ( f, τ, θ) = 2n y ( f, τ, θ) 2n β z ICA ( f, τ, θ) ( 2n if y ( f, τ, θ) 2n β zica ( f, τ, θ) ) 2n 0, γ y ( f, τ, θ) (otherwise), where 2n is the exponent parameter in GSS, β is the subtraction parameter, γ is the flooring parameter and z( f, τ, θ) is the estimated noise via, as (9) z( f, τ, θ) = g T ( f, θ) ˆq( f, τ, θ). (10) III. Real-Time Bssa with Speaker Tracking on Kinect A. Outline of Kinect In this paper, we introduce speaker position estimation by skeleton tracking on Kinect for DOA information of the realtime BSSA. Thus, we implement robot auditory interface to achieve high recognition accuracy for the early utterance of the target speaker. Kinect is a multi-modal interface with RGB camera, depth sensor and microphone array, which enable to add abilities of motion capture and voice recognition to Microsoft Xbox 360 game machines. By using Kinect for Windows SDK [9], we can utilize the microphone array with PC as a USB audio device of the four-ch input. Also, human skeleton tracking

User s Speech Noise Camera Fig. 3. F F T x n ( f,τ ) θu FD- ICA Image User Tracking Main Path Phase Compensation θu User Noise y ( f,τ,θ ) 0 o n ( f,τ,θ ) PB ˆ q n ( f,τ,θ ) θ U θ U Spectral Subtract Z( f,τ,θ ) Reference Path y ( f BSSA,τ,θ ) Block diagram of BSSA with image information detection. function has been provided to make possible to get information of the human head and each joint position instantly in millimeter accuracy by the SDK. Figure 2 shows the microphone array of Kinect. It has four unidirectional microphones with an unequal interelement spacing. Signal (16kHz sampling frequency, 16bit quantization) from the microphone array is output to USB via a built-in preamplifier, an A/D converter and an audio stream controller. B. Semi-Blind Speech Extraction with Video Information for Saving Early Utterance Figure 3 shows a block diagram of the implemented system. In real-time BSSA, it is possible to process SS and in realtime. However, owing to the huge computational load, ICA cannot provide any effective separation matrix in real-time, especially for the first segment of user utterance. This causes a serious problem of poor recognition accuracy for the first word that often conveys important messages of the user, such as commands, user decision, and greetings. To improve the speech recognition accuracy for the early utterances, we introduce a rapid ICA initialization approach combining video image information given by robot (Kinect s) eyes and a prestored initial separation filter bank. If tags of the prestored ICA filters in the filter bank do not coincide with the current user s DOA detected by image information, the system moves to the training phase; a new separation filter is updated by ICA and added to the filter bank with the detected- DOA tag. Otherwise, if the system can find the DOA tag that corresponds to the current user s DOA, the immediate detection of the user s DOA enables ICA to choose the most appropriate prestored separation matrix for saving the user s early utterances. Figure 4 and the following show the flow of processing. Step 1: Construction of ICA filter bank In this algorithm, we quantize the user s DOA for the ICA filter bank into five directions with 11.7 -wise, which divide angle of view of Kinect into five sections, where 0 is normal to the robot face. System holds a filter bank of ICA separation filters learned enough in all directions. If there is not the filter that has been learned, system set NBF of front direction to ICA initial value. Step 2: User s DOA detection by Kinect user tracking To estimate the direction of the user in the angle of view instantly, we use the skeleton tracking of Kinect. The skeleton tracking can detect the user s direction and our algorithm then quantize it into five DOAs (see Fig. 5). This calculation can be processed within about 50 ms, which is immediate enough to detect the user in advance of the utterance. The nearest DOA θ of the user s DOA estimated via Kinect is used as a tag of W ICA ( f, θ), which has already been iteratively optimized. Step 3: Enhancement of early utterance Based on the user direction θ obtained by STEP 2, load the ICA filter W ICA ( f, θ) from the filter bank. Using this, noise suppression processing is performed by BSSA for the current audio input, and outputs the target speech signal. Step 4: Enhancement after 2nd-block utterances After 2nd-block utterances of the user, we perform ICA and speech enhancement processing as described in Sect. II in a blockwise batch manner (see Fig. 6). When the learning of the filter is end, the current W ICA ( f, θ) is overwritten in the filter bank for preparing the next future user. Then, go back to Step 2. C. Permutation Solver In the context of the permutation problem in the ICA study, there exist many methods for solving permutation, such as source-doa-based method [6] and its combination with subband correlation [10]. These approaches, however, often fail to solve the permutation problem in real environments because of the inherent problem that noise is sometimes diffuse and has no discriminable property on DOA. Therefore, in this paper, we introduce a new cue for the classification of sources, i.e., source statistics difference, instead of DOA information. We assume that the separated signal by ICA, o k, can be modeled using the gamma distribution as P(o k ) = oα 1 k exp ( o k σ σ α Γ(α) ), (11) where k is the index of channel of the separated signal by ICA, α is the shape parameter corresponding to the type of sources (e.g., α = 1 is Gaussian and α < 1 is super-gaussian), σ is the scale parameter of the gamma distribution, and Γ(α) is the gamma function, defined as Γ(α) = 0 t α 1 exp( t)dt. (12) Modeling by the gamma distribution enables us to easily estimate the scale parameter σ and shape parameter α from observed raw data samples. These parameters can be estimated by the maximum likelihood estimation method as ˆα(o k ) = 3 ω + (ω 3) 2 + 24ω 12ω, (13) ˆσ(o k ) = E[o k] ˆα(o k ), (14) ω = log (E[o k ]) E [ log o k ]. (15)

Block b-1 Block b Observed signals Channel 1 Channel 2 Time Angle User direction from user tracking Time When a new user is detected or a user moves beyond the partition Input Signals ICA filter bank Fill the buffer for ICA optimization Fill the buffer for ICA optimization direction 1 direction 2 direction 3 Send buffer ICA filter updating in 1.5 sec. duration ICA filter updating in 1.5 sec. duration using preciously buffered data. using preciously buffered data. Update filter Update fiter bank Set optimized initial filter GSS GSS GSS GSS Speech Enhanced Signal Fig. 4. Signal flow in real-time implementation of proposed system using user tracking. For example, it is well known that the subband power spectral component of speech is well modeled by very spiky distribution as (11) with α = 0.1 0.2, but that of diffuse noise is typically more Gaussian (i.e., α 1) owing to the center limit theorem. Therefore, we can discriminate the sources and solve the permutation problem using this difference in statistical property. Finally, in construction of noise-only vector in (4), we can set the user speech component index U as U = argmin k ˆα (o k ( f, τ, θ)). (16) Thus, the component with the smallest ˆα(o k ) is regarded as speech, which should be cancelled out in noise estimation. D. Implementation To process noise suppression with robot audiovisual information, Kinect is connected to the PC via USB, where the system was built on the PC using the Microsoft Visual C++ 2010 (8 GB of memory and CPU of 1.86 GHz Core i7). In this implementation, the STFT frame length is 512 points, the frame shift size is 128 points, and the length of the signal analysis window size for the separation filter update by ICA is set to 256 points. 11.4 Fig. 5. DOA estimation by skeleton tracking on Kinect. IV. Speech Recognition Experiment A. Experimental Conditions We conduct a speech recognition experiment to evaluate the effectiveness of the proposed method, where we compare the

Time block Input data ICA filter ICA filter bank +1 +2 from user tracking information : input data at time block t : optimized filter at time block t Fig. 6. Configuration of updating ICA separation filter in real-time BSSA with image information. TABLE I Experimental conditions for speech recognition Database JNAS [11], 46 speakers (200 sentences) Speech recognition task 20k words newspaper dictation Acoustic model Phonetic tied mixture (PTM), clean model Number of training JNAS, 260 speakers speakers for acoustic model (150 sentences / 1 speaker) Decoder Julius ver. 4.2 [11] following four speech enhancement systems. Non noise suppression (Unprocessed). beamforming (). Conventional BSSA initialized by 0 steered NBF with conventional DOA-based permutation solver [6] (Conventional). Proposed method (Proposed). Experimental environment is shown in Figure 7. Reverberation time is 200 ms. We emitted JNAS [11] speech database from a dummy head as a target user s voice and real railwaystation noise from eight loudspeakers surrounding the Kinect. SNR of speech and noise is set to 10 db and 15 db. Speaker direction θ is changed in three conditions, namely, 0, 10, and 20 degrees. The rest of the experimental conditions is summarized in Table I. In GSS, the exponent parameter is 2.0, subtraction parameter is 1.4, and flooring parameter is 0.2; these values are determined experimentally. Note that this experiment was conducted on the real-time processing basis (not batch processing). The total latency of this processing is about 50 ms. B. Results Figures 8 10 show the results of speech recognition experiments. From these figures, we can confirm that the proposed method significantly outperforms the conventional methods in all cases of user directions. In particular, although the proposed method is aimed to only save the user s early utterance, total word accuracy for the full sentence can be considerably improved; this indicates the importance of the recognition of the first utterance. In the conventional method, recognition rate is markedly low if the user direction is shifted from 0 because the filter did not converge enough. In contrast, the recognition rate of the proposed method is maintained to be high because 6.0 m Fig. 7. 1.5 m PC Loudspeaker 1.0 m 4.2 m Kinect θ Dummy Head DA 9 9 9 Height of Loudspeakers and Kinect: 1.2 m Dummy Head: 1.6 m Acoustical environment used in real-world experiment. the optimal filter can be rapidly available with the assistance of image information. V. Conclusions In this paper, first, to save the user s first utterance, we introduce a new rapid ICA initialization method combining robot video information and a prestored initial separation filter bank. Next, a signal-statistics-based permutation solving approach is proposed. The experimental results show that the proposed real-time ICA-based noise reduction developed in Kinect can markedly improve the word recognition accuracy in the robot dialogue system. Acknowledgement This work was partly supported by JST Core Research of Evolutional Science and Technology (CREST), Japan. References [1] R. Prasad, et al., Robots that can hear, understand and talk, Advanced Robotics, vol.18, pp.533 564, 2004. [2] P. Comon, Independent component analysis, a new concept, Signal processing, vol.36, pp.287 314, 1994. [3] Y. Takahashi, H. Saruwatari, et al., Blind spatial subtraction array for noisy environment, IEEE Trans. Audio, Speech, and Language Processing, vol.17, no.4, pp.650 664, 2009. [4] P. Loizou, Speech Enhancement: Theory and Practice, CRC Press, 2007. [5] H. Saruwatari, et al., Hands-free speech recognition challenge for realworld speech dialogue systems, Proc. ICASSP, pp.3729 3782, 2009. [6] H. Saruwatari, et al., Blind source separation combining independent component analysis and beamforming, EURASIP Journal on Applied Signal Processing, vol.2003, no.11, pp.1135 1146, 2003. [7] Microsoft, Kinect - Xbox.com, http://www.xbox.com/ja-jp/kinect [8] B. L. Sim, et al., A parametric formulation of the generalized spectral subtraction method, IEEE Trans. Speech and Audio Processing, vol.6, no.4, pp.328 337, 1998.

Fig. 8. Result of speech recognition test in real-world experiment, where speaker direction is 0 degree. (a) word correct, and (b) word accuracy. Fig. 9. Result of speech recognition test in real-world experiment, where speaker direction is 10 degree. (a) word correct, and (b) word accuracy. Fig. 10. Result of speech recognition test in real-world experiment, where speaker direction is 20 degree. (a) word correct, and (b) word accuracy. [9] Microsoft, Microsoft Kinect SDK for Developers Develop for the Kinect Kinect for Windows, http://kinectforwindows.org/ [10] H. Sawada, et al., A robust and precise method for solving the permutation problem of frequency-domain blind source separation, IEEE Transactions on Speech and Audio Processing, vol.12, no.5, pp.530 538, 2004. [11] A. Lee, et al., Julius -An open source realtime large vocabulary recognition engine, Proc. Eur. Conf. Speech Commun. Technol., pp.1691 1694, 2001.