Triggering Memories of Conversations using Multimodal Classifiers

Similar documents
Detection and Recognition of Alert Traffic Signs

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012

Controlled Information Maximization for SOM Knowledge Induced Learning

An Unsupervised Segmentation Framework For Texture Image Queries

Point-Biserial Correlation Analysis of Fuzzy Attributes

IP Network Design by Modified Branch Exchange Method

A Two-stage and Parameter-free Binarization Method for Degraded Document Images

Frequency Domain Approach for Face Recognition Using Optical Vanderlugt Filters

Optical Flow for Large Motion Using Gradient Technique

An Extension to the Local Binary Patterns for Image Retrieval

A ROI Focusing Mechanism for Digital Cameras

Meta-classification of Multimedia Classifiers

A Recommender System for Online Personalization in the WUM Applications

POMDP: Introduction to Partially Observable Markov Decision Processes Hossein Kamalzadeh, Michael Hahsler

Segmentation of Casting Defects in X-Ray Images Based on Fractal Dimension

Spiral Recognition Methodology and Its Application for Recognition of Chinese Bank Checks

IP Multicast Simulation in OPNET

Slotted Random Access Protocol with Dynamic Transmission Probability Control in CDMA System

A Novel Automatic White Balance Method For Digital Still Cameras

Towards Adaptive Information Merging Using Selected XML Fragments

Information Retrieval. CS630 Representing and Accessing Digital Information. IR Basics. User Task. Basic IR Processes

A Memory Efficient Array Architecture for Real-Time Motion Estimation

HISTOGRAMS are an important statistic reflecting the

XFVHDL: A Tool for the Synthesis of Fuzzy Logic Controllers

A modal estimation based multitype sensor placement method

ADDING REALISM TO SOURCE CHARACTERIZATION USING A GENETIC ALGORITHM

3D Hand Trajectory Segmentation by Curvatures and Hand Orientation for Classification through a Probabilistic Approach

A Minutiae-based Fingerprint Matching Algorithm Using Phase Correlation

Image Registration among UAV Image Sequence and Google Satellite Image Under Quality Mismatch

AUTOMATED LOCATION OF ICE REGIONS IN RADARSAT SAR IMAGERY

Color Correction Using 3D Multiview Geometry

BUPT at TREC 2006: Spam Track

Assessment of query reweighing, by rocchio method in farsi information retrieval

Illumination methods for optical wear detection

Any modern computer system will incorporate (at least) two levels of storage:

Communication vs Distributed Computation: an alternative trade-off curve

Parallel processing model for XML parsing

A Shape-preserving Affine Takagi-Sugeno Model Based on a Piecewise Constant Nonuniform Fuzzification Transform

Effects of Model Complexity on Generalization Performance of Convolutional Neural Networks

And Ph.D. Candidate of Computer Science, University of Putra Malaysia 2 Faculty of Computer Science and Information Technology,

ANALYTIC PERFORMANCE MODELS FOR SINGLE CLASS AND MULTIPLE CLASS MULTITHREADED SOFTWARE SERVERS

Assessment of Track Sequence Optimization based on Recorded Field Operations

Obstacle Avoidance of Autonomous Mobile Robot using Stereo Vision Sensor

Module 6 STILL IMAGE COMPRESSION STANDARDS

Improvement of First-order Takagi-Sugeno Models Using Local Uniform B-splines 1

Topic -3 Image Enhancement

Multidimensional Testing

Optimal Adaptive Learning for Image Retrieval

a Not yet implemented in current version SPARK: Research Kit Pointer Analysis Parameters Soot Pointer analysis. Objectives

Positioning of a robot based on binocular vision for hand / foot fusion Long Han

SYSTEM LEVEL REUSE METRICS FOR OBJECT ORIENTED SOFTWARE : AN ALTERNATIVE APPROACH

A New Finite Word-length Optimization Method Design for LDPC Decoder

INFORMATION DISSEMINATION DELAY IN VEHICLE-TO-VEHICLE COMMUNICATION NETWORKS IN A TRAFFIC STREAM

RANDOM IRREGULAR BLOCK-HIERARCHICAL NETWORKS: ALGORITHMS FOR COMPUTATION OF MAIN PROPERTIES

High performance CUDA based CNN image processor

vaiation than the fome. Howeve, these methods also beak down as shadowing becomes vey signicant. As we will see, the pesented algoithm based on the il

Haptic Glove. Chan-Su Lee. Abstract. This is a final report for the DIMACS grant of student-initiated project. I implemented Boundary

Lecture # 04. Image Enhancement in Spatial Domain

Effective Data Co-Reduction for Multimedia Similarity Search

Machine Learning for Automatic Classification of Web Service Interface Descriptions

A Neural Network Model for Storing and Retrieving 2D Images of Rotated 3D Object Using Principal Components

A New and Efficient 2D Collision Detection Method Based on Contact Theory Xiaolong CHENG, Jun XIAO a, Ying WANG, Qinghai MIAO, Jian XUE

Accurate Diffraction Efficiency Control for Multiplexed Volume Holographic Gratings. Xuliang Han, Gicherl Kim, and Ray T. Chen

Extract Object Boundaries in Noisy Images using Level Set. Final Report

Prioritized Traffic Recovery over GMPLS Networks

Transmission Lines Modeling Based on Vector Fitting Algorithm and RLC Active/Passive Filter Design

COLOR EDGE DETECTION IN RGB USING JOINTLY EUCLIDEAN DISTANCE AND VECTOR ANGLE

FINITE ELEMENT MODEL UPDATING OF AN EXPERIMENTAL VEHICLE MODEL USING MEASURED MODAL CHARACTERISTICS

Multi-azimuth Prestack Time Migration for General Anisotropic, Weakly Heterogeneous Media - Field Data Examples

ANN Models for Coplanar Strip Line Analysis and Synthesis

Adaptation of Motion Capture Data of Human Arms to a Humanoid Robot Using Optimization

The EigenRumor Algorithm for Ranking Blogs

MULTI-TEMPORAL AND MULTI-SENSOR IMAGE MATCHING BASED ON LOCAL FREQUENCY INFORMATION

Real-Time Speech-Driven Face Animation. Pengyu Hong, Zhen Wen, Tom Huang. Beckman Institute for Advanced Science and Technology

An Improved Resource Reservation Protocol

Modelling, simulation, and performance analysis of a CAN FD system with SAE benchmark based message set

Experimental and numerical simulation of the flow over a spillway

Mobility Pattern Recognition in Mobile Ad-Hoc Networks

Erasure-Coding Based Routing for Opportunistic Networks

A Mathematical Implementation of a Global Human Walking Model with Real-Time Kinematic Personification by Boulic, Thalmann and Thalmann.

IBM Optim Query Tuning Offerings Optimize Performance and Cut Costs

Color Interpolation for Single CCD Color Camera

Concomitants of Upper Record Statistics for Bivariate Pseudo Weibull Distribution

Fifth Wheel Modelling and Testing

Improved Fourier-transform profilometry

On Error Estimation in Runge-Kutta Methods

(a, b) x y r. For this problem, is a point in the - coordinate plane and is a positive number.

INDEXATION OF WEB PAGES BASED ON THEIR VISUAL RENDERING

Multiview plus depth video coding with temporal prediction view synthesis

Exploring non-typical memcache architectures for decreased latency and distributed network usage.

Cellular Neural Network Based PTV

Effective Missing Data Prediction for Collaborative Filtering

A Two-level Pose Estimation Framework Using Majority Voting of Gabor Wavelets and Bunch Graph Analysis

Combinatorial Mobile IP: A New Efficient Mobility Management Using Minimized Paging and Local Registration in Mobile IP Environments

Clustering Interval-valued Data Using an Overlapped Interval Divergence

Cardiac C-Arm CT. SNR Enhancement by Combining Multiple Retrospectively Motion Corrected FDK-Like Reconstructions

A Texture Feature Extraction Based On Two Fractal Dimensions for Content_based Image Retrieval

3D Periodic Human Motion Reconstruction from 2D Motion Sequences

Scaling Location-based Services with Dynamically Composed Location Index

1996 European Conference on Computer Vision. Recognition Using Class Specic Linear Projection

Transcription:

Fom: AAAI Technical Repot WS-02-08. Compilation copyight 2002, AAAI (www.aaai.og). All ights eseved. Tiggeing Memoies of Convesations using Multimodal Classifies Wei-Hao Lin, Rong Jin, and Alexande Hauptmann Language Technologies Institute School of Compute Science Canegie Mellon Univesity 5000 Fobes Avenue Pittsbugh, Pennsylvania 523 USA {whlin,ong,alex}@cs.cmu.edu Abstact Ou pesonal convesation memoy agent is a weaable expeience collection system, which unobtusively ecods the weae s convesation, ecognizes the face of the dialog patne and emembes his/he voice. When the system sees the same peson s face o heas the same voice it uses a summay of the last convesation with this peson to emind the weae. To coectly identify a peson and help emembe the ealie convesation, the system must be awae of the cuent situation, as analyzed fom audio and video steams, and classify the situation by combining these modalities. Multimodal classifies, howeve, ae elatively unstable in the uncontolled eal wod envionments, and a simple linea intepolation of multiple classification judgments cannot effectively combine multimodal classifies. We popose a meta-classification stategy using a Suppot Vecto Machine as a new combination stategy. Expeimental esults show that combining face ecognition and speae identification by meta-classification is damatically moe effective than a linea combination. This meta-classification appoach is geneal enough to be applied to any situation-awae application that needs to combine multiple classifies. Intoduction A memoy agent is a weaable compute system that can povide infomation elevant to the cuent context of the weae without use intevention. This context-sensitive infomation can seve as a passive eminde of things to do, o poactive association of past events. Fo example, Rhodes emembance agent [0] can povide text summaies based on cuent time, location, and convesation patne though a head-up display. Schiele et al. s DyPERS [5] can etieve video and audio that was peviously associated with the physical object the weae is watching. In this pape, we extend this wo futhe by ecognizing high-level objects such as specific faces and speaes, and emembeing a summay of an ealie encounte. Ou eseach aims to develop a system that allows people to captue and etieve fom a complete ecod of thei pesonal expeiences. This assumes that within ten yeas technology will be in place fo ceating a continuously ecoded, digital, high fidelity ecod of one s whole life in video fom [4]. Weaable, pesonal digital memoy systems units will ecod audio, video, location and electonic communications. This eseach aims to fulfill the vision of Vanneva Bush s pesonal Memex [], captuing and emembeing whateve is seen and head, and quicly etuning any item on equest o based on the context. While ou vision outlines a eseach pogam expected to last fo many yeas, we have educed cetain aspects of this vision into an opeational pesonal memoy pototype that emembes the faces and voices associated with a convesation and can etieve snippets of that convesation when confonted with the same face and voice. The system cuently combines face detection/ecognition with speae identification, audio ecoding and analysis. The face detection and speae id enables the stoing of the audio convesation associated with a face and a voice. Audio analysis and speech ecognition compacts the convesation, etieving only impotant phases. All of this happens unobtusively, somewhat lie an intelligent assistant who whispes elevant pesonal bacgound infomation to you when you meet someone you don t quite emembe. One ey component in the afoementioned pototype is the combination of multimodal classifies that detemine the identity of the weae s acquaintance as soon as the system ecognizes a voice o a face. Multiple classifies can impove the accuacy of the classification when the classifies ae complementay to each othe. In tass lie peson identification o validation, classifies fom diffeent modalities use distinct featues to classify samples, and thus they seldom mae coelated mistaes at the same time. Using an ensemble of multimedia classifies has been peviously eseached in identity veification studies [3][7], which demonstated the effectiveness of linealy combining up to thee multimodal classifies. In ode to achieve a high accuacy of classification, the stategy of combining evidence fom a goup of classifies plays a citical ole. Majoity voting and linea intepolation [5] ae the most common ways of combining classifie output. Howeve, these methods only utilize the final decision fom each classifie, ignoing the complete pictue of all judgments. The othe poblem is that the weights between classifies ae assigned equally (summing all pobabilities) o fixed empiically based on the eliability of each classifie. To

addess these poblems, we popose a new combination method called meta-classification, which maes the final decision by e-classifying the esult each classifie etuns. Expeiment esults show meta-classification is moe effective than weighted linea combinations. This pape is oganized as follows: Ou system fo collecting and the etieving digital human memoy is descibed in Section 2. The multimedia classifies ae detailed in Section 3. Section 4 explains metaclassification, which is ou poposed new combination stategy. Expeimental esults ae given in Section 5 and conclusions pesented Section 6. Pesonal Convesation Memoy Agent Thee ae cuently two modes of system opeation: Memoy collection (leaning) and memoy etieval. Inteface Contol: Use commands, pefeences and demonstation puposes. The inteface contol has the ability to povide a set of shot audio notifications to infom the use what the system is cuently doing. These notifications include: Memoy collection is stating Memoy collection is ending Memoy etieval is stating Memoy etieval is ending Face was detected in audio steam Failue to detect faces Vaious intenal system failue states Ou expeience has been that these use notifications wee extemely impotant to eep the use infomed of what the system is doing, since thee is no visual feedbac in nomal use. Futue vesions of the system will liely have the inteface manage module contolled though a limited vocabulay, command speech inteface (e.g. Recod memoy o Who is this peson? ). We envision that eventually the inteface system will be tiggeed entiely by contextdependent ecognized dialogue phases, fo example, ``Nice to meet you'', My name is, etc., instead of mouse clics o specific commands by the use. Pesonal Memoy Collection Pesonal Memoy Retieval Figue. The basic achitectue of the Pesonal Memoy system The basic hadwae components of the system ae a weaable miniatue digital video camea, 2 micophones ( close-taling and omni-diectional), headphones to eceive use output and a laptop type compute fo pocessing. The softwae modules involved in the system ae a module fo speae identification, a speech ecognition module, a face detection and ecognition module, a database, and an inteface contol manage module. The basic achitectue of the system, as outlined in Figue, shows the inteface contol manage selecting between the acquisition and the etieval module as needed. Face Detection Time Stamp What is you name It s a pleasue to meet you Than you I m fine. What do you wo on That s inteesting Tansci pt Summa Name y pleasue,,, meeet,,, wot Pesonal Memoy Speae ID Inteface Contol The inteface contol module detemines which of 3 states the system is cuently in: Idle, Collecting Memoy o Retieving Convesations fom Pesonal Memoy. The cuent use input inteface to this contol module consists of a weaable mouse, with one button to stat collecting convesation memoy, one button to stat etieving the convesations (o to sip to the next elevant one) and one button to set the system into an idle state whee neithe collection no etieval taes place. On the output side, it is assumed that all output will be povided though the headphones to the weae. A visual display inteface is available mostly fo test and Figue 2. Pocess fo Pesonal Memoy Collection Pesonal Memoy Collection The system wos as a weaable device consisting of a miniatue spy camea, a cadioid lapel micophone and an omni-diectional micophone all attached to a laptop compute. The system wos by detecting the face of the peson you ae taling to in the video, and listening to the convesation fom both the close-taling (weae) audio tac and the omni-diectional (dialogue patne) audio tac. An oveview of the leaning system fo memoy collection is shown in Figue 2.

The close-taling audio is tanscibed by a speech ecognition system to poduce a ough, appoximate tanscipt. The omni-diectional audio steam is pocessed though a speae identification module. An encoded epesentation of the face of you cuent dialog patne, the dialog patne speae chaacteistics, and the aw audio of the cuent convesation is saved to a database. The next time the system sees the same peson (by detecting a face and matching it to the stoed faces in the database), it can etieve and play bac the audio fom the last convesation. The audio can optionally be pocessed though audio analysis (silence emoval, emphasis detection) and geneal speech ecognition to efficiently eplay only the peson names and the majo issues that wee mentioned in the convesation. Pesonal Memoy Retieval In the etieval (emembeing) mode, the system immediately seaches fo a face in the video steam and pefoms speae identification on the omni-diectional audio steam. Once a face is detected, the face and speae chaacteistics will be matched to the instances of faces and speae chaacteistics stoed in the memoy database. The scoe of both faced and speae matches is combined using ou meta-classification stategy. When a sufficiently high scoing match is found, the system will etun a bief summay of the last convesation with the peson. Figue 3 shows the pocess of pesonal memoy etieval. Face Detection/ Recognition Pesonal Memoy..,..... Name wo pleasue meet Summa y Figue 3. Pocess fo Pesonal Memoy Retieval Multimodal Classifies Speae ID Face Detection and Recognition Extensive wo in face detection has been done at CMU by Rowley [][2][3]. This appoach modeled the statistics of appeaance implicitly using an atificial neual netwo. Cuently we use Schneideman s appoach [8], which applies statistical modeling to captue the vaiation in facial appeaance. We lean the statistics of both object appeaance and "non-object" appeaance using a poduct of histogams. Each histogam epesents the joint statistics of a subset of wavelet coefficients and thei position on the object. Ou appoach is to use many such histogams epesenting a wide vaiety of visual attibutes. The detecto then applies a set of models that each descibes the statistical behavio of a goup of wavelet coefficients. Face matching was used in [4] with the eigenface appoach. Meanwhile thee have been seveal commecial systems offeing face detection and identification, such as Visionics [9]. In ou implementation we have been using both the Visionics FaceIt toolit fo face detection and matching as well as the Schneideman face detecto and eigenfaces [8] fo matching simila faces. Eigenfaces teat a face image as a two-dimensional N by N aay of intensity values. Fom a set of taining images, a set of eigenvectos can be deived that constitute the eigenfaces. Evey unnown new face is mapped into this eigenvecto subspace and we calculate the distance between faces though coesponding points within the subspace [20]. Speae Identification Speae identification is done though an implementation of Gaussian Mixtue Models (GMM) as descibed by Gish [6]. Gaussian Mixtue Models have poven effective in speae identification tass in lage databases of ove 2000 speaes [7][6]. Pio to classification, Mel-Fequency Cepstal Coefficients (o MFCC) featues ae extacted fom the audio channel. Fo taining, egions of audio ae labeled with a speae code, and then modeled in thei espective class (speae). Once taining models have been geneated, the system must classify novel audio sections. The pocess begins by segmenting the audio channel into -second, ovelapping egions and computing the GMM. The esulting model is compaed to existing tained models using a maximum lielihood distance function. Based on the compaisons to each class, a decision is made as to the classification of the data into speech, noise, nown speae X, etc. The speae identification system also uses the fundamental pitch fequency to eliminate false alams. Geneally, about 4 seconds of speech ae equied to get eliable speae identification, unde benign envionmental conditions. Combining Classifies The main idea of meta-classification is to epesent the judgment of each classifie fo each class as a featue vecto, and then to e-classify again in the new featue space. The final decision is made by the meta-classifies instead of just linealy combining each classifie s judgment. In this section, we fist intoduce the geneation of the new featues, followed by a desciption of the widely

used linea intepolation method, then finally ou method of building a meta-classifie. Featue Synthesis Multimedia classifies mae judgments at diffeent time peiods because of discepant chaacteistics of the individual modalities. Fo example, the speae identification module usually taes longe to epot a esult than the face ecognition module because the fome wos with a lage time window while the latte can mae a judgment as soon as an image is eady to be analyzed. Consequently, the classification esults fom these multimodal classifies will be fed into the meta-classifie asynchonously, and a method of combining them appopiately is needed. i Given an example with an unnown class, set x j is the degee of lielihood that the example belongs to the class i as made by the classifie j. Depending on the natue of the i classifie, x j can be a similaity scoe o pobability. A multimedia classifie j geneates a classification vecto x j once it finishes analyzing input fom the audio o video steam. The classification vecto x j = (x j, x 2 j,, x j ), whee is the numbe of classes, i.e. the numbe of people in cuent pool of digital human memoy. Given classifies, at a time point t at which any classifie can mae a judgment, a new featue vecto x syn (t) = (x (t ), x 2 (t ),, x (t )), whee t is the time point that is the closest to time point t when the classifie maes a judgment. In othe wods, wheneve a classifie maes a judgment, a new featue vecto combines evey classifie s judgment by concatenating classification vectos geneated at the time point neaest to cuent time. Suppose we have thee classifies and two classes. At the fifth second, the fist classifie maes a classification judgment x (5) = (0, 2). The most ecent judgments made by the othe two classifies ae x 2 (4.375) = (0.8, 0.) and x 3 (3) = (60, 50) at 4.375 and 3 seconds, espectively. Theefoe, the synthesized vecto is x syn (5) = (0, 2, 0.8, 0., 60, 50), and the meta-classifie leans in this new featue space. The synthesis method is based on the assumption that between t and t, thee is no damatic change with espect to the last judgment, which holds tue in the cuent situation. When the use attempts to etieve an acquaintance fom memoy by matching the face o voice, he o she usually continues to loo o listen to the peson who is to be identified. Linea Combination Kittle et al. [8] poposed a pobability famewo to explain vaious schemes of combining judgments fom multiple classifies. If each peson in ou convesation memoy system epesents a class (w, w 2,, w p ), whee p is the numbe of people in the memoy pool, the tas of identifying the unnown peson in the video o audio steam can be fomulated as classifying the peson by combing multimodal featues x i, i =,,, whee is the numbe of multimodal classifies. Based on the assumption that each classifie is conditionally independent, the decision ule classifies the unnown peson into the class w j if -( -) -( R-) P ( w ) P( w x ) = max P ( w ) P( w j j i i= = i= Futhemoe, unde the assumption that the posteio pobability p(w x) is not vey fa fom the pio pobability p(w), the decision ule becomes a sum of posteio pobabilities as follows, Ê ˆ ( - ) P( w j ) +  P( w j xi ) = maxá( - ) P( w ) +  P( w xi ) i= = Ë i= The sum ule combines classifies by summing the judgments made by each classifie fo each class, and the class with the highest pobability is chosen as the final decision. It may be counte-intuitive at fist to assume the posteio pobability and pio pobability ae close, but the sum ule is widely used as a combination scheme, and outpefomed othe linea combination stategies such as max ule, min ule, median ule, and majoity voting [8]. In this pape, we use the sum ule with equal pio pobabilities fo compaison against ou new method. One constaint imposed by the sum ule, o othe pobability-based combination stategies, is that each classifie must expess its decision as a tue pobability. Since not evey classifie is designed to output pobability, we tansfom thei judgments into pobability by nomalizing the total sum of the similaity scoes geneated by the classifie, i.e. f ( x j, wi ) p( wi x j ) = p  f ( x, w ) = j p x ) whee f(x, w) is a function of measuing the similaity between x and the class w made judged by each classifie. Meta-Classification Meta-classification is e-classifying the classification esults made by classifies. Conside that thee ae one deaf face ecognition expets and two blind speae identification expets esiding in ou system. Once the system detects an unnown peson appoaching the use o the use actively tigges the ecognition mode, each expet stats to mae his o he own decision based on the input fom the coesponding modality. Instead of maing the final decision by voting, o summing up pobabilities and then picing the most pomising one, we pesent thei decisions as a synthesized vecto to anothe judge, i.e. the meta-classifie, who ultimately decides the identity of the peson in the cuent video o audio steam. A vey pomising classification technique, Suppot Vecto Machine (SVM) [2] is used hee as a meta-classifie. The basic of idea of SVM is to sepaate samples with a hypeplane that has a maximal magin between two classes. To fomulate the poblem of classifying synthesized featue vectos, the taining data ae epesented as {x i, y i }, i =, 2,, R, y i is eithe (negative examples) o (positive i

examples), R is the numbe of taining samples. Suppose all taining data satisfy following constaints: xi w + b + when yi = () xi w + b - when yi = - The distance between the hypeplane x i _w+b= and the hypeplane x i _w+b=- ae 2/ w, whee w is the Euclidean nom of w. Theefoe, by minimizing w 2 we get the two hypeplanes with maximal magins. Quadatic pogamming povides well-studied optimizations to maximize the quadatic functions subject to the linea constaints in Equation, which guaantees finding the global maximum. SVM is not only theoetically sound, but outpefoms othe classification algoithms in empiical poblems with high dimensionality [7]. The SVM metaclassifie maes its binay decision by classifying synthesized featue vectos, and we build one such metaclassifie fo each class. The maximal magin suggests having bette genealization ability. Unlie othe combination schemes that equie each classifie to have the same output fom, hee featue vectos can consist of scoes o similaities without any estiction. We implement the SVM meta-classifie using SVM light [6] with a linea enel. The advantage of applying meta-classification is two-fold. Fist, when combining multiple classifies, the similaity scoe o pobability poduced by each classifie does not necessaily convey all of the infomation. The distibution of the scoes fo each class judged by the classifie eveals how confident it is in maing the decision, which is a chaacteistic that can only be captued by a classification featue vecto x but not in nomal combination schemes such as linea intepolation. Second, thee may be some pattens acoss seveal classification vectos, which can be leaned by a meta-classifie. Fo example, one of the uses fiends was fist met in a vey noisy envionment, esulting in poo quality voice fo taining speae identification but eeping the visual featues of face intact. Metaclassification can lean the patten fom synthesized featue vectos. Theefoe, when the use meets the fiend again, the face ecognition module will be cetain about identifying the fiend while the voice ecognition module is confused. The nomal linea combination stategy will act unstable in this cicumstance. The meta-classifie, on the contay, can mae a bette decision by obseving the pattens in the esults fom the multimodal classifies. The meta-classification stategy can be applied to othe classifies with little effot. Any existing multimedia classifie can be plugged into the famewo to combine with othe classifies to geneate synthesized featue vectos, and meta-classification taining is pocessed in the same way. It does not matte that the classifie is pobability-based o similaity-based, and both pobabilities and similaity scoes can be combined into the featue vecto. Evaluation Window To exploit the continuousness of audio and video input in a context-awae application, we can mae the classification decision not only by combining multimodal classifies, but also accumulating classification esults ove time. The numbe of times of classification judgments is defined as the size of the evaluation window. The decision ule that we choose subject j as the final decision is as follows: i+ w  t = i i+ w  s ( t) = max s ( t) j t = i whee s (t) is the judgment (pobability o similaity scoes) the classifie made fo the -th class at the time t, i is the stating time when we begin to accumulate judgments, and w is the size of the window. Expeiment Data Collection and Pocedue We collected two convesations each with 22 people while weaing ou pototype memoy captue unit. Each convesation was at least 20 seconds long, and was analyzed fo faces and speae audio chaacteistics as descibed above. The lighting condition and bacgound wee diffeent between the two convesations. The fist of each convesation seved as the taining example fo multimedia classifie, while the second convesation was used as a quey o etieval pompt to emembe the fist convesation. The etieval was consideed successful only when the combining stategy coectly identified the peson in question. The fist convesation was used to tain the face detection/ecognition and speae identification classifies, and the meta-classifies as well. The second convesation was used to test the pefomance. Since the SVM metaclassifie is a binay classifie, we have to tain a metaclassifie fo each of the 22 people. The taining data consisted of synthesized featue vectos fom the given peson, and featue vectos fom the othe 2 people. To account fo the discepancy between the numbe of positive examples and negative examples, the cost of misclassifying positive taining examples into negative examples was 22 times the cost incued in the evese situations. Thee wee 936 testing featue vectos fo the total 22 classes. Note that 936 is not multiple of 22 because the numbe of featue vectos geneated fom each peson was not the same. If one of the multimedia classifie had had time maing judgment at a time slot, thee would not be a featue vecto synthesized at that time point. Result We used the aveage an as the evaluation metic, i.e. on aveage, at what an was the coect convesation found. The bette the classifie o the combining stategy

pefoms, the close its aveage an is to one. The esults of ou expeiment is shown in Table, suggesting that the Visionics face ecognition system found the coect convesation at an 3.33 of the 22 possible convesation candidates. Speae identification by acoustic MFCC similaity poved to be moe eliable with an aveage an of 3.92, and the accuacy of speae identification by pitch was the wost at 6.22 among all the multimodal classifies. The meta-classifie combined the face classifie and the speae identification, and esulted in an aveage an of 2.6. All the above esults ae calculated with the size of the evaluation window of one, which means that no infomation ove time is used. The sum ule does not outpefom single classification, and achieved a pefomance between face ecognition and speae identification. The esult of meta-classifie does not only significantly outpefom individual multimedia classifies, but also outpefoms the sum ule combination stategy. Classifies Aveage Ran Visionics Face Recognition 3.33 Speae ID with Similaity 3.92 Speae ID with Pitch 6.22 Combine using Sum ule 3.87 Combine using SVM metaclassification 2.6 Table Expeimental esults fom each classifie and combination stategies (window size = ) showing the advantage of meta-classification We also evaluated the effect of window size, and the plot of the classifies and combination stategies vesus the window size is shown in Figue 4. We expect that with the inceasing size of the evaluation window, the pefomance, i.e. aveage an, should impove because the classifie is moe confident about its decision by obseving moe situation ove time. Since classifies fom diffeent modalities mae decisions at diffeent pace, the plot in Figue 4 has two x-axes, the one above the plot with fewe window sizes fo 25 seconds is fo slow data ate speae identification, and the bottom one with moe window sizes is fo the faste data ate of face ecognition and the combination stategies. Inteestingly, the two audio classifies do not impove with the size of window, which suggests that the speae identification modules have stable, but not vey accuate pefomance with a onesecond audio sample. The pefomance may not impove unless the sample size expands, which is not toleable in context-awae applications that need a quic esponse fom each modality. On the othe hand, the face ecognition module and the combination stategies impove with the size of the evaluation window. Note that afte a window size of 250 (coesponding to about 5 seconds of video), the meta-classifie achieved the pefect pefomance. Moeove, the meta-classifie combining stategy showed the cuve declining quicly in the fist seveal window sizes, which means that the stategy is effective at combining multimodal classifies to mae the best classification judgment to etieve the coect convesation. To achieve an aveage an of two, the meta-classification stategies only need a window size of 20 (about 5 seconds). 7 6 5 4 3 2 5 0 5 20 25 0 00 200 300 400 500 Window Size Pitch Acoustic Similaity Sum Rule Face Recognition Meta-classification Figue 4 Expeimental esults of manipulating the size of the evaluation window showed the apid convegence of the SVM metaclassification Conclusion We pesent a weaable convesation collection system that unobtusively ecods a convesation in audio and video, builds an association between the patne s biometic (audio/video) chaacteistics and that convesation, and automatically etieves the pevious convesation once the peson is seen by the camea o is head via the micophone. This pototype is based on the citical assumption that the system can effectively and efficiently identifies an unnown peson by combining evidence fom multiple modalities, including face ecognition and speae identification. We poposed a novel meta-classification stategy of combining multimedia classifies. Based on the expeiment esults in this tas of identifying the same peson though audio and video signals, meta-classification was shown to be much moe effective than single classifies as well as a linea combination stategy. The esult also showed that meta-classification stategy could impove quicly with the incease size of evaluation window. Although the esult is not pefect in tems of speed and classification esult, we expect the esults can be impoved, as moe context infomation is included. With the emeging need of combining diffeent modalities in situation-awae applications, ou method can povide a geneal famewo to integate diffeent souces of context infomation, and povide moe confident classification judgments.

Refeence [] V. Bush, As we may thin, Atlantic Monthly, Vol.76, No., pp. 0-08, 945 [2] C. J. C. Buges, A Tutoial on Suppot Vecto Machines fo Patten Recognition, Knowledge Discovey and Data Mining, Vol. 2, No. 2, 998 [3] R. W. Fischholz and U. Diecmann, BioID: A Multimodal Biometic Identification System, IEEE Compute, Feb 200, pp. 64-68 [4] J. Gay, What next? A few emaining poblems in Infomation Technology, ACM Fedeated Reseach Compute Confeence, Atlanta, GA, May 999 [5] S. Hashem and B. Schmeise, Impoving Model Accuacy using Optimal Linea Combinations of Tained Neual Netwos, IEEE Tansactions on Neual Netwos, Vol. 6, No. 3, May 995, pp. 792-794 [6] T. Joachims, Maing lage-scale SVM Leaning Pactical, Advances in Kenel Methods - Suppot Vecto Leaning, B. Schölopf and C. Buges and A. Smola (ed.), MIT-Pess, 999 [7] O. Kimball, M. Schmidt, H. Gish, andj. Wateman, Speae veification with limited enollment data, ICCSLP-96, Intenational Confeence on Spoen Language Pocessing, volume 2, pages 967 970, Philadelphia, PA, 996. [8] J. Kittle, M. Hatef, R. P.W. Duin, and J. Matas, On Combining Classifies, IEEE Tansactions on Patten Analysis and Machine Intelligence, Vol. 20, No. 3, Ma 998 [9] A. Pentland, and T. Choudhuy Face Recognition fo Smat Envionments, IEEE Compute, Feb 2000, pp. 50-55 [0] Badley J. Rhodes, The Weaable Remembance Agent: a System fo Augemented Memoy, Peosnal Technologies, () [] H. A. Rowley, S. Baluja, and T. Kanade Neual Netwo- Based Face Detection, IEEE Tansaction on Pateen Analysis and Machine Intelligence, Jan 998, pp. 23-38 [2] H. A. Rowley, S. Baluja and T. Kanade, Human Face Detection in Visual Scenes, Canegie Mellon Univesity, Technical Repot CMU-CS-95-58, Pittsbugh, PA [3] H. A. Rowley, S. Baluja, and T. Kanade, Rotation invaiant neual netwo-based face detection, IEEE CVPR, Santa Babaa, 998. [4] M. Tu and A. Pentland, Eigenfaces fo ecognition, Jounal of Cognitive Neuoscience, 3():7 86, 99. [5] B. Schiele, N. Olive, T. Jebaa, and A. Pentland, An inteactive compute vision system, DyPERS:dynamic and pesonal enhanced eality system, Intenational Confeence on Compute Vision Systems, 999 [6] M. Schmidt, J. Golden, and H. Gish GMM sample statistic log-lielihoods fo text-independent speae ecognition, Euospeech-9, Rhodes, Geece, Sep 997, pp. 855-858 [7] B. Scholöpf, K.-K. Sung, C. Buges, F. Giosi, P. Niyogi, T. Poggio, and V. Vapni, Compaing Suppot Vecto Machines with Gaussian Kenels to Radial Basis Function Classifies, IEEE Tansactions on Signal Pocessing, Vol. 45, No., Nov 997 [8] H. Schneideman and T. Kanade, Pobabilistic Modeling of Local Appeaance and Spatial Relationships of Object Recognition, IEEE CVPR, Santa Babaa, 998 [9] Visionics FaceIt Develope Kit, http://www.visionics.com [20] S. Satoh and T. Kanade, Name-It: Association of Face and Name Video, tech. epot CMU-CS-96-205, Compute Science Depatment, Canegie Mellon Univesity, 996.