High Five: Recognising human interactions in TV shows

Similar documents
FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

User Authentication Based On Behavioral Mouse Dynamics Biometrics

A Binarization Algorithm specialized on Document Images and Photos

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Edge Detection in Noisy Images Using the Support Vector Machines

Local Quaternary Patterns and Feature Local Quaternary Patterns

What is Object Detection? Face Detection using AdaBoost. Detection as Classification. Principle of Boosting (Schapire 90)

Lecture 5: Multilayer Perceptrons

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

TN348: Openlab Module - Colocalization

Reducing Frame Rate for Object Tracking

Detection of hand grasping an object from complex background based on machine learning co-occurrence of local image feature

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Optimizing Document Scoring for Query Retrieval

Classifier Selection Based on Data Complexity Measures *

Feature Reduction and Selection

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Detection of an Object by using Principal Component Analysis

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

S1 Note. Basis functions.

Machine Learning: Algorithms and Applications

Improving Web Image Search using Meta Re-rankers

Motion Boundary Trajectory for Human Action Recognition

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Face Detection with Deep Learning

Support Vector Machines

Large-scale Web Video Event Classification by use of Fisher Vectors

Histogram of Template for Pedestrian Detection

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

An Image Fusion Approach Based on Segmentation Region

Related-Mode Attacks on CTR Encryption Mode

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

EYE CENTER LOCALIZATION ON A FACIAL IMAGE BASED ON MULTI-BLOCK LOCAL BINARY PATTERNS

y and the total sum of

A Gradient Difference based Technique for Video Text Detection

Efficient Segmentation and Classification of Remote Sensing Image Using Local Self Similarity

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

A Gradient Difference based Technique for Video Text Detection

Corner-Based Image Alignment using Pyramid Structure with Gradient Vector Similarity

Wishing you all a Total Quality New Year!

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Prof. Feng Liu. Spring /24/2017

Fast Feature Value Searching for Face Detection

Online Detection and Classification of Moving Objects Using Progressively Improving Detectors

Object-Based Techniques for Image Retrieval

UB at GeoCLEF Department of Geography Abstract

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

Active Contours/Snakes

Mathematics 256 a course in differential equations for engineering students

2 ZHENG et al.: ASSOCIATING GROUPS OF PEOPLE (a) Ambgutes from person re dentfcaton n solaton (b) Assocatng groups of people may reduce ambgutes n mat

Learning-based License Plate Detection on Edge Features

Robust Inlier Feature Tracking Method for Multiple Pedestrian Tracking

Image Alignment CSC 767

Multi-view 3D Position Estimation of Sports Players

IMAGE MATCHING WITH SIFT FEATURES A PROBABILISTIC APPROACH

Machine Learning 9. week

Cluster Analysis of Electrical Behavior

Combined Object Detection and Segmentation

Computer Animation and Visualisation. Lecture 4. Rigging / Skinning

Robust Shot Boundary Detection from Video Using Dynamic Texture

Real-time Motion Capture System Using One Video Camera Based on Color and Edge Distribution

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Classification of Face Images Based on Gender using Dimensionality Reduction Techniques and SVM

Deformable Part-based Robust Face Detection under Occlusion by Using Face Decomposition into Face Components

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Classifying Acoustic Transient Signals Using Artificial Intelligence

MOTION BLUR ESTIMATION AT CORNERS

A Background Subtraction for a Vision-based User Interface *

Accurate Overlay Text Extraction for Digital Video Analysis

Gender Classification using Interlaced Derivative Patterns

Discriminative Dictionary Learning with Pairwise Constraints

Dynamic Camera Assignment and Handoff

An Optimal Algorithm for Prufer Codes *

Fitting: Deformable contours April 26 th, 2018

Unsupervised object segmentation in video by efficient selection of highly probable positive features

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

3D vector computer graphics

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Data Mining: Model Evaluation

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Real-time Joint Tracking of a Hand Manipulating an Object from RGB-D Input

Collaboratively Regularized Nearest Points for Set Based Recognition

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Takahiro ISHIKAWA Takahiro Ishikawa Takahiro Ishikawa Takeo KANADE

Performance Evaluation of Information Retrieval Systems

Real-Time View Recognition and Event Detection for Sports Video

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Lower Body Pose Estimation in Team Sports Videos Using Label-Grid Classifier Integrated with Tracking-by-Detection

Incremental Multiple Kernel Learning for Object Recognition

Metrol. Meas. Syst., Vol. XXIII (2016), No. 1, pp METROLOGY AND MEASUREMENT SYSTEMS. Index , ISSN

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

The Codesign Challenge

Transcription:

PATRON-PEREZ ET AL.: RECOGNISING INTERACTIONS IN TV SHOWS 1 Hgh Fve: Recognsng human nteractons n TV shows Alonso Patron-Perez alonso@robots.ox.ac.uk Marcn Marszalek marcn@robots.ox.ac.uk Andrew Zsserman az@robots.ox.ac.uk Ian Red an@robots.ox.ac.uk Department of Engneerng Scence Unversty of Oxford Oxford, UK Abstract In ths paper we address the problem of recognsng nteractons between two people n realstc scenaros for vdeo retreval purposes. We develop a per-person descrptor that uses attenton (head orentaton) and the local spatal and temporal context n a neghbourhood of each detected person. Usng head orentaton mtgates camera vew ambgutes, whle the local context, comprsed of hstograms of gradents and moton, ams to capture cues such as hand and arm movement. We also employ structured learnng to capture spatal relatonshps between nteractng ndvduals. We tran an ntal set of one-vs-the-rest lnear SVM classfers, one for each nteracton, usng ths descrptor. Notng that people generally face each other whle nteractng, we learn a structured SVM that combnes head orentaton and the relatve locaton of people n a frame to mprove upon the ntal classfcaton obtaned wth our descrptor. To test the effcacy of our method, we have created a new dataset of realstc human nteractons comprsed of clps extracted from TV shows, whch represents a very dffcult challenge. Our experments show that usng structured learnng mproves the retreval results compared to usng the nteracton classfers ndependently. 1 Introducton The am of ths paper s the recognton of nteractons between two people n vdeos n the context of vdeo retreval. In partcular we focus on four nteractons: hand shakes, hgh fves, hugs and ksses. Recognsng human nteractons can be consdered an extenson of sngle-person acton recognton and can provde a dfferent crtera for content-based vdeo retreval. Two-person nteractons can also be used drectly or as a buldng block to create complex systems n applcatons lke survellance, vdeo games and human-computer nteracton. Prevous work n two-person nteracton recognton s scarce compared to closely related areas such as sngle-person acton recognton [7, 10, 12, 22], group acton recognton [14, 24] and human-object nteracton recognton [16, 23]. Closer to our work are [4, c 2010. The copyrght of ths document resdes wth ts authors. It may be dstrbuted unchanged freely n prnt or electronc forms.

2 PATRON-PEREZ ET AL.: RECOGNISING INTERACTIONS IN TV SHOWS Fgure 1: Dataset snapshots. Note the varaton n the actors, scale and camera vews. 17, 19], where nteractons are generally recognsed n a herarchcal manner puttng specal attenton on hgher level descrptons and usng very constraned data. These approaches rely heavly upon many low level mage pre-processng steps lke background subtracton and segmentaton of body parts whch are, by themselves, very dffcult problems to solve when workng wth more complex scenaros. In contrast, recent publcatons on sngle-acton recognton have shown a natural move from smplfed and constraned datasets to more realstc ones [11, 12, 13, 21, 22]. One of the contrbutons of ths paper s the complaton of a realstc human nteracton dataset extracted from a collecton of TV shows (Secton 2). Workng wth realstc datasets ntroduces a new set of challenges that have to be addressed n order to acheve successful recognton: background clutter, a varyng number of people n the scene, camera moton and changes of camera vewponts, to name a few. Our approach s to ntroduce a person-centred descrptor that uses a combnaton of smple features to deal n a systematc way wth these challenges. An upper body detector [6] s frst used to fnd people n every frame of the vdeo (Secton 3). The detectons are then clustered to form tracks. A track s defned as a set of upper body boundng boxes, n consecutve frames, correspondng to the same person. The am of ths frst step s to reduce the search space for nteractons to a lnear search along each track n an analogous way as [9]. We then calculate descrptors along these tracks and use them to learn a Support Vector Machne (SVM) classfer for each nteracton. Then nteracton scores are computed for each boundng box of each track. We also use the head orentaton of people detected n two novel ways: frst to acheve a weak vew nvarance n the descrptor (see Secton 3), and second to learn nteracton-based spatal relatons between people (Secton 4). The latter s based on our assumpton that people generally face each other whle nteractng. Ths assumpton s used to learn a structured SVM [20] that s traned to obtan the best jont classfcaton of a group of people n a frame. We show that usng structured learnng (SL) can mprove the retreval results obtaned by ndependently classfyng each track. An addtonal characterstc of our structured formulaton s that t provdes nformaton about whch people are nteractng. In Secton 4.2 we show the retreval results obtaned by the ndvdual and structured track classfcaton. Secton 5 presents our conclusons and future work. 2 Dataset We have compled a dataset of 300 vdeo clps extracted from 23 dfferent TV shows 1. Each of the clps contans one of four nteractons: hand shake, hgh fve, hug and kss (each 1 http://www.robots.ox.ac.uk/ vgg/data/tv_human_nteractons

PATRON-PEREZ ET AL.: RECOGNISING INTERACTIONS IN TV SHOWS 3 appearng n 50 vdeos). Negatve examples (clps that don t contan any of the nteractons) make up the remanng 100 vdeos. The length of the vdeo clps ranges from 30 to 600 frames. The nteractons are not temporally algned (.e. a clp contanng a hand shake mght start wth people walkng towards each other or drectly at the moment of the hand shake). There s a great degree of varaton between dfferent clps and also n several cases wthn the same clp (Fgure 1). Such varaton ncludes the number of actors n each scene, ther scales and the camera angle, ncludng abrupt vewpont changes (shot boundares). To have a ground truth for the evaluaton of the methods developed n ths paper, we have annotated every frame of each vdeo wth the followng: the upper body, dscrete head orentaton and nteracton label of all persons present whose upper body sze s wthn a certan range. Ths range goes from far shots that show the whole body to medum shots where only the upper body s vsble and s equvalent to 50-350 pxels n our vdeos. We have also annotated whch persons are nteractng, f any, n each frame. For the purposes of tranng and testng, the dataset has been splt nto two groups, each contanng vdeos of mutually exclusve TV shows. The experments shown n the followng sectons were performed usng one set for tranng, the other for testng and vce versa. 3 Modelng human actvty Because of the complexty and varablty of the vdeos n our dataset, fndng relevant and dstnctve features becomes ncreasngly dffcult. The descrptor has to be smultaneously () relatvely coarse to deal wth varaton, and () to some extent focused to avod learnng background nose when codfyng the nteracton. We address these ponts by makng our descrptor person-centred, and by further organsng the data based on head orentaton. The person-centred descrptor focuses on the area around the upper body of a sngle person, enablng us to localse regons of potental nterest and to learn relevant nformaton nsde them. Our descrptor does ths by coarsely quantfyng appearance and moton nsde ths regon. Ths s n contrast to other approaches n sngle-acton recognton [7, 11, 12, 15, 22], where features are estmated n the whole frame or vdeo and then clustered to localse where the acton s happenng. Another advantage for mplementng a personcentred descrptor s that, dependng on the camera angle, both persons are not always vsble n a gven frame, and we would lke to be able to provde a classfcaton n these nstances. For the moment, we assume that we know the locaton and scale of people n each frame and leave the detecton method for secton 3.2. 3.1 Person-centred descrptor The followng descrbes the process for obtanng a descrptor gven an upper body locaton, whch s repeated for each person detected n a frame. Our descrptor supermposes an 8 8 grd around an upper body detecton. The sze of the grd, beng dependent on the detecton sze, deals wth changes of scale. We then calculate hstograms of gradents and optcal flow n each of ts cells. An example of ths can be seen n Fgure 2b. Ths technque of usng hstograms of gradents and flow s a coarse analog to the descrptor used n [3, 11]. Gradents are dscretsed nto fve bns: horzontal, vertcal, two dagonal orentatons and a no-gradent bn. Optcal flow s also dscretsed nto fve bns: no-moton, left, rght, up and down. The hstograms are ndependently normalsed and concatenated to create an ntal grd descrptor g (Note on notaton: whenever a vector s used n ths paper s consdered to be n row format by default). We also expermented wth several varants of the grd

4 PATRON-PEREZ ET AL.: RECOGNISING INTERACTIONS IN TV SHOWS (a) Fgure 2: (a) Upper body detectons and estmated dscrete head orentaton. (b) Grd showng domnant cell gradent and sgnfcant moton (red cells) for a hand shake. (b) descrptor: usng only moton, only gradents, only nformaton of the cells outsde the upper body detecton as well as dfferent normalsatons. The experments descrbed n Secton 3.3 show the results obtaned by selectng dfferent parameters. To obtan the fnal descrptor d, we take nto account the head orentaton, dscretsed nto one of fve orentatons: profle-left, front-left, front-rght, profle-rght and backwards. Perfect frontal vews are very rare and they are ncluded n ether of the two frontal categores. Effectvely, we want to create a compact and automatc representaton from whch we can learn a dfferent classfer for each dscrete head orentaton. To do ths, the dscrete head orentaton, θ, s used to perform the followng operaton: g + = g δ θ, d = [g + g] (1) where s the Kronecker product, δ θ s an ndcator vector wth fve elements (correspondng to the dscrete head orentatons) havng a one at poston θ and zero everywhere else. By usng the head orentaton, we are amng to capture nformaton correlated wth t. Assumng that an nteracton occurs n the drecton a person s facng (Fgure2a) ths can provde us wth a weak knd of vew nvarance. We add an extra copy of g at the end of the descrptor d to account for any nformaton that s ndependent of the head orentaton and to help n cases where the automatc estmaton of the head orentaton s wrong. We can duplcate the amount of examples used for tranng by horzontally flppng the vdeo frames resultng n oposte head orentatons (.e. profle-left becomes profle-rght). The descrptor d s used as a data vector for tranng a lnear SVM classfer. An llustratve example of the results that we obtan, Fgure 3, shows the moton regons (outsde the upper body detecton) learnt by a lnear SVM classfer traned to dscrmnate between hand shakes and hgh fves. As expected, mportant moton regons are correlated wth the head orentaton and occur n lower locatons for hand shakes and hgher ones for hgh fves. 3.2 Localsng humans and estmatng head orentaton To be able to use the descrptor proposed above, we need to pre-process our vdeo clps. The pre-processng follows the same steps as n [6], and we brefly explan them here for completeness. Frst we run an upper body detector n each frame. Ths detector s traned usng a standard Hstogram of Orented Gradents (HOG) descrptor [2] and a smple lnear SVM classfer. We tran two such detectors at a dfferent ntal scale (to mprove the detecton rate). Next, we cluster these detectons usng clque parttonng to form tracks. Very short tracks and tracks wth low average SVM scores are elmnated, and those that reman are used n the experments. As n [1, 18] we learn a classfer for dscrete head orentatons,

PATRON-PEREZ ET AL.: RECOGNISING INTERACTIONS IN TV SHOWS 5 Fgure 3: Moton weghts outsde upper body detecton (blue square) learnt by a lnear SVM classfer traned to dscrmnate between hand shakes and hgh fves. Hgher weghts are ndcated by lghter areas. As expected, the more mportant moton regons are n lower locatons for hand shakes and n hgher ones for hgh fves. These also follow the drecton of the face. however we smply tran a one-vs-the-rest lnear SVM usng HOG descrptors. Once the classfer s learnt, we estmate the head locaton n each boundng box of each track and obtan a dscrete head orentaton classfcaton. 3.3 Experments Gven that people s tracks have been calculated n every vdeo as prevously descrbed, we want to evaluate the accuracy of our descrptor when classfyng nteractons. We have desgned a set of experments to show the effect of: () not usng head orentaton nformaton vs addng t ether by manual annotaton or by automatc classfcaton; () changng descrptor nformaton: usng only moton, only gradents or both; () addng weak temporal nformaton by concatenatng descrptors of consecutve frames to form a sngle descrptor. The term n-frame descrptor refers to a concatenaton of n descrptors from consecutve frames. To be able to compare the results obtaned, all of the experments follow the next steps. We manually select from each clp fve consecutve frames that are nsde the temporal regon where the nteracton s happenng. From these frames we extract descrptors from a track of one of the people performng the nteracton (agan we manually select the track). The same process s appled to the negatve vdeos. As descrbed n Secton 2, the dataset s dvded nto two sets for tranng and testng. We use n turn the descrptors of each set to tran a one-vs-the-rest lnear SVM classfer for each nteracton n a supervsed way. The classfcaton of a clp s done by addng the SVM classfcaton scores of each one of the descrptors extracted from ts fve selected frames. Fgure 4 provdes a vsual representaton of the results. Column-wse we observe accuracy results obtaned usng dfferent n-frame descrptors. Row-wse represents the average accuracy when choosng dfferent nformaton to nclude n the descrptor: only moton, only gradents and both. Each row s an average over tests usng full or external cells and dfferent normalsatons (L1, L2 or no-norm). The table tself s an average of the results obtaned when testng on both sets. Several thngs can be concluded from ths representaton. Frst, we can readly observe that the use of head orentaton mproves the classfcaton accuracy when correctly estmated, but errors when automatcally classfyng the head orentaton reduce t. Takng the

6 PATRON-PEREZ ET AL.: RECOGNISING INTERACTIONS IN TV SHOWS Fgure 4: Average classfcaton accuracy results wth dfferent parameter combnatons. No consstent mprovement s notced by usng hgher n-frame descrptors. Moton nformaton s a more dscrmnatve feature than gradents n three of the four nteractons. On average usng head nformaton mproves the accuracy. (Best vewed n color). best combnaton of parameters for each nteracton (usng 1-frame descrptors), the average accuracy when usng manually annotated head orentaton s 59.4%, for automatc head orentaton 52.2% and for no head orentaton 48.8%. We noted that the concatenaton of descrptors dd not consstently mprove the classfcaton results. Another easly dstngushable characterstc s that the use of moton features alone has better performance when classfyng hgh fves and ksses, whle a combnaton of both works better for hugs. Ths s very ntutve because hugs contan mnmal moton n contrast to the other actons. The bad performance of usng only gradents could be explaned by the coarseness of our descrptor, whch results n learnng gradents that are too general to be dstnctve. We tred to mprove these results by ncreasng the number of cells. The resultng ncreased sze of the descrptor combned wth a reduced number of tranng examples led to worse classfcaton results. 4 Learnng human nteractons As mentoned before, sometmes only one of the two people performng an nteracton appears n the vdeo clp. However, when the locaton of two or more people s avalable n a specfc frame, we should use ths to mprove our classfcaton. The assumpton we make s that people face each other whle nteractng. Thus we want to learn relatve locatons of people gven both ther head orentaton and an nteracton label. We propose to do ths by usng a structured learnng (SL) framework smlar to the one descrbed n [5]. The goal s to smultaneously estmate the best jont classfcaton for a set of detectons n a vdeo frame rather than classfyng each detecton ndependently. In contrast to [5], where SL s used to learn spatal relatons between object classes, we want to learn spatal relatons between people gven ther nteracton class and head orentaton. 4.1 Structured learnng We pose the SL problem n the followng terms: n each frame we have a set of upper body detectons X = [x 1...x M ]. Each detecton x = [l x l y s θ v], has nformaton about ts upper left corner locaton (l x,l y ), scale (s), dscrete head orentaton (θ), and SVM classfcaton scores (v) obtaned by classfyng the descrptor assocated wth ths detecton usng the n-

PATRON-PEREZ ET AL.: RECOGNISING INTERACTIONS IN TV SHOWS 7 Fgure 5: (a) Spatal relatons (δ j ) used n our structured learnng method. The black square at the centre represents the head locaton nsde an upper body detecton. (b) Weghts (β) learnt for each nteracton class and head orentaton combnaton. Lghter ntensty ndcates a hgher weght. teracton classfers prevously learnt. Assocated wth each frame s a label Y = [y 1...y M y c ]. Ths label s formed by a class label y {0,..,K} for each detecton (where K s the number of nteracton classes, wth 0 representng the no-nteracton class) and a confguraton label y c that serves as an ndex for one of the vald parngs of detectons. For example, for three detectons there are four vald confguratons: {(1,0), (2,0), (3,0)}, {(1,0), (2,3)}, {(1,3), (2,0)} and {(1,2) (3,0)}, where (, j) ndcates that detecton s nteractng wth detecton j and the 0 ndex means there s no nteracton. We measure the match between an nput X and a labelng Y by the followng cost functon: S(X,Y) = M α 0 y θ v y + M αy 1 θ + (δ j β T y θ + δ j β T y j θ j ) (2) (, j) P yc where v y s the SVM classfcaton score for class y of detecton, P yc s the set of vald pars defned by confguraton ndex y c, δ j and δ j are ndcator vectors codfyng the relatve locaton of detecton j wth respect to detecton (and vce versa) nto one of R = 6 spatal relatons shown n Fgure 5a. α 0 y θ and α 1 y θ are scalar weghtng and bas parameters that measure the confdence that we have n the SVM score of class y when the head dscrete orentaton s θ {1,...,D}. β y θ s a vector that weghts each spatal confguraton gven a class label and dscrete head orentaton. Once the weghts are learnt, we can fnd the label that maxmses the cost functon by exhaustve search, whch s possble gven the small number of nteracton classes and number of people n each frame. Learnng. We use the SV M struct package [8] to learn the weghts α and β descrbed prevously. To do ths, we must frst re-arrange equaton 2 to defne a sngle weght vector and encapsulate the X and Y components nto a potental functon Ψ (see [20]), and second we need to defne a sutable loss functon. We start by defnng: δ + j = δ j δ y θ and δ + j = δ j δ y j θ j, where means the Kronecker product and δ y θ s an ndcator vector of sze KD havng a one at poston y K + θ and zeros everywhere else. Also, let α 0 = [α 0 01... α0 KD ], α 1 = [α 1 01... α1 KD ] and β = [β 01... β KD ]. By substtutng nto equaton 2 we obtan: M S(X,Y) = [α 0 α 1 β ] [ v y δ y θ }{{} δ y θ (δ + j + δ + j) ] T (, j) P yc w }{{} Ψ M (3)

8 PATRON-PEREZ ET AL.: RECOGNISING INTERACTIONS IN TV SHOWS A key element of a SL framework s to defne an adequate loss functon for the problem n consderaton. Here we would lke the loss functon not only to penalse wrong assgnments of nteracton labels but confguraton labels as well. We also want addtonally to penalse a label msmatch between detectons that are labeled as nteractng. Takng these elements nto consderaton, we defne our loss functon as: (Y,Ŷ) = c (, j) = M 01 (y,ŷ ) + c (, j) (4) (, j) P yc 1 f (, j) / Pŷc 1 f (, j) Pŷc and ŷ ŷ j 0 otherwse were 01 s the zero-one loss, Y s the ground truth labelng and Ŷ s a labelng hypothess. Consder a frame wth three people, two of them nteractng. A canddate label that assgns an ncorrect nteracton label to a person that s not nteractng wll result n a loss of 1 from 01. If nstead ths error occurs n one of the people that are nteractng the loss wll be 2 (1 for the ncorrect label n 01 plus 1 for assgnng dfferent labels to nteractng people n c ). Errors n the confguraton label (y c ) tend to ncrease the loss sgnfcantly dependng on the number of actors present. An example of the spatal weghts learned usng ths method can be seen n Fgure 5b. 4.2 Experments In ths secton we compare the retreval results obtaned by ndvdual classfcaton and by SL. As ndcated n Secton 3.3, the concatenaton of descrptors dd not consstently mprove the classfcaton accuracy. Therefore, we selected a smple 1-frame descrptor that uses both moton and gradents wth L1 normalsaton and all cells n the grd. The classfers were traned to dscrmnate between fve classes: the four nteractons and a no-nteracton class. For a retreval task we need to defne a scorng functon for a vdeo clp. We propose a score based on the classfcaton of each track extracted from the clp. In each frame a detecton belongng to a track s classfed ether ndependently usng the classfers learned n Secton 3.3 or usng the SL framework. The score of each nteracton n a track s smply the percentage of ts detectons that were classfed as that nteracton. The overall nteracton scores of a clp are the average of the track scores. The average s calculated over the tracks where at least one frame was classfed as an nteracton. Ths s to avod assgnng low (5) Method HS HF HG KS AVG M + ID 0.5433 0.4300 0.4846 0.5349 0.5032 M + SL 0.5783 0.5108 0.7116 0.7654 0.6415 M + ID + N 0.4069 0.3348 0.3952 0.5003 0.4093 M + SL + N 0.4530 0.4507 0.6200 0.7058 0.5574 A + ID 0.4765 0.3194 0.4184 0.3153 0.3824 A + SL 0.4423 0.3255 0.4462 0.3592 0.3933 A + ID + N 0.3981 0.2745 0.3267 0.2613 0.3151 A + SL + N 0.3517 0.2569 0.3769 0.3250 0.3276 Table 1: Average precson results for the vdeo retreval task, when usng manual (M) or automatc (A) annotatons, ndependent (ID) or structured (SL) classfcaton and when ncludng the negatve (N) vdeos as part of the retreval task. In every case, the use of structured learnng mproves the average results.

PATRON-PEREZ ET AL.: RECOGNISING INTERACTIONS IN TV SHOWS 9 Fgure 6: Hghest ranked true and false postves for each nteracton obtaned usng the automatc method. The red square ndcates negatve vdeos. nteracton scores to vdeos wth many actors (most of whom are not nteractng). The score for no-nteracton s an average over all tracks. The same process s used for scorng the negatve vdeos and evaluate the effect that ncludng these clps has on the overall rankng. Average precson (AP) results obtaned usng ths rankng measure are shown n Table 1. We tested the nfluence of usng SL when we have manually labeled upper body detectons and head orentatons, and when we use the automatc method descrbed n Secton 3.2. Consderng the substantal challenges of the task, our results fall wthn those obtaned by state-of-the-art methods n sngle-acton recognton that use smlar datasets [7, 10, 12, 22], although a drect comparson s not possble. In every case the mean AP s mproved by the use of SL. Ths mprovement s more obvous n the manually labeled case. When usng the automatc method, there are many factors that can account for the smaller degree of mprovement when usng SL, namely: the nablty to always detect both people performng the nteracton (SL, as we have employed t, can t mprove the results n ths case), the appearance of false postves and the ncorrect automatc classfcaton of head orentaton. In the last two cases, the nput to the SL method s corrupted, and attempts to derve a jont classfcaton wll most lkely produce ncorrect results. To gve an nsght nto the dffculty of ths task Fgure 6 shows the best ranked true and false postves when generatng tracks automatcally and usng the full dataset ncludng negatve vdeos (complete average precson results for ths setup are shown n the last two rows of Table 1). We observed that hand shakes tend to be detected where no nteracton s happenng, ths could be because the natural moton of the arms (when walkng or talkng) resembles the moton pattern of a hand shake n some frames.

10 PATRON-PEREZ ET AL.: RECOGNISING INTERACTIONS IN TV SHOWS 5 Concluson and future work In ths paper we have proposed a new descrptor for human nteractons that captures nformaton n a regon around a person and uses head orentaton to focus attenton on specfc places nsde ths regon. We have also ntroduced a new dataset of realstc nteractons extracted from TV shows, and have shown good classfcaton and retreval results usng our descrptor. Furthermore, we have shown that usng SL to ncorporate spatal relatonshps between detected people n the scene mproves the retreval results obtaned by ndependently classfyng each detecton. Several deas for future work are readly avalable by analysng the results obtaned n Sectons 3.3 and 4.2. It s clear that an mprovement n the automatc head orentaton classfcaton and the automatc generaton of vdeo tracks wll have a postve effect on the classfcaton and retreval results. Although concatenatng descrptors of consecutve frames ddn t mprove the classfcaton scores n a consstent way, ths may be due to the fact that there wasn t much temporal varance to be captured n the fve frames of an nteracton that these experments consdered. It s lkely that capturng moton and appearance nformaton n longer perods of tme could gve us a better classfcaton. Acknowledgements. We are grateful for fnancal support from CONACYT and ERC grant VsRec no. 228180. References [1] B. Benfold and I. Red. Gudng vsual survellance by trackng human attenton. In Brtsh Machne Vson Conference, 2009. [2] N. Dalal and B. Trggs. Hstograms of Orented Gradents for Human Detecton. In Conference on Computer Vson and Pattern Recognton, 2005. [3] N. Dalal, B. Trggs, and C. Schmd. Human Detecton Usng Orented Hstograms of Flow and Appearance. In European Conference on Computer Vson, 2006. [4] A. Datta, M. Shah, and N. Da Vtora Lobo. Person-on-Person Volence Detecton n Vdeo Data. In Internatonal Conference on Pattern Recognton, 2002. [5] C. Desa, D. Ramanan, and C. Fowlkes. Dscrmnatve models for mult-class object layout. In Internatonal Conference on Computer Vson, 2009. [6] V. Ferrar, M. Marn-Jmenez, and A. Zsserman. Pose Search: retrevng people usng ther pose. In Conference on Computer Vson and Pattern Recognton, 2009. [7] A. Glbert, J. Illngworth, and R. Bowden. Fast Realstc Mult-Acton Recognton usng Mned Dense Spato-temporal Features. In Internatonal Conference on Computer Vson, 2009. [8] T. Joachms, T. Fnley, and C. Yu. Cuttng plane tranng of structural SVMs. Machne Learnng, 77(1):27 59, 2009. [9] A. Kläser, M. Marszalek, C. Schmd, and A. Zsserman. Human Focused Acton Localzaton n Vdeo. In SGA, 2010.

PATRON-PEREZ ET AL.: RECOGNISING INTERACTIONS IN TV SHOWS 11 [10] I. Laptev and P. Perez. Retrevng Actons n Moves. In Internatonal Conference on Computer Vson, 2007. [11] I. Laptev, M. Marszalek, C. Schmd, and B. Rozenfeld. Learnng realstc human actons from moves. In Conference on Computer Vson and Pattern Recognton, 2008. [12] J. Lu, J. Luo, and M. Shah. Recognzng Realstc Actons from Vdeos "n the Wld". In Conference on Computer Vson and Pattern Recognton, 2009. [13] M Marszalek, I. Laptev, and C. Schmd. Actons n Context. In Conference on Computer Vson and Pattern Recognton, 2009. [14] B. N, S. Yan, and A. Kassm. Recognzng Human Group Actvtes wth Localzed Causaltes. In Conference on Computer Vson and Pattern Recognton, 2009. [15] J. Nebles, H. Wang, and L. Fe-Fe. Unsupervsed Learnng of Human Acton Categores Usng Spatal-Temporal Words. In Brtsh Machne Vson Conference, 2006. [16] K. Ogawara, Y. Tanabe, R. Kurazume, and T. Hasegawa. Learnng Meannful Interactons from Repettous Moton Patterns. In Internatonal Conference on Intellgent Robots and Systems, 2008. [17] S. Park and J.K. Aggarwal. Smultaneous trackng of multple body parts of nteractng persons. Computer Vson and Image Understandng, 102(1):1 21, 2006. [18] N. Robertson and I. Red. Estmatng gaze drecton from low-resoluton faces n vdeo. In European Conference on Computer Vson, 2006. [19] M. S. Ryoo and J. K. Aggarwal. Spato-Temporal Relatonshp Match: Vdeo Structure Comparson for Recognton of Complex Human Actvtes. In Internatonal Conference on Computer Vson, 2009. [20] I. Tsochantards, T. Hofman, T. Joachms, and Y. Altun. Support Vector Machne Learnng for Interdependent and Structured Output Spaces. In Internatonal Conference on Machne Learnng, 2004. [21] G. Wllems, J. H. Becker, T. Tuytelaars, and L. Van Gool. Exemplar-based Acton Recognton n Vdeo. In Brtsh Machne Vson Conference, 2009. [22] X. Wu, C. W. Ngo, J. L, and Y. Zhang. Localzng Volumetrc Moton for Acton Recognton n Realstc Vdeos. In ACM nternatonal conference on Multmeda, 2009. [23] B. Yao and L. Fe-Fe. Grouplet: a Structured Image Representaton for Recognzng Human and Object Interactons. In Conference on Computer Vson and Pattern Recognton, 2010. [24] W. Zhang, F. Chen, W. Xu, and Y. Du. Herarchcal group process representaton n mult-agent actvty recognton. Sgnal Processng: Image Communcaton, 23(10): 739 753, 2008.