Lower Body Pose Estimation in Team Sports Videos Using Label-Grid Classifier Integrated with Tracking-by-Detection

Size: px

Start display at page:

Download "Lower Body Pose Estimation in Team Sports Videos Using Label-Grid Classifier Integrated with Tracking-by-Detection"

Darren Hicks
5 years ago
Views:

1 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan Research Paper Lower Body Pose Estmaton n Team Sports Vdeos Usng Label-Grd Classfer Integrated wth Trackng-by-Detecton Masak Hayash 1,a) Kyoko Oshma 2,b) Masamoto Tanabk 2,c) Yoshmtsu Aok 1,d) Receved: May 16, 2014, Accepted: November 6, 2014, Released: February 16, 2015 Abstract: We propose a human lower body pose estmaton method for team sport vdeos, whch s ntegrated wth trackng-by-detecton technque. The proposed Label-Grd classfer uses the grd hstogram feature of the tracked wndow from the tracker and estmates the lower body jont poston of a specfc jont as the class label of the multclass classfers, whose classes correspond to the canddate jont postons on the grd. By learnng varous types of player poses and scales of Hstogram-of-Orented Gradents features wthn one team sport, our method can estmate poses even f the players are moton-blurred and low-resoluton mages wthout requrng a moton-model regresson or part-based model, whch are popular vson-based human pose estmaton technques. Moreover, our method can estmate poses wth part-occlusons and non-uprght sde poses, whch part-detector-based methods fnd t dffcult to estmate wth only one model. Expermental results show the advantage of our method for sde runnng poses and non-walkng poses. The results also show the robustness of our method for a large varety of poses and scales n team sports vdeos. Keywords: human pose estmaton, people trackng, trackng-by-detecton, Random Forests, feature selecton 1. Introducton Vdeo-based player trackng has drawn nterest n computer vson. Snce vdeo-based object detecton and trackng technques have shown rapd mprovement [1], the applcatons of trackng team sports players are becomng ncreasngly more attractve for professonal sports. Body pose nformaton would be a mddle-level feature for classfyng the detaled acton of each player. Lke actvty recognton methods [2], [3] usng a pose estmated by sngle mage pose estmaton technques [4] suggest, the pose (jont locaton of the person n 2D or 3D) can be a stable and clear cue for detaled and fne-graned actvty recognton. Whle acton recognton methods usng spato-temporal local features [5], [6] can estmate the semantc acton class (e.g., runnng or standng) of the player, nner-class acton dfference (e.g., how wdely the person moves hs or her legs whle runnng) cannot be easly estmated. Even for semantc acton recognton, an acton classfer usng a pose feature (annotated jonts) performs better than one usng low-level feature (dense flow) as Ref. [7] llustrated n ther experments. In partcular, usng the lower body pose (or lower body jont postons) from team sports vdeos would be a new way of recognzng each acton of a player n detal. Snce runnng s the very 1 Keo Unversty, Yokohama, Kanagawa , Japan 2 R&D Dvson, Panasonc Corporaton, Yokohama, Kanagawa , Japan a) mhayash@aok-medalab.org b) ohshma.kyoko@jp.panasonc.com c) tanabk.masamoto@jp.panasonc.com d) aok@elec.keo.ac.jp basc and most frequent acton n all team sports, leg movements are one of the most mportant cues for recognzng player actons. For example, when the subject player s runnng, the jont pose can precsely measure the number of steps of the player, whch can be the cue for classfyng the step types (normal step or cross step), and whch can be the contact pont wth the foot n soccer. However lower body pose estmaton has rarely been nvestgated n computer vson. Human pose estmaton from vdeo s stll an open problem n computer vson whle depth-based methods usng RGB-D sensors has already been realzed as a hghly robust system [8]. We cannot yet estmate the pose of the sports players n all types of sports vdeos, whle pose estmaton of some lmted perodc actons (such as walkng) has only been solved usng nonparametrc regresson technques [9]. On the other hand, the frontal human pose of sports players can be robustly estmated from an mage wth methods usng part detectors and pctoral structures such as the flexble mxture-of-parts model (FMP)[4]. However, part-based methods usually fal to estmate non-frontal poses n team sports vdeos where players tend to frequently be dsplayed as sde poses. The reason s that these methods need enough space between parts to become a star-shaped tree confguraton of body parts. Even mult-vew part-based pctoral structure technques [10] wth pan-tled cameras cannot robustly estmate the sde and part-occluded poses n team sports vdeos because they stll depend on the dscrmnatve part classfcaton as Ref. [8] does. If the sde poses of sports players could be estmated wth monocular vdeos, a broad range of possbltes of 246

Informaton and Meda Technologes 10(2): 246-258 (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: 18-30 (2015) Informaton Processng Socety of Japan Fg. 1 Fg.

The lower mage shows an nput frame of the vdeo. The rectangle s the tracked player wndow, and the green crcle s the center of the wndow. Label-Grd Classfer.

In ths example, the Label-Grd classfer for the j-th jont s on the (6 8) grd structure, and the estmated Label-Grd s on l j t = (1, 7) on all three mages.

2 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan Fg. 1 Fg. 2 Example result of our framework. Images n the upper row show the tracked player mages n each frame of the nput vdeo wth the estmated jont poston as colored crcles n each frame. The lower mage shows an nput frame of the vdeo. The rectangle s the tracked player wndow, and the green crcle s the center of the wndow. Label-Grd Classfer. The red crcle on the grd s the classfed jont locaton l j t N 2 on each player wndow from the learned Label- Grd class canddates (pnk crcles). In ths example, the Label-Grd classfer for the j-th jont s on the (6 8) grd structure, and the estmated Label-Grd s on l j t = (1, 7) on all three mages. The number of the class of the Label-Grd classfer of the left foot s 21 (sum of pnk crcles and red crcles). vson-based human moton and behavoral understandng would be opened up. We propose a novel grd-wse pose estmaton classfer for monocular team sports vdeos, whch we call the Label-Grd classfer, ntegrated wth a standard trackng-by-detecton framework, such as Ref. [11]. Our Label-Grd classfer estmates the lower body human jont locaton wth Label-Grd resoluton usng the whole body appearance of the tracked wndow estmated by the player tracker (Fg. 1). In other words, the player tracker frst tracks the player wndow n each frame, then the Label-Grd classfer estmates the jont locaton (grd poston n the player wdow (See Fg. 2). To the best of our knowledge, our method s the frst human pose estmaton method that can estmate the pose of sde-runnng players wth scale changes, whch part-based methods [4], [12] cannot estmate very well. As the example results n Fg. 1 show, our method can robustly estmate the poses of the sde runnng sequence. Smlar to (frontal) facal recognton technques [13], our Label-Grd classfer embeds all types of algned player wndow mage nto only one mult-class classfer, whch enables pose estmaton when they are runnng sdeways. Snce our framework employs Hstograms of Orented Gradents (HOG) [14] as whole body gradent hstogram features, t can estmate the jont locaton even from a low-resoluton vdeos owng to the deformaton nvarance and contrast nvarance of the HOG feature. In addton, t can estmate poses that have smlar appearances between parts (e.g., pants wth only one color) that part-based methods fnd t dffcult to estmate, because part appearances become too ambguous to detect (e.g., whle crossng the legs). The contrbutons of ths paper can be summarzed as havng the followng advantages: Label-Grd classfer whose 2D unt blocks (Label-Grd) are synchronzed wth the resoluton of Grd-Hstogram features va mult-class classfer (n ths paper we use HOG features randomzed by Random Decson Forests). Algn the trackng wndow wth a pelvs-algned detector to provde center-algned vsual features (selected from HOG by Random Decson Forests) to easly classfy the class of Label-Grd classfers. Can also estmate the pose of sde runnng sequences, whch frequently appear n team sports vdeos. Per-frame estmaton wthout temporal pose moton models of a specfc acton, such as Ref. [9], whch uses a walkng pose manfold or temporal pose pror. Per-frame pose estmaton for vdeos wthout pctoral structure and part detectors. Our gnorance of pctoral structure framework acheved fast pose estmaton (about 1 fps computatonal tme). The rest of the paper s organzed as follows. In Secton 2 we frst revew the related work for human pose estmaton methods for mages and vdeos. Our framework s presented n Secton 3. Secton 4 descrbes how to learn the Label-Grd classfer wth a prepared dataset. Secton 5 llustrates the expermental results and evaluates the performance of our framework. Fnally, Secton 6 s the concluson. 2. Related Work Frst, we revew human pose estmaton methods usng the classcal slhouette-based template matchng technque for team sport vdeos (Secton 2.1), whch s not robust and s just for graphcs vsualzaton. Then we revew two types of human pose estmaton technques: human pose estmaton from a sngle mage usng pctoral structure (Secton 2.2), and moton model regresson (Secton 2.3). Fnally, we revew nstance template detecton technques such as poselets [15] and Exemplar-SVMs (Secton 2.4). 2.1 Exemplar-based Pose Search Methods Germann et al. [16] proposed slhouette-based database matchng methods for soccer players. These methods frst construct an exemplar-pose database wth slhouette mages of players. At test tme, ther method frst fnds the most smlar slhouette from the 247

3 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan database n each frame, and then apples optmzaton through multple temporal multple frames. The resultng poses are not very accurate because the exemplars cannot nclude every type of human poses, and are senstve when extractng the slhouette va background subtracton. Moreover, ths approach nvolves a hgh computatonal optmzaton cost. 2.2 Sngle Image Pose Estmaton Usng Deformable Part Models The FMP [4] s the extenson of the deformable part models (DPM) [12] that are used to estmate the pose of the human by nferrng from the best part confguraton. DPM was orgnally a weakly-supervsed model usng latent support vector machnes (latent SVM) to learn the appearances and locatons of each part detector automatcally from the labeled whole object wndow. On the other hand, FMP s a supervsed verson of DPM and uses mxture-of-parts to represent the dscrete changes of each part appearance. FMP [12] accurately estmates frontal poses of subjects openng ther legs and arms. However, FMP cannot precsely estmate the poses that have feet and arms occluded, because t depends on the part-detecton scores and depends on the tree-graph where each subnode (arms and feet) s wdely open. For ths reason, FMP tends to estmate the pose of a person who looks rght or left and even frontal poses ncorrectly, because t s hard to detect each arm or leg part n those poses owng to ther ambguous and ncomplete part appearances. Moreover, t s dffcult to model the confguraton wth one tree-structure model for frontal, sde and bendng poses. The model needs to learn those models separately. More recently, poselets-based [15] part-based approaches have been studed [17], [18]. Although those methods overcome the weakness of FMP by representng the relatonshp between parts usng poselets (whch are larger parts than parts of FMP), they are stll poor at hard occluson cases because they stll depend on the pctoral structure. A mult-camera approach [10] helps to deal wth part-occlusons, but even ths methods s not able to tackle wth sde runnng scenaros where part-occlusons occur frequently. There s also a method nvolvng an occluson handlng scheme usng an occluson detector and part-based regresson [19]. Whle ths method provdes good results wth small occlusons between parts, t also cannot estmate sde poses because t stll depends on the pctoral structure. 2.3 Moton Model Regresson and Pose Manfold Methods for Pedestrans If the human pose model only ncludes the pose types wthn one acton class, such as walkng or swngng a golf club, pose estmaton can be solved usng regresson technques wth fxedvew tranng mages. The trackng method usng Gaussan Process Regresson [9] s popular for learnng a (latent) 3D pose manfold from cyclc pedestran mages from one camera vew. For pedestran pose estmaton, Gammeter et al. [20] proposed a people trackng and pose estmaton method for pedestrans usng Gaussan Process Regresson. Rogez et al. [21] proposed a perframe pose estmaton of human pose estmaton based on Random Decson Forests [22] and pose manfolds of a gat sequence wth HOG features. Reference [21] s close to our method n usng Random Decson Forests for pose classfcaton. However, ther Random Decson Forests class s based on camera vews and gat manfold cycle, whle our Label-Grd class s the grd of HOG features. Addtonally, they do not predct the jont postons precsely because they just fnd the most smlar walkng pose exemplar on the gat manfold. These methods only nvestgated the pose dstrbuton of walkng people. They can learn the latent transtons between the poses of pedestrans, but cannot afford to nclude every types of poses. 2.4 Poselets and Exemplar-SVMs Our pelvs-algned detector and Label-Grd classfer are both nspred by the poselets framework [15]. Poselets are the detector of one specfc pose of a mddle-level human part detector, whch can be learned from tranng mages wth the same algned pose and same scale mages but from dfferent subjects (e.g., upper body Poselets wth ther arms crossed). Exemplar-SVM [23] s an object detecton method usng perexemplar detectors. It detects exemplars usng each exemplar- SVM to detect multple appearance types of an object class. Exemplar-SVMs separately learn each pose or vew wth one exemplar-svm (e.g., left-vew car SVM, frontal-vew car SVM, jump-pose human SVM). On the other hand, our pelvs-algned detector learns multple scales and poses of an object class altogether. Ths one-detector soluton makes t easy to ntegrate wth a trackng-by-detecton scheme, whle the goal of Exemplar-SVMs s robust object detecton even wth only one mage usng multple SVMs that know the hard-negatves. 3. Proposed Framework The proposed framework (Fg. 3) s composed of two modules: a trackng player wth trackng-by-detecton technque wth the pelvs-algned detector (Secton 3.1), and estmatng the four jont postons on the grd structure ndependently usng Label- Grd classfers (Secton 3.2). These two modules share the tracked player wndow as HOG (Hstogram-of-Orented Gradents) feature [14] to estmate the pose (locatons of four jonts) n each frame of the vdeo (Secton 3.2). At test tme, the only nput of our framework s the player wndow poston (rectangle) of the subject player n the frst frame of the vdeo. All the lower body poses n each frame are estmated automatcally by trackng the player and are estmated by Label-Grd classfers n each frame. We learn four Label-Grd classfers of each lower-body jont separately; left-knee L lk (x), rght-knee L rk (x), left-foot L lf (x), and rght-foot L rf (x). Note that left or rght means left n the mage and rght n the mage respectvely. Our Label-Grd classfer learns the poston of the left and rght jonts n the mage plane just lke the other pose estmators such as FMP [4] *1. *1 Ths s the typcal lmtaton of the two-dmensonal human pose estmaton methods. To overcome ths, we wll use the three dmensonal nformaton nferred by the other approaches n the future. 248

[11], to provde an algned player wndow for the second module. We also used ths pelvs-algned detector n our upper body pose estmaton framework [24], the explanaton n Ref.

3. For trackng-by-detecton, we use the player detector learned from the dataset D all (Secton 3.3). We use a Kalman Flter (nstead of Partcle Flter n Ref.

4 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan Fg. 3 Flowchart of the proposed framework. 3.1 Player Trackng wth Pelvs-algned Detector Our frst module s the player trackng method wth standard trackng-by-detecton, such as Ref. [11], to provde an algned player wndow for the second module. We also used ths pelvs-algned detector n our upper body pose estmaton framework [24], the explanaton n Ref. [24] was very short because of the page lmt. Hence, ths paper provdes a more detaled procedure to prepare the dataset of our pelvs-algned player detector n Secton 3.3. For trackng-by-detecton, we use the player detector learned from the dataset D all (Secton 3.3). We use a Kalman Flter (nstead of Partcle Flter n Ref. [11]) to track the player whose lkelhood n each frame s a non-maxmum suppresson result of the detectons wthn the local area around the predcted player locaton. Snce our method manly targets at estmatng sde poses durng runnng or walkng, whch occurs very often n team sports vdeos (as mentoned n the ntroducton), we choose a Kalman Flter by assumng the smple and monotonous trajectory of the subject players n typcal team sports vdeos on large felds. Ths trackng procedure provdes the smoothed and the center of the tracked wndow n each frame, and these tracked wndows are expected to be algned to the pelvs poston. Ths frst module provdes the algned wndow that works well for the Label- Grd Classfer n the second module. Snce we use HOG feature for classfyng the pose, we expect around 1 or 1.5 grd errors n ths tracker to ensure that the Label-Grd classfers can use the algned HOG feature leaned from the pelvs-algned dataset. 3.2 Label-Grd Classfer for Estmatng Jont Grd Poston The second module estmates the four jont locatons usng four Label-Grd classfers. The proposed Label-Grd classfer s a mult-class classfer whose label classes are assgned to the grd locatons of grd hstogram feature such as HOG features [14]. The Label Grd classfer L j (x t ) = {F j (x t ), M j (ŷ j t )} (for the j- th jont) conssts of a mult-class classfer F j (x t ) and the classto-grd mappng functon M j (ŷ j t ), where we denote the nput vsual feature vector (n our case, normalzed three-level HOG) as x t R D at frame t, and the estmated Label-Grd class label of F j (x t )sy j t {l = 1, 2,...,L}. Each class l of F j (x t ) learns the appearances (or poses) of players wth ts j-th jont s on the same grd (See Fg. 2). At test tme, the classfer F j (x t )ofl j (x t ) frst estmates the class y t from the nput vsual feature vector x t : ŷ j t = F j (x t ) (1) The reason why we use not only the lower body but also the full body wndow for the HOG feature s that we am to leverage the whole upper body appearance for classfers, whch result n capturng a wde varaton n upper body appearances n each lower body jont poston. We expect that ths strategy of ncludng upper body appearance makes the Label-Grd classfer easer to dscrmnate the pose of a specfc jont poston from the other poses even when the pelvs s not algned, whle the HOG of only the lower body could cause too much senstvty to the ms-regstraton of the tracker *2. After estmatng the class label ŷ j t from the nput feature vector, we map ŷ j t to the correspondng 2D grd locaton wth M j (ŷ j t ): l j t = M j (ŷ j t ) (2) where M j s the dctonary functon for the j-th jont to map each class ŷ j t to the correspondng grd locaton l j t N 2, whch we call Label-Grd. Ths mappng dctonary M j s bult durng tranng. We typcally assgn each class y j t to the grd from left to rght and from top to bottom f there s more than one sample labeled on the grd (see Fg. 2 for the example grd ndex assgnment). Note that we need the nverse mappng of M j (ŷ j t ) durng tranng, because we frst have to assgn each Label-Grd l j t n the D all to the l-th class n L-class classfer. However, at test tme, we need only M j (ŷ j j t ) for convertng ŷt to l j t. The tranng dataset for the Label-Grd classfer must nclude most of the types and scales of players appearances that could occur n the target vdeos. Hence, our system can estmate the lower body pose n any locaton of the mage (n our case, the *2 If the estmaton of the pelvs s perfect wth any other methods, we need to use only the lower body appearance. 249

5 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan sports feld) where player scale vares accordng to the poston and the pose of the camera. 3.3 Dataset Preparaton We share the same data-augmented dataset D all between two modules to learn the pelvs-algned Detector and four Label-Grd classfers. To share the player wndow wth ts center algned to the pelvs of a player, the player detector and the Label-Grd classfer are learned from the same mages and labels from D all.to create D all, we perform data augmentaton wth dfferent scales Fg. 4 Data Augmentaton. Usng h pla (heght of the blue wndow), mages are scaled to the scale s so that the center poston p pel keeps to the center of the Label-Grd even n the scaled mages. By performng ths algned mage samplng of the tranng dataset, the feature space of the Label-Grd classfer can be augmented to the multscale player szes wthn the Label-Grd wndow. and mrrored mages from the orgnal dataset D or (Fg. 4). We frst prepare a tranng dataset D all = {D sca, D mr } from realstc team sports vdeos to learn both player detector and Label- Grd classfers. D sca (Fg. 4 (a)) s the resampled player wndow mages and ther labels from the orgnal dataset D or for whch we should only need to prepare the labels. D mr (Fg. 4 (b)) s the mrrored dataset of D sca whose mages and labels are flpped horzontally. See the left half of the Fg. 5 for ths dataset preparaton procedure. Frst, we prepare a dataset D or wth N mages I = [I 1, I 2,...,I N ] and labels of each mage I so that the mages ncludes varous types of poses of players n one specfc sport. Each player wndow mage I n D or has labels L = [p 1, p2, p3, p4, ppel, h pla ], where p j R D s the j-th jont locaton of the -th mage I on the mage plane, and h pla s the player heght for resamplng the orgnal mages. p pel R D s the locaton of the pelvs of the -th mage I on the mage plane, whch s always at the center of the player wndow and becomes the nformaton of the locaton of the player wndow. Images I are all clpped from the team sports vdeos so that ther wndow centers p pel are all algned to the center of the wndow, and the labelng person manually nputs h pla as a length between the top of head and the bottom of the foot of the player n I. Ths labelng procedure determnes the scale of the player heght h pla to the fxed sze wndow n each sample. For resamplng the orgnal mage of D or to multple scales, we resze all the mages and labels n D or to the resampled player scales s = h pla /h wn, where h wn s the heght of the Label-Grd wdnow. D sca ncludes several player scales wth regular ntervals (e.g., 0.80, 0.85,...,1.00) by reszng D or (Fg. 4 (a)). Fnally, we acqure D mr by flppng the mages and labels of D sca horzontally to learn the mrrored features and labels (Fg. 4 (b)). The player detector uses only the mages of D all because the Fg. 5 Learnng procedure. 250

6 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan centers of the players wndow mages n D all are all algned to the pelvs of the players. In contrast, mages and Label-Grd classes of D all are used to learn the classfer. Any types of multclass classfer can be used as a Label-Grd classfer. However, for ts hgh precson and fast computatonal tme, we use Random Decson Forests [22] as a Label-Grd classfer n our experments. 4. Learnng Label-Grd Classfer The Label-Grd classfer s a mult-class classfer wth ts class labels (Label-Grd) assgned to the 2D grd locaton of a feature type wth a grd structure such as HOG features. The Label-Grd classfer can be any types of mult-class classfer (e.g., Support Vector Machne, Random Decson Forests, etc.), but preparng the dataset for a Label-Grd classfer to classfy the lower body grd-poston of the player s our orgnal approach to usng a general mult-class classfer. Every one class (Label-Grd) of the Label-Grd classfer learns the HOG features whose jont s on the same grd poston (Fg. 2). Gven the grd feature of a player n a W H wndow, classfyng the Label-Grd of a specfc jont (e.g., left-knee) can be regarded as an L-class grd classfcaton problem, where the task s to choose a grd poston (, j) from L canddate postons (pnk crcles are marked as canddate postons n Fg. 2). The other N grds n the grd feature are just gnored from label-grds to learn. The number of Label-Grds L s decded when buldng M j (y j t ) (see Secton 4.1). For nstance, f you use HOG features wth 6 10 cells n a player wndow and f there are 35 classes where the jont label-grd exsts more than one jont poston n the tranng dataset, the Label-Grd classfer becomes 35-class classfers. The other 25 (= ) grds, whch have no jont labels n the tranng dataset, are gnored for the classfcaton of the jont. 4.1 Learnng Procedure We wll explan how to learn a Label-Grd Classfer for classfyng the j-th jont (e.g., the left knee jont). Fgure 5 shows the whole procedure, used to learn the Label-Grd Classfers (L lk (x), L rk (x), L lf (x), and L rf (x)) and the pelvs-algned detector by preparng a dataset D all. Gven a data-augmented dataset D all, we frst calculate the grd locaton l j from the j-th jont p j of the -th mage n D all usng pelvs poston and the sze of Label-Grd (e.g., each grd s 8 8 and the wndow sze s ). After calculatng all l j n D all, we then buld the mappng functon M j (y j ) to decde the number of the class N of the Label-Grd Classfer and all the Label-Grd ndces y j of the -th mage n D all. After M j (y j ) has been bult, we can fnally learn F j (x ) (usng Random Decson Forests, n ths paper) wth Label-Grd s and the calculated feature x all that we wll explan n the next subsecton. 4.2 Mult-level HOG Feature and Feature Selecton We use a three-level mage pyramd from the player wndow for calculatng three-level HOG features x 1 t, x 2 t, x 3 t for makng the feature vector x t = [x 1 t x 2 t x 3 t ] for a Label-Grd classfer. Learnng multple resoluton of the HOG appearance makes the Label-Grd classfer restrct the label-grd canddates at each resoluton level, whch helps to avod the bad classfcaton result far from the true poston. To decrease the effect of the dfference of feature scales between the three levels, we normalze the feature vector x t to L 2 unt vector both at tranng tme and test tme. We use L-class Random Decson Forests as the L-class Classfcaton Forests [22] as F j (x t ) of the Label-Grd classfer, whch results n performng feature selecton from these normalzed three-level HOG features x t. After learnng the L-class, each splt functon uses two randomly selected values of the mult-level feature vector x t to estmate the class label y j t. 5. Experments We tested our framework n two scenaros: frontal pose sequences (Secton 5.3.1), whch part-based pose estmator [4] can also estmate robustly, and sde pose sequences (Secton 5.3.2), whch part-based pose estmator cannnot predct properly as argued n Secton Expermental setup We performed expermental evaluatons on our system wth Amercan Football vdeos n professonal league matches. The sze of each vdeo s 1, The vdeos are taken from the matches of Panasonc IMPULSE *3. All vdeos were captured from the hgh place n the stadum wth fxed cameras. All vdeos were converted to 29 fps vdeos whle the orgnal vdeos were recorded at 59 fps. These vdeos nclude players from a team wth a whte-colored unform and players from the other team wth a black-colored unform. Although we captured hgh-resoluton vdeos, moton blur of movng legs and arms sometmes occurs and the players are captured wth a relatvely low resoluton. We created 10 test sequences (test (1) (10)) for the fve frontal pose tests and fve sde pose tests from these vdeos (see Fg. 6 to see the player trajectores on each sequence). Each test vdeo s composed of 40 frames. We wll show the detal behavor (pose) on each sequence n Secton for the frontal pose sequences and Secton for sde pose sequences. We manually clpped player wndows from vdeo frames and assgned labels to create our orgnal dataset D or for tranng both Label-Grd classfers and the player detector. We tred to nclude as many pose patterns (and also vews of the pose) as possble n the dataset D or to make the Label-Grd classfers learn the whole possble appearance patterns n the Amercan Football vdeos. We randomly selected the mages from all the vdeos so that the orgnal dataset ncludes more versatle player poses, and the orgnal dataset becomes 977 mages and ts labels. Note that approxmately 10% of the orgnal tranng mages shares the same mages wth test dataset sequences of test (7) and test (8). Then we resampled 977 mages and labels of D or wth 13 scales s = {0.70, 0.725, 0.75,...,0.975, 1.0} and fnally prepared 25,402 mages (25,402 = ) for tranng four Label-Grd classfers ndependently. *

Informaton and Meda Technologes 10(2): 246-258 (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: 18-30

Each red crcle shows the canddate Label-Grd class of the classfer.

Consequently, we had 34-class left knee Label- Grd classfer, 34-class rght knee Label-Grd classfer, 38-class left foot Label-Grd

To apply FMP [4] as a baselne, we used PartsBasedDetector software [25].

wth our jont locaton classfers (ndex 12 as left knee jont, ndex 13 as left foot jont, ndex 24 as rght knee jont, ndex 25 as rght

6 Tracked results of all tests (1) (10).

Green dots show the player wndow center locatons n each frame. 5.2

1 Pxel Error of the Jont Poston For measurng the performance of our lower body pose system, we defne the Eucldean dstance error as

player detector and the four Label-Grd classfers.

For the Label-Grd classfers, we also created pyramd mages 48 64 and 24 32 from the tracked 64 128 wndow mage n one frame.

Fnally, we obtaned a 2268-dmensonal L 2 -normalzed feature vector x t as the nput of each Label-Grd classfer.

7 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan Fg. 7 Four Label-Grd classfers wth 8 12 Label-Grds, whch we use n our experments. Each red crcle shows the canddate Label-Grd class of the classfer. ndependently (rght/left knee and rght/left foot) from the tranng dataset D all. Consequently, we had 34-class left knee Label- Grd classfer, 34-class rght knee Label-Grd classfer, 38-class left foot Label-Grd classfer, and 39-class rght foot Label-Grd classfer (see Fg. 7 for the class assgnment). To apply FMP [4] as a baselne, we used PartsBasedDetector software [25]. We used 26-parts frontal person models as FMP and regard center postons of the 4 part-detector as four lower body jonts to compare wth our jont locaton classfers (ndex 12 as left knee jont, ndex 13 as left foot jont, ndex 24 as rght knee jont, ndex 25 as rght foot jont). We assume that the center of the rectangle of each detected part s the correspondng jont poston n the mage. Fg. 6 Tracked results of all tests (1) (10). Test (1) (5) are the results of the frontal pose sequences whle tests (6) (10) are the results on the sde pose sequences. Green dots show the player wndow center locatons n each frame. 5.2 Evaluaton Manner Pxel Error of the Jont Poston For measurng the performance of our lower body pose system, we defne the Eucldean dstance error as below: E t = d( ˆp t, p GT t ) (3) HOG wndow sze was pxels (wdth heght), and the cell sze was 8 8 both for our pelvs-algned player detector and the four Label-Grd classfers. For learnng the pelvsalgned player detector, we only used HOG and labels of D all. For the Label-Grd classfers, we also created pyramd mages and from the tracked wndow mage n one frame. We then calculated three-level HOG x 1 t, x 2 t, x 3 t from each level pyramd mage wth 8 8 cell sze and combned them. Fnally, we obtaned a 2268-dmensonal L 2 -normalzed feature vector x t as the nput of each Label-Grd classfer. We learned four Label-Grd classfers as Random Decson Forests [22] wth the feature vector x t for each lower body jont where ˆp t = (x,y) s the center pont of the estmated Label-Grd locaton and the p GT t s the ground truth poston. Snce our Label- Grd classfers are learned wth 8 8 Label-Grd, the center poston ˆp t becomes (4, 4) from the left-top pont (0, 0) n each Label- Grd. Note that the runnng speed of the player s fast n most of our expermental vdeos because we apply our method to the solated runnng players, such as Quarterback, Runnngback, and Lnebacker. For ths reason, the length of each vdeo s very short (40 frames). Another reason s that we cannot collect many sequences of long runnng solated play easly, because each Amercan football play s around only 10 seconds and players tend to be occluded and congested frequently. 252

The left tow panels n each subfgure (a) (e) show the results of our Label-Grd classfers, and the rght panels show the result of the FMP, where only the four detected jonts are shown (no vsualzaton of

Although we would lke to compare our methods wth FMP usng the PCP [26] score, whch s broadly appled to the evaluaton of part-based methods, we cannot calculate the PCP score because our method does

2 Detecton Rate of FMP and How to Apply FMP to Our Vdeos To test the lmtatons of FMP for sde poses n our vdeos, we defned the detecton rate R as R = k/n, where k s the number of detectons aganst the

Snce FMP [4] was the object detector (but jontly estmate the pose whle detectng the object), we automatcally clpped the magnfed and margn-added mage to apply the FMP detector.

8 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan Fg. 8 Example results from frontal pose experments. The left tow panels n each subfgure (a) (e) show the results of our Label-Grd classfers, and the rght panels show the result of the FMP, where only the four detected jonts are shown (no vsualzaton of jonts means that FMP could not detect anyone n the frame). Although we would lke to compare our methods wth FMP usng the PCP [26] score, whch s broadly appled to the evaluaton of part-based methods, we cannot calculate the PCP score because our method does not nfer the stck area of each part whch s needed for calculatng PCP scores. Ths s one of the reasons why we used the Eucldean dstance for the evaluaton Detecton Rate of FMP and How to Apply FMP to Our Vdeos To test the lmtatons of FMP for sde poses n our vdeos, we defned the detecton rate R as R = k/n, where k s the number of detectons aganst the number of frames N n one test vdeo (n our case, N = 40 frames). Snce FMP [4] was the object detector (but jontly estmate the pose whle detectng the object), we automatcally clpped the magnfed and margn-added mage to apply the FMP detector. We frst clpped the player wndow from the trackng-bydetecton trackng module (Secton 3.2) by addng margn, and magnfed t 200% to enable FMP to detect the players n our vdeo. Each of mages n Fg. 8 shows the clpped mages wth ths procedure. 5.3 Expermental Results Fgure 9 (a) shows the detecton rate of FMP [4] n tests (1) (5). Snce the movement of the players n tests (2) (4) are dagonal and curved (Fg. 6), most of ther poses were dffcult for FMP wth hard occluson and low-resoluton, even though we defned those tests as frontal tests. However, snce our method does not use any part-models and just use tracked (whole) player wndow appearance n each frame, t can classfy the jont poston n all frames n test (1) (5). For example, n the Fg. 8 (d), whle FMP could not detect the player n each frame, our Label-Grd classfer estmated the jont postons correctly from the same mages. Whle we wanted to compare the poston error between our methods and FMP, we abandoned the calculaton of the pxel 253

9 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan Table 2 Average estmaton error of each jont n the sde pose tests (6) (10). All errors are n pxels. Jonts (6) (7) (8) (9) (10) Left knee Rght knee Left foot Rght foot Pelvs (a) Detecton rate of FMP [4] n frontal pose tests (1) (5). (b) Detecton rate of FMP [4] n sde pose tests (6) (10). Fg. 9 Detecton Rate of FMP [4] n each test. Table 1 Average estmaton error of each jont n the frontal pose tests (1) (5). All errors are n pxels. Jonts (1) (2) (3) (4) (5) Left knee Rght knee Left foot Rght foot Pelvs error for FMP snce we were not able to get enough detectons from even frontal poses (Fg. 9 (a)) Frontal Pose Experment We tested our system on the followng frontal pose scenaros to compare the performance or detecton rate wth FMP [4]. We prepared the followng fve sequences for a frontal pose dataset (Fg. 6, left column): Test (1): The player walks to the start poston whle facng ther frontal upper body to the camera. Test (2): The player runs up to the upper sde of the feld. Test (3): The player begns to run from the start poston. Test (4): The player runs dagonally. Test (5): A large player walks to the outsde of the feld. Fgure 6 shows the tracked trajectory of pelvs poston (center of the player wndow) n each of tests (1) (10). The panels n the left column show the results of frontal pose tests (1) (5). Table 1 shows the average error of our method and FMP [4] for each jont. Note that our Label-Grd s 8 8 pxels for all tests. Whle FMP sometmes faled to detect a player who had occluded parts, our method could detect non tree-structured poses. Fgure 8 shows the example results of our method and FMP to compare wth each other Sde Pose Experment Just as for the frontal pose experment n the prevous secton, we also performed evaluaton of our method and FMP wth the followng fve sde pose scenaros (Fg. 6, rght column): Test (6): The player runs straght (namely, almost no scale change) at relatvely slow speed from left to rght. Test (7): The player runs very fast from rght to left. Test (8): The Runnngback player runs dagonally from the startng poston. Test (9): The Runnngback runs straght from the startng poston. Test (10): The player walks backward. We collected these sde pose test vdeos so that the upper body drecton of the player was almost the same as the lower body drecton. Table 2 shows the average error of our method n these sde pose tests and Fg. 10 shows the example results of tests (6) (8). Fgure 9 (b) shows the detecton rate of FMP [4] n sde pose tests (6) (8). As Fg. 10 shows, FMP can rarely detect the player n sde pose tests. FMP detects player wth a detecton rate 0.15 to Compared wth these results of FMP, our method could estmate the pose n all frames va ts trackng and classfcaton procedure wthn about two Label-Grd errors (Table 2). 5.4 Dscussons by Topc Whole Body Appearance Feature as Mult-level HOG As already argued n Secton 3.2, we use the whole body appearance to classfy the lower body pose. Our HOG-based classfcaton approach can be vewed as the modern replacement of the classcal slhouette-matchng schemes usng background subtracton, such as Ref. [16]. We nstead use randomzed HOG features (learned by Random Decson Forests) to robustly classfy the pose wth machne learnng. Our strategy has rcher nformaton wth whch to dscrmnate Label-Grd classes than usng only the lower body appearance. We nstead use the whole body HOG appearance to estmate the jont poston. Moreover, ownng to the deformaton nvarance of the HOG features, our Label-Grd classfer can estmate the pose of a larger or slmmer person untl the gradent dstrbuton changes from the feature dstrbuton of the dataset. For nstance, we performed an experment usng a large player n test (5) (see Fg. 8 (e)). Even though we only ncluded mddle-szed and thn players n the orgnal dataset D or, our classfers could stll est- 254

Grd classﬁer s learned from the same clothng type whle the pctoral structure ncludes varous types of clothng.

4 Algnment and Scale of the Wndow s Important As already mentoned, our framework depends on the algnment of the player wndow between the trackng module and the classﬁcaton module.

Ths means that Label-Grd classﬁers can only deal wth the wndow patterns wthn one or two (at most) cell sze drft.

11 (a), the left foot poston gradually becomes unstable as the left leg s out of the wndow. Ths sequence shows the nature of our wndow algnment scheme.

ncreases wth a wder Label-Grd wdow. Fgure 11 (b) shows the error values of four lower body jonts and the pelvs poston n each frame of the test (9).

10 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan Fg. 10 Example results from sde pose experments. Each row shows the results of sde pose test (6), (7), and (8). Grd classﬁer s learned from the same clothng type whle the pctoral structure ncludes varous types of clothng. In our experments, we learned Label-Grd classﬁers from two Amercan Football teams, but the classﬁers can classfy the lower body pose wth the appearances of both teams Algnment and Scale of the Wndow s Important As already mentoned, our framework depends on the algnment of the player wndow between the trackng module and the classﬁcaton module. In our experments, our player detector s learned wth a HOG of 8 8 cell sze, whch tracks the player wthn a cell error. Ths means that Label-Grd classﬁers can only deal wth the wndow patterns wthn one or two (at most) cell sze drft. Hence, Label-Grd classﬁers can robustly estmate the grd locaton f the tracker can provde well-algned wndows. Fgure 11 shows the temporal analyss of the sde pose test (9). By observng Fg. 11 (a), the left foot poston gradually becomes unstable as the left leg s out of the wndow. Ths sequence shows the nature of our wndow algnment scheme. Whle you can use wder wndows for the Label-Grd classﬁers to prevent ths case by restrctng the person wthn the wndow, you need to make more patterns for the dataset because the number of classes ncreases wth a wder Label-Grd wdow. Fgure 11 (b) shows the error values of four lower body jonts and the pelvs poston n each frame of the test (9). Owng to the mate the jont locaton of the large-szed player Dsregard of Pctoral Structure and Low-resoluton Invarance of Our Method Another mportant part of the nature of our method s that our framework dsregards the pctoral structure strategy [27] whle t depends on the algned wndow appearance. As we showed n our expermental results for sde pose tests (Fg. 10), our method can model any types of pose ncludng hardly occluded sde poses, whch pctoral pcture models cannot nfer very well owng to ther tree-models. As we already argued n Secton 4.2, our threelevel HOG feature and the Randomzed feature selecton helps to restrct the error as much as possble. Snce the Random Decson Forests technque takes advantage of the spatal grd structure to learn the dstance between classes, the error seems to be restrcted wthn neghborng Label-Grd (See Fg. 8 and Fg. 10). Whle FMP and the other part-detector technques assume clear and non-blurred mages, our mult-level HOG-based LabelGrd classﬁers can even classfy the poses n low-resoluton and moton-blurred mages because the HOG feature s robust for contrast change (between mage scales) usng grd-wse edge hstogram poolng and block-wse normalzaton [14] Sport-specﬁc Classﬁer Whle our method s able to estmate the lower body pose wth varous types of poses n Amercan Football robustly, our Label- 255

60). 5.4.5 Per Frame Estmaton for Movng back Players In team sports vdeos, players durng defensve acton tend to have a pose or body drecton that s not the same as the player s movng drecton.

11 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan Label-Grd class. Even though these two advantages help to embed as many (contnuous) scales of features (n Random Decson Forests feature space) as possble, the falure happens f the player s an unknown scale (e.g., s = 0.60) Per Frame Estmaton for Movng back Players In team sports vdeos, players durng defensve acton tend to have a pose or body drecton that s not the same as the player s movng drecton. Fgure 12 shows the result of test (10). Ths shows the ablty of our method to classfy the pose correctly even when the player s movng backward. Ths feature shows the hgh applcablty to team sports vdeos, whereas walkng and runnng backwards only rarely appears n survellance vdeos. Fg. 11 Temporal analyss of test (9). Fg. 12 Walkng back result of test (10). dependence of the algnment of the tracker, Label-Grd classfers tend to have more errors along wth an ncrease n pelvs poston error. In Fg. 11 (a), each jont errors tends to ncrease when pelvs error becomes large. In addton, whether the player s wthn the scales of the dataset durng the test tme s also mportant. In the experments, we used scales s = {0.70, 0.725,..., 1.0} for creatng the augmented tranng mages. If the player scale wthn the wndow s too small or too large, the three-level pyramd HOG features wll become an unknown pattern for the Label-Grd classfer (Random Decson Forests). Our framework has two advantages for overcomng ths problem. Frst, Random Decson Forests can learn the nter-class dstance, as our Label-Grd classfer tends to msclassfy the sample wth the neghborng Label-Grd. In addton, the three-level pyramd HOG features also help the appearance over three resoluton levels and help to evade msclassfcaton to the dstant 6. Concluson To estmate lower body poses from low-resoluton mages, we proposed a new human pose estmaton method usng a Label- Grd classfer that s ntegrated wth an object tracker. Our Label- Grd classfer does not use the pctoral structures, but use the algnment of the player s pelvs poston to classfy varous types and scales of poses nto a grd structure wth off-the-shelf multclass classfers (we use Random Decson Forests n ths paper). Algnment between the trackng-by-detecton module and the Label-Grd classfcaton module s the key to realze the estmaton of lower body poses wth all poses n team sports vdeos. Our system can even estmate poses of the solated player wth part-occlusons and non-uprght poses, whch are dffcult to estmate wth the methods usng pctoral structures and part detectors. Our pose classfcaton strategy usng a whole person HOG makes t possble to classfy the lower body jont locatons of a player even durng the sde runnng poses. Our framework can be vewed as a revsted verson of Ref. [16] by usng machnelearnng and dense vsual features. In other words, tradtonal slhouette matchng strategy for pose estmaton was nnovated by our approach usng HOG features and Random Decson Forests to embed all pose appearance patterns nto the randomzed feature space. In ths work, we only nvestgated the possblty of our framework for an solated player wthout any occluson between players. However, the lower body pose estmaton of solated players wll be useful for many team sports vdeos because players are mostly solated durng play. As our experments showed, our system can estmate all types of poses wth only monocular RGB vdeos f a suffcent amount of poses are prepared n datasets. In addton, our method can estmate sde-runnng poses whle FMP [4] can only detect starshaped part confguratons and cannot estmate non-star sde runnng poses. However, the estmaton fals when the algnment s not very accurate because of the drft of the tracker. We beleve that ths method s advantage over prevous pose estmaton methods wll open up the wde range of potental of player actvty recognton from the estmated jont postons usng only monocular cameras and any low-resoluton settngs of people trackng. Jont postons of the lower body wll provde a new source of rcher nformaton for sports data analyss wth only passve sensng. 256

Moreover, we wll use 3D pose nformaton such as the upper body pose, whch we are explorng n other research [24], or movement drecton to restrct the pose feature space usng these types of nformaton as

Acknowledgments The data were provded by the Panasonc Corporaton and the Japan Amercan Football Assocaton.

12 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan Our future work ncludes jont estmaton of multple jonts and probablstc formulaton usng a knematc body model (whle ths paper only nvestgates one-shot classfcaton as a frst smple proposal of grd-wse classfcaton for pose estmaton). Moreover, we wll use 3D pose nformaton such as the upper body pose, whch we are explorng n other research [24], or movement drecton to restrct the pose feature space usng these types of nformaton as prors. For behavor understandng usng the lower body pose, we wll nvestgate the leg-based actvty recognton such as recognzng foot steps and key-pose extracton for sports vdeo summarzaton and retreval. Acknowledgments The data were provded by the Panasonc Corporaton and the Japan Amercan Football Assocaton. The authors also would lke to thank Kyosh Matsuo for hs help on the dscusson of the expermental settngs. References [1] Wu, Y., Lm, J. and Yang, M.-H.: Onlne object trackng: A benchmark, 2013 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pp , IEEE (2013). [2] Wang, C., Wang, Y. and Yulle, A.L.: An approach to pose-based acton recognton, 2013 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pp , IEEE (2013). [3] Yang, Y., Baker, S., Kannan, A. and Ramanan, D.: Recognzng proxemcs n personal photos, 2012 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pp , IEEE (2012). [4] Yang, Y. and Ramanan, D.: Artculated pose estmaton wth flexble mxtures-of-parts, 2011 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pp , IEEE (2011). [5] Klaser, A. and Marszalek, M.: A spato-temporal descrptor based on 3d-gradents (2008). [6] Sadanand, S. and Corso, J.J.: Acton bank: A hgh-level representaton of actvty n vdeo, 2012 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pp , IEEE (2012). [7] Jhuang, H., Gall, J., Zuff, S., Schmd, C. and Black, M.J.: Towards understandng acton recognton, IEEE Internatonal Conference on Computer Vson (ICCV), Sydney, Australa, pp , IEEE (2013). [8] Shotton, J., Sharp, T., Kpman, A., Ftzgbbon, A., Fnoccho, M., Blake, A., Cook, M. and Moore, R.: Real-tme human pose recognton n parts from sngle depth mages, Comm. ACM, Vol.56, No.1, pp (2013). [9] Urtasun, R., Fleet, D.J. and Fua, P.: 3D people trackng wth Gaussan process dynamcal models, 2006 IEEE Computer Socety Conference on Computer Vson and Pattern Recognton, Vol.1, pp , IEEE (2006). [10] Kazem, V.: Mult-vew Body Part Recognton wth Random Forests, 2013 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pp , IEEE (2013). [11] Bretensten, M.D., Rechln, F., Lebe, B., Koller-Meer, E. and Van Gool, L.: Robust trackng-by-detecton usng a detector confdence partcle flter, 2009 IEEE 12th Internatonal Conference on Computer Vson, pp , IEEE (2009). [12] Felzenszwalb, P.F., Grshck, R.B., McAllester, D.A. and Ramanan, D.: Object Detecton wth Dscrmnatvely Traned Part-Based Models, IEEE Trans. Pattern Analyss and Machne Intellgence, Vol.32, pp (onlne), DOI: /TPAMI (2010). [13] Jan, A.K. and L, S.Z.: Handbook of face recognton, Sprnger (2005). [14] Dalal, N. and Trggs, B.: Hstograms of Orented Gradents for Human Detecton, Computer Vson and Pattern Recognton, Vol.1, pp (2005). [15] Bourdev, L. and Malk, J.: Poselets: Body Part Detectors Traned Usng 3D Human Pose Annotatons, Internatonal Conference on Computer Vson (ICCV) (2009). [16] Germann, M., Popa, T., Zegler, R., Keser, R. and Gross, M.: Space- Tme Body Pose Estmaton n Uncontrolled Envronments, Proc Internatonal Conference on 3D Imagng, Modelng, Processng, Vsualzaton and Transmsson (3DIMPVT 11), pp , Washngton, DC, USA, IEEE Computer Socety (2011). [17] Wang, Y., Tran, D. and Lao, Z.: Learnng herarchcal poselets for human parsng, 2011 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pp , IEEE (2011). [18] Pshchuln, L., Andrluka, M., Gehler, P. and Schele, B.: Poselet condtoned pctoral structures, 2013 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pp , IEEE (2013). [19] Radwan, I., Dhall, A., Josh, J. and Goecke, R.: Regresson based pose estmaton wth automatc occluson detecton and rectfcaton, 2012 IEEE Internatonal Conference on Multmeda and Expo (ICME), pp , IEEE (2012). [20] Gammeter, S., Ess, A., Jäggl, T., Schndler, K., Lebe, B. and Van Gool, L.: Artculated mult-body trackng under egomoton, Computer Vson ECCV 2008, pp , Sprnger (2008). [21] Rogez, G., Rhan, J., Orrte-Uruñuela, C. and Torr, P.H.: Fast human pose detecton usng randomzed herarchcal cascades of rejectors, Internatonal Journal of Computer Vson, Vol.99, No.1, pp (2012). [22] Crmns, A. and Shotton, J.: Decson forests for computer vson and medcal mage analyss, Sprnger (2013). [23] Malsewcz, T., Gupta, A. and Efros, A.A.: Ensemble of Exemplar- SVMs for Object Detecton and Beyond, Internatonal Conference on Computer Vson (ICCV) (2011). [24] Hayash, M., Yamamoto, T., Ohshma, K., Tanabk, M. and Aok, Y.: Head and Upper Body Pose Estmaton n Team Sport Vdeos, nd IAPR Asan Conference on Pattern Recognton (ACPR), pp , IEEE (2013). [25] Brstow, H.: C++ mplementaton of Deva Ramanan s Artculated Pose Estmaton wth Flexble Mxtures of Parts. [26] Ferrar, V., Marn-Jmenez, M. and Zsserman, A.: Progressve search space reducton for human pose estmaton, 2008 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pp.1 8, IEEE (2008). [27] Andrluka, M., Roth, S. and Schele, B.: Pctoral structures revsted: People detecton and artculated pose estmaton, 2009 IEEE Conference on Computer Vson and Pattern Recognton (CVPR), pp , IEEE (2009). Masak Hayash receved an M.Sc. degree n Computer Vson and Image Processng from Keo Unversty n Snce 2012, he has been a Ph.D. canddate at Keo Unversty. Hs research nterests are vdeo-based human pose estmaton and actvty recognton, especally for team sports vdeos. Kyoko Oshma receved a B.A. of Lberal Arts from Internatonal Chrstan Unversty n She s a staff engneer n the Panasonc Corporaton and works on research and development of computer vson system. Masamoto Tanabk receved an M.Sc. degree n Electrcal Engneerng from Waseda Unversty n Hs research nterests are vdeo-based human trackng and pose estmaton, especally for securty, busness ntellgence, and sports. 257

13 Informaton and Meda Technologes 10(2): (2015) reprnted from: IPSJ Transactons on Computer Vson and Applcatons 7: (2015) Informaton Processng Socety of Japan Yoshmtsu Aok s an Assocate Professor, Department of Electroncs and Electrcal Engneerng, Keo Unversty. He receved hs Ph.D. n Engneerng from Waseda Unversty n From 2002 to 2008, he was an Assocate Professor n Shbaura Insttute of Technology. Snce 2008, he has been an Assocate Professor at Department of Electroncs and Electrcal Engneerng n Keo Unversty. He performs research n the areas of Computer Vson, Pattern Recognton, and Meda Sensng/Understandng. (Communcated by Greg Mor) 258

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1 4/14/011 Outlne Dscrmnatve classfers for mage recognton Wednesday, Aprl 13 Krsten Grauman UT-Austn Last tme: wndow-based generc obect detecton basc ppelne face detecton wth boostng as case study Today: