Book chapter for Machine Learning for Human Motion Analysis: Theory and Practice

Size: px

Start display at page:

Download "Book chapter for Machine Learning for Human Motion Analysis: Theory and Practice"

Marcus Richard
5 years ago
Views:

1 YngL Tan 1, Rogero Fers 2, Lsa Brown 2, Danel Vaquero 3, Yun Zha 2, Arun Hampapur 2 1 Department of Electrcal Engneerng Cty College, Cty Unversty of New York, New York, NY ytan@ccny.cuny.edu 2 IBM T. J. Watson Research Center, Hawthorne, NY {rsfers,lsabr,yunzha,arunh}@us.bm.com 3 Department of Computer Scence Unversty of Calforna, Santa Barbara danel@cs.ucsb.edu Vsual processng of people, ncludng detecton, trackng, recognton, and behavor nterpretaton, s a key component of ntellgent vdeo survellance systems. Computer vson algorthms wth the capablty of lookng at people at multple scales can be appled n dfferent survellance scenaros, such as far-feld people detecton for wde-area permeter protecton, mdfeld people detecton for retal/bankng applcatons or parkng lot montorng, and near-feld people/face detecton for faclty securty and access. In ths chapter, we address the people detecton problem n dfferent scales as well as human trackng and moton analyss for real vdeo survellance applcatons ncludng people search, retal loss preventon, people countng, and dsplay effectveness. Keywords: Vdeo survellance, object detecton, face trackng, human trackng, moton analyss, color classfcaton, people search, multple scales. 1. INTRODUCTION As the number of cameras deployed for survellance ncreases, the challenge of effectvely extractng useful nformaton from the torrent of camera data becomes formdable. The nablty of human vglance to effectvely montor survellance cameras s well recognzed n the scentfc communty [Green 1999]. Addtonally, the cost of employng securty staff to montor hundreds of cameras by manually watchng vdeos s prohbtve. Intellgent (smart) survellance systems, whch are now watchng the vdeo and provdng alerts and content-based search capabltes, make the vdeo montorng and nvestgaton process

2 scalable and effectve. The software algorthms that analyze the vdeo and provde alerts are commonly referred to as vdeo analytcs. These are responsble for turnng vdeo cameras from a mere data gatherng tool nto smart survellance systems for proactve securty. Advances n computer vson, vdeo analyss, pattern recognton, and multmeda ndexng technologes have enabled smart survellance systems over the past decade. People detecton, trackng, recognton, and behavor nterpretaton play very mportant roles n vdeo survellance. For dfferent survellance scenaros, dfferent algorthms are employed to detect people n dstnct scales, such as far-feld people detecton for wde-area permeter protecton, md-feld people detecton for retal/bankng applcatons or parkng lot montorng, and near-feld people/face detecton for faclty securty and access. People detecton and trackng has been an actve area of research. The approaches for people detecton can be classfed as ether model-based or learnng-based. The latter can use dfferent knds of features such as edge templates [Gavrla 2000], Haar features [Vola et al. 2001, 2003], hstogram-of-orentedgradents descrptors [Dalal & Trggs 2005, Han et al. 2006], shapelet features [Sabzmeydan 2007], etc. To deal wth occlusons, some approaches use part-based detectors [Wu & Nevata 2005, Lebe 2005]. In our system, learnng-based methods are employed to detect humans at dfferent scales. For each person enterng and leavng the feld of vew of a survellance camera, our goal s to detect the person and to store n a database a key frame contanng the mage of the person, assocated wth a correspondng vdeo. Ths allows the user to perform queres such as Show me all people who entered the faclty yesterday from 1pm to 5pm. The retreved key frames can then be used for recognton, ether manually or by an automatc face recognton system (f the face mage s avalable). To acheve ths goal, we developed a novel face detector algorthm that uses local feature adaptaton pror to Adaboost learnng. Local features have been wdely used n learnngbased object detecton systems. As noted by Munder and Gavrla [Munder & Gavrla 2006], they offer advantages over global features such as Prncpal Component Analyss [Zhang et al. 2004] or Fsher Dscrmnant Analyss [Wang & J 2005], whch tend to smooth out mportant detals. In order to detect trajectory anomales, our system tracks faces and people, analyzes the paths of tracked people, learns a set of repeated patterns that occur frequently, and detects when a person moves n a way nconsstent wth these normal patterns. We mplement two types of trackng methods: person-detecton-based and movng-object-based. The person-detecton-based trackng method s used to track faces and people n near-feld scenaros. In far-feld scenaros, the movng-object-based trackng method s employed because faces are too small to be accurately detected. The movng objects are frst detected by an adaptve background subtracton method, and are then tracked by usng a trackng method based on appearance. An object classfer further labels each tracked object as a car, person, group of people, anmal, etc. To buld the model of moton patterns, the trajectores of all tracks wth a gven start/end locaton labelng are resampled and clustered together. Ths gves an average or prototypcal track along wth standard devatons. Most tracks from a gven entry locaton to a gven ext wll le close to the prototypcal track, wth typcal normal varaton ndcated by the length of the crossbars. Tracks that wander outsde ths normal area can be labeled as anomalous and may warrant further nvestgaton. The prncpal components of the cluster ndcate typcal modes of varaton or egentracks, provdng a more accurate model of normal vs. abnormal. The algorthms for people detecton and moton analyss can then be used n several hgher-level survellance applcatons. Startng from a detected person, we perform clothng color classfcaton based on two body parts (torso and legs), whch are segmented from the human slhouette. Ths enables searchng for people based on the color of ther clothes. We also present

3 applcatons of our system n retal loss preventon, people countng and dsplay effectveness. These have been successfully deployed n commercal establshments. The rest of ths chapter s organzed as follows: Secton 2 revews the IBM Smart Survellance System (SSS). Secton 3 presents our learnng framework for selectng and combnng multple types of vsual features for object detecton. The applcaton of ths framework to human detecton n multple scales s dscussed n Secton 4. Secton 5 covers human trackng and moton analyss. Several case studes n vdeo survellance ncludng people search by clothes color, retal loss preventon, people countng, and dsplay effectveness are demonstrated n Secton 6. We conclude our work n Secton THE IBM SMART SURVEILLANCE SYSTEM The IBM Smart Survellance System (SSS) employs a number of dstnct and hghly specalzed technques and algorthms [Hampapur et al. 2005], whch can be summarzed as follows: Plug and Play Analytcs Frameworks: Vdeo cameras capture a wde range of nformaton about people, vehcles and events. The type of nformaton captured s dependent on a number of parameters lke camera type, angle, feld of vew, resoluton, etc. Automatcally detectng each type of nformaton requres specalzed sets of algorthms. For example, automatcally readng lcense plates requres specalzed OCR algorthms; capturng face mages requres face detecton algorthms and recognzng behavors; fndng abandoned packages requres detecton and trackng algorthms. A smart survellance system needs to support all of these algorthms, typcally through a plug and play framework; Object Detecton and Trackng: One of the core capabltes of smart survellance systems s the ablty to detect and track movng objects. Object detecton algorthms are typcally statstcal learnng algorthms that dynamcally learn the scene background model and use the reference model to determne whch parts of the scene correspond to movng objects [Tan et al. 2005]. Trackng algorthms assocate the movement of objects over tme generatng a trajectory [Senor et al. 2001]. These two algorthms together take a vdeo stream and decompose t nto objects and events, effectvely creatng a parse tree for the survellance vdeo; Object and Color Classfcaton: Object classfcaton algorthms classfy objects nto dfferent classes (such as people, vehcles, anmals, etc.), usng tranng data and calbraton schemes. Color classfcaton algorthms classfy the domnant color of the object nto one of the standard colors (red, green, blue, yellow, black and whte). These attrbutes become part of the searchable ndex, allowng users to query for red vehcles or people wearng blue clothes [Brown 2008, Chen et al. 2008]. Alert Defnton and Detecton: Typcal smart survellance systems support a varety of user-defned behavor detecton capabltes such as detectng moton wthn a defned zone, detectng objects that cross a user-defned vrtual boundary, detectng abandoned objects, etc. The user uses graphcal user nterface tools to specfy zones of nterest, object szes and other parameters that defne the behavor. When the behavor of nterest occurs wthn a camera s feld of vew, the system automatcally generates an alert message that can be transmtted to a workstaton, PDA or emal reader, dependng on the users preference [Tan et al. 2008, Zha et al. 2008].

4 Database Event Indexng: The events detected by the vdeo analyss algorthms are ndexed by content and stored n a database. Ths allows events to be cross-referenced across multple spatally dstrbuted cameras and creates a hstorcal archve of events. The event ndex nformaton typcally ncludes tme of occurrence, camera dentfer, event type, object type, object appearance attrbutes and an ndex nto the vdeo repostory, whch allows the user to play back the relevant vdeo at the touch of a button. Search and Retreval: Users can use a varety of GUI tools to defne complex search crtera to retreve specfc events. Events are typcally presented as shown n the mddle secton of Fgure 1. Search crtera nclude object sze, color, locaton n the scene, velocty, tme of occurrence, and several other parameters. The results of a search can also be rendered n a varety of summary vews, one of whch (called track summary) s shown n the rght secton of Fgure 1. (a) (b) (c) Fgure 1: (a) The home page, wth 1) cameras (bottom rght) runnng a varety of vdeo analyss capabltes, such as lcense plate recognton, face capture and behavor analyss; 2) Real-tme alert panel (top rght); 3) Map of the area (top left); 4) Vdeo player (bottom left). (b) The results of searchng for a red car. (c) A summary vew of all actvty n a camera over a selected perod, represented as object tracks. 3. FEATURE ADAPTATION PRIOR TO LEARNING FOR OBJECT DETECTION In ths secton, we descrbe a novel learnng framework for selectng and combnng multple types of vsual features for object detecton [Fers et al. 2008]. The applcaton of ths framework to human detecton n multple scales s covered n Secton Motvaton Current machne learnng methods for object detecton based on local mage features suffer from a scalablty problem n the feature selecton process. For a specfc feature type (e.g., Gabor flters), most methods nclude many feature confguratons n a feature pool (e.g., Gabor flters unformly sampled at dfferent postons, orentatons and scales), and then use a learnng algorthm, such as Adaboost or SVM, to select the features that best dscrmnate object mages from mages that do not contan the object. Therefore, as new feature types are consdered, the feature pool sze ncreases dramatcally, leadng to computatonal problems. Ths scalablty ssue has several mplcatons. Frst, the tranng tme can be excessvely long due to the large feature

pool and the brute-force strategy for feature selecton.

ncluded n the feature pool due to the samplng process, whereas many features that are less meanngful for dscrmnaton

In order to overcome the lmtatons dscussed above, we propose a novel framework to combne and select multple types

Our approach reles on the observaton that the local features selected for dscrmnaton tend to match the local

Fgure 2 shows the frst features selected by Adaboost n the context of face detecton [Vola & Jones 2001] and

In the mddle mage of the top row, the dark part of the flter concdes wth the dark mage regon (the eyes), whle the

5 pool and the brute-force strategy for feature selecton. Most methods consder a pool wth hundreds of thousands of local features, and tranng can take weeks on conventonal machnes. Second, the detecton/recognton accuracy can be sgnfcantly affected, as mportant feature confguratons may not be ncluded n the feature pool due to the samplng process, whereas many features that are less meanngful for dscrmnaton may be present n the pool. In order to overcome the lmtatons dscussed above, we propose a novel framework to combne and select multple types of vsual features n the learnng process. Our approach reles on the observaton that the local features selected for dscrmnaton tend to match the local structure of the object. Fgure 2 shows the frst features selected by Adaboost n the context of face detecton [Vola & Jones 2001] and recognton [Yang et al. 2004]. In ths example, the selected Haar flters capture the local mage contrast. In the mddle mage of the top row, the dark part of the flter concdes wth the dark mage regon (the eyes), whle the brght part of the flter matches the brght mage regon under the eyes (the cheek and nose). Smlarly, n the bottom row, Gabor wavelets capture local structures of the face. In fact, Lu and Shum [Lu & Shum 2003], n ther Kullback-Lebler boostng framework, argued that features should resemble the face semantcs, matchng the face structure ether locally or globally. [Vola & Jones, 2001] [Yang et al., 2004] Fgure 2: Features selected by Adaboost n the context of face detecton (top) and face recognton (bottom). Note that the features tend to adapt to the local face structure.

6 Fgure 3: Our approach has two stages: frst, we compute adaptve features for each tranng sample; then, we use a feature selecton mechansm (such as Adaboost) to obtan general features. Based on ths observaton, we present an approach that pre-selects local mage flters based on how well they ft the local mage structures of an object, and then use tradtonal learnng technques such as Adaboost or SVMs to select the fnal set of features. Our technque can be summarzed n two stages. In the frst stage, features that adapt to each partcular sample n the tranng set are determned. Ths s carred out by a non-lnear optmzaton method that determnes local mage flter parameters (such as poston, orentaton and scale) that match the geometrc structure of each object sample. By combnng adaptve features of dfferent types from multple tranng samples, a compact and dversfed feature pool s generated. Thus, the computatonal cost of learnng s reduced, snce t s proportonal to the feature pool sze. The effcency and the accuracy of the detector are also mproved, as the use of adaptve features allows for the desgn of classfers composed of fewer features, whch better descrbe the structure of the objects to be detected. In the second stage, Adaboost feature selecton s appled to the pool of adaptve features n order to select the fnal set of dscrmnatve features. As the pool contans features adapted to ndvdual object samples, ths process selects features whch encode common characterstcs of all tranng samples, and thus are sutable for detecton. Fgure 3 llustrates ths process for the partcular case when the objects to be detected are faces. Throughout the remander of ths chapter, we use the term adaptve features to descrbe features that match the geometrc structure of an object n a partcular tranng mage and general features to descrbe the fnal set of dscrmnatve features that encode common characterstcs of all tranng mages. 3.2 Learnng Usng Locally Adapted Features We now descrbe our framework to ncorporate local feature adaptaton n the tranng process. We begn by presentng the feature adaptaton algorthm, whch should be appled to each ndvdual tranng mage contanng the target object. We then show how to use ths adaptaton method to create a meanngful feature pool contanng multple types of wavelet features. Lastly, the adapted feature pool s used n Adaboost learnng to desgn a classfer to detect the object present n the tranng mages. Although we have used wavelet flters n our work, the technque s general n the sense that other local mage flters could also have been appled n the same settngs Feature Adaptaton In ths secton, we address the problem of generatng a set of local adaptve features for a gven mage of the object that we would lke to detect. In other words, our goal s to learn the parameters of wavelet features, ncludng poston, scale, and orentaton, such that the wavelet features match the local structures of the object. Ths s motvated by the wavelet networks proposed by Zhang [Zhang 1997] and ntroduced n computer vson by Krueger [Krueger 2001]. Consder a famly Ψ = ψ,, } of N two-dmensonal wavelet functons, where { n ψ 1 n N ψ ( x, y) s a partcular mother wavelet (e.g., Haar or Gabor) wth parameters n n ) T = ( cx, c y, θ, sx, s y. Here, x c and c y denote the translaton of the wavelet, s x and s y

7 denote the dlaton, and θ denotes the orentaton. The choce of N depends on the desred degree of precson for the representaton. Let I be an nput tranng mage, whch contans an nstance of the object for whch a detector s to be desgned. Frst, we ntalze the set of wavelets Ψ along the mage n a grd, wth the wavelets havng random orentatons, and scales ntalzed to a unfed value that s related to the densty wth whch the wavelets are dstrbuted. Fgure 4(a) llustrates ths process, for the specfc case when the objects to be detected are faces. Then, assumng I s dc-free, wthout loss of generalty, we mnmze the energy functon E = n, w mn I wψ wth respect to the wavelet parameter vectors n and ther correspondng weghts w. Fgures 4(b-e) show the wavelet features beng optmzed one by one to match the local mage structure of the object. In ths example, a Gabor wavelet was adopted as the mother wavelet, and we used the Levenberg-Marquardt method to solve the optmzaton problem. n 2 Fgure 4: Learnng adaptve features for a partcular tranng mage. (a) Input tranng mage wth wavelets ntalzed as a grd along the object regon, wth random orentatons and scales. (b-e) Wavelet features beng selected one by one, wth parameters (poston, scale, and orentaton) optmzed to match the local structure of the nput mage. Dfferently from most exstng dscrete approaches, the parameters n are optmzed n the contnuous doman and the wavelets are postoned wth sub-pxel accuracy. Ths assures that a maxmum of the mage nformaton can be encoded wth a small number of wavelets. Usng the optmal wavelets ψ n and weghts w, the mage I can be closely reconstructed by a lnear combnaton of the weghted wavelets: N Iˆ = ψ n. = 1 w There s an alternatve procedure to drectly compute the wavelet weghts w once the wavelet parameters n have been optmzed. Ths soluton s faster and more accurate than usng Levenberg-Marquardt optmzaton. If the wavelet functons are orthogonal, the weghts can be calculated by computng the nner products of the mage I wth each wavelet flter,.e., w < I, ψ >. In the more general cases where the wavelet functons may not be orthogonal, a = n

8 ~ famly of dual wavelets Ψ = { ~ ψ,, ~ n ψ } 1 n has to be consdered. Recall that the wavelet ~ ψ N n j s the dual wavelet of ψ f t satsfes the b-orthogonalty condton: < ψ n, ~ ψ n > = δ j, where, j n j, Ψ = { ψ n,, ψ 1 n N, ~ 1 ψ n = ( A, j ) ψ n j, j δ s the Kronecker delta functon. Gven an mage I and a set of wavelets } the optmal weghts are gven by w < I, ~ ψ >. It can be shown that where A = < ψ, ψ >, j n n j. = n Integratng Multple Features We have descrbed how to obtan adaptve features for a sngle object tranng mage. Now we proceed to generate a pool of adaptve features obtaned from multple tranng mages. Let χ = I,, I } be a set of object tranng mages. For each mage I, we generate a set of { 1 M adaptve features Ψ, usng the optmzaton method descrbed n the prevous secton. It s possble to ntegrate multple feature types by usng dfferent wavelet settngs for each tranng mage. More specfcally, each set Ψ s learned wth dfferent parameters, ncludng: Number of wavelets. The number of wavelets ndcates how many wavelet functons wll be optmzed for a partcular object mage. From ths set, we can further select a subset of functons that have the largest weghts, as wavelet flters wth larger assocated weghts n general tend to concde wth more sgnfcant local mage varatons; Wavelet type. In our system, we used only Haar and Gabor wavelets for the wavelet type parameter, but other feature types could also be consdered; Wavelet frequency. The frequency parameter controls the number of oscllatons for the wavelet flters; Group of features treated as a sngle feature. We also allow a group of wavelet functons to be treated as a sngle feature, whch s mportant to encode global object nformaton. Those parameters can be randomly ntalzed for each tranng mage n order to allow a varety of dfferent features to be n the pool. Ths ntalzaton process s fully automatc and allows the creaton of a compact and dversfed feature pool. All generated adaptve features for all object mages are then put together n a sngle pool of features Ω, defned as M Ω = Ψ. =1 As an example, Fgure 5 shows some adaptve features (Haar and Gabor wavelets wth dfferent frequences, orentatons and aspect ratos) learned from a dataset of frontal face mages. In the resultng feature pool, dfferent types of local wavelet flters and global features, whch are obtaned by groupng ndvdual wavelet functons, are present.

Fgure 5: Some examples of features present n the pool of learned adaptve features for a frontal face dataset. A large varety of dfferent wavelet flters are consdered.

9 Fgure 5: Some examples of features present n the pool of learned adaptve features for a frontal face dataset. A large varety of dfferent wavelet flters are consdered. The top row shows local wavelet functons, whereas the bottom row shows global features generated by combnng a set of local flters Learnng General Features for Detecton In Sectons and 3.2.2, we have descrbed a method to generate a pool of adaptve features from a set of tranng mages. Those features are selected accordng to how well they match each ndvdual tranng example. Now, n order to desgn an object detector, the goal s to select general features,.e., features from the pool that encode common characterstcs to all object samples. We use Adaboost learnng for both selectng general features and desgnng a classfer [Vola & Jones 2001]. A large set of negatve examples (.e., mages that do not contan the object to be detected) s used n addton to the prevously mentoned tranng mages (whch contan nstances of the object of nterest). The general features are those that best separate the whole set of object samples from non-object (negatve) samples durng classfcaton. We refer the reader to [Vola & Jones 2001] for more detals about the Adaboost classfer and the feature selecton mechansm. It s mportant to notce that other boostng technques mght be used n ths step, such as GentleBoost [Fredman et al. 2000], Real Adaboost [Schapre & Snger 1999], and Vector Boostng [Huang et al. 2005]. In a more nterestng way, our method could be ntegrated nto the learnng method recently proposed by Pham and Cham [Pham & Cham 2007], whch acheves extremely fast learnng tme n comparson to prevous methods. We beleve that ths method would have an even larger reducton n computatonal learnng tme f locally adapted features are used. Fgure 6: We use a Haar flter n the frst levels of the cascade detector n order to acheve realtme performance.

10 3.2.4 Applyng the Classfer and Effcency Consderatons Once the cascade classfer s desgned, we would lke to apply t to mages n order to fnd the object of nterest. A sldng wndow s moved pxel by pxel at dfferent mage scales. Startng wth the orgnal scale, the mage s rescaled by a gven factor (typcally 1.1 or 1.2) n each teraton, and the detector s appled to every wndow. Overlappng detecton results can be merged to produce a sngle result for each locaton and scale. However, even usng a cascade classfer, real-tme performance (25/30Hz) can not be acheved due to the tme requred to compute our features. We addressed ths problem by usng tradtonal Haar-lke/rectangle features n the frst levels of the cascade. Ths allows for effcent rejecton of background patches durng classfcaton. The mage patches that are not rejected by the Haar cascade are then fed nto a cascade of classfers usng our features. The choce of whch cascade level should be used to swtch from Haar features to our features s applcaton dependent. Swtchng n lower levels allows for more accuracy, but swtchng n hgher levels allows for more effcency. Fgure 6 depcts ths archtecture. 4. MULTI-SCALE HUMAN DETECTION IN SURVEILLANCE VIDEOS In the prevous secton, we presented a learnng framework to desgn object detectors based on adaptve local features. Ths framework can be appled to desgn people detectors n near-feld, md-feld, and far-feld survellance scenaros, whch deal wth mages wth dfferent levels of detal. In order to account for these dfferences, for each scenaro we desgned a person detector n a scale specfcally talored to the avalable resoluton. We now descrbe n detal our mplementaton of the framework for near-feld person detecton and dscuss ts advantages. The same concepts could be smlarly appled to mprove our detectors n the other scenaros (mdfeld and far-feld), as these detectors are based on local mage features as well. 4.1 Near-feld Person Detecton In near-feld survellance vdeos, resoluton s suffcent to make facal features of people clearly vsble. We developed a face detector and a trackng system usng the learnng method descrbed above to detect people n near-feld scenes. To desgn the face detector, we used a frontal face dataset contanng 4000 face mages for tranng purposes. Each tranng mage was cropped and rescaled to a 24x24 patch sze. A pool of adaptve features was generated by runnng the optmzaton process descrbed n Secton 3.2.1, wth dfferent wavelet settngs (wavelet type, frequency, etc.) for each sample. As a result, a pool of adaptve features was generated, contanng a large varety of wavelet flters. It takes less than a second to create hundreds of adaptve features for a partcular 24x24 sample n a conventonal 3GHz desktop computer. For the second step of the algorthm (learnng general features), we used an addtonal database of about 1000 background (non-face) mages from whch 24x24 patches are sampled. A cascade classfer was traned by consderng 4000 faces and 4000 non-faces at each level, where the nonface samples were obtaned through bootstrap [Rowley et al. 1998]. Each level n the cascade was traned to reject about half of the negatve patterns, whle correctly acceptng 99.9% of the face patterns. A fully traned cascade conssted of 24 levels. A Haar flter correspondng to the frst 18 levels of the cascade was used n our experments, n order to acheve real-tme performance. Fgure 7 shows the frst three general features selected by Adaboost. The frst selected feature gves more mportance to the eyes regon. The second selected feature s a local coarse-scale

Gabor wavelet wth three oscllatons, whch algn wth the eyes, nose, and mouth regons. The thrd feature s a global feature that encodes the rounded face shape.

A face s consdered to be correctly detected f the Eucldean dstance between the center of the detected box and the ground-truth s less than 50% of the wdth of the ground-truth box, and the wdth (.e., sze) of the detected face box s wthn ±70% of the wdth of the groundtruth box.

11 Gabor wavelet wth three oscllatons, whch algn wth the eyes, nose, and mouth regons. The thrd feature s a global feature that encodes the rounded face shape. Fgure 7: The frst three general features selected whle desgnng the face detector. The CMU+MIT frontal face test set, contanng 130 gray-scale mages wth 511 faces, was used for evaluaton. A face s consdered to be correctly detected f the Eucldean dstance between the center of the detected box and the ground-truth s less than 50% of the wdth of the ground-truth box, and the wdth (.e., sze) of the detected face box s wthn ±70% of the wdth of the groundtruth box. Fgure 8 shows the detected faces n one of the mages from the CMU+MIT dataset, and four vdeo frames taken from our survellance system. Fgure 8: Face detecton results n one of the CMU+MIT dataset mages, and four vdeo frames taken from our survellance system.

12 In order to show the effectveness of feature adaptaton pror to learnng, we compared our face detector to a classfer learned from a smlar feature pool, contanng the same number and type of features (Haar, Gabor, etc.) sampled unformly from the parameter space (at dscrete postons, orentatons, and scales), rather than adapted to the local structure of the tranng samples. Fgure 9(a) shows a plot of the Recever Operatng Characterstc (ROC) curves for ths comparson, demonstratng the superor performance of our method. In addton to achevng mproved detecton accuracy, the number of weak classfers needed for each strong classfer s sgnfcantly smaller n our method (see Fgure 9(b)). Ths has a drect mpact n both tranng and testng computatonal costs. We observed a reducton of about 50%. Fgure 9: (a) ROC Curve comparng classfers learned from adaptve (optmzed) and nonadaptve (non-optmzed) features n the CMU+MIT dataset. (b) Number of classfers for each level of the cascade n both methods. Our approach offers advantages n terms of detecton accuracy and reduced computatonal costs over tradtonal methods that use local features unformly sampled from the parameter space. Fgure 10: ROC Curves for our approach and tradtonal Haar features n the CMU+MIT dataset. We used only half of the number of features n the feature pool compared to Haar features and stll get superor performance.

13 Fgure 10 shows the comparson between our approach and tradtonal Haar-lke/rectangle features. The cascade detector based on Haar features also conssted of 24 levels, and was learned usng the same tranng set. The feature pool, however, was twce as large as the one used by our approach, contanng about features. Wth half of the features n the pool, we acheve superor accuracy and a faster learnng tme. Table 1 confrms ths. Although our optmzed features can cause overfttng when a small number of tranng samples are consdered, ths ssue does not arse when thousands of faces are used for learnng. Table 1. By learnng adaptve and general features, we can use a smaller feature pool, whch results n a reducton n tranng tme, whle stll mantanng superor performance n detecton rate, when compared to a tradtonal pool of Haar features. 4.2 Md-feld Person Detecton Feature Pool Number of Features Learnng Tme Haar Features About 5 days Our Approach About 3 days In md-feld scenes, facal features may not be vsble due to poor resoluton. However, the lnes that delmt the head and shoulders of an ndvdual are stll nformatve cues to fnd people n mages. For these scenes, we developed a system for trackng and detecton whch locates people by scannng a wndow through the mage and applyng a head and shoulders detector at every poston and scale. Ths detector s desgned accordng to the same learnng framework from [Vola & Jones 2001] (as we mplemented ths classfer pror to our research on feature optmzaton),.e., t s a cascade classfer based on Adaboost learnng and Haar features. Smlarly to the face detector, a tranng set of 4000 mages contanng the head and shoulders regon was used for tranng. As ths classfer s based on feature selecton from a pool of local features, t s part of our current work to apply the learnng framework from Secton 3 to frst create a pool of adaptve local features and then select the most dscrmnatve features usng Adaboost. Fgure 11 llustrates the detecton results from the head and shoulders detector n our system for md-feld scenes. Fgure 11: Sample detectons n md-feld scenes, usng a head and shoulders detector based on Haar features.

14 4.3 Far-feld Person Detecton In far-feld magery, pedestrans may appear as small as 30-pxels tall. In our scenaro, the camera s known to be n a fxed poston, makng t feasble to use background modelng technques to segment movng objects. In [Chen et al. 2008], we descrbed how our far-feld survellance system classfes blobs obtaned from background subtracton nto one of three classes: cars, people and groups of people. In [Vola et al. 2003], a system for pedestran detecton n vdeos whch extends the technque from [Vola & Jones 2001] was proposed. It augments the feature space by ncludng Haar features computed from dfferences between pars of subsequent frames. Our locally adaptve feature learnng framework could also be appled n ths case, n order to pre-select the features that best adapt to pedestrans n ndvdual frames or to dfferences n subsequent frames due to movement. Then, the resultng detector could be combned wth our classfer based on foreground measurements n order to confrm that the blobs classfed as people n fact do correspond to people. 5. HUMAN TRACKING AND MOTION ANALYSIS Human trackng and moton analyss are key components of vdeo survellance systems, and very actve research areas. Recently, researchers focused on combnng detecton and trackng nto unfed frameworks. Andrluka et al. [Andrluka et al. 2008] proposed a unfed framework that combnes both pedestran detecton and trackng technques. Detecton results of human artculatons usng a herarchcal Gaussan process latent varable model are further appled n a Hdden Markov Model to perform pedestran trackng. Okuma et al. [Okuma et al. 2004] proposed a target detecton and trackng framework by combnng two very popular technques: mxture partcle flters for mult-target trackng and Adaboost for object detecton. Another Adaboost-based method s presented n [Avdan 2005] by Avdan. In hs work, weak classfers are combned to dstngush foreground objects from the background. The mean-shft algorthm s appled to track the objects usng the confdence map generated by the Adaboost method. Lebe et al. [Lebe et al. 2005, 2007] used a top-down segmentaton approach to localze pedestrans n the mage usng both local and global cues. An mplct human shape model s bult to detect pedestran canddates that are further refned by the segmentaton process, usng Chamfer matchng on the slhouettes. Ramanan et al. [Ramanan et al. 2007] proposed a trackng by model-buldng and detecton framework for trackng people n vdeos. Predefned human models are combned wth canddates detected n the vdeo to form actual human clusters (models). These models are then used to detect the person n subsequent mages and perform trackng. Other co-tranng based approaches are [Javed et al. 2005, Grabner et al. 2006]. They proposed detectng objects and further usng them for onlne updatng of the object classfers. Object appearance features derved from PCA are teratvely updated usng the samples wth hgh detecton confdence. For moton analyss, Stauffer et al. [Stauffer et al. 2000] proposed a moton trackng framework. Each pxel n the mage s modeled by a mxture of Gaussan dstrbutons that represents ts color statstcs. Object tracks are formed by correlatng moton segments across frames usng a cooccurrence matrx. Buzan et al. [Buzan et al. 2004] proposed a clusterng technque to group three-dmensonal trajectores of tracked objects n vdeos. A novel measure called the Longest Common Subsequence s employed to match trajectory projectons on the coordnate axes. Junejo and Foroosh [Junejo & Foroosh 2008] proposed a trajectory groupng method based on normalzed cuts. The matchng crtera nclude spatal proxmty, moton characterstcs, curvature, and absolute world velocty, whch are based on automated camera calbraton.

15 The IBM SSS provdes functons for trackng people and detectng trajectory anomales. The system tracks faces and people, analyzes the paths of tracked people, learns a set of repeated patterns that occur frequently, and detects when a person moves n a way nconsstent wth these normal patterns. 5.1 Face Trackng Once a face s detected n a partcular vdeo frame, t s necessary to track t n order to analyze the trajectory of the person and dentfy a sngle key frame of the face, to be stored n a database. Our face trackng method s based on applyng the face detector to every frame of the vdeo sequence. In order to mantan the track of the face even when the face detector fals, we also use a smple correlaton-based tracker. More specfcally, when a face s detected, the correlatonbased tracker s trggered. For the subsequent frame, f the face detecton fals, the track s updated wth the wndow gven by the correlaton tracker. Otherwse, f the face detector reports a wndow result wth a close poston and sze to the current trackng wndow, then ths result s used to update the track. Ths mechansm s mportant to avod drftng. In order to mprove the effcency of our detector and enable real-tme face trackng (25/30Hz) on conventonal desktop computers, we use the followng technques: We only apply the detector at specfc scales provded by the user and at moton regons detected by background subtracton; An nterleavng technque (explaned below) combnes vew-based detectors and trackng. Fgure 12: Our survellance system nterleaves vew-based detectors to save frame rate for trackng. In most survellance scenaros, human faces appear n mages n a certan range of scales. In our system, the user can specfy the mnmum and maxmum possble face szes for a partcular camera, so that face detecton s appled only for sub-wndows wthn ths range of scales. We also apply background subtracton, usng statstcal mxture modelng [Tan et al. 2005], to prune sub-wndows that do not le n moton regons. A skn color detector could also be used to speed up the processng. A problem faced by most exstng systems s the computatonal tme requred to run a set of vewbased detectors n each frame. Ths causes large nter-frame mage varaton, posng a problem for trackng. We handled ths ssue by usng an nterleavng technque that alternates vew based

16 detectors n each frame. Ths dea s llustrated for frontal and profle face detecton n Fgure 12. In each frame, rather than runnng both frontal and profle detectors, we run just one detector for a specfc vew (e.g., frontal vew). The detector for the other vew (profle vew) s appled n the subsequent frame and so forth. Ths allows us to mprove the frame rate by 50%, whle facltatng trackng due to less nter-frame varaton. Trackng s termnated when there are no foreground regons (obtaned from the background subtracton module) near the current trackng wndow or when the face detector fals consecutvely for a gven tme or number of frames specfed by the user. 5.2 Human Trackng and Moton Analyss Human Trackng Trackng can be seen as a problem of assgnng consstent denttes to vsble objects. We obtan a number of observatons of objects (detectons by the background subtracton algorthm) over tme, and we need to label these so that all observatons of a gven person are gven the same label. When one object passes n front of another, partal or total occluson takes place, and the background subtracton algorthm detects a sngle movng regon. By handlng occlusons, we hope to be able to segment ths regon, approprately labelng each part and stll mantanng the correct labels when the objects separate. In more complex scenes, occlusons between many objects must be handled [Senor et al. 2001]. When objects are wdely separated, a smple boundng box tracker s suffcent to assocate a track dentty wth each foreground regon. Boundng box trackng works by measurng the dstance between each foreground regon n the current frame and each object that was tracked n the prevous frame. If the object overlaps wth the regon or les very close to t, then a match s declared. If the foreground regons and tracks form a one-to-one mappng, then trackng s complete and the tracks are extended to nclude the regons n the new frame usng ths assocaton. If a foreground regon s not matched by any track, then a new track s created. If a track does not match any foreground regons, then t contnues at a constant velocty, but t s consdered to have left the scene once t fals to match any regons for a few frames. Occasonally, a sngle track may be assocated wth two regons. For a few frames, ths s assumed to be a falure of background subtracton and both regons are assocated wth the track. If there are consstently two or more foreground regons, then the track s splt nto two, to model cases as when a group of people separate, a person leaves a vehcle, or an object s deposted by a person. Fgure 13 shows some people trackng results Moton Analyss In order to perform moton analyss, as shown n Fgure 14, the system begns by detectng the locatons where objects enter and ext the scene. The start and end ponts of tracks are clustered to

fnd regons where tracks often begn or end. These ponts tend to be where paths or roads reach the edge of the camera s feld of vew.

from the center), enterng/extng to the rght sde (from the road on the rght or from the center), or movng horzontally across the road.

17 fnd regons where tracks often begn or end. These ponts tend to be where paths or roads reach the edge of the camera s feld of vew. Havng clustered these locatons, we classfy the trajectores by labelng a track wth ts start and end locaton (or as an anomaly when t starts or ends n an unusual locaton, such as a person walkng through the bushes). For example, when we cluster trajectores for the camera placed at the entrance to our buldng, trajectores are classfed nto one of 5 classes enterng/extng to the left sde (from the road on the left or from the center), enterng/extng to the rght sde (from the road on the rght or from the center), or movng horzontally across the road. We then apply a secondary clusterng scheme to further detect anomalous behavor. Ths scheme operates as follows: the trajectores of all tracks wth a gven start/end locaton labelng are resampled and clustered together. Ths gves an average or prototypcal track together wth standard devatons. Most tracks gong from a gven entry locaton to a gven ext wll le close to the prototypcal track, wth typcal normal varaton ndcated by the length of the crossbars. Tracks that wander outsde ths normal area can be labeled as anomalous and may warrant further nvestgaton. The prncpal components of the cluster ndcate typcal modes of varaton or egentracks, provdng a more accurate model of normal vs. abnormal. Fgure 15 shows some examples of abnormal actvtes (people loterng) n front of the IBM Hawthorne buldng. (a) (b) Fgure 13: Examples of people trackng results. (a) People trackng n a shoppng mall; (b) trackng of hockey players Mon Tue Wed AM 9AM 11AM 1PM (a) (b) (c) Fgure 14: (a) Summary vew showng the retreval of the trajectores of all events that occurred n the parkng lot over a 24 hour perod. The trajectory colors are coded: startng ponts are shown n whte and endng ponts are dsplayed n red. (b) Actvty dstrbuton over an extended tme perod, where the tme s shown on the x-axs and the number of people n the area s shown on the y-axs. Each lne represents a dfferent day of the week. (c) Unsupervsed behavor

18 analyss. Object entrance/departure zones (green ellpses) and prototypcal tracks (brown curves) wth typcal varaton (crossbars) are dsplayed. Fgure 15: Abnormal actvty analyss (people loterng n front of the IBM Hawthorne buldng). 6. CASE STUDY FOR REAL VIDEO SURVEILLANCE APPLICATIONS There are many applcatons for people detecton, trackng, and moton analyss n vdeo survellance. In ths secton, we present several case studes ncludng people search, retal loss preventon, people countng, and dsplay effectveness. 6.1 People Search by Clothng Color An mportant aspect of human dentfcaton for fndng and/or matchng a person to another person or descrpton, n the short term (e.g., the same day), s based on clothng. Color s one of the most promnent cues for descrbng clothng. Here, we present our methodology for categorzng the color of people s clothes for the purposes of people search. We perform clothng color classfcaton based on two body parts segmented from the human slhouette: the torso and the legs. These are determned usng the normatve spatal relatonshp to the face locaton, as detected n Secton 4.1. The torso or upper body regon represents what s prmarly the shrt or jacket, whle the legs or lower body regon represents the pants or skrt of the tracked human extracted from the camera. Three ssues are crtcal for successful color classfcaton. The frst s the ssue of color constancy. People perceve an object to be of the same color across a wde range of llumnaton condtons. However, the actual pxels of an object, whch are perceved by a human to be of the same color, may have values (when sensed by a camera) whch range across the color spectrum dependng on the lghtng condtons. Secondly, movng objects extracted from vdeo are not perfectly segmented from the background. Shadows are often part of the object and errors exst n the segmentaton due to the smlarty of the object and the background model. Lastly, complex objects (torsos and legs, n ths case) may have more than one color. The method descrbed here s based on acqurng a normalzed cumulatve color hstogram for each tracked object n the b-conc HSL (hue, saturaton, lumnance) space. The method s desgned wth mechansms to dvde ths space va parameters that can be set by a user or by actve color measurements of the scene. The key dea s to ntellgently quantze the color space based on the relatonshps between hue, saturaton and lumnance. As color nformaton s lmted by both lack of saturaton and ntensty, t s necessary to separate the chromatc from the

19 achromatc space along surfaces defned by a functon of saturaton and ntensty n the b-conc space. In partcular, for each tracked body part (torso or legs), the color classfer wll create a hstogram of a small number of colors. We usually use sx colors: black, whte, red, blue, green and yellow. However, for clothng from people detected by the cameras used n ths study (our faclty), we use the followng colors based on the emprcal dstrbuton and our ablty to dscern: red, green, blue, yellow, orange, purple, black, grey, brown, and bege. The hstogram s created as follows. Each frame of the tracked object that s used for hstogram accumulaton (every frame the body parts detector fnds) s frst converted from RGB to HSL color space. Next, the HSL space s quantzed nto a small number of colors. In order to quantze ths space nto a small number of colors, we determne the angular cutoffs between colors. When we use sx colors (black, whte, yellow, green, blue, red), we need only four cutoffs between the hues: yellow/green, green/blue, blue/red and red/yellow. However, varatons due to lghtng condtons, object textures and object-to-camera vewpont lead to dfferences n brghtness and color saturaton. Therefore, we also need to specfy lghtness and saturaton cutoffs. Here, t s nterestng to note that saturaton and ntensty are related. Both propertes can make the hue of a pxel ndscernble. For ntensty, ths occurs when the lght s too brght or too dark. For saturaton, t happens when there s nsuffcent saturaton. However, as the brghtness gets too low or too hgh, the necessary saturaton ncreases. In general, as ntensty ncreases from 0 up to halfway (the central horzontal cross-secton of the b-conc) or decreases from the maxmum (whte) down to halfway, the range of pxels wth vsble or dscernable hue ncreases. In summary, we frst quantze the HSL space based on hue. We subsequently re-label pxels as ether whte or black dependng on whether they le outsde the lghtness/saturaton curve above or below the horzontal md-plane. Ths s related to earler work n color segmentaton performed by Tseng and Chang [Tseng & Chang 1994]. Fgure 16: Results of search for people wth red torso (shrts). Usng ths color classfcaton scheme, the system records the colors of torso and legs for each person for whom the detecton of the face and the body parts s successful. Ths nformaton s sent to the database for retreval. Users may search for people wth specfc colors of torso and legs. Fgure 16 shows the results of a search for people wth red torsos (shrts). Fgure 17 shows the results of a search for people wth whte torso and red legs. The fnal fgure (Fgure 18) n ths secton shows the results of a search for people wth yellow torso and black legs. In each pcture, the key frames for each event (person) are shown on the rght. The user may select any key frame

20 to watch the assocated vdeo clp of the person as they walk through the scene. A sngle frame of ths vdeo s shown at the left of each fgure. Fgure 17: Results of search for people wth whte torso (shrt) and red legs (pants/skrt). Fgure 18: Results of search for people wth yellow torso (shrts) and black legs (pants/skrt). 6.2 Retal Loss Preventon There are many practcal applcatons for the people search capabltes descrbed n ths chapter. Many of these applcatons nvolve matchng people across cameras, wth dentfcaton nformaton, or wth queres. Matchng people across cameras s useful to track a person across cameras, thereby lnkng a person n one camera to ther movements and actons n another. In ths way, a person can be tracked across non-overlappng cameras. Many mportant securty tasks nvolve ths capablty. For example, a person who breaks through a securty checkpont n an arport could be tracked to a dfferent part of the buldng. A person can also be matched aganst dentfcaton nformaton. Ths nformaton may be assocated wth a badge or other type of ID, and may be used to corroborate the legtmacy of ther access to a buldng or a cash regster. Lastly, a person can be matched aganst a query to locate them across several cameras and tme perods. We have bult an applcaton to match

people to mprove the detecton of returns fraud. In ths applcaton, a user s lookng to match a person at the returns counter n a department store wth people enterng the store.

The return could be done ether wthout a recept (n stores that have lberal return polces), or usng a recept from a prevously purchased (and kept) tem.

21 people to mprove the detecton of returns fraud. In ths applcaton, a user s lookng to match a person at the returns counter n a department store wth people enterng the store. Returns fraud can take one of a number of forms. Our retal customer was partcularly nterested n the case of returnng tems that were just pcked up n the store, but were never bought. The return could be done ether wthout a recept (n stores that have lberal return polces), or usng a recept from a prevously purchased (and kept) tem. Our approach, descrbed n more detal n a prevous paper [Senor, 2007], allows loss preventon staff to quckly determne whether a person returnng an tem entered the store carryng that tem. The system works by detectng and trackng customers at entrances and customer servce desks and assocatng the two events. Ths soluton only requres cameras at the store entrances and the return counters, beng smpler than approaches that rely on trackng customers throughout the store (requrng many cameras and very relable camera hand-off algorthms) and must be contnuously montored to determne whether tems are pcked up. Two cameras at the customer servce desk record actvty there, ncludng the appearance of customers returnng tems. A separate set of cameras ponts at the doors and captures all actvty of people enterng and leavng the store. Fgure 19 shows the felds of vew of two such cameras. Our approach automatcally segments events n each of these cameras, flters them and then provdes a user nterface whch allows the assocaton of each returns event wth the door entrance event that corresponds to when the person came nto the store. At the customer servce desk, the face trackng algorthm tracks customers faces, generatng a sngle event per customer. Customers at the doors are tracked wth our appearance-based tracker [Senor 2001]. The returns fraud nterface provdes ntutve selecton and browsng of the events, summarzed by presentaton of key frames (at both scales), tmestamps and orgnal vdeo clps (from DVR or meda server). Search typcally begns by selectng a return event from the TLOG (see Fgure 20). In response to ths, the nterface dsplays the people found at the customer servce counter near that tme. Selectng one of these then dsplays people enterng the store shortly before the selected event. The user can then browse through the entrance events, usng full-frame and zoomed-n key frames as well as orgnal vdeo, and, when a match s found, determne whether a fraud has taken place. The fundamental ndexng attrbute of the database s tme. All devces are synchronzed and events are tme-stamped. Temporal constrants from real world condtons are exploted to lmt the events dsplayed. In our mplementaton, a user may also use color nformaton (Secton 6.1) to assst n fndng lkely matches between the people enterng and the person at the returns counter. Fgure 19: Vews from the customer servce desk (left) and a door (rght). The regon of unnterest n whch detectons are gnored to reduce false postves s outlned n blue. Alert trpwres (enter and leave) are drawn on the door vew.

In the retal sector, accurate people counts provde a relable base for hghlevel store operaton analyss and mprovements, such as traffc load montorng, staffng assgnment, converson rate estmaton, etc.

22 Fgure 20: Interface for browsng and searchng the TLOG (transacton log), showng a table of return transactons. 6.3 People Countng As a drect applcaton of person detecton and search, people countng has sgnfcant mportance n varous scenaros. In the retal sector, accurate people counts provde a relable base for hghlevel store operaton analyss and mprovements, such as traffc load montorng, staffng assgnment, converson rate estmaton, etc. In our soluton, we ncorporate a people countng framework whch effectvely detects and tracks movng people n b-drectonal traffc flows, denoted as entres and exts. Countng results are ndexed nto the logcal relatonal database for generatng future statstcal reports. An example of a report s shown n Fgure 21. We demonstrate the people countng applcaton n the cafetera traffc scenaro at our nsttuton. In addton to the statstcal reports, ndvdual countng events are also avalable for nvestgaton by exhbtng correspondng key frames. Dsjont search ntervals enable the system s capablty of performng cross-tme traffc comparson. For nstance, n Fgure 22, the average traffc counts between three tme ntervals on fve days of a week are compared. For mornng tmes (8:00AM-11:00AM), Monday has the fewest count. Ths s due to the fact that employees tend to have a buser mornng after the weekend, and thus do not vst the cafetera as often as on other weekdays. For lunch tme (11:00AM-2:00PM) and afternoon tea tme (2:00PM-5:00PM), Frday has the least traffc volume. Ths s due to the fact that some employees leave work earler, and thus do not have lunch or afternoon tea/coffee at the company. Fgure 21: People countng statstcal report.

Fgure 22: Weekly comparson of the average people counts for three tme ntervals at the IBM Hawthorne cafetera. People countng can also be assocated wth non-vsual sensor data.

23 Fgure 22: Weekly comparson of the average people counts for three tme ntervals at the IBM Hawthorne cafetera. People countng can also be assocated wth non-vsual sensor data. One applcaton s badge use enforcement, where securty wants to ensure that employees always carry dentfcaton badges. In certan workng envronments, employees need to swpe ther badges to access crtcal areas. In order to detect people enterng the crtcal areas wthout usng ther badges (e.g., by talgatng other people who properly swped ther badges), the number of enterng people s estmated by applyng a person detector (Secton 4.1). The people count s then matched wth the nput sgnals obtaned from the badge reader. If the number obtaned from the people countng algorthm s greater than the one ndcated by the badge read sgnals, then an alert s generated. 6.4 Dsplay Effectveness In retal stores, a crtcal problem s to effectvely dsplay merchandse at certan places to acheve maxmum customer attenton and optmal revenue. For nstance, the store management s very nterested n the average number of people who have stopped n front of a partcular tem each hour. If the number of people lookng at ths tem s consstently far less than the number of people who have stopped by ts neghborng tems, then the store management could conclude that ths partcular tem does not attract enough customers. Thus, t brngs less value to the store revenue, and may be removed from the dsplay or relocated to a less valued regon. Ths analyss can be accomplshed by constructng a heat map of the store plan, whch represents the traffc densty n the store. One such heat map s demonstrated n Fgure 23, where warmer values represent hgher number of people passng through and stoppng by the correspondng locaton, and colder values represent lower traffc. Based on ths analyss, the duraton of the stay can also

A Binarization Algorithm specialized on Document Images and Photos

A Binarization Algorithm specialized on Document Images and Photos A Bnarzaton Algorthm specalzed on Document mages and Photos Ergna Kavalleratou Dept. of nformaton and Communcaton Systems Engneerng Unversty of the Aegean kavalleratou@aegean.gr Abstract n ths paper, a