Switching Convolutional Neural Network for Crowd Counting

Size: px

Start display at page:

Download "Switching Convolutional Neural Network for Crowd Counting"

Vanessa Beasley
6 years ago
Views:

Swtchng Convolutonal Neural Network for Crowd Countng Deepak Babu Sam Shv Surya R. Venkatesh Babu Indan Insttute of Scence Bangalore, INDIA 560012 bsdeepak@grads.cds.sc.ac.n, shv.surya314@gmal.

Crowd analyss s compounded by myrad of factors lke nter-occluson between people due to extreme crowdng, hgh smlarty of appearance between people and background elements, and large varablty of camera

We propose swtchng convolutonal neural network that leverages varaton of crowd densty wthn an mage to mprove the accuracy and localzaton of the predcted crowd count.

The ndepent CNN regressors are desgned to have dfferent receptve felds and a swtch classfer s traned to relay the crowd scene patch to the best CNN regressor.

We provde nterpretable representatons of the multchotomy of space of crowd scene patches nferred from the swtch.

Massve crowd gatherngs are commonplace at candle-lght vgls, democratc protests, relgous gatherngs and presdental ralles.

1 Swtchng Convolutonal Neural Network for Crowd Countng Deepak Babu Sam Shv Surya R. Venkatesh Babu Indan Insttute of Scence Bangalore, INDIA Abstract We propose a novel crowd countng model that maps a gven crowd scene to ts densty. Crowd analyss s compounded by myrad of factors lke nter-occluson between people due to extreme crowdng, hgh smlarty of appearance between people and background elements, and large varablty of camera vew-ponts. Current state-of-the art approaches tackle these factors by usng mult-scale CNN archtectures, recurrent networks and late fuson of features from mult-column CNN wth dfferent receptve felds. We propose swtchng convolutonal neural network that leverages varaton of crowd densty wthn an mage to mprove the accuracy and localzaton of the predcted crowd count. Patches from a grd wthn a crowd scene are relayed to ndepent CNN regressors based on crowd count predcton qualty of the CNN establshed durng tranng. The ndepent CNN regressors are desgned to have dfferent receptve felds and a swtch classfer s traned to relay the crowd scene patch to the best CNN regressor. We perform extensve experments on all major crowd countng datasets and evdence better performance compared to current stateof-the-art methods. We provde nterpretable representatons of the multchotomy of space of crowd scene patches nferred from the swtch. It s observed that the swtch relays an mage patch to a partcular CNN column based on densty of crowd. 1. Introducton Crowd analyss has mportant geo-poltcal and cvc applcatons. Massve crowd gatherngs are commonplace at candle-lght vgls, democratc protests, relgous gatherngs and presdental ralles. Cvc agences and planners rely on crowd estmates to regulate access ponts and plan dsaster contngency for such events. Crtcal to such analyss s crowd count and densty. In prncple, the key dea behnd crowd countng s self- Equal contrbuton Fgure 1. Sample crowd scenes from the ShanghaTech dataset [19] s shown. evdent: densty tmes area. However, crowds are not regular across the scene. They cluster n certan regons and are spread out n others. Typcal statc crowd scenes from the ShanghaTech Dataset [19] are shown n Fgure 1. We see extreme crowdng, hgh vsual resemblance between people and background elements (e.g. Urban facade) n these crowd scenes that factors n further complexty. Dfferent camera vew-ponts n varous scenes create perspectve effects resultng n large varablty of scales of people. Crowd countng as a computer vson problem has seen drastc changes n the approaches, from early HOG based head detectons [6] to CNN regressors [18, 19, 9] predctng the crowd densty. CNN based regressors have largely outperformed tradtonal crowd countng approaches based on weak representatons from local features. We buld on the performance of CNN based archtectures for crowd countng and propose Swtchng Convolutonal Neural Network (Swtch-CNN) to map a gven crowd scene to ts densty. Swtch-CNN leverages the varaton of crowd densty wthn an mage to mprove the qualty and localzaton of the predcted crowd count. Indepent CNN crowd densty regressors are traned on patches sampled from a grd n a gven crowd scene. The ndepent CNN regressors are chosen such that they have dfferent receptve felds and feld of vew. Ths ensures that the features learned by each CNN regressor are adapted to a partcular scale. Ths ren- 5744

2 ders Swtch-CNN robust to large scale and perspectve varatons of people observed n a typcal crowd scene. A partcular CNN regressor s traned on a crowd scene patch f the performance of the regressor on the patch s the best. A swtch classfer s traned alternately wth the tranng of multple CNN regressors to correctly relay a patch to a partcular regressor. The jont tranng of the swtch and regressors helps augment the ablty of the swtch to learn the complex multchotomy of space of crowd scenes learnt n the dfferental tranng stage. To summarze, n ths paper we present: A novel generc CNN archtecture, Swtch-CNN traned -to- to predct crowd densty for a crowd scene. Swtch-CNN maps crowd patches from a crowd scene to ndepent CNN regressors to mnmze count error and mprove densty localzaton explotng the densty varaton wthn a scene. We evdence state-of-the-art performance on all major crowd countng datasets ncludng ShanghaTech dataset [19], UCF CC 50 dataset [6] and World- Expo 10 dataset [18]. 2. Related Work Crowd countng has been tackled n computer vson by a myrad of technques. Crowd countng va head detectons has been tackled by [17, 16, 14] usng moton cues and appearance features to tran detectors. Recurrent network framework has been used for head detectons n crowd scenes by [12]. They use the deep features from Googlenet [13] n an LSTM framework to regress boundng boxes for heads n a crowd scene. However, crowd countng usng head detectons has lmtatons as t fals n dense crowds, whch are characterzed by hgh nter-occluson between people. In crowd countng from vdeos, [3] use mage features lke Tomas-Kanade features nto a moton clusterng framework. Vdeo s processed by [10] nto a set of trajectores usng a KLT tracker. To prevent fragmentaton of trajectores, they condton the sgnal temporally and spatally. Such trackng methods are unlkely to work for sngle mage crowd countng due to lack of temporal nformaton. Early works n stll mage crowd countng lke [6] employ a combnaton of handcrafted features, namely HOG based detectons, nterest ponts based countng and Fourer analyss. These weak representatons based on local features are outperformed by modern deep representatons. In [18], CNNs are traned to regress the crowd densty map. They retreve mages from the tranng data smlar to a test mage usng densty and perspectve nformaton as the smlarty metrc. The retreved mages are used to fne-tune the traned network for a specfc target test scene and the densty map s predcted. However, the model s applcablty s lmted by fne-tunng requred for each test scene and perspectve maps for tran and test sequences whch are not readly avalable. An Alexnet [7] style CNN model s traned by [15] to regress the crowd count. However, the applcaton of such a model s lmted for crowd analyss as t does not predct the dstrbuton of the crowd. In [9], a mult-scale CNN archtecture s used to tackle the large scale varatons n crowd scenes. They use a custom CNN network, traned separately for each scale. Fully-connected layers are used to fuse the maps from each of the CNN traned at a partcular scale, and regress the densty map. However, the countng performance of ths model s senstve to the number of levels n the mage pyramd as ndcated by performance across datasets. Mult-column CNN used by [2, 19] perform late fuson of features from dfferent CNN columns to regress the densty map for a crowd scene. In [19], shallow CNN columns wth vared receptve felds are used to capture the large varaton n scale and perspectve n crowd scenes. Transfer learnng s employed by [2] usng a VGG network employng dlated layers complemented by a shallow network wth dfferent receptve feld and feld of vew. Both the model fuse the feature maps from the CNN columns by weghted averagng va a 1 1 convolutonal layer to predct the densty map of the crowd. However, the weghted averagng technque s global n nature and does not take n to account the ntra-scene densty varaton. We buld on the performance of mult-column CNN and ncorporate a patch based swtchng archtecture n our proposed archtecture, Swtch- CNN to explot local crowd densty varaton wthn a scene (see Sec 3.1 for more detals of archtecture). 3. Our Approach Convolutonal archtectures lke [18, 19, 9] have learnt effectve mage representatons, whch they leverage to perform crowd countng and densty predcton n a regresson framework. Tradtonal convolutonal archtectures have been modfed to model the extreme varatons n scale nduced n dense crowds by usng mult-column CNN archtectures wth feature fuson technques to regress crowd densty. In ths paper, we consder swtchng CNN archtecture (Swtch-CNN) that relays patches from a grd wthn a crowd scene to ndepent CNN regressors based on a swtch classfer. The ndepent CNN regressors are chosen wth dfferent receptve felds and feld-of-vew as n mult-column CNN networks to augment the ablty to model large scale varatons. A partcular CNN regressor s traned on a crowd scene patch f the performance of the regressor on the patch s the best. A swtch classfer s traned alternately wth the tranng of multple CNN regressors to correctly relay a patch to a partcular regressor. The salent propertes that make ths model excellent for crowd anal- 5745

3 nput : N tranng mage patches {X } N =1 wth ground truth densty maps {DX GT } N =1 output: Traned parameters {Θ k } 3 k=1 for R k and Θ sw for the swtch Intalze Θ k k wth random Gaussan weghts Pretran {R k } 3 k=1 for T p epochs : R k f k ( ;Θ k ) ; Fgure 2. Archtecture of the proposed model, Swtch-CNN s shown. A patch from the crowd scene s hghlghted n red. Ths patch s relayed to one of the three CNN regressor networks based on the CNN label nferred from Swtch. The hghlghted patch s relayed to regressor R 3 whch predcts the correspondng crowd densty map. The element-wse sum over the entre densty map gves the crowd count of the crowd scene patch. yss are (1) the ablty to model large scale varatons (2) the faclty to leverage local varatons n densty wthn a crowd scene. The ablty to leverage local varatons n densty s mportant as the weghted averagng technque used n mult-column networks to fuse the features s global n nature Swtch CNN Our proposed archtecture, Swtch-CNN conssts of three CNN regressors wth dfferent archtectures and a classfer (swtch) to select the optmal regressor for an nput crowd scene patch. Fgure 2 shows the overall archtecture of Swtch-CNN. The nput mage s dvded nto 9 non-overlappng patches such that each patch s 1 rd of the 3 mage. For such a dvson of the mage, crowd characterstcs lke densty, appearance etc. can be assumed to be consstent n a gven patch for a crowd scene. Feedng patches as nput to the network helps n regressng dfferent regons of the mage ndepently by a CNN regressor most suted to patch attrbutes lke densty, background, scale and perspectve varatons of crowd n the patch. We use three CNN regressors ntroduced n [19], R 1 throughr 3, n Swtch-CNN to predct the densty of crowd. These CNN regressors have varyng receptve felds that can capture people at dfferent scales. The archtecture of each of the shallow CNN regressor s smlar: four convolutonal layers wth two poolng layers. R 1 has a large ntal flter /*Dfferental Tranng for T d epochs*/ /*C k s count predcted by R k for nput X */ /*C GT s ground truth count for nput X */ for t = 1 to T d do for = 1 to N do l best = argmn C k C GT ; k Backpropagate R l best /*Coupled Tranng for T c epochs*/ Intalze Θ sw wth VGG-16 weghts ; for t = 1 to T c do and update Θ l best; /*generate labels for tranng swtch*/ for = 1 to N do l best = argmn C k C GT ; k S tran = {(X,l best ) [1,N]} /*Tranng swtch for 1 epoch*/ Tran swtch wth S tran and update Θ sw; /*Swtched Dfferental Tranng*/ for = 1 to N do /*Infer choce of R k from swtch*/ l sw = argmaxf swtch (X ;Θ sw); Backpropagate R l swtch and update Θ l sw ; Algorthm 1: Swtch-CNN tranng algorthm s shown. The tranng algorthm s dvded nto stages coded by color. Color code ndex: Dfferental Tranng, Coupled Tranng, Swtch Tranng sze of 9 9 whch can capture hgh level abstractons wthn the scene lke faces, urban facade etc. R 2 and R 3 wth ntal flter szes 7 7 and 5 5 capture crowds at lower scales detectng blob lke abstractons. Patches are relayed to a regressor usng a swtch. The swtch conssts of a swtch classfer and a swtch layer. The swtch classfer nfers the label of the regressor to whch the patch s to be relayed to. A swtch layer takes the label nferred from the swtch classfer and relays t to the correct regressor. For example, n Fgure 2, the swtch classfer relays the patch hghlghted n red to regressor R 3. The patch has a very hgh crowd densty. Swtch relays t to regressor R 3 whch has smaller receptve feld: deal for detectng blob lke abstractons characterstc of patches wth 5746

4 hgh crowd densty. We use an adaptaton of VGG16 [11] network as the swtch classfer to perform 3-way classfcaton. The fully-connected layers n VGG16 are removed. We use global average pool (GAP) on Conv5 features to remove the spatal nformaton and aggregate dscrmnatve features. GAP s followed by a smaller fully connected layer and 3-class softmax classfer correspondng to the three regressor networks n Swtch-CNN. Ground Truth Annotatons for crowd mages are provded as pont annotatons at the center of the head of a person. We generate our ground truth by blurrng each head annotaton wth a Gaussan kernel normalzed to sum to one to generate a densty map. Summng the resultant densty map gves the crowd count. Densty maps ease the dffculty of regresson for the CNN as the task of predctng the exact pont of head annotaton s reduced to predctng a coarse locaton. The spread of the Gaussan n the above densty map s fxed. However, a densty map generated from a fxed spread Gaussan s napproprate f the varaton n crowd densty s large. We use geometry-adaptve kernels[19] to vary the spread parameter of the Gaussan depng on the local crowd densty. It sets the spread of Gaussan n proporton to the average dstance of k-nearest neghborng head annotatons. The nter-head dstance s a good substtute for perspectve maps whch are laborous to generate and unavalable for every dataset. Ths results n lower degree of Gaussan blur for dense crowds and hgher degree for regon of sparse densty n crowd scene. In our experments, we use both geometry-adaptve kernel method as well as fxed spread Gaussan method to generate ground truth densty depng on the dataset. Geometry-adaptve kernel method s used to generate ground truth densty maps for datasets wth dense crowds and large varaton n count across scenes. Datasets that have sparse crowds are traned usng densty maps generated from fxed spread Gaussan method. Tranng of Swtch-CNN s done n three stages, namely pretranng, dfferental tranng and coupled tranng descrbed n Sec Pretranng The three CNN regressors R 1 through R 3 are pretraned separately to regress densty maps. Pretranng helps n learnng good ntal features whch mproves later fnetunng stages. Indvdual CNN regressors are traned to mnmze the Eucldean dstance between the estmated densty map and ground truth. Let D X ( ;Θ) represent the output of a CNN regressor wth parameters Θ for an nput mage X. The l 2 loss functon s gven by L l2 (Θ) = 1 2N N =1 D X ( ;Θ) D GT X ( ) 2 2, (1) where N s the number of tranng samples and D GT X ( ) ndcates ground truth densty map for mage X. The loss L l2 s optmzed by backpropagatng the CNN va stochastc gradent descent (SGD). Here, l 2 loss functon acts as a proxy for count error between the regressor estmated count and true count. It ndrectly mnmzes count error. The regressors R k are pretraned untl the valdaton accuracy plateaus Dfferental Tranng CNN regressorsr 1 3 are pretraned wth the entre tranng data. The count predcton performance vares due to the nherent dfference n network structure of R 1 3 lke receptve feld and effectve feld-of-vew. Though we optmze the l 2-loss between the estmated and ground truth densty maps for tranng CNN regressor, factorng n count error durng tranng leads to better crowd countng performance. Hence, we measure CNN performance usng count error. Let the count estmated by kth regressor for th mage be C k = x DX (x;θ k). Let the reference count nferred from ground truth be C GT = x DGT X (x). Then count error for th sample evaluated by R k s E C (k) = C k C GT, (2) the absolute count dfference between predcton and true count. Patches wth partcular crowd attrbutes gve lower count error wth a regressor havng complementary network structure. For example, a CNN regressor wth large receptve feld capture hgh level abstractons lke background elements and faces. To amplfy the network dfferences, dfferental tranng s proposed (shown n blue n Algorthm 1). The key dea n dfferental tranng s to backpropagate the regressor R k wth mnmum count error for a gven tranng crowd scene patch. For every tranng patch ) s low-, we choose the regressor l best est across all regressors R 1 3. such that E C (l best Ths amounts to greedly choosng the regressor that predcts the most accurate count amongst k regressors. Formally, we defne the label of chosen regressor l best as: l best The count error for th sample s = argmn C k C GT (3) k E C = mn k C k C GT. (4) Ths tranng regme encourages a regressor R k to prefer a partcular set of the tranng data patches wth partcular patch attrbute so as to mnmze the loss. Whle the backpropagaton of ndepent regressor R k s stll done wth l 2-loss, the choce of CNN regressor for backpropagaton s based on the count error. Dfferental tranng ndrectly mnmzes the mean absolute count error (MAE) over the 5747

5 tranng mages. For N mages, MAE n ths case s gven by E C = 1 N mn C k C GT, (5) N k =1 whch can be thought as the mnmum count error achevable f each sample s relayed correctly to the rght CNN. However durng testng, achevng ths full accuracy may not be possble as the swtch classfer s not deal. To summarze, dfferental tranng generates three dsjont groups of tranng patches and each network s fnetuned on ts own group. The regressors R k are dfferentally traned untl the valdaton accuracy plateaus Swtch Tranng Once the multchotomy of space of patches s nferred va dfferental tranng, a patch classfer (swtch) s traned to relay a patch to the correct regressor R k. The manfold that separates the space of crowd scene patches s complex and hence a deep classfer s requred to nfer the group of patches n the multchotomy. We use VGG16 [11] network as the swtch classfer to perform 3-way classfcaton. The classfer s traned on the labels of multchotomy generated from dfferental tranng. The number of tranng patches n each group can be hghly skewed, wth the majorty of patches beng relayed to a sngle regressor depng on the attrbutes of crowd scene. To allevate class mbalance durng swtch classfer tranng, the labels collected from the dfferental tranng are equalzed so that the number of samples n each group s the same. Ths s done by randomly samplng from the smaller group to balance the tranng set of swtch classfer Coupled Tranng Dfferental tranng on the CNN regressors R 1 through R 3 generates a multchotomy that mnmzes the predcted count by choosng the best regressor for a gven crowd scene patch. However, the traned swtch s not deal and the manfold separatng the space of patches s complex to learn. To mtgate the effect of swtch naccuracy and nherent complexty of task, we co-adapt the patch classfer and the CNN regressors by tranng the swtch and regressors n an alternatng fashon. We refer to ths stage of tranng as Coupled tranng (shown n green n Algorthm 1). The swtch classfer s frst traned wth labels from the multchotomy nferred n dfferental tranng for one epoch (shown n red n Algorthm 1). In, the next stage, the three CNN regressors are made to co-adapt wth swtch classfer (shown n blue n Algorthm 1). We refer to ths stage of tranng enforcng co-adapton of swtch and regressorr 1 3 as Swtched dfferental tranng. In swtched dfferental tranng, the ndvdual CNN regressors are traned usng crowd scene patches relayed by swtch for one epoch. For a gven tranng crowd scene patch X, swtch s forward propagated on X to nfer the choce of regressor R k. The swtch layer then relays X to the partcular regressor and backpropagates R k usng the loss defned n Equaton 1 and θ k s updated. Ths tranng regme s executed for an epoch. In the next epoch, the labels for tranng the swtch classfer are recomputed usng crteron n Equaton 3 and the swtch s agan traned as descrbed above. Ths process of alternatng swtch tranng and swtched tranng of CNN regressors s repeated every epoch untl the valdaton accuracy plateaus. 4. Experments 4.1. Testng We evaluate the performance of our proposed archtecture, Swtch-CNN on four major crowd countng datasets At test tme, the mage patches are fed to the swtch classfer whch relays the patch to the best CNN regressor R k. The selected CNN regressor predcts a crowd densty map for the relayed crowd scene patch. The generated densty maps are assembled nto an mage to get the fnal densty map for the entre scene. Because of the two poolng layers n the CNN regressors, the predcted densty maps are 1 th sze of the nput. 4 Evaluaton Metrc We use Mean Absolute Error (MAE) and Mean Squared Error (MSE) as the metrc for comparng the performance of Swtch-CNN aganst the state-of-the-art crowd countng methods. For a test sequence wth N mages, MAE s defned as follows: MAE = 1 N N =1 C C GT, (6) where C s the crowd count predcted by the model beng evaluated, andc GT s the crowd count from human labelled annotatons. MAE s an ndcator of the accuracy of the predcted crowd count across the test sequence. MSE s a metrc complementary to MAE and ndcates the robustness of the predcted count. For a test sequence, MSE s defned as follows: MSE = 1 N (C C GT ) N 2. (7) = ShanghaTech dataset We perform extensve experments on the ShanghaTech crowd countng dataset [19] that conssts of 1198 annotated mages. The dataset s dvded nto two parts named Part A and Part B. The former contans dense crowd scenes parsed from the nternet and the latter s relatvely sparse crowd 5748

Fgure 3. Sample predctons by Swtch-CNN for crowd scenes from the ShanghaTech dataset [19] s shown.

We use the trantest splts provded by the authors for both parts n our experments. We tran Swtch-CNN as elucdated by Algorthm 1 on both parts of the dataset.

Wth an deal swtch (100% swtchng accuracy), Swtch-CNN performs wth an MAE of 51.4. However, the accuracy of the swtch s 73.2% n Part A and 76.3% n Part B of the dataset resultng n a lower MAE.

6 Fgure 3. Sample predctons by Swtch-CNN for crowd scenes from the ShanghaTech dataset [19] s shown. The top and bottom rows depct a crowd mage, correspondng ground truth and predcton from Part A and Part B of dataset respectvely. scenes captured n urban surface streets. We use the trantest splts provded by the authors for both parts n our experments. We tran Swtch-CNN as elucdated by Algorthm 1 on both parts of the dataset. Ground truth s generated usng geometry-adaptve kernels method as the varance n crowd densty wthn a scene due to perspectve effects s hgh (See Sec 3.1 for detals about ground truth generaton). Wth an deal swtch (100% swtchng accuracy), Swtch-CNN performs wth an MAE of However, the accuracy of the swtch s 73.2% n Part A and 76.3% n Part B of the dataset resultng n a lower MAE. Table 1 shows that Swtch-CNN outperforms all other state-of-the art methods by a sgnfcant margn on both the MAE and MSE metrc. Swtch-CNN shows a 19.8 pont mprovement n MAE on Part A and 4.8 pont mprovement n Part B of the dataset over MCNN [19]. Swtch-CNN also outperforms all other models on MSE metrc ndcatng that the predctons have a lower varance than MCNN across the dataset. Ths s an ndcator of the robustness of Swtch- CNN s predcted crowd count. We show sample predctons of Swtch-CNN for sample test scenes from the ShanghaTech dataset along wth the ground truth n Fgure 3. The predcted densty maps closely follow the crowd dstrbuton vsually. Ths ndcates that Swtch-CNN s able to localze the spatal dstrbuton of crowd wthn a scene accurately. Part A Part B Method MAE MSE MAE MSE Zhang et al. [18] MCNN [19] Swtch-CNN Table 1. Comparson of Swtch-CNN wth other state-of-the-art crowd countng methods on ShanghaTech dataset [19] UCF CC 50 dataset UCF CC 50 [6] s a 50 mage collecton of annotated crowd scenes. The dataset exhbts a large varance n the crowd count wth counts varyng between 94 and The small sze of the dataset and large varance n crowd count makes t a very challengng dataset. We follow the approach of other state-of-the-art models [18, 2, 9, 19] and use 5- fold cross-valdaton to valdate the performance of Swtch- CNN on UCF CC 50. In Table 2, we compare the performance of Swtch- CNN wth other methods usng MAE and MSE as metrcs. Swtch-CNN outperforms all other methods and evdences a 15.7 pont mprovement n MAE over Hydra2s [9]. Swtch- CNN also gets a compettve MSE score compared to Hydra2s ndcatng the robustness of the predcted count. The accuracy of the swtch s 54.3%. The swtch accuracy s relatvely low as the dataset has very few tranng examples and a large varaton n crowd densty. Ths lmts the ablty of the swtch to learn the multchotomy of space of crowd scene patches. Method MAE MSE Lemptsky et al.[8] Idrees et al.[6] Zhang et al. [18] CrowdNet [2] MCNN [19] Hydra2s [9] Swtch-CNN Table 2. Comparson of Swtch-CNN wth other state-of-the-art crowd countng methods on UCF CC 50 dataset [6] The UCSD dataset The UCSD dataset crowd countng dataset conssts of 2000 frames from a sngle scene. The scenes are characterzed by sparse crowd wth the number of people rangng from 11 to 46 per frame. A regon of nterest (ROI) s provded for the scene n the dataset. We use the tran-test splts used by [4]. Of the 2000 frames, frames 601 through 1400 are used for tranng whle the remanng frames are held out for testng. Followng the settng used n [19], we prune the feature maps of the last layer wth the ROI provded. Hence, error s backpropagated durng tranng for areas nsde the ROI. We use a fxed spread Gaussan to generate ground truth densty maps for tranng Swtch-CNN as the crowd s relatvely sparse. At test tme, MAE s computed only for the specfed ROI n test mages for benchmarkng Swtch-CNN aganst other approaches. Table 3 reports the MAE and MSE results for Swtch- CNN and other state-of-the-art approaches. Swtch-CNN performs compettvely compared to other approaches wth an MAE of The swtch accuracy n relayng the patches to regressors R 1 through R 3 s 60.9%. However, the dataset s characterzed by low varablty of crowd densty set n a sngle scene. Ths lmts the performance gan acheved by Swtch-CNN from leveragng ntra-scene crowd densty varaton. 5749

7 Method MAE MSE Kernel Rdge Regresson [1] Cumulatve Attrbute Regresson [5] Zhang et al. [18] MCNN [19] CCNN [9] 1.51 Swtch-CNN Table 3. Comparson of Swtch-CNN wth other state-of-the-art crowd countng methods on UCSD crowd-countng dataset [4]. Method S1 S2 S3 S4 S5 Avg. MAE Zhang et al. [18] MCNN [19] Swtch-CNN (GT wth perspectve map) Swtch-CNN (GT wthout perspectve) Table 4. Comparson of Swtch-CNN wth other state-of-the-art crowd countng methods on WorldExpo 10 dataset [18]. Mean Absolute Error (MAE) for ndvdual test scenes and average performance across scenes s shown The WorldExpo 10 dataset The WorldExpo 10 dateset conssts of 1132 vdeo sequences captured wth 108 survellance cameras. Fve dfferent vdeo sequence, each from a dfferent scene, are held out for testng. Every test scene sequence has 120 frames. The crowds are relatvely sparse n comparson to other datasets wth average number of 50 people per mage. Regon of nterest (ROI) s provded for both tranng and test scenes. In addton, perspectve maps are provded for all scenes. The maps specfy the number of pxels n the mage that cover one square meter at every locaton n the frame. These maps are used by [19, 18] to adaptvely choose the spread of the Gaussan whle generatng ground truth densty maps. We evaluate performance of the Swtch-CNN usng ground truth generated wth and wthout perspectve maps. We prune the feature maps of the last layer wth the ROI provded. Hence, error s backpropagated durng tranng for areas nsde the ROI. Smlarly at test tme, MAE s computed only for the specfed ROI n test mages for benchmarkng Swtch-CNN aganst other approaches. MAE s computed separately for each test scene and averaged to determne the overall performance of Swtch- CNN across test scenes. Table 4 shows that the average MAE of Swtch-CNN across scenes s better by a margn of 2.2 pont over the performance obtaned by the state-of-theart approach MCNN [19]. The swtch accuracy s 52.72%. 5. Analyss 5.1. Effect of number of regressors on Swtch CNN Dfferental tranng makes use of the structural varatons across the ndvdual regressors to learn a multchotomy of the tranng data. To nvestgate the effect of structural varatons of the regressors R 1 through R 3, we tran Swtch-CNN wth combnatons of regressors (R 1,R 2), (R 2,R 3), (R 1,R 3) and (R 1,R 2,R 3) on Part A of ShanghaTech dataset. Table 5 shows the MAE performance of Swtch-CNN for dfferent combnatons of regressors R k. Swtch-CNN wth CNN regressors R 1 and R 3 has lower MAE than Swtch-CNN wth regressorsr 1 R 2 andr 2 R 3. Ths can be attrbuted to the former model havng a hgher swtchng accuracy than the latter. Swtch-CNN wth all three regressors outperforms both the models as t s able to model the scale and perspectve varatons better wth three ndepent CNN regressors R 1, R 2 and R 3 that are structurally dstnct. Swtch-CNN leverages multple ndepent CNN regressors wth dfferent receptve felds. In Table 5, we also compare the performance of ndvdual CNN regressors wth Swtch-CNN. Here each of the ndvdual regressors are traned on the full tranng data from Part A of Shanghatech dataset. The hgher MAE of the ndvdual CNN regressor s attrbuted to the nablty of a sngle regressor to model the scale and perspectve varatons n the crowd scene. Method MAE R R R Swtch-CNN wth (R 1,R 3) Swtch-CNN wth (R 1,R 2) Swtch-CNN wth (R 2,R 3) Swtch-CNN wth (R 1,R 2,R 3) Table 5. Comparson of MAE for Swtch-CNN varants and CNN regressors R 1 through R 3 on Part A of the ShanghaTech dataset [19] Swtch Multchotomy Characterstcs The prncpal dea of Swtch-CNN s to dvde the tranng patches nto dsjont groups to tran ndvdual CNN regressors so that overall count accuracy s maxmzed. Ths multchotomy n space of crowd scene patches s created automatcally through dfferental tranng. We examne the underlyng structure of the patches to understand the correlaton between the learnt multchotomy and attrbutes of the patch lke crowd count and densty. However, the unavalablty of perspectve maps rers computaton of actual densty ntractable. We beleve nter-head dstance between people s a canddate measure of crowd densty. In a hghly dense crowd, the separaton between people s low and hence densty s hgh. On the other hand, for low densty scenes, people are far away and mean nter-head dstance s large. Thus mean nter-head dstance s a proxy for crowd densty. Ths measure of densty s robust to scale varatons as the nter-head dstance naturally subsumes the scale varatons. 5750

No of Patches 100 80 60 40 20 R1:9x9 R2:7x7 R3:5x5 0 0 50 100 150 200 250 No People Mean nter-head dstance per patch Fgure 4.

We see that the multchotomy of space of crowd scene patches nferred from the swtch separates patches based on latent factors correlated wth crowd densty. based on densty.

Tranng patches are dvded nto three groups based on the patch count such that the total number of tranng patches are equally dstrbuted amongst the three CNN regressors R 1 3.

The tranng procedure for ths experment s dentcal to Swtch-CNN, except for the dfferental tranng stage.

8 No of Patches R1:9x9 R2:7x7 R3:5x No People Mean nter-head dstance per patch Fgure 4. Hstogram of average nter-head dstance for crowd scene patches from Part A test set of ShanghaTech dataset [19] s shown n Fgure 4. We see that the multchotomy of space of crowd scene patches nferred from the swtch separates patches based on latent factors correlated wth crowd densty. based on densty. We nvestgate the effect of manually clusterng the patches based on patch attrbute lke crowd count or densty. We use patch count as metrc to cluster patches. Tranng patches are dvded nto three groups based on the patch count such that the total number of tranng patches are equally dstrbuted amongst the three CNN regressors R 1 3. R 1, havng a large receptve feld, s traned on patches wth low crowd count. R 2 s traned on medum count patches whle hgh count patches are relayed to R 3. The tranng procedure for ths experment s dentcal to Swtch-CNN, except for the dfferental tranng stage. We repeat ths experment wth average nter-head dstance of the patches as a metrc for groupng the patches. Patches wth hgh mean nter-head dstance are relayed to R 1. R 2 s relayed patches wth low nter-head dstance by the swtch whle the remanng patches are relayed to R 3. Method MAE Cluster by count Cluster by mean nter-head dstance Swtch-CNN Table 6. Comparson of MAE for Swtch-CNN and manual clusterng of patches based on patch attrbutes on Part A of the ShanghaTech dataset [19]. Fgure 5. Sample crowd scene patches from Part A test set of ShanghaTech dataset [19] are shown. We see that the densty of crowd n the patches ncreases from CNN regressorr 1 R 3. To analyze the multchotomy n space of patches, we compute the average nter-head dstance of each patch n Part A of ShanghaTech test set. For each head annotaton, the average dstance to ts 10 nearest neghbors s calculated. These dstances are averaged over the entre patch representng the densty of the patch. We plot a hstogram of these dstances n Fgure 4 and group the patches by color on the bass of the regressorr k used to nfer the count of the patch. A separaton of patch space based on crowd densty s observed n Fgure 4. R 1, whch has the largest receptve feld of 9 9, evaluates patches of low crowd densty (correspondng to large mean nter-head dstance). An nterestng observaton s that patches from the crowd scene that have no people n them (patches n Fgure 4 wth zero average nter-head dstance) are relayed to R 1 by the swtch. We beleve that the patches wth no people are relayed to R 1 as t has a large receptve feld that helps capture background attrbutes n such patches lke urban facade and folage. Fgure 5 dsplays some sample patches that are relayed to each of the CNN regressorsr 1 throughr 3. The densty of crowd n the patches ncreases from CNN regressorr 1 throughr Attrbute Clusterng Vs Dfferental Tranng We saw n Sec 5.2 that dfferental tranng approxmately dvdes tranng set patches nto a multchotomy Table 6 reports MAE performance for the two clusterng methods. Both crowd count and average nter-head dstance based clusterng gve a hgher MAE than Swtch-CNN. Average nter-head dstance based clusterng performs comparably wth Swtch-CNN. Ths evdence renforces the fact that Swtch-CNN learns a multchotomy n the space of patches that s hghly correlated wth mean nter-head dstance of the crowd scene. The dfferental tranng regme employed by Swtch-CNN s able to nfer ths groupng automatcally, ndepent of the dataset. 6. Concluson In ths paper, we propose swtchng convolutonal neural network that leverages ntra-mage crowd densty varaton to mprove the accuracy and localzaton of the predcted crowd count. We utlze the nherent structural and functonal dfferences n multple CNN regressors capable of tacklng large scale and perspectve varatons by enforcng a dfferental tranng regme. Extensve experments on multple datasets show that our model exhbts state-of-theart performance on major datasets. Further, we show that our model learns to group crowd patches based on latent factors correlated wth crowd densty. 7. Acknowledgements Ths work was supported by SERB, Department of Scence and Technology (DST), Government of Inda (Proj No. SB/S3/EECE/0127/2015). 5751

9 References [1] S. An, W. Lu, and S. Venkatesh. Face recognton usng kernel rdge regresson. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, pages 1 7, [2] L. Boomnathan, S. S. Kruthvent, and R. V. Babu. Crowdnet: A deep convolutonal network for dense crowd countng. In Proceedngs of the 2016 ACM on Multmeda Conference, pages , , 4.3 [3] G. J. Brostow and R. Cpolla. Unsupervsed bayesan detecton of ndepent moton n crowds. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, volume 1, pages , [4] A. B. Chan, Z.-S. J. Lang, and N. Vasconcelos. Prvacy preservng crowd montorng: Countng people wthout people models or trackng. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, pages 1 7, , 3 [5] K. Chen, C. C. Loy, S. Gong, and T. Xang. Feature mnng for localsed crowd countng. In BMVC, volume 1, page 3, [6] H. Idrees, I. Saleem, C. Sebert, and M. Shah. Mult-source mult-scale countng n extremely dense crowd mages. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, pages , , 2, 4.3, 2 [7] A. Krzhevsky, I. Sutskever, and G. E. Hnton. Imagenet classfcaton wth deep convolutonal neural networks. In Advances n Neural Informaton Processng Systems, pages , [8] V. Lemptsky and A. Zsserman. Learnng to count objects n mages. In Advances n Neural Informaton Processng Systems, pages , [9] D. Onoro-Rubo and R. J. López-Sastre. Towards perspectve-free object countng wth deep learnng. In European Conference on Computer Vson, pages Sprnger, , 2, 3, 4.3, 4.4 [10] V. Rabaud and S. Belonge. Countng crowded movng objects. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, volume 1, pages , [11] K. Smonyan and A. Zsserman. Very deep convolutonal networks for large-scale mage recognton. arxv preprnt arxv: , , 3.4 [12] R. Stewart and M. Andrluka. End-to- people detecton n crowded scenes. arxv preprnt arxv: , [13] C. Szegedy, W. Lu, Y. Ja, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabnovch. Gong deeper wth convolutons. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, pages 1 9, [14] P. Vola, M. J. Jones, and D. Snow. Detectng pedestrans usng patterns of moton and appearance. Internatonal Journal of Computer Vson, 63(2): , [15] C. Wang, H. Zhang, L. Yang, S. Lu, and X. Cao. Deep people countng n extremely dense crowds. In Proceedngs of the 2015 ACM on Multmeda Conference, pages , [16] M. Wang and X. Wang. Automatc adaptaton of a generc pedestran detector to a specfc traffc scene. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, pages , [17] B. Wu and R. Nevata. Detecton of multple, partally occluded humans n a sngle mage by bayesan combnaton of edgelet part detectors. In IEEE Internatonal Conference on Computer Vson, volume 1, pages 90 97, [18] C. Zhang, H. L, X. Wang, and X. Yang. Cross-scene crowd countng va deep convolutonal neural networks. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, pages , , 2, 3, 4.2, 4.3, 4.4, 4.5, 4, 4.5 [19] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Snglemage crowd countng va mult-column convolutonal neural network. In Proceedngs of the IEEE Conference on Computer Vson and Pattern Recognton, pages , , 1, 2, 3, 3.1, 4.2, 3, 4.2, 1, 4.3, 4.4, 4.5, 4.5, 5, 4, 5,

arxiv: v2 [cs.cv] 3 Aug 2017

arxiv: v2 [cs.cv] 3 Aug 2017 Swtchng Convolutonal Neural Network for Crowd Countng Deepak Babu Sam Shv Surya R. Venkatesh Babu Indan Insttute of Scence Bangalore, INDIA 560012 arxv:1708.00199v2 [cs.cv] 3 Aug 2017 bsdeepak@grads.cds.sc.ac.n,