Multi-Object Tracking Based on Tracking-Learning-Detection Framework

Size: px

Start display at page:

Download "Multi-Object Tracking Based on Tracking-Learning-Detection Framework"

Earl Francis
6 years ago
Views:

1 Multi-Object Tracking Based on Tracking-Learning-Detection Framework Songlin Piao, Karsten Berns Robotics Research Lab University of Kaiserslautern Abstract. This paper shows the framework of robust long-term and real-time tracking of multi-object under dynamic background. Here multi-object means either the same types or totally different types. For each tracked object a classifier corresponding to the object is trained on-line using positive and negative constraints inside local region. An external Detector module is integrated into the framework to overcome disappearing problem. Proposed framework selects the best result from several independent components and estimates the error at the same time. Kalman Filter and Particle Filter are used inside Filtering component to predict possible positions of the object in the next frame. At the end various test results in different situations show that proposed algorithm is general and extensible. 1 Introduction Object detection and tracking become more and more important these days. One of the main applications is Advanced Driver Assistance Systems for driving safety [6] and the same applies to robotics. Robot needs to detect objects around it and track them not only for the environment perception but also for the safety. Robot should not harm the human around it. The proposed algorithm was designed originally for solving the problem of detection and tracking of human around low speed robot as shown in Fig. 1. The detection and tracking module is located inside perception layer 1. The framework itself is general, it can be used to detect and track any objects. Discussing about detection normally a classifier needs to be trained off-line for the specific object. Two kinds of greedy searching methods are mainly used. One is to fix the size of searching window and change the size of the whole image. The other one is to fix the size of the whole image but instead change the size of searching window. Dalal and Triggs used histogram of oriented gradients to detect human [5]. They trained human detection classifier using linear SVM and used first strategy to scan whole image. Viola and Jones instead used haar like features and the second searching strategy mentioned above to scan face area in the image [18]. The classifier was trained using Adaboost. The trick they used is integral image which could calculate sum of the pixels inside specific region using constant time complexity. The performance of the detection algorithm depends on features used to extract descriptor, learning algorithms, computational cost. Wu et al. proposed a new visual shape descriptor called CENTRIST which 1 It is safety component of the whole system.

Multi-object tracking in real world is not easy because of occlusions, changing background, noise and so on. Recently tracking-by-detection methods become more and more popular. Breitenstein et al.

2 is similar to LBP [17] for scene categorization and showed real time performance in human detection [20]. Fig. 1. Designed System Fig. 2. Framework Overview Tracking method is used to locate identical object in the sequent frames as much as possible. Multi-object tracking in real world is not easy because of occlusions, changing background, noise and so on. Recently tracking-by-detection methods become more and more popular. Breitenstein et al. proposed a particle filter based on-line tracking-bydetection algorithm [3]. They used detector confidence when do reasoning for the next frame and used on-line boosting described in [7] to learn object classifier during runtime. Babenko et al. proposed on-line multiple instance learning(mil) for robust object tracking [1]. One positive bag consisting of several image patches is used to update a MIL classifier instead of several positive patches so that drift problem in the traditional tracking-by-detection algorithms could be solved. Saffari et al. proposed on-line random forest [16] and on-line multi-class LPBoost [15] to overcome multi-classification problem which exists in on-line boosting [7] where only binary classification was considered. 1.1 Related work Tracking-Learning-Detection framework was firstly proposed by Kalal et al. in [11]. They explicitly decomposed long-term tracking task into tracking, learning and detection parts. For the tracking they used forward-backward errors to detect tracking failures automatically [10]. The main concept is based on Lucas Kanade s feature tracker [2]. For the learning part they proposed a P-N learning framework which consists of P- expert and N-expert. P-expert analyzes examples classified as negative, estimate false negative and adds them to training set with positive label; N-expert analyzes examples classified as positive, estimates false positive and add them with negative label to the training set. They used iterative procedure to model this learning process and analyzed

3 its error convergence conditions based on well founded theory of dynamical systems [14]. For the detection they used cascaded classifier in order to speed up. Ferns like feature described in [13] was used during the classification. Nearest neighbor classifier was chosen as a final classifier. Our work was mainly aspired by this Tracking-Learning- Detection framework. We will simply note this framework as TLD in the following sections. 1.2 Proposed framework overview Original TLD framework is extended in this paper. Overview of the proposed framework is shown in Fig. 2. Each process contains a single instance of TrackingController which includes all the information of trackers. In each time frame TrackingController updates all trackers positions by iteratively calling each tracker s update function. Each tracker updates its state by itself using information from various modules and filtering strategy. At the same time it updates its appearance model based on P-N learning theory as mentioned in [9]. At the end TrackingController would call post process function to analyze and correct error based on current status calculated by each tracker. For example, two trackers may probably track same target when several targets cross each other. The external detector is trained using the method described in [21], median flow tracker is implemented using the method described in [10], recognizer is similar with the detector mentioned in [11], mean-shift is implemented using the method [4] and the filtering framework currently implemented is classical kalman filter [19] and condensation Particle filter [8]. The recognizer here uses on-line adapted appearance model of the object to evaluate confidence for each candidate. The main contribution is we extended original TLD framework to the multi-target version and the framework itself remains general and extensible. General means each module in the framework could be replaced by any other state-of-art algorithm and extensible means additional modules could be easily added to the framework. The remainder of this paper is organized as follows. Section 2 discusses the whole proposed framework step by step. Subsection 2.1 introduces P-N learning shortly; subsection 2.2 introduces basic detection technique integrated in the framework; subsection 2.3 introduces local searching strategy; subsection 2.4 introduces fusion strategy. The experimental results will be described in Section 3 and Section 4 will give the conclusion and future work. 2 Proposed Framework In this section we will introduce proposed framework more in detail. All related concepts will be discussed step by step. The main structure of each tracker is shown in Fig. 3. In each frame four kinds of candidate regions estimated by Mean-shift module, Detector module, Recognition module, Median Flow module respectively are put into Fusion module. Fusion module uses known appearance model to judge which candidate would be best matched to the current appearance model of the tracking object. Then Learning module updates object s appearance model again based on P-N learning

4 Fig. 3. Structure of Tracker theory. We search inside local region instead of whole image area in order to speed up. We tested with all the videos mentioned in [11] just for searching local area and the result came out even better than we expected. One example is shown in Fig. 4(a). The large green box around each tracking object shown in Fig. 4(b) is the local region we have mentioned. In the case of particle filter we add Gaussian noise to the transition function as shown in shown in Fig. 4(c). The potential problems of this local searching and how to handle this kind of problem will be discussed in the following subsections. 2.1 Introduction to P-N learning P-N learning is kind of learning strategy which uses positive and negative constraints. Actually many researchers unwittingly used this kind of method in their previous work without strict proof, for example, in paper [3] and [12] authors used on-line learning method to update corresponding classifiers, but the way they sampled positive and negative training data is conform to P-N learning constraints. They update classifier in each frame, it could cause unnecessary computational cost. But based on P-N learning theory, updating classifier in each frame is not necessary, however, only when some conditions are satisfied. This conditions will be discussed in detail in the fusion subsection. This mechanism could reduce a lot of computation time. Kalal et el. has provided detailed prove to P-N learning theory in [9]. 2.2 Introduction to Detector As it is shown in Fig. 2 both TrackingController and Tracker class need external Detector class to scan the specified region. It is better the speed of detection becomes faster

(a) Example (b) Case of Kalman Filter (c) Case of Particle Filter Fig. 4. Local Region Searching without accuracy loss. Normally detector uses greedy searching strategy as mentioned previously.

proposed a real-time human detection using CENTRIST descriptor which is very similar with LBP in [20]. This method achieved 20 fps in VGA resolution image only with embedded 1.2GHz CPU.

5 the patch is divided into several grids, then the green box scan the whole patch 2 and makes descriptor. In the case of Fig. 5 each patch is divided to 3 by 4 small grids.

5 (a) Example (b) Case of Kalman Filter (c) Case of Particle Filter Fig. 4. Local Region Searching without accuracy loss. Normally detector uses greedy searching strategy as mentioned previously. While doing scanning descriptor of each patch is calculated then put this into the predefined classifier and judge if this patch contains specific object or not. Wu et el. proposed a real-time human detection using CENTRIST descriptor which is very similar with LBP in [20]. This method achieved 20 fps in VGA resolution image only with embedded 1.2GHz CPU. It is almost 80 times faster than HOG based descriptor [5] while achieving similar accuracy. As it is seen in Fig. 5 the patch is divided into several grids, then the green box scan the whole patch 2 and makes descriptor. In the case of Fig. 5 each patch is divided to 3 by 4 small grids. Then there are 2 3 = 6 possible positions inside green box. For each green box the length of LBP descriptor is 256, then the total length of the final descriptor is = For each pixel inside green box the index of LBP descriptor could be calculated as it is shown in 5(b). The index could be represented as (C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8 ) 2 where C i is set to 1 if the corresponding neighbor pixel value is higher than the current pixel value otherwise it is set to 0. We have trained cascaded detector using linear SVM and histogram intersection kernel based SVM as in Fig Here patch is Sobel edge image.

(a) (b) Fig. 5. CENTRIST Descriptor Fig. 6. Cascade In Detection 2.3 Local Region Searching The original TLD framework searches object in the whole image because it only tracks one object.

There are four results going into Fusion module as it is shown in Fig. 3. Except the result from Median Flow module other three results do searching in local region.

6 (a) (b) Fig. 5. CENTRIST Descriptor Fig. 6. Cascade In Detection 2.3 Local Region Searching The original TLD framework searches object in the whole image because it only tracks one object. Instead here we search in local region. This gives us many advantages. First it dramatically reduces processing time for each frame, second it makes multi-object tracking possible. There are four results going into Fusion module as it is shown in Fig. 3. Except the result from Median Flow module other three results do searching in local region. The reason we do not search in local for Median Flow is to overcome drift problem because it happens more often than searching in global image. First, Mean-shift module uses two kinds of information. One is the back projected image using initial color histogram of the object. The other is detection confidence image which comes from Detector module. The concept of detection confidence map was proposed in the paper [3]. The authors used the continuous confidence of pedestrian detectors and on-line trained, instance-specific classifiers as a graded observation

7 Fig. 7. Mean-shift Example model in addition to final high-confidence detection results. In our case, detector confidence density d c (p) corresponds to raw SVM output before applying non-maximum suppression, which is further scale to [0,1] using f = 1 exp( ρ). Fig. 7 shows this concept more clearly. When only classical back projected image is used, because the background color is similar with target Mean-shift module may track to a wrong position. But if density image is combined, the result would be very robust. The starting position of Mean-shift module depends on the previous tracking result. If Fusion module selects the result from Mean-shift module as a final result then it means last result is stable. In such case we start Mean-shift algorithm from the last tracked position. This concept is similar to continuous Mean-shift algorithm. But if the last selected tracking result is not frommean-shift module or there is no confident tracking result at all then Mean-shift algorithm starts from center position of the local searching area. The center position of the local searching area is predicted by Filtering component. Second, there are two kinds of searching strategies for Detector module. This task is done by TrackingController not Tracker. The TrackingController detects object either through the whole frame or just boundary area around blue box in Fig. 8. Then TrackingController associates each detected object to the tracker using the similar method described in [3]. After this step each tracker either gets its associated detection result or not. As it is shown in Fig. 8, in the first image tracker can get associated detection result, but in the second image tracker cannot get associated detection result because half of the human is in the non-searching area. In this case we use Detector module to search local area to see if there is human detected. In this Detector module scanning window size is fixed. Instead, size of the whole image is changed in each level.

Fig. 8. Different Detection Strategy Third, for Recognition module cascaded detector proposed in [11] is used to search for exact tracking object inside local region.

For more detail please refer to paper [11]. Instead of searching whole image we search only in local region which could reduce computational time significantly. 2.

8 Fig. 8. Different Detection Strategy Third, for Recognition module cascaded detector proposed in [11] is used to search for exact tracking object inside local region. Three stages classifier is structured: (i) patch variance, (ii) ensemble classifier and (iii) nearest neighbor. Each stage either passes patch to the next stage or reject patch. For more detail please refer to paper [11]. Instead of searching whole image we search only in local region which could reduce computational time significantly. 2.4 Introduction to Fusion As it is shown in the Fig. 3 there are four results entering this module, two results from tracker side and two results from detection side. First, we compare two results from meanshi ft tracker side. Here one is from Mean-shift module noted as Tt and the other is median f low from Median Flow module noted as Tt. For each of them the confidence is calculated by on-line trained classifier. At the end higher confidence is selected. Then we compare this confidence with the predefined threshold values. Learning procedure is depend on these two values. There are two situations in which learning procedure could be triggered. One is that current confidence is bigger than θ plus and the other is current confidence is bigger than θ minus and previous learning state is also true. We set θ plus to 0.65 and θ minus to 0.55 in our system. But if Detector side yields exactly one rectangle with a confidence higher than Tracking side, then the response of Detector module is assigned to the final result which causes re-initialization. The details are described in Algorithm 1. In the original TLD paper author updated the classifier only when tracking result is valid. But here the result from Detector side is also used for

9 learning when its confidence is bigger than some threshold, because searching area is already local region around the object. It means if there is a high confidence detection result inside local region we could think this result as a clue for P-expert. It should be noticed that different fusion strategies result in different performance. What we have showed here is just a strategy for the example system, actually this could be adapted to any specific requirements. Once current state is determined to be confident enough then learning begins. As it was stated in [9], P-expert collects all patches highly overlapped with the final state and labels them as positive; N-expert collects all patches which are not overlapped with the final state and labels them as negative. A bounding box B is highly overlapped if the overlapping ratio is over 60%. If this overlapping ratio is less than 20% then they are not overlapped. 3 Experiments We applied proposed framework to test with all the dataset described in [11] and [1]. Since the framework is designed for multi-target version, for the compatibility it should also work on one target case. Because all the test dataset are for one target tracking, our framework works perfect in all the dataset. Except these datasets, we did additional experiments. The experiment part is divided into two sub parts. One is to test without detector and the other is to test with specific detector. The specific detector we used in the second part is face detector and human detector. In the case of human detector we tested in outdoor where image is not normal image 3, background is changing fast, sunlight is also changing abruptly. Fig. 9 shows tracking face and cup which are two different types. We initialized object s position by mouse. It is shown that even if there is some part of occlusion the tracker would not miss the target. Fig. 10 shows tracking two eyes and one mouth. In the second part both face detector and human detector are trained using method described in [20]. The result is shown in Fig. 11. Fig. 12 and Fig. 8 shows more challenging situation. Actually in this case the original TLD algorithm does not work at all, but our system works well. This is because we added two more components compared with original algorithm: one is external Detector module and the other is Mean-shift module. This time we set maximum number of tracker to 1 in order to show why proposed framework is definitely useful. As mentioned before we only search inside green box which is local area. There are four kinds of small rectangles inside big green box. The red one is the result from Detector module, yellow one is the result from Median Flow module, black one is the result from the Mean-shift module and the blue one is the result from the Recognition module. As it can be seen in Fig. 8, there are only yellow and black sub rectangles inside green area which means currently results only from Median Flow module and from Mean-shift module are available. We analyze these two results using on-line trained object model and judge if these results are reliable or not. In the case of bottom image from Fig. 12, there is no reliable trackers but only reliable detection. This kind of procedure is done inside Fusion module. If in this situation 3 Omni view image is transformed to panorama view so that noise is larger than in normal view.

10 Algorithm 1 fusion strategy meanshi ft median f low Require: Tt,Tt,Dt detector,dt recognizer Ensure: B t φ, valid(b t ) f alse C detect 0, C tracker 0 tracked f alse reinitialization f alse con f idencedetections 0 meanshi ft median f low for R t in Tt Tt do if C(R t ) C tracker then B tracker t R t C tracker C(R t ) end if end for if C tracker θ plus then B t B tracker t tracked true, valid(b t ) true else if valid(b t 1 )&(C tracker >θ minus ) then B t B tracker t tracked true, valid(b t ) true end if if tracked then for R t Dt detector Dt recognizer do if Overlapping(R t,b tracker t )<0.5&C(R t )>C tracker then con f idencedetections + + end if end for if con f idencedetections == 1 then reinitialization true end if else for R t in Dt detector Dt recognizer do if C(R t ) C detect then Bt detector R t C detect C(R t ) end if end for if C detect θ detector then B t Bt detector valid(b t ) true end if end if

Actually there are many other benchmark datasets available, but we are not interested in static background and indoor environment.

11 Fig. 9. First Test Without Detector we apply only detection, most of targets would be missed because of sunlight and fast movement of the target. In order to show generality of the proposed framework we replace the Kalman Filter with Particle Filter and applied proposed framework on the ETH benchmark dataset. Actually there are many other benchmark datasets available, but we are not interested in static background and indoor environment. The test video has the length of 999 frames and the resolution of 640 by 480. Maximum number of tracker inside TrackingController in Fig. 2 is set to 6. We tested with 100 particles for each tracker. Fig. 13 shows the result. They are 132nd, 284th, 432nd, 544th frames from the test sequence. It is seen there are several tracking errors. But based on our error detection mechanism, these wrong trackers would be re-initialized within several frames. Thanks to fast runtime speed of Detector module and Kalman Filter our algorithm runs very fast. Our face detector runs at 18ms per frame in 640 by 480 image comparing with 132ms of OpenCV using i CPU. Totally average processing time for one tracking target in our algorithm takes 64ms per frame including all the procedures.

Fig. 10. Second Test Without Detector 4 Conclusion and future work A novel object tracking framework is introduced in this paper.

Proposed framework calculates the tracking results for each component separately and then estimates the best one through Fusion module.

Original TLD framework can be automatically extended to multi target version with proposed framework. This is done by adding external Detector and Mean-shift components.

12 Fig. 10. Second Test Without Detector 4 Conclusion and future work A novel object tracking framework is introduced in this paper. Framework s feasibility has been analyzed from the points of Generality and Scalability. Proposed framework calculates the tracking results for each component separately and then estimates the best one through Fusion module. At the same time framework provides functionality of learning object s model at runtime. Original TLD framework can be automatically extended to multi target version with proposed framework. This is done by adding external Detector and Mean-shift components. Test results show that proposed framework shows robustness in a certain extent even in outdoor environment. But for the too fast moving object and deformable object, algorithm is not robust enough especially in outdoor. This kind of problem remains as main task in the future. References 1. B. Babenko and M.-H. Y. S. Belongie. Robust object tracking with online multiple instance learning , 9 2. J.-Y. Bouguet. Pyramidal implementation of the lucas kanade feature tracker description of the algorithm,

Fig. 11. Test With Face Detector 3. M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Online multiperson tracking-by-detection from a single, uncalibrated camera.

13 Fig. 11. Test With Face Detector 3. M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Online multiperson tracking-by-detection from a single, uncalibrated camera. IEEE Trans. Pattern Anal. Mach. Intell., 33(9): , Sept , 4, 6, 7 4. D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell., 24(5): , May N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In C. Schmid, S. Soatto, and C. Tomasi, editors, International Conference on Computer Vision & Pattern Recognition, volume 2, pages , INRIA Rhône-Alpes, ZIRST-655, av. de l Europe, Montbonnot-38334, June , 5 6. M. Enzweiler and D. Gavrila. Monocular pedestrian detection: Survey and experiments. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(12): , dec H. Grabner and H. Bischof. On-line boosting and vision. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 1, pages , june M. Isard and A. Blake. Condensationâconditional density propagation for visual tracking. International Journal of Computer Vision, 29:5 28, Z. Kalal, J. Matas, and K. Mikolajczyk. P-N Learning: Bootstrapping Binary Classifiers by Structural Constraints. Conference on Computer Vision and Pattern Recognition, , 4, Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-backward error: Automatic detection of tracking failures. In Pattern Recognition (ICPR), th International Conference on, pages , aug , Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(7): , july , 3, 4, 7, 8, C.-H. Kuo and R. Nevatia. How does person identity recognition help multi-person tracking? In CVPR, pages , M. Oezuysal, P. Fua, and V. Lepetit. Fast keypoint recognition in ten lines of code. In In Proc. IEEE Conference on Computing Vision and Pattern Recognition, K. Ogata. Modern Control Engineering. Tsinghua University Press, A. Saffari, M. Godec, T. Pock, C. Leistner, and H. Bischof. Online multi-class lpboost. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages , june

Fig. 12. Test With Human Detector Fig. 13. Tracking result on ETH data set 16. A.

In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on,

Performance evaluation of texture measures with classification based on kullback

In Proceedings of the 12th IAPR International Conference on Pattern Recognition (ICPR

Robust real-time face detection. Int. J. Comput. Vision, 57(2):137 154, May 2004. 1 19. G.

Technical report, Chapel Hill, NC, USA, 1995. 3 20. J. Wu, C. Geyer, and J. M. Rehg.

14 Fig. 12. Test With Human Detector Fig. 13. Tracking result on ETH data set 16. A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof. On-line random forests. In Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages , oct M. P. T. Ojala and D. Harwood. Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In Proceedings of the 12th IAPR International Conference on Pattern Recognition (ICPR 1994), volume vol. 1, pages pp , P. Viola and M. J. Jones. Robust real-time face detection. Int. J. Comput. Vision, 57(2): , May G. Welch and G. Bishop. An introduction to the kalman filter. Technical report, Chapel Hill, NC, USA, J. Wu, C. Geyer, and J. M. Rehg. Real-time human detection using contour cues. In ICRA, pages IEEE, , 5, J. Wu and J. M. Rehg. Centrist: A visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33: ,

Multiple-Person Tracking by Detection

http://excel.fit.vutbr.cz Multiple-Person Tracking by Detection Jakub Vojvoda* Abstract Detection and tracking of multiple person is challenging problem mainly due to complexity of scene and large intra-class