Robust Object Detection at Regions of Interest with an Application in Ball Recognition

Robust Object Detection at Regions of Inteest with an Application in Ball Recognition Saa Miti, Simone Fintop, Kai Pevölz, Hatmut Sumann Faunhofe Institute fo Autonomous Intelligent Systems (AIS) Schloss Bilinghoven, D-53754 Sankt Augustin, Gemany simone.fintop@ais.faunhofe.de Andeas Nüchte Univesity of Osnabück Institute fo Compute Science Knowledge-Based Systems Reseach Goup Albechtstaße 28 D-49069 Osnabück, Gemany Abstact In this pape, we pesent a new combination of a biologically inspied attention system (VOCUS Visual Object detection with a CompUtational attention System) with a obust object detection method. As an application, we built a eliable system fo ball ecognition in the RoboCup context. Fistly, VOCUS finds egions of inteest geneating a hypothesis fo possible locations of the ball. Secondly, a fast classifie veifies the hypothesis by detecting balls at egions of inteest. The combination of both appoaches makes the system highly obust and eliminates false detections. Futhemoe, the system is quickly adaptable to balls in diffeent scenaios: The complex classifie is univesally applicable to balls in evey context and the attention system impoves the pefomance by leaning scenaio-specific featues quickly fom only a few taining examples. Index Tems visual attention, object classification. I. INTRODUCTION A fundamental poblem in the field of obotics is the peception of the envionment. Ou wok is inspied by the biological two stage pocess of seaching fo an object in a visual scene [17]: Fist, human attention is caught by egions with object-specific featues such as colo o oientations. Second, ecognition pocesses esticted to these egions veify o falsify these hypotheses. Ou system is designed afte these two stages. This pape poposes a scheme fo leaning and detecting socce balls though the combination of the computational attention system VOCUS with a classifie. Recognizing socce balls as an application in the Robot Wold Cup Socce Games and Confeences (RoboCup) [8] has been a tough poblem to solve because of the lack of definite chaacteistics descibing a ball. Ou solution is eliable, scale-independent and colo-adaptable in the sense that it can be applied to balls of any size, suface patten and colo. Ou appoach consists of a taining phase, an adaptation phase, and a detection phase. In the taining phase, the classifie is exhaustively tained using balls of diffeent sizes, colos, and suface pattens fom a wide vaiety of taining images. The output of the taining is a cascade of classifies that in tun consist of a set of decision tees. In the adaptation phase, VOCUS is quickly adapted to a special scenaio: it leans fom few example images (hee: 2) the popeties of the scenaio, e.g., the colo of the ball and its intensity contast to the envionment. This adaptation esults in a set of featue weights descibing the ball in its suoundings. In the detection phase, fist, Attention System egions of inteest Classifie Fig. 1. The ecognition system consists of the attention system VOCUS poviding object candidates and a classification system veifying the hypothesis. The combination yields a flexible and obust system. VOCUS computes egions of inteest by weighting the image featues with the leaned weights. Second, the classifie is applied to these egions, veifying the object hypothesis (Fig. 1). This appoach makes the system flexible as well as obust. The visual attention system VOCUS consists of a bottom-up pat computing data-diven saliency and a topdown pat and enabling goal-diected seach. Bottom-up saliency esults fom uniqueness of featues, e.g., a black sheep among white ones, wheeas top-down saliency uses featues that belong to a specified taget, e.g., ed when seaching fo a ed ball. The bottom-up pat, also descibed in [7], is based on the well-known model of visual attention by Koch & Ullman [11] used by many computational attention systems [12], [1]. It computes saliencies accoding to the featues intensity, oientation, and colo and combines them in a saliency map. The most salient egion in this map yields the focus of attention. The top-down pat is new: it uses peviously leaned featue weights to excite taget-specific featues and inhibit othes. Balls ae classified accoding to the Viola-Jones classifie [22]: The shape of the ball is leaned by using edgefilteed and thesholded images, epesented by computationally efficient integal images [22]. The Gentle Ada Boost leaning technique [5] is used to lean a selection of Classification and Regession Tees (CARTs) that select an aangement of Haa-like featues to classify the object. Seveal selections ae combined into a cascade of classifies. This leaning phase is elatively time-consuming, but only needs to be executed once, since the classifie is then geneal enough to apply to any ball shaped object. The most common techniques fo ball detection in RoboCup ely on colo infomation. In the last few yeas, fast colo segmentation algoithms have been developed to detect and tack objects in this scenaio [10], [19]. The community ageed that in the nea futue, visual cues like colo coding will be emoved to come to a moe ealistic setup with obots playing with a nomal socce ball [20].

Teptow and Zell lean with Ada Boost conglomeations of Haa like classifies and aange them in a cascade to ecognize balls without colo infomation [20]. Howeve, in pevious wok [16] we show poblems with leaning non symmetic object pattens in diffeently illuminated envionments. To ovecome this poblem, we pepocessed the input with edge detection and leaned classification and egession tees (CARTs) instead of simple conglomeations of featue classifies and accomplished colo-independent ball detection fo vaious balls. To educe a significant amount of false detections, whee the classifie maked vaious ound shapes, e.g., the heads in Fig. 7, we popose hee an attention algoithm that is quickly adapted on the spot to a specific ball. It yields seveal egion hypotheses. With the combination of both systems, we eliminate the false detections and identify only the intesection of the two classified sets as coect. In this way, the ball detecto can efficiently be applied to moe complex images, without woying about false detections. The combination of an attention system with classification has also been done by Miau, Papageogiou and Itti who detect pedestians on attentionally focused image egions using a suppot vecto machine algoithm [15]. Walthe and colleagues combine in [23] an attention system with the object ecognize of Lowe [14] and show that the ecognition esults ae impoved by the attentional fontend. Nevetheless, all of these appoaches focus on bottomup attention and do not enable goal-diected seach. To ou knowledge, this is the fist appoach combining a top-down modulated attention system with a classifie. The est of the pape is stuctued as follows: Fist, we descibe the attention system VOCUS in section II. We then discuss biefly the pocess of leaning and detecting balls in section III. The esults of each algoithm independently as well as in combination ae given in section IV and, finally, section V concludes the pape. II. THE ATTENTION SYSTEM VOCUS In this section, we pesent the goal-diected visual attention system VOCUS (Visual Object detection with a CompUtational attention System) (cf. Fig. 2). With visual attention we mean a selective seach-optimization mechanism that tunes the visual pocessing machiney to appoach an optimal configuation [21]. VOCUS consists of a bottom-up pat computing data-diven saliency and a top-down pat enabling goal-diected seach. The global saliency is detemined fom bottom-up and top-down cues. In the following, we fist descibe the computation of the bottom-up and then of the top-down saliency. A. Bottom-up saliency 1) Featue Computations: The fist step fo computing bottom-up saliency is to geneate image pyamids fo each featue to enable computations on diffeent scales. Thee featues ae consideed: Intensity, oientation, and colo. Fo the featue intensity, we convet the input image into gay-scale and geneate a Gaussian pyamid with 5 scales s 0 to s 4 by successively low-pass filteing and subsampling Fig. 2. The goal-diected visual attention system VOCUS with a bottomup pat (left) and a top-down pat (ight). In lean mode, taget weights ae leaned (blue line aows). These ae used in seach mode (ed shot aows). the input image, i.e., scale (i + 1) has half the width and height of scale i. The intensity maps ae ceated by cente-suound mechanisms, which compute the intensity diffeences between image egions and thei suoundings. We compute two kinds of maps, the on-cente maps I on fo bight egions on dak backgound, and the off-cente maps I off : Each pixel in these maps is computed by the diffeence between a cente c and a suound σ (I on ) o vice vesa (I off ). Hee, c is a pixel in one of the scales s 2 to s 4, σ is the aveage of the suounding pixels fo two diffeent adii. This yields 12 intensity scale maps I i,s,σ with i ɛ {on, off}, s ɛ {s 2-s 4 }, and σ ɛ {3, 7}. The maps fo each i ae summed up by inte-scale addition, i.e., all maps ae esized to scale 2 and then added up pixel by pixel yielding the intensity featue maps I i = s,σ I i,s,σ. To obtain the oientation maps, fou oiented Gabo pyamids ae ceated, detecting ba-like featues of the oientations θ = {0, 45, 90, 135 }. The maps 2 to 4 of each pyamid ae summed up by inte-scale addition yielding 4 oientation featue maps O θ. To compute the colo featue maps, the colo image is fist conveted into the unifom CIE LAB colo space [2]. It epesents colos simila to human peception. The thee paametes in the model epesent the luminance of the colo (L), its position between ed and geen (A) and its position between yellow and blue (B). Fom the LAB image, a colo image pyamid P LAB is geneated, fom which fou colo pyamids P R, P G, P B, and P Y ae computed fo the colos ed, geen, blue, and yellow. The maps of these pyamids show to which degee a colo is epesented in an image, i.e., the maps in P R show the bightest values at ed egions and the dakest values at geen egions. Luminance is aleady consideed in the intensity maps, so we ignoe this channel hee. The pixel value P R,s (x, y) in map s of pyamid P R is obtained by the distance between the coesponding pixel P LAB (x, y) and the pototype fo ed = ( a, b ) = (255, 127). Since P LAB (x, y) is of the fom (p a, p b ), this yields: P R,s (x, y) = (p a, p b ), ( a, b ) = (p a a ) 2 + (p b b ) 2.

On these pyamids, the colo contast is computed by on-cente-off-suound diffeences yielding 24 colo scale maps C γ,s,σ with γ ɛ {ed, geen, blue, yellow}, s ɛ {s 2 -s 4 }, and σ ɛ {3, 7}. The maps of each colo ae inte-scale added into 4 colo featue maps C γ = s,σ Ĉγ,s,σ. 2) Fusing Saliencies: All featue maps of one featue ae combined into a conspicuity map yielding one map fo each featue: I = i W(I i ), O = θ W(O θ ), C = γ W(C γ ). The bottom-up saliency map S bu is finally detemined by fusing the conspicuity maps: S bu = W(I)+ W(O) + W(C) The exclusivity weighting W is a vey impotant stategy since it enables the incease of the impact of elevant maps. Othewise, a egion peaking out in a single featue would be lost in the bulk of maps and no pop-out would be possible. In ou context, impotant maps ae those that have few highly salient peaks. Fo weighting maps accoding to the numbe of peaks, each map M is divided by the squae oot of the numbe of local maxima m that exceed a theshold t: W(M) = M/ m m : m > t. Futhemoe, the maps ae nomalized afte summation elative to the lagest value within the summed maps. This yields advantages ove the nomalization elative to a fixed value (details in [7]). 3) The Focus of Attention (FOA): To detemine the most salient location in S bu, the point of maximal activation is located. Stating fom this point, egion gowing ecusively finds all neighbos with simila values within a theshold and the FOA is diected to this egion. Finally, the salient egion is inhibited in the saliency map by zeoing, enabling the computation of the next FOA. B. Top-down saliency 1) Leaning mode: In leaning mode, the use maks a ectangle in a taining image specifying the egion that has to be leaned. Then, VOCUS computes the bottomup saliency map and the most salient egion inside the ectangle. So, the system is able to detemine automatically what is impotant in a specified egion. It concentates on pats that ae most salient and disegads the backgound o less salient pats. Next, weights ae detemined fo the featue and conspicuity maps, indicating how impotant a featue is in the specified egion. The weights ae the quotient of the mean saliency in the taget egion m and in the backgound m (image ) : w i = m /m (image ). This computation consides not only which featues ae the stongest in the egion of inteest, it egads also which featues sepaate the best egion fom the est of the image. Seveal taining images: Leaning weights fom one single taining image usually yields good esults if the taget object occus in all test images in a simila way, i.e., on a simila backgound. To enable a moe stable ecognition even on vaying backgounds, we detemine the aveage weights fom seveal taining images by computing the geometic mean of the weights, i.e., w i,(1..n) = n n j=1 w i,j, whee n is the numbe of taining images. Fig. 3. Left: Sobel filte applied to a coloed image and then thesholded. Right: Edge, line, diagonal, cente suound and 45 featues ae used fo classification. An algoithm fo choosing the taining images is poposed in [6]. It showed that, usually, even in complex scenaios 5 taining images suffice; fo ball detection, aleady two taining images yielded the best pefomance. 2) Seach mode: In seach mode, fistly the bottom-up saliency map is computed. Additionally, we detemine a top-down saliency map that competes with the bottom-up map fo saliency. The top-down map is composed of an excitation and an inhibition map. The excitation map E is the weighted sum of all featue maps that ae impotant fo the leaned object, namely the featues with weights geate than 1. The inhibition map I contains the featue maps that ae not pesent in the leaned object, namely the featues with weights smalle than 1: E = I = P i (w i Map i ) i : w Pj (w j ) i > 1, P i ((1/w i) Map i ) i : w Pj (w j ) i < 1. The top-down saliency map S (td) is obtained by: S (td) = E I. The final saliency map S is composed as a combination of bottom-up and top-down influences. When fusing the maps, it is possible to detemine the degee to which each map contibutes by weighting the maps with a top-down facto t [0..1]: S = (1 t) S (bu) + t S (td). With t = 1, VOCUS looks only fo the specified taget. With t < 1, also bottom-up cues have an influence and may divet the focus of attention. This is also an impotant mechanism in human visual attention. E.g., a peson suddenly enteing a oom catches immediately ou attention, independently of the task. Fo the application discussed in this pape, we always use t = 1 and use the bottom-up saliency only to lean the weights of the taining objects. Thus, the obot focuses its attention completely on the ball and not to play foul on othe obots. III. COLOR-INDEPENDENT BALL CLASSIFICATION In this section we biefly discuss the classifie fo ball detection that is applied to the foci of attention. The algoithm hee efes to pevious wok discussed in [16], which was inspied by Viola and Jones boosted cascade of simple classifies fo fast face detection [22]. A. Colo Invaiance using Linea Image Filtes The poblem with ecognizing geneal shapes, such as balls, as in ou paticula case, is the numbe of possibilities in the visual appeaance of a ball. A ball can take on any colo and size and may have any patten on its suface. In ode to genealize the concept of a ball, the initial goal

y y+h x x+w I (x,y 2) I (x,y) I (x h,y+h 1) h w I (x h+w,y+w+h 1) Fig. 4. Left: Computation of featue values F in the shaded egion is based on the fou uppe ectangles. Middle: Calculation of the otated integal image I. Right: Fou lookups in the otated integal image ae equied to compute the featue value a otated featue F. was to eliminate any colo infomation in the data images epesenting the balls. To detect the edges in the image, we use linea image filtes followed by a theshold to eliminate noise data, which would then be given as input to the classifie, which in tun handles diffeences in size, patten, lighting, etc. Fo this pape, we ae using a Sobel filte, as descibed in [4]. In ode to eliminate the colo infomation in the images, we apply the filte to the coloed image and then use a theshold t to include any pixel in any of the 3 colo channels that cossed the theshold t value in the output image. The esulting image is a binay image including the thesholded pixels of the 3 colo channels. A typical output image of this technique is shown in Fig. 3 (left). This edge detection and thesholding technique is applied to all images used as input to the taining of the Haa classifie. The taining pocess is descibed in the following subsections. B. Featue Detection using Integal Images Thee ae many motivations fo using featues athe than pixels diectly. Fo mobile obots, a citical motivation is that featue based systems opeate much faste than pixel based systems [22]. The featues ae called Haalike, since they follow the same stuctue as the Haa basis, i.e., step functions intoduced by Alfed Haa to define wavelets. They ae also used in [13], [3], [20], [22]. Fig. 3 (ight) shows the eleven basis featues, i.e., edge, line, diagonal and cente suound featues. The base esolution of the object detecto is 30 30 pixels, thus, the set of possible featues in this aea is vey lage (642592 featues, see [13] fo calculation details). A single featue is effectively computed on input images using integal images [22], also known as summed aea tables [13]. An integal image I is an intemediate epesentation fo the image and contains the sum of gay scale pixel values of image N with height y and width x, i.e., x y I(x, y) = N(x, y ). x =0 y =0 The integal image is computed ecusively, by the fomulas: I(x, y) = I(x, y 1) + I(x 1, y) + N(x, y) I(x 1, y 1) with I( 1, y) = I(x, 1) = I( 1, 1) = 0, theefoe equiing only one scan ove the input data. This intemediate epesentation I(x, y) allows the computation of a ectangle featue value at (x, y) with height and width (h, w) using fou efeences (see Fig. 4 (left)): F (x, y, h, w) = I(x, y) + I(x + w, y + h) I(x, y + h) I(x + w, y). I (x+w,y+w 1) 1.0 th. = 0.01048 1.0 th. = 0.007923 th. = 0.000175 0.9939 th. = 0.06808 0.9863 0.3326 Σ h t (x) < 0 Fig. 5. Left: A Classification and Regession Tee with 4 splits. Accoding to the specific filte applied to the image input section x, the output of the tee, h t(x) is calculated, depending on the theshold values. Right: A cascade of CARTs [16]. h t(x) is detemined depending on the path though the tee. Fo the computation of the otated featues, Lienhat et. al. intoduced otated summed aea tables that contain the sum of the pixels of the ectangle otated by 45 with the bottom-most cone at (x, y) and extending till the boundaies of the image (see Fig. 4 (middle and ight)) [13]: I (x, y) = x x =0 x x y y =0 N(x, y ). The otated integal image I is computed ecusively, i.e., I (x, y) = I (x 1, y 1)+I (x+1, y 1) I (x, y 2)+ N(x, y) + N(x, y 1) using the stat values I ( 1, y) = I (x, 1) = I (x, 2) = I ( 1, 1) = I ( 1, 2) = 0. Fou table lookups ae equied to compute the pixel sum of any otated ectangle with the fomula: F (x, y, h, w) = I (x + w h, y + w + h 1) + I (x, y 1) I (x h, y + h 1) I (x + w, y + w 1) Since the featues ae compositions of ectangles, they ae computed with seveal lookups and subtactions weighted with the aea of the black and white ectangles. To detect a featue, a theshold is equied. This theshold is automatically detemined duing a fitting pocess, such that a minimum numbe of examples ae misclassified. Futhemoe, the etun values (α, β) of the featue ae detemined, such that the eo on the examples is minimized. The examples ae given in a set of images that ae classified as positive o negative samples. The set is also used in the leaning phase that is biefly descibed next. C. Leaning Classification Functions 1) Classification and Regession Tees: Fo all 642592 possible featues a Classification and Regession Tee (CART) is ceated. CART analysis is a fom of binay ecusive patitioning. Each node is split into two child nodes, in which case the oiginal node is called a paent node. The tem ecusive efes to the fact that the binay patitioning pocess is applied ove and ove to each a given numbe of splits (4 in this case). In ode to find the best possible split featues, all possible splits ae calculated, as well as all possible etun values to be used in a split node. The pogam seeks to maximize the aveage puity of the two child nodes using the misclassification eo measue. Fig. 5 (left) shows a CART classifie.... Σ h t (x) > 0

2) Gentle Ada Boost fo CARTs: The Gentle Ada Boost Algoithm [5] is used to select a set of simple CARTs to achieve a given detection and eo ate [13]. In the following, a detection is efeed to as a hit and an eo as a false alam. The leaning is based on N weighted taining examples (x 1, y 1 ),..., (x N, y N ), whee x i ae the images and { 1, 1}, i {1,..., N} the classified output. At y i the beginning of the leaning phase the weights w i ae initialized with w i = 1/N. The following thee steps ae epeated to select CARTs until a given detection ate d is eached: 1) Evey classifie, i.e., a CART, is fit to the data. Heeby the eo e is calculated with espect to the weights w i. 2) The best CART h t is chosen fo the classification function. The counte t is incemented. 3) The weights ae updated with w i := w i e yiht(xi) and enomalized. The final output of the classifie is sign( T t=1 h t(x)) > 0, with h t (x) the weighted etun value of the CART. Next, a cascade based on these classifies is built. D. The Cascade of Classifies The pefomance of a classifie is not suitable fo object classification, since it poduces a high hit ate, e.g., 0.999, but also a high eo ate, e.g., 0.5. Nevetheless, the hit ate is much highe than the eo ate. To constuct an oveall good classifie, seveal classifies ae aanged in a cascade, i.e., a degeneated decision tee. In evey stage of the cascade, a decision is made whethe the image contains the object o not. This computation educes both ates. Since the hit ate is close to one, thei multiplication esults also in a value close to one, while the multiplication of the smalle eo ates appoaches zeo. Futhemoe, this speeds up the whole classification pocess. An oveall effective cascade is leaned by a simple iteative method. Fo evey stage the classification function h t (x) is leaned, until the equied hit ate is eached. The pocess continues with the next stage using the coect classified positive and the cuently misclassified negative examples. The numbe of CARTs used in each classifie may incease with additional stages. IV. EXPERIMENTS AND RESULTS Fist the pefomance of the classifie is shown. Then, the attention algoithm is additionally applied to adapt the detection and to educe the false positives. A. Results of the classifie alone The ball detection cascade was leaned with a total of 1000 images, with complex scenes included in the taining set, and tested by using thee socce balls of diffeent colos and pattens. The pocess of geneating the cascade of classifies is elatively time-consuming but it only needs to be executed once, povided a good cascade is geneated. Fig. 6 shows detection esults on five diffeent kinds of balls, thus the CARTs fom a coect dependency Fig. 6. Five diffeent kind of balls ae detected by the classifie. of featues. Since only the uppe two balls (white and yellow/ed ball) and the ed one given in Fig. 7 wee used fo leaning, the figue demonstates the classifie s ability to genealize to all balls. Fo each kind of ball we an the test with 60 images, making a total of 180 test images. The esults in Table I eveal how many ed, white o yellow/ed balls wee coectly classified o not detected, as well as the numbe of false positives fo each ball. The poblems we wee facing with this appoach was the difficulty to diffeentiate between socce balls and othe spheical objects (Fig. 7). TABLE I DETECTION RATE OF THE CASCADE OF CLASSIFIER DEPENDING ON THE USED NUMBER OF STAGES. THE CASCADE WITH 10 STAGES WAS USED FOR THE EXPERIMENTS WITH THE ATTENTION SYSTEM. # stages Coect Not Detected False Pos. ed ball 52/60 8/60 114 white ball 9 48/60 12/60 70 yel/ed ball 57/60 3/60 108 Total 157/180 23/180 292 ed ball 45/60 15/60 52 white ball 10 44/60 16/60 45 yel/ed ball 57/60 3/60 63 Total 146/180 34/180 160 ed ball 45/60 15/60 51 white ball 11 42/60 18/60 47 yel/ed ball 56/60 4/60 65 Total 143/180 37/180 163 ed ball 44/60 16/60 26 white ball 12 29/60 31/60 31 yel/ed ball 37/60 23/60 23 Total 110/180 70/180 80 The detection ate of the classifie is adjustable, i.e., a lowe numbe of stages of the cascade inceases the numbe of detections (hits), but also the amount of false detections. By combining the classifie and the attention algoithm the false positive detection ate will be educed. B. Combining the classifie and the attention algoithm The output of the combination of the two algoithms is the intesection of both esult sets. The balls detected must be found both by the ball classifie as well as the attention algoithm. Fist, the foci ae found in the image. Then, the classifie ties to detect balls at these specific egions. The esults of the combination ae shown in Table II. The test data is composed of a set of 60 ealistic RoboCup images fo each ball, whee thee is exactly one ball in each image. These wee taken with backgounds of diffeent lighting (colo) and complexity. The classifie seaches aeas of the fist 5 foci found by the attention algoithm. The combination is vey useful in eliminating false positives in images. This is shown in Fig. 7, whee the false

TABLE II DETECTION RATE OF COMBINED ALGORITHM. COLUMN 2 (ATTENTION) SHOWS WHICH OF THE 5 FOCI POINTS TO THE BALL (AVERAGE). Att. Classifie only Att. and Class. Found False Pos. Found False Pos. ed Ball 1.0 45/60 52 45/60 3 white ball 1.0 44/60 45 41/60 0 yel/ed b. 1.2 57/60 63 55/60 20 Total 1.07 146/180 160 141/180 23 Fig. 7. Top: Input images including ound objects. Middle: False alams in filteed images. Bottom left: False positives eliminated, ball not found. Bottom ight: False detections eliminated. positives we wee suffeing fom with the classifie alone ae eliminated. The focus of attention is calculated in ca 1.5 sec. and the classification at these egion of inteest needs 200 ms (image size: 240 320, Pentium-M 1.7 GHz). The bulk of the unning time of VOCUS is taken up by the featue computations. These may be paallelized by splitting up the pocessing to seveal CPUs [9] o with dedicated hadwae [18] what makes the system eal-time capable. We conside this fo futue wok. V. CONCLUSIONS Using the visual attention system VOCUS combined with a fast classifie, we have designed a obust ball detection system with a vey low misclassification ate, even in complex, clutteed images. Due to the use of an edge detection Sobel filte and a theshold to pepocess the taining images fo the cascade, the classifie is coloinvaiant, leaving the colo to be leaned by the attention system. Assuming shot-tem pio knowledge about the ball to be used fo a RoboCup match, VOCUS is quickly adjusted to the ball with vey few images. The success of the algoithm is eached by only seaching fo balls in egions hypothesized by the attention algoithm to contain the ball, theeby eliminating false positives. Although the algoithm misses a few balls, what we ae concened with is how it will pefom in the RoboCup envionment. In this case, the eliability of the algoithm seems to be sufficient. Even if the ball is not detected in one in evey 5 pictues, fo example, the obot will still be able to follow it quite confidently. Needless to say, much wok emains to be done: As the detection of egions of inteest is cuently elatively slow compaed to the ball detection, the next step is to wok on inceasing the efficiency of the attention system and theefoe of the whole detection scheme. In addition it is planned to enhance the pesented algoithms by adding time dependent behavio eithe by using standad tacking with paticle filtes o by using a time dependent attention contol. REFERENCES [1] G. Backe, B. Metsching, and M. Bollmann. Data- and modeldiven gaze contol fo an active-vision system. IEEE Tans. on Patten Analysis & Machine Intelligence, 23(12):1415 1429, 2001. [2] R. E. Buge. Colomanagement. Konzepte, Begiffe Systeme. Spinge, 1997. [3] M. Oen C. Papageogiou and T. Poggio. A geneal famewok fo object detection. In Poceedings of the 6th Intenational Confeence on Compute Vision (ICCV 98), Bombay, India, Januay 1998. [4] M. Das and J. Anand. Robust Edge detection in noisy images using and adaptive stochastic gadient technique. In Poc. ICIP, 1995. [5] Y. Feund and R. E. Schapie. Expeiments with a new boosting algoithm. In Machine Leaning: Poc. of the 13th Int. Conf., 1996. [6] S. Fintop, G. Backe, and E. Rome. Goal-diected seach with a top-down modulated computational attention system. submitted. [7] S. Fintop, E. Rome, A. Nüchte, and H. Sumann. A bimodal lase-based attention system. CVIU, (ccepted). [8] RoboCup. http://www.obocup.og. [9] L. Itti. Real-time high-pefomance attention focusing in outdoos colo video steams. In SPIE Human Vision and El. Im. IV, 2002. [10] T. Balch J. Buce and M. Veloso. Fast and inexpensive colo image segmentation fo inteactive obots. In Poc. IROS, 2000. [11] C. Koch and S. Ullman. Shifts in selective visual attention: towads the undelying neual cicuity. Human Neuobiology, 1985. [12] C. Koch L. Itti and E. Niebu. A model of saliency-based visual attention fo apid scene analysis. IEEE PAMI, 20(11), 1998. [13] R. Lienhat and J. Maydt. An Extended Set of Haa-like Featues fo Rapid Object Detection. In Poc. of the IEEE Conf. on Image Pocessing (ICIP 02), pages 155 162, New Yok, USA, 2002. [14] D. G. Lowe. Distinctive image featues fom scale-invaiant keypoints. Intenational J. of Compute Vision, 60(2):91 110, 2004. [15] F. Miau, C. Papageogiou, and L. Itti. Neuomophic algoithms fo compute vision and attention. In Poc. SPIE 46 Ann. Int. Symp. on Optical Science and Technology, vol. 4479, Nov. 2001. [16] S. Miti, K. Pevölz, A. Nüchte, and H. Sumann. Fast Colo- Independent Ball Detection fo Mobile Autonomous Robots. In Poc. IEEE Mechob, Aachen, Gemany, 2004. [17] U. Neisse. Cognitive Psychology. Appleton-Centuy-Cofts, 1967. [18] N. Ouehani and H. Hügli. Real time visual attention on a massively paallel simd achitectue. J. of Real-time imaging, 9(3), 2003. [19] R. Haneka T. Bandlow, M. Klupsch and T. Schmitt. Fast image segmentation, object ecognition and localization in a obocup scenaio. In 3. RoboCup Wokshop, IJCAI, 1999. [20] A. Teptow and A. Zell. Real-time object tacking fo socceobots without colo infomation. Jounal Robotics and Autonomous Systems, 48(1):41 48, August 2004. [21] John K. Tsotsos. Complexity, vision, and attention. In Vision and Attention, chapte 6. 2001. [22] P. Viola and M. J. Jones. Robust eal-time face detection. Intenational Jounal of Compute Vision, 57(2):137 154, May 2004. [23] D. Walthe, U. Rutishause, Ch. Koch, and P. Peona. On the usefulness of attention fo object ecognition. In Poc. WAPCV, 2004. ACKNOWLEDGMENTS We would like to thank G. Backe, M. Hennig, J. Hetzbeg, and E. Rome fo suppoting ou wok.