1 A Uer-Attention Baed Focu Detection Framework and It Application Chia-Chiang Ho, Wen-Huang Cheng, Ting-Jian Pan, Ja-Ling Wu Communication and Multimedia Laboratory, Department of Computer Science and Information Engineering, National Taiwan Univerity, No. 1, Roovelt Rd. Sec. 4, Taipei, Taiwan Abtract In thi paper, a generic uer-attention baed focu detection framework i developed to capture uer focu point for video frame. The propoed framework conider both bottom-up and top-down attention, and integrate both image-baed and video-baed viual feature for aliency map computation. For efficiency purpoe, the number of adopted feature i kept a few a poible. The realized framework i extenible and flexible in integrating more feature with a variety of fuion cheme. One application of the propoed framework, the uer-aited patial reolution reduction, ha alo been addreed. 1. Introduction Attention refer to the ability of one human to focu and concentrate upon ome viual or auditory object, by carefully watching or litening. Auming limited proceing reource of one human, attention alo refer to the allocation of thee reource. Here, the reource can refer to either neurological or cognitive reource. The former i often referred a bottom-up attention and the later top-down attention. We can roughly ay that bottom-up attention model what people are attracted to ee, and topdown attention model what people are willing to ee. In hort, thee two model can be ummarized briefly a follow: Bottom-up attention: The bottom-up attention can be modeled a an integration of different meaurable, lowlevel image feature [1]. Koch and Ullman propoed the firt neurally plauible computational architecture of bottom-up attention model in 1985 [2]. Later reearche on bottom-up attention generally followed thi pioneering architecture. Nowaday, the bottom-up attention model propoed by Itti et al. draw great attention [3]. Top-down attention: Although the bottom-up attention model may capture the deployment of attention within the firt few hundred of milliecond after preenting the viual cene, a complete attention model mut conider top-down, tak-oriented influence a well. Baed on bottom-up attention model, there are reearche that integrated the concept of top-down attention for object recognition tak [4][5][6]. For general purpoe, top-down attention i uually modeled by detecting ome meaningful (emantic) object or video feature. For example, in [7], top-down attention wa modeled by face detection and camera motion detection. In thi paper, we integrate both bottom-up and top-down feature in our uer attention model. Then, we propoe a general uer-attention baed framework to detect uer focue in video frame. Our work may provide another point of view for olving ome content-aware problem. We preent one application and related experimental reult of the propoed focu detection framework: the uer-aited patial reolution reduction which aim at obtaining better patial reolution reduction of the input video intead of direct ub-ampling. Modeling Uer attention may alo benefit other application, uch a video encoding, urveillance, watermarking, and video ummarization [7]. The ret of thi paper i organized a follow: Section 2 introduce the propoed uer-attention baed focu detection framework. Section 3 dicue the uer-aited patial reolution reduction. Section 4 report ome experimental reult. Finally, Section 5 preent the concluion and future work. 2. The Propoed Uer-Attention Baed Focu Detection Framework In thi ection, we propoe a uer-attention baed focu detection framework, which combine both bottom-up and top-down attention feature, uch a intenity, color, motion and face. Without fully emantic undertanding of video content, the propoed framework provide u another way to benefit many content-baed application. Meanwhile, the ytem i carefully deigned to deal with peed iue which real-time application concerned about. Fig. 1 how the naphot and illutration of the implemented focu detection framework.

2 The lat thing to do i to combine all feature map belonging to one feature into one integrated map. All map are re-caled into the ame ize, and a pixel-by-pixel maximum operation i performed among all thee re-caled map. Thi reemble a winner-take-all competition among different feature map. An additional maximum operation i performed to combine Cb and Cr feature map into one color aliency map. Thu the computation of low-level feature yield two aliency map, i.e., the intenity aliency map and the color aliency map. Figure 1. The operational naphot of the propoed uerattention baed focu detection framework. 2.1 Attentional Viual Feature Calculation We model computable attentional viual feature into three level, ay, low level, medium level and high level. Low-level feature (intenity and color): Feature belonging to thi level correpond to the o-called early viual feature in the biological viion, including but not limited to intenity contrat, color opponency, and orientation. Speed i a major concern in our ytem, o only intenity and color are ued in our attention model. It i worthy of mentioning that in our obervation, mot of alient region (or object) found through computation of different early viual feature might actually be imilar. In our ytem, intenity and color feature are calculated in the YCbCr color pace, which i ued by mot of video coding tandard. The generation of adopted feature map i imilar to that of the Itti method [8]. A map normalization operation i then applied to each feature map, which globally promote map in which a mall number of trong peak are preented and globally uppree thoe map with many peak. Another effect of normalization operation i to let all feature map hare a common dynamic range. The normalization operation i performed by: a) Finding out the maximal and minimum value, MinVal and MaxVal, of the feature map, and then calculate a threhold value δ, which i defined a MinVal + ( MaxVal MinVal)/10. (1) b) Counting the average value of pixel with value larger than δ, V. δ c) Calculating the map caling factor a MaxVal Vδ 255 F =, (2) MaxVal MinVal MaxVal and multiplying the feature map with thi caling factor. The dynamic range after map normalization operation i [0,255], for all feature map. Medium-level feature (motion): In our thought, motion preent not only bottom-up but alo top-down attention information. Some baic obervation can be drawn: a) Image egment with patially conitent motion field are more likely to be part of foreground moving object and receive more uer attention than thoe in the background do. b) The uer i more aware of object with temporally conitent motion. c) Object with larger motion draw more attention than thoe with maller motion do. d) People can pay attention to a very limited number of object in a cene. When there are many different object (poibly with different motion), people lo the ability of attention. In thi paper, block motion vector are ued for motion analyi and uer-attention modeling. Though ometime motion vector don t reflect the true motion field well, but utilizing motion vector can greatly help for reducing more computation complexity than that of the fine-grained optical flow. An approach imilar to [7] i adopted to deal with our motion aliency map. For one macroblock i, we define the intenity of it motion vector ( dx, dy) a I ( i) + dy By computing motion vector intenitie of all macroblock in a frame, we get an intenity map I. Let W (i) be the et of motion vector of all macroblock until a redefined window, W. The phae of one motion vector ( dx, dy) i defined a dy Phae = arctan( ). (4) dx 2 2 = dx. (3) The range of a phae i [ 0,2π ]. We calculate an eight-bin phae hitogram of W (i) and meaure the patial-temporal conitency a H p ( h)log( p ( h)) h= 1 C( i) = 1+, (5) log( H )

3 where H i the number of hitogram bin and p (h) i the probability of one particular bin h. The larger C (i) mean the more conitent motion field. By computing the patialtemporal conitency value of all macroblock, we get a motion conitency map C. The motion intenity map, I, and the motion conitency map, C, are combined to yield a ingle value of motion attention. That i, M = I C. (6) We call M a the motion aliency map, and a normalization operation i performed to yield a dynamic range of [0, 255]. High-level feature (face): Dominant face in video frame certainly attract uer attention. In fact, we can aume that people naturally locate face in video frame with priority over other type of object. In our implementation, two kind of face detection cheme are invetigated. Traditional face detection i baed on template matching. We invetigate an object detector pecifically trained for face detection. The idea ha been initially propoed by Paul Viola [9] and improved by Rainer Lienhart [10]. Thi cheme provide comparably good reult for frontal face; however, it uffer from non-frontal face and tilted face. The alternative cla of face detection i baed on kinregion detection. We implemented the kin color model given in [11]. Thi cheme can find mot of face region in video frame. While mi detection rate i very low, uing kin-color detection only ometime uffer from high fale alarm rate. It i our experience that uing the morphological opening operation and impoing ome ize and apect ratio contraint on the detected region can help to reduce fale alarm. Finally, after face region detection, we get a face aliency map. 2.2 The Fuion Stage and the Focu Point Detection The fuion tage: After four aliency map (intenity, color, motion and face map) are available, a fuion tage i required to integrate all map into a final aliency map. In our implementation, we ue a particular fuion cheme, called priority-baed competition. Feature map in the ame level are combined uing the maximal operation. Then the integrated map in the lower level i caled down by a pre-defined factor, and then compete with the integrated map in the higher level. The block diagram of the priority-baed competition i hown in Fig. 2. Focu point detection: Once the final aliency map i generated through the fuion tage, we are now ready to detect a ingle focu point for each video frame. Following tep accomplih thi work: a) Threholding and binarization. Let MIN and MAX be the minimal and the maximal value of the final aliency map. The threhold BinThr i determined a: BinThr = MAX ( MAX MIN) / ThrFactor, (7) where ThrFactor i an adjutable parameter. b) Connected component analyi i then performed and the larget component i found. The center of the larget connected component i et a the candidate of focu point. If no connected component i found, or the area of the larget component i maller than a pre-defined threhold, the candidate focu point i et to be the center of the frame. c) To avoid poible fale alarm, we retrict the ditance between focu point of two neighboring frame. d) To maintain a mooth locu of focu point, a Gauianlike filter i applied to ucceive detected focue. Figure 2. The priority-baed fuion cheme for final aliency map calculation. 3. Uer-Aited Spatial Reolution Reduction Spatial reolution reduction i neceary for ome content repurpoing related application. For example, adapting video with higher patial reolution to device with maller diplay. The eaiet way to perform patial reolution reduction i through direct ub-ampling, however, thi may be undeirable becaue the intereting ubject() may be too mall to view. For better uer atifaction of patial reolution reduction, we propoe the o-called uer-aited patial reolution reduction. Firt, we perform the attention focu detection baed on the precribed uer-attention modeling ytem. Then we let the uer pecify the wanted patial ub-ampling factor and the patial cropping factor. Let thee two value be repectively denoted a r and r c, and the valid range of them i (0,1). Finally, we perform the required ubampling and cropping operation by etting the detected focu point a the center of the operational region. Then we can adjut the patial reolution of image or video by

4 cropping the operational region intead of jut down ampling the entire image or video. 4. Experimental Reult In thi ection, we preent ome experimental reult of the propoed uer-attention baed focu point detection framework, the uer-aited patial reolution reduction, and the attentional focu-point ubjective experiment. We tet the propoed focu detection framework by uing ome well-known equence. Fig. 4 and Fig. 5 how ome focu detection reult for the equence foreman and mobile, repectively. In Fig. 4, the foreman ha conpicuou motion activitie and explicit face, which are uccefully detected through motion and face aliency calculation, and the detected focu point perform reaonably and atifactorily. (a) (b) Figure 4. Reult of aliency map and focu point detection for the equence foreman : (a) the 9 th frame, (b) the 80 th frame. Fig. 5 how aliency map and detected focu point for the intereted frame of the mobile equence. Although thee frame have a lot of intenity or color aliency pot (it can be een that the background i complex), motion map are popped out becaue (1) there i only one conpicuou region exit and (2) motion map are with larger weighting in the fuion tage. The reult are atifactory. different characteritic. In different categorie, we have different weighting factor in combining aliency map of the focu detection framework. Then, we invited 20 oberver to participate in the tet. Every oberver give a core from 1 to 5 (larger value mean better quality in perception) according to hi or her intuition. The ubjective reult are hown in Table 1. In each category, the core mean i atifactory and the core variance i mall. We find from that by chooing proper parameter et, the detected focu point in different kind of video content i repreentative. In concluion, the focu-point detection framework uccefully model the oberver attention. Category Score mean Score var. Home Video High Motion Nature Sport TV & Movie Other Table 1. Reult of the focu-point ubjective experiment The econd kind of our experiment i about patial reolution reduction. Fig. 6 how different patial reolution reduction reult of the equence horn by uing different parameter but with the ame final image ize. The cropping region hown in Fig. 6 are determined by the aitance of the uer attention modeling ytem. The difference of emantic information revealed by different patial reolution reduction parameter i obviou to ee, and the uer-attention modeling ytem do help for revealing more emantic information when the final image ize i very mall a compared to the original one. (a) (b) (c) Figure 6. Spatial reolution reduction with parameter: (a) r = 0.25 and r c = 1.0, (b) r = 0.5 and r c = 0.5, and (c) r = 1.0 and r c = The Concluion and Future Work The following concluion can be made for the propoed framework of focu detection: (a) (b) Figure 5. Reult of aliency map and focu point detection for the equence mobile : (a) the 15 th frame, and (b) the 73 rd frame. To further validate the propoed framework of focu finding, we performed ubjective experiment. Teting video clip are claified into ix categorie according to a) A general uer attention baed focu detection framework i developed to capture uer focu point on video frame. The propoed framework conider both bottom-up and top-down attention, and i extenible and flexible for integrating more feature with a variety of fuion cheme.

5 b) Combining other perceptual model with our framework, the ytem can have many application, uch a uerattention baed video encoding, and the uer-aited patial reolution reduction, which have alo been addreed in thi write-up. In the future, more tak mut be done to enlarge the capability of the propoed framework, e.g., uing more complex modeling cheme to improve adopted feature, integrating more robut face detection cheme and adopting more complex fuion cheme to the fuion tage. A for application of focu detection, one intereting application i the o-called uer-attention baed video encoding, which aim at reducing bitrate requirement, without acrificing perceived quality for typical encoding cheme. The uer-attention baed video encoding can be done through dicarding unimportant viual information a much a poible, under the guideline of the foveation model [12]. It alo electively preerve higher quality for thoe focued region, in trade of wore quality for thoe periphery region, to maximally match uer expectation. Thee cenario can alo be applied to generate the bae layer bittream of a calable video, when the bitrate contraint i very trict. Fig 3 how the propoed architecture of the uer-attention baed video encoding. Some reearch iue and experimental reult are addreed in [13]. Imaging, Special Iue on Human Viion and Electronic Imaging, Vol. 10, No. 1, pp , [5] I. A. Rybak, V. I. Guakova, A. Golovan, L. N. Podladchikova, and N. A. Shevtova, A model of attention-guided viual perception and recognition, Viion Reolution, Vol. 38, pp , [6] G. Deco and J. Zihl, A neurodynamical model of viual attention: Feedback enhancement of patial reolution in a hierarchical ytem, Journal of Computational Neurocience, Vol. 10, pp , [7] Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, A uer attention model for video ummarization, in Proc. ACM Multimedia (ACMMM 02), pp , Dec [8] L. Itti, C. Koch, and E. Niebur, A model of aliencybaed viual attention for rapid cene analyi, IEEE Tran. Pattern Anal. Machine Intell., Vol. 20, No. 11, pp , Nov [9] P. Viola and M. Jone, Robut real-time object detection, in Second Intl, Workhop on Statitical and Computational Theorie of Viion: Modeling, Learning, Computing and Sampling, July [10] R. Lienhart and J. Maydt, An extended et of haarlike feature for rapid object detection, in Proc. IEEE Intl. Conf. Image Proceing (ICIP 02), pp , Sept [11] C. Garcia and G. Tzirita, Face detection uing quantized kin color region merging and wavelet packet analyi, IEEE Tran. Multimedia, Vol. 1, No. 3, pp , Sept Figure 3. The propoed architecture of the uer-attention baed video encoding. Reference [1] A. M. Treiman and G. Gelade, A feature integration theory of attention, Cognitive Pychology, Vol. 12, No. 1, pp , [12] C.-C. Ho and J.-L. Wu, "A foveation-baed rate haping mechanim for MPEG video," in Proc. 3th IEEE Pacific-Rim Conference on Multimedia (PCM'02), Springer-Verlag (LNCS 2532), pp , Hinchu, Taiwan, Dec [13] Ho, C.-C. A Study of Effective Technique for Uer- Centric Video Streaming, Ph.D. diertation, National Taiwan Univerity, Taipei, Taiwan, June, [2] C. Koch and S. Ullman, Shift in elective viual attention: toward the underlying neural circuitry, Human Neurobiology, Vol. 4, pp , [3] L. Itti and C. Koch, Computational modeling of viual attention, Nature Review Neurocience, Vol. 2, No. 3, pp , Mar [4] K. Schill, E. Umkehrer, S. Beinlich, G. Krieger, and C. Zetzche, Scene analyi with accadic eye movement: top-down and bottom-up modeling, Journal of Electronic

More information