Face Detection Using DCT Coefficients in MPEG Video. Jun Wang, Mohan S Kankanhalli, Philippe Mulhem, Hadi Hassan Abdulredha

Face Detecton Usng DCT Coeffcents n MPEG Vdeo Jun Wang, Mohan S Kankanhall, Phlppe Mulhem, Had Hassan Abdulredha School of Computng Natonal Unversty of Sngapore 3 Scence Drve 2, Sngapore 117543 E-mal: {wangj, mohan, mulhem, had}@comp.nus.edu.sg ABSTRACT We present a neural network-based frontal face detecton system, whch s completely mplemented n the compressed doman. The features used for ths purpose are the DCT components of Y, Cr and Cb avalable from the compressed data of I-frames n MPEG vdeos. Snce DCT coeffcents captures frame nformaton concsely, use of DCT features reduces the complexty of the neural network used n the algorthm. In addton, t ncreases the computatonal effcency. The data s used n two stages: n the frst stage, a skn color flter, based on Cr and Cb DCT nformaton, s used to locate skn regons. In the second stage, a 4 4 blocks szed wndow s used to scan the skn regons n the compressed doman mage to extract Y-DCT features. A neural network then s traned usng these DCT features to classfy patterns as faces or non-faces. The prelmnary results obtaned are encouragng enough to contnue research n ths drecton. 1. INTRODUCTION Wth recent advances n broadband networks, mage/vdeo compresson standards (MPEG) and consumer electroncs (ncludng amateur dgtal vdeo cameras), vdeo data now ranges from smple home vdeos to moves to news vdeo clps. The huge amounts of vdeo data generated everyday make t mperatve to ndex the data n a way that enables fast content-based search and retreval. Ths has resulted n actve research nto developng effcent vdeo ndexng technologes. A recent crtcal evaluaton of mage and vdeo ndexng technques n the compressed doman can be found n [1]. Typcally, such ndexng technques are based on features such as hstograms, color, texture, etc. However, these low-level features do not allow for content-based semantc search and retreval of vdeo data of nterest. In most vdeos, vsual features resultng from recorded human actvtes are more lkely to be of value for content descrpton and ndexng n the vdeo sequences (news, moves, home vdeo, and etc.), than anythng else. So human faces often consttute the most mportant content n the vdeo sequences. Therefore to be able to recognze faces and ndex them would be a crucal feature of any ndexng system. As the frst step to face dentfcaton however, faces n any frame of a vdeo clp have to be detected. A number of face detecton algorthms have been proposed n the pxel doman [11] to ths end. Roughly these algorthms can be classfed nto two groups: 1. A face pattern s consdered as a set of facal features such as eyes corners, mouth, nose wth postons and sze wthn an oval shaped area. The presence of a face n s concluded from the ntegraton of several detecton results [2][3][6]. The advantage of these component-based approaches s that the patterns of the components (eye corners, nose, oval and etc.) mght vary less under pose changes, orentaton and vewpont changes than the patterns belongng to the face as a whole [4][5]. However, t s hard to choose a set of facal features and model ther geometrcal confguraton n such component-based approaches. 2. A face can also be consdered as a sngle pattern and features are extracted from the entre face regon. Methods followng ths phlosophy range from Gaussan Mxture Dstrbuton Model [7], neural networks [7][8], prncpal components analyss [9] to SVM [10]. Ths second group of algorthms, whch treat the face as a sngle unt, though more complex and slower, have proven to be more effectve. So the algorthm presented n ths work falls n ths group, wth the dfference that the detecton of faces s done wth the compressed vdeo data tself. Compressed doman semantc ndexng technques are ganng ground of late, because of the need for speed, the accumulaton of huge amount of mage/vdeo, n the compressed forms (JPEG, MPEG, H.261). Wang and Chang [12] combne chromnance, shape, and DCT frequency nformaton to acheve hgh speed face-detecton wthout decodng of the compressed vdeo sequence. Luo and Eleftherads[13] perform face detecton usng Sung s [8] Gaussan Mxture Model n the compressed doman. Chua, Zhao and Kankanhall [14] propose a face detecton method that uses the gradent energy representaton extracted drectly from the compressed MPEG vdeo data. To tackle face recognton problem, [15][16] uses DCT coeffcents to buld HMM s that operate entrely n the compressed doman. DCT coeffcents, as a feature, are attractve for pattern recognton snce DCT based compresson reduces spatal redundancy and gves compact nformaton about patterns. Meanwhle, t would be effcent f face detecton can be mplemented entrely n the compressed doman, wthout

n the pxel doman, t has been adapted to the compressed doman. 2.1 The DCT Transform Fgure 1. DCT coeffcent extracton performng the nverse DCT followed by feature extracton, for thousands of compressed vdeos. Ths paper presents an algorthm for detecton face n the compressed doman usng DCT coeffcents. In frst stage, wth nformaton from chromnance components obtaned from the compressed data, a skn color based flter fnds out the skn color regon n a frame. In the second stage, a neural network detects faces among the skn color regons detected n the frst stage, by usng the DCT lumnance components of these regons. Many successful pxel doman face detecton methods can be easly adapted to work n the compressed doman by usng DCT coeffcents as features are done n the work presented here. Even Gaussan dstrbuton can be bult on the DCT coeffcents snce the DCT s an orthonormal transform, and, Eucldean dstance and Mahalanobs dstance are unchanged after the transform [13]. Ths paper s organzed as follows: Secton 2 descrbes the lumnance DCT-based face detecton scheme. The neural network based used for ths detecton s explaned after that. In Secton 3, skn color nformaton obtaned from the DCT chromnance components s combned to make detector more robust and faster. Fnally n Secton 4 presents the concluson. 2. FACE DETECTION USING Y-COMPONETS IN THE COMPRESSED DOMAIN 2.2 Face Detecton procedures In the uncompressed or pxel doman, many successful face detecton methods share the followng common algorthm [13], whch works on grayscale mages. A fxed sze rectangle wndow (Sung [7] uses masked 19 pxel by 19 pxel wndow, Rowley [8] uses 20 pxel by 20 pxel wndow) s used to scan the whole mage, extractng lumnance features at each pont. These features are used to buld face models that encode the texture of the face pattern. At each pont the pattern extracted s compared wth a prevously traned face model (Sung [7] uses Gaussan Mxture Dstrbuton Model, Rowley [8] uses Neural Networks, and Osuna [10] uses SVM) to decde f t belongs to a face or not. To detect faces at dfferent scales, the mage s repeatedly downscaled by a factor of 1.25 (untl the mage sze s equal to or bgger than the scannng wndow sze) for extractng the features. Snce the above method has proven to be hghly successful MPEG compresson standard uses the block-based dscrete cosne transform (DCT). Bascally, n every I frame, the frame s sampled usng non-overlappng blocks of the sze 8 8 pxels, that are transformed utlzng the 2D DCT. The coeffcents of the transformed block are quantzed and then coded by a Huffman entropy encoder. In the rest of ths paper, compressed DCT doman would mply JPEG mage and MPEG I frames that have been partally decoded (.e. entropy decoded and de-quantzed) so that the DCT coeffcents are avalable n 8 by 8 block structures. In the followng we dscuss grayscale face detecton by lumnance components (Y) n the compressed DCT doman. The usage of chromnance components s ntroduced n secton 3 later. 2.1.1 Feature Extracton As shown n Fgure 1, 4 blocks by 4 blocks szed wndow s used to extract DCT coeffcent features of Y components. These form a feature matrx. The bottom-left and bottom-rght blocks (shown n dark) are gnored, snce the most texture nformaton n these two blocks comes from the background. It s not necessary to extract all DCT coeffcents from all sxteen blocks from that square wndow snce DCT coeffcents compresses the feature energy to the low frequency components. The low frequency DCT coeffcents retan enough encoded nformaton to make nter-class dstnctons (.e. dstngushng a face for a non-face regon). The frst ten low frequency coeffcents n each block of sxty-four coeffcents are chosen. Moreover, as far as learnng s concerned, the goal of feature selecton should be to select features that are less senstve to ntra-class dfferences (.e. dfferences wthn faces) but sgnfcantly senstve to nter-class varatons. Snce the DC values encode the varatons resultng from llumnaton and camera propertes, rather than nter-class dfferences, we choose to gnore the fourteen DC values of each square wndow, from the feature matrx. In all, the feature matrx has 126 elements for a 32 pxels by 32 pxels wndow (a 4 by 4 block). In comparson wth pxel doman methods, where the number of the features for classfcaton s large (20 pxels by 20 pxels face pattern needs 400 features), the number of features used n the compressed doman s much smaller. Ths feature matrx (Fgure 1.) s smply converted to 1D vector for classfcaton usng a neural network. DCT coeffcents are obtaned by performng 2D DCT transformaton on 8 8 pxel blocks. So a 4 by 4 blocks szed square wndow (face model sze) cannot scan the whole mage n steps of one pxel, but n steps of eght pxels. So the pattern from a face regon cannot be obtaned fathfully, except by revertng to pxel doman. Luo and

Fgure 2. System for Neural Network based face detecton n the compressed doman Eleftherads[13] tackle ths problem by ncludng addtonal face patterns arsng from dfferent block algnment postons as postve tranng examples. But ths method nduces too many varatons, not necessarly belongng to nter-class varatons, to the postve tranng samples. Ths n turn makes the hgh frequency DCT coeffcents unrelable for both face model tranng and face detecton functons. Fortunately, there exst fast algorthms to calculate reconstructed DCT coeffcents for overlappng blocks [16][17][18]. These methods help to calculate DCT coeffcents for scan wndow steps of even upto one pxel. However a good tradeoff between speed and accuracy s obtaned by reconstructng the coeffcents for scan wndow steps of two pxels. In order to detect faces of dfferent szes, compressed doman mage downscalng s used before extractng DCT coeffcents. Several fast downscalng algorthms that operated drectly n the compressed doman exst n [17][18][19]. We apply algorthm n [19] whch can down-sample mage and vdeo by a fractonal factor of 1.25 n the DCT doman. Ths algorthm permts us to derve a wde range of scalng factors by cascadng several scalng processes, such as 1.25 1.56 1.95 3.05 3.81. 2.1.2 Normalzaton Fgure 3. Tranng Data Preparaton Suppose x 1, x 1, x 2,, x n are the DCT coeffcents retaned from the feature extracton stage and x 1 (j),x 2 (j),,x n (j) ; j = 1,2,,p are the correspondng DCT coeffcents retaned from the tranng examples where n s the number of DCT coeffcent features retaned(currently, we use 126 DCT coeffcent features extracted from 4 4 block szed square wndow) and p s the number of tranng samples. The upper bounds (U ) and lower bounds (L ) can be estmated by U = α*max{1,x (1),,x (p) }, I = 1,2,,n; (1) and L = α *mn{-1,x (1),,x (p) }, I = 1,2,,n; (2) Where α >=1 s a factor to extend the bounds (here we set α = 1). Then the normalzed vectors Z 1 (j),z 2 (j),,z n (j) ; j = 1,2,,p can be determned by DCT coeffcents n dfferent locatons of each block have dfferent orders of magntude (for example, DC value ranges from 1024 to +1024 n MPEG2 I Frame). Therefore, we need to estmate the upper bound and the lower bound of DCT coeffcents and use them to convert the coeffcents nto [0,1]. Ths helps preventng the a large valued feature from domnatng the detecton process. ( j ) ( j ) x L z = 2 1, = 1,2,..., U L n (3)

Total Faces 176 Detected Faces 122 Mssed Faces 54 False Alarm 61 Detecton Rate 69.3% Table 1. Face Detecton results on the 43 gray mages from CMU database 2.1 Neural network-based Classfer 2.1.3 Structures We use a neural network as classfer to classfy patterns nto faces and non-faces. So now for a gven MPEG I vdeo, each I-frame s scanned n steps of two pxels by a sldng wndow of 4 4 blocks passng through t. From that sldng wndow, DCT coeffcent features are extracted usng the scheme n secton 2.1.1. These features are put nto one neural network (126 nput unts, 20 hdden unts, and 1 output unt ) to classfy whether the pattern nsde sldng wndow s face pattern or not after normalzaton. The structure s shown n Fgure 2. To detect faces larger than the sze f ths sldng wndow, the wndow s passed over successvely on mages downscaled by a factor of 1.25 as n [19]. 2.1.4 Tranng Data Preparaton We use front-vew face type to create postve tranng samples. In order to mprove robustness of face detecton process, some of the postve tranng samples are taken at closer dstances. As the ntal tranng sets, we have collected 1088 face samples and 2000 non-face samples. For the sake of ncreasng tranng samples, we follow Sung [7] s method to synthesze postve samples by slghtly rotatng and mrrorng mages. Snce tranng of neural network requres many negatve tranng samples, we expand non-face tranng samples by applyng the bootstrap algorthm durng tranng. Snce the tranng sets avalable wth us were pxel doman mages, they had to be frst converted to compressed coeffcent features. In order to get the same accurate DCT coeffcent as n MPEG, we convert those raw data nto DCT coeffcents by DCT Transformaton, quantzaton and de-quantzaton (shown n Fgure 3.) usng default quantzaton table. 2.2.3 Neural network Tranng A back propagaton weght tunng method s used to tranng the three layer fully connected neural network (126 nput unts, 20 hdden unts, and 1 output unt). Durng tranng, non-face samples are selected from false alarm samples by applyng the bootstrap algorthm. In order to prevent the neural network from concentratng too much on certan features, the sequence of the face and non-face tranng sample s randomly arranged. To prevent over-tranng, a test set s used durng tranng to measure the generalzablty of the current traned neural network. 2.2 Experments The above algorthm has been mplemented n C++. In order to evaluate the performance of our algorthm, we have used some gray mages from the face detecton test sets of CMU (at http://www.cs.cmu.edu/~har/faces.html). Ths test database conssts of 3 subsets of gray mages wth total of 130 mages and 507 faces. Snce not all gray mages are sutable n our cases (explan later), experments are performed on 43 gray mages of t. To smulate the compressed doman stuaton n MPEG, we convert these grayscale mages before usng them for tranng purposes (Fgure 3). Ths s the frst tme to use CMU database to evaluate face detecton algorthms n the compressed doman ([13] has used CMU database, but they ddn t gve the quanttatve performance on t). Before compared wth the performance of the pxel doman approaches, the followng factors n the compressed doman should be n the consderatons: The converson from the gray mages raw data to the compressed doman DCT coeffcents causes some nformaton loss as well as errors. DCT blocks recalculaton and I frame downsamplng nduce some errors. It becomes more obvous when the factors become larger. The pre-processng algorthms are dfferent. In the pxel doman, non-unform lght condtons are compensated by lnear fttng [8] or hstogram equalzaton whle, n the compressed doman, we just remove the features (DC), whose varatons manly come from lght condton. The scale rato n the mult-scale search s dfferent. Most of the approaches n the pxel doman use the factor between 1.1 and 1.2 whle we use 1.25 nstead n the compressed doman. In the pxel doman, a sldng wndow s shfted pxel by pxel over each mage whle we shft the sldng wndow 2 pxels by 2 pxels for the purpose of effcency. The sze of the sldng wndow s dfferent. We use 32 32 pxels szed square wndow whle they use smaller szed square wndow (19 19 n [7] and 20 20 n [8]). Snce n the most cases, the resoluton of the I frames n the compressed vdeo s lower than that of mages, fndng faces bgger than 32 32 s enough for most of compressed vdeo applcatons. Because of ths, we don t consder faces smaller than 32 32 and only count the face bgger than 32 32 when we use CMU test sets, n ths case. A neural network based face detecton system proposed by Rowley [8] acheves a detecton rate rangng from 76.5% to 92.5% dependng on the heurstc methods and arbtratng among neural networks on the CMU database. Snce some of the face sze n the CMU Gray Image Test

represent the average chromnance of the correspondng macroblock n a frame. In ntra-coded MPEG I frame, the DC values of chromnance blocks are drectly avalable n the compressed doman. In Inter-coded P and B frame, the DC values of nter-code block can be smply reconstructed by usng the DC values of the reference frame(s) [12]. Fgure 4. Face Detecton Experments Sets are smaller than 32 32, we pck up some of gray mages (43 mages), n whch the faces are all bgger than 32 32, from CMU Test Sets to test our face detecton system. The detecton rate for frontal-vew face s almost 70% on 43 gray mages from CMU database as lsted n Table 1(The fnal output s a combnaton of the dfferent scale level). Fgure 5 shows some test results on CMU database (To show the real detecton ablty, the results shown here are drectly obtaned from Neural Network wthout removal of overlap and mult-level combnaton). These prelmnary results prove that our algorthm s relable, gven the condton n the compressed doman. The way to reduce the false alarms and mprove the face detecton rate of our system s to arbtrate among multple networks [8] and ncrease the ntal face tranng samples. Snce we are dong ths face detecton for compressed vdeo. The results obtaned here are encouragng enough for us to contnue research n ths drecton. 3. Color-based Flter Combnng 3.1 Canddate Regon Selecton Each MPEG I frame s dvded nto 16 16 pxel szed macroblocks and each macroblock s composed of four 8 8 pxel szed lumnance blocks and two 8 8 pxel szed chromnance blocks (we assume the color format of MPEG I frame s 4:2:0). In prevous secton, we have dscussed face detecton method based on the lumnance components. However, t s not necessary to search the entre frame area. A skn color flter can be used to locate skn color regons n each frame, from whch the lumnance components can be pcked up for classfcaton by the neural network. Ths makes the face detector faster and more robust. 3.2 Skn Color Model In 4:2:0 format, Each Macroblock has one Cb block and Cr block. The DC values of Cb blocks and Cr blocks represent the average Cb and Cr values of macroblock respectvely. So the DC values of each Cb and Cr block are used to In normalzed color space obtaned by dvdng the red and green components by ntensty, a skn-color dstrbuton forms a tght cluster under constant lght condtons rrespectve of the race [21]. An approach smlar to [19], s taken to the buld skn color model usng bvarate Gaussan dstrbuton model N(µ,Σ) that models the two normalzed varables r and g, whch form the vector x n the condtonal probablty equaton (whch gves the lkelhood of x belongng to N): 1 1/ 2 p( x N) = (2π ) Σ exp{ d( x) / 2} (4) where d(x) s Mahalanobs dstance gven by t 1 d( x) = ( x µ ) Σ ( x µ ) The larger the dstance d(x), the lower the probablty that the block belongs to the skn color class. To decde the skn color block, we set a threshold H. The block whose value of s d(x) smaller than H belongs to skn color regons of the frame. It s dffcult to fnd a unversal optmal threshold H for varous dfferent vdeo resources, snce skn colors vary n dfferent lghtng condtons and envronments. However, gven that we use skn color nformaton to fnd out the canddate regons for next face detecton step, we just choose one loose threshold H. 3.3 Experments We use combned Skn Color flter and neural networks face detector n our content-based vdeo ndexng applcaton. In order to fnd out the face pattern extracton ablty of ths face detector, we extract face pattern n each MPEG I frame and measure the overlaps between every two par of neghborhood I frames to create face pattern tmelnes for ndvduals n vdeo as shown n Fgure 5. The experment results show that our face detecton system s relable for vdeo, except n places along the tmelne where the face s non-frontal (ths problem can be overcome n the future by ncorporatng a trackng mechansm for faces). The results also show that, by combnng skn color flter, the system can avod most of the false alarms and speed up face detecton as well. Note that we cannot present quanttatve results for the standard CMU face database because they only have gray scale mages. We are currently buldng our own color vdeo test sets to conduct further experments for face detecton n vdeo. 4. CONCLUSION AND FUTURE WORKS. In ths paper, an algorthm for detecton face n the (5)

compress doman usng DCT coeffcents s presented. A DCT coeffcents extracton scheme s proposed. Usng ths scheme, t s possble to adapt many successful face detecton algorthms n the pxel doman to operate drectly n the compressed doman. Future work wll nclude, experments wth other methods (Gaussan mxture dstrbuton model, PCA, and etc.) n order to fnd the best classfcaton method for use wth DCT coeffcent features. Use of other nformaton avalable n MPEG vdeo (such as moton vector nformaton), and doman knowledge (vdeo attrbutes) s also on the cards. In order to obtan an emprcal evaluaton of face detecton methods n vdeo, we plan to buld a standard color vdeo database for evaluaton of face detecton approaches n vdeo. 5. ACKNOWLEDGEMENTS We are grateful to thank Yunlong Zhao for provdng some source code and lots of advce. We also would lke to thank CMU (Henry A. Rowley, Shumeet Baluja, and Takeo Kanade) and AI Lab., MIT (K. Sung and T. Poggo) for provdng the test database. 6. REFERENCES [1] M.K.Mandal, F.Idrs, and S.Panchanathan "A crtcal evaluaton of mage and vdeo ndexng technques n the compressed doman" Image and Vson Computng 17 (1999) pp. 513-529,1999 [2] El Saber and A. Murat Tekalp Frontal-Vew Face Detecton and Facal Feature Extracton usng Color, Sharp and Symmetry Based Cost Functons Pattern Recognton Letters, vol. 19, pp. 669-680, 1998 [3] Janguo Wang and Tenu Tan "A new Face detecton method based on shape nformaton" Pattern Recognton Letters 21 pp.463-471 2000 [4] Kn Choong Yow and Roberto Cpolla "Detecton of Human Faces under Scale, Orentaton and Vewpont Varatons" Proc.Int l Conf. Automatc Face and Gesture Recognton, pp. 295-300 1996 [5] D.J.Beymer. Face recognton under varyng pose A.I.Memo 1461 Center for Bologcal and Computatonal Learnng, M.I.T., Cambrdge, MA, 1993 [6] R. Brunell, T. Poggo. Face Recognton: Features versus Templates IEEE Transactons on Pattern Analyss and Machne Intellgence 15 pp.1042-1052 1993 [7] K. K. Sung and T. Poggo Example-based learnng for vew-based human face detecton Tech. Rep. 1532, M.I.T. : Artfcal Intellgence Laboratory and Center for Bologcal and Computatonal Learnng, 1994 [8] H.Rowley, S.Baluja, and T. Kanade Neural Network-based Face Detecton IEEE Trans. Pattern Analyss and Machne Intellgence, 1998 [9] B. Moghaddam and A. Pentland Probablstc Vsual Learnng for Object Detecton Proc. ffth Int l Conf. Computer Vson, June 1995 [10] E. Osuna, R. Freund, and F. Gros Tranng Support Vector Machnes: An Applcaton to Face Detecton Computer Vson and Pattern Recognton, 1997 [11] Mng-hsuan Yang, Narendra Ahuja, and Davd Kregman Detectng Faces n Images: A Survey to appear n IEEE Transactons on Pattern Analyss and Machne Intellgence (PAMI), 2001. [12] H. Wang and S. F. Chang A hghly effcent system for automatc face regon detecton n mpeg vdeo IEEE Trans. CSVT, 7(4), 1997 [13] Hutao Luo and Alexandros Eleftherads On Face Detecton n the Compressed Doman ACM Multmeda-00, Oct. 2000 [14] Tat-Seng Chua, Yunlong Zhao, and Mohan S KanKanhall An Automated Compressed-Doman Face Detecton Method For Vdeo Stratfcatons Proceedngs of MMM-2000 Internatonal Conference on Multmeda Modellng, Nagano, Japan, USA, pp. 333-347 2000 [15] B. Hesele, T. Poggo, and M. Pontl Face Detecton n Stll Gray Images A.I.Memo No.1687 M.I.T. 2000 [16] W. Kou, T. Fjallbrant A drect computaton of DCT coeffcents for a sgnal block taken from two adjacent blocks IEEE Transactons on Sgnal Processng 39 (7) pp. 1423-1435 1997 [17] Ner Merhav and Vasudev Bhaskaran Fast Algorthms for DCT-Doman Image Down-Samplng and for Inverse Moton Compensaton IEEE Trans. on Crcuts and Systems for Vdeo Technology, vol. 7, no. 3, pp. 468-476, June 1997. [18] S. F. Chang and D.G. Messerschmtt Manpulaton and Compostng of MC-DCT Compressed Vdeo IEEE Trans. On Selected Areas n Communcatons, Vol.13, No. 1, pp. 1-11, Jan. 1995 [19] Yunlong Zhao, Mohan S Kankanhall and Tat-Seng Chua A Compressed-Doman Fractonal Scalng Technque for Image and Vdeo Techncal Report, School of Computng, Natonal Unversty of Sngapore, 2000 [20] Z.Pan, R. Adams, and H.Bolour, Image Redundancy Reducton for Neural Network Classfcaton usng Dscrete Cosne Transforms Proc. of The IEEE-INNS-ENNS Internatonal Jont Conf. on Neural Networks (IJCNN2000), Vol. III, 149-154, Como, Italy, 2000. [21] J.Yang, W. Lu, and A. Wabel Skn-color modelng and adaptaton, Techncal Report of School of Computer Scence, Carnege Mellon Unversty CMU-CS-97-146 May 1997

Fgure 5. Some Test Results on CMU database. To show the real detecton ablty, the results shown here are drectly obtaned from Neural Network wthout removal of overlap and mult-level combnaton.