Efficient Content Representation in MPEG Video Databases

Effcent Content Representaton n MPEG Vdeo Databases Yanns S. Avrths, Nkolaos D. Doulams, Anastasos D. Doulams and Stefanos D. Kollas Department of Electrcal and Computer Engneerng Natonal Techncal Unversty of Athens Heroon Polytechnou 9, 57 73 Zografou, Greece E-mal: avr@mage.ntua.gr Abstract In ths paper, an effcent vdeo content representaton system s presented whch permts automatc extracton of a lmted number of characterstc frames or scenes that provde suffcent nformaton about the content of an MPEG vdeo sequence. Ths can be used for reducton of the amount of stored nformaton that s necessary n order to provde search capabltes n a multmeda database, resultng n faster and more effcent vdeo queres. Moreover, the proposed system can be used for automatc generaton of low resoluton vdeo clp prevews (tralers), gvng the ablty to browse databases on web pages. Fnally, drect content-based retreval wth mage queres s possble usng the feature vector representaton ncorporated n our system.. Introducton Multmeda systems have recently led to the development of a seres of applcatons whch make use of dfferent knds of nformaton, such as text, voce, sounds, graphcs, anmaton, mages and vdeo. The resultng multmeda databases requre new technologes and tools for ther organzaton and management, especally for dgtal vdeo databases, manly due to the sze of the nformaton nvolved. For ths reason, a new standardzaton phase s currently n progress by the MPEG group n order to develop algorthms for audovsual codng (MPEG-4 [9]) and content-based vdeo storage, retreval and ndexng, based on object extracton (MPEG-7 [0]). Several prototype systems, such as Vrage, WebSEEK and QBIC [3] have already been developed and are now n the frst stage of commercal explotaton, provdng ndexng capabltes. However, most of them are restrcted to stll mages and use smple features, e.g. color hstogram, to perform the queres. Moreover, these systems cannot be easly extended to vdeo databases snce t s practcally mpossble to perform queres on every vdeo frame. Several approaches have been proposed n the recent lterature for content-based vdeo ndexng whch manly deal wth scene cut detecton [], vdeo object trackng [4] as well as wth sngle frame extracton [3,5] or mage retreval based on hdden Markov models [7]. In the context of ths paper, another approach has been adopted for representng vdeo content. In partcular, we present a system that, apart from provdng vsual search capabltes, also permts automatc extracton of a lmted number of characterstc frames or scenes whch provde suffcent nformaton about the content of a vdeo sequence. The scene/frame selecton mechansm s based on a transformaton of the mage to a feature doman, whch s more sutable for mage comparsons, queres and content based retreval. A smlar approach has been proposed n one of our earler works [2]. However, better performance s now acheved by ntegratng object-trackng functonalty n the color segmentaton algorthm, thus resultng n smoother trajectores of the feature vectors and n a selecton mechansm, whch s less susceptble to nose. Moreover, the most representatve scenes are selected n an optmal way and all the calculatons nvolved are drectly appled to MPEG coded vdeo sequences, resultng n a robust and fast mplementaton whch s not very far from real-tme. The proposed system can be very useful for multmeda database management because t provdes a means for reducton of the necessary amount of nformaton, whch needs to be stored n order to provde search capabltes. Instead of performng a query on all avalable vdeo frames, one can only consder the selected ones, because they nclude most nformaton about the content of the database. Multmeda nteractve servces are another nterestng applcaton of our system. It s possble to automatcally generate low resoluton vdeo clp prevews (tralers) or stll mage mosacs whch play exactly the same role for vdeo sequences as thumbnals for stll mages. These prevews can be used to mprove the user nterface of dgtal vdeo databases or browse databases on web pages. 2. System Overvew The proposed system conssts of several modules usng color and moton nformaton. The parttonng of the nput vdeo stream nto scenes s performed n the frst stage of the proposed system, usng a scene cut detecton technque. Ths s acheved by computng the sum of the block moton estmaton error over each frame and

detectng frames for whch ths sum exceeds a certan threshold. Snce we use MPEG coded vdeo streams, the block moton estmaton error s avalable and the resultng mplementaton s very fast. A multdmensonal feature vector s then generated for each frame, contanng global frame characterstcs, such as color hstogram and texture complexty, as well as object characterstcs. Unsupervsed color and moton segmentaton s performed and nformaton such the number of segments, ther locaton, sze, mean color and moton vectors are used n the constructon of the feature vector. Object trackng s supported by takng nto account moton compensated segmentaton results of prevous frames. The feature vector s formed as a multdmensonal hstogram usng fuzzy classfcaton of object propertes nto predefned categores. Based on feature vectors of all frames wthn each scene, a scene feature vector s computed whch characterzes the respectve scene. The scene vector s then fed as nput to a clusterng mechansm, whch optmally extracts the most representatve scenes by mnmzng a dstorton crteron. Fnally, the most characterstc frames of the selected scenes are extracted, based on the temporal fluctuaton of frame vectors. 3. Herarchcal block-based color segmentaton A herarchcal block-based color segmentaton scheme s adopted for provdng color nformaton. Images consstng of unform areas wll be characterzed by large segments, whle mages contanng many small objects or nose, by small and dstrbuted segments. (c) Fgure 2. Color segmentaton. orgnal frame, st teraton of segmentaton, (c) 3rd teraton, and (d) fnal result (6 th step). (d) Orgnal Vdeo Sequence Scene Cut Detecton Global Frame Analyss Color Segmentaton Moton Compensaton z - Feature Vector Formulaton (Fuzzy Classfcaton) Frame / Scene Selecton Moton Analyss Moton Segmentaton Fgure. System archtecture. The overall system archtecture s depcted n Fgure. Note that the Moton Analyss step n ths fgure s actually not mplemented n our system, as the moton vectors of the MPEG sequence are drectly used for moton segmentaton and compensaton. 3. Segmentaton The most mportant task nvolved n vdeo representaton s the extracton of features such as moton, lumnosty, color, shape and texture. For ths purpose, a feature vector for each frame s calculated, contanng global frame characterstcs, such as color hstogram and texture complexty, obtaned through global frame analyss, as well as object characterstcs, obtaned through color and moton segmentaton. Segmentaton s studed next. (c) Fgure 3. Trackng capabltes. two successve frames, segmentaton results wthout trackng, and (c) wth trackng. Block resoluton s used n order to reduce computatonal tme and to explot nformaton avalable n MPEG sequences (dc coeffcent of the block DCT transform). Snce oversegmentaton usually results n nosy feature vectors, herarchcal mergng of smlar segments s proposed, whch takes nto account segment szes apart from spatal homogenety, n order to elmnate small segments whle preservng large ones. The aforementoned algorthm, whch s fully descrbed n one of our earler works [2], s enhanced by ntroducng vdeo object trackng capabltes. Ths s acheved by takng nto account moton compensated segmentaton results of prevous frames. In partcular, the decson for mergng two adjacent regons s modfed by addng to the threshold functon a postve or a negatve constant, dependng on whether the two regons belong to the same segment n prevous frame or not. Thus, connected regons are encouraged to reman connected n successve frames. However, n order to take

account of moton n the pcture, the color segmentaton results are frst moton compensated. Fgure 2 depcts the color segmentaton results of a frame extracted from a TV news program. It can be clearly seen that n each teraton, more and more segments are merged together, and that the smallest segments are only merged n the last teraton. Our segmentaton algorthm has smlar performance wth the Recursve Shortest Spannng Tree (RSST) technque [8] but s much faster. Moreover, the nose effects whch appear due to the random order of the mergng procedure [2] are now elmnated wth the use of object trackng. Ths s llustrated n Fgure 3, where segmentaton results are shown for two successve frames, wth and wthout trackng. A statonary background gves dfferent segmentaton results wthout trackng, whereas wth trackng only the movng parts of the mage result n dfferent segments. 3.2 Block-based moton segmentaton A smlar approach s carred out for the case of moton segmentaton. The proposed procedure s stll at block resoluton for explotng propertes of the MPEG bt stream. However, the moton vectors, whch can ether be computed wth a moton analyss algorthm, as shown n Fgure, or taken drectly form the MPEG stream (as s done n our experments), usually appear nosy due to lumnosty fluctuatons. To acheve smoothness of moton vectors wthn a movng area, a medan flter s used for elmnatng nose whle preservng edges between regons of dfferent moton. Fgure 4. Moton segmentaton wthout, and wth flterng. After the approprate moton vectors are extracted, we use a technque smlar to the prevous one to derve the moton segments, except for the fact that no trackng of objects takes place here (snce trackng s based on the moton vectors). Fgure 4 llustrates the moton segmentaton results of a frame extracted from a TV news program. It s clear that wthout flterng the moton vectors, wrong segmentaton results are produced, even n a unform and almost statonary background (Fgure 4). On the contrary, only the actually movng objects are extracted n the case of fltered moton vectors (Fgure 4(c)). 4. Fuzzy Classfcaton All of the above features are gathered n order to form a multdmensonal feature vector for each frame. Propertes of color or moton segments cannot be used drectly as elements of the feature vector, snce ts sze wll dffer between frames. We therefore classfy color as well as moton segments nto pre-determned categores, formng a multdmensonal hstogram. In order to elmnate the possblty of classfyng two smlar segments at dfferent categores, a degree of membershp s allocated to each category, resultng n a fuzzy classfcaton. The feature vector s then constructed by calculatng the sum, over all segments, of the correspondng degrees of membershp, and gatherng these sums nto the approprate categores: K L () F( n ) = µ n ( f j j ) () = j= where K s the total number of color or moton segments, L s number of features (e.g., sze, color or locaton) of segments that are taken nto account for classfcaton, T n = [ n n L ] specfes the category nto whch a segment s classfed, and nj { 2,,, Q} s an ndex for each feature, where Q s the number of regons nto whch each feature space s parttoned. The j-th feature, f () j, of the -th segment, S, s the j-th element of the vector T T T [ PS ( ) c( S) l ( S) ] for color segmentaton, or T T T [ PS ( ) v( S) l ( S ) ] for moton segmentaton, where P, c, v, and l denote the sze, color, moton vector and locaton of each segment. Fnally, µ n ( f ) s the degree of membershp of feature f n partton n. Trangular membershp functons µ n are used wth 50% overlap between parttons. The feature vector s formed by gatherng values F(n) for all combnatons of n,, n L, for both color and moton segments. Global frame characterstcs are also ncluded n the feature vector. In partcular, the color hstogram of each frame s calculated usng YUV coordnates for color descrpton and the average texture complexty s estmated usng the hgh frequency DCT coeffcents of each block derved from the MPEG stream. 5. Scene and Frame Selecton The trajectory of the feature vector for all frames wthn a scene ndcates the way n whch the frame propertes fluctuate durng a scene perod. Consequently, a vector that characterzes a whole scene s constructed by calculatng the mean value of feature vectors over the whole duraton of the scene. The scene feature vectors are used for the selecton of the most representatve scenes, as descrbed below.

5. Scene selecton. The extracton of a small but suffcent number of scenes that satsfactorly represent the vdeo content s accomplshed by clusterng smlar scene feature vectors and selectng only a lmted number of cluster representatves. For example, n TV news recordngs, consecutve scenes of the same person would reduce to just M one. Let s R, = 2,,..., N S be the scene feature vector for the -th scene, where N S s the total number of scenes. Then S = { s, = 2,,, NS} s the set of all scene feature vectors. Let also K S be the number of scenes to be selected and c, = 2,,..., K S the feature vectors, whch best represent those scenes. For each c, an nfluence set s formed whch contans all scene feature vectors s S, whch are closer to c : Z = { s S: d( s, c) < d( s, c j) j } (2) where d() denotes the dstance between two vectors. A common choce for d() s the Eucldean norm. In effect, the set of all Z defnes a partton of S nto clusters of smlar scenes whch are represented by the feature vectors c. Then the average dstorton, defned as D( c, c,..., c ) = d( s, c ) (3) 2 KS Ks = s Z s a performance measure of the representaton of scene feature vectors by the cluster centers c. The optmal vectors c are thus calculated by mnmzng D: c c c = D c c c (4) (,,..., ) arg mn (,,..., ) 2 KS 2 M c, c2,..., c R Ks Drect mnmzaton of the prevous equaton s a tedous task snce the unknown parameters are nvolved both n dstances d() and nfluence zones. For ths reason, mnmzaton s performed n an teratve way usng the generalzed Lloyd or K-means algorthm. Startng from arbtrary ntal values c ( 0 ), = 2,,..., K S, the new centers are calculated through the followng equatons for n 0 : Z ( n) = { s S: d( s, c ( n)) < d( s, c ( n)) j } (5) j c ( n + ) = cent( Z ( n)) (6) where c ( n) denotes the -th center at the n-th teraton, and Z ( n) ts nfluence set. The center of Z ( n) s estmated by the functon cent( Z ( n)) = s (7) Z ( n) s Z( n) where Z ( n) s the cardnalty of Z ( n). The algorthm converges to the soluton ( c,,..., c2 ck S ) after a small number of teratons. Fnally, the K S most representatve KS scenes are extracted, as the ones whose feature vectors are closest to ( c, c2,..., ck S ) : s = d s c = 2 K (8) arg mn (, ),,,, s S 5.2 Frame selecton. After extractng the most representatve scenes, the next step s to select the most characterstc frames wthn each one of the selected scenes. The decson mechansm s based on the detecton of those frames whose feature vector resdes n extreme locatons of the feature vector trajectory. For ths purpose, the magntude of the second dervatve of the feature vector wth respect to tme s used as a curvature measure. y(t) 4 3 2 0 - -2-0 2 3 4 x(t) D(t) 200 000 800 600 400 200 S 0 0 0.5.5 2 2.5 3 t Fgure 5. A contnuous curve r(t)=(x(t),y(t)), and the magntude of the second dervatve D(t) versus t. For example, as shown n Fgure 5, and supposng that we have a 2-dmensonal feature vector as a functon of tme t, whch corresponds to the contnuous curve r(t)=(x(t), y(t)) of Fgure 5, the local maxma of the magntude of the second dervatve 2 2 Dt () = d r ()/ t dt (9) (shown as small crcles n Fgure 5 & ) correspond to the extreme locatons of the curve s(t) and provde suffcent nformaton about the curve, snce s(t) could be almost reproduced usng some knd of nterpolaton. Based on the above observaton, local maxma of ths curvature measure were extracted as the locatons (frame numbers) of the characterstc frames n our experments. Note that although ths technque s extremely fast and very easy to mplement n hardware, t may not work n all cases. For ths reason, other technques are currently under nvestgaton, such as logarthmc search and genetc algorthms. Optmal selecton can be acheved n ths way, at the expense of ncreased computatonal complexty. 5.3 Vdeo queres. Once the feature vector s formed as a functon of tme, a vdeo database can also be searched n order to retreve frames whch possess partcular propertes, such as dark frames, frames wth a lot of moton and so on. The feature vector space s deal for such comparsons, as t contans all essental frame propertes, whle ts dmenson s much less than that of the mage space. Moreover, a dramatc

reducton s acheved n the number of frames that are requred for retreval, browsng or ndexng. Instead of examnng every frame of a vdeo sequence, queres are performed on a very small set of frames, whch provde a meanngful representaton of the sequence. 6. Expermental Results The proposed algorthms were ntegrated nto a system that was tested usng several vdeo sequences from vdeo databases. The results obtaned from a TV news reportng sequence of total duraton 35 seconds (875 frames) are llustrated n Fgures 6, 7 and 8. Usng scene cut detecton, the test sequence was parttoned n 8 scenes and the respectve frame and scene feature vectors were calculated. For each scene, the frame whose feature vector s closest to the respectve scene feature vector s depcted n Fgure 6. Usng the scene selecton mechansm descrbed above, three scenes were extracted as most representatve and are shown n Fgure 7. In effect, three scene clusters were generated, each contanng scenes wth smlar propertes, such as number and complexty of objects. Moreover, t s clear that the selected scenes gve a meanngful representaton of the content of the 35 sec vdeo sequence. 2 3 4 5 6 7 8 Fgure 6. The 8 scenes of a test vdeo sequence. 4 7 Fgure 7. The 3 scenes that were selected as most representatve. Fgure 8. The most representatve frames of scene 3. Scene 3 of Fgure 6 was used n order to test the frame selecton procedure. Out of a total of 37 frames, only 6 were selected as most representatve and are shown n Fgure 8. Due to object trackng, the trajectory of the feature vector versus tme (frame number) s smoother than the one descrbed n [2] and thus the selecton procedure s more relable. Although a very small percentage of frames s retaned, t s obvous that one can perceve the content of the scene by just examnng the 6 selected frames. 7. Conclusons - Further Work An effcent content-based representaton has been proposed n ths paper. In partcular, a small but meanngful amount of nformaton s extracted from a vdeo sequence, whch s capable of provdng a representaton sutable for vsualzaton, browsng and content-based retreval n vdeo databases. Several mprovements are possble for the proposed system, such as ntegraton of color and moton segmentaton results, more robust object trackng algorthm, more ntellgent object extracton (e.g., extracton of human faces [6]), enhancement of the frame selecton mechansm (based on correlaton propertes between feature vectors), and nterweavng of audo and vdeo nformaton. These topcs are currently under nvestgaton. 8. References [] Y. Ark and Y. Sato, Extracton of TV News Artcles Based on Scene Cut Detecton usng DCT Clusterng, Proceedngs of ICIP, Sept. 996, Swtzerland. [2] A. Doulams, Y. Avrths, N. Doulams and S. Kollas, Indexng and Retreval of the Most Characterstc Frames/Scenes n Vdeo Databases, Proc.of WIAMIS, June 997, Belgum. [3] M. Flckner, H. Sawhney, W. Nblack, J. Ashley, Q. Huang, B. Dom, M. Gorkan, J. Hafner, D. Lee, D. Petkovc, D. Steele and P. Yanker, Query by Image and Vdeo content: the QBIC System, IEEE Computer Magazne, pp. 23-32, Sept. 995. [4] M. Gelgon and P. Bouthemy, A Herarchcal Moton- Based Segmentaton and Trackng Technque for Vdeo Storyboard-Lke Representaton and Content-Based Indexng, Proc.of WIAMIS, June 997, Belgum. [5] G. Iyerngar and A.B. Lppman, Vdeobook: An Experment n Characterzaton of Vdeo, Proceedngs of ICIP, Sept. 996, Swtzerland. [6] D. Kalogeras, N. Doulams, A. Doulams and S. Kollas, Low Bt Rate Codng of Image Sequences usng Adaptve Regons of Interest, Accepted for pub., IEEE Trans. Crcuts and Systems for Vdeo Technology. [7] H.-C. Ln, L.-L. Wang and S.-N. Yang, Color Image Retreval Based on Hdden Markov Models, IEEE Trans. Image Processng, Vol. 6, No. 2, pp. 332-339, Feb. 997. [8] O. J. Morrs, M. J. Lee and A. G. Constantndes, Graph Theory for Image Analyss: an Approach Based on the Shortest Spannng Tree, IEE Proceedngs, Vol. 33, pp. 46-52, Aprl 986. [9] MPEG Vdeo Group, MPEG-4 Requrements, ISO/IEC GTC/SC29/WG N679, Brstol MPEG Meetng, Aprl 997. [0] MPEG Vdeo Group, MPEG-7: Context and Objectves (v.3), ISO/IEC GTC/SC29/WG N678, Brstol MPEG Meetng, Aprl 997.