Practical Elimination of Near-Duplicates from Web Video Search

Size: px

Start display at page:

Download "Practical Elimination of Near-Duplicates from Web Video Search"

Jody Elliott
6 years ago
Views:

1 Xao Wu +# Practcal Elmnaton of Near-Duplcates from Web Vdeo Search + Department of Computer Scence Cty Unversty of Hong Kong 83 Tat Chee Avenue, Kowloon, Hong Kong Alexander G. Hauptmann # alex@cs.cmu.edu Chong-Wah Ngo + cwngo@cs.ctyu.edu.hk # School of Computer Scence Carnege Mellon Unversty 5000 Forbes Avenue, Pttsburgh, USA ABSTRACT Current web vdeo search results rely exclusvely on text keywords or user-suppled tags. A search on typcal popular vdeo often returns many duplcate and near-duplcate vdeos n the top results. Ths paper outlnes ways to cluster and flter out the nearduplcate vdeo usng a herarchcal approach. Intal trage s performed usng fast sgnatures derved from color hstograms. Only when a vdeo cannot be clearly classfed as novel or nearduplcate usng global sgnatures, we apply a more expensve local feature based near-duplcate detecton whch provdes very accurate duplcate analyss through more costly computaton. The results of 24 queres n a data set of 2,790 vdeos retreved from Google, Yahoo! and YouTube show that ths herarchcal approach can dramatcally reduce redundant vdeo dsplayed to the user n the top result set, at relatvely small computatonal cost. Categores and Subect Descrptors H.3.3 [Informaton Storage and Retreval]: Informaton Search and Retreval Informaton flterng, Search process; I.2.0 [Artfcal Intellgence]: Vson and Scene Understandng Vdeo analyss; General Terms Algorthms, Desgn, Expermentaton, Performance. Keywords Smlarty Measure, Novelty and Redundancy Detecton, Flterng, Multmodalty, Near-Duplcates, Copy Detecton, Web Vdeo. INTRODUCTION As bandwdth accessble to average users s ncreasng, vdeo s becomng one of the fastest growng types of data on the Internet. Especally wth the popularty of socal meda n Web 2.0, there has been exponental growth n vdeos avalable on the net. Users can obtan web vdeos easly, and dstrbute them agan wth some modfcatons. For example, users upload 65,000 new vdeos each Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. MM 07, September 23 28, 2007, Augsburg, Bavara, Germany. Copyrght 2007 ACM /07/ $5.00. day on vdeo sharng webste YouTube and the daly vdeo vews are over 00 mllon [29]. Among these huge volumes of vdeos, there exst large numbers of duplcate and near-duplcate vdeos. It becomes mportant to manage these vdeos n an automatc and effcent way. To avod gettng swamped by almost dentcal copes of the same vdeo n any search, effcent near-duplcate vdeo detecton and elmnaton s essental for effectve search, retreval, and browsng. Current web vdeo search engnes tend to provde a lst of search results ranked accordng to ther relevance scores gven a text query. Whle some users nformaton needs may be satsfed wth the relevant tems ranked at the very top, the topmost search results usually contan a vast amount of redundant vdeos. Based on a sample of 24 popular queres from YouTube [34], Google Vdeo [0] and Yahoo! Vdeo [32] (see Table ), on average there are 27% redundant vdeos that duplcate or nearly duplcate to the most popular verson of a vdeo n the search results. Fgure shows actual search results from three currently popular web vdeo search engnes, wth redundancy farly obvous n ths case. As a consequence, users need to spend sgnfcant amount of tme to fnd the vdeos they need and are subected to repeatedly watchng smlar copes of vdeos whch have been vewed prevously. Ths process s extremely tme-consumng partcularly for web vdeos, where the users need to watch dfferent versons of duplcate or near-duplcate vdeos streamed over the Internet. An deal soluton would be to return a lst whch not only maxmes precson wth respect to the query, but also novelty (or dversty) of the query topc. Ths problem s generally referred to as novelty rankng (or sub-topc retreval) n nformaton retreval (IR) [5, 36, 37]. Unfortunately, the textbased technques from IR cannot be drectly appled to dscover vdeo novelty. For nstance, text keywords and user-suppled tags attached to web vdeos are usually abbrevated and mprecse. Second, most vdeos lack the web lnk structure typcal n HTML documents whch can be exploted for fndng sub-topc relatedness. Fndng novelty (or conversely, elmnatng duplcates) among the relevant web vdeos must largely rely on the power of content analyss. Due to the large varety of near-duplcate web vdeos rangng from smple formattng to complex edtng, near-duplcate detecton remans a challengng problem. Accurate detecton generally comes at the cost of tme complexty [20] partcularly n a large vdeo corpus. On the other hand, tmely response to user queres s one mportant factor that fuels the popularty of Web 2.0. To balance the speed and the accuracy aspects, n ths paper, we propose a herarchcal approach combnng global sgnatures and local feature based parwse comparson to detect nearduplcate web vdeos. The tool of near-duplcate detecton can be

2 used n several ways: As a flter to remove redundant vdeos n the lstng of retreval results, as a tool for fndng smlar vdeos n dfferent varatons (e.g. to prevent copyrght nfrngement), or as a way to dscover the essental verson of content appearng n dfferent presentatons. We show that the approach s practcal for near-duplcate retreval and novelty re-rankng of web vdeos where the maorty of duplcates can be detected and removed from the top rankngs. The rest of ths paper s organed as follows. In secton 2 we gve a bref overvew of related work. A characteraton of dfferent types of near-duplcate web vdeos s provded n secton 3. The proposed framework for effcent near-duplcate detecton s ntroduced n secton 4. Secton 5 descrbes the data set used. Secton 6 presents experments and results for the two tasks a) web result novelty re-rankng and b) fndng smlar vdeos. Fnally, we conclude the paper wth a summary. 2. RELATED WORK 2. Novelty Detecton and Re-Rankng Novelty/redundancy detecton has been explored n text nformaton retreval from the event level [4, 33] to the document/sentence level [3, 39]. It s closely related to the New Event Detecton (NED) [4] or Frst Story Detecton (FSD) n Topc Detecton and Trackng (TDT) [2] that nvestgates several aspects for the automatc organaton of news stores n text area. The NED task s to detect the frst story that dscusses a prevously unknown event. A common soluton to NED s to compare news stores to clusters of stores from prevously dentfed events. The novelty detecton approaches for documents and sentences manly focus on vector space models and statstcal language models to measure the degree of novelty expressed n words. The dea of novelty detecton has also been appled to web search to mprove the search results [36]. Query relevance and nformaton novelty have been combned to re-rank the documents/pages by usng Maxmal Margnal Relevance [5], Affnty Graph [37] and language models [36]. However, these approaches are manly based on textual nformaton. Recently, multmeda based novelty/redundancy detecton has also been appled to cross-lngual news vdeo smlarty measure [30] and vdeo re-rankng [3] by utlng both textual and vsual modaltes. Hsu [3] used an nformaton bottleneck method to rerank vdeo search results. For web vdeos, the textual nformaton s usually lmted and naccurate. Therefore, applyng text analyss to web vdeos makes lttle sense. To the best of our knowledge, there s lttle research on near-duplcate vdeo detecton and re-rankng for large scale web vdeo search. 2.2 Vdeo Copy and Smlarty Detecton Vdeo copy and smlarty detecton has been actvely studed for ts potental n search [6], topc trackng [3] and copyrght protecton [9]. Varous approaches, usng dfferent features and matchng algorthms have been proposed. Generally speakng, global features are sutable for dentfyng the maorty of copes n formattng modfcatons such as codng and frame resoluton changes [7, 8,, 2, 8, 35], whle segment or shot-level features can detect some of copes wth smple to moderate level of edtng [35]. More sophstcated approaches normally nvolve the ntensve use of feature matchng at the mage regon level [20]. Thus an assocated ssue s the computaton and scalablty problem [7, 9, 20]. Table. 24 Vdeo Queres Collected from YouTube, Google Vdeo and Yahoo! Vdeo (#: number of vdeos) Queres Near-Duplcate ID Query # # % The lon sleeps tonght % 2 Evoluton of dance % 3 Fold shrt % 4 Cat massage % 5 Ok go here t goes agan % 6 Urban nna % 7 Real lfe Smpsons % 8 Free hugs % 9 Where the hell s Matt % 0 U2 and green day % Lttle superstar % 2 Napoleon dynamte dance % 3 I wll survve Jesus % 4 Ronaldnho png pong % 5 Whte and Nerdy % 6 Korean karaoke % 7 Panc at the dsco I wrte sns not tragedes % 8 Bus uncle ( 巴士阿叔 ) % 9 Sony Brava % 20 Changes Tupac % 2 Afternoon delght % 22 Numa Gary % 23 Shakra hps don t le % 24 Inda drvng % Total % Among exstng approaches, many emphase the rapd dentfcaton of duplcate vdeos wth global but compact and relable features. These features are generally referred to as sgnatures or fngerprnts whch summare the global statstc of low-level features. Typcal features nclude color, moton and ordnal sgnature [, 35] and prototype-based sgnature [7, 8, 22]. The matchng between sgnatures s usually through bn-to-bn dstance measures, probably wth ntellgent frame skppng [8, 35] and randomaton [7, 8] so as to mnme the number of feature comparsons. These approaches are sutable for dentfyng almost dentcal vdeos, and can detect mnor edtng n the spatal and temporal doman. Another branch of approaches derve low-level features at the segment or shot level to facltate local matchng [, 2, 23, 28]. Typcally the granularty of the segment-level matchng, the changes n temporal order, and the nserton/deleton of frames all contrbute to the smlarty score of vdeos. The emphass of these approaches s mostly on varants of matchng algorthms such as dynamc tme warpng [], as well as maxmal and optmal bpartte graph matchng [28]. Compared to sgnature based methods, these approaches are slower but capable of retrevng approxmate copes that have undergone a substantal degree of edtng. Duplcates wth changes n background, color, and lghtng, make serous demands for stable and relable features at regon-level detals. Dfferng from global features, local features can be extracted after segmentng an mage nto regons and computng a set of color, texture and shape features for each regon. A smpler approach merely segments the mage nto NxN blocks, and extracts features for each block. Promsng approaches, whch

Fgure. Search results from dfferent vdeo search engnes for the query The lon sleeps tonght demonstrate that there are a large number of near-duplcate vdeos n the topmost results.

3 Fgure. Search results from dfferent vdeo search engnes for the query The lon sleeps tonght demonstrate that there are a large number of near-duplcate vdeos n the topmost results. have receved a lot of attenton recently, are to extract local feature ponts [5, 7, 9, 20, 27, 38]. These local ponts are salent local regons (e.g. corners) detected over mages scales, whch locate local regons that are tolerant to geometrc and photometrc varatons [24]. Whle local ponts appear as promsng features, a real challenge concerns the matchng and scalablty ssues, snce there smply exst too many local ponts for effcent, exhaustve comparson even between two frames. As a consequence, a maor emphass of these approaches s n explorng ndexng structures [7, 9] and fast trackng wth heurstcs [27]. Most approaches ndeed focus on keyframe-level duplcate detecton [5, 27, 38]. Recent work n [20] shows how to perform vdeo-level copy detecton wth a novel keypontaganst-traectory search. In web vdeo search [8, 22], the duplcates can be of any varaton from dfferent formats to mxtures of complex modfcatons. Thus the rght choce of features and matchng algorthms cannot be pre-determned. Ths ssue has not been serously addressed, whle the popularty of Web 2.0 has ndeed made the problem tmely and crtcal. In ths paper, we explore a practcal approach for near-duplcate web vdeo flterng and retreval. 3. NEAR-DUPLICATE WEB VIDEOS 3. Defnton of Near-Duplcate Vdeos Defnton: Near-duplcate web vdeos are dentcal or approxmately dentcal vdeos close to the exact duplcate of each other, but dfferent n fle formats, encodng parameters, photometrc varatons (color, lghtng changes), edtng operatons (capton, logo and border nserton), dfferent lengths, and certan modfcatons (frames add/remove). A user would clearly dentfy the vdeos as essentally the same. A vdeo s a duplcate of another, f t looks the same, corresponds to approxmately the same scene, and does not contan new and mportant nformaton. Two vdeos do not have to be pxel dentcal to be consdered duplcates whether two vdeos are duplcates depends entrely on the type of dfferences between them and the purpose of the comparson. Copyrght law mght consder even a porton of a sngle frame wthn a full-length moton pcture vdeo as a duplcate, f that frame was coped and cropped from another vdeo source. A user searchng for entertanng vdeo content on the web, mght not care about ndvdual frames, but the overall content and subectve mpresson when flterng near-duplcate vdeos for more effectve search. Exact duplcate vdeos are a specal case of near-duplcate vdeos. In ths paper, we nclude exact duplcates n our defnton of near-duplcate vdeos, as these vdeos are also frequently returned by vdeo search servces. 3.2 Categores of Near-Duplcate Vdeos To facltate our further dscusson, we classfy near-duplcate web vdeos as the followng categores:

(a) s the standard verson (b) brghtness and resoluton change (c) frame rate change (d) addng

(g) longer verson wth borders (h) resoluton dfferences Formattng dfferences Encodng format: flv,

97fps Bt rate: 529kbps, 89kbps Frame resoluton: 74x44, 320x240, 240x320 Content dfferences

Versons: same content n dfferent lengths for dfferent releases.

shots represented by representatve keyframes, whch wll cause near-duplcate vdeos havng dfferent

Commonly, a vdeo s frst parttoned nto a set of shots based on edtng cuts and transtons between

Extractng a representatve keyframe from the mddle of a shot therefore s relatvely relable for

Ths mappng of vdeo to keyframes reduces the number of frames that need to be analyed by a factor of

Although methods for detectng shots are overall qute robust for fndng dentcal vdeos wth the same

scenes. We can see that the extracted keyframes are slghtly dfferent and nearduplcate varatons.

transformatons, makng near-duplcate vdeo detecton a challengng problem. 4.

approach for near-duplcate web vdeo detecton.

, followed by the detaled descrpton of global sgnatures wth color hstogram (SIG_CH) as a fast flter

2, and a more accurate but expensve local feature based parwse comparson among keyframes (SET_NDK)

4 a) b) c) d) e) f) g) h) Fgure 2. Keyframe sequence of near-duplcate vdeos wth dfferent varatons (each row corresponds to one vdeo). (a) s the standard verson (b) brghtness and resoluton change (c) frame rate change (d) addng overlay text, borders and content modfcaton at the end (e, f) content modfcaton at begnnng and end (g) longer verson wth borders (h) resoluton dfferences Formattng dfferences Encodng format: flv, wmv, av, mpg, mp4, ram and so on. Frame rate: 5fps, 25fps, 29.97fps Bt rate: 529kbps, 89kbps Frame resoluton: 74x44, 320x240, 240x320 Content dfferences Photometrc varatons: color change, lghtng change. Edtng: logo nserton, addng borders around frames, superposton of overlay text. Content modfcaton: addng unrelated frames wth dfferent content at the begnnng, end, or n the mddle. Versons: same content n dfferent lengths for dfferent releases. Furthermore, to avod performng duplcate comparson on all frames, a vdeo s usually vewed as a lst of shots represented by representatve keyframes, whch wll cause near-duplcate vdeos havng dfferent keyframe sequences. A web vdeo s a sequence of consecutve frames to descrbe a meanngful scene. Commonly, a vdeo s frst parttoned nto a set of shots based on edtng cuts and transtons between frames, and then a representatve keyframe s extracted to represent each shot. Extractng a representatve keyframe from the mddle of a shot therefore s relatvely relable for extractng bascally smlar keyframes from dfferent near-duplcates. Ths mappng of vdeo to keyframes reduces the number of frames that need to be analyed by a factor of dependng on the type of vdeo. Although methods for detectng shots are overall qute robust for fndng dentcal vdeos wth the same format, when appled to near-duplcate vdeos wth dfferent frame rates, they could generate dfferent keyframe sequences. It potentally nduces the problem of vewpont changes, oomng and so on, whch causes the nearduplcate detecton more complex. Fgure 2 shows examples of near-duplcate web vdeos for the query The lon sleeps tonght wth smple scenes. We can see that the extracted keyframes are slghtly dfferent and nearduplcate varatons. The overall scene s relatvely smple because there are some common thngs throughout the vdeos (brown obect and blue background). Fgure 3 demonstrates another query Whte and Nerdy wth complex scenes n whch the content n the keyframes changes dramatcally. Both smple and extensve changes are frequently mxed together to form more complcated transformatons, makng near-duplcate vdeo detecton a challengng problem. 4. HIERARCHICAL NEAR-DUPLICATE VIDEO DETECTION In ths secton, we ntroduce the proposed herarchcal approach for near-duplcate web vdeo detecton. The framework combnng global sgnatures and parwse comparson s frst presented n secton 4., followed by the detaled descrpton of global sgnatures wth color hstogram (SIG_CH) as a fast flter n secton 4.2, and a more accurate but expensve local feature based parwse comparson among keyframes (SET_NDK) n secton 4.3. Fnally, we summare global sgnatures and parwse comparson for near-duplcate vdeo detecton n secton Herarchcal Framework Our analyss of a dverse set of popular web vdeos shows that there are around 20% exact duplcate vdeos among all nearduplcate web vdeos. It s common for web users to upload exact duplcate vdeos wth mnmal change. Ths demands an approach for fast detecton of duplcate vdeos. A global sgnature from color hstograms (SIG_CH) s ust ths knd of fast measures sutable for matchng vdeos wth dentcal and almost dentcal content wth only mnor changes. The global sgnatures are bascally the global statstcs or summares of low-level color features n vdeos. The smlarty of vdeos s measured by the dstance between sgnatures [35]. Fgure 3. Two vdeos of complex scene query Whte and Nerdy wth complex transformatons (only the frst ten keyframes are dsplayed): logo nserton, geometrc and photometrc varatons (lghtng change, black border), and keyframes added/removed

5 # of vdeos However, for vdeos wth maor edtng, content modfcaton, dramatc photometrc and geometrc transformatons, global sgnatures tend to be nadequate. Especally, when multple varatons are mxed together, the near-duplcate detecton becomes even harder. Furthermore, due to dfferent frame rates, and content modfcatons such as the nserton of commercals or ttle frames at the begnnng and credts at the end, the extracted keyframe sequence could be dfferent. And even non-duplcate vdeos could have smlar color dstrbuton as duplcate vdeos, whch wll be falsely detected as smlar vdeos. In contrast to global sgnatures, parwse keyframe comparson treats each keyframe as an ndependent node and two vdeos are compared by measurng the parwse smlarty among these nodes. Local feature based methods can accurately capture the mappng among keyponts. Parwse comparson among keyframes can further measure the degree of overlappng between two vdeos. Therefore local feature based parwse comparson (SET_NDK) has great potental n detectng near-duplcate keyframes and ultmately provdng a relable measurement for vdeos that have been nontrvally modfed. However, the computaton of local ponts s more expensve than mere color hstograms, and the keyframes have to be compared parwse. To guarantee effectve near-duplcate detecton whle meetng the speed requrements for Google-scale vdeo collectons, we propose a herarchcal method whch utles both global sgnatures and local keyponts for detectng near-duplcate web vdeos. A global sgnature from color hstograms s frst used to detect the near-duplcate vdeos wth hgh confdence and flter out very dssmlar vdeos. Fgure 4 shows the sgnature dstance dstrbutons of near-duplcate and novel vdeos from our test set. Some vdeos can be drectly dentfed as near-duplcate vdeos, for example, the ones wth dstance less than 0.2. Whle other vdeos wth large dstance can safely be labeled as novel ones, for example, those wth dstance greater than 0.7. Wth ths flterng, a large porton of vdeos can be successfully dentfed, whch reduces the computaton for more expensve parwse comparson. For vdeos that cannot be clearly classfed as ether novel or near-duplcate usng global sgnatures (at dstances between 0.2 and 0.7), we apply local feature based near-duplcate detecton whch provdes very accurate duplcate analyss, at hgher cost. The combnaton of global sgnature and parwse comparson can balance performance and cost. 4.2 Global Sgnature on Color Hstograms A color hstogram s calculated for each keyframe of the vdeo, whch s represented as: H = (h, h 2,, h m ). As a typcal feature here, we use the HSV color space. A hstogram s concatenated Sgnature dstance Near-duplcate vdeos Novel vdeos Fgure 4. Sgnature dstance dstrbuton of near-duplcate and novel vdeos wth 8 bns for Hue, 3 bns for Saturaton, and 3 bns for Value, hence m = 24. A vdeo sgnature (VS) s defned as an m-dmensonal vector of a normaled color hstogram over all keyframes n the vdeo. VS = ( s, s L s ), 2 m where s = n n = where n s the number of keyframes n the vdeo, and h s the th bn of the color hstogram at keyframe. We compute the dstance of two sgnatures VS and VS based on the Eucldean dstance: R( V V ) = d( VS, VS ) = m k= ( x k k h y ) where VS = (x,, y m ), and VS = (y,, y m ). Two vdeos are regarded as near-duplcate f ther dstance s consdered close. The sgnatures of vdeos can be ndexed and then searched wthout accessng the orgnal vdeos. So the retreval speed s rather fast wth effcent mechansms avalable for searchng dstance between moderately sed feature vectors [8]. 4.3 Parwse Comparson among Keyframes For web vdeos that cannot be determned novel or near-duplcate usng global sgnature, local features based method (SET_NDK) s used to measure the smlarty of keyframes by parwse comparson of keyframes from two vdeos, and then the redundancy of these two vdeos can be determned by comparng the rato of the number of smlar keyframes. In ths secton we wll frst ntroduce the local feature based technque to detect the near-duplcate keyframes (NDK) n vdeos wth a sldng wndow, followed by the measure (set dfference) of vdeo redundancy wth the nformaton of keyframe smlarty Near-duplcate Keyframe Detecton wth Local Features In contrast to global features, features derved from local ponts can recogne varous transformatons from edtng, vewpont, and photometrc changes. Salent regons n each keyframe can be extracted wth local pont detectors (e.g. DOG [24], Hessan- Affne [26]) and ther descrptors (e.g., SIFT [25]) are mostly nvarant to local transformatons. Keypont based local feature detecton approach avods the shortcomng of global features and therefore s partcularly sutable for detectng near-duplcate web vdeos havng complex varatons. To detect near-duplcate keyframes, the local ponts of each keyframe were located by Hessan-Affne detector [26]. The local ponts were then descrbed by PCA-SIFT [9], whch s a 36 dmensonal vector for each local pont. Wth a fast ndexng structure, local ponts were matched based on a pont-to-pont symmetrc matchng scheme [27]. In our experments, we wll treat two keyframes as smlar f the number of local pont matchng pars between two keyframes s above a certan threshold Keyframe Matchng Wndow To fnd all near-duplcate/smlar keyframes n two vdeos, the tradtonal method s to exhaustvely compare each keyframe par, n whch the tme complexty s the producton of the numbers of keyframes n two vdeos. When vdeos consst of a large number 2

6 V a V b of keyframes, t s expensve and not feasble for large scale web vdeo collectons. To reduce the computaton, each keyframe was only compared to the correspondng keyframes n another vdeo wthn a certan sldng wndow. For near-duplcate web vdeos, there exsts certan mappng among keyframes. For example, the correspondng near-duplcate keyframes of one vdeo n Fgure 3 are wthn a certan dstance n another vdeo. To avod unnecessary comparson and guarantee mnmal mss detecton, we utle a sldng wndow polcy to effectvely reduce the computaton. For the th keyframe n one vdeo, t s only compared wth the keyframes of another vdeo wthn the followng range: Range = [ max(, df w), mn( + df + w, n)] where n s the length of another vdeo,.e. the number of keyframes, df s the length dfference between two vdeos, w s the wndow se. In our experments, the wndow se w s fxed to 5. Fgure 5 gves an example of matchng wndow between two vdeos. The whole near-duplcate keyframe lst s generated by transtve closure based on the nformaton of each two keyframes, whch forms a set of NDK groups [27]. Ths scheme s especally useful for complex scene vdeos wth a large number of keyframes, such as queres 5, 7 and 23 n Table. These vdeos are represented by as many as 00 keyframes, where ths scheme can greatly dmnsh the number of necessary comparsons. Although the sldng wndow scheme mght mss part of near-duplcate keyframes for a sngle keyframe n vdeos of smple scenes, these mssed near-duplcate keyframes wll be eventually ncluded by transtve closure consderng the fact that keyframes for smple scene vdeos are usually very smlar Set Dfference of Keyframes Once the smlar keyframes have been dentfed, we use normaled set dfference as the metrc to evaluate the smlarty between two vdeos. The set dfference measure represents each vdeo as a set of keyframes, ether near-duplcate keyframes (NDK) or non-near-duplcate keyframes (non-ndk). It calculates the rato of the number of duplcate keyframes to the total number of keyframe n a vdeo. It s measured by the followng formulaton: R( V max(, -df-w) mn(+df+w, n) KF KF V ) = ( KF KF KF + KF ) / 2 KF s the set of keyframes contaned n vdeo V. Ths measure counts the rato of ntersected near-duplcate keyframes. The hgher the rate, the more redundant the vdeo. wndow Fgure 5. Matchng wndow for keyframes between two vdeos df Table 2. Comparson of Near-Duplcate Detecton Capablty for Global Color Hstogram Sgnatures (SIG_CH) and Parwse Comparson among Keyframes (SET_NDK) Typcal Near-Duplcate Categores Freq SIG_ SET_ % CH NDK Exactly duplcate 20% Photometrc varatons 20% X Edtng (nsertng logo, text) 5% P Resoluton 2% Border (Zoom) 8% P Smple scene Content modfcaton 20% X P Dfferent lengths 0% Complex scene Content modfcaton 25% P Dfferent lengths 5% X Other 5% X P : able to detect X: unable to detect P: partally able to detect 4.4 Sgnature vs. Parwse Comparson The categores of web vdeo varatons and the capablty of global sgnature based on color hstograms (SIG_CH) and local feature based parwse comparson of keyframes (SET_NDK) are lsted n Table 2. The table categores dfferent types of nearduplcates, and provdes estmates of how frequently ths category appeared n our web vdeo test collecton of 2,790 vdeos (Freq %). It also dentfes whch of the two approaches, SIG_CH and SET_NDK, s sutable for each type of near-duplcate detecton. The color hstograms based global sgnature s able to detect duplcate and near-duplcate vdeos wth certan mnor varatons (e.g. small logo nserton). Furthermore, the detecton capablty for smple scenes and complex scenes s dfferent. For the smple scene vdeo lke The lon sleeps tonght n Fgure 2, the key aspect (theme) of the extracted keyframes s a brown lon wth a blue background. Droppng/nsertng a couple of smlar keyframes wll not serously affect the color dstrbuton. A global sgnature usng color hstograms potentally can detect certan knds of near-duplcate vdeos. But for complex scenes, such as Whte and Nerdy n Fgure 3, the nserton and removal of keyframes wll cause extensve changes n the global color sgnatures. The global sgnature s unable to recogne nearduplcates wth dfferent lengths because the color and ordnal dstrbutons have changed qute dramatcally. Generally, computng global sgnatures s fast, but ther potental to detect the near-duplcate vdeos s lmted. On the other hand, local ponts are effectve for fndng duplcates wth photometrc and geometrc varatons, complex edtng and oomng. Moreover, the local mappng among keyframes s especally sutable for detectng duplcate vdeos wth dfferent versons, nserton/deleton keyframes and varous keyframe sequences caused by shot boundary detecton algorthms. However, the matchng process s naturally slow due to the large numbers of keyponts and the hgh dmensonalty of the keypont descrptors. Typcally there are hundreds to thousands of keyponts dentfed n one keyframe. Although fast ndexng structure (e.g. LSH [9], LIP-IS [27]) can flter out comparson among feature ponts and the matchng wndow strategy reduces the comparson among keyframes, the matchng (nearest neghbor search) s computatonally expensve and not scalable to very large vdeo databases. The herarchcal approach combng the global sgnature and parwse comparson s a reasonable soluton to provde effectve

7 and effcent near-duplcate web vdeo detecton. Even though our experments were done wth one specfc set of global features and local pont descrptors, the basc prncples of the approach, and ts cost/effectveness analyss, would easly apply to other sets of global features and other spatal or local pont descrptors. 5. DATASET To test our approach, we selected 24 queres desgned to retreve the most vewed and top favorte vdeos from YouTube. Each text query was ssued to YouTube, Google Vdeo, and Yahoo! Vdeo respectvely and we collected all retreved vdeos as our dataset. The vdeos were collected n November, Vdeos wth tme duraton over 0 mnutes were removed from the dataset snce they were usually documentares or TV programs retreved from Google, and were only mnmally related to the queres. The fnal data set conssts of 2,790 vdeos. Tables 3 and 4 summare the formats and sources of web vdeos respectvely. The query nformaton and the number of near-duplcates to the domnant verson (the vdeo most frequently appearng n the results) are lsted n Table. For example, there are,77 vdeos n query 5 Whte and Nerdy, and among them there are 696 nearduplcates of the most common verson n the result lsts. Shot boundares were detected usng tools from CMU [4] and each shot was represented by a keyframe. In total there are 398,05 keyframes n the set. To analye the performance of the novelty re-rankng and nearduplcate vdeo retreval, two non-expert assessors were asked to watch vdeos one query at a tme. The vdeos were ordered accordng to the sequence returned by the vdeo search engnes. For near-duplcate vdeo retreval, the most popular vdeo was selected as the seed vdeo for each query. The assessors were requested to label the vdeos wth a udgment (redundant or novel) and to form the ground truth. To evaluate the re-rankng results, the assessors were also requested to dentfy the near-duplcate clusters n an ncremental way and the fnal rankng lst was formed based on the orgnal relevance rankng after removng near-duplcate vdeos. 5. Performance Metrc To evaluate the performance, we use measures: precson and recall, and novelty mean average precson (NMAP). The former measure s to assess the performance of near-duplcate detecton, whle the latter measures the ablty to re-rank relevant web vdeos accordng to ther novelty. Let G be the ground truth set of redundant vdeos and D be the detected one. Table 3. Vdeo Format Informaton Formats No. Vdeos Percentage FLV % MPG % AVI % WMV % MP % Table 4. Vdeo Source Informaton Sources YouTube Google Yahoo! No. vdeos Percentage 83.8 %.2 % 5 % Total 2790 Re call = G D / G Pr ecson = G D / D The novelty mean average precson (NMAP) measures the mean average precson of all tested queres, consderng only novel and relevant vdeos as the ground truth set. In other words, f two vdeos are relevant to a query but near-duplcate to each other, only the frst vdeo s consdered as a correct match. For a gven query, there are total of N vdeos n the collecton that are relevant to the query. Assume that the system only retreves the top k canddate novel vdeos where r s the number of novel vdeos seen so far from rank to. The NMAP s computed as: k NMAP = ( / r ) N = / 6. EXPERIMENTS In ths paper, we dscuss two expermental tasks: search result novelty re-rankng and near-duplcate web vdeo retreval. Search result novelty re-rankng ams to provde novel vdeos based on relevance rankng by elmnatng all near-duplcate vdeos. Nearduplcate web vdeo retreval seeks to fnd all vdeos that are nearduplcates to a query (seed) vdeo. Potentally the frst scenaro s a more challengng task snce the number of possble nearduplcate vdeos ncreases quadratcally. 6. Task : Novelty Re-Rankng The obectve of search results novelty re-rankng s to lst all the novel vdeos whle mantanng the relevance order. To combne query relevance and novelty, each vdeo V s computed through a parwse comparson between V and every prevously ranked novel vdeo V, whch s calculated by: R( V V,..., V ) max R( V V ) = The precede ranked vdeo that most smlar to V determnes the redundancy of V. The ranked lst after removng all nearduplcate vdeos wll be presented to the user. To evaluate the performance of novelty re-rankng, we compared the re-rankng results based on tme duraton, global sgnatures and the herarchcal method. The orgnal rankng from the search engne acts as the baselne. Gven the ntuton that duplcate vdeos usually have smlar tme duratons, the re-rankng based on tme duraton was also tested. In addton to the most popular verson n the results, there are other subordnate versons dfferent from the domnant one. Fgure 6 llustrates the tme duraton dstrbuton of vdeos n query Sony Brava, whch potentally ndcates a couple of subsdary versons (e.g. verson of 47 second) n the results dfferng from the most popular one (verson of 70 second). If the tme dfference between two vdeos s wthn an nterval (e.g. 3 seconds), they wll be treated as redundant. Smlarly, two vdeos were regarded as duplcate when ther sgnature dfference s close enough (e.g. less than 0.5). In ths experment, we tested dfferent ntervals (e.g. 0, 3, 5 seconds) and sgnature thresholds (e.g. 0.5, 0.2, 0.3), and the one wth the best performance s reported. Usually, the top search results receve the most attenton for users. The performance comparson up to top 30 search results s llustrated n Fgure 7 and the average NMAP over all top k levels s lsted n Table 5. It s obvous that the performance for orgnal search results s not good because duplcate vdeos are commonly appeared n the top lst. The tme duraton nformaton can dstngush novel vdeos at the begnnng, however dfferent web vdeos could have the same duraton, especally for vdeos queres accompaned wth background musc or musc vdeos, e.g. queres

8 , 0, 23. As the number of vdeos ncreases, the nformaton of tme duraton s nadequate, therefore the performance drops a lot. Although the global sgnature method can dentfy duplcate vdeos to some extent, the ablty for duplcate vdeos s lmted. A lot of near-duplcate vdeos cannot be correctly detected. Therefore the re-rankng lst stll conssts of some duplcate vdeos and some novel vdeos were falsely removed. Overall, our herarchcal method effectvely elmnates duplcate vdeos, whch mproves the dversty n the search results. So t acheves a good and stable performance across all top k levels. Table 5. Overall Novelty Re-Rankng Performance Solutons Average NMAP Orgnal Rankng 0.76 Re-Rankng by Tme Duraton 0.74 Re-Rankng by Global Sgnature 0.84 Re-Rankng by Herarchcal Method 0.94 As search engnes demands for quck response, the computaton tme s an mportant factor for consderaton. The average number of keyframe par comparson for top k re-rankng over 24 queres s lsted n Table 6. Compared to fast re-rankng wth global sgnatures and tme duraton, the herarchcal method s more expensve. However, usng the global sgnature flterng and the sldng wndow, the herarchcal method has greatly reduced the computaton compared to the exhaustve comparson among keyframes, whch makes the novelty re-rankng feasble. Dependng on the complexty of keyframes, the tme for keyframe par comparson ranges from 0.0 to 0. second for a Pentum-4 machne wth 3.4G H CPU and G man memory. The average tme to re-rank the top-0 results s around a couple of mnutes. Wth the fast development of computer and parallel processng, especally for platform lke Google parallel archtecture, t s not a problem to response the queres quckly wth our herarchcal NMAP # of vdeos Orgnal Duraton Sgnature Herarchcal Top K Fgure 7. Performance comparson of novelty re-rankng Tme Duraton Fgure 6. The tme duraton dstrbuton for the query Sony Brava (query 9) ndcates that there mght be multple sets of duplcate vdeos dfferent from the most popular vdeo n the search results approach. Table 6. Average number of keyframe par comparson for top k rankng over all queres wth the herarchcal method Top k Pars Task 2: Near-Duplcate Vdeo Retreval In addton to the novelty re-rankng, the users can also retreve all vdeos that are near-duplcate to a query vdeo. Gven a seed (query) vdeo V s, all relevant vdeos are compared wth the seed vdeo to see f they are near-duplcates. It s computed by: R V ) = R( V V ) ( s Here, the redundancy measure s based on the proposed herarchcal method that combnes the global sgnature and parwse measure. The vdeos havng small sgnature dstance are drectly labeled as near-duplcate whle the dssmlar ones are fltered out as novel vdeos. For the uncertan vdeos, local features are further used to measure the redundancy of vdeos. In ths task, we retreve the most popular vdeo n each query. The seed (query) vdeo can be determned automatcally or manually accordng to the tme duraton dstrbuton of the vdeos n the rank lst, the relevance rankng and the global sgnature. The popular vdeo n the top most lst wth the domnant tme duraton was pcked as the seed vdeo, and other vdeos were compared wth t to see f they are near-duplcate to t. The detaled and general performance comparson for nearduplcate retreval s shown n Fgure 8 and 9 respectvely. As seen from Fgure 8(a), global sgnature on color hstogram (SIG_CH) acheves good performance for queres wth smple scene or complex scene wth mnor edtng and varatons, e.g. queres 3 and 24. These near-duplcate vdeos have mnor changes, so sgnature alone can detect most of the near-duplcate vdeos and flter out dssmlar vdeos. But for queres wth complex scene (e.g. queres 0, 5, 22, 23), the sgnature based method s nsuffcent. Dssmlar vdeos can have smlar color dstrbuton to the seed vdeo. Especally n vdeos wth maor varatons, and nserton/removal of keyframes, ths wll cause remarkable dfference of color dstrbutons. However, the parwse comparson method based on local features can effectvely dentfy the near-duplcate keyframe mappng and elmnate the dssmlar vdeos wth smlar color sgnatures. Compared to Fgure 8(a), the precson-recall curves usng herarchcal method (HIRACH, Fgure 8(b)) has promnent mprovement. Most of the queres have hgh precson, especally at hgh recall levels. The parwse comparson s especally useful for queres of complex scenes (e.g. queres 0, 5, 22, 23). The

9 Precson Precson Recall queres havng relatvely low precson and recall by HIRACH are queres 8 and 22. For query 8 ( Bus uncle ), t was orgnally captured by a cell phone n the bus, so the scene s a lttle vague and the qualty s bad. Furthermore, near-duplcate vdeos are undergone extensve edtng and content modfcaton (e.g. overlay text, frame nserton), whle the query vdeo clp conssts of only two keyframes, whch makes ths detecton a dffcult task. So the precson and recall are low. For query 22 ( Numa Gary ), a lot of unrelated frames were nserted at the begnnng and end for some near-duplcate vdeos, whch nduces low smlarty scores. Therefore, the performance of query 22 s not good enough at hgh recall. Overall, the herarchcal method acheves satsfactory results. Fgure 9 demonstrates the average precson over 24 queres. It s easy to see that HIRACH mproves the performance extensvely, whch successfully detects the near-duplcate vdeos wth complex transformatons and flter out dssmlar ones. The average precson over all recall levels (0.05.0) s shown n Table 7 and the last column of Fgure 9 (denoted as AVG). The average precson s mproved from (SIG_CH) to (HIRACH). Table 7. Average precson of all queres over all recall levels Methods SIG_CH HIRACH Average CONCLUSION Wth the exponental growth of web vdeos, especally the comng of the Web 2.0 era, a huge number of near-duplcate vdeos are commonly returned from current vdeo search engnes. The dversty of near-duplcate vdeos ranges from smple formattng to complex mxture of dfferent edtng effects, whch causes the near-duplcate vdeo detecton a challengng task. To tradeoff the performance and speed requrements, we proposed a herarchcal method to combne global sgnatures and local parwse measure. Global sgnatures on color hstogram were frst used to detect clear near-duplcate vdeos wth hgh confdence and flter out obvously dssmlar ones. For vdeos that cannot be clearly classfed as novel or near-duplcate usng global sgnatures, we appled the local feature based near-duplcate detecton whch provdes very accurate duplcate analyss wth a hgher cost. Experments on a data set of 2,790 vdeos retreved from YouTube, Google Vdeo, and Yahoo! Vdeo show that the herarchcal approach can effectvely detect a large dversty of near-duplcate vdeos and dramatcally reduce redundant vdeo Recall (a) SIG_CH (b) HIRACH Fgure 8. Performance of near-duplcate vdeo retreval dsplayed to the user n the top result set, at relatvely small computatonal cost. Our current research can be further extended to fnd the essental content that s frequently appeared across relevant vdeos. It could act as a good tool for gleanng a quck summary of the most mportant clps from the returned vdeos. Ths approach could also be used to develop customed web vdeo crawlers that talored to recogne users nterests and send out on autonomous search mssons. Furthermore, we wll buld classfers to automatcally partton vdeo nto smple and complex scenes and then apply dfferent strateges to each n the future. 8. ACKNOWLEDGEMENT The work descrbed n ths paper was partally supported by a grant from the Research Grants Councl of the Hong Kong Specal Admnstratve Regon, Chna (CtyU 8905). We d lke to thank Rong Yan for the web vdeo crawler and Wan- Le Zhao for the NDK detecton. 9. REFERENCES [] D. A. Aderoh, M. C. Lee, and I. Kng. A Dstance Measure for Vdeo Sequences. CVIU, pp , 999. [2] J. Allan, edtor. Topc Detecton and Trackng: Event-based Informaton Organaton. Kluwer Academc Publshers, [3] J. Allan, C. Wade, and A. Bolvar. Retreval and Novelty Detecton at the Sentence Level. ACM SIGIR 03. [4] T. Brants, F. Chen, and A. Farahat. A System for New Event Detecton. ACM SIGIR 03, Canada, Jul [5] J. Carbonell and J. Goldsten. The Use of MMR, Dverstybased Rerankng for Reorderng Documents and Producng Summares. ACM SIGIR 98. [6] S-F. Chang, W. Hsu, L. Kennedy, L. Xe and et al. Columba Unversty TRECVID-2005 Vdeo Search and Hgh-Level Feature Extracton. TRECVID 2005, Washngton DC, [7] S. C. Cheung and A. Zakhor. Effcent Vdeo Smlarty Measurement wth Vdeo Sgnature. IEEE Trans. on CSVT, vol. 3, no., pp , Jan [8] S. C. Cheung and A. Zakhor. Fast Smlarty Search and Clusterng of Vdeo Sequences on the World-Wde-Web. IEEE Trans. on CSVT, vol. 7, no. 3, pp , June 2005.

10 Average precson of 24 queres SIG_CH HIRACH AVG Average number of near-duplcate vdeos of 24 queres Fgure 9. Average near-duplcate retreval performance comparson for dfferent approaches over all queres [9] E. Gabrlovch, S. Dumas, and E. Horvt. Newsunke: Provdng Personaled Newsfeeds va Analyss of Informaton Novelty. WWW 04, USA, 2004, pp [0] Google Vdeo. Avalable: [] A. Hampapur and R. Bolle. Comparson of Sequence Matchng Technques for Vdeo Copy Detecton. Conf. on Storage and Retreval for Meda Databases, [2] T. C. Hoad and J. Zobel. Fast Vdeo Matchng wth Sgnature Algnment. MIR 03, pp , USA, [3] W. H. Hsu, L. S. Kennedy and S-F. Chang. Vdeo Search Rerankng va Informaton Bottleneck Prncple. ACM MM 06, USA, pp , [4] Informeda. Avalable: [5] A. James. Conceptual Structures and Computatonal Methods for Indexng and Organaton of Vsual Informaton. Ph.D. Thess, [6] A. K. Jan, A. Valaya, and W. Xong. Query by Vdeo Clp. ACM Multmeda Syst. J., vol. 7, pp , 999. [7] A. Joly, O. Busson and C. Frelcot. Content-Based Copy Retreval Usng Dstorton-based Probablstc Smlarty Search. IEEE Trans. on MM, vol. 9, no. 2, Feb [8] K. Kashno, Takayuk, and H. Murase. A Quck Search Method for Audo and Vdeo Sgnals Based on Hstogram Prunng. IEEE Trans. on MM, vol. 5, no. 3, [9] Y. Ke, R. Sukthankar, and L. Huston. Effcent Near- Duplcate Detecton and Sub-Image Retreval. ACM MM 04. [20] J. Law-To, B. Olver, V. Gouet-Brunet and B. Noha. Robust Votng Algorthm Based on Labels of Behavor for Vdeo Copy Detecton. ACM MM 06, pp , [2] R. Lenhart and W. Effelsberg. VsualGREP: A Systematc Method to Compare and Retreve Vdeo Sequences. Multmeda Tools Appl., vol. 0, no., pp , Jan [22] L. Lu, W. La, X.-S. Hua, and S.-Q. Yang. Vdeo Hstogram: A Novel Vdeo Sgnature for Effcent Web Vdeo Duplcate Detecton. MMM 07. [23] X. Lu, Y. Zhuang, and Y. Pan. A New Approach to Retreve Vdeo by Example Vdeo Clp. ACM MM 99, 999. [24] D. Lowe. Dstnctve Image Features from Scale-Invarant Key Ponts. IJCV, vol. 60, pp. 9-0, [25] K. Mkolacyk and C. Schmd. A Performance Evaluaton of Local Descrptors. CVPR 03, pp [26] K. Mkolacyk and C. Schmd. Scale and Affne Invarant Interest Pont Detectors. IJCV, 60 (2004), pp [27] C-W. Ngo, W-L. Zhao, Y-G. Jang. Fast Trackng of Near- Duplcate Keyframes n Broadcast Doman wth Transtvty Propagaton. ACM MM 06, pp , USA, Oct [28] Y. Peng and C-W. Ngo. Clp-based Smlarty Measure for Query-Dependent Clp Retreval and Vdeo Summaraton. IEEE Trans. on CSVT, vol. 6, no. 5, May [29] Wkpeda. [30] X. Wu, A. G. Hauptmann, and C.-W. Ngo. Novelty Detecton for Cross-Lngual News Stores wth Vsual Duplcates and Speech Transcrpts. ACM MM 07. [3] X. Wu, C-W. Ngo, and Q. L. Threadng and Autodocumentng News Vdeos. IEEE Sgnal Processng Magane, vol. 23, no. 2, pp , March [32] Yahoo! Vdeo. Avalable: [33] Y. Yang, J. Zhang, J. Carbonell and C. Jn. Topccondtoned Novelty Detecton. ACM SIGKDD 02, Canada. [34] YouTube. Avalable: [35] J. Yuan, L. Y. Duan, Q. Tan, S. Ranganath and C. Xu. Fast and Robust Short Vdeo Clp Search for Copy Detecton. Pacfc Rm Conf. on Multmeda (PCM), [36] C. Zha, W. Cohen and J. Lafferty. Beyond Independent Relevance: Methods and Evaluaton Metrcs for Subtopc Retreval. ACM SIGIR 03. [37] B. Zhang et. al. Improvng Web Search Results Usng Affnty Graph. ACM SIGIR 05. [38] D-Q. Zhang and S-F. Chang. Detectng Image Near- Duplcate by Stochastc Attrbuted Relatonal Graph Matchng wth Learnng. ACM MM 04, USA, Oct [39] Y. Zhang, J. Callan, and T. Mnka. Novelty and Redundancy Detecton n Adaptve Flterng. ACM SIGIR 02, 2002.

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data A Fast Content-Based Multmeda Retreval Technque Usng Compressed Data Borko Furht and Pornvt Saksobhavvat NSF Multmeda Laboratory Florda Atlantc Unversty, Boca Raton, Florda 3343 ABSTRACT In ths paper,