Visually Summarizing he Web using Inernal Images and Keyphrases M.V.Gedam, S. A. Taale Deparmen of compuer engineering, PUNE Universiy Vidya Praishhan s College of Engg., India Absrac Visual summarizaion of web pages helps o achieve a friendlier user experience in search and refinding ass. Wih he use of his mehod users quicly ge an idea of wha he web page is all abou and help in recalling he visied web page. The inernal images of web page are usually used o describe he conen of a webpage. Inernal images in web pages are generally worhy for his purpose. Neverheless, in many web pages dominan inernal images are unavailable. So, we move owards a new approach i.e. exernal image summarizaion o summarize web pages wihou any dominan inernal images. Bu, exernal images are someimes unreliable because of he way hey are rerieved from inerne. By aing advanage of inernal and exernal images we propose a scheme for visual summarizaion. To improve he reliabiliy of exernal images, relevan exernal images are rerieved from he Inerne by considering exual and visual imporance beween arge webpage and hosing webpage. Relevance, Dominance and ypicaliy of he image wih respec o web page are he feaures of images. We propose an Affiniy Propagaion Clusering based algorihm o selec he bes of inernal and exernal images as summarizaion. Index Terms Affiniy propagaion, Inernal images, exernal image, eyphrases, visual ran, visual summarizaion I. INTRODUCTION When The Web has become he larges informaion reposiory over he world, hus effecively and efficienly searching he informaion becomes provides a means for fas and effecive Web browsing and informaion rerieval. The goal of summarizaion is o produce coheren summaries. This allows for faser and beer undersanding of he conens of a Websie wihou firs browsing is conen A search engine resuls normally include a lis of iems wih iles, a reference o he full version, and a shor descripion showing where he eywords have mached conen wihin he page. A search engine resul page may refer o a single page of lins reurned, or o he se of all lins reurned for a search query. Nearly all search engines provide search resuls using summarizaion consiss of page ile, URL and a shor exual snippe. On he WWW users frequenly revisi informaion ha everyone have previously seen, bu eeping found hings found is difficul when he informaion has no been visied frequenly or recenly. Generally webuser loo a he boomars, hisory in heir browser o find ou WebPages ha everybody visied before. Research has shown ha visual images are much easier and beer o remember and undersand when compare wih words. Almos all applicaions concerned o summarizaion of webpage should become more efficien if he web pages are visually summarized. The visual summarizaion represens a webpage using visual informaion, such as images and humbnail. Google inegraes he Sies wih images feaure in web search sysem ha summarizes he webpage using he images in he page. This feaure provides several images seleced from webpage above he exual snippe, from hose web users may quicly idenify he required webpage. Mos applicaion (Search engines, browser, ec) uses Thumbnail and inernal images based approach for visual summarizaion of webpage. Bu everyone face problem if WebPages having rich conens and wihou dominan inernal images. These problems are solved by reriving relevan exernal image from he inerne. On oher hand, he exernal images are ofen unreliable o represen he arge webpage because of he imperfecion of he way in which hey are rerieved. Bu boh he approaches inernal and exernal image based summarizaion has heir respecive advanages which may complemen each oher. So, by aing ino consideraion boh he sources of images, we proposed a clusering based scheme Affiniy propagaion o selec bes visual summary for webpage. Secion II briefly describes he exising sysem relaed o visual summarizaion. Secion III gives implemenaion deails of proposed approach o generae visual summary. Nex secion gives he deails abou daase used for his sysem. A las, we conclude his paper and fuure wor II. LITERATURE REVIEW People regularly inerac wih differen represenaions of Web pages. While considering he visual represenaion of webpage, mosly Thumbnail and Inernal image based visual summarizaion are used in exising producs and research communiies. a) Thumbnail based visual summarizaion Curren commercial web browsers such as Mozilla Firefox and Microsof Inerne Explorer provide a wide range of uiliies, such as hisory liss and boomars, which suppor revisiing previously seen pages on he web. Ye previous research indicaes ha hese uiliies are largely unused. WebView; a prooype designed o improve he efficiency and usabiliy of page revisiaion. I does his by paying paricular aenion o how previous pages are inegraing many revisiaion capabiliies ino a single display space. Bu his sysem did no evaluae implici boomars, dogears, and he hub and spoe view, complex revisiaion sequences clearly. Thumbnail is a reduced snap of a webpage depiced in mos browsers. For example, Firefox FasDial visualizes he boomars as humbnail. Viewzi presens search resuls using ex snippe and humbnail www.ijcsi.com 4304
of webpage for web search. Users found problem in reading ex conen from small snapsho of webpage which have rich conens. To overcome his problem in search as Dziadosz and Chandrasear [2] sugges ha in he search engine humbnails of web pages should appear wih ex snippes.woodruff e al. design exually enhanced humbnails, which enhance he legibiliy of he ex conens hrough highlighing some eywords in he scaled-down snapsho. In his way, he problem of reading ex informaion is removed, bu he ime of loading he snapsho increases dramaically. Erol e al. [3] propose a mulimedia humbnail, one of he inds of documen represenaion which is a pan-and-zoom movie railer for he documen. In heir sysem, he salien documen informaion (visual and audible) is exraced and synhesized ino a playable humbnail. Mulimedia humbnail address he problem of represening mulipage and high resoluion documens on small form facor devices. User needs assisance in navigaion, bu i provides hrough sric formaing, no via lins. b) Inernal image based visual summarizaion Inernal images are he bes represenaive of is webpage; hence commonly use o visually summarize he webpage. To improve he relevance judgmen of web search resul, Li. e al. [4] propose an inernal image based sysem which exracs he dominan images from he web page as image excerp. A ex snippe is generaed by selecing eywords around he mached query erms for each reurned page. This sysem auomaically generaes image excerps by considering he dominance of each picure in each web page and he relevance of he picure o he query. Anoher approach Visual Snippe [5], which is an image generaed by composing a dominan image from he web page, salien ex (e. g., ile), and a waermared logo from he web page, is an exension of image excerps. In mos indusrial producs, such as Google s Sies Wih images, adop hese inernal image based summarizaions approach. However, he problems in he applicabiliy of such approaches are for a large amoun of web pages, dominan inernal images are unavailable. Fig.. Sysem Archiecure III. SYSTEM OVERVIEW Inernal and exernal images are considered as wo useful sources for image based summarizaion. Since hey have respecive advanages and may complemen each oher, we joinly oo ino consideraion hese wo sources of images for summarizing web pages. Sysem archiecure for he sysem as shown in fig. Projec is divided ino hree modules which are as follows a. Dominan Inernal image exracion The conen of he page; such images are called as inernal images. These images can be direcly used o summarize he webpage. However, a ypical web page may have a lo of images, mos of which may be decoraion, adverisemen or logo. Hence we use an algorihm o deec he dominan ones among all of he images in he webpage for visual summarizaion. Here, we adop he learning-based algorihm proposed in [6] for his as, which is described as shown in fig.2 Fig. 2. Dominan Inernal image exracion Feaures are exraced for each image from hree levels, including he image iself, he hosing web page and he hosing web sie. These feaures are summarized in Table TABLE FEATURES FOR DOMINANT INTERNAL IMAGE SN Feaure Kind Feaure Name Image Level size, widh/heigh raio, blurriness,conras,colorfulness 2 Page Level relaive posiion,relauve size, relaive widh/heigh raio 3 Sie Level IsInnerSie Afer exracing feaures from hree level of an image, dominance model can be learned from some labeled raining samples, which are represened as (x i, ; y i, ) where x i, is he exraced feaure vecor of he image i in he page and y i, is is labeled dominance, namely, 0 (non-dominan) and (dominan). Linear Raning Suppor vecor machine is hen adoped o rain he raning model for dominance deecion. And finally dominan images from webpage are separae ou. b. Exernal image rerieval For a large amoun of web pages, here are no dominan images. So, o generae a meaningful visual summarizaion for such web pages we have o rerieve images which can visually represen he web page from he Inerne. The idea we design is saed as follows. Firs, he ey phrases, which can represen he salien opics in he web page, are exraced. These ey phrases are used as queries o search images relevan o he opics of he web page in an image search engine. Finally he resul images are raned based on he exual similariies beween he web pages which hos hese images and he arge web page. Nonrelevan images www.ijcsi.com 4305
are furher filered ou based on he visual similariies among he resul images. The op raned image is adoped as he mos relevan exernal image for given webpage. Fig. 3 illusraes he wor flow of his an approach. ) Keyphrase exracion Keyphrases provide semanic meadaa ha characerize and Summarize documen. We use hese phrases o faciliae Web users grasping he main opic(s) of a Web page. For exracion of Keyphrases from webpage. We use KEX algorihm [7]. 2) Raning and Filering Afer exracing Keyphrases from webpage, hese are given o he image search engine. We use an algorihm o re-ran and filer images from he search resul. In our algorihm, we firs ran resul images based on he exual similariies beween he arge web page and he web pages which hos he images. Then we propose an algorihm o filer ou visually unimporan images. Texual similariy: Using cosine similariy based on vecor space model (VSM) we calculae he exual similariy beween wo WebPages. A web page is represened as a vecor where each componen is a Term Frequency score of a paricular erm in he page. And Cosine similariy [8] is adoped o measure he exual similariy beween each resul web page and he arge web page. The cosine similariy funcion can be formulaed as: Similariy T, H ) = w. w h = 2 ( w ). ( = = ( () Where, w =Weigh of erm in Targe webpage w =Weigh of erm in Hosing webpage h Fig. 3. Exernal image rerieval w h ) 2 Visual Similariy: Visual imporance of he resul image is calculaed using VisualRan [9]. For each image, SIFT feaures are exraced. The visual similariy beween wo images is defined as he number of similar SIFT feaures divided by he average number of SIFT feaures in he wo images. Then a graph is consruced wih images as verices. Each pair of images is conneced wih an edge while he weigh on he edge is defined as he visual similariy. No. ofsimilarsiftfeaure VisualSimilariy( u, v) = (2) Avg. no. ofsiftfeaure Page Ran is applied on he graph o calculae he imporance score for each image. The images whose visual imporance score is below a hreshold will be filered ou. Afer raning and filering, he op raned image is adoped as relevan exernal image used for summarizaion. c. Affiniy Propagaion Clusering The exising sysem provides visual summary of webpage are focused on eiher inernal image based summarizaion or exernal image based summarizaion. Each of hese approaches has heir respecive advanages and disadvanages. While focusing on inernal image based summarizaion, hough inernal images are bes o represen is hosing webpage, abou 40 percen of webpage doesn have any dominan inernal image. In case of exernal image based summarizaion, mos images ha were exracing from inerne are irrelevan because of imperfecion in ways hey are rerieved. So, we are considering boh he ind of images joinly. We apply Affiniy propagaion clusering [0] on exernal and inernal images ha we go from above wo modules. ) Feaures of good image summarizaion An effecive image based web page summarizaion should saisfy he following hree properies in order o alleviae user in search and refinding ass. Relevance - This propery reflecs wheher he image is relevan o he opic of inpu webpage. We are considering boh inernal and exernal images for visual summarizaion, so wih consideraion of his propery we are able o selec he summary represening he opics in he webpage. In our projec his propery is incorporaed by compuing exual similariy beween arge webpage and he webpage who hos he candidae summary images. Dominance- A webpage consiss of los of images represening differen opics among hem some are dominan one while ohers are nondominan. We have o selec image from he candidae summary images which reflecs he dominan opic of webpage. So, by seeing a summary, users quicly ge idea abou subjec of webpage. In he firs module we generae model o selec he dominan inernal image from webpage which describe he dominan opic of arge webpage. The sudies [7] shows ha he dominan images specifically opical dominan images are ofen observed in aracive areas of webpage wih high qualiy and large size compared wih oher images in webpage. Here, we use mehod in firs module o www.ijcsi.com 4306
compue dominance of an image. Typicaliy- The above wo properies relevance and dominance depic he relaionship beween a webpage and an image, he ypicaliy describe relaionship beween image and muliple aspecs of a opic in webpage. According o he heory of graded srucure, people ofen evaluae some images imiae a paricular opic han ohers. Generally ypicaliy can be measured by some cenral endency since all images of same opic follow same graded srucure. Cenral endency normally relaed o he way in which daa incline o cluser around some value. In our projec we are measuring ypicaliy using a clusering mehod. To inegrae all hree properies relevance, dominance, ypicaliy ino an inegraed framewor, we adop affiniy propagaion clusering (AP). This clusering mehod used in our projec has several benefis. I is very hard o predefine number of clusers in summarizaion problem, which is no required in case of AP clusering. Also his clusering mehod easily considers a relevance and dominance propery o indicae probabiliy of an image o be an exemplar. In he clusering, firs compue relevance and dominance for each image hen we adop AP o generae exemplar which are served as summarizaion candidae. Finally, visual summarizaion seleced from exemplars having highes score. The aim of AP is o idenify Small no. of images ha accuraely represen se of images. Given a se of images I = { I } N i i= and he similariy marix S = { s } N i, Where s i, = i, is he similariy beween image I i and I, he AP algorihm ries o cluser I ino P (P < N) groups, each represened by an exemplar. The generaed clusers are denoed as I { } M e = I ei i= and e ( I i ) represens he exemplar of image I i. Iniially all images as poenial exemplars. AP clusers images according o wo inds of messages propagaed beween images:. The responsibiliy r i. sen from image i o, reflecing how well-suied is o serve as he exemplar for i. 2. The availabiliy a i, sen from image o i, reflecing how appropriae i would be for image i o choose image as is exemplar, wih considering suppor from oher images ha should be an exemplar. Afer messages of all images are updaed, image I which maximizes r i, +a i, is seleced as e(i i ).The clusering procedure is erminaed afer a fixed number of ieraions or afer I e says consan for some number of ieraions. The algorihm is shown as follows: Algorihm: Affiniy Propagaion Inpu: I, S Oupu: I e (0) (0) Iniializaion: a 0, r 0 i = i, = For = o a fixed no. of ieraion do Updae and Damp all responsibiliy: ( ) r = s { a + s } i i, max ' i, i ' ( ) ri = λri + ( λ) ri Updae and Damp all availabiliies : a min 0, r + max{ 0 r } =, i i { i, } i, a = max { 0, ri } a Selec exemplars: e I i ) = I r i ( ) i = λai, + ( λ) a i, (, where r= arg max { + a } Add I r ino exemplar se I e if r i i r = i If Ie say consan for some no. of ieraion hen Reurn Ie End if End for Affiniy propagaion aes as inpu a collecion of real valued similariies beween images, where he similariy s(i, indicaes how well he images wih index is suied o be he exemplar for image i. Each similariy (3) is compued using a linear model of exual similariy and visual similariy. v sim( i, = βsim ( i, + ( β ) sim ( i, (3) The exual similariy beween images sim is evaluaed using he cosine similariy over he TF vecors of heir hosing v webpage. And Visual Similariy sim is evaluaed as: sim v ( i, = dis( i, / dismax (4) Where Euclidean disance dis is calculaed in feaure space as(5) m 28 2 dis( i, = min ( s jq fq ), (5) m f n j= q= Where, m is he number of SIFT feaure in image I, n is he number of SIFT feaure of image s jq is he q h elemen of he j h feaure vecor of image I and fq is he q h elemen of f h feaure vecor of image. The responsibiliies are evaluaed using he rule(6) ( ) r i = si, max ' { ai, + si ' } (6) In he firs ieraion, as availabiliies are se o zero iniially, r(i, is se o he inpu similariy beween image i and image minus he larges of he similariies beween image i and oher candidae exemplars. In laer ieraions, when some images are effecively assigned o oher exemplars, heir availabiliies for oher images will goes negaive as prescribed by he availabiliy updae rule(7) : www.ijcsi.com 4307
{ 0 r } ai = min0, r + max, i i { i, } (7) For image i, he value ha maximizes a(i, + r(i, eiher idenifies image i as an exemplar if = i, or idenifies he image ha is he exemplar for image i. The message passing procedure may be erminaed afer a fixed number of ieraions, afer changes in he messages fall below a hreshold, or afer he local decisions say consan for some number of ieraions. When updaing he messages, i is imporan ha hey be damped o avoid numerical oscillaions ha arise in some circumsances. Each message is se o λ imes is value from he previous ieraion plus ( - λ ) imes is prescribed updaed value, where he damping facor λ is beween 0 and. Here, we used a defaul damping facor of λ = 0.5. In he dominan image deecion, we develop a model o deec he dominan image on webpage and dominance score are mapped ino [0,]. In exernal image exracion, we found ha he relevance beween an image and a webpage can be evaluaed by cosine similariy over erm frequency vecor of arge webpage and hosing webpage. Afer performing all ieraion of AP algorihm, we have o selec visual summary from candidae exemplar. All candidae exemplar represen muliple opics in webpage bu we have o choose he one as summary which represen he dominan opic of webpage. The selecion of summary is based on wo evaluaion crieria. Firs one is number of images in a cluser, if more images are presen in a cluser means ha more images are similar o candidae exemplar represens is imporance. Second is preference, o leverage he relevance and dominance, we also consider preference as an evaluaion crierion. 2) Visual Summary Selecion: In our sysem, he preference for image I i is calculaed as a combinaion of he relevance score r( Ii, TW ) and he dominance score d ( I i ) : pref ( Ii ) = [ αr( Ii, TW ) + ( α) d( Ii )] simmed Where, pref i =preference of exemplar of i h cluser prop i =proporion of images in i h cluser A model is developed based on hese evaluaion crieria o compue score for each exemplar o selec he summary he exemplar wih he larges score is seleced as he image based summarizaion i score = γprefi + ( γ ) propi Where, TW= Targe Webpage = median of he similariies in he sim med similariy marix α =parameer o rade off he conribuion from Relevance and dominance IV RESULT AND ANALYSIS If an inpu given o our sysem any webpage suppose hp://www.live-conferencing.com/aricles/sony-equipmen-forvideo-conferencing.hml. The resul obained are shown in above figure. This page is abou sales of video conferencing equipmen. The oupu we ge from our sysem is mos appropriae o describe he conens of webpage. The exising produc or sysem regarding visual summarizaion of webpages consider only eiher of inernal image or exernal image or someimes humbnails. In our sysem we aing ino consideraion all ypes of webpages and all summarizaion crieria. Thus obain a beer resuls. V CONCLUSION AND FUTURE WORK The various inds of visual summarizaions based on humbnails, inernal images, visual snippes and exernal image are presen in exising produc. Since heir resul can complemen each oher in erms of availabiliy and reliabiliy, we propose a clusering based scheme o joinly selec a summary which bes presens hree properies (relevance, dominance and ypicaliy). The exising approaches considered eiher inernal image or exernal image. So, he resuls could complemens as conens of webpages changes. Thus, he proposed sysem considered boh he ind of images joinly o selec bes summary for any ind of webpages. ACKNOWLEDGMENT This is o acnowledge and han all he individuals who played defining role in shaping his journal. I avail his opporuniy o express my deep sense of graiude and whole heared hans o my guide Prof. S. A. Taale madam for giving her valuable guidance, inspiraion and encouragemen o embar his paper. www.ijcsi.com 4308
REFERENCES [] Binxing Jiao, Linjun Yang, Jizheng Xu, Qi Tian Visually Summarizing Webpages Through Inernal and Exernal Images, IEEE Trans. Mulimedia., vol. 4, no. 6, pp. 673-683, Dec. 202 [2] S. Dziadosz and R. Chandrasear, Do humbnail previews help users mae beer relevance decisions abou web search resuls?, in Proc. SIGIR 02, New Yor, 2002, pp. 365 366. [3] B. Erol, K. Berner, and S. Joshi, Mulimedia humbnails for documens in Proc. 4h Annual ACM In. Conf. Mulimedia,New Yor, 2006, pp. 23 240. [4] Z. Li and L. Zhang, Improving relevance judgmen of web search resuls wih image excerps, in Proc. 7h In. World Wide Web Conf. (WWW2008), Apr. 2008, pp. 2 30. [5] J. Teevan, E. Curell, D. Fisher, S. M. Drucer, G. Ramos, P. André, and C. Hu, Visual snippes: Summarizing web pages for search and revisiaion, in Proc. CHI 09, New Yor, 2009, pp. 2023 2032. [6] Q. Yu, S. Shi, Z. Li, J.-R. Wen, and W.-Y. Ma, Improve raning by using image informaion, in Proc. ECIR 2007, 2007, pp. 645 652. [7] M. Chen, J.-T. Sun, H.-J. Zeng, and K.-Y. Lam, A pracical sysem of eyphrase exracion for web pages, in Proc. CIKM 05, New Yor, 2005, pp. 277 278 [8] G. Salon and C. Bucley, Term-weighing approaches in auomaic ex rerieval, Inf. Process. Manage., vol. 24, no. 5, pp. 53 523, 988. [9] Y. Jing and S. Baluja, Visualran: Applying pageran o large-scale image search, IEEE Trans. Paern Anal. Mach. Inell., vol. 30, no., pp. 877 890, Nov. 2008. [0] B. J. Frey and D.Duec, Clusering by passing messages beween daa poins, Science, vol. 35, pp. 972 976, 2007. www.ijcsi.com 4309