he EigenRumo Algoithm fo Ranking Blogs Ko Fujimua N Cybe Solutions Laboatoies N Copoation akafumi Inoue N Cybe Solutions Laboatoies N Copoation Masayuki Sugisaki N Resonant Inc. ABSRAC he advent of easy to use blogging tools is inceasing the numbe of blogges leading to moe divesity in the quality blogspace. he blog seach technologies that help uses to find good blogs ae thus moe and moe impotant. his pape poposes a new algoithm called EigenRumo that scoes each blog enty by weighting the hub and authoity scoes of the blogges based on eigenvecto calculations. his algoithm enables a highe scoe to be assigned to the blog enties submitted by a good blogge but not yet linked to by any othe blogs based on acceptance of the blogges pio wok. Geneal ems Algoithms, Management, Expeimentation Keywods Weblog, link-analysis, anking, seach engine.. INRODUCION Many appoaches on anking Web pages have been poposed and studied[3]. PageRank[2] and HIS[7] ae most successful of these and thei effectiveness has been shown in both industy and the academic wold. Of couse these techniques ae also effective fo anking blogs. he simple adoption of these algoithms to blogs, howeve, induces some issues as follows he numbe of links to a blog enty is geneally vey small. As the esult, the scoes of blog enties calculated by PageRank, fo example, ae geneally too small to pemit blog enties to be anked by impotance. 2 Geneally, some time is needed to develop a numbe of inlinks and thus have a highe PageRank scoe. Since blogs ae consideed to be a communication tool fo discussing new topics, it is desiable to assign a highe scoe to an enty submitted by a blogge who has been eceived a lot of attention in the past, even if the enty itself has no in-links at fist. Consideing these issues, this pape poposes a new link-analysis algoithm called EigenRumo. he algoithm is designed fo anking infomation esouces povided as blogs o othe cybespace communities, in which the identities of infomation povides ae obsevable. Unlike geneic web pages, a blog site is constucted fom a set of blog enties witten by a single blogge and the quality of blog enties and topics ae dominated by the ability o inteests of the blogge. Using this stuctual chaacteistic of blogs, the EigenRumo algoithm ates a new blog enty o othe blog enties that have no in-links accoding to Copyight is held by the autho/owne(s. WWW 2005, May 0--4, 2005, Chiba, Japan. the past behavio of the blogge. In this pape, we define a blog (o blog site fom just the stuctue point of view, i.e., we do not concen ouselves with the contents of the blog. We assume that a blog has the following stuctue (a A blog consists of a top page and a set of blog enties. A blog is geneally updated and maintained by a single blogge. (b hee ae links fom the top page of the blog to each blog enty and each blog enty has a pemanent URI. (c Blog enties ae fequently added and the notification of updates is, as an option, sent to a ping seve []. (d A mechanism to constuct a tackback [0] is povided. he EigenRumo algoithm has similaities to PageRank [2] and HIS [7] in that all ae based on eigenvecto calculation of the adjacency matix of the links. In the EigenRumo model, howeve, the adjacency matix is constucted fom agent-to-object links, not page-to-page (o object-to-object links. Note that an agent is used to epesent an aspect of human being such as a blogge, and an object is used to epesent any object such as a blog entity in this pape. Using the EigenRumo algoithm, the hub and authoity scoes ae calculated as attibutes of agents (blogges and by weighting these scoes to the blog enties submitted by the blogge, the attactiveness of a blog entity that does not yet have any in-link submitted by the blogge can be estimated. his pape also epots the implementation expeiments of a blog seach engine that etuns the seach esult soted by the scoes calculated by this algoithm and evaluated the effectiveness of the anking by submitting seveal queies. Ou expeience shows that links between blog enties ae vey spase. Only.2% of blog enties have links to the blog enties of othes. he aggegation on the agent (blogge povided the EigenRumo algoithm enables us to assign non-zeo scoes to about 9.3% of blog enties. his geatly impoves the usability of blog seaches. In Section 2, we discuss the classification of blog ankings and claify the taget of this pape. In Section 3, we pesent the EigenRumo algoithm that calculates the hub and authoity scoes fo agents and the eputation scoe of objects. In Section 4, we descibe how to apply the EigenRumo algoithm to blog anking. In paticula, we descibe the nomalization stategy of links to educe the effect of seach engine optimization (SEO and so get bette anking. In Section 5, we biefly pesent an implementation fo blog seach engines and expeiments leaned fom applying the system. Finally, we pesent elated woks and the conclusions in Sections 6 and 7, espectively. 2. BLOG RANKING hee ae vaious types of anks in the so-called blog anking technology. In this section, we classify them and claify the taget of this pape. Although this is not
exhaustive, blog ankings ae classified using the following axis ( Subject of anking (a Blog enties (b Blogges (c Aticles efeed to by blogs (d Goods o sevices efeed to by blogs (2 Space of anking (a All blogs (b Blogs that send notification of update to a specific ping seve (c Blogs in a specific povide (3 empoal space of anking (a All blogs (b Specific peiod (c Damping model (4 Semantics of anking (a Stength of suppot fom the community (b ustwothiness (c Recency / feshness (d Specific attibute, e.g., funniness o usefulness (5 Souce of evaluations collected (a Hypelink, e.g. tackbacks (b Access, e.g., numbe of clicks (c Collection of explicit votes (d Natual language analysis Regading the subject of anking (, the taget of this pape is both (a and (b. We think that the anking of goods o sevices efeed to by blogs is impotant fo maketing puposes. Howeve, anking blogge and blog enties is moe impotant because if we have a eliable anking of blogge o blog enties, we can then easily and eliably ank goods o sevices by weighting the eliability of the blogge o blog enties. his pape thus focuses on (a and (b as the fist step. Regading the space of anking (2, it is impotant fom the viewpoints of business o implementation, but it has no, theoetically, impact, and we make no assumption egading anking space in this pape. Regading the tempoal space of anking (3, it is impotant to weight newe topics since blogs ae usually used to find o discuss new topics. his pape thus pesents a mechanism to suppot it. Regading the semantic of anking (4, it depends on how the evaluations of blogs ae collected, which is axis (5 above. At this moment, thee is no mechanism to expess the semantics and stength of suppot of esouces that a blog efes to explicitly. echnoati [2] intoduced a new attibute tag called el to specify the categoy of link but this is not widely used yet. his pape thus collects evaluations of each blog enty by assuming that a link is an indication of inteest in some aspect of the blog. hus the semantics of anking in this pape might be attactiveness athe than stength of suppot fom the blog community. 3. HE ALGORIHM he EigenRumo algoithm poposed hee is a highly geneic algoithm and applicable to not only blog communities but also any othe cybespace community in which the identities of infomation povides (agents ae obsevable, in othe wods, communities in which membeship egistation is equied. In this section, theefoe, we descibe the algoithm in an abstact manne and we use agent and object fo blogge and blog enty, espectively. 3. Community model We assume a univese of m agents and n infomation objects. When agent i povides (posts object j, a povisioning link is established fom i to j. We will use the povisioning matix P[p ] (i m, j n to epesent all povisioning links in the univese. In this notation, p if agent i povides object j and zeo othewise. When agent i evaluates the usefulness of an existing object j with the scoing value e, an evaluation link is established fom i to j. We will use the evaluation matix E[e ] (i m, j n to epesent all evaluation links in the univese (Figue. he evaluation link is assigned weight e based on the stength of the suppot given to object j. We assume e has the ange of [0,] and highe values indicate stonge suppot. Fo simplicity, we do not conside negative values fo e. Note that scoing value e is not always given explicitly. It can be geneated by a tanslation ule, e.g., e when an aticle (object j eceives a comment fom an agent i, e 0 othewise. An example of a tanslation ule applied to blogspace is given in Section 5. Agents i m e Objects j n infomation povisioning infomation evaluation Figue. EigenRumo community model 3.2 Scoes he EigenRumo algoithm scoes agents in two aspects infomation evaluation (hub scoe and infomation povisioning (authoity scoe. hese scoes enable us to calculate the weighted scoe of an object. o implement this idea, two scoes fo each agent and one scoe fo each object ae intoduced in the algoithm Authoity scoe (agent popety his indicates to what level agent i povided objects in the past that followed the community diection. It is consideed that the highe the scoe, the bette the ability of the agent to povide objects to the community. We define a as a vecto that contains the authoity scoes a i fo agent i (i m. Hub scoe (agent popety his indicates to what level agent i submitted comments (evaluation that followed the community diection on othe past objects. It is consideed that the highe the scoe, the bette the
ability of the agent to contibute evaluations to the community. We define h as a vecto that contains the hub scoes h i fo agent i (i m. Reputation scoe (object popety his indicates the level of suppot object j eceived fom the agents, i.e., the degee to which j follows the community diection. It is consideed that the highe the scoe, the bette the object confoms to the community diection. We define as a vecto that contains the eputation scoe j (j n fo object j. 3.3 he EigenRumo Alogithm he EigenRumo algoithm calculates thee vectos, i.e., authoity vecto a, hub vecto h, and eputation vecto, defined in Section 3.2, fom infomation povisioning matix P and infomation evaluation matix E, defined in Section 3.. Based on the following assumptions, these scoe vectos ae mutually influenced Assumption he objects that ae povided by a good authoity will follow the diection of the community. Assumption 2 he objects that ae suppoted by a good hub will follow the diection of the community. Assumption 3 he agents that povide objects that follow the community diection ae good authoities of the community. Assumption 4 he agents that evaluate objects that follow the community diection ae good hubs of the community. Coesponding to the above assumptions, the algoithm intoduces fou equations as follows P a ( E h (2 a P (3 h E (4 In ode to mege equation ( and (2 above, we use the following convex combination αp a + ( α E h (5 whee α is a constant with ange of [0,] that contols the weight of authoity scoe and hub scoe. It is adjusted depending on the taget community o application. Note that α can be assigned to each object sepaately and can be designed to decease with time fom the submission o the numbe of evaluations submitted to object j. We now have thee equations, (3, (4, and (5, that ecusively define thee scoe vectos, a, h, and. o find the equilibium values fo the scoe vectos, we integate equation (3 and equation (4 with equation (5, and get whee αp P + ( α E E ( αp P + ( α E E S (6 S ( αp P + ( α E E If S is a stochastic matix, will convege to the pincipal eigenvecto of S simply by iteating pocedue (6. Fotunately, the pincipal eigenvecto of any non-negative matix can be calculated by just adding a nomalization pocedue in each iteation pocedue. In othe wods, we can get the equilibium value fo such that that following equality is satisfied. S λ (7 whee λ is the lagest eigenvalue of matix S. Afte getting, we can also get a, h by equations (3 and (4. We can also get all of these scoes simultaneously by the pocedue shown in Figue 2. (0 a (,..., (0 h (,..., while changes significantly do ( k ( k ( k αp a + ( α E h ( k + ( k ( k / 2 ( k + ( k + a P ( k+ ( k+ h E end while Figue 2. he EigenRumo Algoithm. is function that computes the L 2 vecto nom. 2 4. MAPPING O BLOG COMMUNIY hee ae seveal ways in which the EigenRumo community model descibed in Section 3. can be applied to the blog community. We applied the simplest mapping, shown in Figue 3, to the blog seach system descibed in Section 5. As shown this figue, the links fom the top page of the blog site to the blog enties ae consideed to be infomation povisioning links and links to blog enties in othe blogs ae consideed to be infomation evaluation links. In this mapping, the scoing value e of each infomation evaluation link is if thee is a link and 0 othewise, since no explicit scoing value is given. Note that this infomation evaluation link is actually an enty-to-enty hypelink and no blogge-to-enty link exists. We use the tanslation ule to intepet actual enty-to-enty links as blogge-to-enty links. Note also that enty-to-enty hypelinks ae sometimes ceated by the tackback mechanism [0]. Ou system deals with both nomal hypelinks and (fowad tackback links equally since both links ae consideed to be an indication of the inteest of the blogge who cites the enty. On the contay, the (backwad tackback links, i.e., automatically geneated by the tackback potocol, ae not consideed to be an indication of inteest of the blogge whose enties ae efeed to and often geneated by spamming. We accodingly ignoe these links.
P [ p ] p ρ j... n p ρ p (9 Blog enty Blogge Blog site Figue 3. Mapping to blog community Since the basic algoithm descibed in the pevious section does not nomalize infomation povisioning matix P o infomation evaluation matix E, it is susceptible to spamming. If some use ceates many blog accounts and intelinks them, he/she can inflate the scoes. o educe the effect of this attack, nomalization of the matixes is impotant. PageRank [2] uses out-link nomalization such that the total sum of out-links fom one page is nomalized to one. We have applied this method to the EigenRumo algoithm. It was found, howeve, that this appoach does not wok well fo nomalizing the links fom agents (blogges. Unlike web pages, the levels of activities of agents ae quite divese. heefoe, it is not fai to nomalize total sum of out-links fom one agent to one equally. Ou expeiments show that some blogs with only a few blog enties can ean the same level of authoity scoes as the blogs with a hunded of enties when we apply this nomalization. We also studied the behavio of scoes in the case whee no nomalization is applied. In this case, it was also found that scoes ae seiously impacted by spamming as we expected. he best nomalization function we have found so fa is to use the squae oot of the numbe of the objects submitted o evaluated by the agent, i.e. P E [ p ] ( i... m, j... n p (7 Pi [ e ] ( i... m, j... n e...(8 whee P i and E i is the total numbe of objects povided and evaluated by agent i, espectively. Geneally, blogge inteest in a specific blog enty submitted o cited decease day by day. o implement this effect, we intoduce an optional longevity facto to infomation povisioning links and infomation evaluation links, and we use the following P (t and E (t instead of P and E. E i E [ e ] e γ e γ j... n e (0 whee t is the cuent time and time(x is the time when link x was ceated. ρ, γ ae damping factos with ange [0,]. 5. EXPERIMENS We implemented a blog seach system that eceives one o moe keywods fom the use teminal and etuns a list of blog enties with the blog name as the seach esult. In the database of the system, we stoed about 9,280,000 enties fom 305,000 blog sites collected by ou cawle fom Octobe 6, 2004 to Febuay 3, 2005. he collected data ae mainly fom 0 majo blog povides in Japan and all of the enties ae witten in Japanese. Of the 9,280,000 enties,,520,000 (6.3% have one o moe hypelinks. Only 6,000 enties (.25% ae linked to othe blogs. Note that we distinguished whethe the link is to a blog o not by checking whethe the URI of the enty is also stoed in the database. heefoe, the actual atio of blog enties that ae linked to othe blogs is somewhat highe. Vey few blog enties ae efeed to by othe blogs, only 07,000 (.5%. his means that only.5% of blog enties can be scoed by PageRank if we use only this dataset. (he actual set is highe in numbe since thee ae some links fom non-blog pages to the blogs in the database. his atio,.5%, seems too small to yield useful ank seach esults. he EigenRumo algoithm solves the above poblem since it assigns hub and authoity scoes to blogges and then popagates these scoes to all enties submitted by the blogge. As a esult, 36,200 blogges (blog sites have at least one blog enty linked to (o fom othe blogs and 28,300 blogges have nonzeo authoity scoes. his is 9.28 % of the 305,000 blogges. hese authoity scoes ae popagated to thei enties so 862,000 (9.28% of blog enties have nonzeo eputation scoes. his atio is still small but it is sufficient fo anking seach esults since the anking is impotant the numbe of seach esults is lage. Moeove, ou obsevation shows that seach engine uses check only the top 20 seach esults. We also investigated the effectiveness of the anking by conducting a face-to-face use suvey. We asked 40 guests who visited ou exhibition held on Febuay 2005, to use ou blog seach system. hey wee asked to compae the anking quality with that of taditional blog anking schemes, i.e., soting by the numbe of in-links and FIDF soting [9]. he numbe of in-links diectly counts towad the total numbe of links to all aticles submitted by the agent. In this suvey, all guests wee asked to submit only one quey that could be feely selected. We only inteupted when the guest submitted a quey that had aleady been submitted. he blog seach system showed the thee ankings and the subjects wee asked to indicate the best anking. Accoding to thei eplies, about 48% of queies showed no significant diffeence fom the simple count of in-links. Fo 45% of the queies, the poposed scheme was supeio while fo about 7.5% of queies it was infeio (able.
Best esult able. he summay of use suvey EigenRumo In-link FIDF Not detemined Queies 8 (45% 2 (5% (2.5% 9 (48% In this expeiment, we also found that if the quey was geneic such as baseball, i.e. many seach esults ae etuned, thee was no pominent diffeence between EigenRumo and In-link. Howeve, in case of moe specific queies such as baseball ichio EigenRumo geneally povided the bette anking. his is consideed to indicate the effect of scoe aggegation on agents povided by the algoithm. It is also obseved that simple in-link ankings ae moe susceptible to spamming in which blogs attempt to ceate seveal accounts and link them to each othe to inflate the atings. Actually, we often found such attacks in the ankings geneated by the numbe of in-links. his type of attack is moe pominent when we submit specific queies. 6. Related Woks Blog anking is an impotant topic in web mining but it still has not been widely studied. Ada el al. [] poposed the concept of anking called irank, which assigns high atings to the sites that contain oiginal (souce infomation wheeas PageRank and EigenRumo assign high atings to popula sites. In this sense, irank and EigenRumo have diffeent puposes. Howeve, both appoaches have similaities in addessing the issue of the spaseness of the blogspace and the impotance of the dynamic stuctue of links. (We intoduced a link longevity facto in Section 4. echnoati [2] povided a commecial blog seach and some similaities with ou system appeas to exist. Howeve, details of the anking algoithm wee not published. Access anking is widely used in the blogspace, but it equies the blogges o blog povides to paticipate in the anking pocess and thus has a fundamental disadvantage in tems of limited coveage. Apat fom the aea of the blogspace, the EigenRumo algoithm has a unique chaacteistic as a new link-analysis tool. Most linkanalysis schemes poposed so fa conside page-to-page links o agent-to-agent links [6]. On the contay, the EigenRumo algoithm analyzes agent-to-object links diectly and it dispenses with the need to collect agent-to-agent links. his widens the application field of link analysis. he EigenRumo algoithm is based on eigenvecto analysis simila to PageRank [2] and HIS [7] but it manages scoes fo agents and objects sepaately and eputation scoes ae intoduced as well as hub and authoity scoes as illustated in Figue 4. As a esult, an object povided by an agent with high authoity scoe can be anked highly fom the time submitted. his is impossible with PageRank o HIS which equie many eviews befoe useful scoes can be assigned. he nomalization of link descibed in Section 4 is also a unique featue of the EigenRumo algoithm since this it allows the analysis of agent-to-object links and the levels of activities of agents ae quite diffeent fom those of static web pages. Authos have pesented some elated anking algoithms [4][5], but none of them ae based on eigenvecto calculation o addess blogspace-specific issues. 7. CONCLUSION In this pape, we pesented a new algoithm fo anking blogs and showed its effectiveness by calculating the scoe of 9,280,000 blog enties. he impotant featue of the algoithm is to widen the coveage of blog enties that ae assigned a scoe by only fom static link analysis. his featue is especially impotant fo blog anking since the link stuctue of blogspace is spase than that of Web. Entities Link types Scoes Algoithm PageRank Web page a a 2 a 3 a HIS Web page h h h 2 h 3 p a a 2 a 3 a EigenRumo Agent/Object Evaluation ( E Evaluation ( E Evaluation ( E Povisioning ( P Authoity ( a Authoity( a Authoity( a Hub( h Hub( h Agent Reputation( Object d a ( N + ( d E a h Ea N a E h αp a + ( α E h a P h E a α 2 3 α a h h 2 3 h 2 3 Figue 4. Compaison with PageRank and HIS Algoithms
his appoach also enables to assign a highe scoe when the blog enty is submitted by a blogge who has been accepted a lot of attention in the past, even if the enty itself has no in-links at fist. his is a desiable featue of blog ankings since blogspace ae consideed to be a community in which discussing new topics. Futue wok can be a new use inteface o visualization of seach esults in which take advantage of the algoithm that calculates thee scoes, i.e., authoity, hub, and eputation scoes. Moe detail analysis on the duability of spamming is also an impotant futue wok. 8. ACKNOWLEDGEMENS We would like to thank Naoto animoto, Yoshinobu onomua, and Masahio Oku fo helpful discussions and comments. 9. REFERENCES [] E. Ada, L. Zhang, L. Adamic, and R. Lukose, Implicit Stuctue and the Dynamics of Blogspace, In Poceedings of the Wokshop on the Weblogging and Ecosystem at the 3th Intenational Wold Wide Web Confeence, 2004. [2] S. Bin and L. Page, he Anatomy of a Lage-scale Hypetextual Web Seach Engine, In Poceedings of 7th Intenational Wold Wide Web Confeence, 998. [3] S. Chakabati, mining the web, Mogan Kaufmann Publishes, 2003. [4] K. Fujimua and. Nishihaa, Reputation Rating System based on Past Behavio of Evaluatos, In Poceedings of the 4th ACM Confeence on Electonic Commece, 2003. [5] K. Fujimua, N. animoto, and M. Iguchi, Calculating Contibution in Cybespace Community Using Reputation System "RuMoR", In Poceedings of the AAMAS Wokshop on ust in Cybe-societies, July 2004. [6] S. D. Kamva, M.. Schlosse, and H. Gacia-Molina, he Eigenust Algoithm fo Reputation Management in P2P Netwoks, In Poceedings of 2th Intenational Wold Wide Web Confeence, 2003. [7] J. M. Kleinbeg, Authoitative souces in hypelinked envionment, Jounal of the ACM, Vol. 46, No. 5, 999. [8] D. Libby, RDF Site Summay (RSS 0.9 official DD, http//my.netscape.com/ publish/fomats/ ss-0.9.dtd, 999. [9] C. D. Manning and H. Schutze, Foundations of Statistical Natual Language Pocessing, MI Pess, Cambidge, MA 999. [0] B. and M. ott, ackback echnical Specification, http//www.sixapat.com/movabletype/docs/mttackback, 2002. [] D. Wine, Blog.Com XML-RPC inteface, http//www.xmlpc.com/weblogscom, 200. [2] echnoati, Inc. www.technoati.com.