Searching and Ranking

Size: px

Start display at page:

Download "Searching and Ranking"

Deborah Higgins
6 years ago
Views:

1 Searching and Ranking Michal Cap May 14, 2008

2 Introduction Outline Outline Search Engines 1 Crawling Crawler Creating the Index 2 Searching Querying 3 Ranking Content-based Ranking Inbound Links PageRank Using Link Text Combining all the Techniques 4 Learning from Clicks Neural Network Implementing the Neural Network Training the Neural Network

3 Introduction Search Engines Full-Text Search Engines Allow people to search in large set of documents for a list of words Modern ranking algorithms are among the most used collective intelligence algorithms Google s success based on the PageRank, an example of the collective intelligence algorithm

4 Introduction Search Engines History of Searching on Internet 1990 Archie Indexing FTP directory listings 1993 Wandex First Web Search Engine 1994 WebCrawler, Lycos 1995 Altavista, Yahoo! 1998 Google

5 Introduction Search Engines Google Homepage 1998

6 Introduction Search Engines Architecture of a Search Engine Crawler collecting data

7 Introduction Search Engines Architecture of a Search Engine Crawler collecting data Database stores indexed data

8 Introduction Search Engines Architecture of a Search Engine Crawler collecting data Database stores indexed data Searcher returns list of documents for a certain query

9 Introduction Search Engines Architecture of a Search Engine Crawler collecting data Database stores indexed data Searcher returns list of documents for a certain query Ranking Algorithm ensures that most relevant results are returned first

10 Crawling Crawler What is a Crawler Robot wandering through the webpages to index it s contents Indexed data is stored in a database No need to store entire contents of the webpage May operate on Internet or corporate intranet

11 Crawling Crawler Programming Simple Crawler in Python class crawler: # Auxilliary function for getting an entry id and adding it if it s not present def getentryid(self,table,field,value,createnew=true): # Index an individual page def addtoindex(self,url,soup): # Extract the text from an HTML page (no tags) def gettextonly(self,soup): # Seperate the words by any non-whitespace character def separatewords(self,text): # Return true if this url is already indexed def isindexed(self,url): # Add a link between two pages def addlinkref(self,urlfrom,urlto,linktext): # Starting with a list of pages, do a breadth first search to the given depth def crawl(self,pages,depth=2):

12 Crawling Crawler Parsing the Webpage, urllib2 Our parser uses urllib2 to get the contents of the web page via http protocol: >>> import urllib2 >>> c=urllib2.urlopen( ) >>> contents=c.read() >>> print contents[0:250] <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"" <html lang="en"> <head> <title>cnn.com - Breaking News, U.S., World, Weather, Entertainment & Video News</title> <meta http-equiv="refresh" conte >>>

13 Crawling Crawler Parsing the Webpage, BeautifulSoup Beautiful Soup is a library allowing to build structured representation of the HTML document. It can be used to give us all outbound links from the current page to be followed further. >>> from BeautifulSoup import * >>> c=urllib2.urlopen( ) >>> soup = BeautifulSoup(c.read()) >>> for link in soup( a ):... print dict(link.attrs)[ href ] >>>

14 Crawling Crawler Parsing the Webpage, Finding the Words on Page We have to break the webpage into separate words: Use Beautiful Soap to search for text nodes and collect them

15 Crawling Crawler Parsing the Webpage, Finding the Words on Page We have to break the webpage into separate words: Use Beautiful Soap to search for text nodes and collect them Now we have plain-text representation of the webpage

16 Crawling Crawler Parsing the Webpage, Finding the Words on Page We have to break the webpage into separate words: Use Beautiful Soap to search for text nodes and collect them Now we have plain-text representation of the webpage Split the text representation into the list of separate words

17 Crawling Crawler Parsing the Webpage, gettextonly and separatewords # Extract the text from an HTML page (no tags) def gettextonly(self,soup): v=soup.string if v==none: c=soup.contents resulttext= for t in c: subtext=self.gettextonly(t) resulttext+=subtext+ \n return resulttext else: return v.strip() # Seperate the words by any non-whitespace character def separatewords(self,text): splitter=re.compile( \\W* ) return [s.lower() for s in splitter.split(text) if s!= ]

18 Crawling Crawler Stemming Another method for obtaining separate words: Converts words into their stems Indexing becomes Index

19 Crawling Crawler Parsing the Webpage, addtoindex method # Index an individual page def addtoindex(self,url,soup): if self.isindexed(url): return print Indexing +url # Get the individual words text=self.gettextonly(soup) words=self.separatewords(text) # Get the URL id urlid=self.getentryid( urllist, url,url) # Link each word to this url for i in range(len(words)): word=words[i] if word in ignorewords: continue wordid=self.getentryid( wordlist, word,word) self.con.execute("insert into wordlocation(urlid,wordid,location) values (%d,%d,%d)" % (urlid,wordid

20 Crawling Crawler Parsing the Webpage, crawl method def crawl(self,pages,depth=2): for i in range(depth): newpages={} for page in pages: try: c=urllib2.urlopen(page) except: print "Could not open %s" % page continue try: soup=beautifulsoup(c.read()) self.addtoindex(page,soup) links=soup( a ) for link in links: if ( href in dict(link.attrs)): url=urljoin(page,link[ href ]) if url.find(" ")!=-1: continue url=url.split( # )[0] # remove location portion if url[0:4]== http and not self.isindexed(url): newpages[url]=1 linktext=self.gettextonly(link) self.addlinkref(page,url,linktext) self.dbcommit() except: print "Could not parse page %s" % page pages=newpages

21 Crawling Crawler Runing the Crawler >> import searchengine >> pagelist=[ ] >> crawler=searchengine.crawler( ) >> crawler.crawl(pagelist) Indexing Could not open Indexing Indexing

22 Crawling Creating the Index Database with the Index We will use sqlite to store the database in our simple crawler

23 Crawling Creating the Index Table: urllist sqlite> select rowid, url from urllist limit 10;

24 Crawling Creating the Index Table: wordlist sqlite> select rowid, word from wordlist where rowid>300 and rowid<310; 301 ibm 302 system mainframe 305 c 306 name 307 used 308 few 309 bring

25 Crawling Creating the Index Table: wordlocation sqlite> select urlid, wordid, location from wordlocation where rowid>54000 limit 5; sqlite> select * from wordlist where rowid=1310; changes sqlite> select * from wordlist where rowid=1311; random sqlite> select * from wordlist where rowid=1294; article sqlite> select * from urllist where rowid=260; sqlite>

26 Crawling Creating the Index Storing links Apart from indexing the contents of the webpages, we also store links between pages and the words they contain.

27 Searching Querying Searching in the Index To search in the index for a specific word recursive, we can run a simple query: sqlite> select word, url, location from wordlist w, wordlocation l, urllist u where l.wordid = w.rowid and w.word = recursive and u.rowid = l.urlid; recursive recursive recursive recursive recursive recursive recursive recursive recursive recursive recursive recursive

28 Searching Querying Searching in the Index This would be quite limited search engine, so we will need to add support for the multi-word queries: sqlite>select w1.word, w2.word, url, l1.location, l2.location from wordlist w1, wordlist w2, wordlocation l1, wordlocation l2, urllist u where l1.wordid = w1.rowid and l2.wordid = w2.rowid and w1.word= recursive and w2.word= function and l1.urlid = l2.urlid and u.rowid = l1.urlid limit 17; recursive function recursive function recursive function recursive function recursive function recursive function recursive function recursive function recursive function recursive function recursive function recursive function recursive function recursive function recursive function recursive function recursive function

29 Ranking Ranking the results Until now results given in the order they have been indexed

30 Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms

31 Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking

32 Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking Ranking based on inbound links

33 Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking Ranking based on inbound links PageRank Algorithm

34 Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking Ranking based on inbound links PageRank Algorithm Ranking based on the users feedbacks

35 Ranking Content-based Ranking Word Frequency Based on the intuition that the relevant pages will contain more occurrences of the search term than the irrelevant ones. def frequencyscore(self,rows): counts=dict([(row[0],0) for row in rows]) for row in rows: counts[row[0]]+=1 return self.normalizescores(counts)

36 Ranking Content-based Ranking Document Location Based on the intuition that the most relevant pages will contain search term at the beginning of the page. def locationscore(self,rows): locations=dict([(row[0], ) for row in rows]) for row in rows: loc=sum(row[1:]) if loc<locations[row[0]]: locations[row[0]]=loc

37 Ranking Content-based Ranking Word Distance When searching for multi-word queries, it is desirable to return pages with the query words mentioned close together first. def distancescore(self,rows): # If there s only one word, everyone wins! if len(rows[0])<=2: return dict([(row[0],1.0) for row in rows]) # Initialize the dictionary with large values mindistance=dict([(row[0], ) for row in rows]) for row in rows: dist=sum([abs(row[i]-row[i-1]) for i in range(2,len(row))]) if dist<mindistance[row[0]]: mindistance[row[0]]=dist return self.normalizescores(mindistance,smallisbetter=1)

38 Ranking Content-based Ranking Examples of Results Word Frequency >> e.query( functional programming )

39 Ranking Content-based Ranking Examples of Results Word Frequency >> e.query( functional programming ) Document Location >> e.query( functional programming )

40 Ranking Content-based Ranking Examples of Results Word Frequency >> e.query( functional programming ) Document Location >> e.query( functional programming ) Word Distance >> e.query( functional programming )

41 Ranking Content-based Ranking Combining Metrics Different metrics serve different purposes it makes sense to combine them and use weighted average to rank the results. weights=[(1.0,self.locationscore(rows)), (1.0,self.frequencyscore(rows)), (1.0,self.distancescore(rows)), ]

42 Ranking Content-based Ranking Combining Metrics Different metrics serve different purposes it makes sense to combine them and use weighted average to rank the results. weights=[(1.0,self.locationscore(rows)), (1.0,self.frequencyscore(rows)), (1.0,self.distancescore(rows)), ] Normalization different metric have to be on the common scale (0,1) def normalizescores(self,scores,smallisbetter=0): vsmall= # Avoid division by zero errors if smallisbetter: minscore=min(scores.values()) return dict([(u,float(minscore)/max(vsmall,l)) for (u,l) in scores.items()]) else: maxscore=max(scores.values()) if maxscore==0: maxscore=vsmall return dict([(u,float(c)/maxscore) for (u,c) in scores.items()])

43 Ranking Content-based Ranking Combining Word Count and Document Location Metrics Combining Word Count and Document Location Metrics. Weight 1:1 >>> s.query( functional programming )

44 Ranking Inbound Links Inbound Links Content based metrics Still used Considering only contents of the document Susceptible to manipulation Off page metrics Using inbound links More difficult to manipulate An example of collective intelligence Based on opinions of many website authors who decide whether to link certain page or not

45 Ranking Inbound Links Counting Inbound Links Considering links pointing to the ranked page Academic papers rated this way The algorithm weights each link equally Not considering text of the link def inboundlinkscore(self,rows): uniqueurls=dict([(row[0],1) for row in rows]) inboundcount=dict([(u,self.con.execute( select count(*) from link where toid=%d % u).fetchone()[0]) f return self.normalizescores(inboundcount)

46 Ranking Inbound Links Counting Inbound Links >>> s.query( functional programming )

47 Ranking PageRank PageRank Algortihm invented by founders of Google

48 Ranking PageRank PageRank Algortihm invented by founders of Google Named after Larry Page

49 Ranking PageRank PageRank Algortihm invented by founders of Google Named after Larry Page Every page is assigned PageRank score, calculated from the importance of all other pages that link to it and their s own PageRank

50 Ranking PageRank PageRank Algortihm invented by founders of Google Named after Larry Page Every page is assigned PageRank score, calculated from the importance of all other pages that link to it and their s own PageRank Supposed to model probability at which one randomly clicking on links ends up at a certain page

51 Ranking PageRank Computing PageRank Each page gives an equal portion (multiplied by damping factor 0.85) of its own PageRank to the pages it links to.

52 Ranking PageRank Computing PageRank What if we don t know beforewards what is the PR of the linking pages?

53 Ranking PageRank Computing PageRank What if we don t know beforewards what is the PR of the linking pages? Initialize to arbitrary value and repeat PageRank algorithm after each iteration we get closer to the true PageRank values.

54 Ranking PageRank Table: pagerank sqlite> select score,url from pagerank p, urllist u where u.rowid = p.urlid order by score desc limit 10;

55 Ranking PageRank Results when using PageRank Metrics >>> s.query( functional programming )

56 Ranking Using Link Text Using Link Text Powerful way to rank searches We can get better information from from what the links say about the page Add up all the PageRank scores of the pages with relevant links and use this as the Link Text Score def linktextscore(self,rows,wordids): linkscores=dict([(row[0],0) for row in rows]) for wordid in wordids: cur=self.con.execute( select link.fromid,link.toid from linkwords,link where wordid=%d and linkwords for (fromid,toid) in cur: if toid in linkscores: pr=self.con.execute( select score from pagerank where urlid=%d % fromid).fetchone()[0] linkscores[toid]+=pr maxscore=max(linkscores.values()) normalizedscores=dict([(u,float(l)/maxscore) for (u,l) in linkscores.items()]) return normalizedscores

57 Ranking Using Link Text Results when using Link Text Metrics >>> s.query( functional programming )

58 Ranking Combining all the Techniques Different Metrics Combined There is no the best metric Averaging few different metrics may work better than any single one Finding the right weights is a crucial thing when tuning up a search engine weights=[(1.0,self.locationscore(rows)), (1.0,self.frequencyscore(rows)), (1.0,self.pagerankscore(rows)), (1.0,self.linktextscore(rows,wordids))]

59 Ranking Combining all the Techniques Results >>> s.query( functional programming ) select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.wordid=144 and w0.ur

60 Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query!

61 Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query! Using an artificial neural network is great method to do this

62 Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query! Using an artificial neural network is great method to do this First train the network. Words as the input, chosen URL as the output

63 Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query! Using an artificial neural network is great method to do this First train the network. Words as the input, chosen URL as the output Then let the network guess which URL will be chosen next and rank it high

64 Learning from Clicks Neural Network Artifical Neural Network Our neural network will consist of 3 layers of neurons: Input layer: neurons activated by words of query Hidden layer Output layer: activated neurons represent URLs

65 Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance

66 Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance However, we will take an easier approach new nodes in hidden layer are created only when needed Every time we are passed a combination of words we haven t seen before, we create new neuron in the hidden layer for that combination

67 Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance However, we will take an easier approach new nodes in hidden layer are created only when needed Every time we are passed a combination of words we haven t seen before, we create new neuron in the hidden layer for that combination Complete representation of the hidden layer will be stored as an table in our database

68 Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance However, we will take an easier approach new nodes in hidden layer are created only when needed Every time we are passed a combination of words we haven t seen before, we create new neuron in the hidden layer for that combination Complete representation of the hidden layer will be stored as an table in our database Input and output layer don t need to be represented explicitly - we already have tables wordids and urlids We will only store the weights of connections between layers

69 Learning from Clicks Implementing the Neural Network Creating new Hidden Node >> import nn >> mynet=nn.searchnet( nn.db ) >> mynet.maketables( ) >> wworld,wriver,wbank =101,102,103 >> uworldbank,uriver,uearth =201,202,203 >> mynet.generatehiddennode([wworld,wbank],[uworldbank,uriver,uearth]) >> for c in mynet.con.execute( select * from wordhidden ): print c (101, 1, 0.5) (103, 1, 0.5) >> for c in mynet.con.execute( select * from hiddenurl ): print c (1, 201, 0.1) (1, 202, 0.1)

70 Learning from Clicks Implementing the Neural Network Feeding Forward Now, the network can take the words as inputs, activate the links and give a set of URLs as an output Neurons in the hidden layer will activate their output according to the tanh function Before running the algorithm, we will build up only the relevant part of the network in memory

71 Learning from Clicks Implementing the Neural Network Set-up the Network def setupnetwork(self,wordids,urlids): # value lists self.wordids=wordids self.hiddenids=self.getallhiddenids(wordids,urlids) self.urlids=urlids # node outputs self.ai = [1.0]*len(self.wordids) self.ah = [1.0]*len(self.hiddenids) self.ao = [1.0]*len(self.urlids) # create weights matrix self.wi = [[self.getstrength(wordid,hiddenid,0) for hiddenid in self.hiddenids] for wordid in self.wordids] self.wo = [[self.getstrength(hiddenid,urlid,1) for urlid in self.urlids] for hiddenid in self.hiddenids]

72 Learning from Clicks Implementing the Neural Network Feed Forward def feedforward(self): # the only inputs are the query words for i in range(len(self.wordids)): self.ai[i] = 1.0 # hidden activations for j in range(len(self.hiddenids)): sum = 0.0 for i in range(len(self.wordids)): sum = sum + self.ai[i] * self.wi[i][j] self.ah[j] = tanh(sum) # output activations for k in range(len(self.urlids)): sum = 0.0 for j in range(len(self.hiddenids)): sum = sum + self.ah[j] * self.wo[j][k] self.ao[k] = tanh(sum) return self.ao[:] >> reload(nn) >> mynet=nn.searchnet( nn.db ) >> mynet.getresult([wworld,wbank],[uworldbank,uriver,uearth]) [0.76,0.76,0.76]

73 Learning from Clicks Training the Neural Network Training the Network Until now, no useful output We need to train the network first

74 Learning from Clicks Training the Neural Network Training the Network Until now, no useful output We need to train the network first We will use backpropagation algorithm to adjust weights in the network

75 Learning from Clicks Training the Neural Network Backpropagation 1 Calculate the error the difference between the node s current output and what it is supposed to be

76 Learning from Clicks Training the Neural Network Backpropagation 1 Calculate the error the difference between the node s current output and what it is supposed to be 2 Use dtanh function to determine how much the node s output has to change

77 Learning from Clicks Training the Neural Network Backpropagation 1 Calculate the error the difference between the node s current output and what it is supposed to be 2 Use dtanh function to determine how much the node s output has to change 3 Change the strength of each incoming link in proportion to the link s current strength and learning rate

78 Learning from Clicks Training the Neural Network Backpropagation def backpropagate(self, targets, N=0.5): # calculate errors for output output_deltas = [0.0] * len(self.urlids) for k in range(len(self.urlids)): error = targets[k]-self.ao[k] output_deltas[k] = dtanh(self.ao[k]) * error # calculate errors for hidden layer hidden_deltas = [0.0] * len(self.hiddenids) for j in range(len(self.hiddenids)): error = 0.0 for k in range(len(self.urlids)): error = error + output_deltas[k]*self.wo[j][k] hidden_deltas[j] = dtanh(self.ah[j]) * error # update output weights for j in range(len(self.hiddenids)): for k in range(len(self.urlids)): change = output_deltas[k]*self.ah[j] self.wo[j][k] = self.wo[j][k] + N*change # update input weights for i in range(len(self.wordids)): for j in range(len(self.hiddenids)): change = hidden_deltas[j]*self.ai[i] self.wi[i][j] = self.wi[i][j] + N*change

79 Learning from Clicks Training the Neural Network Train Query def trainquery(self,wordids,urlids,selectedurl): # generate a hidden node if necessary self.generatehiddennode(wordids,urlids) self.setupnetwork(wordids,urlids) self.feedforward() targets=[0.0]*len(urlids) targets[urlids.index(selectedurl)]=1.0 error = self.backpropagate(targets) self.updatedatabase() >> mynet=nn.searchnet( nn.db ) >> mynet.trainquery([wworld,wbank],[uworldbank,uriver,uearth],uworldbank) >> mynet.getresult([wworld,wbank],[uworldbank,uriver,uearth]) [0.335,0.055,0.055]

80 Learning from Clicks Training the Neural Network Power of Neural Networks A neural network is even capable to answer queries it has never seen before reasonably well: >> allurls=[uworldbank,uriver,uearth] >> for i in range(30):... mynet.trainquery([wworld,wbank],allurls,uworldbank)... mynet.trainquery([wriver,wbank],allurls,uriver)... mynet.trainquery([wworld],allurls,uearth)... >> mynet.getresult([wworld,wbank],allurls) [0.861, 0.011, 0.016] >> mynet.getresult([wriver,wbank],allurls) [-0.030, 0.883, 0.006] >> mynet.getresult([wbank],allurls) [0.865, 0.001, -0.85]

81 Learning from Clicks Training the Neural Network Connecting Network to Search Engine Finally, we can connect the neural network to our search engine ranking scheme: def nnscore(self,rows,wordids): # Get unique URL IDs as an ordered list urlids=[urlid for urlid in dict([(row[0],1) for row in rows])] nnres=mynet.getresult(wordids,urlids) scores=dict([(urlids[i],nnres[i]) for i in range(len(urlids))]) return self.normalizescores(scores)

82 Learning from Clicks Training the Neural Network Does Google Use It? <a href=" class=l onmousedown="return rwt(this,,, res, 4, AFQjCNG2ybB-4tLBf8_ZxyXx5brQsgSYAQ, &sig2=l6txgxnqoadbdzhm8zkn8w )"> <b>python</b> Tutorial </a>

83 Learning from Clicks Training the Neural Network Thank you for your attention

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012 Python & Web Mining Lecture 6 10-10-12 Old Dominion University Department of Computer Science CS 495 Fall 2012 Hany SalahEldeen Khalil hany@cs.odu.edu Scenario So what did Professor X do when he wanted