Searching and Ranking Michal Cap May 14, 2008
Introduction Outline Outline Search Engines 1 Crawling Crawler Creating the Index 2 Searching Querying 3 Ranking Content-based Ranking Inbound Links PageRank Using Link Text Combining all the Techniques 4 Learning from Clicks Neural Network Implementing the Neural Network Training the Neural Network
Introduction Search Engines Full-Text Search Engines Allow people to search in large set of documents for a list of words Modern ranking algorithms are among the most used collective intelligence algorithms Google s success based on the PageRank, an example of the collective intelligence algorithm
Introduction Search Engines History of Searching on Internet 1990 Archie Indexing FTP directory listings 1993 Wandex First Web Search Engine 1994 WebCrawler, Lycos 1995 Altavista, Yahoo! 1998 Google
Introduction Search Engines Google Homepage 1998
Introduction Search Engines Architecture of a Search Engine Crawler collecting data
Introduction Search Engines Architecture of a Search Engine Crawler collecting data Database stores indexed data
Introduction Search Engines Architecture of a Search Engine Crawler collecting data Database stores indexed data Searcher returns list of documents for a certain query
Introduction Search Engines Architecture of a Search Engine Crawler collecting data Database stores indexed data Searcher returns list of documents for a certain query Ranking Algorithm ensures that most relevant results are returned first
Crawling Crawler What is a Crawler Robot wandering through the webpages to index it s contents Indexed data is stored in a database No need to store entire contents of the webpage May operate on Internet or corporate intranet
Crawling Crawler Programming Simple Crawler in Python class crawler: # Auxilliary function for getting an entry id and adding it if it s not present def getentryid(self,table,field,value,createnew=true): # Index an individual page def addtoindex(self,url,soup): # Extract the text from an HTML page (no tags) def gettextonly(self,soup): # Seperate the words by any non-whitespace character def separatewords(self,text): # Return true if this url is already indexed def isindexed(self,url): # Add a link between two pages def addlinkref(self,urlfrom,urlto,linktext): # Starting with a list of pages, do a breadth first search to the given depth def crawl(self,pages,depth=2):
Crawling Crawler Parsing the Webpage, urllib2 Our parser uses urllib2 to get the contents of the web page via http protocol: >>> import urllib2 >>> c=urllib2.urlopen( http://www.cnn.com ) >>> contents=c.read() >>> print contents[0:250] <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN""http://www.w3.org/TR/html4/loose.dtd"> <html lang="en"> <head> <title>cnn.com - Breaking News, U.S., World, Weather, Entertainment & Video News</title> <meta http-equiv="refresh" conte >>>
Crawling Crawler Parsing the Webpage, BeautifulSoup Beautiful Soup is a library allowing to build structured representation of the HTML document. It can be used to give us all outbound links from the current page to be followed further. >>> from BeautifulSoup import * >>> c=urllib2.urlopen( http://www.google.com ) >>> soup = BeautifulSoup(c.read()) >>> for link in soup( a ):... print dict(link.attrs)[ href ]... http://images.google.nl/imghp?hl=nl&tab=wi http://maps.google.nl/maps?hl=nl&tab=wl http://news.google.nl/nwshp?hl=nl&tab=wn http://video.google.nl/?hl=nl&tab=wv http://mail.google.com/mail/?hl=nl&tab=wm http://www.google.nl/intl/nl/options/... >>>
Crawling Crawler Parsing the Webpage, Finding the Words on Page We have to break the webpage into separate words: Use Beautiful Soap to search for text nodes and collect them
Crawling Crawler Parsing the Webpage, Finding the Words on Page We have to break the webpage into separate words: Use Beautiful Soap to search for text nodes and collect them Now we have plain-text representation of the webpage
Crawling Crawler Parsing the Webpage, Finding the Words on Page We have to break the webpage into separate words: Use Beautiful Soap to search for text nodes and collect them Now we have plain-text representation of the webpage Split the text representation into the list of separate words
Crawling Crawler Parsing the Webpage, gettextonly and separatewords # Extract the text from an HTML page (no tags) def gettextonly(self,soup): v=soup.string if v==none: c=soup.contents resulttext= for t in c: subtext=self.gettextonly(t) resulttext+=subtext+ \n return resulttext else: return v.strip() # Seperate the words by any non-whitespace character def separatewords(self,text): splitter=re.compile( \\W* ) return [s.lower() for s in splitter.split(text) if s!= ]
Crawling Crawler Stemming Another method for obtaining separate words: Converts words into their stems Indexing becomes Index
Crawling Crawler Parsing the Webpage, addtoindex method # Index an individual page def addtoindex(self,url,soup): if self.isindexed(url): return print Indexing +url # Get the individual words text=self.gettextonly(soup) words=self.separatewords(text) # Get the URL id urlid=self.getentryid( urllist, url,url) # Link each word to this url for i in range(len(words)): word=words[i] if word in ignorewords: continue wordid=self.getentryid( wordlist, word,word) self.con.execute("insert into wordlocation(urlid,wordid,location) values (%d,%d,%d)" % (urlid,wordid
Crawling Crawler Parsing the Webpage, crawl method def crawl(self,pages,depth=2): for i in range(depth): newpages={} for page in pages: try: c=urllib2.urlopen(page) except: print "Could not open %s" % page continue try: soup=beautifulsoup(c.read()) self.addtoindex(page,soup) links=soup( a ) for link in links: if ( href in dict(link.attrs)): url=urljoin(page,link[ href ]) if url.find(" ")!=-1: continue url=url.split( # )[0] # remove location portion if url[0:4]== http and not self.isindexed(url): newpages[url]=1 linktext=self.gettextonly(link) self.addlinkref(page,url,linktext) self.dbcommit() except: print "Could not parse page %s" % page pages=newpages
Crawling Crawler Runing the Crawler >> import searchengine >> pagelist=[ http://kiwitobes.com/wiki/perl.html ] >> crawler=searchengine.crawler( ) >> crawler.crawl(pagelist) Indexing http://kiwitobes.com/wiki/perl.html Could not open http://kiwitobes.com/wiki/module_%28programming%29.html Indexing http://kiwitobes.com/wiki/open_directory_project.html Indexing http://kiwitobes.com/wiki/common_gateway_interface.html
Crawling Creating the Index Database with the Index We will use sqlite to store the database in our simple crawler
Crawling Creating the Index Table: urllist sqlite> select rowid, url from urllist limit 10; 1 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 2 http://kiwitobes.com/wiki/programming_language.html 3 http://kiwitobes.com/wiki/alphabetical_list_of_programming_languages.html 4 http://kiwitobes.com/wiki/timeline_of_programming_languages.html 5 http://kiwitobes.com/wiki/generational_list_of_programming_languages.html 6 http://kiwitobes.com/wiki/array_programming.html 7 http://kiwitobes.com/wiki/a%2b_%28programming_language%29.html 8 http://kiwitobes.com/wiki/analytica.html 9 http://kiwitobes.com/wiki/apl_programming_language.html 10 http://kiwitobes.com/wiki/f_programming_language.html
Crawling Creating the Index Table: wordlist sqlite> select rowid, word from wordlist where rowid>300 and rowid<310; 301 ibm 302 system 303 360 304 mainframe 305 c 306 name 307 used 308 few 309 bring
Crawling Creating the Index Table: wordlocation sqlite> select urlid, wordid, location from wordlocation where rowid>54000 limit 5; 260 1310 610 260 1311 611 260 1294 612 260 1312 613 260 1313 614 sqlite> select * from wordlist where rowid=1310; changes sqlite> select * from wordlist where rowid=1311; random sqlite> select * from wordlist where rowid=1294; article sqlite> select * from urllist where rowid=260; http://kiwitobes.com/wiki/janus_computer_programming_language.html sqlite>
Crawling Creating the Index Storing links Apart from indexing the contents of the webpages, we also store links between pages and the words they contain.
Searching Querying Searching in the Index To search in the index for a specific word recursive, we can run a simple query: sqlite> select word, url, location from wordlist w, wordlocation l, urllist u where l.wordid = w.rowid and w.word = recursive and u.rowid = l.urlid; recursive http://kiwitobes.com/wiki/cilk.html 606 recursive http://kiwitobes.com/wiki/cilk.html 616 recursive http://kiwitobes.com/wiki/cilk.html 1440 recursive http://kiwitobes.com/wiki/joy_programming_language.html 639 recursive http://kiwitobes.com/wiki/declarative_programming_language.html 462 recursive http://kiwitobes.com/wiki/haskell_programming_language.html 622 recursive http://kiwitobes.com/wiki/haskell_programming_language.html 900 recursive http://kiwitobes.com/wiki/xslt.html 2804 recursive http://kiwitobes.com/wiki/xsl_transformations.html 2801 recursive http://kiwitobes.com/wiki/logo_programming_language.html 1996 recursive http://kiwitobes.com/wiki/e_programming_language.html 775 recursive http://kiwitobes.com/wiki/procedural_programming.html 902...
Searching Querying Searching in the Index This would be quite limited search engine, so we will need to add support for the multi-word queries: sqlite>select w1.word, w2.word, url, l1.location, l2.location from wordlist w1, wordlist w2, wordlocation l1, wordlocation l2, urllist u where l1.wordid = w1.rowid and l2.wordid = w2.rowid and w1.word= recursive and w2.word= function and l1.urlid = l2.urlid and u.rowid = l1.urlid limit 17; recursive function http://kiwitobes.com/wiki/cilk.html 606 460 recursive function http://kiwitobes.com/wiki/cilk.html 606 611 recursive function http://kiwitobes.com/wiki/cilk.html 606 1250 recursive function http://kiwitobes.com/wiki/cilk.html 606 1328 recursive function http://kiwitobes.com/wiki/cilk.html 616 460 recursive function http://kiwitobes.com/wiki/cilk.html 616 611 recursive function http://kiwitobes.com/wiki/cilk.html 616 1250 recursive function http://kiwitobes.com/wiki/cilk.html 616 1328 recursive function http://kiwitobes.com/wiki/cilk.html 1440 460 recursive function http://kiwitobes.com/wiki/cilk.html 1440 611 recursive function http://kiwitobes.com/wiki/cilk.html 1440 1250 recursive function http://kiwitobes.com/wiki/cilk.html 1440 1328 recursive function http://kiwitobes.com/wiki/joy_programming_language.html 639 352 recursive function http://kiwitobes.com/wiki/joy_programming_language.html 639 388 recursive function http://kiwitobes.com/wiki/joy_programming_language.html 639 424 recursive function http://kiwitobes.com/wiki/joy_programming_language.html 639 433 recursive function http://kiwitobes.com/wiki/joy_programming_language.html 639 472
Ranking Ranking the results Until now results given in the order they have been indexed
Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms
Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking
Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking Ranking based on inbound links
Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking Ranking based on inbound links PageRank Algorithm
Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking Ranking based on inbound links PageRank Algorithm Ranking based on the users feedbacks
Ranking Content-based Ranking Word Frequency Based on the intuition that the relevant pages will contain more occurrences of the search term than the irrelevant ones. def frequencyscore(self,rows): counts=dict([(row[0],0) for row in rows]) for row in rows: counts[row[0]]+=1 return self.normalizescores(counts)
Ranking Content-based Ranking Document Location Based on the intuition that the most relevant pages will contain search term at the beginning of the page. def locationscore(self,rows): locations=dict([(row[0],1000000) for row in rows]) for row in rows: loc=sum(row[1:]) if loc<locations[row[0]]: locations[row[0]]=loc
Ranking Content-based Ranking Word Distance When searching for multi-word queries, it is desirable to return pages with the query words mentioned close together first. def distancescore(self,rows): # If there s only one word, everyone wins! if len(rows[0])<=2: return dict([(row[0],1.0) for row in rows]) # Initialize the dictionary with large values mindistance=dict([(row[0],1000000) for row in rows]) for row in rows: dist=sum([abs(row[i]-row[i-1]) for i in range(2,len(row))]) if dist<mindistance[row[0]]: mindistance[row[0]]=dist return self.normalizescores(mindistance,smallisbetter=1)
Ranking Content-based Ranking Examples of Results Word Frequency >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/functional_programming.html 0.262476 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.062310 http://kiwitobes.com/wiki/programming_language.html 0.043976 http://kiwitobes.com/wiki/lisp_programming_language.html 0.036394 http://kiwitobes.com/wiki/programming_paradigm.html
Ranking Content-based Ranking Examples of Results Word Frequency >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/functional_programming.html 0.262476 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.062310 http://kiwitobes.com/wiki/programming_language.html 0.043976 http://kiwitobes.com/wiki/lisp_programming_language.html 0.036394 http://kiwitobes.com/wiki/programming_paradigm.html Document Location >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/functional_programming.html 0.150183 http://kiwitobes.com/wiki/haskell_programming_language.html 0.149635 http://kiwitobes.com/wiki/opal_programming_language.html 0.149091 http://kiwitobes.com/wiki/miranda_programming_language.html 0.149091 http://kiwitobes.com/wiki/joy_programming_language.html
Ranking Content-based Ranking Examples of Results Word Frequency >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/functional_programming.html 0.262476 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.062310 http://kiwitobes.com/wiki/programming_language.html 0.043976 http://kiwitobes.com/wiki/lisp_programming_language.html 0.036394 http://kiwitobes.com/wiki/programming_paradigm.html Document Location >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/functional_programming.html 0.150183 http://kiwitobes.com/wiki/haskell_programming_language.html 0.149635 http://kiwitobes.com/wiki/opal_programming_language.html 0.149091 http://kiwitobes.com/wiki/miranda_programming_language.html 0.149091 http://kiwitobes.com/wiki/joy_programming_language.html Word Distance >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/xslt.html 1.000000 http://kiwitobes.com/wiki/xquery.html 1.000000 http://kiwitobes.com/wiki/procedural_programming.html 1.000000 http://kiwitobes.com/wiki/miranda_programming_language.html 1.000000 http://kiwitobes.com/wiki/iswim.html
Ranking Content-based Ranking Combining Metrics Different metrics serve different purposes it makes sense to combine them and use weighted average to rank the results. weights=[(1.0,self.locationscore(rows)), (1.0,self.frequencyscore(rows)), (1.0,self.distancescore(rows)), ]
Ranking Content-based Ranking Combining Metrics Different metrics serve different purposes it makes sense to combine them and use weighted average to rank the results. weights=[(1.0,self.locationscore(rows)), (1.0,self.frequencyscore(rows)), (1.0,self.distancescore(rows)), ] Normalization different metric have to be on the common scale (0,1) def normalizescores(self,scores,smallisbetter=0): vsmall=0.00001 # Avoid division by zero errors if smallisbetter: minscore=min(scores.values()) return dict([(u,float(minscore)/max(vsmall,l)) for (u,l) in scores.items()]) else: maxscore=max(scores.values()) if maxscore==0: maxscore=vsmall return dict([(u,float(c)/maxscore) for (u,c) in scores.items()])
Ranking Content-based Ranking Combining Word Count and Document Location Metrics Combining Word Count and Document Location Metrics. Weight 1:1 >>> s.query( functional programming ) 2.000000 http://kiwitobes.com/wiki/functional_programming.html 0.379619 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.191990 http://kiwitobes.com/wiki/lisp_programming_language.html 0.167829 http://kiwitobes.com/wiki/haskell_programming_language.html 0.164944 http://kiwitobes.com/wiki/scheme_programming_language.html 0.161776 http://kiwitobes.com/wiki/programming_paradigm.html 0.161647 http://kiwitobes.com/wiki/logo_programming_language.html 0.160671 http://kiwitobes.com/wiki/miranda_programming_language.html 0.158189 http://kiwitobes.com/wiki/dylan_programming_language.html 0.156673 http://kiwitobes.com/wiki/curry_programming_language.html
Ranking Inbound Links Inbound Links Content based metrics Still used Considering only contents of the document Susceptible to manipulation Off page metrics Using inbound links More difficult to manipulate An example of collective intelligence Based on opinions of many website authors who decide whether to link certain page or not
Ranking Inbound Links Counting Inbound Links Considering links pointing to the ranked page Academic papers rated this way The algorithm weights each link equally Not considering text of the link def inboundlinkscore(self,rows): uniqueurls=dict([(row[0],1) for row in rows]) inboundcount=dict([(u,self.con.execute( select count(*) from link where toid=%d % u).fetchone()[0]) f return self.normalizescores(inboundcount)
Ranking Inbound Links Counting Inbound Links >>> s.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/programming_language.html 0.519048 http://kiwitobes.com/wiki/object-oriented_programming.html 0.442857 http://kiwitobes.com/wiki/unix.html 0.376190 http://kiwitobes.com/wiki/functional_programming.html 0.361905 http://kiwitobes.com/wiki/python_programming_language.html 0.338095 http://kiwitobes.com/wiki/programming_paradigm.html 0.319048 http://kiwitobes.com/wiki/perl.html 0.295238 http://kiwitobes.com/wiki/lisp_programming_language.html 0.285714 http://kiwitobes.com/wiki/assembly_language.html 0.280952 http://kiwitobes.com/wiki/smalltalk.html
Ranking PageRank PageRank Algortihm invented by founders of Google
Ranking PageRank PageRank Algortihm invented by founders of Google Named after Larry Page
Ranking PageRank PageRank Algortihm invented by founders of Google Named after Larry Page Every page is assigned PageRank score, calculated from the importance of all other pages that link to it and their s own PageRank
Ranking PageRank PageRank Algortihm invented by founders of Google Named after Larry Page Every page is assigned PageRank score, calculated from the importance of all other pages that link to it and their s own PageRank Supposed to model probability at which one randomly clicking on links ends up at a certain page
Ranking PageRank Computing PageRank Each page gives an equal portion (multiplied by damping factor 0.85) of its own PageRank to the pages it links to.
Ranking PageRank Computing PageRank What if we don t know beforewards what is the PR of the linking pages?
Ranking PageRank Computing PageRank What if we don t know beforewards what is the PR of the linking pages? Initialize to arbitrary value and repeat PageRank algorithm after each iteration we get closer to the true PageRank values.
Ranking PageRank Table: pagerank sqlite> select score,url from pagerank p, urllist u where u.rowid = p.urlid order by score desc limit 10; 2.528516 http://kiwitobes.com/wiki/main_page.html 1.161464 http://kiwitobes.com/wiki/programming_language.html 1.064252 http://kiwitobes.com/wiki/computer_language.html 0.542686 http://kiwitobes.com/wiki/c_programming_language.html 0.496406 http://kiwitobes.com/wiki/java_programming_language.html 0.427582 http://kiwitobes.com/wiki/object-oriented_programming.html 0.398397 http://kiwitobes.com/wiki/compiler.html 0.395712 http://kiwitobes.com/wiki/c%2b%2b.html 0.38577 http://kiwitobes.com/wiki/operating_system.html 0.370058 http://kiwitobes.com/wiki/microsoft_windows.html
Ranking PageRank Results when using PageRank Metrics >>> s.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/programming_language.html 0.368141 http://kiwitobes.com/wiki/object-oriented_programming.html 0.318146 http://kiwitobes.com/wiki/functional_programming.html 0.291282 http://kiwitobes.com/wiki/unix.html 0.277793 http://kiwitobes.com/wiki/programming_paradigm.html 0.255929 http://kiwitobes.com/wiki/smalltalk.html 0.255763 http://kiwitobes.com/wiki/assembly_language.html 0.240539 http://kiwitobes.com/wiki/python_programming_language.html 0.234827 http://kiwitobes.com/wiki/lisp_programming_language.html 0.232237 http://kiwitobes.com/wiki/haskell_programming_language.html
Ranking Using Link Text Using Link Text Powerful way to rank searches We can get better information from from what the links say about the page Add up all the PageRank scores of the pages with relevant links and use this as the Link Text Score def linktextscore(self,rows,wordids): linkscores=dict([(row[0],0) for row in rows]) for wordid in wordids: cur=self.con.execute( select link.fromid,link.toid from linkwords,link where wordid=%d and linkwords for (fromid,toid) in cur: if toid in linkscores: pr=self.con.execute( select score from pagerank where urlid=%d % fromid).fetchone()[0] linkscores[toid]+=pr maxscore=max(linkscores.values()) normalizedscores=dict([(u,float(l)/maxscore) for (u,l) in linkscores.items()]) return normalizedscores
Ranking Using Link Text Results when using Link Text Metrics >>> s.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/programming_language.html 0.802978 http://kiwitobes.com/wiki/functional_programming.html 0.311898 http://kiwitobes.com/wiki/object-oriented_programming.html 0.182475 http://kiwitobes.com/wiki/programming_paradigm.html 0.136933 http://kiwitobes.com/wiki/logic_programming.html 0.133477 http://kiwitobes.com/wiki/procedural_programming.html 0.120658 http://kiwitobes.com/wiki/imperative_programming.html 0.093312 http://kiwitobes.com/wiki/generic_programming.html 0.046538 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.044233 http://kiwitobes.com/wiki/mozart_programming_system.html
Ranking Combining all the Techniques Different Metrics Combined There is no the best metric Averaging few different metrics may work better than any single one Finding the right weights is a crucial thing when tuning up a search engine weights=[(1.0,self.locationscore(rows)), (1.0,self.frequencyscore(rows)), (1.0,self.pagerankscore(rows)), (1.0,self.linktextscore(rows,wordids))]
Ranking Combining all the Techniques Results >>> s.query( functional programming ) select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.wordid=144 and w0.ur 3.121124 http://kiwitobes.com/wiki/functional_programming.html 2.074506 http://kiwitobes.com/wiki/programming_language.html 0.712191 http://kiwitobes.com/wiki/object-oriented_programming.html 0.622044 http://kiwitobes.com/wiki/programming_paradigm.html 0.564171 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.469566 http://kiwitobes.com/wiki/procedural_programming.html 0.463690 http://kiwitobes.com/wiki/lisp_programming_language.html 0.454014 http://kiwitobes.com/wiki/imperative_programming.html 0.433878 http://kiwitobes.com/wiki/haskell_programming_language.html 0.384647 http://kiwitobes.com/wiki/multi-paradigm_programming_language.html
Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query!
Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query! Using an artificial neural network is great method to do this
Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query! Using an artificial neural network is great method to do this First train the network. Words as the input, chosen URL as the output
Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query! Using an artificial neural network is great method to do this First train the network. Words as the input, chosen URL as the output Then let the network guess which URL will be chosen next and rank it high
Learning from Clicks Neural Network Artifical Neural Network Our neural network will consist of 3 layers of neurons: Input layer: neurons activated by words of query Hidden layer Output layer: activated neurons represent URLs
Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance
Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance However, we will take an easier approach new nodes in hidden layer are created only when needed Every time we are passed a combination of words we haven t seen before, we create new neuron in the hidden layer for that combination
Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance However, we will take an easier approach new nodes in hidden layer are created only when needed Every time we are passed a combination of words we haven t seen before, we create new neuron in the hidden layer for that combination Complete representation of the hidden layer will be stored as an table in our database
Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance However, we will take an easier approach new nodes in hidden layer are created only when needed Every time we are passed a combination of words we haven t seen before, we create new neuron in the hidden layer for that combination Complete representation of the hidden layer will be stored as an table in our database Input and output layer don t need to be represented explicitly - we already have tables wordids and urlids We will only store the weights of connections between layers
Learning from Clicks Implementing the Neural Network Creating new Hidden Node >> import nn >> mynet=nn.searchnet( nn.db ) >> mynet.maketables( ) >> wworld,wriver,wbank =101,102,103 >> uworldbank,uriver,uearth =201,202,203 >> mynet.generatehiddennode([wworld,wbank],[uworldbank,uriver,uearth]) >> for c in mynet.con.execute( select * from wordhidden ): print c (101, 1, 0.5) (103, 1, 0.5) >> for c in mynet.con.execute( select * from hiddenurl ): print c (1, 201, 0.1) (1, 202, 0.1)
Learning from Clicks Implementing the Neural Network Feeding Forward Now, the network can take the words as inputs, activate the links and give a set of URLs as an output Neurons in the hidden layer will activate their output according to the tanh function Before running the algorithm, we will build up only the relevant part of the network in memory
Learning from Clicks Implementing the Neural Network Set-up the Network def setupnetwork(self,wordids,urlids): # value lists self.wordids=wordids self.hiddenids=self.getallhiddenids(wordids,urlids) self.urlids=urlids # node outputs self.ai = [1.0]*len(self.wordids) self.ah = [1.0]*len(self.hiddenids) self.ao = [1.0]*len(self.urlids) # create weights matrix self.wi = [[self.getstrength(wordid,hiddenid,0) for hiddenid in self.hiddenids] for wordid in self.wordids] self.wo = [[self.getstrength(hiddenid,urlid,1) for urlid in self.urlids] for hiddenid in self.hiddenids]
Learning from Clicks Implementing the Neural Network Feed Forward def feedforward(self): # the only inputs are the query words for i in range(len(self.wordids)): self.ai[i] = 1.0 # hidden activations for j in range(len(self.hiddenids)): sum = 0.0 for i in range(len(self.wordids)): sum = sum + self.ai[i] * self.wi[i][j] self.ah[j] = tanh(sum) # output activations for k in range(len(self.urlids)): sum = 0.0 for j in range(len(self.hiddenids)): sum = sum + self.ah[j] * self.wo[j][k] self.ao[k] = tanh(sum) return self.ao[:] >> reload(nn) >> mynet=nn.searchnet( nn.db ) >> mynet.getresult([wworld,wbank],[uworldbank,uriver,uearth]) [0.76,0.76,0.76]
Learning from Clicks Training the Neural Network Training the Network Until now, no useful output We need to train the network first
Learning from Clicks Training the Neural Network Training the Network Until now, no useful output We need to train the network first We will use backpropagation algorithm to adjust weights in the network
Learning from Clicks Training the Neural Network Backpropagation 1 Calculate the error the difference between the node s current output and what it is supposed to be
Learning from Clicks Training the Neural Network Backpropagation 1 Calculate the error the difference between the node s current output and what it is supposed to be 2 Use dtanh function to determine how much the node s output has to change
Learning from Clicks Training the Neural Network Backpropagation 1 Calculate the error the difference between the node s current output and what it is supposed to be 2 Use dtanh function to determine how much the node s output has to change 3 Change the strength of each incoming link in proportion to the link s current strength and learning rate
Learning from Clicks Training the Neural Network Backpropagation def backpropagate(self, targets, N=0.5): # calculate errors for output output_deltas = [0.0] * len(self.urlids) for k in range(len(self.urlids)): error = targets[k]-self.ao[k] output_deltas[k] = dtanh(self.ao[k]) * error # calculate errors for hidden layer hidden_deltas = [0.0] * len(self.hiddenids) for j in range(len(self.hiddenids)): error = 0.0 for k in range(len(self.urlids)): error = error + output_deltas[k]*self.wo[j][k] hidden_deltas[j] = dtanh(self.ah[j]) * error # update output weights for j in range(len(self.hiddenids)): for k in range(len(self.urlids)): change = output_deltas[k]*self.ah[j] self.wo[j][k] = self.wo[j][k] + N*change # update input weights for i in range(len(self.wordids)): for j in range(len(self.hiddenids)): change = hidden_deltas[j]*self.ai[i] self.wi[i][j] = self.wi[i][j] + N*change
Learning from Clicks Training the Neural Network Train Query def trainquery(self,wordids,urlids,selectedurl): # generate a hidden node if necessary self.generatehiddennode(wordids,urlids) self.setupnetwork(wordids,urlids) self.feedforward() targets=[0.0]*len(urlids) targets[urlids.index(selectedurl)]=1.0 error = self.backpropagate(targets) self.updatedatabase() >> mynet=nn.searchnet( nn.db ) >> mynet.trainquery([wworld,wbank],[uworldbank,uriver,uearth],uworldbank) >> mynet.getresult([wworld,wbank],[uworldbank,uriver,uearth]) [0.335,0.055,0.055]
Learning from Clicks Training the Neural Network Power of Neural Networks A neural network is even capable to answer queries it has never seen before reasonably well: >> allurls=[uworldbank,uriver,uearth] >> for i in range(30):... mynet.trainquery([wworld,wbank],allurls,uworldbank)... mynet.trainquery([wriver,wbank],allurls,uriver)... mynet.trainquery([wworld],allurls,uearth)... >> mynet.getresult([wworld,wbank],allurls) [0.861, 0.011, 0.016] >> mynet.getresult([wriver,wbank],allurls) [-0.030, 0.883, 0.006] >> mynet.getresult([wbank],allurls) [0.865, 0.001, -0.85]
Learning from Clicks Training the Neural Network Connecting Network to Search Engine Finally, we can connect the neural network to our search engine ranking scheme: def nnscore(self,rows,wordids): # Get unique URL IDs as an ordered list urlids=[urlid for urlid in dict([(row[0],1) for row in rows])] nnres=mynet.getresult(wordids,urlids) scores=dict([(urlids[i],nnres[i]) for i in range(len(urlids))]) return self.normalizescores(scores)
Learning from Clicks Training the Neural Network Does Google Use It? <a href="http://docs.python.org/tut/" class=l onmousedown="return rwt(this,,, res, 4, AFQjCNG2ybB-4tLBf8_ZxyXx5brQsgSYAQ, &sig2=l6txgxnqoadbdzhm8zkn8w )"> <b>python</b> Tutorial </a>
Learning from Clicks Training the Neural Network Thank you for your attention