Searching and Ranking

Similar documents
Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Search Engines. Dr. Johan Hagelbäck.

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

An Overview of Search Engine. Hai-Yang Xu Dev Lead of Search Technology Center Microsoft Research Asia

5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

12. Web Spidering. These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.

Full-Text Indexing For Heritrix

Web Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India

Chapter 2: Literature Review

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

Information Retrieval Spring Web retrieval

Mining Web Data. Lijun Zhang

Webinar Series. Sign up at February 15 th. Website Optimization - What Does Google Think of Your Website?

Administrivia. Crawlers: Nutch. Course Overview. Issues. Crawling Issues. Groups Formed Architecture Documents under Review Group Meetings CSE 454

SEARCH ENGINE INSIDE OUT

An Adaptive Approach in Web Search Algorithm

Searching the Web for Information

Information Retrieval May 15. Web retrieval

Web Search. Web Spidering. Introduction

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

Motivation. Motivation

Web Structure Mining using Link Analysis Algorithms

AN SEO GUIDE FOR SALONS

Information Retrieval

Web Clients and Crawlers

Traffic Overdrive Send Your Web Stats Into Overdrive!

Site Audit Boeing

The PageRank Citation Ranking: Bringing Order to the Web

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Website Name. Project Code: # SEO Recommendations Report. Version: 1.0

Activity: Google. Activity #1: Playground. Search Engine Optimization Google Results Organic vs. Paid. SEO = Search Engine Optimization

Information Retrieval on the Internet (Volume III, Part 3, 213)

International Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

INLS : Introduction to Information Retrieval System Design and Implementation. Fall 2008.

Using Development Tools to Examine Webpages

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword.

Exam IST 441 Spring 2014

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Parts of Speech, Named Entity Recognizer

Search & Google. Melissa Winstanley

SEO Technical & On-Page Audit

A Survey on Web Information Retrieval Technologies

Artificial Neural Networks Lecture Notes Part 5. Stephen Lucci, PhD. Part 5

Complimentary SEO Analysis & Proposal. ageinplaceofne.com. Rashima Marjara

How Does a Search Engine Work? Part 1

Site Audit SpaceX

Site Audit Virgin Galactic

Search Engines. Charles Severance

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

CS47300 Web Information Search and Management

FAQ: Crawling, indexing & ranking(google Webmaster Help)

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Table of Contents. How Google Works in the Real World. Why Content Marketing Matters. How to Avoid Getting BANNED by Google

DEC Computer Technology LESSON 6: DATABASES AND WEB SEARCH ENGINES

Searching. Outline. Copyright 2006 Haim Levkowitz. Copyright 2006 Haim Levkowitz

Web Crawling As Nonlinear Dynamics

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma

COMMUNICATIONS METRICS, WEB ANALYTICS & DATA MINING

COMP Page Rank

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

The Geeks Guide To SEO

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

data analysis - basic steps Arend Hintze

Search Engine Architecture II

Scraping I: Introduction to BeautifulSoup

A Survey of Google's PageRank

Running Head: HOW A SEARCH ENGINE WORKS 1. How a Search Engine Works. Sara Davis INFO Spring Erika Gutierrez.

Building Your Blog Audience. Elise Bauer & Vanessa Fox BlogHer Conference Chicago July 27, 2007

Web Scraping. HTTP and Requests

CS6200 Information Retreival. Crawling. June 10, 2015

Mining Web Data. Lijun Zhang

Social Network Analysis

A web directory lists web sites by category and subcategory. Web directory entries are usually found and categorized by humans.

Crawling. CS6200: Information Retrieval. Slides by: Jesse Anderton

The Topic Specific Search Engine

SEOHUNK INTERNATIONAL D-62, Basundhara Apt., Naharkanta, Hanspal, Bhubaneswar, India

SEO According to Google

Title: Artificial Intelligence: an illustration of one approach.

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Brief (non-technical) history

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

THE HISTORY & EVOLUTION OF SEARCH

WebSite Grade For : 97/100 (December 06, 2007)

COMP Homework #5. Due on April , 23:59. Web search-engine or Sudoku (100 points)

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

CMSC5733 Social Computing

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

CS6200 Information Retreival. The WebGraph. July 13, 2015

Exam IST 441 Spring 2011

Experimental study of Web Page Ranking Algorithms

Spring 2008 June 2, 2008 Section Solution: Python

Introduction to Data Mining

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Link Analysis and Web Search

Transcription:

Searching and Ranking Michal Cap May 14, 2008

Introduction Outline Outline Search Engines 1 Crawling Crawler Creating the Index 2 Searching Querying 3 Ranking Content-based Ranking Inbound Links PageRank Using Link Text Combining all the Techniques 4 Learning from Clicks Neural Network Implementing the Neural Network Training the Neural Network

Introduction Search Engines Full-Text Search Engines Allow people to search in large set of documents for a list of words Modern ranking algorithms are among the most used collective intelligence algorithms Google s success based on the PageRank, an example of the collective intelligence algorithm

Introduction Search Engines History of Searching on Internet 1990 Archie Indexing FTP directory listings 1993 Wandex First Web Search Engine 1994 WebCrawler, Lycos 1995 Altavista, Yahoo! 1998 Google

Introduction Search Engines Google Homepage 1998

Introduction Search Engines Architecture of a Search Engine Crawler collecting data

Introduction Search Engines Architecture of a Search Engine Crawler collecting data Database stores indexed data

Introduction Search Engines Architecture of a Search Engine Crawler collecting data Database stores indexed data Searcher returns list of documents for a certain query

Introduction Search Engines Architecture of a Search Engine Crawler collecting data Database stores indexed data Searcher returns list of documents for a certain query Ranking Algorithm ensures that most relevant results are returned first

Crawling Crawler What is a Crawler Robot wandering through the webpages to index it s contents Indexed data is stored in a database No need to store entire contents of the webpage May operate on Internet or corporate intranet

Crawling Crawler Programming Simple Crawler in Python class crawler: # Auxilliary function for getting an entry id and adding it if it s not present def getentryid(self,table,field,value,createnew=true): # Index an individual page def addtoindex(self,url,soup): # Extract the text from an HTML page (no tags) def gettextonly(self,soup): # Seperate the words by any non-whitespace character def separatewords(self,text): # Return true if this url is already indexed def isindexed(self,url): # Add a link between two pages def addlinkref(self,urlfrom,urlto,linktext): # Starting with a list of pages, do a breadth first search to the given depth def crawl(self,pages,depth=2):

Crawling Crawler Parsing the Webpage, urllib2 Our parser uses urllib2 to get the contents of the web page via http protocol: >>> import urllib2 >>> c=urllib2.urlopen( http://www.cnn.com ) >>> contents=c.read() >>> print contents[0:250] <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN""http://www.w3.org/TR/html4/loose.dtd"> <html lang="en"> <head> <title>cnn.com - Breaking News, U.S., World, Weather, Entertainment & Video News</title> <meta http-equiv="refresh" conte >>>

Crawling Crawler Parsing the Webpage, BeautifulSoup Beautiful Soup is a library allowing to build structured representation of the HTML document. It can be used to give us all outbound links from the current page to be followed further. >>> from BeautifulSoup import * >>> c=urllib2.urlopen( http://www.google.com ) >>> soup = BeautifulSoup(c.read()) >>> for link in soup( a ):... print dict(link.attrs)[ href ]... http://images.google.nl/imghp?hl=nl&tab=wi http://maps.google.nl/maps?hl=nl&tab=wl http://news.google.nl/nwshp?hl=nl&tab=wn http://video.google.nl/?hl=nl&tab=wv http://mail.google.com/mail/?hl=nl&tab=wm http://www.google.nl/intl/nl/options/... >>>

Crawling Crawler Parsing the Webpage, Finding the Words on Page We have to break the webpage into separate words: Use Beautiful Soap to search for text nodes and collect them

Crawling Crawler Parsing the Webpage, Finding the Words on Page We have to break the webpage into separate words: Use Beautiful Soap to search for text nodes and collect them Now we have plain-text representation of the webpage

Crawling Crawler Parsing the Webpage, Finding the Words on Page We have to break the webpage into separate words: Use Beautiful Soap to search for text nodes and collect them Now we have plain-text representation of the webpage Split the text representation into the list of separate words

Crawling Crawler Parsing the Webpage, gettextonly and separatewords # Extract the text from an HTML page (no tags) def gettextonly(self,soup): v=soup.string if v==none: c=soup.contents resulttext= for t in c: subtext=self.gettextonly(t) resulttext+=subtext+ \n return resulttext else: return v.strip() # Seperate the words by any non-whitespace character def separatewords(self,text): splitter=re.compile( \\W* ) return [s.lower() for s in splitter.split(text) if s!= ]

Crawling Crawler Stemming Another method for obtaining separate words: Converts words into their stems Indexing becomes Index

Crawling Crawler Parsing the Webpage, addtoindex method # Index an individual page def addtoindex(self,url,soup): if self.isindexed(url): return print Indexing +url # Get the individual words text=self.gettextonly(soup) words=self.separatewords(text) # Get the URL id urlid=self.getentryid( urllist, url,url) # Link each word to this url for i in range(len(words)): word=words[i] if word in ignorewords: continue wordid=self.getentryid( wordlist, word,word) self.con.execute("insert into wordlocation(urlid,wordid,location) values (%d,%d,%d)" % (urlid,wordid

Crawling Crawler Parsing the Webpage, crawl method def crawl(self,pages,depth=2): for i in range(depth): newpages={} for page in pages: try: c=urllib2.urlopen(page) except: print "Could not open %s" % page continue try: soup=beautifulsoup(c.read()) self.addtoindex(page,soup) links=soup( a ) for link in links: if ( href in dict(link.attrs)): url=urljoin(page,link[ href ]) if url.find(" ")!=-1: continue url=url.split( # )[0] # remove location portion if url[0:4]== http and not self.isindexed(url): newpages[url]=1 linktext=self.gettextonly(link) self.addlinkref(page,url,linktext) self.dbcommit() except: print "Could not parse page %s" % page pages=newpages

Crawling Crawler Runing the Crawler >> import searchengine >> pagelist=[ http://kiwitobes.com/wiki/perl.html ] >> crawler=searchengine.crawler( ) >> crawler.crawl(pagelist) Indexing http://kiwitobes.com/wiki/perl.html Could not open http://kiwitobes.com/wiki/module_%28programming%29.html Indexing http://kiwitobes.com/wiki/open_directory_project.html Indexing http://kiwitobes.com/wiki/common_gateway_interface.html

Crawling Creating the Index Database with the Index We will use sqlite to store the database in our simple crawler

Crawling Creating the Index Table: urllist sqlite> select rowid, url from urllist limit 10; 1 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 2 http://kiwitobes.com/wiki/programming_language.html 3 http://kiwitobes.com/wiki/alphabetical_list_of_programming_languages.html 4 http://kiwitobes.com/wiki/timeline_of_programming_languages.html 5 http://kiwitobes.com/wiki/generational_list_of_programming_languages.html 6 http://kiwitobes.com/wiki/array_programming.html 7 http://kiwitobes.com/wiki/a%2b_%28programming_language%29.html 8 http://kiwitobes.com/wiki/analytica.html 9 http://kiwitobes.com/wiki/apl_programming_language.html 10 http://kiwitobes.com/wiki/f_programming_language.html

Crawling Creating the Index Table: wordlist sqlite> select rowid, word from wordlist where rowid>300 and rowid<310; 301 ibm 302 system 303 360 304 mainframe 305 c 306 name 307 used 308 few 309 bring

Crawling Creating the Index Table: wordlocation sqlite> select urlid, wordid, location from wordlocation where rowid>54000 limit 5; 260 1310 610 260 1311 611 260 1294 612 260 1312 613 260 1313 614 sqlite> select * from wordlist where rowid=1310; changes sqlite> select * from wordlist where rowid=1311; random sqlite> select * from wordlist where rowid=1294; article sqlite> select * from urllist where rowid=260; http://kiwitobes.com/wiki/janus_computer_programming_language.html sqlite>

Crawling Creating the Index Storing links Apart from indexing the contents of the webpages, we also store links between pages and the words they contain.

Searching Querying Searching in the Index To search in the index for a specific word recursive, we can run a simple query: sqlite> select word, url, location from wordlist w, wordlocation l, urllist u where l.wordid = w.rowid and w.word = recursive and u.rowid = l.urlid; recursive http://kiwitobes.com/wiki/cilk.html 606 recursive http://kiwitobes.com/wiki/cilk.html 616 recursive http://kiwitobes.com/wiki/cilk.html 1440 recursive http://kiwitobes.com/wiki/joy_programming_language.html 639 recursive http://kiwitobes.com/wiki/declarative_programming_language.html 462 recursive http://kiwitobes.com/wiki/haskell_programming_language.html 622 recursive http://kiwitobes.com/wiki/haskell_programming_language.html 900 recursive http://kiwitobes.com/wiki/xslt.html 2804 recursive http://kiwitobes.com/wiki/xsl_transformations.html 2801 recursive http://kiwitobes.com/wiki/logo_programming_language.html 1996 recursive http://kiwitobes.com/wiki/e_programming_language.html 775 recursive http://kiwitobes.com/wiki/procedural_programming.html 902...

Searching Querying Searching in the Index This would be quite limited search engine, so we will need to add support for the multi-word queries: sqlite>select w1.word, w2.word, url, l1.location, l2.location from wordlist w1, wordlist w2, wordlocation l1, wordlocation l2, urllist u where l1.wordid = w1.rowid and l2.wordid = w2.rowid and w1.word= recursive and w2.word= function and l1.urlid = l2.urlid and u.rowid = l1.urlid limit 17; recursive function http://kiwitobes.com/wiki/cilk.html 606 460 recursive function http://kiwitobes.com/wiki/cilk.html 606 611 recursive function http://kiwitobes.com/wiki/cilk.html 606 1250 recursive function http://kiwitobes.com/wiki/cilk.html 606 1328 recursive function http://kiwitobes.com/wiki/cilk.html 616 460 recursive function http://kiwitobes.com/wiki/cilk.html 616 611 recursive function http://kiwitobes.com/wiki/cilk.html 616 1250 recursive function http://kiwitobes.com/wiki/cilk.html 616 1328 recursive function http://kiwitobes.com/wiki/cilk.html 1440 460 recursive function http://kiwitobes.com/wiki/cilk.html 1440 611 recursive function http://kiwitobes.com/wiki/cilk.html 1440 1250 recursive function http://kiwitobes.com/wiki/cilk.html 1440 1328 recursive function http://kiwitobes.com/wiki/joy_programming_language.html 639 352 recursive function http://kiwitobes.com/wiki/joy_programming_language.html 639 388 recursive function http://kiwitobes.com/wiki/joy_programming_language.html 639 424 recursive function http://kiwitobes.com/wiki/joy_programming_language.html 639 433 recursive function http://kiwitobes.com/wiki/joy_programming_language.html 639 472

Ranking Ranking the results Until now results given in the order they have been indexed

Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms

Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking

Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking Ranking based on inbound links

Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking Ranking based on inbound links PageRank Algorithm

Ranking Ranking the results Until now results given in the order they have been indexed Relevant pages first we need ranking algorithms Content-based ranking Ranking based on inbound links PageRank Algorithm Ranking based on the users feedbacks

Ranking Content-based Ranking Word Frequency Based on the intuition that the relevant pages will contain more occurrences of the search term than the irrelevant ones. def frequencyscore(self,rows): counts=dict([(row[0],0) for row in rows]) for row in rows: counts[row[0]]+=1 return self.normalizescores(counts)

Ranking Content-based Ranking Document Location Based on the intuition that the most relevant pages will contain search term at the beginning of the page. def locationscore(self,rows): locations=dict([(row[0],1000000) for row in rows]) for row in rows: loc=sum(row[1:]) if loc<locations[row[0]]: locations[row[0]]=loc

Ranking Content-based Ranking Word Distance When searching for multi-word queries, it is desirable to return pages with the query words mentioned close together first. def distancescore(self,rows): # If there s only one word, everyone wins! if len(rows[0])<=2: return dict([(row[0],1.0) for row in rows]) # Initialize the dictionary with large values mindistance=dict([(row[0],1000000) for row in rows]) for row in rows: dist=sum([abs(row[i]-row[i-1]) for i in range(2,len(row))]) if dist<mindistance[row[0]]: mindistance[row[0]]=dist return self.normalizescores(mindistance,smallisbetter=1)

Ranking Content-based Ranking Examples of Results Word Frequency >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/functional_programming.html 0.262476 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.062310 http://kiwitobes.com/wiki/programming_language.html 0.043976 http://kiwitobes.com/wiki/lisp_programming_language.html 0.036394 http://kiwitobes.com/wiki/programming_paradigm.html

Ranking Content-based Ranking Examples of Results Word Frequency >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/functional_programming.html 0.262476 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.062310 http://kiwitobes.com/wiki/programming_language.html 0.043976 http://kiwitobes.com/wiki/lisp_programming_language.html 0.036394 http://kiwitobes.com/wiki/programming_paradigm.html Document Location >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/functional_programming.html 0.150183 http://kiwitobes.com/wiki/haskell_programming_language.html 0.149635 http://kiwitobes.com/wiki/opal_programming_language.html 0.149091 http://kiwitobes.com/wiki/miranda_programming_language.html 0.149091 http://kiwitobes.com/wiki/joy_programming_language.html

Ranking Content-based Ranking Examples of Results Word Frequency >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/functional_programming.html 0.262476 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.062310 http://kiwitobes.com/wiki/programming_language.html 0.043976 http://kiwitobes.com/wiki/lisp_programming_language.html 0.036394 http://kiwitobes.com/wiki/programming_paradigm.html Document Location >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/functional_programming.html 0.150183 http://kiwitobes.com/wiki/haskell_programming_language.html 0.149635 http://kiwitobes.com/wiki/opal_programming_language.html 0.149091 http://kiwitobes.com/wiki/miranda_programming_language.html 0.149091 http://kiwitobes.com/wiki/joy_programming_language.html Word Distance >> e.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/xslt.html 1.000000 http://kiwitobes.com/wiki/xquery.html 1.000000 http://kiwitobes.com/wiki/procedural_programming.html 1.000000 http://kiwitobes.com/wiki/miranda_programming_language.html 1.000000 http://kiwitobes.com/wiki/iswim.html

Ranking Content-based Ranking Combining Metrics Different metrics serve different purposes it makes sense to combine them and use weighted average to rank the results. weights=[(1.0,self.locationscore(rows)), (1.0,self.frequencyscore(rows)), (1.0,self.distancescore(rows)), ]

Ranking Content-based Ranking Combining Metrics Different metrics serve different purposes it makes sense to combine them and use weighted average to rank the results. weights=[(1.0,self.locationscore(rows)), (1.0,self.frequencyscore(rows)), (1.0,self.distancescore(rows)), ] Normalization different metric have to be on the common scale (0,1) def normalizescores(self,scores,smallisbetter=0): vsmall=0.00001 # Avoid division by zero errors if smallisbetter: minscore=min(scores.values()) return dict([(u,float(minscore)/max(vsmall,l)) for (u,l) in scores.items()]) else: maxscore=max(scores.values()) if maxscore==0: maxscore=vsmall return dict([(u,float(c)/maxscore) for (u,c) in scores.items()])

Ranking Content-based Ranking Combining Word Count and Document Location Metrics Combining Word Count and Document Location Metrics. Weight 1:1 >>> s.query( functional programming ) 2.000000 http://kiwitobes.com/wiki/functional_programming.html 0.379619 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.191990 http://kiwitobes.com/wiki/lisp_programming_language.html 0.167829 http://kiwitobes.com/wiki/haskell_programming_language.html 0.164944 http://kiwitobes.com/wiki/scheme_programming_language.html 0.161776 http://kiwitobes.com/wiki/programming_paradigm.html 0.161647 http://kiwitobes.com/wiki/logo_programming_language.html 0.160671 http://kiwitobes.com/wiki/miranda_programming_language.html 0.158189 http://kiwitobes.com/wiki/dylan_programming_language.html 0.156673 http://kiwitobes.com/wiki/curry_programming_language.html

Ranking Inbound Links Inbound Links Content based metrics Still used Considering only contents of the document Susceptible to manipulation Off page metrics Using inbound links More difficult to manipulate An example of collective intelligence Based on opinions of many website authors who decide whether to link certain page or not

Ranking Inbound Links Counting Inbound Links Considering links pointing to the ranked page Academic papers rated this way The algorithm weights each link equally Not considering text of the link def inboundlinkscore(self,rows): uniqueurls=dict([(row[0],1) for row in rows]) inboundcount=dict([(u,self.con.execute( select count(*) from link where toid=%d % u).fetchone()[0]) f return self.normalizescores(inboundcount)

Ranking Inbound Links Counting Inbound Links >>> s.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/programming_language.html 0.519048 http://kiwitobes.com/wiki/object-oriented_programming.html 0.442857 http://kiwitobes.com/wiki/unix.html 0.376190 http://kiwitobes.com/wiki/functional_programming.html 0.361905 http://kiwitobes.com/wiki/python_programming_language.html 0.338095 http://kiwitobes.com/wiki/programming_paradigm.html 0.319048 http://kiwitobes.com/wiki/perl.html 0.295238 http://kiwitobes.com/wiki/lisp_programming_language.html 0.285714 http://kiwitobes.com/wiki/assembly_language.html 0.280952 http://kiwitobes.com/wiki/smalltalk.html

Ranking PageRank PageRank Algortihm invented by founders of Google

Ranking PageRank PageRank Algortihm invented by founders of Google Named after Larry Page

Ranking PageRank PageRank Algortihm invented by founders of Google Named after Larry Page Every page is assigned PageRank score, calculated from the importance of all other pages that link to it and their s own PageRank

Ranking PageRank PageRank Algortihm invented by founders of Google Named after Larry Page Every page is assigned PageRank score, calculated from the importance of all other pages that link to it and their s own PageRank Supposed to model probability at which one randomly clicking on links ends up at a certain page

Ranking PageRank Computing PageRank Each page gives an equal portion (multiplied by damping factor 0.85) of its own PageRank to the pages it links to.

Ranking PageRank Computing PageRank What if we don t know beforewards what is the PR of the linking pages?

Ranking PageRank Computing PageRank What if we don t know beforewards what is the PR of the linking pages? Initialize to arbitrary value and repeat PageRank algorithm after each iteration we get closer to the true PageRank values.

Ranking PageRank Table: pagerank sqlite> select score,url from pagerank p, urllist u where u.rowid = p.urlid order by score desc limit 10; 2.528516 http://kiwitobes.com/wiki/main_page.html 1.161464 http://kiwitobes.com/wiki/programming_language.html 1.064252 http://kiwitobes.com/wiki/computer_language.html 0.542686 http://kiwitobes.com/wiki/c_programming_language.html 0.496406 http://kiwitobes.com/wiki/java_programming_language.html 0.427582 http://kiwitobes.com/wiki/object-oriented_programming.html 0.398397 http://kiwitobes.com/wiki/compiler.html 0.395712 http://kiwitobes.com/wiki/c%2b%2b.html 0.38577 http://kiwitobes.com/wiki/operating_system.html 0.370058 http://kiwitobes.com/wiki/microsoft_windows.html

Ranking PageRank Results when using PageRank Metrics >>> s.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/programming_language.html 0.368141 http://kiwitobes.com/wiki/object-oriented_programming.html 0.318146 http://kiwitobes.com/wiki/functional_programming.html 0.291282 http://kiwitobes.com/wiki/unix.html 0.277793 http://kiwitobes.com/wiki/programming_paradigm.html 0.255929 http://kiwitobes.com/wiki/smalltalk.html 0.255763 http://kiwitobes.com/wiki/assembly_language.html 0.240539 http://kiwitobes.com/wiki/python_programming_language.html 0.234827 http://kiwitobes.com/wiki/lisp_programming_language.html 0.232237 http://kiwitobes.com/wiki/haskell_programming_language.html

Ranking Using Link Text Using Link Text Powerful way to rank searches We can get better information from from what the links say about the page Add up all the PageRank scores of the pages with relevant links and use this as the Link Text Score def linktextscore(self,rows,wordids): linkscores=dict([(row[0],0) for row in rows]) for wordid in wordids: cur=self.con.execute( select link.fromid,link.toid from linkwords,link where wordid=%d and linkwords for (fromid,toid) in cur: if toid in linkscores: pr=self.con.execute( select score from pagerank where urlid=%d % fromid).fetchone()[0] linkscores[toid]+=pr maxscore=max(linkscores.values()) normalizedscores=dict([(u,float(l)/maxscore) for (u,l) in linkscores.items()]) return normalizedscores

Ranking Using Link Text Results when using Link Text Metrics >>> s.query( functional programming ) 1.000000 http://kiwitobes.com/wiki/programming_language.html 0.802978 http://kiwitobes.com/wiki/functional_programming.html 0.311898 http://kiwitobes.com/wiki/object-oriented_programming.html 0.182475 http://kiwitobes.com/wiki/programming_paradigm.html 0.136933 http://kiwitobes.com/wiki/logic_programming.html 0.133477 http://kiwitobes.com/wiki/procedural_programming.html 0.120658 http://kiwitobes.com/wiki/imperative_programming.html 0.093312 http://kiwitobes.com/wiki/generic_programming.html 0.046538 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.044233 http://kiwitobes.com/wiki/mozart_programming_system.html

Ranking Combining all the Techniques Different Metrics Combined There is no the best metric Averaging few different metrics may work better than any single one Finding the right weights is a crucial thing when tuning up a search engine weights=[(1.0,self.locationscore(rows)), (1.0,self.frequencyscore(rows)), (1.0,self.pagerankscore(rows)), (1.0,self.linktextscore(rows,wordids))]

Ranking Combining all the Techniques Results >>> s.query( functional programming ) select w0.urlid,w0.location,w1.location from wordlocation w0,wordlocation w1 where w0.wordid=144 and w0.ur 3.121124 http://kiwitobes.com/wiki/functional_programming.html 2.074506 http://kiwitobes.com/wiki/programming_language.html 0.712191 http://kiwitobes.com/wiki/object-oriented_programming.html 0.622044 http://kiwitobes.com/wiki/programming_paradigm.html 0.564171 http://kiwitobes.com/wiki/categorical_list_of_programming_languages.html 0.469566 http://kiwitobes.com/wiki/procedural_programming.html 0.463690 http://kiwitobes.com/wiki/lisp_programming_language.html 0.454014 http://kiwitobes.com/wiki/imperative_programming.html 0.433878 http://kiwitobes.com/wiki/haskell_programming_language.html 0.384647 http://kiwitobes.com/wiki/multi-paradigm_programming_language.html

Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query!

Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query! Using an artificial neural network is great method to do this

Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query! Using an artificial neural network is great method to do this First train the network. Words as the input, chosen URL as the output

Learning from Clicks Neural Network Learning from Clicks Let s improve relevance by learning which link people actually choose after asking the query! Using an artificial neural network is great method to do this First train the network. Words as the input, chosen URL as the output Then let the network guess which URL will be chosen next and rank it high

Learning from Clicks Neural Network Artifical Neural Network Our neural network will consist of 3 layers of neurons: Input layer: neurons activated by words of query Hidden layer Output layer: activated neurons represent URLs

Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance

Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance However, we will take an easier approach new nodes in hidden layer are created only when needed Every time we are passed a combination of words we haven t seen before, we create new neuron in the hidden layer for that combination

Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance However, we will take an easier approach new nodes in hidden layer are created only when needed Every time we are passed a combination of words we haven t seen before, we create new neuron in the hidden layer for that combination Complete representation of the hidden layer will be stored as an table in our database

Learning from Clicks Implementing the Neural Network Implementation of the Neural Network Usually, all nodes in the network are created in advance However, we will take an easier approach new nodes in hidden layer are created only when needed Every time we are passed a combination of words we haven t seen before, we create new neuron in the hidden layer for that combination Complete representation of the hidden layer will be stored as an table in our database Input and output layer don t need to be represented explicitly - we already have tables wordids and urlids We will only store the weights of connections between layers

Learning from Clicks Implementing the Neural Network Creating new Hidden Node >> import nn >> mynet=nn.searchnet( nn.db ) >> mynet.maketables( ) >> wworld,wriver,wbank =101,102,103 >> uworldbank,uriver,uearth =201,202,203 >> mynet.generatehiddennode([wworld,wbank],[uworldbank,uriver,uearth]) >> for c in mynet.con.execute( select * from wordhidden ): print c (101, 1, 0.5) (103, 1, 0.5) >> for c in mynet.con.execute( select * from hiddenurl ): print c (1, 201, 0.1) (1, 202, 0.1)

Learning from Clicks Implementing the Neural Network Feeding Forward Now, the network can take the words as inputs, activate the links and give a set of URLs as an output Neurons in the hidden layer will activate their output according to the tanh function Before running the algorithm, we will build up only the relevant part of the network in memory

Learning from Clicks Implementing the Neural Network Set-up the Network def setupnetwork(self,wordids,urlids): # value lists self.wordids=wordids self.hiddenids=self.getallhiddenids(wordids,urlids) self.urlids=urlids # node outputs self.ai = [1.0]*len(self.wordids) self.ah = [1.0]*len(self.hiddenids) self.ao = [1.0]*len(self.urlids) # create weights matrix self.wi = [[self.getstrength(wordid,hiddenid,0) for hiddenid in self.hiddenids] for wordid in self.wordids] self.wo = [[self.getstrength(hiddenid,urlid,1) for urlid in self.urlids] for hiddenid in self.hiddenids]

Learning from Clicks Implementing the Neural Network Feed Forward def feedforward(self): # the only inputs are the query words for i in range(len(self.wordids)): self.ai[i] = 1.0 # hidden activations for j in range(len(self.hiddenids)): sum = 0.0 for i in range(len(self.wordids)): sum = sum + self.ai[i] * self.wi[i][j] self.ah[j] = tanh(sum) # output activations for k in range(len(self.urlids)): sum = 0.0 for j in range(len(self.hiddenids)): sum = sum + self.ah[j] * self.wo[j][k] self.ao[k] = tanh(sum) return self.ao[:] >> reload(nn) >> mynet=nn.searchnet( nn.db ) >> mynet.getresult([wworld,wbank],[uworldbank,uriver,uearth]) [0.76,0.76,0.76]

Learning from Clicks Training the Neural Network Training the Network Until now, no useful output We need to train the network first

Learning from Clicks Training the Neural Network Training the Network Until now, no useful output We need to train the network first We will use backpropagation algorithm to adjust weights in the network

Learning from Clicks Training the Neural Network Backpropagation 1 Calculate the error the difference between the node s current output and what it is supposed to be

Learning from Clicks Training the Neural Network Backpropagation 1 Calculate the error the difference between the node s current output and what it is supposed to be 2 Use dtanh function to determine how much the node s output has to change

Learning from Clicks Training the Neural Network Backpropagation 1 Calculate the error the difference between the node s current output and what it is supposed to be 2 Use dtanh function to determine how much the node s output has to change 3 Change the strength of each incoming link in proportion to the link s current strength and learning rate

Learning from Clicks Training the Neural Network Backpropagation def backpropagate(self, targets, N=0.5): # calculate errors for output output_deltas = [0.0] * len(self.urlids) for k in range(len(self.urlids)): error = targets[k]-self.ao[k] output_deltas[k] = dtanh(self.ao[k]) * error # calculate errors for hidden layer hidden_deltas = [0.0] * len(self.hiddenids) for j in range(len(self.hiddenids)): error = 0.0 for k in range(len(self.urlids)): error = error + output_deltas[k]*self.wo[j][k] hidden_deltas[j] = dtanh(self.ah[j]) * error # update output weights for j in range(len(self.hiddenids)): for k in range(len(self.urlids)): change = output_deltas[k]*self.ah[j] self.wo[j][k] = self.wo[j][k] + N*change # update input weights for i in range(len(self.wordids)): for j in range(len(self.hiddenids)): change = hidden_deltas[j]*self.ai[i] self.wi[i][j] = self.wi[i][j] + N*change

Learning from Clicks Training the Neural Network Train Query def trainquery(self,wordids,urlids,selectedurl): # generate a hidden node if necessary self.generatehiddennode(wordids,urlids) self.setupnetwork(wordids,urlids) self.feedforward() targets=[0.0]*len(urlids) targets[urlids.index(selectedurl)]=1.0 error = self.backpropagate(targets) self.updatedatabase() >> mynet=nn.searchnet( nn.db ) >> mynet.trainquery([wworld,wbank],[uworldbank,uriver,uearth],uworldbank) >> mynet.getresult([wworld,wbank],[uworldbank,uriver,uearth]) [0.335,0.055,0.055]

Learning from Clicks Training the Neural Network Power of Neural Networks A neural network is even capable to answer queries it has never seen before reasonably well: >> allurls=[uworldbank,uriver,uearth] >> for i in range(30):... mynet.trainquery([wworld,wbank],allurls,uworldbank)... mynet.trainquery([wriver,wbank],allurls,uriver)... mynet.trainquery([wworld],allurls,uearth)... >> mynet.getresult([wworld,wbank],allurls) [0.861, 0.011, 0.016] >> mynet.getresult([wriver,wbank],allurls) [-0.030, 0.883, 0.006] >> mynet.getresult([wbank],allurls) [0.865, 0.001, -0.85]

Learning from Clicks Training the Neural Network Connecting Network to Search Engine Finally, we can connect the neural network to our search engine ranking scheme: def nnscore(self,rows,wordids): # Get unique URL IDs as an ordered list urlids=[urlid for urlid in dict([(row[0],1) for row in rows])] nnres=mynet.getresult(wordids,urlids) scores=dict([(urlids[i],nnres[i]) for i in range(len(urlids))]) return self.normalizescores(scores)

Learning from Clicks Training the Neural Network Does Google Use It? <a href="http://docs.python.org/tut/" class=l onmousedown="return rwt(this,,, res, 4, AFQjCNG2ybB-4tLBf8_ZxyXx5brQsgSYAQ, &sig2=l6txgxnqoadbdzhm8zkn8w )"> <b>python</b> Tutorial </a>

Learning from Clicks Training the Neural Network Thank you for your attention