CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Similar documents
CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval I. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 12: Conclusion. Aidan Hogan

Mining Web Data. Lijun Zhang

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Mining Web Data. Lijun Zhang

University of Maryland. Tuesday, March 2, 2010

How to organize the Web?

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Information Retrieval and Web Search

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

Social Networks 2015 Lecture 10: The structure of the web and link analysis

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Link Analysis and Web Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Relevance of a Document to a Query

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Unit VIII. Chapter 9. Link Analysis

COMP Page Rank

PAGE RANK ON MAP- REDUCE PARADIGM

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Information Networks: PageRank

Big Data Analytics CSCI 4030

Outline. Lecture 2: EITN01 Web Intelligence and Information Retrieval. Previous lecture. Representation/Indexing (fig 1.

Introduction to Information Retrieval

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

CS425: Algorithms for Web Scale Data

Searching the Web What is this Page Known for? Luis De Alba

Introduction to Data Mining

Search Engines. Dr. Johan Hagelbäck.

Information Retrieval Spring Web retrieval

Architecture and Implementation of Database Systems (Summer 2018)

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Information Retrieval

Lecture 8: Linkage algorithms and web search

Slides based on those in:

COMP 4601 Hubs and Authorities

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Big Data Analytics CSCI 4030

Administrative. Web crawlers. Web Crawlers and Link Analysis!

COMP6237 Data Mining Searching and Ranking

NATURAL LANGUAGE PROCESSING

Chapter 6: Information Retrieval and Web Search. An introduction

Recent Researches on Web Page Ranking

Reading Time: A Method for Improving the Ranking Scores of Web Pages

F O C U S E D S U R F E R M O D E L S Ranking visual search results

CC LA WEB DE DATOS PRIMAVERA Lecture 10: RDB2RDF. Aidan Hogan

Part 1: Link Analysis & Page Rank

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Information Retrieval. Lecture 4: Search engines and linkage algorithms

Web Structure Mining using Link Analysis Algorithms

Midterm Examination CSE 455 / CIS 555 Internet and Web Systems Spring 2009 Zachary Ives

Searching the Web [Arasu 01]

Data-Intensive Computing with MapReduce

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Link Analysis in the Cloud

Information Retrieval

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Lecture 5: Information Retrieval using the Vector Space Model

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Boolean Model. Hongning Wang

Link Analysis in Web Mining

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

CS/INFO 1305 Summer 2009

World Wide Web has specific challenges and opportunities

Authoritative K-Means for Clustering of Web Search Results

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Searching the Web for Information

Lec 8: Adaptive Information Retrieval 2

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Information Retrieval

Lecture 27: Learning from relational data

CS54701: Information Retrieval

Graph Data Processing with MapReduce

Search Engines and Learning to Rank

Information Retrieval

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Information Retrieval May 15. Web retrieval

Introduction to Data Mining

Query Processing and Alternative Search Structures. Indexing common words

Information Retrieval

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Analysis of Large Graphs: TrustRank and WebSpam

Transcription:

CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com

How does Google know about the Web?

Inverted Index: Example 1 Fruitvale Station is a 2013 American drama film written and directed by Ryan Coogler. Inverted index: Term List a american and by directed drama Posting List (1,2, ) (1,5, ) (1,2, ) (1,2, ) (1,2, ) (1,16, )

Apache Lucene Inverted Index They built one so you don t have to! Open Source in Java

Apache Lucene Inverted Index They built one so you don t have to! Open Source in Java

INFORMATION RETRIEVAL: RANKING

How Does Google Get Such Good Results? obama

How Does Google Get Such Good Results? aidan hogan

How does Google Get Such Good Results?

Two Sides to Ranking: Relevance

Two Sides to Ranking: Importance >

RANKING: RELEVANCE

Example Query Which of these three keyword terms is most important?

Matches in a Document freedom 7 occurrences

Matches in a Document freedom 7 occurrences movie 16 occurrences

Matches in a Document freedom 7 occurrences movie 16 occurrences wallace 88 occurrences

Usefulness of Words movie occurs very frequently freedom occurs frequently wallace occurs occassionally

Estimating Relevance Rare words more important than common words wallace (49M) more important than freedom (198M) more important than movie (835M) Words occurring more frequently in a document indicate higher relevance wallace (88) more matches than movie (16) more matches than freedom (7)

Relevance Measure: TF IDF TF: Term Frequency Measures occurrences of a term in a document various options Raw count of occurrences Logarithmically scaled Normalised by document length A combination / something else

Relevance Measure: TF IDF IDF: Inverse Document Frequency How common a term is across all documents Logarithmically scaled document occurrences Note: The more rare, the larger the value

Relevance Measure: TF IDF TF IDF: Combine Term Frequency and Inverse Document Frequency: Score for a query Let query Score for a query: (There are other possibilities)

Relevance Measure: TF IDF Term Frequency Inverse Document Frequency

Relevance Measure: TF IDF Term Frequency Inverse Document Frequency

Relevance Measure: TF IDF Term Frequency Inverse Document Frequency

Relevance Measure: TF IDF Term Frequency Inverse Document Frequency

Relevance Measure: TF IDF Term Frequency Inverse Document Frequency

Relevance Measure: TF IDF Term Frequency Inverse Document Frequency

Relevance Measure: TF IDF Term Frequency Inverse Document Frequency

Vector Space Model (a mention)

Vector Space Model (a mention)

Vector Space Model (a mention)

Vector Space Model (a mention) Cosine Similarity Note:

Relevance Measure: TF IDF TF IDF: Combine Term Frequency and Inverse Document Frequency: Score for a query Let query Score for a query: (There are other possibilities) we could also use cosine similarity between query and document using TF IDF weights

Two Sides to Ranking: Relevance

Field-Based Boosting Not all text is equal: titles, headers, etc.

Anchor Text See how the Web views/tags a page

Anchor Text See how the Web views/tags a page

Lucene uses relevance scoring Inverted Index They built one so you don t have to! Open Source in Java

RANKING: IMPORTANCE

Two Sides to Ranking: Importance How could we determine that Barack Obama is more important than Mount Obama as a search result on the Web? >

Link Analysis Which will have more links from other pages? The Wikipedia article for Mount Obama? The Wikipedia article for Barack Obama?

Link Analysis Consider links as votes of confidence in a page A hyperlink is the open Web s version of ( even if the page is linked in a negative way.)

Link Analysis So if we just count links to a page we can determine its importance and we are done?

Link Spamming

Link Importance So which should count for more? A link from http://en.wikipedia.org/wiki/barack_obama? Or a link from http://freev1agra.com/shop.html?

Link Importance Maybe we could consider links from some domains as having more vote?

PageRank

PageRank Not just a count of inlinks A link from a more important page is more important A link from a page with fewer links is more important A page with lots of inlinks from important pages (which have few outlinks) is more important

PageRank is Recursive Not just a count of inlinks A link from a more important page is more important A link from a page with fewer links is more important A page with lots of inlinks from important pages (which have few outlinks) is more important

PageRank Model The Web: a directed graph 0.225 f 0.265 a Vertices (pages) Edges (links) 0.138 0.127 e b d 0.172 c 0.074 Which vertex is most important?

PageRank Model The Web: a directed graph f a Vertices (pages) Edges (links) e b d c

PageRank Model f Vertices (pages) Edges (links) e b d

PageRank Model f a Vertices (pages) Edges (links) e b d c

PageRank: Random Surfer Model = someone surfing the web, clicking links randomly f a What is the probability of being at page x after n hops? e b d c

PageRank: Random Surfer Model = someone surfing the web, clicking links randomly f a What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node e b d c

PageRank: Random Surfer Model = someone surfing the web, clicking links randomly e f d a c b What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node PageRank applied iteratively for each hop: score indicates probability of being at that page after that many hops What would happen with g over time? g

PageRank: Random Surfer Model = someone surfing the web, clicking links randomly e f d a c b What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops If the surfer reaches a page without links, the surfer randomly jumps to another page g

PageRank: Random Surfer Model = someone surfing the web, clicking links randomly e What would happen with g and i over time? f d a c g b i What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops If the surfer reaches a page without links, the surfer randomly jumps to another page

PageRank: Random Surfer Model = someone surfing the web, clicking links randomly e What would happen with g and i over time? f d a c g b i What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops If the surfer reaches a page without links, the surfer randomly jumps to another page

PageRank: Random Surfer Model = someone surfing the web, clicking links randomly e f d a c g b i What is the probability of being at page x after n hops? Initial state: surfer equally likely to start at any node PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops If the surfer reaches a page without links, the surfer randomly jumps to another page The surfer will jump to a random page at any time with a probability 1 d this avoids traps and ensures convergence!

PageRank Model: Final Version The Web: a directed graph f a Vertices (pages) Edges (links) e b d c

PageRank: Benefits More robust than a simple link count Fewer ties than link counting Scalable to approximate (for sparse graphs) Convergence guaranteed

Two Sides to Ranking: Importance >

GOOGLE: A GUESS

How Modern Google ranks results (maybe) According to survey of SEO experts, not people in Google

How Modern Google ranks results (maybe) Why so secretive? According to survey of SEO experts, not people in Google

INFORMATION RETRIEVAL: RECAP

How Does Google Get Such Good Results? aidan hogan

Ranking in Information Retrieval Relevance: Is the document relevant for the query? Term Frequency * Inverse Document Frequency Cosine similarity Importance: Is the document a popular one? Links analysis PageRank

Ranking: Science or Art?

Questions?