The Technology Behind. Slides organized by: Sudhanva Gurumurthi

Similar documents
Motivation. Map in Lisp (Scheme) Map/Reduce. MapReduce: Simplified Data Processing on Large Clusters

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PAGE RANK ON MAP- REDUCE PARADIGM

COMP Parallel Computing. Lecture 22 November 29, Datacenters and Large Scale Data Processing

A project report submitted to Indiana University

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L12: Cloud Computing 1

CS427 Multicore Architecture and Parallel Computing

1.6 Case Study: Random Surfer

A Survey of Google's PageRank

COMP5331: Knowledge Discovery and Data Mining

Cloud Computing. Leonidas Fegaras University of Texas at Arlington. Web Data Management and XML L3b: Cloud Computing 1

Reading Time: A Method for Improving the Ranking Scores of Web Pages

A New Technique for Ranking Web Pages and Adwords

A project report submitted to Indiana University

CMSC 330: Organization of Programming Languages

Ranking Techniques in Search Engines

International Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining

Part II: Software Infrastructure in Data Centers: Distributed Execution Engines

Computing as a Utility. Cloud Computing. Why? Good for...

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Ruby Threads Thread Creation. CMSC 330: Organization of Programming Languages. Ruby Threads Condition. Ruby Threads Locks.

Internet search engines. COMP 250 Winter 2018 Lecture 30

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Web Structure Mining using Link Analysis Algorithms

How to Implement MapReduce Using. Presented By Jamie Pitts

Search Engines. Dr. Johan Hagelbäck.

Brief (non-technical) history

CSI 445/660 Part 10 (Link Analysis and Web Search)

Motivation. Motivation

SEO basics. Chris

The MapReduce Abstraction

Lecture #3: PageRank Algorithm The Mathematics of Google Search

CS 345A Data Mining. MapReduce

On the Convergence of Cloud Computing and Desktop Grids

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Cloud Computing Overview*

Internet search engines

CS6200 Information Retreival. The WebGraph. July 13, 2015

Link Analysis and Web Search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

HW 4: PageRank & MapReduce. 1 Warmup with PageRank and stationary distributions [10 points], collaboration

An Approach To Improve Website Ranking Using Social Networking Site

Analysis of Web Pages through Link Structure

A brief history of Google

Parallel Computing: MapReduce Jin, Hai

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Introduction to MapReduce

How to organize the Web?

Big Data Management and NoSQL Databases

Big Data Analytics CSCI 4030

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

MapReduce. Cloud Computing COMP / ECPE 293A

Searching the Web for Information

소셜네트워크 빅데이터분석기법. December 7, 2012 KAIST Jae-Gil Lee

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Chapter 4. Processing Text

An Adaptive Approach in Web Search Algorithm

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

EECS 498 Introduction to Distributed Systems

Lecture 27: Learning from relational data

Analytical survey of Web Page Rank Algorithm

L22: SC Report, Map Reduce

A Review Paper on Page Ranking Algorithms

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Experimental study of Web Page Ranking Algorithms

Computing Yi Fang, PhD

Information Retrieval and Web Search Engines

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

1. Introduction to MapReduce

Clustering Lecture 8: MapReduce

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

MapReduce for Graph Algorithms

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

CS 137 Part 4. Structures and Page Rank Algorithm

How Google Finds Your Needle in the Web's

Social Network Analysis

The PageRank Citation Ranking: Bringing Order to the Web

PageRank & the Random Surfer Model

Jeffrey D. Ullman Stanford University

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Cloud, Big Data & Linear Algebra

PageRank Explained or Everything you ve always wanted to know about PageRank 2001 All Rights Reserved

Big Data Analytics CSCI 4030

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

COMP Page Rank

Map-Reduce (PFP Lecture 12) John Hughes

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

B490 Mining the Big Data. 5. Models for Big Data

MapReduce, Hadoop and Spark. Bompotas Agorakis

Transcription:

The Technology Behind Slides organized by: Sudhanva Gurumurthi

The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear each day! About 1 billion Internet users (and growing)!! The World Wide Web in 2003. Image source: hkp://www.opte.org/ 2

Use a huge number of computers Data Centers An ordinary Google Search uses 700-1000 machines! Search through a massive number of webpages MapReduce Find which webpages match your query - PageRank 3

Data Centers Buildings that house compuqng equipment Contain 1000s of computers Inside a MicrosoT Data Center Google s Data Center in The Dalles, Oregon 4

Google s 36 Data Centers 5

Why do we need so many computers? Searching the Internet is like looking for a needle in a haystack! There are a trillion webpages There are millions of users Imagine wriqng a for or while- loop to search the contents of each webpage! Use the 1000s of computers in parallel to speed up the search 6

Map/Reduce Adapted from the Lisp programming language Easy to distribute across many computers MapReduce slides adapted from Dan Weld s slides at U. Washington: hkp:// rakaposhi.eas.asu.edu/cse494/notes/s07- map- reduce.ppt 7

Map/Reduce in Lisp (map f list [list 2 list 3 ]) (map square (1 2 3 4)) o (1 4 9 16) (reduce + (1 4 9 16)) o (+ 16 (+ 9 (+ 4 1) ) ) o 30 8

Map/Reduce ala Google map(key, val) is run on each item in set emits new- key / new- val pairs reduce(key, vals) is run for each unique key emiked by map() emits final output 9

Example: CounQng words in webpages Input consists of (url, contents) pairs map(key=url, val=contents): For each word w in contents, emit (w, 1 ) reduce(key=word, values=uniq_counts): Sum all 1 s in values list Emit result (word, sum) 10

Count, Illustrated map(key=url, val=contents): For each word w in contents, emit (w, 1 ) reduce(key=word, values=uniq_counts): Sum all 1 s in values list Emit result (word, sum) see bob throw see spot run see 1 bob 1 run 1 see 1 spot 1 throw 1 bob 1 run 1 see 2 spot 1 throw 1 11

Map/Reduce Job Processing Worker 0 Worker 1 Worker 2 Master count Worker 3 Worker 4 Worker 5 1. Client submits the count job, indicating code and input data 2. Master breaks input data into 6 chunks and assigns work to Workers. 3. After map(), Workers exchange map-output so that they can do the reduce() function 4. Master breaks reduce() keyspace into 6 chunks and assigns work to the Workers 5. Final reduce() step is done by the Master

The Life of a Google Query MapReduce + PageRank Image Source: www.google.com/corporate/tech.html 13

Finding the Right Websites for a Query Relevance - Is the document similar to the query term? Importance - Is the document useful to a variety of users? Search engine approaches Paid adverqsers Manually created classificaqon Feature detecqon, based on Qtle, text, anchors, "Popularity" 14

Google s PageRank Algorithm Measure popularity of pages based on hyperlink structure of Web. Google Founders Larry Page and Sergei Brin 15

90-10 Rule Model. Web surfer chooses next page: 90% of the Qme surfer clicks a link on current page. 10% of the Qme surfer types a random page. Crude, but useful, web surfing model. No one chooses links on a page with equal probability. The 90-10 breakdown is just a guess. It does not take the back bukon or bookmarks into account. 16

Basic Ideas Behind PageRank PageRank is a probability distribuqon that denotes the likelihood that the random surfer will arrive at a parqcular webpage. Links coming from important pages convey more importance to a page. If a web page has a link off the CNN home page, it may be just one link but it is a very important one. A page has high rank if the sum of the ranks of its inbound links is high. Covers the cases where a page has many inbound links and also when a page has a few highly ranked inbound links. 17

The PageRank Algorithm Assume that there are only 4 pages A, B, C, D and that the distribuqon is evenly divided among the pages PR(A) = PR(B) = PR(C) = PR(D) = 0.25 A C B D 18

If B, C, D each only link to A B, C, and D each confer their 0.25 PageRank to A PR(A) = PR(B) + PR(C) + PR(D) = 0.75 A B C D 19

Assume B links to C and D links to B and C Value of link- votes divided amongst the outbound links on a page B gives vote worth 0.125 to A and C D gives vote worth 0.083 to A, B, C PR(A) = (PR(B)/2) + (PR (C)/1) + (PR(D)/3) A C B D 20

PageRank The PageRank for any page u: PR(u) is dependent on the PageRank values for each page v out of the set B u (this set contains all pages linking to page u), divided by the number L(v) of links from page v 21

References The paper by Larry Page and Sergei Brin that describes their Google prototype: hkp://infolab.stanford.edu/~backrub/google.html The paper by Jeffrey Dean and Sanjay Ghemawat that describes MapReduce: hkp://labs.google.com/papers/mapreduce.html Wikipedia arqcle on PageRank: hkp://en.wikipedia.org/wiki/pagerank 22