Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark

Similar documents
Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval and Web Search

COMP 4601 Hubs and Authorities

Social Network Analysis

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Lecture 17 November 7

CS6200 Information Retreival. The WebGraph. July 13, 2015

Web Structure Mining using Link Analysis Algorithms

: Semantic Web (2013 Fall)

Mining Web Data. Lijun Zhang

General Instructions. Questions

Link Analysis and Web Search

MapReduce Design Patterns

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

COMP5331: Knowledge Discovery and Data Mining

1/30/2019 Week 2- B Sangmi Lee Pallickara

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Creating N-gram profile for a Wikipedia Corpus

CS646 (Fall 2016) Homework 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Final Exam Search Engines ( / ) December 8, 2014

Bruno Martins. 1 st Semester 2012/2013

COSC-589 Web Search and Sense-making Information Retrieval In the Big Data Era. Spring Instructor: Grace Hui Yang

CS-E4420 Information Retrieval

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Information Networks: Hubs and Authorities

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

ANALYZING THE MILLION SONG DATASET USING MAPREDUCE

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Temporal Graphs KRISHNAN PANAMALAI MURALI

2/4/2019 Week 3- A Sangmi Lee Pallickara

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Programming Assignment 6: Famous Scientists Diving Competition

/ Cloud Computing. Recitation 13 April 14 th 2015

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

Entity Linking at TAC Task Description

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

How to organize the Web?

The body text of the page also has all newlines converted to spaces to ensure it stays on one line in this representation.

COMPUTING SUBJECT KNOWLEDGE AUDIT

CSCI 320 Group Project

Data-Intensive Computing with MapReduce

Lecture 8 May 7, Prabhakar Raghavan

Introduction to Text Mining. Hongning Wang

Searching the Web What is this Page Known for? Luis De Alba

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Unit VIII. Chapter 9. Link Analysis

Web. The Discovery Method of Multiple Web Communities with Markov Cluster Algorithm

Information Retrieval. hussein suleman uct cs

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Personalized Information Retrieval

Bibliometrics: Citation Analysis

CMPSCI 646, Information Retrieval (Fall 2003)

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

CS264: Homework #1. Due by midnight on Thursday, January 19, 2017

Enhancing Internet Search Engines to Achieve Concept-based Retrieval

Homework Assignment #3

Lecture #3: PageRank Algorithm The Mathematics of Google Search

CSI 445/660 Part 10 (Link Analysis and Web Search)

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Web Document Clustering using Semantic Link Analysis

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Link Analysis in Web Mining

CSE Theory of Computing Fall 2017 Project 2-Finite Automata

A ew Algorithm for Community Identification in Linked Data

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

CS 206 Introduction to Computer Science II

Databases 2 (VU) ( / )

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

21. Search Models and UIs for IR

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

modern database systems lecture 4 : information retrieval

Semantic text features from small world graphs

CSC 261/461 Database Systems Lecture 19

Information Networks: PageRank

ECS289: Scalable Machine Learning

CompSci 516: Database Systems

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

CSE 444: Database Internals. Lecture 23 Spark

Abstract. 1. Introduction

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Finding Hubs and authorities using Information scent to improve the Information Retrieval precision

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

CS 113: Introduction to

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Transcription:

Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark Due: Sept. 27 Wednesday 5:00PM Submission: via Canvas, individual submission Instructor: Sangmi Lee Pallickara Web page: http://www.cs.colostate.edu/~cs535/assignments.html Objectives The goal of this programming assignment is to enable you to gain experience in: Installing and using analytics tools such as HDFS and Apache Spark Generating root set and base set based on link information Implementing iterative algorithms to estimate Hub and Authority scores of Wikipedia articles using Wikipedia dump data 1. Overview In this assignment, you will design and implement a system that generates a subset of graphs using Hyperlink-Induced Topic Search (HITS) 12 over currently available Wikipedia articles 3. HITS is a link analysis algorithm designed to find the key pages for specific web communities. HITS is also known as hubs and authorities. HITS relies on two values associated with each page: one is called its hub score and the other its authority score. For any query, you compute two ranked lists of results. The ranking of one list is induced by the hub scores and that of the other by the authority scores. This approach stems from a particular insight into the very early creation of web pages --- that there are two primary kinds of web pages that are useful as results for broad-topic searches. For example, consider an information query such as I wish to learn about the eclipse. There are authoritative sources of information on the topic; in this case, the National Aeronautics and Space Administration (NASA) s page on eclipse would be such a page. We will call such pages as authorities. In contrast, there are pages containing lists of links (hand-compiled) to authoritative web pages on a particular topic. These hub pages are not in themselves authoritative sources of topic but they provide aggregated information. We use these hub pages to discover the authority pages. 1 Christopher D. Manning, Prabhakar Raghavan and Hinrich Shutze, Chapter. 21, Link analysis, Introduction to Information Retrieval, https://nlp.stanford.edu/ir-book/pdf/21link.pdf 2 U.S. Patant 6,112,202, https://www.google.com/patents/us6112202 3 https://meta.wikimedia.org/wiki/data_dumps 1/6

1.1. Calculating scores A good hub page is one that points to many good authorities; a good authority page is one that is pointed to by many good hub pages. Therefore, the definition of hubs and authorities is a circular concept and we will turn this into an iterative computation. Suppose that we have a subset of the web containing good hub and authority pages, with hyperlinks among them. We will iteratively compute a hub score and an authority score for every web page in this subset. For a web page v in our subset of the web, we use Hub(v) to denote its hub score and Authority(v) to denote its authority score. Initially, we set Hub(v) = Authority(v) = 1 for all nodes v. We also denote by v y the existence of a hyperlink from v to y. The core of the iterative algorithm is a pair of updates to the hub and authority scores of all pages. The following equations capture the intuitive notions that good hubs point to good authorities and that good authorities are pointed to by good hubs. h v a(y) ( * a(v) h(y) * ( Thus, the first line of the above equation sets the hub score of page v to the sum of the authority scores of the pages it links to. In other words, if v links to pages with high authority scores, its hub score increases. The second line plays the reverse role; if page v is linked to by good hubs, its authority score increases. 1.2. Choosing the subset of the Web In assembling a subset of web pages around a topic such as eclipse, we must cope with the fact that good authority pages may not contain the specific query term eclipse. For instance, many pages on the Apple website are authoritative sources of information on computer hardware, even though these pages may not contain the term computer or hardware. However, a hub compiling computer hardware resources is likely to use these terms and also link to the relevant pages on the Apple website. To compile the subset of web pages to compute hub and authority scores, HITS requires following steps. Note that we consider only internal-links that include only the links between Wikipedia pages. Step 1. Given a query (e.g. eclipse, there are more than 100 Wikipedia pages with title containing the text eclipse ), get all pages containing your-topic in their title. Call this the root set of pages. Step 2. Build a base set of pages, to include the root set as well as any page that either links to a page in the root set, or is linked to by a page in the root set. 2/6

Step 3. We then use the base set for computing hub and authority scores. To calculate your hub and authority scores, the initial hub and authority scores of pages in the base set should be set as 1 before the iterations. The iterations should be continued automatically until the values converge. You should normalize the set of values after every iteration. The authority values should be divided by the sum of all authority values after each step. Similarly, the hub values should be divided by the sum of all hub values after each step. The hubs and authorities scores can be calculated using eigenvectors and Singular Value Decomposition (SVD). However, in programming assignment 1, you must implement the algorithm with the iterative method. The final hubs and authority scores are determined only after a large number of repetitions of the algorithm. 1.3. Example Figure 1 shows an example of calculating hub scores for each page. In the diagram, each node depicts the Wikipedia page, and the edges represent the links included in the source page. Figure 1. Example with 4 Wikipedia pages, A, B, C, and D Initially the Hub and Authority scores for each page are set as 1. Authority(A) = Authority(B) = Authority(C) = Authority(D) =1 Hub(A) = Hub(B) = Hub(C) = Hub(D) = 1 In the first iteration, you will calculate authority scores: Authority 1 (A) = Hub 0 (B) = 1 Authority 1 (B)= Hub 0 (A) = 1 Authority 1 (C)= Hub 0 (A) + Hub 0 (B) = 2 Authority 1 (D)= Hub 0 (A) + Hub 0 (B) + Hub 0 (C) = 3 3/6

After normalizing these values with the sum of the new Authority scores, Authority (A) 1 = 1/(1 + 1 + 2 + 3) = 0.143 Authority (B) 1 = 1/(1 + 1 + 2 + 3) = 0.143 Authority (C) 1 = 2/(1 + 1 + 2 + 3) = 0.289 Authority (D) 1 = 3/(1 + 1 + 2 + 3) = 0.429 Then, calculate hub scores: Hub(A) 1 = Authority (B) 1 + Authority (C) 1 + Authority (D) 1 =0.858 Hub(B) 1 = Authority (A) 1 + Authority (C) 1 + Authority (D) 1 =0.858 Hub(C) 1 = Authority (D) 1 =0.429 Hub(D) 1 = 0 After normalizing these values with the sum of new Hub scores Hub(A) 1 = 0.858/(0.859 + 0.859 + 0.429 + 0) = 0.438 Hub(B) 1 = 0.858/(0.859 + 0.859 + 0.429 + 0) = 0.438 Hub(C) 1 = 0.429/(0.859 + 0.859 + 0.429 + 0) = 0.2 Hub(D) 1 = 0 You should repeat this iteration until the Hub and Authority scores have converged. 4/6

2. Requirements of Programing Assignment 1 Your implementation of this assignment should provide: A. A selection of your root set based on a specified topic. This must be implemented with Apache Spark. B. Base set for hubs and authorities. This must be implemented with Apache Spark. C. A sorted list of Wikipedia pages based on the hub scores for the user specified topic (non-hardcoded) in descending order. Each should contain the title of the article and its hub value. This computation should be performed in your own Spark cluster with 5 machines. You should present results from at least 50 pages in the list. This must be implemented with Apache Spark. D. A sorted list of Wikipedia pages based on the authority scores for the user specified topic (non-hardcoded) in descending order. Each should contain the title of the article and its authority value. This computation should be performed in your own Spark cluster with 5 machines. You should present results from at least 50 pages in the list. This must be implemented with Apache Spark. You are not allowed to use existing HITS implementations. A 100% deduction will be assessed in that case. Demonstration of your software should be on machines in CSB-120. Demonstration will include an interview discussing implementation and design details. Your submission should include your source codes and outputs. Your submission should be via Canvas. 3. Programming Environment A. Requirements All software demonstration of this assignment will be done in CSB120. You should make sure your software works on the machines in CSB120. B. Setup your Hadoop cluster This assignment includes a requirement of setting up your own HDFS and Spark cluster on every node. For writing your intermediate and output data, you should use your own cluster setup. 5/6

4. Dataset The original dump dataset from Wikipedia is 12.1GB as a bz2 file, and about 50GB when it is decompressed. It follows an XML format and includes other information that is not relevant to this task. To optimize your work on this assignment, we will use Wikipedia dump 4 generated by Henry Haselgrove. Please find the instruction to download the data file from CS120 (instead of the external server) in the course assignment web page. If you have any problem with storage capacity, you should contact Prof. Pallickara immediately. The dataset contains two files: links-simple-sorted.zip (323MB) titles-orted.zip (28MB) These files contain all links between proper Wikipedia pages. This includes disambiguation pages and redirect pages. It includes only the English language Wikipedia. In links-simple-sorted.txt, there is one line for each page that has links from it. The format of the lines is: from 1 : to 11 to 12 to 13... from 2 : to 21 to 22 to 23...... where from 1 is an integer labeling a page that has links from it, and to 11 to 12 to 13... are integers labeling all the pages that the page links to. To find the page title that corresponds to integer n, just look up the n-th line in the file titles-sorted.txt, a UTF-8 encoded text file. 5. Grading Your submissions will be graded based on the demonstration to the professor in CSB120. All of the items described in (2) will be evaluated. This assignment will account for 15% of your final course grade. 2 points: Generating the root set 2 points: Generating the base set 5 points: Generating a sorted list based on the hub scores 5 points: Generating a sorted list based on the authority scores 1 points: Set up the cluster 6. Late Policy Please check the late policy posted on the course web page. 4 https://wayback.archive.org/web/20160516004046/http://haselgrove.id.au/wikipedia.htm 6/6