Final Exam Search Engines ( / ) December 8, 2014

Similar documents
Midterm Exam Search Engines ( / ) October 20, 2015

Math Released Item Geometry. Height of Support VF650053

2.Simplification & Approximation

Bruno Martins. 1 st Semester 2012/2013

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Introduction to Data Mining

Link Analysis in Web Mining

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

KS3 Progression Map: Number

Creating a Classifier for a Focused Web Crawler

Link Analysis and Web Search

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Information Retrieval and Web Search

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Exam IST 441 Spring 2014

Exam IST 441 Spring 2011

k-nearest Neighbor (knn) Sept Youn-Hee Han

Note that there are questions printed on both sides of each page!

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

The simplest way to search for and enroll in classes is to do so via the Enrollment Shopping Cart.

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Jeffrey D. Ullman Stanford University

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

The course can be taken as part of the MSc Programme in Information Systems, or as a separate course.

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

INFC20, Informatics: Advanced Database Systems, 7.5 credits Informatik: Avancerade databassystem, 7,5 högskolepoäng First Cycle / Grundnivå

Type Conversion. and. Statements

Machine Learning using MapReduce

10-701/15-781, Fall 2006, Final

CS 540: Introduction to Artificial Intelligence

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

Measuring Similarity to Detect

Information Retrieval Spring Web retrieval

Elizabethtown Area School District 7th Grade Math Name of Course

Unsupervised Learning Partitioning Methods

Natural Language Processing

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Web Structure Mining using Link Analysis Algorithms

ECE/CS 757: Homework 1

The syllabus was approved by The Board of the Department of Informatics on to be valid from , autumn semester 2018.

How to organize the Web?

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

INFN45, Informatics: Business Intelligence, 7.5 credits Informatik: Business Intelligence, 7,5 högskolepoäng Second Cycle / Avancerad nivå

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Jeffrey D. Ullman Stanford University/Infolab

mandatory for Pre-AICE Geometry & Geometry Honors students

CS 161 Fall 2015 Final Exam

Enhancing Cluster Quality by Using User Browsing Time

September 11, Unit 2 Day 1 Notes Measures of Central Tendency.notebook

Enhancing Cluster Quality by Using User Browsing Time

1 Document Classification [60 points]

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

The course makes up the first semester of the BSc programme in Design of Information Systems or can be taken as a freestanding course.

CS 2113 Midterm Exam, November 6, 2007

Fall 09, Homework 5

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Exam IST 441 Spring 2013

Information Retrieval. Lecture 11 - Link analysis

Telling Experts from Spammers Expertise Ranking in Folksonomies

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

MATLAB - Lecture # 4

New York State Testing Program Mathematics Test

11 Data Structures Foundations of Computer Science Cengage Learning

Search Engines. Information Retrieval in Practice

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 27: Learning from relational data

1-3 Square Roots. Warm Up Lesson Presentation Lesson Quiz. Holt Algebra 2 2

Searching the Web What is this Page Known for? Luis De Alba

Section A Arithmetic ( 5) Exercise A

Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark

Birkbeck (University of London)

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

CS2102, B11 Exam 1. Name:

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am

Function Algorithms: Linear Regression, Logistic Regression

Lecture 8 May 7, Prabhakar Raghavan

International Journal of Computer Engineering and Applications, Volume IX, Issue X, Oct. 15 ISSN

TECHNISCHE UNIVERSITEIT EINDHOVEN Faculteit Wiskunde en Informatica

CSCI-1200 Data Structures Fall 2017 Lecture 7 Order Notation & Basic Recursion

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

STUDENT NAME: STUDENT ID: Problem 1 Problem 2 Problem 3 Problem 4 Problem 5 Total

Final Exam COS 116 Spring 2012: The Computational Universe

TAKS Mathematics Practice Tests Grade 7, Test A

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

Cornell University Computer Science 211 Second Preliminary Examination 18 April 2006

Unit VIII. Chapter 9. Link Analysis

Adaptive Gesture Recognition System Integrating Multiple Inputs

Summer Math Packet for Students Going Into Pre-algebra 8

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Clustering analysis of gene expression data

CS1800 Discrete Structures Final Version A

THE REAL NUMBER SYSTEM

Jarek Szlichta

8 Introduction to Distributed Computing

loose-leaf paper Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Transcription:

Student Name: Andrew ID: Seat Number: Final Exam Search Engines (11-442 / 11-642) December 8, 2014 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points will be deducted for irrelevant details. Use the back of the pages if you need more room for your answer. Calculators, phones, and other computational devices are not permitted. If a calculation is required, write fractions, and show your work so that it is clear that you know how to do the calculation..

1. Distributed vs. Selected Search You have a collection of 100 million documents that will be searched using 10 machines. Your boss wants you to use the selective search approach to searching this collection, using 1,000 equally-sized index shards. The system uses ReDDE as the resource selection algorithm, and searches 10 shards per query. Compare the selective search system with a distributed search system that divides the collection randomly into 10 equally-sized shards and searches all shards. Thus, both configurations (distributed and selective) search 10 shards for each query. a. Real-world search systems often care about latency the amount of a query takes to finish processing. When a system searches multiple shards in parallel, the latency is determined by the time taken to search the slowest shard plus any overheads or fixed search costs. Assume that your system can search up to 10 indexes in parallel. The typical distributed search system will have a latency equivalent to searching one 10 million document index because all 10 shards are processed in parallel. If selective search uses a 1% CSI for ReDDE, which system (selective search or distributed search) has a lower latency? What about a 10% CSI? Assume shards have a processing cost proportional to the number of documents in the index. (12 points) b. ReDDE is an unsupervised resource selection algorithm. You could also use a supervised resource selection algorithm, like the one discussed in class for federated search. Which type of resource selection algorithm unsupervised or supervised do you expect to be faster at search time? Why? Be sure to consider the cost of the features used. [4 points]

2. Diversification a. Describe why and when a search engine may need to diversify its search results. In class, we discussed two types of queries that benefit from diversification. Your answer must address both query types. [6 points] b. Provide an example query for each type described above, and show why it needs to be diversified (it is not enough to just state that it need to be diversified). Your examples must be original. You may not use a query that was in any of the lecture notes. [4 points]

c. Describe the procedures of implicit search result diversification and explicit search result diversification. Compare them, and describe their differences. [11 points]

3. HITS a. The root set of a query is {P1, P2, P3, P4, P5} in the web graph shown below. Provide the authority and hub scores of each page after 2 iterations; if HITS would ignore the page, write n/a. The table below is provided to help guide your work. Scores can be fractions. If you need to calculate a square root, round the result to the nearest integer to simplify your calculation; be clear about how you did this so that we will understand your answer. [12 points] Authority P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 Initial Value Iteration 1 After Normalization Iteration 2 After Normalization Hub P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 Initial Value Iteration 1 After Normalization Iteration 2 After Normalization

b. At the end of iteration, which nodes have the highest hubs and authority scores? Why do they have the highest scores? [4 points]

4. Personalization a. Personalization methods assume that a person's interests are known to the search engine. Describe how the search engine acquires this information, and how it represents it (i.e., what kind of model of the user it creates). [9 points] b. Describe how the search engine can use its model of a person's interests to produce a personalized ranking for a query. [6 points] c. Most personalization methods assume that your interests do not change. However, analysis shows that 6-8% of a person's queries are about topics that are atypical for that individual. How can a search engine provide a personalized search experience for these queries? [6 points]

5. Query Suggestions a. Describe an approach to generating query suggestions for a web search engine. You may assume that you have access to a search engine log that contains a unique user id; the user s query; information about the search results and clicked web pages; and time stamps. If your method requires additional information in the search engine log, describe what it needs and why. [8 points] b. Describe an approach to generating query suggestions in an enterprise search environment. You may assume that you have access to an enterprise search log if you wish, but this cannot be your only source of information. Your method must also use information that is specific to the enterprise. Describe how these suggestions may differ from the types of suggestions produced by a web search engine. [8 points]

6. Page quality Suppose that you work for a web search company. Your job is to develop a classifier that will use the page url, inlink text, and contents (i.e., not the web graph) to determine the likelihood that a web page is of low quality ("spam"). Describe 5 features that you might use in your classifier. Explain why each feature might be expected to be effective. At least 2 of your features must use the page url, and at least 2 of your features must use the page contents. Partial or no credit will be given for features that are minor variations of the same idea. [10 points]