CS371R: Final Exam Dec. 18, 2017

Similar documents
Birkbeck (University of London)

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Classification. 1 o Semestre 2007/2008

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

Part I: Data Mining Foundations

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Based on Raymond J. Mooney s slides

1 Document Classification [60 points]

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Machine Learning using MapReduce

Search Engines Information Retrieval in Practice

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Exam IST 441 Spring 2014

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Building Search Applications

Exam IST 441 Spring 2011

CSE 494: Information Retrieval, Mining and Integration on the Internet

SOCIAL MEDIA MINING. Data Mining Essentials

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

VECTOR SPACE CLASSIFICATION

CSE 158. Web Mining and Recommender Systems. Midterm recap

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

60-538: Information Retrieval

Chapter 9. Classification and Clustering

Search Engines. Information Retrieval in Practice

Representation/Indexing (fig 1.2) IR models - overview (fig 2.1) IR models - vector space. Weighting TF*IDF. U s e r. T a s k s

Chapter 27 Introduction to Information Retrieval and Web Search

Information Retrieval. (M&S Ch 15)

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Information Retrieval

Keyword Extraction by KNN considering Similarity among Features

Automatic Summarization

Multimedia Information Systems

Chapter 6: Information Retrieval and Web Search. An introduction

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

Computer Vision. Exercise Session 10 Image Categorization

10-701/15-781, Fall 2006, Final

Clustering Results. Result List Example. Clustering Results. Information Retrieval

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

A Survey on Web Information Retrieval Technologies

Natural Language Processing

Information Retrieval

Semantic Estimation for Texts in Software Engineering

Reading group on Ontologies and NLP:

Efficient query processing

Classification and Clustering

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Information Retrieval. Information Retrieval and Web Search

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Collective Intelligence in Action

Semi-Supervised Clustering with Partial Background Information

CS54701: Information Retrieval

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

CSE 158, Fall 2017: Midterm

Recommender Systems. Collaborative Filtering & Content-Based Recommending

Statistics 202: Statistical Aspects of Data Mining

What to come. There will be a few more topics we will cover on supervised learning

Lecture 8 May 7, Prabhakar Raghavan

CS 6320 Natural Language Processing

Text Categorization (I)

6.034 Quiz 2, Spring 2005


Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

CSE 494/598 Lecture-11: Clustering & Classification

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 2013 ISSN:

Text Analytics (Text Mining)

Exam IST 441 Spring 2013

Information Retrieval and Web Search

Remaining Enhanced Labs

Introduction to Text Mining. Hongning Wang

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Midterm Exam Search Engines ( / ) October 20, 2015

Lemur (Draft last modified 12/08/2010)

University of Maryland. Tuesday, March 2, 2010

Midterm Examination CS 540-2: Introduction to Artificial Intelligence

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BIG DATA SCIENTIST Certification. Big Data Scientist

Topics du jour CS347. Centroid/NN. Example

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am

USC Viterbi School of Engineering

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

Multi-label classification using rule-based classifier systems

An Introduction to Search Engines and Web Navigation

Text Analytics (Text Mining)

CSCI544, Fall 2016: Assignment 1

Machine Learning Classifiers and Boosting

Transcription:

CS371R: Final Exam Dec. 18, 2017 NAME: This exam has 11 problems and 16 pages. Before beginning, be sure your exam is complete. In order to maximize your chance of getting partial credit, show all of your work and intermediate results. Final grades will be available on Canvas on or before Dec. 21. Good luck and have a good break! 1

1. (8 points) Assuming simple term frequency weights (no IDF factor), no length normalization, and NO stop words compute the cosine similarity of the following two simple documents: (a) five thousand five hundred and fifty five dollars (b) fifty six thousand five hundred sixty five dollars 2

2. (8 points) Graphically illustrate with an array of linked list diagrams, the basic inverted index constructed for the following simple corpus. Doc1: school of business Doc2: school of social work Doc3: school of public affairs You are given the following list of stop words: at in of on Perform stop word removal and order tokens in the inverted index alphabetically. You need to include TF information, but not IDF nor document-vector lengths. 3

3. (8 points) Consider the following web pages and the set of web pages that they link to: Page A points to pages B, D, and E. Page B points to pages C and E. Page C points to pages F and G. Page D points to page G. Page G points to page E. Show the order in which the pages are indexed when starting at page A and using a breadthfirst spider (with duplicate page detection) as implemented in the course Spider class. Assume links on a page are examined in the orders given above. 4

4. (9 points) Consider the following web pages and the set of web pages that they link to: Page A points to pages B, C, and E. Page D points to pages B, and C. All other pages have no outgoing links. Consider running the HITS (Hubs and Authorities) algorithm on this subgraph of pages. Simulate the algorithm for three iterations. Show the authority and hub scores for each page twice for each iteration, both before and after normalization, order the elements in the vectors in the sequence: A, B, C, D, E. 5

5. (9 points) Consider the following web pages and the set of web pages that they link to: Page A points to pages B and C. Page B points to page C. All other pages have no outgoing links. Consider running the PageRank algorithm on this subgraph of pages. Assume α = 0.15. Simulate the algorithm for three iterations. Show the page rank scores for each page twice for each iteration, both before and after normalization, order the elements in the vectors in the sequence: A, B, C. 6

6. (8 points) Consider the problem of learning to classify a name as being Food or Beverage. Assume the following training set: Food: cherry pie Food: buffalo wings Beverage: cream soda Beverage: orange soda Apply 3-nearest-neighbor text categorization to the name cherry soda. Show all the similarity calculations needed to classify the name, and the final categorization. Assume simple term-frequency weights (no IDF) with cosine similarity. Would the result be guaranteed to be the same with 1-nearest-neighbor? Why or why not? 7

7. (10 points) Assume we want to categorize science texts into the following categories: Physics, Biology, Chemistry. Consider performing naive Bayes classification with a simple model in which there is a binary feature for each significant word indicating its presence or absence in the document. The following probabilities have been estimated from analyzing a corpus of preclassified web pages gathered from Yahoo: c Physics Biology Chemistry P(c) 0.35 0.4 0.25 P(atom c) 0.2 0.01 0.2 P(carbon c) 0.01 0.1 0.05 P(proton c) 0.1 0.001 0.05 P(life c) 0.001 0.2 0.005 P(earth c) 0.005 0.008 0.01 Assuming the probability of each evidence word is independent given the category of the text, compute the posterior probability for each of the possible categories for each of the following short texts. Assume the categories are disjoint and complete for this application. Assume that words are first stemmed to reduce them to their base form, and ignore any words that are not in the table. (a) The carbon atom is the foundation of life on earth. (b) The carbon atom contains 12 protons. 8

8. (9 points) Consider the problem of clustering the following documents using K-means with K = 2 and cosine similarity. Doc1: go Longhorns go Doc2: go Texas Doc3: Texas Longhorns Doc4: Longhorns Longhorns Assume Doc1 and Doc3 are chosen as the initial seeds. Assume simple term-frequency weights (no IDF, no length normalization). Show all similarity calculations needed to cluster the documents, centroid computations for each iteration, and the final clustering. The algorithm should converge after only 2 iterations. 9

9. (8 points) Consider the following item ratings to be used by collaborative filtering. Item User1 User2 User3 User4 Active User A 8 5 10 10 B 6 2 3 3 C 2 8 2 5 1 D 9 2 6 2 E 1 7 3 1 c a,u 0.88 0.21 1 1 The Pearson correlation of each of the existing users with the active user (c a,u ) is already given in the table. Compute the predicted rating for the active user for items D and E using standard significance weighting and the two most similar neighbors to make predictions using the method discussed in class. 10

10. (6 points) Write a regular expression in PERL syntax for matching a time and date expression such as 12:07 PM 9/27/61 and 7AM 12/2/15, specifically a word boundary, followed by a one or two digit hour, followed optionally by a colon and two digits for the minutes, followed by an optional space, then AM or PM, followed by some amount of whitespace, then a one or two digit month, slash, a one or two digit day, slash, a two-digit year, ending in a word boundary. 11

11. (17 points) Provide short answers (1 3 sentences) for each of the following questions (1 point each): What is IDF and what role does it play in vector-space retrieval? Why is there typically a trade-off between precision and recall? What is pseudo relevance feedback? Why is Zipf s law important to the significant performance advantage achieved by using an inverted index? 12

Why is having a multi-threaded spider important, even if it is run on a single processor? What is a robots.txt file for? What is the Random Surfer model and why is it relevant to web search? In machine learning, what is over-fitting? What is a potential advantage (with respect to classification accuracy) of k nearest neighbor over Rocchio for text categorization? 13

What is smoothing (e.g. Laplace estimate) with respect to probabilistic categorization and why is it important? How does the language model approach use Naive Bayes text classification to solve the ad-hoc document retrieval problem? What is the primary advantage of K-means clustering over hierarchical agglomerative clustering? What is semi-supervised learning for text categorization? 14

What is Word2Vec? What is a visual word? Name and define two problems with collaborative filtering for recommending? Briefly describe two different approaches to combining collaborative filtering and contentbased recommending. 15

(Extra credit) What is the Six Degrees of Kevin Bacon game? (Extra credit) Digital Equipment Corp. (DEC) which developed the Alta Vista search engine is now effectively a part of what large tech company (after 2 separate corporate acquisitions)? (Extra credit) The statistical techniques now generally referred to as Bayesian really owe their development more to what French scientist and mathematician? (Extra credit) Why did Netflix decide not to have a second followup competition to their $1M prize for improving their recommender system? 16