Topics du jour CS347. Centroid/NN. Example

Similar documents
Lecture 10 May 14, Prabhakar Raghavan

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Automatic Summarization

Lecture 8 May 7, Prabhakar Raghavan

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

VECTOR SPACE CLASSIFICATION

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Machine Learning using MapReduce

Information Retrieval

Naïve Bayes for text classification

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Chapter 3: Supervised Learning

Unsupervised Learning

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Search Engines. Information Retrieval in Practice

Bruno Martins. 1 st Semester 2012/2013

Chapter 4: Non-Parametric Techniques

Part I: Data Mining Foundations

Information Retrieval

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Building Search Applications

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Final Report - Smart and Fast Sorting

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Brief (non-technical) history

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Based on Raymond J. Mooney s slides

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Introduction to Text Mining. Hongning Wang

Unsupervised Learning

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Application of Support Vector Machine Algorithm in Spam Filtering

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

CS371R: Final Exam Dec. 18, 2017

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Evaluation Methods for Focused Crawling

Semi-Supervised Clustering with Partial Background Information

Classification. 1 o Semestre 2007/2008

Predictive Coding. A Low Nerd Factor Overview. kpmg.ch/forensic

Keyword Extraction by KNN considering Similarity among Features

Applying Supervised Learning

CHAPTER 4: CLUSTER ANALYSIS

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Information Retrieval

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Natural Language Processing

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Information Retrieval and Web Search Engines

Instructor: Stefan Savev

Introduction to Machine Learning CMU-10701

Clustering. Unsupervised Learning

Computer Vision. Exercise Session 10 Image Categorization

SOCIAL MEDIA MINING. Data Mining Essentials

Lecture #11: The Perceptron

ECE521 Lecture 18 Graphical Models Hidden Markov Models

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Clustering. Unsupervised Learning

Indexing in Search Engines based on Pipelining Architecture using Single Link HAC

2. Design Methodology

CSE 158. Web Mining and Recommender Systems. Midterm recap

Link Analysis and Web Search

Basic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies?

CS347. Lecture 2 April 9, Prabhakar Raghavan

ECG782: Multidimensional Digital Signal Processing

Today s topics CS347. Inverted index storage. Inverted index storage. Processing Boolean queries. Lecture 2 April 9, 2001 Prabhakar Raghavan

CS 188: Artificial Intelligence Fall Machine Learning

What to come. There will be a few more topics we will cover on supervised learning

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

CSE 494 Project C. Garrett Wolf

Machine Learning Department School of Computer Science Carnegie Mellon University. K- Means + GMMs

Using Machine Learning to Optimize Storage Systems

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

CSEP 573: Artificial Intelligence

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Unsupervised Learning : Clustering

String Vector based KNN for Text Categorization

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Lecture 12 Recognition

Problem 1: Complexity of Update Rules for Logistic Regression

Unsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Robust PDF Table Locator

Unsupervised Learning and Clustering

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

PARALLEL CLASSIFICATION ALGORITHMS

Lecture 8: Linkage algorithms and web search

4. Feedforward neural networks. 4.1 Feedforward neural network structure

Information Retrieval

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Digital Image Processing. Prof. P.K. Biswas. Department of Electronics & Electrical Communication Engineering

Transcription:

Topics du jour CS347 Lecture 10 May 14, 2001 Prabhakar Raghavan Centroid/nearest-neighbor classification Bayesian Classification Link-based classification Document summarization Centroid/NN Given training docs for a topic, compute their centroid Now have a centroid for each topic Given query doc, assign to topic whose centroid is nearest. Government Science Arts 1

Bayesian Classification As before, train classifier on exemplary docs from classes c 1, c 2, c r Given test doc d estimate Pr [d belongs to class c j] = Pr [c j d] Apply Bayes Theorem Pr [c j d] Pr [d] = Pr [d c j ] Pr [c j ] So Pr [c j d] = Pr [ d cj] οpr[ cj] Pr[] d r i = 1 Express Pr [d] as Pr [d c i ] Pr [c i ] Reverse Engineering To compute Pr [c j d], all we need are Pr [d c i ] and Pr [c i ], for all i. Will get these from training. Training Given a set of training docs, together with a class label for each training doc. e.g., these docs belong to Physics, those others to Astronomy, etc. 2

Estimating Pr [c i ] Pr [c i ] = Fraction of training docs that are labeled c i. In practice, use more sophisticated smoothing to boost probabilities of classes under-represented in sample. Estimating Pr [d c i ] Basic assumption - each occurrence of each word in each doc is independent of all others. For a word w, (from sample docs) Pr [w c i ] = Frequency of word w amongst all docs labeled c i. Pr[ d c ] = Pr[ w ] i c i w d Thus, the probability of a doc consisting of Friends, Romans, Countrymen = Pr[Friends] Pr[Romans] Pr[Countrymen] In implementations, pay attention to precision/underflow. Extract all probabilities from term-doc matrix. To summarize Training Use class frequencies in training data for Pr [c i ]. Estimate word frequencies for each word and each class to estimate Pr [w c i ]. Test doc d Use the Pr[w c i ] values to estimate Pr [d c i ] for each class c i. Determine class c j for which Pr [c j d] is maximized. 3

Abstract features So far, have relied on word counts as the features to train and classify on. In general, could be any statistic. terms in boldface count for more. authors of cited docs. number of equations. square of the number of commas Abstract features. Bayesian in practice Many improvements used over naïve version discussed above various models for document generation varying emphasis on words in different portions of docs smoothing statistics for infrequent terms classifying into a hierarchy Supervised learning deployment issues Uniformity of docs in training/test Quality of authorship Volume of training data Typical empirical observations Training ~ 1000+ docs/class Accuracy upto 90% in the very best circumstances below 50% in the worst 4

SVM vs. Bayesian SVM appears to beat variety of Bayesian approaches both beat centroid-based methods SVM needs quadratic programming more computation than naïve Bayes at training time less at classification time Bayesian classifiers also partition vector space, but not using linear decision surfaces Classifying hypertext Given a set of hyperlinked docs Class labels for some docs available Figure out class labels for remaining docs Bayesian hypertext classification c1 c3? c3? Besides the terms in a doc, derive cues from linked docs to assign a class to test doc. Cues could be any abstract features from doc and its neighbors. c2 c2? c4? 5

Feature selection Attempt 1: use terms in doc + those in its neighbors. Generally does worse than terms in doc alone. Why? Neighbors terms diffuse focus of doc s terms. Attempt 2 Use terms in doc, plus tagged terms from neighbors. E.g., car denotes a term occurring in d. car@i denotes a term occurring in a doc with a link into d. car@o denotes a term occurring in a doc with a link from d. Generalizations possible: car@oioi Attempt 2 also fails Key terms lose density e.g., car gets split into car, car@i, car@o Better attempt Use class labels of (in- and out-) neighbors as features in classifying d. e.g., docs about physics point to docs about physics. Setting: some neighbors have pre-assigned labels; need to figure out the rest. 6

Content + neighbors classes Naïve Bayes gives Pr[c j d] based on the words in d. Now consider Pr[c j N] where N is the set of labels of d s neighbors. (Can separate N into in- and out-neighbors.) Can combine conditional probs for c j from text- and link-based evidence. Training As before, use training data to compute Pr[N c j ] etc. Assume labels of d s neighbors independent (as we did with word occurrences). (Also continue to assume word occurrences within d are independent.) Classification Can invert probs using Bayes to derive Pr[c j N]. Need to know class labels for all of d s neighbors. Unknown neighbor labels What if all neighbors class labels are not known? First, use word content alone to assign a tentative class label to each unlabelled doc. Next, iteratively recompute all tentative labels using word content as well as neighbors classes (some tentative). 7

Convergence This iterative relabeling will converge provided tentative labels not too far off. Guarantee requires ideas from Markov random fields, used in computer vision. Error rates significantly below text-alone classification. End of classification Move on to document summarization Document summarization Given a doc, produce a short summary. Length of summary a parameter. Application/form factor. Very sensitive to doc quality. Typically, corpus-independent. Summarization Simplest algorithm: Output the first 50 (or however many) words of the doc. Hard to beat on high-quality docs. For very short summaries (e.g., 5 words), drop stop words. 8

Summarization Slightly more complex heuristics: Compute an importance score for each sentence. Summary output contains sentences from original doc, beginning with the most important. Article from WSJ 1: Tandon Corp. will introduce a portable hard-disk drive today that will enable personal computer owners to put all of their programs and data on a single transportable cartridge that can be plugged into other computers. 2: Tandon, which has reported big losses in recent quarters as it shifted its product emphasis from disk drives to personal computer systems, asserts that the new device, called Personal Data Pac, could change the way software is sold, and even the way computers are used. 3: The company, based in Moorpark, Calif., also will unveil a new personal computer compatible with International Business Machines Corp.'s PC-AT advanced personal computer that incorporates the new portable hard-disks. 4: "It's an idea we've been working on for several years," said Chuck Peddle, president of Tandon's computer systems division. 5: "As the price of hard disks kept coming down, we realized that if we could make them portable, the floppy disk drive would be a useless accessory. 6: Later, we realized it could change the way people use their computers." 7: Each Data Pac cartridge, which will be priced at about $400, is about the size of a thick paperback book and contains a hard-disk drive that can hold 30 million pieces of information, or the equivalent of about four Bibles. 8: To use the Data Pacs, they must be plugged into a cabinet called the Ad-Pac 2. That device, to be priced at about $500, contains circuitry to connect the cartridge to an IBMcompatible personal computer, and holds the cartridge steadily in place. 9: The cartridges, which weigh about two pounds, are so durable they can be dropped on the floor without being damaged. 10: Tandon developed the portable cartridge in conjunction with Xerox Corp., which supplied much of the research and development funding. 11: Tandon also said it is negotiating with several other personal computer makers, which weren't identified, to incorporate the cartridges into their own designs. 12: Mr. Peddle, who is credited with inventing Commodore International Ltd.'s Pet personal computer in the late 1970s, one of the first personal computers, said the Data Pac will enable personal computer users to carry sensitive data with them or lock it away.. 1: Tandon Corp. will introduce a portable hard-disk drive today that will enable personal computer owners to put all of their programs and data on a single transportable cartridge that can be plugged into other computers. 12: Mr. Peddle, who is credited with inventing Commodore International Ltd.'s Pet personal computer in the late 1970s, one of the first personal computers, said the Data Pac will enable personal computer users to carry sensitive data with them or lock it away. Article from WSJ 1: Tandon Corp. will introduce a portable hard-disk drive today that will enable personal computer owners to put all of their programs and data on a single transportable cartridge that can be plugged into other computers. 7: Each Data Pac cartridge, which will be priced at about $400, is about the size of a thick paperback book and contains a hard-disk drive that can hold 30 million pieces of information, or the equivalent of about four Bibles. 12: Mr. Peddle, who is credited with inventing Commodore International Ltd.'s Pet personal computer in the late 1970s, one of the first personal computers, said the Data Pac will enable personal computer users to carry sensitive data with them or lock it away. 9

Using Lexical chains Score each word based on lexical chains. Score sentences. Extract hierarchy of key sentences Better (WAP) summarization What are Lexical Chains? Dependency Relationship between words reiteration: e.g. tree - tree superordinate: e.g. tree - plant systematic semantic relation e.g. tree - bush non- systematic semantic relation e.g. tree - tall Using lexical chains Look for chains of reiterated words Score chains Use to score sentences determine how important a sentence is to the content of the doc. Computing Lexical Chains Quite simple if only dealing with reiteration. Issues: How far apart can 2 nodes in a chain be? How do we score a chain? 10

Article from WSJ 1: Tandon Corp. will introduce a portable hard-disk drive today that will enable personal computer owners to put all of their programs and data on a single transportable cartridge that can be plugged into other computers. 2: Tandon, which has reported big losses in recent quarters as it shifted its product emphasis from disk drives to personal computer systems, asserts that the new device, called Personal Data Pac, could change the way software is sold, and even the way computers are used. 3: The company, based in Moorpark, Calif., also will unveil a new personal computer compatible with International Business Machines Corp.'s PC-AT advanced personal computer that incorporates the new portable hard-disks. 4: "It's an idea we've been working on for several years," said Chuck Peddle, president of Tandon's computer systems division. 5: "As the price of hard disks kept coming down, we realized that if we could make them portable, the floppy disk drive would be a useless accessory. 6: Later, we realized it could change the way people use their computers." 7: Each Data Pac cartridge, which will be priced at about $400, is about the size of a thick paperback book and contains a hard-disk drive that can hold 30 million pieces of information, or the equivalent of about four Bibles. 8: To use the Data Pacs, they must be plugged into a cabinet called the Ad-Pac 2. That device, to be priced at about $500, contains circuitry to connect the cartridge to an IBMcompatible personal computer, and holds the cartridge steadily in place. 9: The cartridges, which weigh about two pounds, are so durable they can be dropped on the floor without being damaged. 10: Tandon developed the portable cartridge in conjunction with Xerox Corp., which supplied much of the research and development funding. 11: Tandon also said it is negotiating with several other personal computer makers, which weren't identified, to incorporate the cartridges into their own designs. 12: Mr. Peddle, who is credited with inventing Commodore International Ltd.'s Pet personal computer in the late 1970s, one of the first personal computers, said the Data Pac will enable personal computer users to carry sensitive data with them or lock it away.. For each word: How far apart can 2 nodes in a chain be? Will continue chain if within 2 sentences. How to score a chain? Function of word frequency and chain size. Chain Structure Analysis Compute & Score Chains Chain 1 Score = 3674265 PAC: 12 13 14 15 17 Chain 2 Score = 2383334 TANDON: 1 2 4 Chain 3 Score = 1903214 PORTABLE: 1 3 5 Chain 4 Score = 1466674 CARTRIDGE: 7 8 9 10 11 Chain 5 Score = 959779 COMPUTER: 1 2 3 4 6 8 Chain 6 Score = 951902 TANDON: 15 17 Chain 7 Score = 951902 TANDON: 10 11 Chain 8 Score = 760142 PEDDLE: 12 14 Chain 9 Score = 726256 COMPUTER: 11 12 13 15 17 Chain 10 Score = 476633 DATA: 12 13 14 15 17 Sentence scoring For each sentence S in the doc. f(s) = a*h(s) - b*t(s) where h(s) = total score of all chains starting at S and t(s) = total score of all chains covering S, but not starting at S 11

Semantic Structure Analysis 12000 10000 8000 6000 4000 2000 0-2000 1 3 5 7 9 11 13 15 17 19 Derive the Document Structure 1 Semantic Structure Analysis 1 7 12 1 5 7 8 9 10 11 12 15 18 1 2 3 4 5 6 12 13 14 15 16 17 18 19 Resources S. Chakrabarti, B. Dom, P. Indyk. Enhanced hypertext categorization using hyperlinks. http://citeseer.nj.nec.com/chakrabarti98enhanced.html R. Barzilay. Using lexical chains for text summarization. http://citeseer.nj.nec.com/barzilay97using.html 12