CSCI 5417 Information Retrieval Systems Jim Martin!

Similar documents
SI485i : NLP. Set 5 Using Naïve Bayes

Performance Evaluation of Information Retrieval Systems

CS 534: Computer Vision Model Fitting

Deep Classification in Large-scale Text Hierarchies

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

Machine Learning. Topic 6: Clustering

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Machine Learning 9. week

The Research of Support Vector Machine in Agricultural Data Classification

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Machine Learning: Algorithms and Applications

Announcements. Supervised Learning

Support Vector Machines

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Machine Learning. Support Vector Machines. (contains material adapted from talks by Constantin F. Aliferis & Ioannis Tsamardinos, and Martin Law)

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Information Retrieval

Experiments in Text Categorization Using Term Selection by Distance to Transition Point

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Reliable Negative Extracting Based on knn for Learning from Positive and Unlabeled Examples

Unsupervised Learning

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

An Entropy-Based Approach to Integrated Information Needs Assessment

Deep Classifier: Automatically Categorizing Search Results into Large-Scale Hierarchies

Audio Content Classification Method Research Based on Two-step Strategy

An Anti-Noise Text Categorization Method based on Support Vector Machines *

Pruning Training Corpus to Speedup Text Classification 1

Support Vector Machines

Impact of a New Attribute Extraction Algorithm on Web Page Classification

Three supervised learning methods on pen digits character recognition dataset

Lecture 5: Multilayer Perceptrons

Efficient Text Classification by Weighted Proximal SVM *

Collaboratively Regularized Nearest Points for Set Based Recognition

Edge Detection in Noisy Images Using the Support Vector Machines

EECS 730 Introduction to Bioinformatics Sequence Alignment. Luke Huan Electrical Engineering and Computer Science

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

Biostatistics 615/815

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Data Mining: Model Evaluation

Unsupervised Learning and Clustering

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

Hierarchical clustering for gene expression data analysis

A User Selection Method in Advertising System

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Web-supported Matching and Classification of Business Opportunities

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Extraction of Fuzzy Rules from Trained Neural Network Using Evolutionary Algorithm *

A New Approach For the Ranking of Fuzzy Sets With Different Heights

A Hybrid Text Classification System Using Sentential Frequent Itemsets

SHAPE RECOGNITION METHOD BASED ON THE k-nearest NEIGHBOR RULE

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Using Language Models for Flat Text Queries in XML Retrieval

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Today s Outline. Sorting: The Big Picture. Why Sort? Selection Sort: Idea. Insertion Sort: Idea. Sorting Chapter 7 in Weiss.

Spam Filtering Based on Support Vector Machines with Taguchi Method for Parameter Selection

Learning from Multiple Related Data Streams with Asynchronous Flowing Speeds

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article. A selective ensemble classification method on microarray data

Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks

A Simple Methodology for Database Clustering. Hao Tang 12 Guangdong University of Technology, Guangdong, , China

Fuzzy Modeling of the Complexity vs. Accuracy Trade-off in a Sequential Two-Stage Multi-Classifier System

5/21/17. Standing queries. Spam filtering Another text classification task. Categorization/Classification. Document Classification

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Information Retrieval

Cluster Analysis of Electrical Behavior

Parallel Implementation of Classification Algorithms Based on Cloud Computing Environment

Modeling Hierarchical User Interests Based on HowNet and Concept Mapping

THE CONDENSED FUZZY K-NEAREST NEIGHBOR RULE BASED ON SAMPLE FUZZY ENTROPY

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

CSE 326: Data Structures Quicksort Comparison Sorting Bound

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Random Variables and Probability Distributions

Optimizing Document Scoring for Query Retrieval

Information Retrieval

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

From Comparing Clusterings to Combining Clusterings

Angle-Independent 3D Reconstruction. Ji Zhang Mireille Boutin Daniel Aliaga

A Taxonomy Fuzzy Filtering Approach

Feature Selection for Target Detection in SAR Images

Yan et al. / J Zhejiang Univ-Sci C (Comput & Electron) in press 1. Improving Naive Bayes classifier by dividing its decision regions *

An Improvement to Naive Bayes for Text Classification

A Webpage Similarity Measure for Web Sessions Clustering Using Sequence Alignment

Abstract. 1. Introduction

Using Query Contexts in Information Retrieval Jing Bai 1, Jian-Yun Nie 1, Hugues Bouchard 2, Guihong Cao 1 1 Department IRO, University of Montreal

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

An Image Fusion Approach Based on Segmentation Region

Semi Supervised Learning using Higher Order Cooccurrence Paths to Overcome the Complexity of Data Representation

ETAtouch RESTful Webservices

Selecting Query Term Alterations for Web Search by Exploiting Query Contexts

Context-Specific Bayesian Clustering for Gene Expression Data

Problem Set 3 Solutions

Transcription:

CSCI 5417 Informaton Retreval Systems Jm Martn! Lecture 11 9/29/2011 Today 9/29 Classfcaton Naïve Bayes classfcaton Ungram LM 1

Where we are... Bascs of ad hoc retreval Indexng Term weghtng/scorng Cosne Evaluaton Document classfcaton Clusterng Informaton extracton Sentment/Opnon mnng Is ths spam? From: "" <takworlld@hotmal.com> Subect: real estate s the only way... gem oalvgkay Anyone can buy real estate wth no money down Stop payng rent TODAY! There s no need to spend hundreds or even thousands for smlar courses I am 22 years old and I have already purchased 6 propertes usng the methods outlned n ths truly INCREDIBLE ebook. Change your lfe NOW! ================================================= Clck Below to order: http://www.wholesaledaly.com/sales/nmd.htm ================================================= 2

Text Categorzaton Examples Assgn labels to each document or web-page: Labels are most often topcs such as Yahoo-categores fnance, sports, news>world>asa>busness Labels may be genres edtorals, move-revews, news Labels may be opnon lke, hate, neutral Labels may be doman-specfc "nterestng-to-me" : "not-nterestng-to-me spam : not-spam contans adult content : doesn t mportant to read now: not mportant Categorzaton/Classfcaton Gven: A descrpton of an nstance, x X, where X s the nstance language or nstance space. Issue for us s how to represent text documents And a fxed set of categores: C = {c 1, c 2,, c n } Determne: The category of x: c(x C, where c(x s a categorzaton functon whose doman s X and whose range s C. We want to know how to buld categorzaton functons (.e. classfers. 3

Text Classfcaton Types Those examples can be further classfed by type Bnary Spam/not spam, contans adult content/doesn t Multway Busness vs. sports vs. gossp Herarchcal News> UK > Wales>Weather > Mxture model.8 basketball,.2 busness Document Classfcaton Test! Data:! plannng! language! proof! ntellgence! Classes:! ML! (AI! Plannng! (Programmng! Semantcs! Garb.Coll.! (HCI! Multmeda! GUI! Tranng! Data:! learnng! ntellgence! algorthm! renforcement! network...! plannng! temporal! reasonng! programmng! semantcs! language! plan! proof...! language...! garbage! collecton! memory! optmzaton! regon...!...!...! 4

Bayesan Classfers Task: Classfy a new nstance D based on a tuple of attrbute values D = x1, x2,, x n nto one of the classes c C c MAP = argmax P( c x, x2,, x c C 1 n P( x1, x2,, xn c P( c = argmax c C P( x, x,, x = argmax P( x, x2,, x c C 1 2 c n P( c 1 n Naïve Bayes Classfers P(c Can be estmated from the frequency of classes n the tranng examples. P(x 1,x 2,,x n c O( X n C parameters Could only be estmated f a very, very large number of tranng examples was avalable. Naïve Bayes Condtonal Independence Assumpton: Assume that the probablty of observng the conuncton of attrbutes s equal to the product of the ndvdual probabltes P(x c. 5

The Naïve Bayes Classfer (Belef Net Flu X 1 X 2 X 3 X 4 X 5 runnynose snus cough fever muscle-ache Condtonal Independence Assumpton: features detect term presence and are ndependent of each other gven the class: P(X 1,,X 5 C = P(CP(X 1 C P(X 2 C P(X 5 C Learnng the Model C X 1 X 2 X 3 X 4 X 5 X 6 Pˆ( c Frst attempt: maxmum lkelhood estmates smply use the frequences n the data N( C = c = N Pˆ( x c = N( X N( C = c = x, C = c 6

Smoothng to Avod Overfttng Pˆ( x c = N( X = x, C = c + 1 N( C = c + k Add-One smoothng # of values ofx Stochastc Language Models Models probablty of generatng strngs (each word n turn n the language (commonly all strngs over. E.g., ungram model Model M 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 sad 0.02 lkes the man lkes the woman 0.2 0.01 0.02 0.2 0.01 multply P(s M = 0.00000008 13.2.1 7

Stochastc Language Models Model probablty of generatng any strng Model M1 Model M2 0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maden 0.2 the 0.0001 class the class pleaseth yon maden 0.03 sayst 0.2 0.01 0.0001 0.0001 0.0005 0.02 pleaseth 0.2 0.0001 0.02 0.1 0.01 0.1 yon 0.01 maden P(s M2 > P(s M1 0.01 woman 0.0001 woman 13.2.1 Ungram and hgher-order models P ( = P ( P ( P ( P ( Ungram Language Models P ( P ( P ( P ( Bgram (generally, n-gram Language Models P ( P ( P ( P ( Other Language Models Grammar-based models (PCFGs, etc. Probably not the frst thng to try n IR Easy. Effectve! 13.2.1 8

Naïve Bayes va a class condtonal language model = multnomal NB Cat w 1 w 2 w 3 w 4 w 5 w 6 Effectvely, the probablty of each class s done as a class-specfc ungram language model Usng Multnomal Nave Bayes to Classfy Text Attrbutes are text postons, values are words. c Stll too many possbltes Assume that classfcaton s ndependent of the postons of the words NB = argmax P( c c C = argmax P( c c C P( x P( x c = "our" c P( x Use same parameters for each poston 1 = "text" c Result s bag of words model (over tokens not types n 9

Naïve Bayes: Learnng From tranng corpus, extract Vocabulary Calculate requred P(c and P(x k c terms For each c n C do docs subset of documents for whch the target class s c docs P( c total # documents Text sngle document contanng all docs for each word x k n Vocabulary n k number of occurrences of x k n Text nk + α P( xk c n + α Vocabulary Multnomal Model 10

Naïve Bayes: Classfyng postons all word postons n current document whch contan tokens found n Vocabulary Return c NB, where c C c = argmax P( c P( x c NB postons Apply Multnomal 11

Nave Bayes: Tme Complexty Tranng Tme: O( D L d + C V where L d s the average length of a document n D. Assumes V and all D, n, and n pre-computed n O( D L d tme durng one pass through all of the data. Generally ust O( D L d snce usually C V < D L d Test Tme: O( C L t where L t s the average length of a test document. Very effcent overall, lnearly proportonal to the tme needed to ust read n all the data. Underflow Preventon: log space Multplyng lots of probabltes, whch are between 0 and 1 by defnton, can result n floatng-pont underflow. Snce log(xy = log(x + log(y, t s better to perform all computatons by summng logs of probabltes rather than multplyng probabltes. Class wth hghest fnal un-normalzed log probablty score s stll the most probable. c C c = argmax log P( c + log P( x c NB postons Note that model s now ust max of sum of weghts 12

Naïve Bayes example Gven: 4 documents D1 (sports: Chna soccer D2 (sports: Japan baseball D3 (poltcs: Chna trade D4 (poltcs: Japan Japan exports Classfy: D5: soccer D6: Japan Use Add-one smoothng Multnomal model Multvarate bnomal model Naïve Bayes example V s {Chna, soccer, Japan, baseball, trade exports} V = 6 Szes Sports = 2 docs, 4 tokens Poltcs = 2 docs, 5 tokens Japan Raw Sm Sports 1/4 2/10 Poltcs 2/5 3/11 soccer Raw Sm Sports 1/4 2/10 Poltcs 0/5 1/11 13

Naïve Bayes example Classfyng Soccer (as a doc Soccer sports =.2 Soccer poltcs =.09 Sports > Poltcs or.2/.2+.09 =.69.09/.2+.09 =.31 New example What about a doc lke the followng? Japan soccer Sports P(apan sportsp(soccer sportsp(sports.2 *.2 *.5 =.02 Poltcs P(apan poltcsp(soccer poltcsp(poltcs.27 *.09 *. 5 =.01 Or.66 to.33 14

Evaluatng Categorzaton Evaluaton must be done on test data that are ndependent of the tranng data (usually a dsont set of nstances. Classfcaton accuracy: c/n where n s the total number of test nstances and c s the number of test nstances correctly classfed by the system. Average results over multple tranng and test sets (splts of the overall data for the best results. Example: AutoYahoo! Classfy 13,589 Yahoo! webpages n Scence subtree nto 95 dfferent topcs (herarchy depth 2 15

WebKB Experment Classfy webpages from CS departments nto: student, faculty, course,proect Tran on ~5,000 hand-labeled web pages Cornell, Washngton, U.Texas, Wsconsn Crawl and classfy a new ste (CMU Student Faculty Person Proect Course Departmt Extracted 180 66 246 99 28 1 Correct 130 28 194 72 25 1 Accuracy: 72% 42% 79% 73% 89% 100% NB Model Comparson 16

SpamAssassn Naïve Bayes made a bg splash wth spam flterng Paul Graham s A Plan for Spam And ts offsprng... Nave Bayes-lke classfer wth werd parameter estmaton Wdely used n spam flters Classc Nave Bayes superor when approprately used Accordng to Davd D. Lews Many emal flters use NB classfers But also many other thngs: black hole lsts, etc. 17

Naïve Bayes on spam emal Nave Bayes s Not So Nave Does well n many standard evaluaton compettons Robust to Irrelevant Features Irrelevant Features cancel each other wthout affectng results Instead Decson Trees can heavly suffer from ths. Very good n domans wth many equally mportant features Decson Trees suffer from fragmentaton n such cases especally f lttle data A good dependable baselne for text classfcaton Very Fast: Learnng wth one pass over the data; testng lnear n the number of attrbutes, and document collecton sze Low Storage requrements 18

Next couple of classes Other classfcaton ssues What about vector spaces? Lucene nfrastructure Better ML approaches SVMs etc. 19