CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Informaton Retreval Systems Jm Martn! Lecture 11 9/29/2011 Today 9/29 Classfcaton Naïve Bayes classfcaton Ungram LM 1

Where we are... Bascs of ad hoc retreval Indexng Term weghtng/scorng Cosne Evaluaton Document classfcaton Clusterng Informaton extracton Sentment/Opnon mnng Is ths spam? From: "" <takworlld@hotmal.com> Subect: real estate s the only way... gem oalvgkay Anyone can buy real estate wth no money down Stop payng rent TODAY! There s no need to spend hundreds or even thousands for smlar courses I am 22 years old and I have already purchased 6 propertes usng the methods outlned n ths truly INCREDIBLE ebook. Change your lfe NOW! ================================================= Clck Below to order: http://www.wholesaledaly.com/sales/nmd.htm ================================================= 2

Text Categorzaton Examples Assgn labels to each document or web-page: Labels are most often topcs such as Yahoo-categores fnance, sports, news>world>asa>busness Labels may be genres edtorals, move-revews, news Labels may be opnon lke, hate, neutral Labels may be doman-specfc "nterestng-to-me" : "not-nterestng-to-me spam : not-spam contans adult content : doesn t mportant to read now: not mportant Categorzaton/Classfcaton Gven: A descrpton of an nstance, x X, where X s the nstance language or nstance space. Issue for us s how to represent text documents And a fxed set of categores: C = {c 1, c 2,, c n } Determne: The category of x: c(x C, where c(x s a categorzaton functon whose doman s X and whose range s C. We want to know how to buld categorzaton functons (.e. classfers. 3

Text Classfcaton Types Those examples can be further classfed by type Bnary Spam/not spam, contans adult content/doesn t Multway Busness vs. sports vs. gossp Herarchcal News> UK > Wales>Weather > Mxture model.8 basketball,.2 busness Document Classfcaton Test! Data:! plannng! language! proof! ntellgence! Classes:! ML! (AI! Plannng! (Programmng! Semantcs! Garb.Coll.! (HCI! Multmeda! GUI! Tranng! Data:! learnng! ntellgence! algorthm! renforcement! network...! plannng! temporal! reasonng! programmng! semantcs! language! plan! proof...! language...! garbage! collecton! memory! optmzaton! regon...!...!...! 4

Bayesan Classfers Task: Classfy a new nstance D based on a tuple of attrbute values D = x1, x2,, x n nto one of the classes c C c MAP = argmax P( c x, x2,, x c C 1 n P( x1, x2,, xn c P( c = argmax c C P( x, x,, x = argmax P( x, x2,, x c C 1 2 c n P( c 1 n Naïve Bayes Classfers P(c Can be estmated from the frequency of classes n the tranng examples. P(x 1,x 2,,x n c O( X n C parameters Could only be estmated f a very, very large number of tranng examples was avalable. Naïve Bayes Condtonal Independence Assumpton: Assume that the probablty of observng the conuncton of attrbutes s equal to the product of the ndvdual probabltes P(x c. 5

The Naïve Bayes Classfer (Belef Net Flu X 1 X 2 X 3 X 4 X 5 runnynose snus cough fever muscle-ache Condtonal Independence Assumpton: features detect term presence and are ndependent of each other gven the class: P(X 1,,X 5 C = P(CP(X 1 C P(X 2 C P(X 5 C Learnng the Model C X 1 X 2 X 3 X 4 X 5 X 6 Pˆ( c Frst attempt: maxmum lkelhood estmates smply use the frequences n the data N( C = c = N Pˆ( x c = N( X N( C = c = x, C = c 6

Smoothng to Avod Overfttng Pˆ( x c = N( X = x, C = c + 1 N( C = c + k Add-One smoothng # of values ofx Stochastc Language Models Models probablty of generatng strngs (each word n turn n the language (commonly all strngs over. E.g., ungram model Model M 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 sad 0.02 lkes the man lkes the woman 0.2 0.01 0.02 0.2 0.01 multply P(s M = 0.00000008 13.2.1 7

Stochastc Language Models Model probablty of generatng any strng Model M1 Model M2 0.2 the 0.01 class 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maden 0.2 the 0.0001 class the class pleaseth yon maden 0.03 sayst 0.2 0.01 0.0001 0.0001 0.0005 0.02 pleaseth 0.2 0.0001 0.02 0.1 0.01 0.1 yon 0.01 maden P(s M2 > P(s M1 0.01 woman 0.0001 woman 13.2.1 Ungram and hgher-order models P ( = P ( P ( P ( P ( Ungram Language Models P ( P ( P ( P ( Bgram (generally, n-gram Language Models P ( P ( P ( P ( Other Language Models Grammar-based models (PCFGs, etc. Probably not the frst thng to try n IR Easy. Effectve! 13.2.1 8

Naïve Bayes va a class condtonal language model = multnomal NB Cat w 1 w 2 w 3 w 4 w 5 w 6 Effectvely, the probablty of each class s done as a class-specfc ungram language model Usng Multnomal Nave Bayes to Classfy Text Attrbutes are text postons, values are words. c Stll too many possbltes Assume that classfcaton s ndependent of the postons of the words NB = argmax P( c c C = argmax P( c c C P( x P( x c = "our" c P( x Use same parameters for each poston 1 = "text" c Result s bag of words model (over tokens not types n 9

Naïve Bayes: Learnng From tranng corpus, extract Vocabulary Calculate requred P(c and P(x k c terms For each c n C do docs subset of documents for whch the target class s c docs P( c total # documents Text sngle document contanng all docs for each word x k n Vocabulary n k number of occurrences of x k n Text nk + α P( xk c n + α Vocabulary Multnomal Model 10

Naïve Bayes: Classfyng postons all word postons n current document whch contan tokens found n Vocabulary Return c NB, where c C c = argmax P( c P( x c NB postons Apply Multnomal 11

Nave Bayes: Tme Complexty Tranng Tme: O( D L d + C V where L d s the average length of a document n D. Assumes V and all D, n, and n pre-computed n O( D L d tme durng one pass through all of the data. Generally ust O( D L d snce usually C V < D L d Test Tme: O( C L t where L t s the average length of a test document. Very effcent overall, lnearly proportonal to the tme needed to ust read n all the data. Underflow Preventon: log space Multplyng lots of probabltes, whch are between 0 and 1 by defnton, can result n floatng-pont underflow. Snce log(xy = log(x + log(y, t s better to perform all computatons by summng logs of probabltes rather than multplyng probabltes. Class wth hghest fnal un-normalzed log probablty score s stll the most probable. c C c = argmax log P( c + log P( x c NB postons Note that model s now ust max of sum of weghts 12

Naïve Bayes example Gven: 4 documents D1 (sports: Chna soccer D2 (sports: Japan baseball D3 (poltcs: Chna trade D4 (poltcs: Japan Japan exports Classfy: D5: soccer D6: Japan Use Add-one smoothng Multnomal model Multvarate bnomal model Naïve Bayes example V s {Chna, soccer, Japan, baseball, trade exports} V = 6 Szes Sports = 2 docs, 4 tokens Poltcs = 2 docs, 5 tokens Japan Raw Sm Sports 1/4 2/10 Poltcs 2/5 3/11 soccer Raw Sm Sports 1/4 2/10 Poltcs 0/5 1/11 13

Naïve Bayes example Classfyng Soccer (as a doc Soccer sports =.2 Soccer poltcs =.09 Sports > Poltcs or.2/.2+.09 =.69.09/.2+.09 =.31 New example What about a doc lke the followng? Japan soccer Sports P(apan sportsp(soccer sportsp(sports.2 *.2 *.5 =.02 Poltcs P(apan poltcsp(soccer poltcsp(poltcs.27 *.09 *. 5 =.01 Or.66 to.33 14

Evaluatng Categorzaton Evaluaton must be done on test data that are ndependent of the tranng data (usually a dsont set of nstances. Classfcaton accuracy: c/n where n s the total number of test nstances and c s the number of test nstances correctly classfed by the system. Average results over multple tranng and test sets (splts of the overall data for the best results. Example: AutoYahoo! Classfy 13,589 Yahoo! webpages n Scence subtree nto 95 dfferent topcs (herarchy depth 2 15

WebKB Experment Classfy webpages from CS departments nto: student, faculty, course,proect Tran on ~5,000 hand-labeled web pages Cornell, Washngton, U.Texas, Wsconsn Crawl and classfy a new ste (CMU Student Faculty Person Proect Course Departmt Extracted 180 66 246 99 28 1 Correct 130 28 194 72 25 1 Accuracy: 72% 42% 79% 73% 89% 100% NB Model Comparson 16

SpamAssassn Naïve Bayes made a bg splash wth spam flterng Paul Graham s A Plan for Spam And ts offsprng... Nave Bayes-lke classfer wth werd parameter estmaton Wdely used n spam flters Classc Nave Bayes superor when approprately used Accordng to Davd D. Lews Many emal flters use NB classfers But also many other thngs: black hole lsts, etc. 17

Naïve Bayes on spam emal Nave Bayes s Not So Nave Does well n many standard evaluaton compettons Robust to Irrelevant Features Irrelevant Features cancel each other wthout affectng results Instead Decson Trees can heavly suffer from ths. Very good n domans wth many equally mportant features Decson Trees suffer from fragmentaton n such cases especally f lttle data A good dependable baselne for text classfcaton Very Fast: Learnng wth one pass over the data; testng lnear n the number of attrbutes, and document collecton sze Low Storage requrements 18

Next couple of classes Other classfcaton ssues What about vector spaces? Lucene nfrastructure Better ML approaches SVMs etc. 19