SI485i : NLP. Set 5 Using Naïve Bayes

SI485 : NL Set 5 Usng Naïve Baes

Motvaton We want to predct somethng. We have some text related to ths somethng. somethng = target label text = text features Gven, what s the most probable?

Motvaton: Author Detecton = Alas the da! tae heed of hm; he stabbed me n mne own house, and that most beastl: n good fath, he cares not what mschef he does. If hs weapon be out: he wll fon le an devl; he wll spare nether man, woman, nor chld. = { Charles Dcens, Wllam Shaespeare, Herman Melvlle, Jane Austn, Homer, Leo Tolsto } arg max

More Motvaton =spam =emal =worth =revew sentence

5 The Naïve Baes Classfer Recall Baes rule: Whch s short for: We can re-wrte ths as: x x x x x x Remanng sldes adapted from Tom Mtchell.

Dervng Naïve Baes Idea: use the tranng data to drectl estmate: and We can use these values to estmate usng Baes rule. new Recall that representng the full ont probablt s not practcal. 1, 2,, n 6

Dervng Naïve Baes However, f we mae the assumpton that the attrbutes are ndependent, estmaton s eas!, 1, n In other words, we assume all attrbutes are condtonall ndependent gven. Often ths assumpton s volated n practce, but more on that later 7

Dervng Naïve Baes,, n Let and label be dscrete. 1 Then, we can estmate and drectl from the tranng data b countng! S Temp Humd Wnd Water Forecast la? sunn warm normal strong warm same es sunn warm hgh strong warm same es ran cold hgh strong warm change no sunn warm hgh strong cool change es S = sunn la = es =? Humd = hgh la = es =? 8

The Naïve Baes Classfer Now we have: To classf a new pont new : 9 n,, 1 new arg max

The Naïve Baes Algorthm For each value Estmate = from the data. For each value x of each attrbute Estmate =x = Classf a new pont va: new arg max In practce, the ndependence assumpton doesn t often hold true, but Naïve Baes performs ver well despte t. 10

An alternate vew of NB as LMs 1 = dcens 2 = twan 1 * 1 2 * 2 1 = 1 Bgrams: 1 = 1 x x 1 2 = 2 Bgrams: 2 = 2 x x 1

Naïve Baes Applcatons Text classfcaton Whch e-mals are spam? Whch e-mals are meetng notces? Whch author wrote a document? Whch webpages are about current events? Whch blog contans angr wrtng? What sentence n a document tals about compan? etc. 12

Text and Features What s? n, 1, Could be ungrams, hopefull bgrams too. It can be anthng that s computed from the text. es, I reall mean anthng. Creatvt and ntuton nto language s where the real gans come from n NL. Non n-gram examples: 10 = the number of sentences that begn wth conunctons 356 = exstence of a sem-colon n the paragraph

Features In machne learnng, features are the attrbutes to whch ou assgn weghts probabltes n Naïve Baes that help n the fnal classfcaton. Up untl now, our features have been n-grams. ou now want to consder other tpes of features. ou count features ust le n-grams. How man dd ou see? = set of features = probablt of a gven a set of features

How do ou count features? Feature dea: a semcolon exsts n ths sentence Count them: Count FEAT-SEMICOLON, 1 Mae up a unque name for the feature, then count! Compute probablt: FEAT-SEMICOLON author= dcens = Count FEAT-SEMICOLON / # dcens sentences

Authorshp Lab 1. Fgure out how to use our Language Models from Lab 2. The can be our ntal features. Can ou tran a model on one author s text? 2. dcens text = dcens * BgramModeltext 3. New code for new features. Call our language models, get a probablt, and then multpl new feature probabltes.