4/4/18. MeSH Subject Category Hierarchy. Arch. Graphics. Theory. Text Classification and Naïve Bayes K-Nearest Neighbor (KNN) Classifier

Size: px

Start display at page:

Download "4/4/18. MeSH Subject Category Hierarchy. Arch. Graphics. Theory. Text Classification and Naïve Bayes K-Nearest Neighbor (KNN) Classifier"

Eugene Williams
5 years ago
Views:

Text Classification and Naïve Bayes K-Nearest Neighbor (KNN) Classifier LECTURER: BURCU CAN 207-208

.. gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY!

already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook.

to ratify U.S Constitution: Jay, Madison, Hamilton.

1 Text Classification and Naïve Bayes K-Nearest Neighbor (KNN) Classifier LECTURER: BURCU CAN Spring From: "" Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW! ================================================= Click Below to order: ================================================= 787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton. Authorship of 2 of the letters in dispute 963: solved by Mosteller and Wallace using Bayesian methods James Madison Alexander Hamilton MEDLINE Article MeSH Subject Category Hierarchy? Antogonists and Inhibitors Blood Supply Chemistry Drug Therapy Embryology Epidemiology Arch. Graphics Theory NLP AI 5

2 LABELS=BINARY spam / not spam LABELS=TOPICS finance / sports / asia LABELS=OPINION like / hate / neutral LABELS=AUTHOR Shakespeare / Marlowe / Ben Jonson The Federalist papers Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis Input: a document d a fixed set of classes C = {c, c 2,, c J } Output: a predicted class c Î C Manual Classification Hand-coded Rules Supervised Learning Unsupervised Learning (i.e. clustering) Manual classification Used by Yahoo!, Looksmart, about.com, Medline very accurate when job is done by experts consistent when the problem size and team is small difficult and expensive to scale 2

3 Automatic document classification Hand-coded rule-based systems Reuters, CIA, Verity, Rules based on combinations of words or other features spam: black- list- address OR ( dollars AND have been selected ) Accuracy can be high If rules carefully refined by expert But building and maintaining these rules is expensive Input: a document d a fixed set of classes C = {c, c 2,, c J } A training set of m hand- labeled documents (d,c ),...,(d m,c m ) Output: a learned classifier γ:d à c 4 Any kind of classifier Naïve Bayes Logistic regression Support- vector machines k- Nearest Neighbors Supervised learning of a document-label assignment function Many systems partly rely on machine learning (Autonomy, MSN, Verity, Enkata, Yahoo!, ) k-nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) Support-vector machines (more powerful) plus many other methods No free lunch: requires hand-classified training data But data can be built up (and refined) by amateurs Note that many commercial systems use a mixture of methods I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet! fairy always love it it to whimsical it and I seen are friend anyone happy dialogue adventure recommend who sweet of satirical it it I to movie but romantic I several yet again it the humor the seen would to scenes I the manages fun the I times and and about whenever while have conventions with it 6 I 5 the 4 to 3 and 3 seen 2 yet would whimsical times sweet satirical adventure genre fairy humor have great 8 3

4/4/8 γ( seen sweet 2 whimsical recommend happy.

.. cmap = argmax P(c d) = argmax )=c For a document d MAP

P(d) = argmax P(d c)p(c) Bayes Rule cmap = argmax P(d

denominator P(x, x2,, xn c) cmap = argmax P(x, x2,, xn

very, very large number of training examples was available.

.xn Bag of Words assumption: Assume position doesn t How

We can just count the relative frequencies in a corpus

4 4/4/8 γ( seen sweet 2 whimsical recommend happy... and a class c P(c d) = P(d c)p(c) P(d)... cmap = argmax P(c d) = argmax )=c For a document d MAP is maximum a posteriori = most likely class P(d c)p(c) P(d) = argmax P(d c)p(c) Bayes Rule cmap = argmax P(d c)p(c) = argmax P(x, x2,, xn c)p(c) Dropping the denominator P(x, x2,, xn c) cmap = argmax P(x, x2,, xn c)p(c) O( X n C ) parameters Could only be estimated if a very, very large number of training examples was available. Document d represented as features x..xn Bag of Words assumption: Assume position doesn t How often does this class occur? We can just count the relative frequencies in a corpus matter Conditional Independence: Assume the feature probabilities P(xi cj) are independent given the class c. P(x,, xn c) = P(x c) P(x2 c) P(x3 c)... P(xn c) 4

5 Sec.3.3 First attempt: maximum likelihood estimates simply use the frequencies in the data ˆP(c j ) = doccount(c = c j ) N doc ˆP(w i c j ) = count(w i, c j ) count(w, c j ) w V ˆP(w i c j ) = count(w i,c j ) count(w, c j ) w V fraction of times word w i appears among all words in documents of topic c j Create mega-document for topic j by concatenating all docs in this topic Use frequency of w in mega-document c=china X =Shanghai X 2=and X 3=Shenzhen X 4=issue X 5=bonds Naïve Bayes classifiers can use any sort of feature URL, address, dictionaries, network features But if, as in the previous slides We use only word features We use all of the words in the text (not a subset) Then Naïve bayes has an important similarity to language modeling Sec.3.2. Sec.3.2. Assigning each word: P(word c) Assigning each sentence: P(s c)=π P(word c) Which class assigns the higher probability to s? Class pos 0. I 0. love 0.0this I love this fun film Model pos 0. I 0. love 0.0this Model neg 0.2 I 0.00 love 0.0this I love this fun film fun 0. film P(s pos) = fun 0. film fun 0. film P(s pos) > P(s neg) 5

6 ˆP(w c) = count(w,c)+ count(c)+ V Prior s: P(c)= 3 4 P(j)= 4 ˆP(c) = N c N Do Words Class c Training Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j Test 5 Chinese Chinese Chinese Tokyo Japan? Conditional Probabilities: P(Chinese c) = (5+) / (8+6) = 6/4 = 3/7 P(Tokyo c) = (0+) / (8+6) = /4 P(Japan c) = (0+) / (8+6) = /4 P(Chinese j) = (+) / (3+6) = 2/9 P(Tokyo j) = (+) / (3+6) = 2/9 P(Japan j) = (+) / (3+6) = 2/9 Choosing a class: P(c d5) µ 3/4 * (3/7) 3 * /4 * / P(j d5) µ /4 * (2/9) 3 * 2/9 * 2/ SpamAssassin Features: Mentions Generic Viagra Online Pharmacy Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN) Phrase: impress... girl From: starts with many numbers Subject is all capitals HTML has a low ratio of text to image area One hundred percent guaranteed Claims you can be removed from the list 'Prestigious Non-Accredited Universities' 3 Sec Sec Use Naïve Bayes Naïve Bayes is a high- bias algorithm (Ng and Jordan 2002 NIPS) Get more labeled data Find clever ways to get humans to label data for you Try semi- supervised training methods: Bootstrapping, EM over unlabeled documents, Perfect for all the clever classifiers SVM Regularized Logistic Regression You can even use user- interpretable decision trees Users like to hack Management likes quick fixes Sec Sec Can achieve high accuracy! At a cost: SVMs (train time) or knn (test time) can be too slow Regularized logistic regression can be somewhat better So Naïve Bayes can come back into its own again! With enough data Classifier may not matter Brill and Banko on spelling correction

7 Multiplying lots of probabilities can result in floating- point underflow. Since log(xy) = log(x) + log(y) Better to sum logs of probabilities instead of multiplying probabilities. Class with highest un- normalized log probability score is still most probable. c NB = argmax c j C log P(c j )+ log P(x i c j ) i positions Model is now just max of sum of weights Sec.4.2 Sec.4.2 µ (c) = D c d D c v (d) Where D c is the set of all documents that belong to class c and v(d) is the vector space representation of d. Note that centroid will in general not be a unit vector even when the inputs are unit vectors. Rocchio forms a simple representative for each class: the centroid/prototype Classification: nearest prototype/centroid It does not guarantee that classifications are consistent with the given training data Sec.4.2 Little used outside text classification It has been used quite effectively for text classification But in general worse than Naïve Bayes Again, cheap to train and test documents 4 7

8 knn = k Nearest Neighbor P(government )? To classify a document d: Define k- neighborhood as the k nearest neighbors of d Pick the majority class label in the k- neighborhood Government Science Arts Learning: just store the labeled training examples D Testing instance x (under NN) Compute similarity between x and all examples in D Assign x the category of the most similar example in D. Does not compute anything beyond storing the examples Also called: Case- based learning Memory- based learning Lazy learning Using only the closest example (NN) subject to errors due to: A single atypical example. Noise (i.e., an error) in the category label of a single training example. More robust: find the k examples and return the majority category of these k k is typically odd to avoid ties; 3 and 5 are most common Boundaries are in principle arbitrary surfaces but usually polyhedra Government Science Arts knn gives locally defined decision boundaries between classes far away points do not influence each classification decision (unlike in Naïve Bayes, Rocchio, etc.)

9 Sec.4.6 No feature selection necessary No training necessary Scales well with large number of classes Don t need to train n classifiers for n classes Classes can influence each other Small changes to one class can have ripple effect May be expensive at test time In most cases it s more accurate than Naïve Bayes or Rocchio NB has low variance and high bias. Linear decision surface (hyperplane see later) knn has high variance and low bias. Infinite memory Sec.4.6 Is there a learning method that is optimal for all text classification problems? No, because there is a tradeoff between bias and variance. Factors to take into account: How much training data is available? How simple/complex is the problem? (linear vs. nonlinear decision boundary) How noisy is the data? How stable is the problem over time? For an unstable problem, its better to use a simple and robust classifier Dan Jurafsky, Text Classification and Naïve Bayes Roger Levy, Text Categorization through Naïve Bayes 9

5/21/17. Standing queries. Spam filtering Another text classification task. Categorization/Classification. Document Classification

5/21/17. Standing queries. Spam filtering Another text classification task. Categorization/Classification. Document Classification Standing queries Introduction to Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris Manning and Pandu Nayak The path from IR to text classification: You have