Parts of Speech, Named Entity Recognizer Artificial Intelligence @ Allegheny College Janyl Jumadinova November 8, 2018 Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 1 / 25
NLTK $ python3 $ import nltk $ nltk.download() Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 2 / 25
NLTK Tokenize using Python 1 urllin module to crawl the webpage 2 BeautifulSoup to clean the text with html tags 3 convert text into tokens using split() function Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 3 / 25
NLTK Tokenize using Python 1 urllin module to crawl the webpage 2 BeautifulSoup to clean the text with html tags 3 convert text into tokens using split() function Remove Stop Words 1 get english stop words from nltk 2 remove stop words before plotting Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 3 / 25
NLTK Tokenize using Python 1 urllin module to crawl the webpage 2 BeautifulSoup to clean the text with html tags 3 convert text into tokens using split() function Remove Stop Words 1 get english stop words from nltk 2 remove stop words before plotting Frequency Analysis 1 nltk s FreqDist to calculate the frequency distribution 2 plot function to produce a graph Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 3 / 25
Parts of Speech (POS) Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 4 / 25
POS Tagging Words often have more than one POS: The back door On my back Win the voters back Promised to back the bill Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 5 / 25
POS Tagging Words often have more than one POS: The back door On my back Win the voters back Promised to back the bill The POS tagging problem is to determine the POS tag for a particular instance of a word. Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 5 / 25
POS Tagging Input: Plays well with others Ambiguity: NNS/VBZ UH/JJ/NN/RB IN NNS Output: Plays/VBZ well/rb with/in others/nns Penn Treebank Tag-set Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 6 / 25
Sentiment Analysis Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 7 / 25
Sentiment Analysis https: //www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/ www.sentiment140.com https://textblob.readthedocs.io/en/dev/ Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 8 / 25
Sentiment analysis has many other names Opinion extraction Opinion mining Sentiment mining Subjectivity analysis Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 9 / 25
Sentiment Analysis Sentiment analysis is the detection of attitudes enduring, affectively colored beliefs, dispositions towards objects or persons Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 10 / 25
Attitudes Holder (source) of attitude Target (aspect) of attitude Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 11 / 25
Attitudes Holder (source) of attitude Target (aspect) of attitude Type of attitude - From a set of types: Like, love, hate, value, desire, etc. - Or (more commonly) simple weighted polarity: positive, negative, neutral, together with strength Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 11 / 25
Attitudes Holder (source) of attitude Target (aspect) of attitude Type of attitude - From a set of types: Like, love, hate, value, desire, etc. - Or (more commonly) simple weighted polarity: positive, negative, neutral, together with strength Text containing the attitude - Sentence or entire document Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 11 / 25
Sentiment analysis Simplest task: Is the attitude of this text positive or negative? Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 12 / 25
Sentiment analysis Simplest task: Is the attitude of this text positive or negative? More complex: Rank the attitude of this text from 1 to 5 Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 12 / 25
Sentiment analysis Simplest task: Is the attitude of this text positive or negative? More complex: Rank the attitude of this text from 1 to 5 Advanced: Detect the target, source, or complex attitude types Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 12 / 25
Baseline Algorithm Tokenization Feature Extraction Classification using different classifiers Naive Bayes MaxEnt SVM Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 13 / 25
Sentiment Tokenization Issues Deal with HTML and XML markup Twitter/Facebook/... mark-up (names, hash tags) Capitalization (preserve for words in all caps) Phone numbers, dates Emoticons Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 14 / 25
Extracting Features for Sentiment Classification How to handle negation: I didn t like this movie vs. I really like this movie Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 15 / 25
Extracting Features for Sentiment Classification How to handle negation: I didn t like this movie vs. I really like this movie Which words to use? Only adjectives All words Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 15 / 25
Negation Add NOT to every word between negation and following punctuation Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 16 / 25
Naive Bayes Algorithm Simple ( naive ) classification method based on Bayes rule Relies on very simple representation of document: - Bag of words Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 17 / 25
Naive Bayes Algorithm Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 18 / 25
Naive Bayes Algorithm Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 19 / 25
Naive Bayes Algorithm Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 20 / 25
Naive Bayes Algorithm For a document d and a class c Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 21 / 25
Naive Bayes Algorithm Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 22 / 25
Naive Bayes Algorithm Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 23 / 25
Naive Bayes Algorithm Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 24 / 25
Binarized (Boolean feature) Multinomial Naive Bayes Intuition: Word occurrence may matter more than word frequency The occurrence of the word fantastic tells us a lot The fact that it occurs 5 times may not tell us much more. Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 25 / 25
Binarized (Boolean feature) Multinomial Naive Bayes Intuition: Word occurrence may matter more than word frequency The occurrence of the word fantastic tells us a lot The fact that it occurs 5 times may not tell us much more. Boolean Multinomial Naive Bayes Clips all the word counts in each document at 1 Janyl Jumadinova Parts of Speech, Named Entity Recognizer November 8, 2018 25 / 25