Practical session 3: Machine learning for NLP

Practical session 3: Machine learning for NLP Traitement Automatique des Langues 21 February 2018 1 Introduction In this practical session, we will explore machine learning models for NLP applications; specifically, we will train a classifier for sentiment analysis on a French dataset of movie reviews. For these exercises, we will make use of Python (v2.7), and a number of modules for data processing and machine learning: numpy, scipy, scikit-learn, and pandas. If you want to use your own computer you will need to make sure these are installed (e.g. using the command pip). If you re using Miniconda, you can use the command conda install <modulename>. We will also make use of nltk (the natural language processing module that we experimented with in the first practical session). First, download the archive for the practical session to an appropriate working directory from the following address: http://www.irit.fr/~tim.van-de-cruys/tal/tp/tp3/tp3.zip Under linux, you can issue the following commands: $ wget http://www.irit.fr/~tim.van-de-cruys/tal/tp/tp3/tp3.zip $ unzip tp3.zip $ cd tp3 The first command will download a ZIP-archive file (which contains the sentiment analysis data set) to your working directory. The second command will unpack the archive. An NLP machine learning pipeline contains the following stages: 1

data preprocessing (tokenization) feature extraction model training evaluation We ll go through these stages step by step, using sentiment classification as an application. As a dataset, we ll be using a set of reviews for television series in French, extracted from the website allocine.fr. The dataset consists of the text of the review, as well as a sentiment label (positive or negative). 1 The training set is divided into a training part (for training, 5576 reviews, ± 90%) and test part (for evaluation, 544 reviews, ± 10%). The dataset is balanced, which means positive and negative instances are evenly distributed. Additionally, training and test set contain reviews about different TV series (in order to avoid possible bias when evaluating). Exercise 1 Why might the evaluation results be biased when reviews in train and test set talk about the same television series? 2 Preprocessing First, we ll load the training set. In python, issue the following commands (you can also put the commands in a file and run the script separately if you like): import pandas as pd train = pd.read_csv("allocine_train.tsv", header=0, \ delimiter="\t", quoting=3) We are now able to examine the data. Explore the dataset using the following commands. 1 Note that the original ratings on the site allocine.fr range from 0 to 4 stars. We will use binary classification instead. In our dataset, original reviews of 0 and 1 stars are considered negative, while reviews of 3 and 4 stars are considered positive. 2

train.shape... train.columns.values... print train["review"][0]... As we ve seen before, we need to preprocess the dataset to be able to properly extract features from it. In order to do so, we ll create a function that makes use of the tokenisation functions of nltk. In order to reuse the function, we can save the commands below in a separate file named sentitools.py. import nltk french_tok_file = tokenizers/punkt/french.pickle sent_tok = nltk.tokenize.load(french_tok_file) word_tok = nltk.tokenize.treebankwordtokenizer() def review_to_words(raw_review): review_string = raw_review.decode( utf8 ) review_lower = review_string.lower() sents = sent_tok.tokenize(review_lower) tokens = [] for s in sents: tokens.extend(word_tok.tokenize(s)) return " ".join(tokens) Once we have our function ready, we can use it to carry out the actual tokenisation of the texts in the training set. from sentitools import review_to_words num_reviews = len(train["review"]) clean_train_reviews = [] for i in range(num_reviews): clean_review = review_to_words(train["review"][i]) clean_train_reviews.append(clean_review) 3

Exercise 2 Examine the tokenised reviews. What errors are made? What could be improved? 3 Feature extraction Now it s time to decide which features to use in our classifier. We ll start with simple bag of words features. from sklearn.feature_extraction.text \ import CountVectorizer vectorizer = CountVectorizer( analyzer = "word", max_features = 5000 ) train_data_features = vectorizer.fit_transform( clean_train_reviews ) train_data_features = train_data_features.toarray() We can look at the extracted feature vectors. We can also look at the vocabulary used by the vectorizer. print train_data_features.shape... vocab = vectorizer.get_feature_names() print vocab 4 Classification Scikit-learn contains many different implementations of classification algorithms. We ll start with the example of last week s class: Naïve Bayes. 4

from sklearn.naive_bayes import MultinomialNB, BernoulliNB classifier = MultinomialNB() classifier.fit(train_data_features, train["sentiment"]) Our model has now been trained on the training set; we can now test its performance on the test set. First, we carry out the same preprocessing and feature extraction on the test set. test = pd.read_csv("allocine_test.tsv", header=0, \ delimiter="\t", quoting=3 ) num_reviews = len(test["review"]) clean_test_reviews = [] for i in range(num_reviews): clean_review = review_to_words(test["review"][i]) clean_test_reviews.append(clean_review) test_data_features = vectorizer.transform( clean_test_reviews ) test_data_features = test_data_features.toarray() Next, we can compute the performance on the test set. score = classifier.score( test_data_features, test["sentiment"] ) print score Exercise 3 What does the score represent? Look at the instances that were classified badly. Do you see why the review was misclassified? Hint: use function predict 5

4.1 K-fold cross validation Usually, we will want to try out different parameters, in order to see what works best for our task. As such, we might experiment with: Different features Different classification algorithms Different model parameters However, we have to be careful: we cannot use our test set over and over again, as we ll be optimizing our parameters for that particular test set (and run the risk of overfitting, which means we are not able to properly generalize to data we haven t trained on). For this reason, we need to make use of a validation set. However, our training set is already quite small; creating a separate validation set would give us even less training data. Fortunately, we don t have to create a separate set: we can use k-fold cross validation. The idea is the following: Break up data into k (e.g. 10) parts (folds) For each fold Current fold is used as temporary test set Use other 9 folds as training data Performance is computed on test fold Average performance over 10 runs Note that, again, we want to make sure that the movies that are reviewed in our training set are different from the ones that appear in our validation set. Scikitlearn has a function for this: 6

from sklearn.model_selection import GroupKFold group_kfold = GroupKFold(n_splits=10) score_kfold = [] for train_index, test_index in group_kfold.split(train_data_features, train["sentiment"], train["movie_id"]): X_train, X_test = train_data_features[train_index], \ train_data_features[test_index] y_train, y_test = train["sentiment"][train_index], \ train["sentiment"][test_index] classifier.fit(x_train, y_train) score_kfold.append(classifier.score(x_test, y_test)) print sum(score_kfold) / float(len(score_kfold)) Exercise 4 Experiment with different feature sets Exercise 5 Exclude a list of stopwords Hint: NLTK provides a list of stopwords for French; look at the arguments of CountVectorizer to include them Experiment with n-grams instead of bag of words Hint: look at the arguments of CountVectorizer again in order to extract n-grams What if you change the number of vocabulary elements included? Can you think of other features to include? Experiment with different models Try a naïve bayes classifier that uses binary features (word presence instead of word count) 7

Exercise 6 Try any other classifier included with scikit-learn (decision trees, SVM,... ) How does it perform? When you ve determined the best set of parameters (according to crossvalidation), compute the performance on the test set 4.2 Intrinsic model evaluation Some models allow us to look at the most informative features. Using a logistic regression, you can do the following: classifier = sklearn.linear_model.logisticregression() classifier.fit(train_data_features, train["sentiment"]) allcoefficients = [(classifier.coef_[0,i], vocab[i]) \ for i in range(len(vocab))] allcoefficients.sort() allcoefficients.reverse() Exercise 7 Examine both the top and the bottom of the list. Which features are most informative? 8