Practical session 3: Machine learning for NLP

Similar documents
CE807 Lab 3 Text classification with Python

Solution to the example exam LT2306: Machine learning, October 2016

CIS192 Python Programming

NLP Final Project Fall 2015, Due Friday, December 18

Identifying Important Communications

Lecture Linear Support Vector Machines

Predicting Popular Xbox games based on Search Queries of Users

CS229 Final Project: Predicting Expected Response Times

SAMPLE CHAPTER. Henrik Brink Joseph W. Richards Mark Fetherolf. FOREWORD BY Beau Cronin MANNING

NLP Lab Session Week 9, October 28, 2015 Classification and Feature Sets in the NLTK, Part 1. Getting Started

Programming Exercise 6: Support Vector Machines

Introducing Categorical Data/Variables (pp )

Programming Exercise 3: Multi-class Classification and Neural Networks

maxbox Starter 66 - Data Science with Max

Latent Semantic Analysis. sci-kit learn. Vectorizing text. Document-term matrix

Evaluating Classifiers

Applied Machine Learning

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Python With Data Science

Introduction to Automated Text Analysis. bit.ly/poir599

sentiment_classifier Documentation

Computerlinguistische Anwendungen Support Vector Machines

COMP 364: Computer Tools for Life Sciences

Kaggle See Click Fix Model Description

Data Science Bootcamp Curriculum. NYC Data Science Academy

1 Training/Validation/Testing

1 Machine Learning System Design

6.034 Design Assignment 2

Regularization and model selection

HANDS ON DATA MINING. By Amit Somech. Workshop in Data-science, March 2016

1 Document Classification [60 points]

Certified Data Science with Python Professional VS-1442

Lab 15 - Support Vector Machines in Python

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance

from sklearn import tree from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

MATH 829: Introduction to Data Mining and Analysis Model selection

1 Introduction. 2 Document classification process. Text mining. Document classification (text categorization) in Python using the scikitlearn

Manual of SPCI (structural and physico-chemical interpretation) open-source software version 0.1.5

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Final Exam. Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm

ML 프로그래밍 ( 보충 ) Scikit-Learn

EPL451: Data Mining on the Web Lab 5

mltool Documentation Release Maurizio Sambati

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Machine Learning: Think Big and Parallel

$ easy_install scikit-learn from scikits.learn import svm. Shouyuan Chen

SCIENCE. An Introduction to Python Brief History Why Python Where to use

Lab Five. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 29th 2018

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm

S E N T I M E N T A N A L Y S I S O F S O C I A L M E D I A W I T H D A T A V I S U A L I S A T I O N

Network Traffic Measurements and Analysis

10 things I wish I knew. about Machine Learning Competitions

Wikipedia, Dead Authors, Naive Bayes & Python

Converting categorical data into numbers with Pandas and Scikit-learn -...

Evaluating Classifiers

Logistic Regression with a Neural Network mindset

Lab 16 - Multiclass SVMs and Applications to Real Data in Python

Homework 2: HMM, Viterbi, CRF/Perceptron

An Efficient Spam Classification System Using Ensemble Machine Learning Algorithm

Predict the box office of US movies

Feature Extraction and Classification. COMP-599 Sept 19, 2016

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Frameworks in Python for Numeric Computation / ML

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

KNIME Python Integration Installation Guide. KNIME AG, Zurich, Switzerland Version 3.7 (last updated on )

Exercise 4. AMTH/CPSC 445a/545a - Fall Semester October 30, 2017

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

CS 224N: Assignment #1

Lab Four. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 22nd 2018

2. A Bernoulli distribution has the following likelihood function for a data set D: N 1 N 1 + N 0

Review of UK Big Data EssNet WP2 SGA1 work. WP2 face-to-face meeting, 4/10/17

We ll be using data on loans. The website also has data on lenders.

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Extracting data governance information from Slack chat channels

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

A bit of theory: Algorithms

Intel Distribution for Python* и Intel Performance Libraries

On the automatic classification of app reviews

Exercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017

Principles of Machine Learning

Practical example - classifier margin

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Introduction to Machine Learning. Useful tools: Python, NumPy, scikit-learn

Tutorial. Docking School SAnDReS Tutorial Cyclin-Dependent Kinases with K i Information (Introduction)

Lab 10 - Ridge Regression and the Lasso in Python

Lecture 3: Linear Classification

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

List of Exercises: Data Mining 1 December 12th, 2015

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

Log- linear models. Natural Language Processing: Lecture Kairit Sirts

Python for. Data Science. by Luca Massaron. and John Paul Mueller

ADVANCED CLASSIFICATION TECHNIQUES

Detecting ads in a machine learning approach

Intel Distribution For Python*

EPL451: Data Mining on the Web Lab 10

CS 224N: Assignment #1

Text classification with Naïve Bayes. Lab 3

Support Vector Machines + Classification for IR

Personalized Web Search

Transcription:

Practical session 3: Machine learning for NLP Traitement Automatique des Langues 21 February 2018 1 Introduction In this practical session, we will explore machine learning models for NLP applications; specifically, we will train a classifier for sentiment analysis on a French dataset of movie reviews. For these exercises, we will make use of Python (v2.7), and a number of modules for data processing and machine learning: numpy, scipy, scikit-learn, and pandas. If you want to use your own computer you will need to make sure these are installed (e.g. using the command pip). If you re using Miniconda, you can use the command conda install <modulename>. We will also make use of nltk (the natural language processing module that we experimented with in the first practical session). First, download the archive for the practical session to an appropriate working directory from the following address: http://www.irit.fr/~tim.van-de-cruys/tal/tp/tp3/tp3.zip Under linux, you can issue the following commands: $ wget http://www.irit.fr/~tim.van-de-cruys/tal/tp/tp3/tp3.zip $ unzip tp3.zip $ cd tp3 The first command will download a ZIP-archive file (which contains the sentiment analysis data set) to your working directory. The second command will unpack the archive. An NLP machine learning pipeline contains the following stages: 1

data preprocessing (tokenization) feature extraction model training evaluation We ll go through these stages step by step, using sentiment classification as an application. As a dataset, we ll be using a set of reviews for television series in French, extracted from the website allocine.fr. The dataset consists of the text of the review, as well as a sentiment label (positive or negative). 1 The training set is divided into a training part (for training, 5576 reviews, ± 90%) and test part (for evaluation, 544 reviews, ± 10%). The dataset is balanced, which means positive and negative instances are evenly distributed. Additionally, training and test set contain reviews about different TV series (in order to avoid possible bias when evaluating). Exercise 1 Why might the evaluation results be biased when reviews in train and test set talk about the same television series? 2 Preprocessing First, we ll load the training set. In python, issue the following commands (you can also put the commands in a file and run the script separately if you like): import pandas as pd train = pd.read_csv("allocine_train.tsv", header=0, \ delimiter="\t", quoting=3) We are now able to examine the data. Explore the dataset using the following commands. 1 Note that the original ratings on the site allocine.fr range from 0 to 4 stars. We will use binary classification instead. In our dataset, original reviews of 0 and 1 stars are considered negative, while reviews of 3 and 4 stars are considered positive. 2

train.shape... train.columns.values... print train["review"][0]... As we ve seen before, we need to preprocess the dataset to be able to properly extract features from it. In order to do so, we ll create a function that makes use of the tokenisation functions of nltk. In order to reuse the function, we can save the commands below in a separate file named sentitools.py. import nltk french_tok_file = tokenizers/punkt/french.pickle sent_tok = nltk.tokenize.load(french_tok_file) word_tok = nltk.tokenize.treebankwordtokenizer() def review_to_words(raw_review): review_string = raw_review.decode( utf8 ) review_lower = review_string.lower() sents = sent_tok.tokenize(review_lower) tokens = [] for s in sents: tokens.extend(word_tok.tokenize(s)) return " ".join(tokens) Once we have our function ready, we can use it to carry out the actual tokenisation of the texts in the training set. from sentitools import review_to_words num_reviews = len(train["review"]) clean_train_reviews = [] for i in range(num_reviews): clean_review = review_to_words(train["review"][i]) clean_train_reviews.append(clean_review) 3

Exercise 2 Examine the tokenised reviews. What errors are made? What could be improved? 3 Feature extraction Now it s time to decide which features to use in our classifier. We ll start with simple bag of words features. from sklearn.feature_extraction.text \ import CountVectorizer vectorizer = CountVectorizer( analyzer = "word", max_features = 5000 ) train_data_features = vectorizer.fit_transform( clean_train_reviews ) train_data_features = train_data_features.toarray() We can look at the extracted feature vectors. We can also look at the vocabulary used by the vectorizer. print train_data_features.shape... vocab = vectorizer.get_feature_names() print vocab 4 Classification Scikit-learn contains many different implementations of classification algorithms. We ll start with the example of last week s class: Naïve Bayes. 4

from sklearn.naive_bayes import MultinomialNB, BernoulliNB classifier = MultinomialNB() classifier.fit(train_data_features, train["sentiment"]) Our model has now been trained on the training set; we can now test its performance on the test set. First, we carry out the same preprocessing and feature extraction on the test set. test = pd.read_csv("allocine_test.tsv", header=0, \ delimiter="\t", quoting=3 ) num_reviews = len(test["review"]) clean_test_reviews = [] for i in range(num_reviews): clean_review = review_to_words(test["review"][i]) clean_test_reviews.append(clean_review) test_data_features = vectorizer.transform( clean_test_reviews ) test_data_features = test_data_features.toarray() Next, we can compute the performance on the test set. score = classifier.score( test_data_features, test["sentiment"] ) print score Exercise 3 What does the score represent? Look at the instances that were classified badly. Do you see why the review was misclassified? Hint: use function predict 5

4.1 K-fold cross validation Usually, we will want to try out different parameters, in order to see what works best for our task. As such, we might experiment with: Different features Different classification algorithms Different model parameters However, we have to be careful: we cannot use our test set over and over again, as we ll be optimizing our parameters for that particular test set (and run the risk of overfitting, which means we are not able to properly generalize to data we haven t trained on). For this reason, we need to make use of a validation set. However, our training set is already quite small; creating a separate validation set would give us even less training data. Fortunately, we don t have to create a separate set: we can use k-fold cross validation. The idea is the following: Break up data into k (e.g. 10) parts (folds) For each fold Current fold is used as temporary test set Use other 9 folds as training data Performance is computed on test fold Average performance over 10 runs Note that, again, we want to make sure that the movies that are reviewed in our training set are different from the ones that appear in our validation set. Scikitlearn has a function for this: 6

from sklearn.model_selection import GroupKFold group_kfold = GroupKFold(n_splits=10) score_kfold = [] for train_index, test_index in group_kfold.split(train_data_features, train["sentiment"], train["movie_id"]): X_train, X_test = train_data_features[train_index], \ train_data_features[test_index] y_train, y_test = train["sentiment"][train_index], \ train["sentiment"][test_index] classifier.fit(x_train, y_train) score_kfold.append(classifier.score(x_test, y_test)) print sum(score_kfold) / float(len(score_kfold)) Exercise 4 Experiment with different feature sets Exercise 5 Exclude a list of stopwords Hint: NLTK provides a list of stopwords for French; look at the arguments of CountVectorizer to include them Experiment with n-grams instead of bag of words Hint: look at the arguments of CountVectorizer again in order to extract n-grams What if you change the number of vocabulary elements included? Can you think of other features to include? Experiment with different models Try a naïve bayes classifier that uses binary features (word presence instead of word count) 7

Exercise 6 Try any other classifier included with scikit-learn (decision trees, SVM,... ) How does it perform? When you ve determined the best set of parameters (according to crossvalidation), compute the performance on the test set 4.2 Intrinsic model evaluation Some models allow us to look at the most informative features. Using a logistic regression, you can do the following: classifier = sklearn.linear_model.logisticregression() classifier.fit(train_data_features, train["sentiment"]) allcoefficients = [(classifier.coef_[0,i], vocab[i]) \ for i in range(len(vocab))] allcoefficients.sort() allcoefficients.reverse() Exercise 7 Examine both the top and the bottom of the list. Which features are most informative? 8