Practical session 3: Machine learning for NLP

Size: px
Start display at page:

Download "Practical session 3: Machine learning for NLP"

Transcription

1 Practical session 3: Machine learning for NLP Traitement Automatique des Langues 21 February Introduction In this practical session, we will explore machine learning models for NLP applications; specifically, we will train a classifier for sentiment analysis on a French dataset of movie reviews. For these exercises, we will make use of Python (v2.7), and a number of modules for data processing and machine learning: numpy, scipy, scikit-learn, and pandas. If you want to use your own computer you will need to make sure these are installed (e.g. using the command pip). If you re using Miniconda, you can use the command conda install <modulename>. We will also make use of nltk (the natural language processing module that we experimented with in the first practical session). First, download the archive for the practical session to an appropriate working directory from the following address: Under linux, you can issue the following commands: $ wget $ unzip tp3.zip $ cd tp3 The first command will download a ZIP-archive file (which contains the sentiment analysis data set) to your working directory. The second command will unpack the archive. An NLP machine learning pipeline contains the following stages: 1

2 data preprocessing (tokenization) feature extraction model training evaluation We ll go through these stages step by step, using sentiment classification as an application. As a dataset, we ll be using a set of reviews for television series in French, extracted from the website allocine.fr. The dataset consists of the text of the review, as well as a sentiment label (positive or negative). 1 The training set is divided into a training part (for training, 5576 reviews, ± 90%) and test part (for evaluation, 544 reviews, ± 10%). The dataset is balanced, which means positive and negative instances are evenly distributed. Additionally, training and test set contain reviews about different TV series (in order to avoid possible bias when evaluating). Exercise 1 Why might the evaluation results be biased when reviews in train and test set talk about the same television series? 2 Preprocessing First, we ll load the training set. In python, issue the following commands (you can also put the commands in a file and run the script separately if you like): import pandas as pd train = pd.read_csv("allocine_train.tsv", header=0, \ delimiter="\t", quoting=3) We are now able to examine the data. Explore the dataset using the following commands. 1 Note that the original ratings on the site allocine.fr range from 0 to 4 stars. We will use binary classification instead. In our dataset, original reviews of 0 and 1 stars are considered negative, while reviews of 3 and 4 stars are considered positive. 2

3 train.shape... train.columns.values... print train["review"][0]... As we ve seen before, we need to preprocess the dataset to be able to properly extract features from it. In order to do so, we ll create a function that makes use of the tokenisation functions of nltk. In order to reuse the function, we can save the commands below in a separate file named sentitools.py. import nltk french_tok_file = tokenizers/punkt/french.pickle sent_tok = nltk.tokenize.load(french_tok_file) word_tok = nltk.tokenize.treebankwordtokenizer() def review_to_words(raw_review): review_string = raw_review.decode( utf8 ) review_lower = review_string.lower() sents = sent_tok.tokenize(review_lower) tokens = [] for s in sents: tokens.extend(word_tok.tokenize(s)) return " ".join(tokens) Once we have our function ready, we can use it to carry out the actual tokenisation of the texts in the training set. from sentitools import review_to_words num_reviews = len(train["review"]) clean_train_reviews = [] for i in range(num_reviews): clean_review = review_to_words(train["review"][i]) clean_train_reviews.append(clean_review) 3

4 Exercise 2 Examine the tokenised reviews. What errors are made? What could be improved? 3 Feature extraction Now it s time to decide which features to use in our classifier. We ll start with simple bag of words features. from sklearn.feature_extraction.text \ import CountVectorizer vectorizer = CountVectorizer( analyzer = "word", max_features = 5000 ) train_data_features = vectorizer.fit_transform( clean_train_reviews ) train_data_features = train_data_features.toarray() We can look at the extracted feature vectors. We can also look at the vocabulary used by the vectorizer. print train_data_features.shape... vocab = vectorizer.get_feature_names() print vocab 4 Classification Scikit-learn contains many different implementations of classification algorithms. We ll start with the example of last week s class: Naïve Bayes. 4

5 from sklearn.naive_bayes import MultinomialNB, BernoulliNB classifier = MultinomialNB() classifier.fit(train_data_features, train["sentiment"]) Our model has now been trained on the training set; we can now test its performance on the test set. First, we carry out the same preprocessing and feature extraction on the test set. test = pd.read_csv("allocine_test.tsv", header=0, \ delimiter="\t", quoting=3 ) num_reviews = len(test["review"]) clean_test_reviews = [] for i in range(num_reviews): clean_review = review_to_words(test["review"][i]) clean_test_reviews.append(clean_review) test_data_features = vectorizer.transform( clean_test_reviews ) test_data_features = test_data_features.toarray() Next, we can compute the performance on the test set. score = classifier.score( test_data_features, test["sentiment"] ) print score Exercise 3 What does the score represent? Look at the instances that were classified badly. Do you see why the review was misclassified? Hint: use function predict 5

6 4.1 K-fold cross validation Usually, we will want to try out different parameters, in order to see what works best for our task. As such, we might experiment with: Different features Different classification algorithms Different model parameters However, we have to be careful: we cannot use our test set over and over again, as we ll be optimizing our parameters for that particular test set (and run the risk of overfitting, which means we are not able to properly generalize to data we haven t trained on). For this reason, we need to make use of a validation set. However, our training set is already quite small; creating a separate validation set would give us even less training data. Fortunately, we don t have to create a separate set: we can use k-fold cross validation. The idea is the following: Break up data into k (e.g. 10) parts (folds) For each fold Current fold is used as temporary test set Use other 9 folds as training data Performance is computed on test fold Average performance over 10 runs Note that, again, we want to make sure that the movies that are reviewed in our training set are different from the ones that appear in our validation set. Scikitlearn has a function for this: 6

7 from sklearn.model_selection import GroupKFold group_kfold = GroupKFold(n_splits=10) score_kfold = [] for train_index, test_index in group_kfold.split(train_data_features, train["sentiment"], train["movie_id"]): X_train, X_test = train_data_features[train_index], \ train_data_features[test_index] y_train, y_test = train["sentiment"][train_index], \ train["sentiment"][test_index] classifier.fit(x_train, y_train) score_kfold.append(classifier.score(x_test, y_test)) print sum(score_kfold) / float(len(score_kfold)) Exercise 4 Experiment with different feature sets Exercise 5 Exclude a list of stopwords Hint: NLTK provides a list of stopwords for French; look at the arguments of CountVectorizer to include them Experiment with n-grams instead of bag of words Hint: look at the arguments of CountVectorizer again in order to extract n-grams What if you change the number of vocabulary elements included? Can you think of other features to include? Experiment with different models Try a naïve bayes classifier that uses binary features (word presence instead of word count) 7

8 Exercise 6 Try any other classifier included with scikit-learn (decision trees, SVM,... ) How does it perform? When you ve determined the best set of parameters (according to crossvalidation), compute the performance on the test set 4.2 Intrinsic model evaluation Some models allow us to look at the most informative features. Using a logistic regression, you can do the following: classifier = sklearn.linear_model.logisticregression() classifier.fit(train_data_features, train["sentiment"]) allcoefficients = [(classifier.coef_[0,i], vocab[i]) \ for i in range(len(vocab))] allcoefficients.sort() allcoefficients.reverse() Exercise 7 Examine both the top and the bottom of the list. Which features are most informative? 8

CE807 Lab 3 Text classification with Python

CE807 Lab 3 Text classification with Python CE807 Lab 3 Text classification with Python February 2 In this lab we are going to use scikit-learn for text classification, focusing in particular on the most classic example of text classification: spam

More information

Solution to the example exam LT2306: Machine learning, October 2016

Solution to the example exam LT2306: Machine learning, October 2016 Solution to the example exam LT2306: Machine learning, October 2016 Score required for a VG: 22 points Question 1 of 6: Hillary or the Donald? (6 points) We would like to build a system that tries to predict

More information

CIS192 Python Programming

CIS192 Python Programming CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania October 22, 2015 Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 1 / 18 Outline 1 Machine Learning

More information

NLP Final Project Fall 2015, Due Friday, December 18

NLP Final Project Fall 2015, Due Friday, December 18 NLP Final Project Fall 2015, Due Friday, December 18 For the final project, everyone is required to do some sentiment classification and then choose one of the other three types of projects: annotation,

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

Lecture Linear Support Vector Machines

Lecture Linear Support Vector Machines Lecture 8 In this lecture we return to the task of classification. As seen earlier, examples include spam filters, letter recognition, or text classification. In this lecture we introduce a popular method

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

SAMPLE CHAPTER. Henrik Brink Joseph W. Richards Mark Fetherolf. FOREWORD BY Beau Cronin MANNING

SAMPLE CHAPTER. Henrik Brink Joseph W. Richards Mark Fetherolf. FOREWORD BY Beau Cronin MANNING SAMPLE CHAPTER Henrik Brink Joseph W. Richards Mark Fetherolf FOREWORD BY Beau Cronin MANNING Real-World Machine Learning by Henrik Brink Joseph W. Richards Mark Fetherolf Chapter 8 Copyright 217 Manning

More information

NLP Lab Session Week 9, October 28, 2015 Classification and Feature Sets in the NLTK, Part 1. Getting Started

NLP Lab Session Week 9, October 28, 2015 Classification and Feature Sets in the NLTK, Part 1. Getting Started NLP Lab Session Week 9, October 28, 2015 Classification and Feature Sets in the NLTK, Part 1 Getting Started For this lab session download the examples: LabWeek9classifynames.txt and put it in your class

More information

Programming Exercise 6: Support Vector Machines

Programming Exercise 6: Support Vector Machines Programming Exercise 6: Support Vector Machines Machine Learning May 13, 2012 Introduction In this exercise, you will be using support vector machines (SVMs) to build a spam classifier. Before starting

More information

Introducing Categorical Data/Variables (pp )

Introducing Categorical Data/Variables (pp ) Notation: Means pencil-and-paper QUIZ Means coding QUIZ Definition: Feature Engineering (FE) = the process of transforming the data to an optimal representation for a given application. Scaling (see Chs.

More information

Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks to recognize

More information

maxbox Starter 66 - Data Science with Max

maxbox Starter 66 - Data Science with Max //////////////////////////////////////////////////////////////////////////// Machine Learning IV maxbox Starter 66 - Data Science with Max There are two kinds of data scientists: 1) Those who can extrapolate

More information

Latent Semantic Analysis. sci-kit learn. Vectorizing text. Document-term matrix

Latent Semantic Analysis. sci-kit learn. Vectorizing text. Document-term matrix Latent Semantic Analysis Latent Semantic Analysis (LSA) is a framework for analyzing text using matrices Find relationships between documents and terms within documents Used for document classification,

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Applied Machine Learning

Applied Machine Learning Applied Machine Learning Lab 3 Working with Text Data Overview In this lab, you will use R or Python to work with text data. Specifically, you will use code to clean text, remove stop words, and apply

More information

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model? SUPERVISED LEARNING WITH SCIKIT-LEARN How good is your model? Classification metrics Measuring model performance with accuracy: Fraction of correctly classified samples Not always a useful metric Class

More information

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014 CS273 Midterm Eam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014 Your name: Your UCINetID (e.g., myname@uci.edu): Your seat (row and number): Total time is 80 minutes. READ THE

More information

Python With Data Science

Python With Data Science Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,

More information

Introduction to Automated Text Analysis. bit.ly/poir599

Introduction to Automated Text Analysis. bit.ly/poir599 Introduction to Automated Text Analysis Pablo Barberá School of International Relations University of Southern California pablobarbera.com Lecture materials: bit.ly/poir599 Today 1. Solutions for last

More information

sentiment_classifier Documentation

sentiment_classifier Documentation sentiment_classifier Documentation Release 0.4 Pulkit Kathuria January 07, 2015 Contents 1 Overview 3 2 Online Demo 5 3 Sentiment Classifiers and Data 7 4 Requirements 9 5 How to Install 11 6 Documentation

More information

Computerlinguistische Anwendungen Support Vector Machines

Computerlinguistische Anwendungen Support Vector Machines with Scikitlearn Computerlinguistische Anwendungen Support Vector Machines Thang Vu CIS, LMU thangvu@cis.uni-muenchen.de May 20, 2015 1 Introduction Shared Task 1 with Scikitlearn Today we will learn about

More information

COMP 364: Computer Tools for Life Sciences

COMP 364: Computer Tools for Life Sciences COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn Christopher J.F. Cameron and Carlos G. Oliver 1 / 1 Key course information Assignment #4 available now due Monday,

More information

Kaggle See Click Fix Model Description

Kaggle See Click Fix Model Description Kaggle See Click Fix Model Description BY: Miroslaw Horbal & Bryan Gregory LOCATION: Waterloo, Ont, Canada & Dallas, TX CONTACT : miroslaw@gmail.com & bryan.gregory1@gmail.com CONTEST: See Click Predict

More information

Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Science Bootcamp Curriculum. NYC Data Science Academy Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations

More information

1 Training/Validation/Testing

1 Training/Validation/Testing CPSC 340 Final (Fall 2015) Name: Student Number: Please enter your information above, turn off cellphones, space yourselves out throughout the room, and wait until the official start of the exam to begin.

More information

1 Machine Learning System Design

1 Machine Learning System Design Machine Learning System Design Prioritizing what to work on: Spam classification example Say you want to build a spam classifier Spam messages often have misspelled words We ll have a labeled training

More information

6.034 Design Assignment 2

6.034 Design Assignment 2 6.034 Design Assignment 2 April 5, 2005 Weka Script Due: Friday April 8, in recitation Paper Due: Wednesday April 13, in class Oral reports: Friday April 15, by appointment The goal of this assignment

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

HANDS ON DATA MINING. By Amit Somech. Workshop in Data-science, March 2016

HANDS ON DATA MINING. By Amit Somech. Workshop in Data-science, March 2016 HANDS ON DATA MINING By Amit Somech Workshop in Data-science, March 2016 AGENDA Before you start TextEditors Some Excel Recap Setting up Python environment PIP ipython Scientific computation in Python

More information

1 Document Classification [60 points]

1 Document Classification [60 points] CIS519: Applied Machine Learning Spring 2018 Homework 4 Handed Out: April 3 rd, 2018 Due: April 14 th, 2018, 11:59 PM 1 Document Classification [60 points] In this problem, you will implement several text

More information

Certified Data Science with Python Professional VS-1442

Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional Certified Data Science with Python Professional Certification Code VS-1442 Data science has become

More information

Lab 15 - Support Vector Machines in Python

Lab 15 - Support Vector Machines in Python Lab 15 - Support Vector Machines in Python November 29, 2016 This lab on Support Vector Machines is a Python adaptation of p. 359-366 of Introduction to Statistical Learning with Applications in R by Gareth

More information

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance Machine Learning May 13, 212 Introduction In this exercise, you will implement regularized linear regression and use it to study

More information

from sklearn import tree from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

from sklearn import tree from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier 1 av 7 2019-02-08 10:26 In [1]: import pandas as pd import numpy as np import matplotlib import matplotlib.pyplot as plt from sklearn import tree from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

More information

MATH 829: Introduction to Data Mining and Analysis Model selection

MATH 829: Introduction to Data Mining and Analysis Model selection 1/12 MATH 829: Introduction to Data Mining and Analysis Model selection Dominique Guillot Departments of Mathematical Sciences University of Delaware February 24, 2016 2/12 Comparison of regression methods

More information

1 Introduction. 2 Document classification process. Text mining. Document classification (text categorization) in Python using the scikitlearn

1 Introduction. 2 Document classification process. Text mining. Document classification (text categorization) in Python using the scikitlearn 1 Introduction Text mining. Document classification (text categorization) in Python using the scikitlearn package. The aim of text categorization is to assign documents to predefined categories as accurately

More information

Manual of SPCI (structural and physico-chemical interpretation) open-source software version 0.1.5

Manual of SPCI (structural and physico-chemical interpretation) open-source software version 0.1.5 Manual of SPCI (structural and physico-chemical interpretation) open-source software version 0.1.5 Version (date) Changes and comments 0.1.0 (02.02.2015) Changes from alpha version: 1. More precise default

More information

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core) Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data

More information

Final Exam. Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm

Final Exam. Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm Final Exam Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm Instructions: you will submit this take-home final exam in three parts. 1. Writeup. This will be a complete

More information

ML 프로그래밍 ( 보충 ) Scikit-Learn

ML 프로그래밍 ( 보충 ) Scikit-Learn ML 프로그래밍 ( 보충 ) Scikit-Learn 2017.5 Scikit-Learn? 특징 a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (NumPy, SciPy, matplotlib).

More information

EPL451: Data Mining on the Web Lab 5

EPL451: Data Mining on the Web Lab 5 EPL451: Data Mining on the Web Lab 5 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Predictive modeling techniques IBM reported in June 2012 that 90% of data available

More information

mltool Documentation Release Maurizio Sambati

mltool Documentation Release Maurizio Sambati mltool Documentation Release 0.5.1 Maurizio Sambati November 18, 2015 Contents 1 Overview 3 1.1 Features.................................................. 3 1.2 Installation................................................

More information

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning

More information

Machine Learning: Think Big and Parallel

Machine Learning: Think Big and Parallel Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least

More information

$ easy_install scikit-learn from scikits.learn import svm. Shouyuan Chen

$ easy_install scikit-learn from scikits.learn import svm. Shouyuan Chen $ easy_install scikit-learn from scikits.learn import svm Shouyuan Chen scikits.learn Advantages Many useful model Unified API for various ML algorithms Very clean source code Features Supervised learning

More information

SCIENCE. An Introduction to Python Brief History Why Python Where to use

SCIENCE. An Introduction to Python Brief History Why Python Where to use DATA SCIENCE Python is a general-purpose interpreted, interactive, object-oriented and high-level programming language. Currently Python is the most popular Language in IT. Python adopted as a language

More information

Lab Five. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 29th 2018

Lab Five. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 29th 2018 Lab Five COMP 219 - Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves October 29th 2018 1 Decision Trees and Random Forests 1.1 Reading Begin by reading chapter three of Python Machine

More information

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm CS 170 Algorithms Fall 2014 David Wagner HW12 Due Dec. 5, 6:00pm Instructions. This homework is due Friday, December 5, at 6:00pm electronically via glookup. This homework assignment is a programming assignment

More information

S E N T I M E N T A N A L Y S I S O F S O C I A L M E D I A W I T H D A T A V I S U A L I S A T I O N

S E N T I M E N T A N A L Y S I S O F S O C I A L M E D I A W I T H D A T A V I S U A L I S A T I O N S E N T I M E N T A N A L Y S I S O F S O C I A L M E D I A W I T H D A T A V I S U A L I S A T I O N BY J OHN KELLY SOFTWARE DEVELOPMEN T FIN AL REPOR T 5 TH APRIL 2017 TABLE OF CONTENTS Abstract 2 1.

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

10 things I wish I knew. about Machine Learning Competitions

10 things I wish I knew. about Machine Learning Competitions 10 things I wish I knew about Machine Learning Competitions Introduction Theoretical competition run-down The list of things I wish I knew Code samples for a running competition Kaggle the platform Reasons

More information

Wikipedia, Dead Authors, Naive Bayes & Python

Wikipedia, Dead Authors, Naive Bayes & Python Wikipedia, Dead Authors, Naive Bayes & Python Outline Dead Authors : The Problem Wikipedia : The Resource Naive Bayes : The Solution Python : The Medium NLTK Scikits.learn Authors, Books & Copyrights Authors

More information

Converting categorical data into numbers with Pandas and Scikit-learn -...

Converting categorical data into numbers with Pandas and Scikit-learn -... 1 of 6 11/17/2016 11:02 AM FastML Machine learning made easy RSS Home Contents Popular Links Backgrounds About Converting categorical data into numbers with Pandas and Scikit-learn 2014-04-30 Many machine

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Logistic Regression with a Neural Network mindset

Logistic Regression with a Neural Network mindset Logistic Regression with a Neural Network mindset Welcome to your first (required) programming assignment! You will build a logistic regression classifier to recognize cats. This assignment will step you

More information

Lab 16 - Multiclass SVMs and Applications to Real Data in Python

Lab 16 - Multiclass SVMs and Applications to Real Data in Python Lab 16 - Multiclass SVMs and Applications to Real Data in Python April 7, 2016 This lab on Multiclass Support Vector Machines in Python is an adaptation of p. 366-368 of Introduction to Statistical Learning

More information

Homework 2: HMM, Viterbi, CRF/Perceptron

Homework 2: HMM, Viterbi, CRF/Perceptron Homework 2: HMM, Viterbi, CRF/Perceptron CS 585, UMass Amherst, Fall 2015 Version: Oct5 Overview Due Tuesday, Oct 13 at midnight. Get starter code from the course website s schedule page. You should submit

More information

An Efficient Spam Classification System Using Ensemble Machine Learning Algorithm

An Efficient Spam Classification System Using Ensemble Machine Learning Algorithm An Efficient Spam Classification System Using Ensemble Machine Learning Algorithm A.Lakshmanarao 1, K.Chandra Sekhar 2, Y.Swathi 3 Associate Professor 1, Assistant Professor 2, Assistant Professor 3 1,2,3

More information

Predict the box office of US movies

Predict the box office of US movies Predict the box office of US movies Group members: Hanqing Ma, Jin Sun, Zeyu Zhang 1. Introduction Our task is to predict the box office of the upcoming movies using the properties of the movies, such

More information

Feature Extraction and Classification. COMP-599 Sept 19, 2016

Feature Extraction and Classification. COMP-599 Sept 19, 2016 Feature Extraction and Classification COMP-599 Sept 19, 2016 Good-Turing Smoothing Defined Let N be total number of observed word-tokens, w c be a word that occurs c times in the training corpus. N = i

More information

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 SUPERVISED LEARNING METHODS Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 2 CHOICE OF ML You cannot know which algorithm will work

More information

Frameworks in Python for Numeric Computation / ML

Frameworks in Python for Numeric Computation / ML Frameworks in Python for Numeric Computation / ML Why use a framework? Why not use the built-in data structures? Why not write our own matrix multiplication function? Frameworks are needed not only because

More information

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44

Encoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44 A Activation potential, 40 Annotated corpus add padding, 162 check versions, 158 create checkpoints, 164, 166 create input, 160 create train and validation datasets, 163 dropout, 163 DRUG-AE.rel file,

More information

KNIME Python Integration Installation Guide. KNIME AG, Zurich, Switzerland Version 3.7 (last updated on )

KNIME Python Integration Installation Guide. KNIME AG, Zurich, Switzerland Version 3.7 (last updated on ) KNIME Python Integration Installation Guide KNIME AG, Zurich, Switzerland Version 3.7 (last updated on 2019-02-05) Table of Contents Introduction.....................................................................

More information

Exercise 4. AMTH/CPSC 445a/545a - Fall Semester October 30, 2017

Exercise 4. AMTH/CPSC 445a/545a - Fall Semester October 30, 2017 Exercise 4 AMTH/CPSC 445a/545a - Fall Semester 2016 October 30, 2017 Problem 1 Compress your solutions into a single zip file titled assignment4.zip, e.g. for a student named Tom

More information

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron Matt Gormley Lecture 5 Jan. 31, 2018 1 Q&A Q: We pick the best hyperparameters

More information

CS 224N: Assignment #1

CS 224N: Assignment #1 Due date: assignment) 1/25 11:59 PM PST (You are allowed to use three (3) late days maximum for this These questions require thought, but do not require long answers. Please be as concise as possible.

More information

Lab Four. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 22nd 2018

Lab Four. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 22nd 2018 Lab Four COMP 219 - Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves October 22nd 2018 1 Reading Begin by reading chapter three of Python Machine Learning until page 80 found in the learning

More information

2. A Bernoulli distribution has the following likelihood function for a data set D: N 1 N 1 + N 0

2. A Bernoulli distribution has the following likelihood function for a data set D: N 1 N 1 + N 0 Machine Learning Fall 2015 Homework 1 Homework must be submitted electronically following the instructions on the course homepage. Make sure to explain you reasoning or show your derivations. Except for

More information

Review of UK Big Data EssNet WP2 SGA1 work. WP2 face-to-face meeting, 4/10/17

Review of UK Big Data EssNet WP2 SGA1 work. WP2 face-to-face meeting, 4/10/17 Review of UK Big Data EssNet WP2 SGA1 work WP2 face-to-face meeting, 4/10/17 Outline Ethical/legal issues Website identification Using registry information Using scraped data E-commerce Job vacancy Outstanding

More information

We ll be using data on loans. The website also has data on lenders.

We ll be using data on loans. The website also has data on lenders. Economics 1660: Big Data PS 0: Programming for Large Data Sets Brown University Prof. Daniel Björkegren The spread of information technology has generated massive amounts of data, even in developing countries.

More information

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Lecture 20: Neural Networks for NLP. Zubin Pahuja Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple

More information

Extracting data governance information from Slack chat channels

Extracting data governance information from Slack chat channels Extracting data governance information from Slack chat channels By Simon Quigley Supervisor: Dr. Rob Brennan Assistant supervisor: Dr. Alfredo Maldonado Dissertation Presented to University of Dublin,

More information

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution Summary Each of the ham/spam classifiers has been tested against random samples from pre- processed enron sets 1 through 6 obtained via: http://www.aueb.gr/users/ion/data/enron- spam/, or the entire set

More information

A bit of theory: Algorithms

A bit of theory: Algorithms A bit of theory: Algorithms There are different kinds of algorithms Vector space models. e.g. support vector machines Decision trees, e.g. C45 Probabilistic models, e.g. Naive Bayes Neural networks, e.g.

More information

Intel Distribution for Python* и Intel Performance Libraries

Intel Distribution for Python* и Intel Performance Libraries Intel Distribution for Python* и Intel Performance Libraries 1 Motivation * L.Prechelt, An empirical comparison of seven programming languages, IEEE Computer, 2000, Vol. 33, Issue 10, pp. 23-29 ** RedMonk

More information

On the automatic classification of app reviews

On the automatic classification of app reviews Requirements Eng (2016) 21:311 331 DOI 10.1007/s00766-016-0251-9 RE 2015 On the automatic classification of app reviews Walid Maalej 1 Zijad Kurtanović 1 Hadeer Nabil 2 Christoph Stanik 1 Walid: please

More information

Exercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017

Exercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017 Exercise 3 AMTH/CPSC 445a/545a - Fall Semester 2016 October 7, 2017 Problem 1 Compress your solutions into a single zip file titled assignment3.zip, e.g. for a student named Tom

More information

Principles of Machine Learning

Principles of Machine Learning Principles of Machine Learning Lab 3 Improving Machine Learning Models Overview In this lab you will explore techniques for improving and evaluating the performance of machine learning models. You will

More information

Practical example - classifier margin

Practical example - classifier margin Support Vector Machines (SVMs) SVMs are very powerful binary classifiers, based on the Statistical Learning Theory (SLT) framework. SVMs can be used to solve hard classification problems, where they look

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Introduction to Machine Learning. Useful tools: Python, NumPy, scikit-learn

Introduction to Machine Learning. Useful tools: Python, NumPy, scikit-learn Introduction to Machine Learning Useful tools: Python, NumPy, scikit-learn Antonio Sutera and Jean-Michel Begon September 29, 2016 2 / 37 How to install Python? Download and use the Anaconda python distribution

More information

Tutorial. Docking School SAnDReS Tutorial Cyclin-Dependent Kinases with K i Information (Introduction)

Tutorial. Docking School SAnDReS Tutorial Cyclin-Dependent Kinases with K i Information (Introduction) Tutorial Docking School SAnDReS Tutorial Cyclin-Dependent Kinases with K i Information (Introduction) Prof. Dr. Walter Filgueira de Azevedo Jr. Laboratory of Computational Systems Biology azevedolab.net

More information

Lab 10 - Ridge Regression and the Lasso in Python

Lab 10 - Ridge Regression and the Lasso in Python Lab 10 - Ridge Regression and the Lasso in Python March 9, 2016 This lab on Ridge Regression and the Lasso is a Python adaptation of p. 251-255 of Introduction to Statistical Learning with Applications

More information

Lecture 3: Linear Classification

Lecture 3: Linear Classification Lecture 3: Linear Classification Roger Grosse 1 Introduction Last week, we saw an example of a learning task called regression. There, the goal was to predict a scalar-valued target from a set of features.

More information

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time, Chapter 2 Although stochastic gradient descent can be considered as an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. Since

More information

List of Exercises: Data Mining 1 December 12th, 2015

List of Exercises: Data Mining 1 December 12th, 2015 List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring

More information

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course: DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business

More information

Log- linear models. Natural Language Processing: Lecture Kairit Sirts

Log- linear models. Natural Language Processing: Lecture Kairit Sirts Log- linear models Natural Language Processing: Lecture 3 21.09.2017 Kairit Sirts The goal of today s lecture Introduce the log- linear/maximum entropy model Explain the model components: features, parameters,

More information

Python for. Data Science. by Luca Massaron. and John Paul Mueller

Python for. Data Science. by Luca Massaron. and John Paul Mueller Python for Data Science by Luca Massaron and John Paul Mueller Table of Contents #»» *» «»>»»» Introduction 1 About This Book 1 Foolish Assumptions 2 Icons Used in This Book 3 Beyond the Book 4 Where to

More information

ADVANCED CLASSIFICATION TECHNIQUES

ADVANCED CLASSIFICATION TECHNIQUES Admin ML lab next Monday Project proposals: Sunday at 11:59pm ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 Fall 2014 Project proposal presentations Machine Learning: A Geometric View 1 Apples

More information

Detecting ads in a machine learning approach

Detecting ads in a machine learning approach Detecting ads in a machine learning approach Di Zhang (zhangdi@stanford.edu) 1. Background There are lots of advertisements over the Internet, who have become one of the major approaches for companies

More information

Intel Distribution For Python*

Intel Distribution For Python* Intel Distribution For Python* Intel Distribution for Python* 2017 Advancing Python performance closer to native speeds Easy, out-of-the-box access to high performance Python High performance with multiple

More information

EPL451: Data Mining on the Web Lab 10

EPL451: Data Mining on the Web Lab 10 EPL451: Data Mining on the Web Lab 10 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Dimensionality Reduction Map points in high-dimensional (high-feature) space

More information

CS 224N: Assignment #1

CS 224N: Assignment #1 Due date: assignment) 1/25 11:59 PM PST (You are allowed to use three (3) late days maximum for this These questions require thought, but do not require long answers. Please be as concise as possible.

More information

Text classification with Naïve Bayes. Lab 3

Text classification with Naïve Bayes. Lab 3 Text classification with Naïve Bayes Lab 3 1 The Task Building a model for movies reviews in English for classifying it into positive or negative. Test classifier on new reviews Takes time 2 Sentiment

More information

Support Vector Machines + Classification for IR

Support Vector Machines + Classification for IR Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines

More information

Personalized Web Search

Personalized Web Search Personalized Web Search Dhanraj Mavilodan (dhanrajm@stanford.edu), Kapil Jaisinghani (kjaising@stanford.edu), Radhika Bansal (radhika3@stanford.edu) Abstract: With the increase in the diversity of contents

More information