Practical session 3: Machine learning for NLP
|
|
- Paula McCarthy
- 6 years ago
- Views:
Transcription
1 Practical session 3: Machine learning for NLP Traitement Automatique des Langues 21 February Introduction In this practical session, we will explore machine learning models for NLP applications; specifically, we will train a classifier for sentiment analysis on a French dataset of movie reviews. For these exercises, we will make use of Python (v2.7), and a number of modules for data processing and machine learning: numpy, scipy, scikit-learn, and pandas. If you want to use your own computer you will need to make sure these are installed (e.g. using the command pip). If you re using Miniconda, you can use the command conda install <modulename>. We will also make use of nltk (the natural language processing module that we experimented with in the first practical session). First, download the archive for the practical session to an appropriate working directory from the following address: Under linux, you can issue the following commands: $ wget $ unzip tp3.zip $ cd tp3 The first command will download a ZIP-archive file (which contains the sentiment analysis data set) to your working directory. The second command will unpack the archive. An NLP machine learning pipeline contains the following stages: 1
2 data preprocessing (tokenization) feature extraction model training evaluation We ll go through these stages step by step, using sentiment classification as an application. As a dataset, we ll be using a set of reviews for television series in French, extracted from the website allocine.fr. The dataset consists of the text of the review, as well as a sentiment label (positive or negative). 1 The training set is divided into a training part (for training, 5576 reviews, ± 90%) and test part (for evaluation, 544 reviews, ± 10%). The dataset is balanced, which means positive and negative instances are evenly distributed. Additionally, training and test set contain reviews about different TV series (in order to avoid possible bias when evaluating). Exercise 1 Why might the evaluation results be biased when reviews in train and test set talk about the same television series? 2 Preprocessing First, we ll load the training set. In python, issue the following commands (you can also put the commands in a file and run the script separately if you like): import pandas as pd train = pd.read_csv("allocine_train.tsv", header=0, \ delimiter="\t", quoting=3) We are now able to examine the data. Explore the dataset using the following commands. 1 Note that the original ratings on the site allocine.fr range from 0 to 4 stars. We will use binary classification instead. In our dataset, original reviews of 0 and 1 stars are considered negative, while reviews of 3 and 4 stars are considered positive. 2
3 train.shape... train.columns.values... print train["review"][0]... As we ve seen before, we need to preprocess the dataset to be able to properly extract features from it. In order to do so, we ll create a function that makes use of the tokenisation functions of nltk. In order to reuse the function, we can save the commands below in a separate file named sentitools.py. import nltk french_tok_file = tokenizers/punkt/french.pickle sent_tok = nltk.tokenize.load(french_tok_file) word_tok = nltk.tokenize.treebankwordtokenizer() def review_to_words(raw_review): review_string = raw_review.decode( utf8 ) review_lower = review_string.lower() sents = sent_tok.tokenize(review_lower) tokens = [] for s in sents: tokens.extend(word_tok.tokenize(s)) return " ".join(tokens) Once we have our function ready, we can use it to carry out the actual tokenisation of the texts in the training set. from sentitools import review_to_words num_reviews = len(train["review"]) clean_train_reviews = [] for i in range(num_reviews): clean_review = review_to_words(train["review"][i]) clean_train_reviews.append(clean_review) 3
4 Exercise 2 Examine the tokenised reviews. What errors are made? What could be improved? 3 Feature extraction Now it s time to decide which features to use in our classifier. We ll start with simple bag of words features. from sklearn.feature_extraction.text \ import CountVectorizer vectorizer = CountVectorizer( analyzer = "word", max_features = 5000 ) train_data_features = vectorizer.fit_transform( clean_train_reviews ) train_data_features = train_data_features.toarray() We can look at the extracted feature vectors. We can also look at the vocabulary used by the vectorizer. print train_data_features.shape... vocab = vectorizer.get_feature_names() print vocab 4 Classification Scikit-learn contains many different implementations of classification algorithms. We ll start with the example of last week s class: Naïve Bayes. 4
5 from sklearn.naive_bayes import MultinomialNB, BernoulliNB classifier = MultinomialNB() classifier.fit(train_data_features, train["sentiment"]) Our model has now been trained on the training set; we can now test its performance on the test set. First, we carry out the same preprocessing and feature extraction on the test set. test = pd.read_csv("allocine_test.tsv", header=0, \ delimiter="\t", quoting=3 ) num_reviews = len(test["review"]) clean_test_reviews = [] for i in range(num_reviews): clean_review = review_to_words(test["review"][i]) clean_test_reviews.append(clean_review) test_data_features = vectorizer.transform( clean_test_reviews ) test_data_features = test_data_features.toarray() Next, we can compute the performance on the test set. score = classifier.score( test_data_features, test["sentiment"] ) print score Exercise 3 What does the score represent? Look at the instances that were classified badly. Do you see why the review was misclassified? Hint: use function predict 5
6 4.1 K-fold cross validation Usually, we will want to try out different parameters, in order to see what works best for our task. As such, we might experiment with: Different features Different classification algorithms Different model parameters However, we have to be careful: we cannot use our test set over and over again, as we ll be optimizing our parameters for that particular test set (and run the risk of overfitting, which means we are not able to properly generalize to data we haven t trained on). For this reason, we need to make use of a validation set. However, our training set is already quite small; creating a separate validation set would give us even less training data. Fortunately, we don t have to create a separate set: we can use k-fold cross validation. The idea is the following: Break up data into k (e.g. 10) parts (folds) For each fold Current fold is used as temporary test set Use other 9 folds as training data Performance is computed on test fold Average performance over 10 runs Note that, again, we want to make sure that the movies that are reviewed in our training set are different from the ones that appear in our validation set. Scikitlearn has a function for this: 6
7 from sklearn.model_selection import GroupKFold group_kfold = GroupKFold(n_splits=10) score_kfold = [] for train_index, test_index in group_kfold.split(train_data_features, train["sentiment"], train["movie_id"]): X_train, X_test = train_data_features[train_index], \ train_data_features[test_index] y_train, y_test = train["sentiment"][train_index], \ train["sentiment"][test_index] classifier.fit(x_train, y_train) score_kfold.append(classifier.score(x_test, y_test)) print sum(score_kfold) / float(len(score_kfold)) Exercise 4 Experiment with different feature sets Exercise 5 Exclude a list of stopwords Hint: NLTK provides a list of stopwords for French; look at the arguments of CountVectorizer to include them Experiment with n-grams instead of bag of words Hint: look at the arguments of CountVectorizer again in order to extract n-grams What if you change the number of vocabulary elements included? Can you think of other features to include? Experiment with different models Try a naïve bayes classifier that uses binary features (word presence instead of word count) 7
8 Exercise 6 Try any other classifier included with scikit-learn (decision trees, SVM,... ) How does it perform? When you ve determined the best set of parameters (according to crossvalidation), compute the performance on the test set 4.2 Intrinsic model evaluation Some models allow us to look at the most informative features. Using a logistic regression, you can do the following: classifier = sklearn.linear_model.logisticregression() classifier.fit(train_data_features, train["sentiment"]) allcoefficients = [(classifier.coef_[0,i], vocab[i]) \ for i in range(len(vocab))] allcoefficients.sort() allcoefficients.reverse() Exercise 7 Examine both the top and the bottom of the list. Which features are most informative? 8
CE807 Lab 3 Text classification with Python
CE807 Lab 3 Text classification with Python February 2 In this lab we are going to use scikit-learn for text classification, focusing in particular on the most classic example of text classification: spam
More informationSolution to the example exam LT2306: Machine learning, October 2016
Solution to the example exam LT2306: Machine learning, October 2016 Score required for a VG: 22 points Question 1 of 6: Hillary or the Donald? (6 points) We would like to build a system that tries to predict
More informationCIS192 Python Programming
CIS192 Python Programming Machine Learning in Python Robert Rand University of Pennsylvania October 22, 2015 Robert Rand (University of Pennsylvania) CIS 192 October 22, 2015 1 / 18 Outline 1 Machine Learning
More informationNLP Final Project Fall 2015, Due Friday, December 18
NLP Final Project Fall 2015, Due Friday, December 18 For the final project, everyone is required to do some sentiment classification and then choose one of the other three types of projects: annotation,
More informationIdentifying Important Communications
Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our
More informationLecture Linear Support Vector Machines
Lecture 8 In this lecture we return to the task of classification. As seen earlier, examples include spam filters, letter recognition, or text classification. In this lecture we introduce a popular method
More informationPredicting Popular Xbox games based on Search Queries of Users
1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which
More informationCS229 Final Project: Predicting Expected Response Times
CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time
More informationSAMPLE CHAPTER. Henrik Brink Joseph W. Richards Mark Fetherolf. FOREWORD BY Beau Cronin MANNING
SAMPLE CHAPTER Henrik Brink Joseph W. Richards Mark Fetherolf FOREWORD BY Beau Cronin MANNING Real-World Machine Learning by Henrik Brink Joseph W. Richards Mark Fetherolf Chapter 8 Copyright 217 Manning
More informationNLP Lab Session Week 9, October 28, 2015 Classification and Feature Sets in the NLTK, Part 1. Getting Started
NLP Lab Session Week 9, October 28, 2015 Classification and Feature Sets in the NLTK, Part 1 Getting Started For this lab session download the examples: LabWeek9classifynames.txt and put it in your class
More informationProgramming Exercise 6: Support Vector Machines
Programming Exercise 6: Support Vector Machines Machine Learning May 13, 2012 Introduction In this exercise, you will be using support vector machines (SVMs) to build a spam classifier. Before starting
More informationIntroducing Categorical Data/Variables (pp )
Notation: Means pencil-and-paper QUIZ Means coding QUIZ Definition: Feature Engineering (FE) = the process of transforming the data to an optimal representation for a given application. Scaling (see Chs.
More informationProgramming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks to recognize
More informationmaxbox Starter 66 - Data Science with Max
//////////////////////////////////////////////////////////////////////////// Machine Learning IV maxbox Starter 66 - Data Science with Max There are two kinds of data scientists: 1) Those who can extrapolate
More informationLatent Semantic Analysis. sci-kit learn. Vectorizing text. Document-term matrix
Latent Semantic Analysis Latent Semantic Analysis (LSA) is a framework for analyzing text using matrices Find relationships between documents and terms within documents Used for document classification,
More informationEvaluating Classifiers
Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts
More informationApplied Machine Learning
Applied Machine Learning Lab 3 Working with Text Data Overview In this lab, you will use R or Python to work with text data. Specifically, you will use code to clean text, remove stop words, and apply
More informationSUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?
SUPERVISED LEARNING WITH SCIKIT-LEARN How good is your model? Classification metrics Measuring model performance with accuracy: Fraction of correctly classified samples Not always a useful metric Class
More informationCS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014
CS273 Midterm Eam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014 Your name: Your UCINetID (e.g., myname@uci.edu): Your seat (row and number): Total time is 80 minutes. READ THE
More informationPython With Data Science
Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,
More informationIntroduction to Automated Text Analysis. bit.ly/poir599
Introduction to Automated Text Analysis Pablo Barberá School of International Relations University of Southern California pablobarbera.com Lecture materials: bit.ly/poir599 Today 1. Solutions for last
More informationsentiment_classifier Documentation
sentiment_classifier Documentation Release 0.4 Pulkit Kathuria January 07, 2015 Contents 1 Overview 3 2 Online Demo 5 3 Sentiment Classifiers and Data 7 4 Requirements 9 5 How to Install 11 6 Documentation
More informationComputerlinguistische Anwendungen Support Vector Machines
with Scikitlearn Computerlinguistische Anwendungen Support Vector Machines Thang Vu CIS, LMU thangvu@cis.uni-muenchen.de May 20, 2015 1 Introduction Shared Task 1 with Scikitlearn Today we will learn about
More informationCOMP 364: Computer Tools for Life Sciences
COMP 364: Computer Tools for Life Sciences Intro to machine learning with scikit-learn Christopher J.F. Cameron and Carlos G. Oliver 1 / 1 Key course information Assignment #4 available now due Monday,
More informationKaggle See Click Fix Model Description
Kaggle See Click Fix Model Description BY: Miroslaw Horbal & Bryan Gregory LOCATION: Waterloo, Ont, Canada & Dallas, TX CONTACT : miroslaw@gmail.com & bryan.gregory1@gmail.com CONTEST: See Click Predict
More informationData Science Bootcamp Curriculum. NYC Data Science Academy
Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations
More information1 Training/Validation/Testing
CPSC 340 Final (Fall 2015) Name: Student Number: Please enter your information above, turn off cellphones, space yourselves out throughout the room, and wait until the official start of the exam to begin.
More information1 Machine Learning System Design
Machine Learning System Design Prioritizing what to work on: Spam classification example Say you want to build a spam classifier Spam messages often have misspelled words We ll have a labeled training
More information6.034 Design Assignment 2
6.034 Design Assignment 2 April 5, 2005 Weka Script Due: Friday April 8, in recitation Paper Due: Wednesday April 13, in class Oral reports: Friday April 15, by appointment The goal of this assignment
More informationRegularization and model selection
CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial
More informationHANDS ON DATA MINING. By Amit Somech. Workshop in Data-science, March 2016
HANDS ON DATA MINING By Amit Somech Workshop in Data-science, March 2016 AGENDA Before you start TextEditors Some Excel Recap Setting up Python environment PIP ipython Scientific computation in Python
More information1 Document Classification [60 points]
CIS519: Applied Machine Learning Spring 2018 Homework 4 Handed Out: April 3 rd, 2018 Due: April 14 th, 2018, 11:59 PM 1 Document Classification [60 points] In this problem, you will implement several text
More informationCertified Data Science with Python Professional VS-1442
Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional Certified Data Science with Python Professional Certification Code VS-1442 Data science has become
More informationLab 15 - Support Vector Machines in Python
Lab 15 - Support Vector Machines in Python November 29, 2016 This lab on Support Vector Machines is a Python adaptation of p. 359-366 of Introduction to Statistical Learning with Applications in R by Gareth
More informationProgramming Exercise 5: Regularized Linear Regression and Bias v.s. Variance
Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance Machine Learning May 13, 212 Introduction In this exercise, you will implement regularized linear regression and use it to study
More informationfrom sklearn import tree from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
1 av 7 2019-02-08 10:26 In [1]: import pandas as pd import numpy as np import matplotlib import matplotlib.pyplot as plt from sklearn import tree from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
More informationMATH 829: Introduction to Data Mining and Analysis Model selection
1/12 MATH 829: Introduction to Data Mining and Analysis Model selection Dominique Guillot Departments of Mathematical Sciences University of Delaware February 24, 2016 2/12 Comparison of regression methods
More information1 Introduction. 2 Document classification process. Text mining. Document classification (text categorization) in Python using the scikitlearn
1 Introduction Text mining. Document classification (text categorization) in Python using the scikitlearn package. The aim of text categorization is to assign documents to predefined categories as accurately
More informationManual of SPCI (structural and physico-chemical interpretation) open-source software version 0.1.5
Manual of SPCI (structural and physico-chemical interpretation) open-source software version 0.1.5 Version (date) Changes and comments 0.1.0 (02.02.2015) Changes from alpha version: 1. More precise default
More informationIntroduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)
Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data
More informationFinal Exam. Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm
Final Exam Advanced Methods for Data Analysis (36-402/36-608) Due Thursday May 8, 2014 at 11:59pm Instructions: you will submit this take-home final exam in three parts. 1. Writeup. This will be a complete
More informationML 프로그래밍 ( 보충 ) Scikit-Learn
ML 프로그래밍 ( 보충 ) Scikit-Learn 2017.5 Scikit-Learn? 특징 a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (NumPy, SciPy, matplotlib).
More informationEPL451: Data Mining on the Web Lab 5
EPL451: Data Mining on the Web Lab 5 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Predictive modeling techniques IBM reported in June 2012 that 90% of data available
More informationmltool Documentation Release Maurizio Sambati
mltool Documentation Release 0.5.1 Maurizio Sambati November 18, 2015 Contents 1 Overview 3 1.1 Features.................................................. 3 1.2 Installation................................................
More informationPartitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning
Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning
More informationMachine Learning: Think Big and Parallel
Day 1 Inderjit S. Dhillon Dept of Computer Science UT Austin CS395T: Topics in Multicore Programming Oct 1, 2013 Outline Scikit-learn: Machine Learning in Python Supervised Learning day1 Regression: Least
More information$ easy_install scikit-learn from scikits.learn import svm. Shouyuan Chen
$ easy_install scikit-learn from scikits.learn import svm Shouyuan Chen scikits.learn Advantages Many useful model Unified API for various ML algorithms Very clean source code Features Supervised learning
More informationSCIENCE. An Introduction to Python Brief History Why Python Where to use
DATA SCIENCE Python is a general-purpose interpreted, interactive, object-oriented and high-level programming language. Currently Python is the most popular Language in IT. Python adopted as a language
More informationLab Five. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 29th 2018
Lab Five COMP 219 - Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves October 29th 2018 1 Decision Trees and Random Forests 1.1 Reading Begin by reading chapter three of Python Machine
More informationCS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm
CS 170 Algorithms Fall 2014 David Wagner HW12 Due Dec. 5, 6:00pm Instructions. This homework is due Friday, December 5, at 6:00pm electronically via glookup. This homework assignment is a programming assignment
More informationS E N T I M E N T A N A L Y S I S O F S O C I A L M E D I A W I T H D A T A V I S U A L I S A T I O N
S E N T I M E N T A N A L Y S I S O F S O C I A L M E D I A W I T H D A T A V I S U A L I S A T I O N BY J OHN KELLY SOFTWARE DEVELOPMEN T FIN AL REPOR T 5 TH APRIL 2017 TABLE OF CONTENTS Abstract 2 1.
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:
More information10 things I wish I knew. about Machine Learning Competitions
10 things I wish I knew about Machine Learning Competitions Introduction Theoretical competition run-down The list of things I wish I knew Code samples for a running competition Kaggle the platform Reasons
More informationWikipedia, Dead Authors, Naive Bayes & Python
Wikipedia, Dead Authors, Naive Bayes & Python Outline Dead Authors : The Problem Wikipedia : The Resource Naive Bayes : The Solution Python : The Medium NLTK Scikits.learn Authors, Books & Copyrights Authors
More informationConverting categorical data into numbers with Pandas and Scikit-learn -...
1 of 6 11/17/2016 11:02 AM FastML Machine learning made easy RSS Home Contents Popular Links Backgrounds About Converting categorical data into numbers with Pandas and Scikit-learn 2014-04-30 Many machine
More informationEvaluating Classifiers
Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts
More informationLogistic Regression with a Neural Network mindset
Logistic Regression with a Neural Network mindset Welcome to your first (required) programming assignment! You will build a logistic regression classifier to recognize cats. This assignment will step you
More informationLab 16 - Multiclass SVMs and Applications to Real Data in Python
Lab 16 - Multiclass SVMs and Applications to Real Data in Python April 7, 2016 This lab on Multiclass Support Vector Machines in Python is an adaptation of p. 366-368 of Introduction to Statistical Learning
More informationHomework 2: HMM, Viterbi, CRF/Perceptron
Homework 2: HMM, Viterbi, CRF/Perceptron CS 585, UMass Amherst, Fall 2015 Version: Oct5 Overview Due Tuesday, Oct 13 at midnight. Get starter code from the course website s schedule page. You should submit
More informationAn Efficient Spam Classification System Using Ensemble Machine Learning Algorithm
An Efficient Spam Classification System Using Ensemble Machine Learning Algorithm A.Lakshmanarao 1, K.Chandra Sekhar 2, Y.Swathi 3 Associate Professor 1, Assistant Professor 2, Assistant Professor 3 1,2,3
More informationPredict the box office of US movies
Predict the box office of US movies Group members: Hanqing Ma, Jin Sun, Zeyu Zhang 1. Introduction Our task is to predict the box office of the upcoming movies using the properties of the movies, such
More informationFeature Extraction and Classification. COMP-599 Sept 19, 2016
Feature Extraction and Classification COMP-599 Sept 19, 2016 Good-Turing Smoothing Defined Let N be total number of observed word-tokens, w c be a word that occurs c times in the training corpus. N = i
More informationSUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018
SUPERVISED LEARNING METHODS Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 2 CHOICE OF ML You cannot know which algorithm will work
More informationFrameworks in Python for Numeric Computation / ML
Frameworks in Python for Numeric Computation / ML Why use a framework? Why not use the built-in data structures? Why not write our own matrix multiplication function? Frameworks are needed not only because
More informationEncoding RNNs, 48 End of sentence (EOS) token, 207 Exploding gradient, 131 Exponential function, 42 Exponential Linear Unit (ELU), 44
A Activation potential, 40 Annotated corpus add padding, 162 check versions, 158 create checkpoints, 164, 166 create input, 160 create train and validation datasets, 163 dropout, 163 DRUG-AE.rel file,
More informationKNIME Python Integration Installation Guide. KNIME AG, Zurich, Switzerland Version 3.7 (last updated on )
KNIME Python Integration Installation Guide KNIME AG, Zurich, Switzerland Version 3.7 (last updated on 2019-02-05) Table of Contents Introduction.....................................................................
More informationExercise 4. AMTH/CPSC 445a/545a - Fall Semester October 30, 2017
Exercise 4 AMTH/CPSC 445a/545a - Fall Semester 2016 October 30, 2017 Problem 1 Compress your solutions into a single zip file titled assignment4.zip, e.g. for a student named Tom
More informationPerceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron Matt Gormley Lecture 5 Jan. 31, 2018 1 Q&A Q: We pick the best hyperparameters
More informationCS 224N: Assignment #1
Due date: assignment) 1/25 11:59 PM PST (You are allowed to use three (3) late days maximum for this These questions require thought, but do not require long answers. Please be as concise as possible.
More informationLab Four. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 22nd 2018
Lab Four COMP 219 - Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves October 22nd 2018 1 Reading Begin by reading chapter three of Python Machine Learning until page 80 found in the learning
More information2. A Bernoulli distribution has the following likelihood function for a data set D: N 1 N 1 + N 0
Machine Learning Fall 2015 Homework 1 Homework must be submitted electronically following the instructions on the course homepage. Make sure to explain you reasoning or show your derivations. Except for
More informationReview of UK Big Data EssNet WP2 SGA1 work. WP2 face-to-face meeting, 4/10/17
Review of UK Big Data EssNet WP2 SGA1 work WP2 face-to-face meeting, 4/10/17 Outline Ethical/legal issues Website identification Using registry information Using scraped data E-commerce Job vacancy Outstanding
More informationWe ll be using data on loans. The website also has data on lenders.
Economics 1660: Big Data PS 0: Programming for Large Data Sets Brown University Prof. Daniel Björkegren The spread of information technology has generated massive amounts of data, even in developing countries.
More informationLecture 20: Neural Networks for NLP. Zubin Pahuja
Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple
More informationExtracting data governance information from Slack chat channels
Extracting data governance information from Slack chat channels By Simon Quigley Supervisor: Dr. Rob Brennan Assistant supervisor: Dr. Alfredo Maldonado Dissertation Presented to University of Dublin,
More informationSolution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution
Summary Each of the ham/spam classifiers has been tested against random samples from pre- processed enron sets 1 through 6 obtained via: http://www.aueb.gr/users/ion/data/enron- spam/, or the entire set
More informationA bit of theory: Algorithms
A bit of theory: Algorithms There are different kinds of algorithms Vector space models. e.g. support vector machines Decision trees, e.g. C45 Probabilistic models, e.g. Naive Bayes Neural networks, e.g.
More informationIntel Distribution for Python* и Intel Performance Libraries
Intel Distribution for Python* и Intel Performance Libraries 1 Motivation * L.Prechelt, An empirical comparison of seven programming languages, IEEE Computer, 2000, Vol. 33, Issue 10, pp. 23-29 ** RedMonk
More informationOn the automatic classification of app reviews
Requirements Eng (2016) 21:311 331 DOI 10.1007/s00766-016-0251-9 RE 2015 On the automatic classification of app reviews Walid Maalej 1 Zijad Kurtanović 1 Hadeer Nabil 2 Christoph Stanik 1 Walid: please
More informationExercise 3. AMTH/CPSC 445a/545a - Fall Semester October 7, 2017
Exercise 3 AMTH/CPSC 445a/545a - Fall Semester 2016 October 7, 2017 Problem 1 Compress your solutions into a single zip file titled assignment3.zip, e.g. for a student named Tom
More informationPrinciples of Machine Learning
Principles of Machine Learning Lab 3 Improving Machine Learning Models Overview In this lab you will explore techniques for improving and evaluating the performance of machine learning models. You will
More informationPractical example - classifier margin
Support Vector Machines (SVMs) SVMs are very powerful binary classifiers, based on the Statistical Learning Theory (SLT) framework. SVMs can be used to solve hard classification problems, where they look
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationIntroduction to Machine Learning. Useful tools: Python, NumPy, scikit-learn
Introduction to Machine Learning Useful tools: Python, NumPy, scikit-learn Antonio Sutera and Jean-Michel Begon September 29, 2016 2 / 37 How to install Python? Download and use the Anaconda python distribution
More informationTutorial. Docking School SAnDReS Tutorial Cyclin-Dependent Kinases with K i Information (Introduction)
Tutorial Docking School SAnDReS Tutorial Cyclin-Dependent Kinases with K i Information (Introduction) Prof. Dr. Walter Filgueira de Azevedo Jr. Laboratory of Computational Systems Biology azevedolab.net
More informationLab 10 - Ridge Regression and the Lasso in Python
Lab 10 - Ridge Regression and the Lasso in Python March 9, 2016 This lab on Ridge Regression and the Lasso is a Python adaptation of p. 251-255 of Introduction to Statistical Learning with Applications
More informationLecture 3: Linear Classification
Lecture 3: Linear Classification Roger Grosse 1 Introduction Last week, we saw an example of a learning task called regression. There, the goal was to predict a scalar-valued target from a set of features.
More informationIn stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,
Chapter 2 Although stochastic gradient descent can be considered as an approximation of gradient descent, it typically reaches convergence much faster because of the more frequent weight updates. Since
More informationList of Exercises: Data Mining 1 December 12th, 2015
List of Exercises: Data Mining 1 December 12th, 2015 1. We trained a model on a two-class balanced dataset using five-fold cross validation. One person calculated the performance of the classifier by measuring
More informationDATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:
DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business
More informationLog- linear models. Natural Language Processing: Lecture Kairit Sirts
Log- linear models Natural Language Processing: Lecture 3 21.09.2017 Kairit Sirts The goal of today s lecture Introduce the log- linear/maximum entropy model Explain the model components: features, parameters,
More informationPython for. Data Science. by Luca Massaron. and John Paul Mueller
Python for Data Science by Luca Massaron and John Paul Mueller Table of Contents #»» *» «»>»»» Introduction 1 About This Book 1 Foolish Assumptions 2 Icons Used in This Book 3 Beyond the Book 4 Where to
More informationADVANCED CLASSIFICATION TECHNIQUES
Admin ML lab next Monday Project proposals: Sunday at 11:59pm ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 Fall 2014 Project proposal presentations Machine Learning: A Geometric View 1 Apples
More informationDetecting ads in a machine learning approach
Detecting ads in a machine learning approach Di Zhang (zhangdi@stanford.edu) 1. Background There are lots of advertisements over the Internet, who have become one of the major approaches for companies
More informationIntel Distribution For Python*
Intel Distribution For Python* Intel Distribution for Python* 2017 Advancing Python performance closer to native speeds Easy, out-of-the-box access to high performance Python High performance with multiple
More informationEPL451: Data Mining on the Web Lab 10
EPL451: Data Mining on the Web Lab 10 Παύλος Αντωνίου Γραφείο: B109, ΘΕΕ01 University of Cyprus Department of Computer Science Dimensionality Reduction Map points in high-dimensional (high-feature) space
More informationCS 224N: Assignment #1
Due date: assignment) 1/25 11:59 PM PST (You are allowed to use three (3) late days maximum for this These questions require thought, but do not require long answers. Please be as concise as possible.
More informationText classification with Naïve Bayes. Lab 3
Text classification with Naïve Bayes Lab 3 1 The Task Building a model for movies reviews in English for classifying it into positive or negative. Test classifier on new reviews Takes time 2 Sentiment
More informationSupport Vector Machines + Classification for IR
Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines
More informationPersonalized Web Search
Personalized Web Search Dhanraj Mavilodan (dhanrajm@stanford.edu), Kapil Jaisinghani (kjaising@stanford.edu), Radhika Bansal (radhika3@stanford.edu) Abstract: With the increase in the diversity of contents
More information