Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Similar documents
Lecture 25: Review I

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Using Machine Learning to Optimize Storage Systems

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Python With Data Science

Large Scale Data Analysis Using Deep Learning

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Lecture on Modeling Tools for Clustering & Regression

PARALLEL CLASSIFICATION ALGORITHMS

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Machine Learning with Python

Applied Statistics for Neuroscientists Part IIa: Machine Learning

Supervised vs unsupervised clustering

Introduction. Welcome. Machine Learning

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

K-Means Clustering 3/3/17

Accelerated Machine Learning Algorithms in Python

Evaluating Classifiers

Network Traffic Measurements and Analysis

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Regularization and model selection

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Model Complexity and Generalization

Introduction to Artificial Intelligence

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Evaluating Classifiers

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

10 things I wish I knew. about Machine Learning Competitions

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Facial Expression Classification with Random Filters Feature Extraction

NMLRG #4 meeting in Berlin. Mobile network state characterization and prediction. P.Demestichas (1), S. Vassaki (2,3), A.Georgakopoulos (2,3)

Tutorials (M. Biehl)

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Topics in Machine Learning

Detecting Thoracic Diseases from Chest X-Ray Images Binit Topiwala, Mariam Alawadi, Hari Prasad { topbinit, malawadi, hprasad

A Survey On Data Mining Algorithm

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

CS249: ADVANCED DATA MINING

Predicting Popular Xbox games based on Search Queries of Users

Classification by Support Vector Machines

Business Club. Decision Trees

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Machine Learning in the Process Industry. Anders Hedlund Analytics Specialist

Weka ( )

Lecture #11: The Perceptron

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Data Mining: Models and Methods

Cross-validation. Cross-validation is a resampling method.

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Artificial Intelligence. Programming Styles

Robust PDF Table Locator

Random Forest A. Fornaser

Artificial Neural Networks (Feedforward Nets)

Application of Support Vector Machine Algorithm in Spam Filtering

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Slides for Data Mining by I. H. Witten and E. Frank

CSC411/2515 Tutorial: K-NN and Decision Tree

Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Science Training

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Hyperparameters and Validation Sets. Sargur N. Srihari

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Tutorial on Machine Learning Tools

Orange3 Educational Add-on Documentation

RESAMPLING METHODS. Chapter 05

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Machine Learning Software ROOT/TMVA

ECG782: Multidimensional Digital Signal Processing

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

1 Machine Learning System Design

Lecture 13: Model selection and regularization

Clustering & Classification (chapter 15)

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Fraud Detection using Machine Learning

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

scikit-learn (Machine Learning in Python)

Machine Learning / Jan 27, 2010

A Dendrogram. Bioinformatics (Lec 17)

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

CS 8520: Artificial Intelligence

Machine Learning Techniques for Data Mining

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Natural Language Processing

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

Features: representation, normalization, selection. Chapter e-9

Interpretable Machine Learning with Applications to Banking

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

CS4491/CS 7265 BIG DATA ANALYTICS

Transcription:

Machine Learning in Python Rohith Mohan GradQuant Spring 2018

What is Machine Learning? https://twitter.com/myusuf3/status/995425049170489344

Traditional Programming Data Computer Program Output Getting computers to program themselves Coding is the bottleneck, let data dictate programming Machine Learning Data Computer Output Program http://www.hlt.utdallas.edu/~vgogate/ml/2013f/lectures.html

Formal Definitions Arthur Samuel (1959) Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed. Created a program for computer to play itself in checkers (10000s games) and learn at IBM Tom Mitchell (1998) Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. Andrew Ng Machine Learning Coursera

Machine Learning Developed out of initial work in Artificial Intelligence (AI) Increased availability of large datasets and advances in computing architecture boosted usage in recent times https://en.wikipedia.org/wiki/timeline_of_machine_learning

Usage Natural Language Processing + Computer Vision Mining and clustering gene expression data to identify individuals Reproducing human behavior (True AI) Recommendation algorithms https://www.irishnews.com/magazine/science/2018/01/01/news/12-of-the-biggest-scientificbreakthroughs-of-2017-that-might-just-change-the-world-1222695/ https://www.flickr.com/photos/theadamclarke/2589233355 http://www.idownloadblog.com/2016/05/12/google-translate-offline-mode/ https://de.wikipedia.org/wiki/genexpressionsanalyse

Common steps in ML workflow Collect data (various sources, UCI data repository, news orgs, Kaggle) Prepare data (exploratory analysis, feature selection, regularization) Selecting and training model (train and test datasets, what model?) Evaluating model (accuracy, precision, ROC curves, F1 score) Optimizing performance (change model, # of features, scaling)

scikit-learn http://scikit-learn.org/stable/index.html

Preprocessing Clean data and deal with missing values, etc. Feature scaling - rescaling features to be more sensible Standardization - getting various features into similar range (e.g. -1 to 1) Square footage of a house (100s of ft) vs # of rooms (1-5) http://scikit-learn.org/stable/modules/preprocessing.html

Preprocessing Clean data and deal with missing values, etc. Feature scaling - rescaling features to be more sensible Standardization - getting various features into similar range (e.g. -1 to 1) Square footage of a house (100s of ft) vs # of rooms (1-5) Normalization scaling to some standard (e.g. subtract mean & divide by SD) Many others (regularization,imputation, generating polynomial features, etc.) http://scikit-learn.org/stable/modules/preprocessing.html

Preprocessing Clean data and deal with missing values, etc. Feature scaling - rescaling features to be more sensible Standardization - getting various features into similar range (e.g. -1 to 1) Square footage of a house (100s of ft) vs # of rooms (1-5) Normalization scaling to some standard (e.g. subtract mean & divide by SD) http://scikit-learn.org/stable/modules/preprocessing.html

Importance of feature scaling http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html

Comparison of scaling StandardScaler

Comparison of scaling RobustScaler

Train Test (Cross Validate?) Why do we need to split up our datasets? Overfitting Split dataset Train for training your model on Test evaluate performance of model Usually 40% for testing is enough Validation set? Cross-validation Split up training set into subsets and evaluate performance (can be more computationally expensive but conserves data) Hyper-parameter tuning

Bias-variance tradeoff Underfitting High Bias Overfitting High Variance http://scott.fortmann-roe.com/docs/biasvariance.html http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selectionplot-underfitting-overfitting-py

Bias-variance tradeoff http://scott.fortmann-roe.com/docs/biasvariance.html

How to select a model?

http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Supervised vs Unsupervised Learning Supervised Regression, classification Input variables, output variable, learn mapping of input to output Unsupervised Clustering, association, etc. No correct answers and no teacher Semi-supervised Partially labeled dataset of images Mixing both techniques is what occurs in real-world

Regression Linear regression (OLS) Prediction Multiple variables/features? Feature selection https://www.xlstat.com/en/solutions/features/ordinary-least-squares-regression-ols http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

Feature Selection https://www.coursera.org/learn/machine-learning

Feature Selection https://www.coursera.org/learn/machine-learning

Regression Linear regression (OLS) Prediction Multiple variables/features? Feature selection Length, width of a house (area?) Regularization https://www.xlstat.com/en/solutions/features/ordinary-least-squares-regression-ols http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

Regularization https://www.coursera.org/learn/machine-learning

Regularization http://enhancedatascience.com/2017/07/04/machine-learning-explained-regularization/

http://scikit-learn.org/stable/auto_examples/model_selection/plot_train_error_vs_test_error.html#sphx-glr-auto-examples-model-selectionplot-train-error-vs-test-error-py

Classification Logistic Regression http://www.saedsayad.com/logistic_regression.htm

Classification Logistic Regression https://medium.com/technology-nineleaps/logistic-regression-bac1db38cb8c https://mapr.com/blog/predicting-breast-cancer-using-apache-spark-machine-learning-logistic-regression/

Classification SVM http://scikit-learn.org/0.18/auto_examples/svm/plot_separating_hyperplane.html

Evaluating Performance Accuracy how many predictions are correct out of the entire dataset? Can be a flawed metric Precision and Recall https://en.wikipedia.org/wiki/precision_and_recall

Evaluating Performance Accuracy how many predictions are correct out of the entire dataset? Can be a flawed metric Precision and Recall ROC curves F1 score

Evaluating Performance http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py

Classification - K-Nearest Neighbors Robust to noisy training data More effective with larger datasets Need to determine parameter K (number of nearest neighbors) What type of distance metric? High computation cost

Clustering Unsupervised learning Can help you understand structure of your data Various types of clustering: K-means, Hierarchical, Ward

K-means Randomly choose k centroids Form clusters around it Take mean of cluster to identify new centroid Repeat until convergence https://www.youtube.com/watch?v=_awzggnrcic