Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

Similar documents
What is machine learning?

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Last time... Bias-Variance decomposition. This week

Linear Methods for Regression and Shrinkage Methods

Topics in Machine Learning

Nonparametric Regression

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

Day 3 Lecture 1. Unsupervised Learning

Random Forest A. Fornaser

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Machine Learning Lecture 3

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Expectation Maximization (EM) and Gaussian Mixture Models

Image Analysis, Classification and Change Detection in Remote Sensing

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Supervised vs unsupervised clustering

Machine Learning Lecture 3

Machine Learning Lecture 3

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Dimension Reduction CS534

MTTTS17 Dimensionality Reduction and Visualization. Spring 2018 Jaakko Peltonen. Lecture 11: Neighbor Embedding Methods continued

Overview of machine learning

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Adaptive Metric Nearest Neighbor Classification

Applying Supervised Learning

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

CLASSIFICATION AND CHANGE DETECTION

Supervised vs. Unsupervised Learning

Clustering Lecture 5: Mixture Model

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Machine Learning (CSE 446): Perceptron

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Machine Learning (BSMC-GA 4439) Wenke Liu

Lecture 25: Review I

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Topics In Feature Selection

Modelling and Visualization of High Dimensional Data. Sample Examination Paper

Clustering and The Expectation-Maximization Algorithm

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Unsupervised Learning

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Content-based image and video analysis. Machine learning

Automatic basis selection for RBF networks using Stein s unbiased risk estimator

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning / Jan 27, 2010

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

IMAGE ANALYSIS, CLASSIFICATION, and CHANGE DETECTION in REMOTE SENSING

Supervised Learning for Image Segmentation

CSC 411: Lecture 05: Nearest Neighbors

Statistics 202: Data Mining. c Jonathan Taylor. Outliers Based in part on slides from textbook, slides of Susan Holmes.

Large Scale Data Analysis Using Deep Learning

Image Processing. Image Features

Radial Basis Function Networks: Algorithms

Naïve Bayes for text classification

A Taxonomy of Semi-Supervised Learning Algorithms

10-701/15-781, Fall 2006, Final

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Semi-Automatic Transcription Tool for Ancient Manuscripts

MULTIVARIATE TEXTURE DISCRIMINATION USING A PRINCIPAL GEODESIC CLASSIFIER

Advanced Applied Multivariate Analysis

Distribution-free Predictive Approaches

Basis Functions. Volker Tresp Summer 2017

Unsupervised learning in Vision

Bioinformatics - Lecture 07

Network Traffic Measurements and Analysis

Using Existing Numerical Libraries on Spark

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

Notes and Announcements

Unsupervised Learning

Classification and Regression Trees

3 Feature Selection & Feature Extraction

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Moving Beyond Linearity

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Linear Regression and K-Nearest Neighbors 3/28/18

STEPHEN WOLFRAM MATHEMATICADO. Fourth Edition WOLFRAM MEDIA CAMBRIDGE UNIVERSITY PRESS

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Introduction to Machine Learning CMU-10701

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Modern Methods of Data Analysis - WS 07/08

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

K-Means Clustering 3/3/17

Data Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\

Machine Learning with MATLAB --classification

Introduction to digital image classification

Mixture Models and EM

Functional Data Analysis

Transcription:

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones What is machine learning? Data interpretation describing relationship between predictors and responses finding naural groups or classes in data relating observables (or a function thereof) to physical quantities Prediction capturing relationship between inputs and outputs for a set of labelled data with the goal of predicting outputs for unlabelled data ( pattern recognition ) Learning from data 1 2 What is statistics? Many names... the systematic and quantitative way of making inferences from data data is considered as the outcome of a random event variability is expressed by a probability distribution mathematics is used to manipulate probability distributions allows us to write down statistical models for the data and solve for the parameters of interest machine learning statistical learning pattern recognition statistical data modelling data mining (although this has other meanings) multivariate data analysis... 3 4

Parameter estimation from stellar spectra Parameter estimation from stellar spectra learn mapping from spectra to data using labelled examples multidimensional (few hundred) noninear inverse Willemsenet al. (2005) Willemsenet al. (2005) 5 6 Galaxy spectral classification want to classify galaxies based on the appearance of their observed spectra can simulate spectra of known classes much variance within these basic classes Tsalmantza et al. (2007) www.astro.princeton.edu Right: Optical synthetic spectra of 8 basic galaxy types (SDSS filters overlaid) Top right: colour-colour diagram showing 8 basic types compared to locus of SDSS galaxies Above: locus of 10,000 simulated galaxies in which various parameters have been varied Tsalmantza et al. (2007) 7 8

Course objectives learn the basic concepts of machine learning learn the basic tools and methods of machine learning identify appropriate methods interpret results in light of methods used recognise inherent limitations and assumptions linear and nonlinear methods Right: Optical synthetic spectra of 8 basic galaxy types (SDSS filters overlaid) methods for high dimensional data become familiar with a freely available package for modelling data (R) Top right: Simulated Gaia spectra of the 8 basic galaxy types Above: the 10,000 simulated galaxy spectra projected into the space of the first 3 Principal Components. (These plus mean explain 99.7% of variance.) Tsalmantza et al. (2007) 9 10 Lecture schedule Online material and texts http://www.mpia.de/homes/calj/ss2007_mlpr.html viewgraphs R scripts used in lectures bilbiography, links to articles recommended book The Elements of Statistical Learning, Hastie et al. (2001), Springer links to R tutorials 11 12

The course R (S, S-PLUS) assumed knowledge http://www.r-project.org simple probability theory and statistics (distributions, hypothesis testing, least squares,...) linear algebra (matrices, eigenvalue problems,...) elementary calculus interact learn by being active, not passive! question, criticize, etc. hands-on: learn a package and play with data a language and environment for statistical computing and graphics open source runs on linux, Windows, MacOS operations for vectors and matrices large number of statistical and machine learning packages can link to C, C++ and Fortran code Good book on using R for statistics Modern Applied Statistics with S, Venables & Ripley, 2002, Springer 13 14 Supervised and unsupervised learning A simple problem of data fitting Supervised learning for each observed vector (the predictors, inputs, independent variables ), x, there are one or more dependent variables ( responses, outputs ), y, or two or more classes, C. regression problems: goal is to learn a function, y = f(x; ), where is a set of parmeters to be inferred from a training set of pre-labelled vectors {x, y} classification problems: goal is either to define decision boundaries between objects with different classes, or to model the density of the class probabilities over the data, i.e. P 1 (C = C 1 ) = f(x; ), where parametrizes the probability density function (PDF) and is learned from a training set of pre-classified vectors {x, C} Unsupervised learning no pre-labelled data or pre-defined dependent variables or classes goal is to find either natural classes/clusterings in data, or simpler (e.g. lower dimensional) variables which explain the data Examples: PCA, K-means clustering 15 16

Learning, generalization and regularization Learning from data See R scripts on web page make assumptions about smoothness of function regularization generalization take into account variance (errors) in data domain knowledge helps (if it's reliable...) this is data interpolation extrapolation is less contrained 17 18 Notation (as used by Hastie et al.) Fitting a model: linear least squares Input variables denoted by Output variables denotes by i=1..n observations with j=1..p dimensions Upper case used to refer to generic aspects of variables if a vector, subscripts access components specific observed values are written in lower case Bold is used for vectors and matrices vectors in lower case matrices in upper case Hastie et al. do not use bold for p-vectors Parameters use Greek letters can also be vectors, of course are p x 1 column vectors Determine parameters by minimizing sum-of-squares error on all N training data This is quadratic in so always has a minimum. Differentiate w.r.t is a p x 1 column vector is an N x p matrix If (the information matrix ) is non-singular then the unique solution is 19 20

2-class classification: linear decision boundary 2-class classification: K-nearest neighbours G i = 0 for green class G i = 1 for red class Boundary is Hastie, Tibshirani, Friedman (2001) Hastie, Tibshirani, Friedman (2001) 21 22 Comparison Which solution is optimal? Linear model makes a very strong assumption about the data viz. wellapproximated by a globally linear function stable but biased learn relationship between (X, y) and encapsulate into parameters, K-nearest neighbours no assumption about functional form of relationship (X, y), i.e. it is nonparametric but does assume that function is well-approximated by a locally constant function less stable but less biased no free parameters to learn, so application to new data relatively slow: brute force search for neighbours takes O(N) if we know nothing about how the data were generated (underlying model, noise), we don't know if data drawn from two uncorrelated Gaussians: linear decision boundary is almost optimal if data drawn from mixture of multiple distributions: linear boundary not optimal (nonlinear, disjoint) what is optimal? smallest generalization errors simple solution (interpretability) more complex models permit lower errors on training data but we want models to generalize need to control complexity / nonlinearity (regularization) with enough training data, wouldn't k-nn be best? 23 24

Summary supervised vs. unsupervised learning generalization and regularization regression vs. classification parametric vs. non-parametri linear regression, k nearest neighbours least squares 25