Lecture on Modeling Tools for Clustering & Regression

Similar documents
Gene Clustering & Classification

Linear Methods for Regression and Shrinkage Methods

Using Machine Learning to Optimize Storage Systems

Artificial Intelligence. Programming Styles

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CS145: INTRODUCTION TO DATA MINING

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

CSC 411: Lecture 02: Linear Regression

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

SOCIAL MEDIA MINING. Data Mining Essentials

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

CSE446: Linear Regression. Spring 2017

Clustering CS 550: Machine Learning

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Unsupervised Learning and Clustering

Network Traffic Measurements and Analysis

Variable Selection 6.783, Biomedical Decision Support

Introduction to Information Retrieval

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Model selection and validation 1: Cross-validation

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Expectation Maximization!

10601 Machine Learning. Model and feature selection

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Random Forest A. Fornaser

Unsupervised Learning : Clustering

LECTURE 12: LINEAR MODEL SELECTION PT. 3. October 23, 2017 SDS 293: Machine Learning

Unsupervised Learning

Machine Learning: An Applied Econometric Approach Online Appendix

Clustering and Visualisation of Data

Supervised vs unsupervised clustering

CSE 5243 INTRO. TO DATA MINING

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

CS Introduction to Data Mining Instructor: Abdullah Mueen

Lecture 13: Model selection and regularization

model order p weights The solution to this optimization problem is obtained by solving the linear system

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

CSE 5243 INTRO. TO DATA MINING

PV211: Introduction to Information Retrieval

Machine Learning. Unsupervised Learning. Manfred Huber

Lasso. November 14, 2017

EE 511 Linear Regression

Information Retrieval and Organisation

Unsupervised Learning and Clustering

Cluster Evaluation and Expectation Maximization! adapted from: Doug Downey and Bryan Pardo, Northwestern University

Problems 1 and 5 were graded by Amin Sorkhei, Problems 2 and 3 by Johannes Verwijnen and Problem 4 by Jyrki Kivinen. Entropy(D) = Gini(D) = 1

Lecture 16: High-dimensional regression, non-linear regression

Nonparametric Methods Recap

Clustering algorithms and autoencoders for anomaly detection

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Understanding Clustering Supervising the unsupervised

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

ECLT 5810 Clustering

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

CS 229 Midterm Review

Introduction to Information Retrieval

Information Driven Healthcare:

STA 4273H: Sta-s-cal Machine Learning

Chapter 7: Numerical Prediction

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Applying Supervised Learning

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Lasso Regression: Regularization for feature selection

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Chapter 9. Classification and Clustering

3 Nonlinear Regression

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

5 Learning hypothesis classes (16 points)

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Introduction to Computer Science

ECLT 5810 Clustering

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

CS294-1 Assignment 2 Report

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

I How does the formulation (5) serve the purpose of the composite parameterization

10701 Machine Learning. Clustering

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 10: Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Cluster analysis. Agnieszka Nowak - Brzezinska

Data Informatics. Seon Ho Kim, Ph.D.

Data Clustering. Danushka Bollegala

Dimensionality Reduction, including by Feature Selection.

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Transcription:

Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete

Data Clustering Overview Organizing data into sensible groupings is critical for understanding and learning. Cluster analysis: methods/algorithms for grouping objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category labels distinguishes data clustering (unsupervised learning) from classification (supervised learning). Clustering aims to find structure in data and is therefore exploratory in nature.

Clustering has a long rich history in various scientific fields. K-means (1955): One of the most popular simple clustering algorithms. Still widely-used! The design of a general purpose clustering algorithm is a difficult task.

Clustering

Clustering objectives

The centroid of a cluster can be computed as a point with coordinates the mean of the coordinates of each point. Other metrics based on similarity (e.g., cosine) could be also employed.

k-means: simplest unsupervised learning algorithm Iterative greedy algorithm (K): 1. Place K points into the space represented by the objects that are being clustered These points represent initial group centroids (e.g., start by randomly selecting K centroids) 2. Assign each object to the group that has the closest centroid (e.g., Euclidian distance) 3. When all objects have been assigned, recalculate the positions of the K centroids Repeat Steps 2 and 3 until the centroids no longer move. It converges but does not guarantee optimal solutions. It is heuristic!

Example 1 Well-defined clusters

Example 1 Well-defined clusters (cont d)

Example 2 Not so well-defined clusters The initial data points before clustering.

Example 2 Not so well-defined clusters The initial data points before clustering.

Example 2 Not so well-defined clusters The initial data points before clustering. Random selection of centroids (circled points).

Example 2 (cont d) After the grouping of the points (in the first iteration), the True centroid of each group has been computed. Note that it is different from the initially randomly selected centroid. The process continues until the centroids do not change

Criteria for Assessing a Clustering Internal criterion analyzes intrinsic characteristics of a clustering External criterion analyzes how close is a clustering to a reference Relative criterion analyzes the sensitivity of internal criterion during clustering generation The measured quality of a clustering depends on both the object representation & the similarity measure used

Properties of a good clustering according to the internal criterion High intra-class (intra-cluster) similarity Cluster cohesion: measures how closely related are objects in a cluster Low inter-class similarity Cluster separation: measures how well-separated a cluster is from other clusters The measured quality depends on the object representation & the similarity measure used

Silhouette Coefficient x x x x x x x x x x x x x

Silhouette Measures Cohesion Compared to Separation How similar an object is to its own cluster (cohesion) compared to other clusters (separation) Ranges in [ 1, 1]: a high value indicates that the object is well matched to its own cluster & poorly matched to neighboring clusters If most objects have a high value, then the clustering configuration is appropriate If many points have a low or negative value, then the clustering configuration may have too many or too few clusters α(i) average dissimilarity of i with all other data within the same cluster. b(i): lowest average dissimilarity of i to any other cluster, of which i is not a member

External criteria for clustering quality External criteria: analyze how close is a clustering to a reference Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data Assesses a clustering with respect to ground truth requires labeled data Assume items with C gold standard classes, while our clustering algorithms produce K clusters, ω 1, ω 2,, ω K with n i members.

External Evaluation of Cluster Quality (cont d) Assume items with C gold standard classes, while our clustering produces K clusters, ω 1, ω 2,, ω K with n i members. Purity: the ratio between the dominant class in the cluster π i & the size of cluster ω i 1 Purity ( i) max j ( nij ) j C ni Biased because having n clusters maximizes purity Entropy of classes in clusters Mutual information between classes and clusters

Purity example Cluster I Cluster II Cluster III Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Entropy-based Measure of the Quality of Clustering

Mutual-information based Measure of Quality of Clustering

Example Clustering of neurons at positions A, B and C in the conditional STTC (A, B C) 1 st cluster: neurons with approximately equal participation at each position 2 nd cluster: neurons with high presence in the position B

Linear Regression for Predictive Modeling Suppose a set of observations & a set of explanatory variables (i.e., predictors) We build a linear model where each predictor are the coefficients of given as a weighted sum of the predictors, with the weights being the coefficients

Why using linear regression? Strength of the relationship between y and a variable x i - Assess the impact of each predictor x i on y through the magnitude of β i - Identify subsets of X that contain redundant information about y

Simple linear regression Suppose that we have obtained the dataset and we want to model these as a linear function of To determine which is the optimal β R n, we solve the least squares problem: where is the β that minimizes the Sum of Squared Errors (SSE)

An intercept term β 0 captures the noise not caught by predictor variable Again we estimate using least squares without intercept term with intercept term

Example 2 The intercept term improves the accuracy of the model Predicted Y Squared Error SSE = 3.55 Predicted Y Squared Error SSE = 2.67 0.70 0.09 1.20 0.04 1.40 0.36 1.60 0.16 2.10 0.64 2.00 0.49 2.80 0.90 2.50 1.56 3.50 1.56 2.90 0.42

Multiple linear regression Models the relationship between two or more predictors & the target where are the optimal coefficients β 1, β 2,..., β p of predictors x 1, x 2,..., x p respectively, that minimize the sum of squared errors

Regularization Process of introducing additional information in order to prevent overfitting Loss function λ controls the importance of the regularization

Regularization Bias Error from erroneous assumptions about the training data high bias: underfitting Variance Error from sensitivity to small fluctuations (noise) in the training data high variance: overfitting Shrinks the magnitude of coefficients Increasing λ

Regularization Bias: error from erroneous assumptions about the training data Miss relevant relations between predictors & target (underfitting) Variance: error from sensitivity to small fluctuations in the training data noise, not the intended output (overfitting) Bias variance tradeoff: Ignore small details to get a more general big picture

Ridge regression Given a vector with observations & a predictor matrix the ridge regression coefficients are defined as: Not only minimizing the squared error but also the size of the coefficients!

Ridge regression as regularization If the β j are unconstrained, they can explode and hence are susceptible to very high variance! To control variance, we regularize the coefficients i.e., control how large they can grow

Example 3 Underfitting Overfitting Increasing size of λ

Variable selection Problem of selecting the most relevant predictors from a larger set of predictors In linear modeling, this means estimating some coefficients to be exactly zero This can be very important for the purposes of model interpretation Ridge regression cannot perform variable selection - Does not set coefficients exactly to zero, unless λ =

Example 4 Suppose that we study the level of prostate-specific antigen (PSA), which is often elevated in men with prostate cancer. We examine n = 97 men with prostate cancer & p = 8 clinical measurements. Interested in identifying a small number of predictors, say 2 or 3, that impact PSA. We perform ridge regression over a wide range of λ

Lasso regression The Lasso coefficients are defined as: The only difference between Lasso vs. ridge regression is the penalty term - Ridge uses l 2 penalty - Lasso uses l 1 penalty

Lasso regression λ 0 is a tuning parameter for controlling the strength of the penalty The nature of the l 1 penalty causes some coefficients to be shrunken to zero exactly As λ increases, more coefficients are set to zero less predictors are selected Can perform variable selection

Example 5: Ridge vs. Lasso lcp, age & gleason: the least important predictors set to zero

Example 6: Ridge (left) vs. Lasso (right)

Constrained form of lasso & ridge For any λ and corresponding solution in the penalized form, there is a value of t such that the above constrained form has this same solution. The imposed constraints constrict the coefficient vector to lie in some geometric shape centered around the origin Type of shape (i.e., type of constraint) really matters!

The elliptical contour represents sum of square error term The coefficient β 1 is 0 The diamond shape indicates the constraint region

Coefficients β 1 & β 2 0

Why lasso sets coefficients to zero? The elliptical contour plot represents sum of square error term The diamond shape in the middle indicates the constraint region Optimal point: intersection between ellipse & circle - Corner of the diamond region, where the coefficient is zero Instead with ridge:

Regularization penalizes hypothesis complexity L2 regularization leads to small weights L1 regularization leads to many zero weights (sparsity) Feature selection aims to discard irrelevant features