Clustering & Classification (chapter 15)

Similar documents
Unsupervised Learning and Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Learning and Clustering

Introduction to Artificial Intelligence

Unsupervised Learning

Cluster Analysis. Ying Shen, SSE, Tongji University

CHAPTER 4: CLUSTER ANALYSIS

Clustering CS 550: Machine Learning

Data Preprocessing. Supervised Learning

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

INF 4300 Classification III Anne Solberg The agenda today:

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

6. Learning Partitions of a Set

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Network Traffic Measurements and Analysis

Based on Raymond J. Mooney s slides

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis for Microarray Data

Unsupervised Learning

Chapter 6: Cluster Analysis

High throughput Data Analysis 2. Cluster Analysis

Gene Clustering & Classification

SOCIAL MEDIA MINING. Data Mining Essentials

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Exploratory Analysis: Clustering

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Image Segmentation. Srikumar Ramalingam School of Computing University of Utah. Slides borrowed from Ross Whitaker

ECLT 5810 Clustering

Clustering. Chapter 10 in Introduction to statistical learning

Artificial Intelligence. Programming Styles

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Using Machine Learning to Optimize Storage Systems

Application of fuzzy set theory in image analysis. Nataša Sladoje Centre for Image Analysis

Cluster Analysis: Agglomerate Hierarchical Clustering

A Dendrogram. Bioinformatics (Lec 17)

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

ECLT 5810 Clustering

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Jarek Szlichta

Pattern Recognition Lecture Sequential Clustering

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Lecture 25: Review I

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Image Segmentation. Ross Whitaker SCI Institute, School of Computing University of Utah

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Contents. Preface to the Second Edition

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

CS 664 Segmentation. Daniel Huttenlocher

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Region-based Segmentation

Introduction to Machine Learning. Xiaojin Zhu

Enhancing K-means Clustering Algorithm with Improved Initial Center

CSE 5243 INTRO. TO DATA MINING

Clustering k-mean clustering

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

The k-means Algorithm and Genetic Algorithm

University of Florida CISE department Gator Engineering. Clustering Part 2

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Clustering. k-mean clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Methods for Intelligent Systems

Natural Language Processing

CSE 573: Artificial Intelligence Autumn 2010

k-nearest Neighbor (knn) Sept Youn-Hee Han

OPTIMIZATION. Optimization. Derivative-based optimization. Derivative-free optimization. Steepest descent (gradient) methods Newton s method

Introduction to Clustering

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

Idea. Found boundaries between regions (edges) Didn t return the actual region

Distribution-free Predictive Approaches

Unsupervised Learning : Clustering

K- Nearest Neighbors(KNN) And Predictive Accuracy

CSE 5243 INTRO. TO DATA MINING

CSE 158. Web Mining and Recommender Systems. Midterm recap

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

VECTOR SPACE CLASSIFICATION

CHAPTER 3 TUMOR DETECTION BASED ON NEURO-FUZZY TECHNIQUE

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Exploiting Parallelism to Support Scalable Hierarchical Clustering

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Hierarchical Clustering 4/5/17

Image Segmentation. Ross Whitaker SCI Institute, School of Computing University of Utah

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Kapitel 4: Clustering

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Clustering Analysis based on Data Mining Applications Xuedong Fan

What to come. There will be a few more topics we will cover on supervised learning

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

PARALLEL CLASSIFICATION ALGORITHMS

CAP 6412 Advanced Computer Vision

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Machine Learning using MapReduce

Transcription:

Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical Methods Adaptive Clustering 9/9/003

3 Preliminaries Partitioning of data into several groups s.t. similarity within group is larger than that among groups Clustering = Classification Need similarity metric Need to normalize data Supervised vs. unsupervised clustering issues Unsupervised: labeling cost high (large # of data, costly experiments, data not available, ) Understand internal distribution Preprocessing for classification 9/9/003 k-means Clustering Partitions data into k (or c) groups Finds cluster centers Dissimilarity measure: e.g., Euclidean distance q ( m, i ) = ( m i ) d x c x c p= s.t. cost function is minimized c where c=number of clusters n=data within cluster q=number of dimensions c n p c n q ( m, i ) ( m i ) J = J = d x c = x c i i= i= m=, xm G i i= m=, xm G i p= p 4 9/9/003

Other Dissimilarity Measures ( ) T xy d x, y = x y x y T x y d ( x, y) = T T T x y + y y y x (binary x, y) ( ) d x, y = T x y x y d x y x y x y i= N m (, ) = = ( i i ) m m 5 9/9/003 k-means Algorithm Initialize cluster centers (randomly?) Check whether data are closest to cluster center Compute cost function (stop if below threshold) Update cluster centers ci = where Gi n x k k, xk G i (mean of all vectors in group i) G i is the number of elements in cluster G i Go to second step Issues: selection of location and number of clusters 6 9/9/003

Cluster class exercise height shirt color Fuzzy C-means Allow points to belong partly to several clusters why? 8 9/9/003

Butterfly Example Fuzzy Clustering Formulate optimization problem n c m min J( U, V ) = ( uik ) d ( xk, vi ) k = i= Calculate membership values u ik = d c j= ( x, v ) d x v i k ( j, k ) m m Update centroid v i n ( uik ) k = = n m ( uik ) k = x m k

Butterfly Simulation 9/9/003 Powerpoint Butterfly Animation 9/9/003

Many Simulations for Butterfly 50 Runs from random starting points centroids always settle at the same points Fuzzy Cluster in 3-D Space

Mountain Clustering No need to set number of clusters a priori simple computationally expensive can be used to determine number of clusters for c-means Mountain Clustering: steps form a grid on the data space; intersections are candidates for cluster centers. construct a mountain function representing data density m( v) = N i= e v xi σ (each data point contributes to the height) sequentially destruct the mountain function: make dent where highest value is m new ( v) = m( v) m( c) e v c i β subtracted amount inversely proportional to distance between v and c and height m(c )

D data for Mountain Clustering. mountain function with b) σ=0.0; c) 0.; d) 0.. destruction with β= b) first; c) second; d) third knn Algorithm Looks for k nearest neighbors (knn) to classify data Assigns class based on majority among knn Supervised method - needs labeled data for training 8 9/9/003

Crisp knn Algorithm Compute distance from data point to labeled samples IF knn have not been found yet THEN include data point ELSEIF a labeled sample is closer to the data point than any other knn then replace the farthest with the new one Deal with ties Repeat for the next labeled sample 9 9/9/003 Example 0 9/9/003

Fuzzy knn Assigns class membership Computationally simple Assign membership based on distance to knn and their memberships in classes 9/9/003 Fuzzy knn Algorithm Compute distance from data point to labeled samples If knn have not been found yet then include data point Else, if a labeled sample is closer to the data point than any other knn then replace the farthest with the new one Compute membership (inverse of distances from nn and their class memberships) Repeat for the next labeled sample u ( x) = i k uij x x j= m j k j= x x m j 9/9/003

Hierarchical Clustering Merge method: start with each x i as a cluster merge the nearest pairs until #of clusters = Split method: start with # of clusters = split until predefined goal is reached 3 9/9/003 Classifying Turbine Anomalies with Adaptive Fuzzy Clustering Objective: track behavior of engines measure system parameters trend analysis for change detection Challenges: large amounts of noise changing operating conditions corrections with first principle models or regression models work only to some extent changes of schedules, maintenance, etc., which are not necessarily known to the analyst

Trending Monitor data and report observations indicative of abnormal conditions Issues: definition of abnormality not crisp step changes different slope combination of events trade-off between false positives and false negatives trade-off also with time to recognition tool performance not the same for all conditions of interest noise in data will influence quality of reporting Trending of Shift Changes Problem: detect abnormal shifts in process data noise masks shifts slow drifts are superimposed on data Solution: Multivariate adaptive fuzzy clustering Self-learning of normal drift Capability to recognize abnormal changes 0 0 de GT 5 0 0 5 0 0 0 5 0 0 0 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 5 5 0 6 0 0 4 0 3 0 w f 0 0 0 5 0 0 0 5 0 0 0 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 5 5 0 6 0 0 0. 5 N 0-0. 5 0 0 0 0 0 3 0 0 4 0 0 5 0 0 6 0 0

Components: Shift Detection Tool Completeness checker and new data checker Normalizer Classifier (fuzzy knn) Calculator of statistical propertiers (performs learning) Adaptor (tracks normal wear) Persistency checker (vigilance) Composite alert score detector (generates alert) Cluster Adaptation tracks system changes

Raw Data Output Tool Shift Detection 0 0 w f Example Data de G T 5 0 0 5 0 0 0 5 0 0 0 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 5 5 0 6 0 0 4 0 3 0 0 0 0 5 0 0 0 5 0 0 0 5 0 3 0 0 3 5 0 4 0 0 4 5 0 5 0 0 5 5 0 6 0 0 0. 5 N 0-0. 5 0 0 0 0 0 3 0 0 4 0 0 5 0 0 6 0 0 3 m ulti-variate view for 75658.5.5 0.5 0-0.5-0.5 0 0.5.5.5 last slide