COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

Similar documents
COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Unsupervised Learning

Clustering. Shishir K. Shah

Road map. Basic concepts

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Introduction to Fuzzy Logic. IJCAI2018 Tutorial

ECLT 5810 Clustering

Introduction to Fuzzy Logic and Fuzzy Systems Adel Nadjaran Toosi

Cluster Analysis. Ying Shen, SSE, Tongji University

Application of fuzzy set theory in image analysis. Nataša Sladoje Centre for Image Analysis

10701 Machine Learning. Clustering

Unsupervised Learning : Clustering

ECLT 5810 Clustering

Unsupervised Learning and Clustering

Data Informatics. Seon Ho Kim, Ph.D.

K-Means. Oct Youn-Hee Han

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CSE 5243 INTRO. TO DATA MINING

Gene Clustering & Classification

Unsupervised Learning and Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Unsupervised Learning I: K-Means Clustering

CSE 5243 INTRO. TO DATA MINING

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Introduction to Computer Science

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Why Fuzzy? Definitions Bit of History Component of a fuzzy system Fuzzy Applications Fuzzy Sets Fuzzy Boundaries Fuzzy Representation

Lecture on Modeling Tools for Clustering & Regression

CPS331 Lecture: Fuzzy Logic last revised October 11, Objectives: 1. To introduce fuzzy logic as a way of handling imprecise information

University of Florida CISE department Gator Engineering. Clustering Part 2

Clustering: Classic Methods and Modern Views

[7.3, EA], [9.1, CMB]

Why Fuzzy Fuzzy Logic and Sets Fuzzy Reasoning. DKS - Module 7. Why fuzzy thinking?

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Application of fuzzy set theory in image analysis

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering

Unit V. Neural Fuzzy System

Clustering CS 550: Machine Learning

Unsupervised Learning

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Clustering. Discover groups such that samples within a group are more similar to each other than samples across groups.

Introduction. Aleksandar Rakić Contents

What is Clustering? Clustering. Characterizing Cluster Methods. Clusters. Cluster Validity. Basic Clustering Methodology

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Workload Characterization Techniques

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

CHAPTER 4: CLUSTER ANALYSIS

Clustering and Visualisation of Data

Information Retrieval and Web Search Engines

Introduction to Mobile Robotics

MSA220 - Statistical Learning for Big Data

SOCIAL MEDIA MINING. Data Mining Essentials

Fuzzy Reasoning. Linguistic Variables

Expectation Maximization (EM) and Gaussian Mixture Models

Supervised vs. Unsupervised Learning

Master-Worker pattern

Introduction to Machine Learning

Clustering: Overview and K-means algorithm

Redefining and Enhancing K-means Algorithm

Mixture Models and the EM Algorithm

Clustering Lecture 5: Mixture Model

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Fuzzy Logic Controller

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

ARTIFICIAL INTELLIGENCE. Uncertainty: fuzzy systems

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

What is all the Fuzz about?

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

CSC 411: Lecture 12: Clustering

Supervised and Unsupervised Learning (II)

Clustering: Overview and K-means algorithm

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Lecture 2 The k-means clustering problem

Master-Worker pattern

Exploratory Analysis: Clustering

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Kapitel 4: Clustering

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster analysis. Agnieszka Nowak - Brzezinska

Information Retrieval and Web Search Engines

Intelligent Image and Graphics Processing

What to come. There will be a few more topics we will cover on supervised learning

Data clustering & the k-means algorithm

Chapter 7 Fuzzy Logic Controller

Machine Learning & Statistical Models

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

CHAPTER 5 FUZZY LOGIC CONTROL

Lecture notes. Com Page 1

Transcription:

COSC 6339 Big Data Analytics Fuzzy Clustering Some slides based on a lecture by Prof. Shishir Shah Edgar Gabriel Spring 217 Clustering Clustering is a technique for finding similarity groups in data, called clusters. i.e., it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters. Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given. 1

K-means algorithm Given k, the k-means algorithm works as follows: 1) Randomly choose k data points (seeds) to be the initial centroids, cluster centers 2) Assign each data point to the closest centroid 3) Re-compute the centroids using the current cluster memberships. 4) If a convergence criterion is not met, go to 2). Stopping/convergence criterion 1. no (or minimum) re-assignments of data points to different clusters, 2. no (or minimum) change of centroids, or 3. minimum decrease in the sum of squared error (SSE), k 2 SSE dist ( x, m C j ) x j j 1 C j is the jth cluster, m j is the centroid of cluster C j (the mean vector of all the data points in C j ), and dist(x, m j ) is the distance between data point x and centroid m j. 2

Strengths of k-means Strengths: Simple: easy to understand and to implement Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations. Since both k and t are small. k-means is considered a linear algorithm. K-means is the most popular clustering algorithm. Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due to complexity. Weaknesses of k-means The algorithm is only applicable if the mean is defined. For categorical data, k-mode - the centroid is represented by most frequent values. The user needs to specify k. The algorithm is sensitive to outliers Outliers are data points that are very far away from other data points. Outliers could be errors in the data recording or some special data points with very different values. 3

Weaknesses of k-means: Problems with outliers Weaknesses of k-means: outliers One method is to remove some data points in the clustering process that are much further away from the centroids than other data points. To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them. Another method is to perform random sampling. Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. Assign the rest of the data points to the clusters by distance or similarity comparison, or classification 4

Weaknesses of k-means (cont ) The algorithm is sensitive to initial seeds. Weaknesses of k-means (cont ) If we use different seeds: good results There are some methods to help choose good seeds 5

Weaknesses of k-means (cont ) The k-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres). + Weaknesses of k-means (cont ) Membership of a point to a single cluster not always clear -> Fuzzy clustering can help with that 6

Boolean Logic In Boolean logic, an object is either a member of a set or is not, i.e. their membership function can be expressed as μ A x = 1 x A x A In Boolean Logic μ A ~ A x = μ A ~ A x = {A U } A set is a collection of objects grouped sharing a common property A boolean set is also referred to as a crisp set Fuzzy Logic Logic based on continuous variables Provides the ability to represent intrinsic ambiguity Fuzzification: the process of finding the membership value of a (scalar) number in a fuzzy set Defuzzification: the process of converting the outcome of a fuzzy set to a single representative number 7

Grade of membership m(x) Fuzzy Sets Indicate that the membership function can be different than just and 1 indicates no membership 1 indicates complete set membership [>,<1] indicate partial membership Superset of Boolean Logic Fuzzy set has three principal components Degree of membership Possible Domain values Membership function: a continuous function that connects a domain value to its degree of membership in the set Fuzzy Numbers Fuzzy number: a fuzzy set representing an approximation to a number Support set Domain 8

Grade of membership m(x) Grade of membership m(x) Grade of membership m(x) Fuzzy number About 2 14 16 18 2 22 24 26 Expectancy Expectancy e: degree of spread e=: normal scalar value Other fuzzy sets Fuzzy set of tall men Fuzzy set for long project 4.5 5 5.5 6 6.5 7 7.5 Height in ft 4 6 8 1 12 14 16 Project duration in weeks 9

Grade of membership m(x) Collection of Fuzzy Sets Child Teen Young adult Middle aged senior 1 15 2 25 3 35 4 Client age (in years) 45 5 55 6 65 7 Each underlying fuzzy set defines a portion of the variables domain A portion is not necessarily uniquely defined Hedges: Fuzzy set transformers A hedge acts on a fuzzy set the same way an adjective acts on a noun Increase or decrease the expectancy of a fuzzy number Intensify or dilute the membership of a fuzzy set Change the shape of a fuzzy set through contrast or restriction 1

Hedge Mathematical Expression Graphical Representation A little [ A (x)] 1.3 Slightly [ A (x)] 1.7 Very [ A (x)] 2 Extremely [ A (x)] 3 Hedge Mathematical Expression Graphical Representation Very very [ A (x)] 4 More or less A (x) Somewhat A (x) Indeed 2 [ A (x)] 2 if A.5 1 2 [1 A (x)] 2 if.5 < A 1 11

Grade of membership m(x) Grade of membership m(x) Alpha Cut Threshold An Alpha cut threshold defines a minimum truth membership level for a fuzzy set Fuzzy set for long project µ[.15] 4 6 8 1 12 14 16 Project duration in wks Fuzzy AND Operator Young adult Middle Aged 1 15 2 25 3 35 4 45 5 55 6 65 7 Client age (in years) Example: region produced by proposition of Young Adult and Middle Aged Mathematical representation μ T x i = min(μ A x i, μ B x i ) 12

Grade of membership m(x) Grade of membership m(x) Fuzzy OR Operator Young adult Middle Aged 1 15 2 25 3 35 4 45 5 55 6 65 7 Client age (in years) Example: region produced by proposition of Young Adult or Middle Aged Mathematical representation μ T x i = max(μ A x i, μ B x i ) Fuzzy NOT Operator Middle Aged 1 15 2 25 3 35 4 45 5 55 6 65 7 Client age (in years) Example: region produced by proposition of NOT Middle Aged Mathematical representation μ T x i = 1 μ A x i 13

Fuzzy Clustering: Motivation Crisp clustering allows each data point to be member of exactly one cluster Fuzzy clustering assign membership values for each cluster Might be zero for some points Fuzzy Clustering Concepts Each data point will have an associated degree of membership for each cluster center in the range of [,1] 14

Fuzzy clustering concepts Fuzzification parameter m m=1 clusters do not overlap m>1 clusters overlap Fuzzy c-means clustering Extension of the k-means algorithm Two steps: calculation of cluster centers Assignment of points to the clusters with varying degree of memberships Constraint on fuzzy membership function associated p with each point: j=1 μ j x i = 1, i=1,..,k p : number of clusters k: number of datapoints x i : i th data point µ j (): function returning the membership value of x i in the j th cluster 15

Fuzzy c-means clustering Minimization of standard loss function p n k=1 i=1 μ k x i m x i ck 2 Basic algorithm Initialize p = number of clusters m = fuzzification parameter c j = cluster centers Repeat for all data points: calculate distance d ij to all centers c j for i=1 to n: update µ j (x i ) using c j for j=1 to p: Update c j using current µ j (x i ) Until c j estimates stabilize Fuzzy c-means clustering With µ j (x i )= 1 d ji p 1 k=1 d ki 1 m 1 1 m 1 d ji being the distance of x i to cluster center c j (e.g. euclidean distance) and c j = i( µ j(x i ) m x i ) i µ j (x i ) m 16

Fuzzy c-means clustering Problem with c-means clustering: Outlier data points still have to be assigned to a cluster Fuzzy Adaptive Clustering Alternative formulation for constraint on membership p n j=1 i=1 µ j (x i ) = n Membership quantifiers for all sample points is n Individual point could have a total value of membership function of <1 => µ j (x i )= p k=1 n 1 d ji m 1 1 n 1 z=1 d kz 1 m 1 17