Clustering Part 1. CSC 4510/9010: Applied Machine Learning. Dr. Paula Matuszek

Similar documents
CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Clustering Analysis Basics

CS 8520: Artificial Intelligence

What to come. There will be a few more topics we will cover on supervised learning

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Informatics. Seon Ho Kim, Ph.D.

CSE 5243 INTRO. TO DATA MINING

Clustering: Overview and K-means algorithm

K-Means. Oct Youn-Hee Han

CSE 5243 INTRO. TO DATA MINING

What is Cluster Analysis? COMP 465: Data Mining Clustering Basics. Applications of Cluster Analysis. Clustering: Application Examples 3/17/2015

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

University of Florida CISE department Gator Engineering. Clustering Part 2

Unsupervised Learning I: K-Means Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Kapitel 4: Clustering

Based on Raymond J. Mooney s slides

Clustering. Partition unlabeled examples into disjoint subsets of clusters, such that:

Clustering: Overview and K-means algorithm

Intelligent Image and Graphics Processing

Administrative. Machine learning code. Supervised learning (e.g. classification) Machine learning: Unsupervised learning" BANANAS APPLES

Unsupervised Learning

[7.3, EA], [9.1, CMB]

Clust Clus e t ring 2 Nov

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Gene Clustering & Classification

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Unsupervised Learning Partitioning Methods

Clustering Basic Concepts and Algorithms 1

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

Unsupervised learning, Clustering CS434

Cluster Analysis. CSE634 Data Mining

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Data Mining: Concepts and Techniques. Chapter March 8, 2007 Data Mining: Concepts and Techniques 1

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

Cluster Analysis: Basic Concepts and Algorithms

Text Documents clustering using K Means Algorithm

Document Clustering: Comparison of Similarity Measures

Cluster Analysis: Basic Concepts and Algorithms

ECLT 5810 Clustering

ECLT 5810 Clustering

Hierarchical Clustering 4/5/17

10701 Machine Learning. Clustering

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Big Data Analytics! Special Topics for Computer Science CSE CSE Feb 9

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

CSE 7/5337: Information Retrieval and Web Search Document clustering I (IIR 16)

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Flat Clustering. Slides are mostly from Hinrich Schütze. March 27, 2017

Clustering Part 2. A Partitional Clustering

Data Mining Algorithms

COSC 6339 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2017.

Supervised vs. Unsupervised Learning

Information Retrieval and Web Search Engines

Cluster Analysis. Ying Shen, SSE, Tongji University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Information Retrieval and Web Search Engines

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Unsupervised Learning : Clustering

Introduction to Information Retrieval

Clustering CS 550: Machine Learning


Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Unsupervised Learning. Andrea G. B. Tettamanzi I3S Laboratory SPARKS Team

Information Retrieval and Organisation

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Machine Learning using MapReduce

Introduction to Information Retrieval

PV211: Introduction to Information Retrieval

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

K+ Means : An Enhancement Over K-Means Clustering Algorithm

Lecture on Modeling Tools for Clustering & Regression

CS47300: Web Information Search and Management

Supervised and Unsupervised Learning (II)

Lecture 15 Clustering. Oct

Lecture 8 May 7, Prabhakar Raghavan

Clustering. Supervised vs. Unsupervised Learning

Introduction to Data Mining

2. Background. 2.1 Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

K-Means Clustering 3/3/17

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Clustering (COSC 416) Nazli Goharian. Document Clustering.

CHAPTER 4 K-MEANS AND UCAM CLUSTERING ALGORITHM

Clustering (Basic concepts and Algorithms) Entscheidungsunterstützungssysteme

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Unsupervised Learning and Clustering

Transcription:

CSC 4510/9010: Applied Machine Learning 1 Clustering Part 1 Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

What is Clustering? 2 Given some instances with data: group instances such that examples within a group are similar examples in different groups are different These groups are clusters Unsupervised learning the instances do not include a class attribute.

Clustering Example 3................ Based on: www.cs.utexas.edu/~mooney/cs388/slides/textclustering.ppt 3

A Different Example 4 How would you group 'The price of crude oil has increased significantly 'Demand for crude oil outstrips supply' 'Some people do not like the flavor of olive oil' 'The food was very oily' 'Crude oil is in short supply' 'Oil platforms extract crude oil' 'Canola oil is supposed to be healthy' 'Iraq has significant oil reserves' 'There are different types of cooking oil

Another (Familiar) Example 5

Introduction Real Applications: Emerging Applications http://studentnet.cs.manchester.ac.uk/ugt/comp24111/materials/slides/ 8

In the Real World A technique demanded by many real world tasks Bank/Internet Security: fraud/spam pattern discovery Biology: taxonomy of living things such as kingdom, phylum, class, order, family, genus and species City-planning: Identifying groups of houses according to their house type, value, and geographical location Climate change: understanding earth climate, find patterns of atmospheric and ocean Finance: stock clustering analysis to uncover correlation underlying shares Image Compression/segmentation: coherent pixels grouped Information retrieval/organisation: Google search, topic-based news Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Social network mining: special interest group automatic discovery http://studentnet.cs.manchester.ac.uk/ugt/comp24111/materials/slides/ 9

Clustering Basics 8 Collect examples Compute similarity among examples according to some metric Group examples together such that examples within a cluster are similar examples in different clusters are different Summarize each cluster Sometimes -- assign new instances to the most similar cluster

Measures of Similarity 9 In order to do clustering we need some kind of measure of similarity. This is basically our critic Vector of values, depends on domain: documents: bag of words, linguistic features purchases: cost, purchaser data, item data census data: most of what is collected Multiple different measures available

Measures of Similarity 10 Semantic similarity -- but this is hard. Similar attribute counts Number of attributes with the same value. Appropriate for large, sparse vectors, such as BOW. More complex vector comparisons: Euclidian Distance Cosine Similarity

Euclidean Distance 11 Euclidean distance: distance between two measures summed across each feature Squared differences to give more weight to larger difference dist(xi, xj) = sqrt((xi1-xj1) 2 +(xi2-xj2) 2 +..+(xin-xjn) 2 )

Euclidian 12 Calculate differences Ears: pointy? Muzzle: how many inches long? Tail: how many inches long? dist(x1, x2) = sqrt((0-1) 2 +(3-1) 2 +..+(2-4) 2 )=9 dist(x1, x3) = sqrt((0-0) 2 +(3-3) 2 +..+(2-3) 2 )=1

Cosine similarity measurement 13 Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. The result of the Cosine function is equal to 1 when the angle is 0, and it is less than 1 when the angle is of any other value. As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases. Based on home.iitk.ac.in/~mfelixor/files/non-numeric-clustering-seminar.pp

Cosine Similarity 14 4 B(1,4) 3 A(3,2) C(3,3) Muzzle 2 1 1 2 3 4 Tail

Clustering Algorithms 15 Flat K means Hierarchical Bottom up Top down (not common) Probabilistic Expectation Maximumization (E-M)

Partitioning (Flat) Algorithms 16 Partitioning method: Construct a partition of n documents into a set of K clusters Given: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion Globally optimal: exhaustively enumerate all partitions. Usually too expensive. Effective heuristic methods: K-means algorithm. http://www.csee.umbc.edu/~nicholas/676/mrsslides/lecture17-clustering.ppt http://www.csee.umbc.edu/~nicholas/676/mrsslides/lecture17-clustering.ppt

K-Means Clustering 17 Simplest hierarchical method, widely used Create clusters based on a centroid; each instance is assigned to the closest centroid K is given as a parameter. Heuristic and iterative

K-Means Clustering 18 Provide number of desired clusters, k. Randomly choose k instances as seeds. Form initial clusters based on these seeds. Calculate the centroid of each cluster. Iterate, repeatedly reallocating instances to closest centroids and calculating the new centroids Stop when clustering converges or after a fixed number of iterations. Based on: www.cs.utexas.edu/~mooney/cs388/slides/textclustering.ppt 18

K Means Example (K=2) 19 Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged! www.cs.utexas.edu/~mooney/cs388/slides/textclustering.ppt 22

K-Means 20 Tradeoff between having more clusters (better focus within each cluster) and having too many clusters. Overfitting again. Results can vary based on random seed selection. Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. The algorithm is sensitive to outliers Data points that are far from other data points. Could be errors in the data recording or some special data points with very different values. http://www.csee.umbc.edu/~nicholas/676/mrsslides/lecture17-clustering.ppt

Problem! 21 Poor clusters based on initial seeds https://datasciencelab.wordpress.com/2014/01/15/improved-seeding-for-clustering-with-k-means/

Strengths of K-Means 22 Strengths: Simple: easy to understand and to implement Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations. Since both k and t are small. k-means is considered a linear algorithm. K-means is most popular clustering algorithm. In practice, performs well, especially on text. www.cs.uic.edu/~liub/teach/cs583-fall-05/cs583-unsupervised-learning.ppt

K-Means Weaknesses 23 Must choose K; poor choice can lead to poor clusters Clusters may differ in size or density All attributes are weighted Heuristic, based on initial random seeds; clusters may differ from run to run