Data Mining Learning from Large Data Sets
|
|
- Terence Shaw
- 5 years ago
- Views:
Transcription
1 Data Mining Learning from Large Data Sets Lecture 8 Clustering large data sets L Andreas Krause
2 Announcements! Homework 4 out tomorrow 2
3 Course organizapon! Retrieval! Given a query, find most similar item in a large data set! Determine relevance of search results! Applica'ons: GoogleGoggles, Shazam,! Supervised learning (ClassificaPon, Regression)! Learn a concept (funcpon mapping queries to labels)! Applica'ons: Spam filtering, predicpng price changes,! Unsupervised learning (Clustering, dimension reducpon)! IdenPfy clusters, common pa]erns ; anomaly detecpon! Applica'ons: Recommender systems, fraud detecpon,! Learning with limited feedback! Learn to oppmize a funcpon that s expensive to evaluate! Applica'ons: Online adverpsing, opt. UI, learning rankings, 3
4 Unsupervised learning! Learning without labels! Typically useful for exploratory data analysis ( find pa]erns ; visualizapon; )! Most common methods:! Clustering (unsupervised classificapon)! Dimension reducpon (unsupervised regression) 4
5 What is clustering?! Given data points, group into clusters such that! Similar points are in the same cluster! Dissimilar points are in different clusters! Points are typically represented either! in (high- dimensional) Euclidean space! in a metric space, given in terms of pairwise distances (Jaccard, cosine, )! Anomaly / outlier detecpon: IdenPficaPon of points that don t fit well in any of the clusters 5
6 Examples of clustering! Cluster! Documents based on the words they contain! Images based on image features! DNA sequences based on edit distance! Products based on which customers bought them! Customers based on their purchase history! Web surfers based on their queries / sites they visit! 6
7 Standard approaches to clustering! Hierarchical clustering! Build a tree (either bo]om- up or top- down), represenpng the distances among the data points! Example: single-, average- linkage agglomerapve clustering! ParPPonal approaches! Define and oppmize a nopon of goodness defined over parppons! Example: Spectral clustering, graph- cut based approaches! Model- based approaches! Maintain cluster models and infer cluster membership (e.g., assign each point to closest center)! Example: k- means, Gaussian mixture models, 7
8 We will! Review standard clustering algorithms! K- means! ProbabilisPc mixture models! Discuss how to scale them to massive data sets and data streams 8
9 Clustering example 9
10 k- means! Assumes points are in Euclidean space! Represent clusters as centers! Each point is assigned to closest center x i 2 R d µ j 2 R d Goal: Pick centers to minimize average squared distance NX i=1! Non- convex oppmizapon! min j µ j x i 2 2! NP- hard è can t solve oppmally in general 10
11 Classical k- means algorithm! IniPalize cluster centers! E.g., pick one point at random, the other ones with maximum distance! While not converged! Assign each point x i to closest center! Update center as mean of assigned data points 11
12 K- means 12
13 13
14 14
15 15
16 16
17 17
18 18
19 ProperPes of k- means! Guaranteed to monotonically decrease average squared distance in each iterapon NX L(µ) = min µ j x i 2 2 j i=1! Converges to a local oppmum! Complexity:! Per iterapon! Have to process enpre data set in each iterapon 19
20 K- means for large data sets / streams! What if data set does not fit in main memory?! In principle not a problem (why?)! But each iterapon spll requires an enpre pass through the data set! Recall supervised learning (online SVM, etc.)! There we were able to process one data point at a Pme! Get (provably) good solupons from a single pass through the data! Could even do it in parallel!! Can we do the same thing for clustering?? 20
21 Streaming clustering! How should me maintain clusters as new data arrives? 21
22 Recall online SVM! Recall Online SVMs (& stochaspc gradient descent)! Loss funcpon decomposes addi'vely over data set L(w) = X i hinge(x i ; y i, w)! Can take a (sub- )gradient step for each data point 22
23 Online k- means! For k- means, loss funcpon also decomposes addipvely over data set L(µ) = NX i=1 min j µ j x i 2 2! Let s try take a (sub- )gradient step for each data point 23
24 CalculaPng the gradient L(µ) = NX i=1 min j µ j x i
25 Online k- means algorithm! IniPalize centers randomly! For t = 1:N! Find! Set c = arg min j µ j x t 2 µ c µ c + t (x t µ c ) X! To converge to local oppmum, need that t t = 1 X t 2 t < 1 25
26 PracPcal aspects! Generally works best if data is «randomly» ordered (like stochaspc gradient descent)! Typically, want to choose larger value for k! How can one implement mulpple random restarts in one pass? 26
27 Problems with online k- means! Have to commit to k in advance! ObjecPve funcpon non- convex (and problem NP- hard) à guarantees for online convex programming / SGD do not apply!! Not clear how to parallelize 27
28 AlternaPve: Summarizing large data sets! Idea:! Efficiently construct a compact version C of the data set D such that solving k- means on C gives a good solupon to D! Approach:! First construct C such that it allows approximately answer k- means queries NX i.e., approximately evaluate L(µ) = i=1 min j µ j x i 2 2! Then solve k- means using the approximate loss funcpon 28
29 k- mean queries 29
30 30
31 31
32 Data set summarizapon for k- means 32
33 Data set summarizapon for k- means! Key idea: Replace many points by one weighted representapve L k (µ; C) = X (w,x)2c w min j2{1,...,k} µ j x
34 Coresets L k (µ; C) = X (w,x)2c w min j2{1,...,k} µ j x 2 2 C is called a (k,ε)- coreset for data set D, if (1 ")L k (µ; D) apple L k (µ; C) apple (1 + ")L k (µ; D) 34
35 ConstrucPng coresets! Suppose for all pairs of points:! First a]empt: Random sampling x i x j 2 apple 1! Pick n << N points C uniformly at random from D E[L k (µ; C)] = L k (µ; D)! Hoeffding s inequality gives Pr L k (µ; C) L k (µ; D) " 1 1 O log " 2 apple 2exp! Thus need points to ensure 2" 2 n absolute error at most ε with probability at least 1- δ 35
36 ConstrucPng coresets! Suppose for all pairs of points:! First a]empt: Random sampling x i x j 2 apple 1! Pick n << N points C uniformly at random from D! Assign uniform weights N/n 36
37 Can we get small relapve error?! To ensure low mulpplicapve error, need more complex construcpon! è will use non- uniform sampling! 37
38 Sampling DistribuPon Bias sampling towards small clusters Sampling distribupon
39 Importance Weights Sampling distribupon Weights
40 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points 40
41 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random 41
42 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 42
43 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 43
44 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 44
45 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 45
46 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 46
47 CreaPng a Sampling DistribuPon Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 47
48 CreaPng a Sampling DistribuPon Small clusters are represented Itera@vely find representa@ve points Sample a small set uniformly at random Remove half the blue points nearest the samples 48
49 CreaPng a Sampling DistribuPon ParPPon data via a Voronoi diagram centered at points 49
50 CreaPng a Sampling DistribuPon Points in sparse cells get more mass and points far from centers Sampling distribupon 50
51 Importance Weights Points in sparse cells get more mass and points far from centers Sampling distribupon Weights 51
52 Non- uniform sample 52
53 Coresets via AdapPve Sampling C is (k,ε)- coreset of size polynomial in d,k,log n, 1/ε, 1/δ 53
54 Can do be]er: Coresets for k- means! Theorem [Har- Peled and Kushal, 05] One can find efficiently a (k,ε)- coreset for k- means of size k O 3 /" d+1! Theorem [Feldman et al 07] One can efficiently find a weak (k,ε)- coreset of size O poly(k, 1/")! Allows PTAS for k- means! 54
55 Coresets exist for! K- means, K- median! K- line means / median! PageRank! SVMs! Diameter of a point set! Matrix low- rank approximapon! 55
56 ComposiPon of Coresets [c.f. Har- Peled, Mazumdar 04] Merge 56
57 ComposiPon of Coresets [Har- Peled, Mazumdar 04] Merge Compress 57
58 Coresets on Streams [Har- Peled, Mazumdar 04] Merge Compress 58
59 Coresets on Streams [Har- Peled, Mazumdar 04] Merge Compress 59
60 Coresets on Streams [Har- Peled, Mazumdar 04] Merge Compress Error grows linearly with number of compressions 60
61 Coresets on Streams Error grows with height of tree
62 Coresets in Parallel 62
63 k- means clustering with coresets! Given data set D, desired number of clusters k, precision ε! Construct (k, ε) - coreset C! E.g., in parallel using MapReduce! Solve k- means on coreset! If coreset small, can even do exhauspve search!! In pracpce, run k- means with many restarts! ResulPng solupon will be (1+ε)- oppmal for D! è Provably near- oppmal solupon! 63
64 Summary so far! Clustering is a central problem in unsupervised learning! Two main classes of approaches! Hierarchical (difficult to scale)! Assignment based! Discussed k- means algorithm! Widely used clustering algorithm! Non- linear versions available! Can scale to large data sets using online oppmizapon and coreset construcpons 64
65 Acknowledgments! The slides are partly based on material By Chris Bishop, Andrew Moore and Danny Feldman 65
Introduc)on to Machine Learning: Clustering Algorithms How to Analyze Your Own Genome Fall 2013
Introduc)on to Machine Learning: Clustering Algorithms 02-223 How to Analyze Your Own Genome Fall 203 Organizing data into clusters such that there is high intra- cluster similarity low inter- cluster
More informationCOMP 551 Applied Machine Learning Lecture 13: Unsupervised learning
COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551
More informationCoresets for k-means clustering
Melanie Schmidt, TU Dortmund Resource-aware Machine Learning - International Summer School 02.10.2014 Coresets and k-means Coresets and k-means 75 KB Coresets and k-means 75 KB 55 KB Coresets and k-means
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationToday s lecture. Clustering and unsupervised learning. Hierarchical clustering. K-means, K-medoids, VQ
Clustering CS498 Today s lecture Clustering and unsupervised learning Hierarchical clustering K-means, K-medoids, VQ Unsupervised learning Supervised learning Use labeled data to do something smart What
More informationClustering. CS294 Practical Machine Learning Junming Yin 10/09/06
Clustering CS294 Practical Machine Learning Junming Yin 10/09/06 Outline Introduction Unsupervised learning What is clustering? Application Dissimilarity (similarity) of objects Clustering algorithm K-means,
More informationClustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015
Clustering Subhransu Maji CMPSCI 689: Machine Learning 2 April 2015 7 April 2015 So far in the course Supervised learning: learning with a teacher You had training data which was (feature, label) pairs
More informationClustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2
So far in the course Clustering Subhransu Maji : Machine Learning 2 April 2015 7 April 2015 Supervised learning: learning with a teacher You had training data which was (feature, label) pairs and the goal
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationCSC 411: Lecture 12: Clustering
CSC 411: Lecture 12: Clustering Raquel Urtasun & Rich Zemel University of Toronto Oct 22, 2015 Urtasun & Zemel (UofT) CSC 411: 12-Clustering Oct 22, 2015 1 / 18 Today Unsupervised learning Clustering -means
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 11/05/2018 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would
More informationData Clustering. Danushka Bollegala
Data Clustering Danushka Bollegala Outline Why cluster data? Clustering as unsupervised learning Clustering algorithms k-means, k-medoids agglomerative clustering Brown s clustering Spectral clustering
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 03/02/2016 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would
More informationSupervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationOlmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.
Center of Atmospheric Sciences, UNAM November 16, 2016 Cluster Analisis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster)
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 9: Data Mining (4/4) March 9, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationClustering. Unsupervised Learning
Clustering. Unsupervised Learning Maria-Florina Balcan 04/06/2015 Reading: Chapter 14.3: Hastie, Tibshirani, Friedman. Additional resources: Center Based Clustering: A Foundational Perspective. Awasthi,
More informationWhat to come. There will be a few more topics we will cover on supervised learning
Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression
More informationInstance-based Learning
Instance-based Learning Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 19 th, 2007 2005-2007 Carlos Guestrin 1 Why not just use Linear Regression? 2005-2007 Carlos Guestrin
More information1 Case study of SVM (Rob)
DRAFT a final version will be posted shortly COS 424: Interacting with Data Lecturer: Rob Schapire and David Blei Lecture # 8 Scribe: Indraneel Mukherjee March 1, 2007 In the previous lecture we saw how
More informationGene Clustering & Classification
BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering
More informationToday. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time
Today Lecture 4: We examine clustering in a little more detail; we went over it a somewhat quickly last time The CAD data will return and give us an opportunity to work with curves (!) We then examine
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Unsupervised Learning: Clustering Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com (Some material
More informationPerceptron as a graph
Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 10 th, 2007 2005-2007 Carlos Guestrin 1 Perceptron as a graph 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2
More informationHard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering
An unsupervised machine learning problem Grouping a set of objects in such a way that objects in the same group (a cluster) are more similar (in some sense or another) to each other than to those in other
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.
More informationA Dendrogram. Bioinformatics (Lec 17)
A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu
More informationCPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016
CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster
More informationMachine Learning and Data Mining. Clustering. (adapted from) Prof. Alexander Ihler
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Overview What is clustering and its applications? Distance between two clusters. Hierarchical Agglomerative clustering.
More informationCS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample
CS 1675 Introduction to Machine Learning Lecture 18 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem:
More informationMachine Learning. Unsupervised Learning. Manfred Huber
Machine Learning Unsupervised Learning Manfred Huber 2015 1 Unsupervised Learning In supervised learning the training data provides desired target output for learning In unsupervised learning the training
More informationNetwork Traffic Measurements and Analysis
DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,
More informationBBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler
BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43 K-Means Clustering Example: Old Faithful Geyser
More informationInstance-Based Learning: Nearest neighbor and kernel regression and classificiation
Instance-Based Learning: Nearest neighbor and kernel regression and classificiation Emily Fox University of Washington February 3, 2017 Simplest approach: Nearest neighbor regression 1 Fit locally to each
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning
More informationDS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li
Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationIntroduction to Machine Learning. Xiaojin Zhu
Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006
More informationUnderstanding Clustering Supervising the unsupervised
Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data
More informationCSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning
CSE 40171: Artificial Intelligence Learning from Data: Unsupervised Learning 32 Homework #6 has been released. It is due at 11:59PM on 11/7. 33 CSE Seminar: 11/1 Amy Reibman Purdue University 3:30pm DBART
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up
More informationPower to the points: Local certificates for clustering
Power to the points: Local certificates for clustering Suresh Venkatasubramanian University of Utah Joint work with Parasaran Raman Data Mining Pipeline Modelling Data Cleaning Exploration Modelling Decision
More information10601 Machine Learning. Hierarchical clustering. Reading: Bishop: 9-9.2
161 Machine Learning Hierarchical clustering Reading: Bishop: 9-9.2 Second half: Overview Clustering - Hierarchical, semi-supervised learning Graphical models - Bayesian networks, HMMs, Reasoning under
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1
More informationDistributed Databases. Distributed Database Systems. CISC437/637, Lecture #21 Ben Cartere?e
Distributed Databases CISC437/637, Lecture #21 Ben Cartere?e Copyright Ben Cartere?e 1 Distributed Database Systems A distributed database consists of loosely coupled sites with no shared physical components
More informationSTATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010
STATS306B Jonathan Taylor Department of Statistics Stanford University June 3, 2010 Spring 2010 Outline K-means, K-medoids, EM algorithm choosing number of clusters: Gap test hierarchical clustering spectral
More informationClustering CS 550: Machine Learning
Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
More informationClustering in R d. Clustering. Widely-used clustering methods. The k-means optimization problem CSE 250B
Clustering in R d Clustering CSE 250B Two common uses of clustering: Vector quantization Find a finite set of representatives that provides good coverage of a complex, possibly infinite, high-dimensional
More informationA Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis
A Weighted Majority Voting based on Normalized Mutual Information for Cluster Analysis Meshal Shutaywi and Nezamoddin N. Kachouie Department of Mathematical Sciences, Florida Institute of Technology Abstract
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationCS 6140: Machine Learning Spring 2017
CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis@cs Grades
More informationApproximation Algorithms for Clustering Uncertain Data
Approximation Algorithms for Clustering Uncertain Data Graham Cormode AT&T Labs - Research graham@research.att.com Andrew McGregor UCSD / MSR / UMass Amherst andrewm@ucsd.edu Introduction Many applications
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems
More informationCS 61C: Great Ideas in Computer Architecture Lecture 15: Caches, Part 2
CS 61C: Great Ideas in Computer Architecture Lecture 15: Caches, Part 2 Instructor: Sagar Karandikar sagark@eecsberkeleyedu hbp://insteecsberkeleyedu/~cs61c 1 So/ware Parallel Requests Assigned to computer
More informationClustering web search results
Clustering K-means Machine Learning CSE546 Emily Fox University of Washington November 4, 2013 1 Clustering images Set of Images [Goldberger et al.] 2 1 Clustering web search results 3 Some Data 4 2 K-means
More information10701 Machine Learning. Clustering
171 Machine Learning Clustering What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among
More informationNearest Neighbor with KD Trees
Case Study 2: Document Retrieval Finding Similar Documents Using Nearest Neighbors Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox January 22 nd, 2013 1 Nearest
More informationInstance-based Learning
Instance-based Learning Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 15 th, 2007 2005-2007 Carlos Guestrin 1 1-Nearest Neighbor Four things make a memory based learner:
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationMining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams
Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 2
Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical
More informationClustering Lecture 5: Mixture Model
Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics
More informationData Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data
More informationUnsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection
More informationMachine learning - HT Clustering
Machine learning - HT 2016 10. Clustering Varun Kanade University of Oxford March 4, 2016 Announcements Practical Next Week - No submission Final Exam: Pick up on Monday Material covered next week is not
More informationNearest Neighbor Predictors
Nearest Neighbor Predictors September 2, 2018 Perhaps the simplest machine learning prediction method, from a conceptual point of view, and perhaps also the most unusual, is the nearest-neighbor method,
More informationCPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017
CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2017 Assignment 3: 2 late days to hand in tonight. Admin Assignment 4: Due Friday of next week. Last Time: MAP Estimation MAP
More informationUnsupervised Learning I: K-Means Clustering
Unsupervised Learning I: K-Means Clustering Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552 (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)
More informationLecture-17: Clustering with K-Means (Contd: DT + Random Forest)
Lecture-17: Clustering with K-Means (Contd: DT + Random Forest) Medha Vidyotma April 24, 2018 1 Contd. Random Forest For Example, if there are 50 scholars who take the measurement of the length of the
More informationClustering Results. Result List Example. Clustering Results. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Presenting Results Clustering Clustering Results! Result lists often contain documents related to different aspects of the query topic! Clustering is used to
More informationUnsupervised Learning: Clustering
Unsupervised Learning: Clustering Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer Machine Learning Supervised Learning Unsupervised Learning
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More informationClustering. Supervised vs. Unsupervised Learning
Clustering Supervised vs. Unsupervised Learning So far we have assumed that the training samples used to design the classifier were labeled by their class membership (supervised learning) We assume now
More informationGradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz
Gradient Descent Wed Sept 20th, 2017 James McInenrey Adapted from slides by Francisco J. R. Ruiz Housekeeping A few clarifications of and adjustments to the course schedule: No more breaks at the midpoint
More informationMay 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch
May 1, CODY, Error Backpropagation, Bischop 5.3, and Support Vector Machines (SVM) Bishop Ch 7. May 3, Class HW SVM, PCA, and K-means, Bishop Ch 12.1, 9.1 May 8, CODY Machine Learning for finding oil,
More informationCPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017
CPSC 340: Machine Learning and Data Mining Kernel Trick Fall 2017 Admin Assignment 3: Due Friday. Midterm: Can view your exam during instructor office hours or after class this week. Digression: the other
More informationSGN (4 cr) Chapter 11
SGN-41006 (4 cr) Chapter 11 Clustering Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology February 25, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006 (4 cr) Chapter
More informationData Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140
Data Mining CS 5140 / CS 6140 Jeff M. Phillips January 7, 2019 What is Data Mining? What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational
More informationCS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample
Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups
More informationJarek Szlichta
Jarek Szlichta http://data.science.uoit.ca/ Approximate terminology, though there is some overlap: Data(base) operations Executing specific operations or queries over data Data mining Looking for patterns
More informationClustering: K-means and Kernel K-means
Clustering: K-means and Kernel K-means Piyush Rai Machine Learning (CS771A) Aug 31, 2016 Machine Learning (CS771A) Clustering: K-means and Kernel K-means 1 Clustering Usually an unsupervised learning problem
More informationUnsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.
March 19, 2018 1 / 40 1 2 3 4 2 / 40 What s unsupervised learning? Most of the data available on the internet do not have labels. How can we make sense of it? 3 / 40 4 / 40 5 / 40 Organizing the web First
More informationk-means, k-means++ Barna Saha March 8, 2016
k-means, k-means++ Barna Saha March 8, 2016 K-Means: The Most Popular Clustering Algorithm k-means clustering problem is one of the oldest and most important problem. K-Means: The Most Popular Clustering
More informationUnsupervised Learning
Networks for Pattern Recognition, 2014 Networks for Single Linkage K-Means Soft DBSCAN PCA Networks for Kohonen Maps Linear Vector Quantization Networks for Problems/Approaches in Machine Learning Supervised
More informationData Preprocessing. Javier Béjar. URL - Spring 2018 CS - MAI 1/78 BY: $\
Data Preprocessing Javier Béjar BY: $\ URL - Spring 2018 C CS - MAI 1/78 Introduction Data representation Unstructured datasets: Examples described by a flat set of attributes: attribute-value matrix Structured
More informationThe exam is closed book, closed notes except your one-page (two-sided) cheat sheet.
CS 189 Spring 2015 Introduction to Machine Learning Final You have 2 hours 50 minutes for the exam. The exam is closed book, closed notes except your one-page (two-sided) cheat sheet. No calculators or
More informationChapter 4: Non-Parametric Techniques
Chapter 4: Non-Parametric Techniques Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Supervised Learning How to fit a density
More informationPreface to the Second Edition. Preface to the First Edition. 1 Introduction 1
Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches
More informationConvex Optimization MLSS 2015
Convex Optimization MLSS 2015 Constantine Caramanis The University of Texas at Austin The Optimization Problem minimize : f (x) subject to : x X. The Optimization Problem minimize : f (x) subject to :
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationMachine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.1 Cluster Analysis Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim,
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationLecture 2 The k-means clustering problem
CSE 29: Unsupervised learning Spring 2008 Lecture 2 The -means clustering problem 2. The -means cost function Last time we saw the -center problem, in which the input is a set S of data points and the
More informationCS570: Introduction to Data Mining
CS570: Introduction to Data Mining Scalable Clustering Methods: BIRCH and Others Reading: Chapter 10.3 Han, Chapter 9.5 Tan Cengiz Gunay, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei.
More informationIntroduction to Computer Science
DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at
More information