Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection
|
|
- Emil Tate
- 5 years ago
- Views:
Transcription
1 Scalable PDEs p.1/107 Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection Dan Pelleg Andrew Moore (chair) Manuela Veloso Geoff Gordon Nir Friedman, the Hebrew University
2 Scalable PDEs p.2/107 Clustering Large Data Sets Quickly 70 weight axles length
3 Scalable PDEs p.3/107 Sloan Digital Sky Survey as an example of data collection, storage, and sharing: Goal: map, in detail, one-quarter of the entire sky 5 years to complete 200 million objects in catalog 25 TB raw data, 5 TB catalog data Access over the web at
4 SkyServer Scalable PDEs p.4/107 Supported activities on the SDSS SkyServer: Browse Learn Search by coordinates Send SQL query APIs for direct integration
5 Advancing SkyServers Scalable PDEs p.5/107 Make it easier to ask the right question Make it easier to understand the answer
6 Scalable PDEs p.6/107 Requirements from Next Generation data analysis tools: Fast Comprehensible output Turn-key
7 Scalable PDEs p.7/107 Focus on clustering. Very general Lots of applications In particular, mixture-model based clustering.
8 Scalable PDEs p.8/107 Talk outline: K-means and X-means: fast spatial clustering Mixture of Rectangles: highly legible model Anomaly Hunting Sub-linear component learner Active learner User interface
9 K-means Scalable PDEs p.9/107
10 K-means Scalable PDEs p.10/107 During the K-means algorithm, we maintain a set of centroids.
11 K-means Scalable PDEs p.11/107 In every iteration, each data point is associated with its closest centroid.
12 K-means Scalable PDEs p.12/107 At the end of an iteration, we move each centroid to the center of mass of all points associated with it.
13 K-means Scalable PDEs p.13/107
14 K-means Scalable PDEs p.14/107
15 K-means Scalable PDEs p.15/107
16 Cost of K-means Scalable PDEs p.16/107 Cost per iteration: #records #centroids
17 kd-tree Scalable PDEs p.17/107
18 A kd-tree Scalable PDEs p.18/107
19 A kd-tree Scalable PDEs p.19/107
20 A kd-tree Scalable PDEs p.20/107
21 A kd-tree Scalable PDEs p.21/107
22 A kd-tree Scalable PDEs p.22/107
23 A kd-tree Scalable PDEs p.23/107
24 A kd-tree Scalable PDEs p.24/107 A binary tree to store data points. Each node stores statistics about all points contained in it. Not the only structure meeting these conditions.
25 K-means Scalable PDEs p.25/107
26 Center-of-mass calculation Scalable PDEs p.26/107 Suppose Q is the set of all points that belong to some centroid C. The new position of C is: C x Q = x Q Let {Q p } be a partition of Q. Then we can write the new position as: C p x Q = p x p Q p This helps if the sums of each Q p are known. They are known for kd-nodes.
27 K-means Scalable PDEs p.27/107
28 A kd-node owned by a centroid Scalable PDEs p.28/107 The boundary line between centroids G and R does not intersect the rectangle H.
29 A kd-node owned by a centroid Scalable PDEs p.28/107 The boundary line between centroids G and R does not intersect the rectangle H. The point in H which is closest to R is on the same side of the boundary as G.
30 A kd-node owned by a centroid Scalable PDEs p.28/107 The boundary line between centroids G and R does not intersect the rectangle H. The point in H which is closest to R is on the same side of the boundary as G. Scanning every point in the node is not needed.
31 A kd-node not owned by a centroid Scalable PDEs p.29/107
32 A kd-node not owned by a centroid Scalable PDEs p.29/107
33 A kd-node not owned by a centroid Scalable PDEs p.29/107 The boundary line between centroids G and R does intersect the rectangle H.
34 A kd-node not owned by a centroid Scalable PDEs p.29/107 The boundary line between centroids G and R does intersect the rectangle H. The point in H which is closest to G is not on the same side of the boundary as R.
35 A kd-node not owned by a centroid Scalable PDEs p.29/107 The boundary line between centroids G and R does intersect the rectangle H. The point in H which is closest to G is not on the same side of the boundary as R. We try our luck with the child rectangles.
36 Run time Scalable PDEs p.30/ clusters 50 clusters 500 clusters gpetro data 0.4 time points 2-D data
37 K-means: summary Scalable PDEs p.31/107 Popular and trusted statistical method Very fast algorithm; not approximation Not restricted to kd-trees Still requires K from user
38 X-means Scalable PDEs p.32/107
39 X-means Scalable PDEs p.33/107 The number of clusters K is not always known in advance. Estimate from data; measure the goodness of fit, and penalize complex models. Do this on a local scale.
40 Local Splits Scalable PDEs p.34/107 Start with a small value for K. Run K-means to convergence.
41 Scalable PDEs p.35/107 This defines regions of points which belong to a specific class.
42 In each region run 2-means independently. Scalable PDEs p.36/107
43 Scalable PDEs p.37/107
44 Scalable PDEs p.38/107 BIC(k=1)=2471 BIC(k=2)=3088 BIC(k=1)=2018 BIC(k=2)=1859 BIC(k=1)=1935 BIC(k=2)=1784 For each region compute the contribution of splitting the class in two.
45 Commit the split only if the score goes up. Scalable PDEs p.39/107
46 X-means: summary Scalable PDEs p.40/107 Can accurately estimate the K in K-means Naturally fits in the fast K-meansframework In a single step chooses between 2 K options Better, faster, than looping over K
47 K-means and X-means package Scalable PDEs p.41/107 Code released in late 2000 Over 200 licenses granted Users in: Bioinformatics Music information retrieval Computer hardware and software analysis Many more X-means scoring function independently analyzed and improved (Hamerly et al. 2003)
48 K-means and X-means users Scalable PDEs p.42/107
49 K-means and X-means users Scalable PDEs p.42/107
50 Mixtures of Rectangles Scalable PDEs p.43/107
51 Gaussian clusters Scalable PDEs p.44/107 Domain: credit card approval. Take the following vector: [(AGE 18) 2, (taxrate 6) 2, (income 10000) 2, (edunum 8) 2 ] and compute its dot-product with: 1/[4.9,.3, 730, 209]. If the result is small enough, approve.
52 My approach Scalable PDEs p.45/107 If 18 AGE 46 and 5 taxrate 7 then approve.
53 2-D PDF Scalable PDEs p.46/
54 Mixture of Dependency Trees Scalable PDEs p.47/107
55 Motivation Scalable PDEs p.48/107 Given a data-set, want to understand it A Bayes net fits the bill, but expensive to find Compromise: look for a simpler structure dependency tree
56 Scalable PDEs p.49/107 Burglar Thunder Barking Phone Call Alarm P (A B) N(c A + m A b, σ 2 A )
57 The Chow-Liu algorithm Scalable PDEs p.50/107 A 1 A 2 A M X 1 X 2 X 3 X 4. X R
58 The Chow-Liu algorithm Scalable PDEs p.50/107 A 1 A 2 A M X 1 X 2 X 3 X 4. X R A5 I(1; 4) I(1; 5) A1 A4 I(1; 3) I(4; 3) A2 A3
59 The Chow-Liu algorithm Scalable PDEs p.50/107 A 1 A 2 A M X 1 X 2 X 3 X 4. X R A5 I(1; 4) I(1; 5) A1 MST A5 A1 A4 A4 I(1; 3) I(4; 3) A2 A2 A3 A3
60 The Chow-Liu algorithm Scalable PDEs p.50/107 A 1 A 2 A M X 1 X 2 X 3 X 4. X R Total Cost: O(RM 2 )+cost of MST algorithm. A5 I(1; 4) I(1; 5) A1 MST A5 A1 A4 A4 I(1; 3) I(4; 3) A2 A2 A3 A3
61 MST: using the blue-edge rule Scalable PDEs p.51/107 Given a cut, the lightest edge across it must be part of the MST.
62 MST: using the red-edge rule Scalable PDEs p.52/107 Given a cycle, the heaviest edge in it must not be part of the MST.
63 Scalable PDEs p.53/107 Idea: repeatedly use the red-edge rule Stop when all we have left is a tree This tree must be the MST Tarjan: bad idea.
64 Walkthrough Scalable PDEs p.54/107
65 Walkthrough Scalable PDEs p.55/107 Tree edge Non tree edge
66 Walkthrough Scalable PDEs p.56/107 Tree edge Non tree edge
67 Walkthrough Scalable PDEs p.57/107 Can I eliminate this edge? Tree edge Non tree edge
68 Walkthrough Scalable PDEs p.58/107 Tree edge Non tree edge
69 Walkthrough Scalable PDEs p.59/107 Tree edge Non tree edge Eliminated edge
70 Walkthrough Scalable PDEs p.60/107
71 Walkthrough Scalable PDEs p.61/107
72 Walkthrough Scalable PDEs p.62/107
73 Walkthrough Scalable PDEs p.63/107
74 Walkthrough Scalable PDEs p.64/107
75 Walkthrough Scalable PDEs p.65/107
76 Walkthrough Scalable PDEs p.66/107
77 Walkthrough Scalable PDEs p.67/107
78 Walkthrough Scalable PDEs p.68/107
79 Walkthrough Scalable PDEs p.69/107
80 Walkthrough Scalable PDEs p.70/107
81 Saving Work Scalable PDEs p.71/107 We want to avoid scanning the full data-set for a given edge. Scan just a sample Derive a confidence interval using the CLT Or Hoeffding bounds Now need to deal with intervals instead of point estimates
82 Comparing intervals Scalable PDEs p.72/107 c d a b
83 Comparing intervals Scalable PDEs p.72/107 c d a b a Case 1: c b d
84 Comparing intervals Scalable PDEs p.72/107 c d a b a Case 1: If this happens, we save work. c b d
85 Comparing intervals Scalable PDEs p.73/107 c d a b
86 Comparing intervals Scalable PDEs p.73/107 c d a b a Case 2: d b c
87 Comparing intervals Scalable PDEs p.73/107 c d a b a Case 2: d b c Another lucky occurrence.
88 Comparing intervals Scalable PDEs p.74/107 c d a b
89 Comparing intervals Scalable PDEs p.74/107 c d a b Case 3: a c b d
90 Comparing intervals Scalable PDEs p.74/107 c d a b Case 3: We have two options: a c b d
91 Comparing intervals Scalable PDEs p.74/107 c d a b Case 3: We have two options: Work harder a c b d
92 Comparing intervals Scalable PDEs p.74/107 c d a b Case 3: We have two options: Work harder Procrastinate a c b d
93 Scalable PDEs p.75/107 So far we assumed that we can always eliminate an edge in the cycle In fact, this is not necessary
94 Walkthrough - alternative scenario Scalable PDEs p.76/107
95 Walkthrough - alternative scenario Scalable PDEs p.77/107 Not enough information to eliminate.
96 Walkthrough - alternative scenario Scalable PDEs p.78/107 Not enough information to eliminate. Leave for later.
97 Walkthrough - alternative scenario Scalable PDEs p.79/107 Later...
98 Walkthrough - alternative scenario Scalable PDEs p.80/107 Let s examine this edge again.
99 Walkthrough - alternative scenario Scalable PDEs p.81/107 The tree path has changed! We can eliminate!
100 Walkthrough - alternative scenario Scalable PDEs p.82/107
101 Experimental Results Scalable PDEs p.83/107
102 Experimental Results Scalable PDEs p.84/107 How much work does it save?
103 Experimental Results Scalable PDEs p.84/107 How much work does it save? cells per edge e e+06 records
104 Experimental Results Scalable PDEs p.84/107 How much work does it save? cells per edge e e+06 records most of it.
105 Experimental Results Scalable PDEs p.85/107 Does it scale with the number of attributes?
106 Experimental Results Scalable PDEs p.85/107 Does it scale with the number of attributes? running time number of attributes
107 Experimental Results Scalable PDEs p.85/107 Does it scale with the number of attributes? running time number of attributes Yes!
108 Experimental Results Scalable PDEs p.86/107 How good are the generated trees?
109 Evaluation Scalable PDEs p.87/107 Exhaustive algorithm
110 Evaluation Scalable PDEs p.87/107 Exhaustive algorithm My algorithm
111 Evaluation Scalable PDEs p.87/107 Exhaustive algorithm My algorithm 35% subsample
112 Evaluation Scalable PDEs p.87/107 Exhaustive algorithm My algorithm 35% subsample Informed subsample
113 Experimental Results Scalable PDEs p.88/107 How good are the generated trees?
114 Experimental Results Scalable PDEs p.88/107 How good are the generated trees? 2 relative log-likelihood e e+06 records
115 Experimental Results Scalable PDEs p.88/107 How good are the generated trees? 2 relative log-likelihood e e+06 records Better then those obtained by uniformly using same fraction of data.
116 Experimental Results Scalable PDEs p.89/107 Does it work for real data?
117 Experimental Results Scalable PDEs p.89/107 Does it work for real data? NAME ATTR. RECORDS TYPE DATA USAGE MIST SAMPLE CENSUS-HOUSE N 1.0% COLORHISTOGRAM N 0.5% COOCTEXTURE N 4.6% ABALONE N 21.0% COLORMOMENTS N 0.6% CENSUS-INCOME C 0.05% COIL C 0.9% IPUMS C 0.06% KDDCUP C 0.02% LETTER N 1.5% COVTYPE C 0.009% PHOTOZ N 0.008%
118 Experimental Results Scalable PDEs p.89/107 Does it work for real data? NAME ATTR. RECORDS TYPE DATA USAGE MIST SAMPLE CENSUS-HOUSE N 1.0% COLORHISTOGRAM N 0.5% COOCTEXTURE N 4.6% ABALONE N 21.0% COLORMOMENTS N 0.6% CENSUS-INCOME C 0.05% COIL C 0.9% IPUMS C 0.06% KDDCUP C 0.02% LETTER N 1.5% COVTYPE C 0.009% PHOTOZ N 0.008% Better 7/12 times, worse 4/12, one tie.
119 Anomaly Hunting Scalable PDEs p.90/107
120 Anomaly Hunting Scalable PDEs p.91/107 Want to sift a large data set for strangest objects. First attempt: build a statistical model for/from the data, flag whatever does not fit it well.
121 Boring Anomalies Scalable PDEs p.92/107
122 The Oracle Framework Scalable PDEs p.93/107 Random set of records
123 The Oracle Framework Scalable PDEs p.94/107 Random set of records Ask expert to classify
124 The Oracle Framework Scalable PDEs p.95/107 Random set of records Ask expert to classify Build model from data and labels
125 The Oracle Framework Scalable PDEs p.96/107 Random set of records Ask expert to classify Build model from data and labels Run all data through model
126 The Oracle Framework Scalable PDEs p.97/107 Random set of records Ask expert to classify Spot "important" records Build model from data and labels Run all data through model
127 The Oracle Framework Scalable PDEs p.98/107 Random set of records Ask expert to classify Spot "important" records Build model from data and labels Run all data through model
128 The Oracle Framework Scalable PDEs p.99/107 Random set of records Ask expert to classify Spot "important" records Build model from data and labels Run all data through model
129 The Oracle Framework Random set of records Ask expert to classify Spot "important" records Build model from data and labels Run all data through model Scalable PDEs p.100/107
130 The Oracle Framework Random set of records Ask expert to classify Spot "important" records Build model from data and labels Run all data through model Scalable PDEs p.101/107
131 Anomaly Hunting Run GUI Scalable PDEs p.102/107
132 Interesting Anomalies Scalable PDEs p.103/107
133 Contributions Fast K-means implementation [KDD99] Extension to X-means [ICML00] Widely used, cited [HPL124,NIPS03,ICME01,ASPL02,IEEE01] Novel mixture model for comprehensibility [ICML01] probably approximately correct approach for dependency trees [NIPS02] Active learning framework for general mixtures User-centered anomaly hunting process [GLC03] Scalable PDEs p.104/107
134 Scalable PDEs p.105/107
135 Why scientific? Assumptions on data: Mostly real-valued Not sparse No or very few labels Scalable PDEs p.106/107
136 Thesis Statement We can efficiently perform clustering on very large data sets. Scalable PDEs p.107/107
ALTERNATIVE METHODS FOR CLUSTERING
ALTERNATIVE METHODS FOR CLUSTERING K-Means Algorithm Termination conditions Several possibilities, e.g., A fixed number of iterations Objects partition unchanged Centroid positions don t change Convergence
More informationLecture 7: Decision Trees
Lecture 7: Decision Trees Instructor: Outline 1 Geometric Perspective of Classification 2 Decision Trees Geometric Perspective of Classification Perspective of Classification Algorithmic Geometric Probabilistic...
More informationExpectation Maximization (EM) and Gaussian Mixture Models
Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation
More informationPattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition
Pattern Recognition Kjell Elenius Speech, Music and Hearing KTH March 29, 2007 Speech recognition 2007 1 Ch 4. Pattern Recognition 1(3) Bayes Decision Theory Minimum-Error-Rate Decision Rules Discriminant
More informationMultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A
MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI
More informationLecture 8: The EM algorithm
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 8: The EM algorithm Lecturer: Manuela M. Veloso, Eric P. Xing Scribes: Huiting Liu, Yifan Yang 1 Introduction Previous lecture discusses
More informationIBL and clustering. Relationship of IBL with CBR
IBL and clustering Distance based methods IBL and knn Clustering Distance based and hierarchical Probability-based Expectation Maximization (EM) Relationship of IBL with CBR + uses previously processed
More informationCS Introduction to Data Mining Instructor: Abdullah Mueen
CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts
More informationGenerative and discriminative classification techniques
Generative and discriminative classification techniques Machine Learning and Category Representation 2014-2015 Jakob Verbeek, November 28, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15
More informationExpectation-Maximization. Nuno Vasconcelos ECE Department, UCSD
Expectation-Maximization Nuno Vasconcelos ECE Department, UCSD Plan for today last time we started talking about mixture models we introduced the main ideas behind EM to motivate EM, we looked at classification-maximization
More informationCS 8520: Artificial Intelligence
CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Spring, 2013 1 Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function
More informationMixture Models and the EM Algorithm
Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is
More informationCOMP 465: Data Mining Still More on Clustering
3/4/015 Exercise COMP 465: Data Mining Still More on Clustering Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed. Describe each of the following
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationUnsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi
Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which
More informationTowards the world s fastest k-means algorithm
Greg Hamerly Associate Professor Computer Science Department Baylor University Joint work with Jonathan Drake May 15, 2014 Objective function and optimization Lloyd s algorithm 1 The k-means clustering
More informationRobust PDF Table Locator
Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records
More informationK-Means Clustering 3/3/17
K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering
More informationEmpirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee
A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) Empirical risk minimization (ERM) Recall the definitions of risk/empirical risk We observe the
More informationCS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek
CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Fall, 2015!1 Regression Classifiers We said earlier that the task of a supervised learning system can be viewed as learning a function
More informationMachine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves
Machine Learning A 708.064 11W 1sst KU Exercises Problems marked with * are optional. 1 Conditional Independence I [2 P] a) [1 P] Give an example for a probability distribution P (A, B, C) that disproves
More informationDATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm
DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)
More informationMarkov Random Fields and Segmentation with Graph Cuts
Markov Random Fields and Segmentation with Graph Cuts Computer Vision Jia-Bin Huang, Virginia Tech Many slides from D. Hoiem Administrative stuffs Final project Proposal due Oct 27 (Thursday) HW 4 is out
More informationCS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016
CS 231A CA Session: Problem Set 4 Review Kevin Chen May 13, 2016 PS4 Outline Problem 1: Viewpoint estimation Problem 2: Segmentation Meanshift segmentation Normalized cut Problem 1: Viewpoint Estimation
More informationECLT 5810 Clustering
ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping
More informationCOMP 551 Applied Machine Learning Lecture 13: Unsupervised learning
COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551
More informationClustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation
More informationSpatial biosurveillance
Spatial biosurveillance Authors of Slides Andrew Moore Carnegie Mellon awm@cs.cmu.edu Daniel Neill Carnegie Mellon d.neill@cs.cmu.edu Slides and Software and Papers at: http://www.autonlab.org awm@cs.cmu.edu
More informationBuilding Classifiers using Bayesian Networks
Building Classifiers using Bayesian Networks Nir Friedman and Moises Goldszmidt 1997 Presented by Brian Collins and Lukas Seitlinger Paper Summary The Naive Bayes classifier has reasonable performance
More informationUnsupervised Learning
Unsupervised Learning Unsupervised learning Until now, we have assumed our training samples are labeled by their category membership. Methods that use labeled samples are said to be supervised. However,
More informationData Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation
Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationINTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá
INTRODUCTION TO DATA MINING Daniel Rodríguez, University of Alcalá Outline Knowledge Discovery in Datasets Model Representation Types of models Supervised Unsupervised Evaluation (Acknowledgement: Jesús
More informationLecture 12 Recognition
Institute of Informatics Institute of Neuroinformatics Lecture 12 Recognition Davide Scaramuzza 1 Lab exercise today replaced by Deep Learning Tutorial Room ETH HG E 1.1 from 13:15 to 15:00 Optional lab
More informationLecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013
Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest
More informationClustering. Image segmentation, document clustering, protein class discovery, compression
Clustering CS 444 Some material on these is slides borrowed from Andrew Moore's machine learning tutorials located at: Clustering The problem of grouping unlabeled data on the basis of similarity. A key
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationUsing Decision Boundary to Analyze Classifiers
Using Decision Boundary to Analyze Classifiers Zhiyong Yan Congfu Xu College of Computer Science, Zhejiang University, Hangzhou, China yanzhiyong@zju.edu.cn Abstract In this paper we propose to use decision
More informationUnsupervised Learning. Clustering and the EM Algorithm. Unsupervised Learning is Model Learning
Unsupervised Learning Clustering and the EM Algorithm Susanna Ricco Supervised Learning Given data in the form < x, y >, y is the target to learn. Good news: Easy to tell if our algorithm is giving the
More informationGRID BASED CLUSTERING
Cluster Analysis Grid Based Clustering STING CLIQUE 1 GRID BASED CLUSTERING Uses a grid data structure Quantizes space into a finite number of cells that form a grid structure Several interesting methods
More informationMachine Learning and Data Mining. Clustering (1): Basics. Kalev Kask
Machine Learning and Data Mining Clustering (1): Basics Kalev Kask Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand patterns of
More informationAnalysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark
Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,
More informationhttp://www.xkcd.com/233/ Text Clustering David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture17-clustering.ppt Administrative 2 nd status reports Paper review
More informationK-means and Hierarchical Clustering
K-means and Hierarchical Clustering Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these
More informationAccess Methods. Basic Concepts. Index Evaluation Metrics. search key pointer. record. value. Value
Access Methods This is a modified version of Prof. Hector Garcia Molina s slides. All copy rights belong to the original author. Basic Concepts search key pointer Value record? value Search Key - set of
More informationLecture 7: Segmentation. Thursday, Sept 20
Lecture 7: Segmentation Thursday, Sept 20 Outline Why segmentation? Gestalt properties, fun illusions and/or revealing examples Clustering Hierarchical K-means Mean Shift Graph-theoretic Normalized cuts
More informationClustering Lecture 5: Mixture Model
Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics
More informationPart I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a
Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering
More informationCOMS 4771 Clustering. Nakul Verma
COMS 4771 Clustering Nakul Verma Supervised Learning Data: Supervised learning Assumption: there is a (relatively simple) function such that for most i Learning task: given n examples from the data, find
More informationCost Models for Query Processing Strategies in the Active Data Repository
Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272
More informationCPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2016
CPSC 340: Machine Learning and Data Mining Density-Based Clustering Fall 2016 Assignment 1 : Admin 2 late days to hand it in before Wednesday s class. 3 late days to hand it in before Friday s class. 0
More informationSemi-supervised learning and active learning
Semi-supervised learning and active learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Combining classifiers Ensemble learning: a machine learning paradigm where multiple learners
More informationWhat to do with Scientific Data? Michael Stonebraker
What to do with Scientific Data? by Michael Stonebraker Outline Science data what it looks like Hardware options for deployment Software options RDBMS Wrappers on RDBMS SciDB Courtesy of LSST. Used with
More informationPredictive Analysis: Evaluation and Experimentation. Heejun Kim
Predictive Analysis: Evaluation and Experimentation Heejun Kim June 19, 2018 Evaluation and Experimentation Evaluation Metrics Cross-Validation Significance Tests Evaluation Predictive analysis: training
More informationGenerative and discriminative classification techniques
Generative and discriminative classification techniques Machine Learning and Category Representation 013-014 Jakob Verbeek, December 13+0, 013 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.13.14
More informationData Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners
Data Mining 3.5 (Instance-Based Learners) Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction k-nearest-neighbor Classifiers References Introduction Introduction Lazy vs. eager learning Eager
More informationClustering: Overview and K-means algorithm
Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin
More informationIntroduction to Programming in C Department of Computer Science and Engineering. Lecture No. #17. Loops: Break Statement
Introduction to Programming in C Department of Computer Science and Engineering Lecture No. #17 Loops: Break Statement (Refer Slide Time: 00:07) In this session we will see one more feature that is present
More information10-701/15-781, Fall 2006, Final
-7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly
More informationR-Trees. Accessing Spatial Data
R-Trees Accessing Spatial Data In the beginning The B-Tree provided a foundation for R- Trees. But what s a B-Tree? A data structure for storing sorted data with amortized run times for insertion and deletion
More informationIntroduction to Machine Learning Prof. Mr. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Introduction to Machine Learning Prof. Mr. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 19 Python Exercise on Naive Bayes Hello everyone.
More informationADVANCED MACHINE LEARNING MACHINE LEARNING. Kernel for Clustering kernel K-Means
1 MACHINE LEARNING Kernel for Clustering ernel K-Means Outline of Today s Lecture 1. Review principle and steps of K-Means algorithm. Derive ernel version of K-means 3. Exercise: Discuss the geometrical
More informationSTREAMING ALGORITHMS. Tamás Budavári / Johns Hopkins University ANALYSIS OF ASTRONOMY IMAGES & CATALOGS 10/26/2015
STREAMING ALGORITHMS ANALYSIS OF ASTRONOMY IMAGES & CATALOGS 10/26/2015 / Johns Hopkins University Astronomy Changed! Always been data-driven But we used to know the sources by heart! Today large collections
More informationMotivation. Technical Background
Handling Outliers through Agglomerative Clustering with Full Model Maximum Likelihood Estimation, with Application to Flow Cytometry Mark Gordon, Justin Li, Kevin Matzen, Bryce Wiedenbeck Motivation Clustering
More informationRandom projection for non-gaussian mixture models
Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,
More informationScalable K-Means++ Bahman Bahmani Stanford University
Scalable K-Means++ Bahman Bahmani Stanford University K-means Clustering Fundamental problem in data analysis and machine learning By far the most popular clustering algorithm used in scientific and industrial
More informationDS-Means: Distributed Data Stream Clustering
DS-Means: Distributed Data Stream Clustering Alessio Guerrieri and Alberto Montresor University of Trento, Italy Abstract. This paper proposes DS-means, a novel algorithm for clustering distributed data
More informationInference and Representation
Inference and Representation Rachel Hodos New York University Lecture 5, October 6, 2015 Rachel Hodos Lecture 5: Inference and Representation Today: Learning with hidden variables Outline: Unsupervised
More informationData Mining Techniques for Massive Spatial Databases. Daniel B. Neill Andrew Moore Ting Liu
Data Mining Techniques for Massive Spatial Databases Daniel B. Neill Andrew Moore Ting Liu What is data mining? Finding relevant patterns in data Datasets are often huge and highdimensional, e.g. astrophysical
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationSpyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems
Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems Andrew W Leung Ethan L Miller University of California, Santa Cruz Minglong Shao Timothy Bisson Shankar Pasupathy NetApp 7th USENIX
More informationPASCAL. A Parallel Algorithmic SCALable Framework for N-body Problems. Laleh Aghababaie Beni, Aparna Chandramowlishwaran. Euro-Par 2017.
PASCAL A Parallel Algorithmic SCALable Framework for N-body Problems Laleh Aghababaie Beni, Aparna Chandramowlishwaran Euro-Par 2017 Outline Introduction PASCAL Framework Space Partitioning Trees Tree
More informationParallel Physically Based Path-tracing and Shading Part 3 of 2. CIS565 Fall 2012 University of Pennsylvania by Yining Karl Li
Parallel Physically Based Path-tracing and Shading Part 3 of 2 CIS565 Fall 202 University of Pennsylvania by Yining Karl Li Jim Scott 2009 Spatial cceleration Structures: KD-Trees *Some portions of these
More informationNearest Neighbors Classifiers
Nearest Neighbors Classifiers Raúl Rojas Freie Universität Berlin July 2014 In pattern recognition we want to analyze data sets of many different types (pictures, vectors of health symptoms, audio streams,
More informationRecommender Systems New Approaches with Netflix Dataset
Recommender Systems New Approaches with Netflix Dataset Robert Bell Yehuda Koren AT&T Labs ICDM 2007 Presented by Matt Rodriguez Outline Overview of Recommender System Approaches which are Content based
More informationIntroduction to Machine Learning. Xiaojin Zhu
Introduction to Machine Learning Xiaojin Zhu jerryzhu@cs.wisc.edu Read Chapter 1 of this book: Xiaojin Zhu and Andrew B. Goldberg. Introduction to Semi- Supervised Learning. http://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford
More informationCS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample
Lecture 9 Clustering Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square Clustering Groups together similar instances in the data sample Basic clustering problem: distribute data into k different groups
More informationDensity estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate
Density estimation In density estimation problems, we are given a random sample from an unknown density Our objective is to estimate? Applications Classification If we estimate the density for each class,
More informationChapter 5: Outlier Detection
Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 5: Outlier Detection Lecture: Prof. Dr.
More informationINF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22
INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task
More informationIntroduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering
Introduction to Pattern Recognition Part II Selim Aksoy Bilkent University Department of Computer Engineering saksoy@cs.bilkent.edu.tr RETINA Pattern Recognition Tutorial, Summer 2005 Overview Statistical
More informationClustering Lecture 3: Hierarchical Methods
Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced
More informationOracle9i Data Mining. Data Sheet August 2002
Oracle9i Data Mining Data Sheet August 2002 Oracle9i Data Mining enables companies to build integrated business intelligence applications. Using data mining functionality embedded in the Oracle9i Database,
More informationChapter 4: Text Clustering
4.1 Introduction to Text Clustering Clustering is an unsupervised method of grouping texts / documents in such a way that in spite of having little knowledge about the content of the documents, we can
More informationImage Segmentation. Shengnan Wang
Image Segmentation Shengnan Wang shengnan@cs.wisc.edu Contents I. Introduction to Segmentation II. Mean Shift Theory 1. What is Mean Shift? 2. Density Estimation Methods 3. Deriving the Mean Shift 4. Mean
More informationCLUSTERING. JELENA JOVANOVIĆ Web:
CLUSTERING JELENA JOVANOVIĆ Email: jeljov@gmail.com Web: http://jelenajovanovic.net OUTLINE What is clustering? Application domains K-Means clustering Understanding it through an example The K-Means algorithm
More informationColorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.
Professor William Hoff Dept of Electrical Engineering &Computer Science http://inside.mines.edu/~whoff/ 1 Image Segmentation Some material for these slides comes from https://www.csd.uwo.ca/courses/cs4487a/
More informationk-means demo Administrative Machine learning: Unsupervised learning" Assignment 5 out
Machine learning: Unsupervised learning" David Kauchak cs Spring 0 adapted from: http://www.stanford.edu/class/cs76/handouts/lecture7-clustering.ppt http://www.youtube.com/watch?v=or_-y-eilqo Administrative
More informationSpatial Data Management
Spatial Data Management [R&G] Chapter 28 CS432 1 Types of Spatial Data Point Data Points in a multidimensional space E.g., Raster data such as satellite imagery, where each pixel stores a measured value
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Unsupervised Learning: Kmeans, GMM, EM Readings: Barber 20.1-20.3 Stefan Lee Virginia Tech Tasks Supervised Learning x Classification y Discrete x Regression
More informationHierarchical Clustering
Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges
More informationClustering. Shishir K. Shah
Clustering Shishir K. Shah Acknowledgement: Notes by Profs. M. Pollefeys, R. Jin, B. Liu, Y. Ukrainitz, B. Sarel, D. Forsyth, M. Shah, K. Grauman, and S. K. Shah Clustering l Clustering is a technique
More informationMachine Learning for Signal Processing Clustering. Bhiksha Raj Class Oct 2016
Machine Learning for Signal Processing Clustering Bhiksha Raj Class 11. 13 Oct 2016 1 Statistical Modelling and Latent Structure Much of statistical modelling attempts to identify latent structure in the
More informationCS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed
More informationLecture 12 Recognition. Davide Scaramuzza
Lecture 12 Recognition Davide Scaramuzza Oral exam dates UZH January 19-20 ETH 30.01 to 9.02 2017 (schedule handled by ETH) Exam location Davide Scaramuzza s office: Andreasstrasse 15, 2.10, 8050 Zurich
More informationSpatial Data Management
Spatial Data Management Chapter 28 Database management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1 Types of Spatial Data Point Data Points in a multidimensional space E.g., Raster data such as satellite
More informationGaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm
Gaussian Mixture Models For Clustering Data Soft Clustering and the EM Algorithm K-Means Clustering Input: Observations: xx ii R dd ii {1,., NN} Number of Clusters: kk Output: Cluster Assignments. Cluster
More information