Data Warehousing and Machine Learning

Size: px
Start display at page:

Download "Data Warehousing and Machine Learning"

Transcription

1 Data Warehousing and Machine Learning Preprocessing Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring / 35

2 Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: Attributes may be redundant. Values may be missing. The data contains outliers. The data is not in a suitable format. The values appear inconsistent. Garbage in, garbage out DWML Spring / 35

3 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 DWML Spring / 35

4 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Correct zip code? DWML Spring / 35

5 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Correct zip code? DWML Spring / 35

6 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W ?? S M S F D 3000 Missing value! DWML Spring / 35

7 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Error/outlier! DWML Spring / 35

8 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Error! DWML Spring / 35

9 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Unexpected precision. DWML Spring / 35

10 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Categorical value? DWML Spring / 35

11 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Error/missing value? DWML Spring / 35

12 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Other issues: What are the semantics of the marital status? DWML Spring / 35

13 Preprocessing Data Cleaning ID Zip Gander Income Age Marital status Transaction amount M C M J2S7K7 F W S M S F D 3000 Other issues: What are the semantics of the marital status? What is the unit of measure for the transaction field? DWML Spring / 35

14 Preprocessing Missing Values In many real world data bases you will be faced with the problem of missing data: Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good By simply discarding the records with missing data we might unintentionally bias the data. DWML Spring / 35

15 Preprocessing Missing Values Possible strategies for handling missing data: Use a predefined constant. Use the mean (for numerical variables) or the mode (for categorical values). Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good DWML Spring / 35

16 Preprocessing Missing Values Possible strategies for handling missing data: Use a predefined constant. Use the mean (for numerical variables) or the mode (for categorical values). Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low 25 Bad 8 Medium Medium 75 Good Both Low and Medium are modes for savings. DWML Spring / 35

17 Preprocessing Missing Values Possible strategies for handling missing data: Use a predefined constant. Use the mean (for numerical variables) or the mode (for categorical values). Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good High and Medium are drawn randomly from the observed distribution for Assets. DWML Spring / 35

18 Preprocessing Missing Values Possible strategies for handling missing data: Use a predefined constant. Use the mean (for numerical variables) or the mode (for categorical values). Use a value drawn randomly form the observed distribution. Id. Savings Assets Income Credit Risk ($ 1000s) 1 Medium High 75 Good 2 Low Low 50 Bad 3 Low High 25 Bad 4 Medium Medium 54 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Low Medium 25 Bad 8 Medium Medium 75 Good DWML Spring / 35

19 Preprocessing Discretization Some data mining algorithms can only handle discrete attributes. Possible solution: Divide the continuous range into intervals. Example: (Income, Risk) = (25, B), (25, B), (50, G), (51, B), (54, G), (75, G), (75, G)(100, G), (100, G) Unsupervised discretization Equal width binning (width 25): Equal frequency binning (bin density 3): Bin 1: 25, 25 [25, 50) Bin 2: 50, 51, 54 [50, 75) Bin 3: 75, 75, 100, 100 [75, 100] Bin 1: 25, 25, 50 [25, 50.5) Bin 2: 51, 54, 75, 75 [50.5, 87.5) Bin 3: 100, 100 [87.5, 100] DWML Spring / 35

20 Preprocessing Supervised discretization Take the class distribution into account when selecting the intervals. For example, recursively bisect the interval by selecting the split point giving the highest information gain:» S v Gain(S, v) = Ent(S) Ent(S v ) + S>v Ent(S >v) S S Until some stopping criteria is met. (Income, Risk) = (25, B), (25, B), (50, G), (51, B), (54, G), (75, G), (75, G)(100, G), (100, G) 3 Ent(S) = 9 log «9 log 6 2 = Split E-Ent Interval (, 25],(25, ) (, 50],(50, ) (, 51],(51, ) (, 54],(54, ) (, 75],(75, ) DWML Spring / 35

21 Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, Age versus income. DWML Spring / 35

22 Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, Age versus income. The typical approach is to standardize the scales: 1 Min-Max Normalization: X = X min(x) max(x) min(x). normalized values A1 A original values DWML Spring / 35

23 Preprocessing Data Transformation Some data mining tools tends to give variables with a large range a higher significance than variables with a smaller range. For example, Age versus income. The typical approach is to standardize the scales: 1 Min-Max Normalization: X = X min(x) max(x) min(x). normalized values A1 A original values Z-score standardization: X = X mean(x). SD(X) standardized values A1 A original values DWML Spring / 35

24 Preprocessing Outliers Data: 1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, Summary statistics: First quartile (1Q): 25% of the data = 4. Second quartile (2Q): 50% of the data = 6. Third quartile (3Q): 75% of the data = 7. Interquartile range IQR = 3Q 1Q = DWML Spring / 35

25 Preprocessing Outliers Data: 1, 2, 3, 3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8, 8, Summary statistics: First quartile (1Q): 25% of the data = 4. Second quartile (2Q): 50% of the data = 6. Third quartile (3Q): 75% of the data = 7. Interquartile range IQR = 3Q 1Q = A data point may be an outlier if: It is lower than 1Q 1.5 IQR = = 0.5. It is higher than 3Q IQR = = DWML Spring / 35

26 Data Warehousing and Machine Learning Clustering Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Clustering: partitional and hierarchical DWML Spring / 35

27 Clustering Unlabeled Data The Iris data with class labels removed: Attributes SL SW PL PW Unlabeled data in general: (discrete or continuous) attributes, no class variable. Clustering: partitional and hierarchical DWML Spring / 35

28 Clustering Clustering A clustering of the data S = s 1,..., s N consists of a set C = {c 1,..., c k } of cluster labels, and a cluster assignment ca : S C. Clustering Iris with C = {blue, red}: Note: a clustering partitions the datapoints, not necessarily the instance space. When cluster labels have no particular significance, can identify clustering also with partition S = S 1... S k where S i = ca 1 (c i ). Clustering: partitional and hierarchical DWML Spring / 35

29 Clustering Clustering goal Instance Space Between cluster distances Within cluster distances A candidate clustering (indicated by colors) of data cases in instance space. Arrows indicate between- and within-cluster distances (selected). General goal: find clustering with large between-cluster variation (sum of between-cluster distances), and small within-cluster variation (sum of within-cluster distances). Concrete goal varies according to exact distance definition. Clustering: partitional and hierarchical DWML Spring / 35

30 Clustering Examples Group plants/animals into families or related species, based on morphological features molecular features Identify types of customers based on attributes in a database (can then be targeted by special advertising campaigns) Web mining: group web-pages according to content Clustering: partitional and hierarchical DWML Spring / 35

31 Clustering Clustering vs. Classification The cluster label can be interpreted as a hidden class variable that is never observed whose number of states is unknown on which the distribution of attribute values depends Clustering is often called unsupervised learning, vs. the supervised learning of classifiers: in supervised learning correct class labels for the training data are provided to the learning algorithm by a supervisor, or teacher. One key problem in clustering is determining the right number of clusters. Two different approaches: Partition-based clustering Hierarchical clustering All clustering methods require a distance measure on the instance space! Clustering: partitional and hierarchical DWML Spring / 35

32 Clustering Partition-based Clustering Number k of clusters fixed (user defined). Partition data into k clusters. k-means clustering Assume that there is a distance function d(s, s ) defined between data items we can compute the mean value of a collection {s 1,..., s l } of data items Initialize: randomly pick initial cluster centers c = c 1,..., c k from S repeat for i = 1,..., k S i := {s S c i = arg min c c d(c, s)} c old,i := c i c i := mean S i ca(s) := c i (s S i ) until c = c old Clustering: partitional and hierarchical DWML Spring / 35

33 Clustering Example k = 3: Clustering: partitional and hierarchical DWML Spring / 35

34 Clustering Example k = 3: c 1 c 2 c 3 Clustering: partitional and hierarchical DWML Spring / 35

35 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

36 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

37 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

38 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

39 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

40 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

41 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

42 Clustering Example k = 3: c 1 c 2 c 3 S 1 S 2 S 3 Clustering: partitional and hierarchical DWML Spring / 35

43 Clustering Example(cont.) Result for clustering the same data with k = 2: c 1 c 2 S 1 S 2 Result can depend on choice of initial cluster centers! Clustering: partitional and hierarchical DWML Spring / 35

44 Clustering Outliers The result of partitional clustering can be skewed by outliers. Example with k = 2: useful preprocessing: outlier detection and elimination (be careful not to eliminate interesting outliers!). Clustering: partitional and hierarchical DWML Spring / 35

45 Clustering k-means as optimization With a Euclidean distance function dist we can use the sum of squared errors for evaluating a clustering: kx X SSE = dist(x, c i ) 2. i=1 x C i Clustering: partitional and hierarchical DWML Spring / 35

46 Clustering k-means as optimization With a Euclidean distance function dist we can use the sum of squared errors for evaluating a clustering: kx X SSE = dist(x, c i ) 2. i=1 x C i k-means directly tries to minimize this error: Initialize: randomly pick initial cluster centers c = c 1,..., c k from S repeat for i = 1,..., k S i := {s S c i = arg min c c d(c, s)} //Minimize the SSE for the current clusters c old,i := c i c i := mean S i //The centroid that minimizes the SSE for the assigned objects ca(s) := c i (s S i ) until c = c old Only guaranteed to find a local minimum Clustering: partitional and hierarchical DWML Spring / 35

47 Hierarchical Clustering Reducing SSE Choosing initial centroids: Perform multiple runs with random initializations. Initialize centroids based on results from another algorithm (e.g. hierarchical).... Clustering: partitional and hierarchical DWML Spring / 35

48 Hierarchical Clustering Choosing initial centroids: Reducing SSE Perform multiple runs with random initializations. Initialize centroids based on results from another algorithm (e.g. hierarchical).... Postprocessing: Split a cluster Disperse a cluster (choose the one that increases the SSE the least) Merge two clusters (the two with closets centroids or the two that increases the SSE the least). Clustering: partitional and hierarchical DWML Spring / 35

49 Hierarchical Clustering Hierarchical clustering The right number of clusters may not only be unknown, it may also be quite ambiguous: Clustering: partitional and hierarchical DWML Spring / 35

50 Hierarchical Clustering Hierarchical clustering The right number of clusters may not only be unknown, it may also be quite ambiguous: Clustering: partitional and hierarchical DWML Spring / 35

51 Hierarchical Clustering Hierarchical clustering The right number of clusters may not only be unknown, it may also be quite ambiguous: Clustering: partitional and hierarchical DWML Spring / 35

52 Hierarchical Clustering Hierarchical clustering The right number of clusters may not only be unknown, it may also be quite ambiguous: Provide an explicit representation of nested clusterings of different granularity Clustering: partitional and hierarchical DWML Spring / 35

53 Hierarchical Clustering Agglomerative hierarchical clustering Extend distance function d(s, s ) to distance function D(S, S ) between sets of data items. Two out of many possibilities: D average(s, S ) := 1 S S X s S,s S d(s, s ) D min (S, S ) := min s S,s S d(s, s ) for i = 1,..., N: S i := {s i } while current partition S 1... S k of S contains more than one element (i, j) := arg min i,j 1,...,k D(S i, S j ) form new partition by merging S i and S j. When D average is used, this is also called average link clustering; for D min : single link clustering. Clustering: partitional and hierarchical DWML Spring / 35

54 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

55 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

56 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

57 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

58 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

59 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

60 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

61 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

62 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

63 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

64 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

65 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

66 Hierarchical Clustering Clustering: partitional and hierarchical DWML Spring / 35

67 Hierarchical Clustering Dendrogram Representation of Hierarchical Clustering Distance of merged components Clustering: partitional and hierarchical DWML Spring / 35

68 Hierarchical Clustering Dendrogram Representation of Hierarchical Clustering Distance of merged components 3 clustering 5 clustering The length of the distance interval correponding to a specific clustering can be interpreted as a measure for the significance of this particular clustering Clustering: partitional and hierarchical DWML Spring / 35

69 Hierarchical Clustering Single link vs. Average link Clustering: partitional and hierarchical DWML Spring / 35

70 Hierarchical Clustering Single link vs. Average link 4-clustering for single link and average link Clustering: partitional and hierarchical DWML Spring / 35

71 Hierarchical Clustering Single link vs. Average link 4-clustering for single link and average link single link 2-clustering Clustering: partitional and hierarchical DWML Spring / 35

72 Hierarchical Clustering Single link vs. Average link 4-clustering for single link and average link single link 2-clustering average link 2-clustering Clustering: partitional and hierarchical DWML Spring / 35

73 Hierarchical Clustering Single link vs. Average link 4-clustering for single link and average link single link 2-clustering average link 2-clustering Generally: single link will produce rather elongated, linear clusters, average link more convex clusters Clustering: partitional and hierarchical DWML Spring / 35

74 Hierarchical Clustering Another Example Clustering: partitional and hierarchical DWML Spring / 35

75 Hierarchical Clustering Another Example single link 2-clustering Clustering: partitional and hierarchical DWML Spring / 35

76 Hierarchical Clustering Another Example average link 2-clustering (or similar) Clustering: partitional and hierarchical DWML Spring / 35

77 Data Warehousing and Machine Learning Self Organizing Maps Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 Self Organizing Maps DWML Spring / 35

78 Self Organizing Maps SOMs as Special Neural Networks Input Layer Output Layer Neural network structure without hidden layers Output neurons structured as two-dimensional array Connection from ith input to jth output has weight w i,j No activation function for output nodes Self Organizing Maps DWML Spring / 35

79 Self Organizing Maps Kohonen Learning Given: Unlabeled data a 1,..., a N R n Distance measure d n(, ) on R n Distance measure d out (, ) on output neurons Update function η(t, d) : N R R; decreasing in t and d. 1. Initialize weight vectors w (0) j for output nodes o j 2. t := 0 3. repeat 4. t := t for i = 1,..., N 6. let o j be the output neuron minimizing d n(w j, a i ). 7. for all output nodes o h : 8. w (t) h := w(t 1) h + η(t, d out (o h, o j ))(a i w (t 1) h ) 9. until termination condition applies Self Organizing Maps DWML Spring / 35

80 Self Organizing Maps Distances etc. Possible choices: d n: Euclidean d out (o j, o h ): e.g. 1 if o j, o h are neighbors (rectangular or hexagonal layout), or Euclidean distance on grid indices η(t, d): e.g. α(t)exp( d 2 /2σ 2 (t)) with α(t), σ(t) decreasing in t. Self Organizing Maps DWML Spring / 35

81 Self Organizing Maps Intuition SOM learning can be understood as fitting a 2-dimensional surface to the data: o 1,0 o1,1 o 0,0 o 0,1 Colors indicate association with different output neurons, not data attributes. Some output neurons may not have any associated data cases. Self Organizing Maps DWML Spring / 35

82 Self Organizing Maps Example (from Tan et al.) Data: Word occurrence data (?) from 3204 articles from the Los Angeles Times with (hidden) section labels Entertainment, Financial, Foreign, Metro, National, Sports. Result of SOM clustering on 4 4 hexagonal grid: Density Sports Sports Metro Metro low Sports Sports Metro Foreign Entertainment Metro Metro National high Entertainment Metro Financial Financial Output nodes labelled with majority label of associated cases and colored according to number of cases associated with it (fictional). Self Organizing Maps DWML Spring / 35

83 Self Organizing Maps SOMs and k-means In spite of its roots in neural networks, SOMs are more closely related to k-means clustering: Weight vectors w j are cluster centers Kohonen updating associates data cases with cluster centers, and repositions cluster centers to fit associated data cases Differences: 2-dim. spatial relationship among cluster centers Data cases associated with more than one cluster center On-line updating (one case at a time) Self Organizing Maps DWML Spring / 35

84 Self Organizing Maps Pros and Cons + Provides more insight than a basic clustering (i.e. partitioning of data) + Can produce intuitive representations of clustering results - No well-defined objective function that is optimized Self Organizing Maps DWML Spring / 35

Preprocessing DWML, /33

Preprocessing DWML, /33 Preprocessing DWML, 2007 1/33 Preprocessing Before you can start on the actual data mining, the data may require some preprocessing: Attributes may be redundant. Values may be missing. The data contains

More information

Unsupervised Learning

Unsupervised Learning Outline Unsupervised Learning Basic concepts K-means algorithm Representation of clusters Hierarchical clustering Distance functions Which clustering algorithm to use? NN Supervised learning vs. unsupervised

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha Data Preprocessing S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha 1 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking

More information

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani Clustering CE-717: Machine Learning Sharif University of Technology Spring 2016 Soleymani Outline Clustering Definition Clustering main approaches Partitional (flat) Hierarchical Clustering validation

More information

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing Unsupervised Data Mining: Clustering Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 1. Supervised Data Mining Classification Regression Outlier detection

More information

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California,

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Unsupervised Learning Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi Content Motivation Introduction Applications Types of clustering Clustering criterion functions Distance functions Normalization Which

More information

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation Data Mining Part 2. Data Understanding and Preparation 2.4 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Normalization Attribute Construction Aggregation Attribute Subset Selection Discretization

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Data Warehousing and Machine Learning

Data Warehousing and Machine Learning Data Warehousing and Machine Learning Introduction Thomas D. Nielsen Aalborg University Department of Computer Science Spring 2008 DWML Spring 2008 1 / 47 What is Data Mining?? Introduction DWML Spring

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

Supervised vs.unsupervised Learning

Supervised vs.unsupervised Learning Supervised vs.unsupervised Learning In supervised learning we train algorithms with predefined concepts and functions based on labeled data D = { ( x, y ) x X, y {yes,no}. In unsupervised learning we are

More information

Clustering Basic Concepts and Algorithms 1

Clustering Basic Concepts and Algorithms 1 Clustering Basic Concepts and Algorithms 1 Jeff Howbert Introduction to Machine Learning Winter 014 1 Machine learning tasks Supervised Classification Regression Recommender systems Reinforcement learning

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

ECLT 5810 Clustering

ECLT 5810 Clustering ECLT 5810 Clustering What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Grouping

More information

Data Preprocessing. Komate AMPHAWAN

Data Preprocessing. Komate AMPHAWAN Data Preprocessing Komate AMPHAWAN 1 Data cleaning (data cleansing) Attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. 2 Missing value

More information

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong

MIS2502: Data Analytics Clustering and Segmentation. Jing Gong MIS2502: Data Analytics Clustering and Segmentation Jing Gong gong@temple.edu http://community.mis.temple.edu/gong What is Cluster Analysis? Grouping data so that elements in a group will be Similar (or

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

Cluster Analysis: Agglomerate Hierarchical Clustering

Cluster Analysis: Agglomerate Hierarchical Clustering Cluster Analysis: Agglomerate Hierarchical Clustering Yonghee Lee Department of Statistics, The University of Seoul Oct 29, 2015 Contents 1 Cluster Analysis Introduction Distance matrix Agglomerative Hierarchical

More information

Hierarchical Clustering 4/5/17

Hierarchical Clustering 4/5/17 Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree with data points as leaves. Useful for explaining the training data. Not useful for making new predictions. Direction

More information

2. Data Preprocessing

2. Data Preprocessing 2. Data Preprocessing Contents of this Chapter 2.1 Introduction 2.2 Data cleaning 2.3 Data integration 2.4 Data transformation 2.5 Data reduction Reference: [Han and Kamber 2006, Chapter 2] SFU, CMPT 459

More information

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2

Clustering. So far in the course. Clustering. Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. dist(x, y) = x y 2 2 So far in the course Clustering Subhransu Maji : Machine Learning 2 April 2015 7 April 2015 Supervised learning: learning with a teacher You had training data which was (feature, label) pairs and the goal

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Clustering in Data Mining

Clustering in Data Mining Clustering in Data Mining Classification Vs Clustering When the distribution is based on a single parameter and that parameter is known for each object, it is called classification. E.g. Children, young,

More information

Road map. Basic concepts

Road map. Basic concepts Clustering Basic concepts Road map K-means algorithm Representation of clusters Hierarchical clustering Distance functions Data standardization Handling mixed attributes Which clustering algorithm to use?

More information

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018 MIT 801 [Presented by Anna Bosman] 16 February 2018 Machine Learning What is machine learning? Artificial Intelligence? Yes as we know it. What is intelligence? The ability to acquire and apply knowledge

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Slides From Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

DATA PREPROCESSING. Pronalaženje skrivenog znanja Bojan Furlan

DATA PREPROCESSING. Pronalaženje skrivenog znanja Bojan Furlan DATA PREPROCESSING Pronalaženje skrivenog znanja Bojan Furlan WHY DO WE NEED TO PREPROCESS THE DATA? Raw data contained in databases is unpreprocessed, incomplete, and noisy. For example, the databases

More information

Machine Learning - Clustering. CS102 Fall 2017

Machine Learning - Clustering. CS102 Fall 2017 Machine Learning - Fall 2017 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for

More information

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015

Clustering. Subhransu Maji. CMPSCI 689: Machine Learning. 2 April April 2015 Clustering Subhransu Maji CMPSCI 689: Machine Learning 2 April 2015 7 April 2015 So far in the course Supervised learning: learning with a teacher You had training data which was (feature, label) pairs

More information

Cluster analysis. Agnieszka Nowak - Brzezinska

Cluster analysis. Agnieszka Nowak - Brzezinska Cluster analysis Agnieszka Nowak - Brzezinska Outline of lecture What is cluster analysis? Clustering algorithms Measures of Cluster Validity What is Cluster Analysis? Finding groups of objects such that

More information

A Dendrogram. Bioinformatics (Lec 17)

A Dendrogram. Bioinformatics (Lec 17) A Dendrogram 3/15/05 1 Hierarchical Clustering [Johnson, SC, 1967] Given n points in R d, compute the distance between every pair of points While (not done) Pick closest pair of points s i and s j and

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering Erik Velldal University of Oslo Sept. 18, 2012 Topics for today 2 Classification Recap Evaluating classifiers Accuracy, precision,

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10) CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1 Overview Motivation KDD and Data Mining Association Rules Clustering Classification

More information

3. Data Preprocessing. 3.1 Introduction

3. Data Preprocessing. 3.1 Introduction 3. Data Preprocessing Contents of this Chapter 3.1 Introduction 3.2 Data cleaning 3.3 Data integration 3.4 Data transformation 3.5 Data reduction SFU, CMPT 740, 03-3, Martin Ester 84 3.1 Introduction Motivation

More information

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray Exploratory Data Analysis using Self-Organizing Maps Madhumanti Ray Content Introduction Data Analysis methods Self-Organizing Maps Conclusion Visualization of high-dimensional data items Exploratory data

More information

Hierarchical Clustering

Hierarchical Clustering Hierarchical Clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree-like diagram that records the sequences of merges

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

UNIT 2 Data Preprocessing

UNIT 2 Data Preprocessing UNIT 2 Data Preprocessing Lecture Topic ********************************************** Lecture 13 Why preprocess the data? Lecture 14 Lecture 15 Lecture 16 Lecture 17 Data cleaning Data integration and

More information

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 Objectives 2.1 What Are the Types of Data? www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Figure (5) Kohonen Self-Organized Map

Figure (5) Kohonen Self-Organized Map 2- KOHONEN SELF-ORGANIZING MAPS (SOM) - The self-organizing neural networks assume a topological structure among the cluster units. - There are m cluster units, arranged in a one- or two-dimensional array;

More information

K-Means Clustering 3/3/17

K-Means Clustering 3/3/17 K-Means Clustering 3/3/17 Unsupervised Learning We have a collection of unlabeled data points. We want to find underlying structure in the data. Examples: Identify groups of similar data points. Clustering

More information

AND NUMERICAL SUMMARIES. Chapter 2

AND NUMERICAL SUMMARIES. Chapter 2 EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 What Are the Types of Data? 2.1 Objectives www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and 2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Data Mining Algorithms

Data Mining Algorithms for the original version: -JörgSander and Martin Ester - Jiawei Han and Micheline Kamber Data Management and Exploration Prof. Dr. Thomas Seidl Data Mining Algorithms Lecture Course with Tutorials Wintersemester

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1 Week 8 Based in part on slides from textbook, slides of Susan Holmes December 2, 2012 1 / 1 Part I Clustering 2 / 1 Clustering Clustering Goal: Finding groups of objects such that the objects in a group

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1

More information

Clustering. Unsupervised Learning

Clustering. Unsupervised Learning Clustering. Unsupervised Learning Maria-Florina Balcan 11/05/2018 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

arxiv: v1 [physics.data-an] 27 Sep 2007

arxiv: v1 [physics.data-an] 27 Sep 2007 Classification of Interest Rate Curves Using Self-Organising Maps arxiv:0709.4401v1 [physics.data-an] 27 Sep 2007 M.Kanevski a,, M.Maignan b, V.Timonin a,1, A.Pozdnoukhov a,1 a Institute of Geomatics and

More information

Basic Data Mining Technique

Basic Data Mining Technique Basic Data Mining Technique What is classification? What is prediction? Supervised and Unsupervised Learning Decision trees Association rule K-nearest neighbor classifier Case-based reasoning Genetic algorithm

More information

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning CSE 40171: Artificial Intelligence Learning from Data: Unsupervised Learning 32 Homework #6 has been released. It is due at 11:59PM on 11/7. 33 CSE Seminar: 11/1 Amy Reibman Purdue University 3:30pm DBART

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

Clustering. Unsupervised Learning

Clustering. Unsupervised Learning Clustering. Unsupervised Learning Maria-Florina Balcan 03/02/2016 Clustering, Informal Goals Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would

More information

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan Working with Unlabeled Data Clustering Analysis Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan chanhl@mail.cgu.edu.tw Unsupervised learning Finding centers of similarity using

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/28/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of Computer Science 2016 201 Road map What is Cluster Analysis? Characteristics of Clustering

More information

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning Associate Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551

More information

Clustering Part 3. Hierarchical Clustering

Clustering Part 3. Hierarchical Clustering Clustering Part Dr Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Hierarchical Clustering Two main types: Agglomerative Start with the points

More information

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, nd Edition by Tan, Steinbach, Karpatne, Kumar What is Cluster Analysis? Finding groups

More information

Unsupervised Learning I: K-Means Clustering

Unsupervised Learning I: K-Means Clustering Unsupervised Learning I: K-Means Clustering Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp. 487-515, 532-541, 546-552 (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Data Mining. Kohonen Networks. Data Mining Course: Sharif University of Technology 1

Data Mining. Kohonen Networks. Data Mining Course: Sharif University of Technology 1 Data Mining Kohonen Networks Data Mining Course: Sharif University of Technology 1 Self-Organizing Maps Kohonen Networks developed in 198 by Tuevo Kohonen Initially applied to image and sound analysis

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Clustering part II 1

Clustering part II 1 Clustering part II 1 Clustering What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods 2 Partitioning Algorithms:

More information

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015.

COSC 6397 Big Data Analytics. Fuzzy Clustering. Some slides based on a lecture by Prof. Shishir Shah. Edgar Gabriel Spring 2015. COSC 6397 Big Data Analytics Fuzzy Clustering Some slides based on a lecture by Prof. Shishir Shah Edgar Gabriel Spring 215 Clustering Clustering is a technique for finding similarity groups in data, called

More information

CSE4334/5334 DATA MINING

CSE4334/5334 DATA MINING CSE4334/5334 DATA MINING Lecture 4: Classification (1) CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides courtesy

More information

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska Preprocessing Short Lecture Notes cse352 Professor Anita Wasilewska Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept

More information

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction. March 19, 2018 1 / 40 1 2 3 4 2 / 40 What s unsupervised learning? Most of the data available on the internet do not have labels. How can we make sense of it? 3 / 40 4 / 40 5 / 40 Organizing the web First

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu Clustering Overview Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements,

More information

Clustering COMS 4771

Clustering COMS 4771 Clustering COMS 4771 1. Clustering Unsupervised classification / clustering Unsupervised classification Input: x 1,..., x n R d, target cardinality k N. Output: function f : R d {1,..., k} =: [k]. Typical

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE Sundari NallamReddy, Samarandra Behera, Sanjeev Karadagi, Dr. Anantha Desik ABSTRACT: Tata

More information

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand

More information

Kapitel 4: Clustering

Kapitel 4: Clustering Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases WiSe 2017/18 Kapitel 4: Clustering Vorlesung: Prof. Dr.

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions Thomas Giraud Simon Chabot October 12, 2013 Contents 1 Discriminant analysis 3 1.1 Main idea................................

More information

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation = noisy: containing

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Murhaf Fares & Stephan Oepen Language Technology Group (LTG) September 27, 2017 Today 2 Recap Evaluation of classifiers Unsupervised

More information

Data Mining. Moustafa ElBadry. A thesis submitted in fulfillment of the requirements for the degree of Bachelor of Arts in Mathematics

Data Mining. Moustafa ElBadry. A thesis submitted in fulfillment of the requirements for the degree of Bachelor of Arts in Mathematics Data Mining Moustafa ElBadry A thesis submitted in fulfillment of the requirements for the degree of Bachelor of Arts in Mathematics Department of Mathematics and Computer Science Whitman College 2016

More information